Professional Documents
Culture Documents
Handbook of Employee Selection
Handbook of Employee Selection
Handbook of Employee Selection
Employee
Selection
“SIOP emphasizes the scientist-practitioner model. Who exemplifies this model more in the field
of selection than Drs. Farr and Tippins? The new Alliance of Organizational Psychology includes
EAWOP, Division 1 of IAAP, and SIOP members. This book includes the perspectives of authors
with a global view of selection. There is simply no more authoritative text on the planet than this one
for members of the Alliance.” - Gary P. Latham, Secretary of State Professor of Organizational
Psychology, Rotman School of Management, University of Toronto, and Past President of the
Canadian Psychology and the Society for Industrial-Organizational Psychology
“Farr and Tippins have assembled an impressive line-up of leading scholars in the field to present a
comprehensive and up-to-date treatment of employee selection. Broad in scope and rich in content,
the Handbook of Employee Selection offers an evidence-based perspective on the design, imple-
mentation, and evaluation of selection systems in organizational contexts. The handbook also offers
an in-depth treatment of criterion development, important legal and ethical issues in employee
selection, and an in-depth discussion of the unique selection associated with various organizational
contexts, including the military, blue collar occupations, and multinational corporations. This book
is a must-read for practitioners and academics alike, and is positioned to have a major impact on
the field of employee selection.” - Lillian T. Eby, Professor of Psychology, Associate Editor,
Personnel Psychology, University of Georgia
“James L. Farr and Nancy T. Tippins have produced a ‘must-have’ handbook on employee selection
for anybody engaged in the selection and recruitment of talent at work, and for all students and aca-
demics in HR and personnel/occupational psychology. It contains most of the international leading
lights in the field. It not only includes the state of the art research in this arena, but also how this
can be translated into effective practice. A book shelf in HR is not complete without this volume.”
- Cary L. Cooper, CBE, Distinguished Professor of Organizational Psychology and Health at
Lancaster University Management School, England
“Farr and Tippins’ Handbook of Employee Selection is an impressive compilation of chapters writ-
ten by leaders in the field covering a) psychometric issues b) design, implementation, and evalua-
tion issues, and c) historical and legal contexts relating to selection of employees in organizations.
Chapters are, at the same time, sophisticated and readable. They summarize the state of the science
and practice to date, and provide signposts to the future of employee selection. Many of the hand-
book’s chapters will be citation classics well into the 21st century. It’s on my bookshelf and should
be on yours too!” - Charles E. Lance, Associate Editor, Organizational Research Methods,
Professor of Psychology, The University of Georgia
“Farr and Tippins have assembled the definitive who’s who in employee selection.” - Milton D.
Hakel, Department of Psychology, Bowling Green State University
HANDBOOK OF
Employee
Selection
EDITED BY
James L. Farr
Pennsylvania State University
University Park, PA
Nancy T. Tippins
Valtera Corporation
Greenville, SC
SECTION EDITORS
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
HF5549.5.S38H36 2010
658.3’112--dc22 2009052677
vii
viii Contents
Chapter 7 Test Administration and the Use of Test Scores ...................................................... 151
Jeff W. Johnson and Frederick L. Oswald
Use of Test Scores .................................................................................................... 151
Decisions to Make Before Collecting Test Scores ................................................... 151
Collection of Test Scores .......................................................................................... 153
Computation of Test Scores...................................................................................... 159
Contents ix
Chapter 14 Personality: Its Measurement and Validity for Employee Selection ........................ 299
Leaetta Hough and Stephan Dilchert
Structure of Personality Variables ........................................................................... 299
Measurement Methods ............................................................................................. 301
Validity of Personality Constructs and Other Factors That Affect
Their Usefulness...................................................................................................308
Conclusions .............................................................................................................. 312
References ................................................................................................................ 312
Chapter 16 Practical Intelligence, Emotional Intelligence, and Social Intelligence .................. 339
Filip Lievens and David Chan
Definitions and Conceptualizations ......................................................................... 339
Measurement Approaches ........................................................................................344
Conceptual Framework for Examining Practical, Emotional, and
Social Intelligence ................................................................................................ 349
Strategies for Future Research ................................................................................. 351
Epilogue.................................................................................................................... 354
References ................................................................................................................ 355
Chapter 19 Evaluation of Measures: Sources of Error, Sufficiency, and Contamination ........... 399
Michael J. Zickar, Jose M. Cortina, and Nathan T. Carter
Reliability .................................................................................................................400
Validity .....................................................................................................................402
Sources of Invariance: A Handful of Hypotheses....................................................407
CFA MI/E in Selection Research ............................................................................. 411
Conclusions .............................................................................................................. 413
References ................................................................................................................ 414
Chapter 37 Selection for Team Membership: A Contingency and Multilevel Perspective......... 801
Susan Mohammed, Jan Cannon-Bowers, and Su Chuen Foo
Conceptual Framework for Understanding Selection for Team Membership..........802
Individual-Level Considerations ..............................................................................804
Team-Level Considerations ......................................................................................809
Discussion................................................................................................................. 815
Conclusions .............................................................................................................. 817
References ................................................................................................................ 818
Chapter 39 The Management Progress Study and Its Legacy for Selection............................... 843
Ann Howard
The Foundations of the Assessment Center ............................................................. 843
The Beginnings of Managerial Assessment ............................................................. 845
Predicting Managerial Success ................................................................................ 847
Successful Managers’ Development Over Time ...................................................... 852
AC Advantages and Disadvantages.......................................................................... 855
Contents xvii
Chapter 44 Employee Selection: Musings About Its Past, Present, and Future.......................... 943
Robert M. Guion
Psychometric Musings.............................................................................................. 943
Musings on the Perpetual Disconnect: Research and Practice ................................ 950
Is Education Really Broadening? ............................................................................. 954
A Final Musing......................................................................................................... 956
References ................................................................................................................ 956
Author Index................................................................................................................................. 959
Subject Index ................................................................................................................................ 983
Preface
Industrial and organizational psychologists have long studied employee selection. For more than
100 years, psychologists have identified new processes for developing and evaluating selection instru-
ments and created new predictors and criteria for a wide range of jobs. At the same time, the organizational
environments in which selection tools are used have continued to evolve. Because the literature on selec-
tion is massive, this handbook can only summarize the important ideas that influence selection efforts
today. Undoubtedly, some important ideas have been omitted and many others are yet to be developed.
This handbook was designed to cover the current thinking on not only basic concepts in employee
selection but also specific applications of those concepts in various organizational settings. In addition,
the book looks backward at the milestones that have been passed in employee selection in the United
States and in Europe and considers the future direction of selection in the final chapter. Chapter 1
provides an overview of the entire volume. Throughout the book, we have encouraged authors to (a)
balance the treatment of scientific (i.e., research findings and theory) and practical concerns related
to implementation and operational use and (b) take a global perspective that reflects the concerns of
multinational corporations and cross-cultural differences in testing practices and applicant skill.
Our hope for this handbook is that it serves as a reference for the informed reader possessing an
advanced degree in industrial and organizational psychology, human resource management, and
other related fields, as well as for graduate students in these fields. Because the intended audience
for the handbook is professionals who work in the area of employee selection in academic and pro-
fessional settings, a conscientious effort has been made to include the latest scientific thought and
the best practices in application.
Handbooks of this size are not published without the help of many people. We are particularly
grateful for the contributions of the 128 authors who wrote these 44 chapters. Without their
expertise and willingness to share their professional and scientific knowledge, there would be no
handbook. Similarly, we also attribute the existence of this handbook to the 16 section editors (John
P. Campbell, Frank J. Landy, Jerard F. Kehoe, Robert E. Ployhart, David Chan, Leaetta Hough, Ann
Marie Ryan, Neal W. Schmitt, Kevin R. Murphy, Elaine D. Pulakos, P. Richard Jeanneret, Paul R.
Sackett, Rick Jacobs, Ann Howard, Walter C. Borman, and Benjamin Schneider) of the eight major
parts of the handbook, who not only helped us refine our outlines and identify authors but also
guided the authors of each chapter in their respective parts.
Anne Duffy, Senior Editor at Psychology Press/Routledge of the Taylor and Francis Group,
deserves a special acknowledgement for initially cajoling us with good humor and positive affect
into taking this project on and then guiding and encouraging us throughout the process, as do Erin
Flaherty for handling the administrative duties for Taylor and Francis and Betsy Saiani of Valtera
for keeping us organized.
There are innumerable teachers, mentors, colleagues, and friends who have taught us much about
employee selection and gotten us to this point in our careers. We are appreciative of their sup-
port and encouragement and their indirect contribution to this handbook. Of course, our respective
families have made a significant contribution to this book through their encouragement, support,
patience, and tolerance. Thank you, Diane and Mac.
James L. Farr
State College, Pennsylvania
Nancy T. Tippins
Greenville, South Carolina
xix
This page intentionally left blank
Editors
James L. Farr
Dr. James L. Farr received his PhD in industrial and organizational psychology from the University
of Maryland. Since 1971 he has been on the faculty of Pennsylvania State University, where he is
currently professor of psychology. He has also been a visiting scholar at the University of Sheffield
(United Kingdom), the University of Western Australia, the Chinese University of Hong Kong, and
the University of Giessen (Germany). His primary research interests are performance appraisal
and feedback, personnel selection, the older worker, and innovation and creativity in work set-
tings. Dr. Farr is the author or editor of over 80 publications in professional journals and books,
including The Measurement of Work Performance (with Frank J. Landy; Academic Press, 1983),
Innovation and Creativity at Work: Psychological and Organizational Strategies (coedited with
Michael West; John Wiley & Sons, 1990), and Personnel Selection and Assessment: Individual and
Organizational Perspectives (coedited with Heinz Schuler and Mike Smith; Lawrence Erlbaum
Associates, 1993). He was the editor of Human Performance from 2000–2006 and has been a
member of the editorial boards of numerous other professional journals, including Journal of Applied
Psychology, Organizational Behavior and Human Decision Processes, Journal of Occupational
and Organizational Psychology, and Journal of Business and Psychology. Active in a number of
professional organizations, Dr. Farr was president of the Society for Industrial and Organizational
Psychology (SIOP) in 1996–97 and has served in a variety of other positions for SIOP. He was an
elected member of the Board of Representatives for the American Psychological Association (APA)
from 1993–1996 and 2002–2004, representing SIOP. He is an elected fellow of SIOP and APA.
A strong believer in the scientist-practitioner model for industrial/organizational (I/O) psychology,
Dr. Farr was a winner of SIOP’s 1980 James McKeen Cattell Award for Research Design (with
Frank J. Landy and Rick Jacobs) and its 1998 M. Scott Myers Award for Applied Research in the
Workplace (with Frank J. Landy, Edwin Fleishman, and Robert Vance). In 2001 he was the winner
of SIOP’s Distinguished Service Award.
Nancy T. Tippins
Dr. Nancy T. Tippins is a senior vice president and managing principal of Valtera Corporation,
where she is responsible for the development and execution of firm strategies related to employee
selection and assessment. She has extensive experience in the development and validation of
selection tests and other forms of assessment, including performance appraisals for all levels of
management and hourly employees as well as in designing performance management programs
and leadership development programs. Prior to joining Valtera, Dr. Tippins worked as an internal
consultant in large Fortune 100 companies (e.g., Exxon, Bell Atlantic, and GTE) developing and
validating selection and assessment tools.
Dr. Tippins is active in professional affairs, including SIOP, where she has served as Chair of
the Committee on Committees, Secretary, Member at Large, and President (2000–2001). In addi-
tion, she served on SIOP’s Ad Hoc Committee on the Revision of the Principles for the Validation
and Use of Personnel Selection Procedures. She is currently the Secretary of the SIOP Foundation.
She has been SIOP’s representative to the APA’s Council of Representatives and served on the
APA’s Board of Professional Affairs. She is currently a commissioner on the Commission for the
Recognition of Specialties and Proficiencies in Professional Psychology and a member of the Joint
Committee for the revision of the Standards for Educational and Psychological Tests. She is on
xxi
xxii Editors
the editorial boards of the Journal of Applied Psychology and Personnel Psychology, where she was
formerly the associate editor of the scientist-practitioner forum of Personnel Psychology. Dr. Tippins
is a fellow of SIOP, the APA, and the American Psychological Society (APS).
Dr. Tippins received MS and PhD degrees in industrial and organizational psychology from the
Georgia Institute of Technology. She holds an MEd in counseling and psychological services from
Georgia State University and a BA in history from Agnes Scott College.
Contributors
Herman Aguinis Paula Caligiuri
University of Colorado Denver Rutgers University
Denver, Colorado New Brunswick, New Jersey
xxiii
xxiv Contributors
Betty Onyura
Lynn A. McFarland
University of Guelph
Department of Psychology
Ontario, Canada
Clemson University
Clemson, South Carolina
Frederick L. Oswald
Department of Psychology
Heather H. McIntyre Rice University
School of Psychology Houston, Texas
Georgia Institute of Technology
Atlanta, Georgia James L. Outtz
Outtz and Associates
Antonio Mladinic Washington, DC
Pontificia Universidad Catolica de Chile
Santiago, Chile Jone M. Papinchock
Valtera Corporation
Susan Mohammed Greenville, South Carolina
Department of Psychology
Pennsylvania State University Karen B. Paul
University Park, Pennsylvania 3M
St. Paul, Minnesota
Stefan T. Mol
Amsterdam Business School Kenneth Pearlman
University of Amsterdam Creative Personnel Management Consulting
Amsterdam, The Netherlands Sarasota, Florida
Pamela L. Perrewé
Kevin R. Murphy
College of Business
Department of Psychology
Florida State University
Pennsylvania State University
Tallahassee, Florida
University Park, Pennsylvania
Norman Peterson
Brett Myors SPR Center, Inc.
Griffith University Minneapolis, Minnesota
Brisbane, Australia
Robert E. Ployhart
Levi Nieminen Department of Management
Wayne State University University of South Carolina
Detroit, Michigan Columbia, South Carolina
Contributors xxvii
Employee selection is a major human resources function in work organizations. Employee selec-
tion refers to the process that employers use to make decisions concerning which individuals from
a group to choose for particular jobs or roles within the organization. Although frequently the job
candidates are not employed by the organization at the time selection decisions are made, employee
selection can also refer to cases in which current organizational employees may be candidates for
another job (e.g., a promotion to a higher organizational level or a lateral move to a different unit or
function). Guion (1998) has noted that the same selection logic also applies to other types of human
resource decisions, such as choosing which employees should be offered participation in skill or
competency development programs.
The chapters in Handbook of Employee Selection adopt the point of view that employment selec-
tion decisions should be based on information systematically obtained about job candidates and
that such candidate information has documented usefulness for predicting which candidates are
most likely to meet the objectives of the selection process. Selection objectives differ in their details
across organizations, job characteristics, and occasions but in general fall into three broad catego-
ries: (a) improving organizational effectiveness and efficiency; (b) complying with legal require-
ments and societal norms regarding equal employment opportunities related to candidates’ race,
ethnicity, gender, age, and disability status; and (c) helping individuals gain meaningful and sat-
isfying employment. Thus, the perspective regarding employee selection that is advocated in this
handbook is science- and data-based and consistent with the contemporary approach to organiza-
tional decision-making known as evidence-based management (e.g., Pfeffer & Sutton, 2006). This
perspective has evolved over the past century, and its history is briefly reviewed below.
1 What is now known as I-O psychology in North America, and work and organizational psychology in much of the rest
of the world, has had several names during its history; prior to 1973 the most common label was industrial psychology.
Throughout this handbook we generally use industrial-organizational (often abbreviated as I-O) psychology as the disci-
plinary name.
1
2 Handbook of Employee Selection
work organizations and paid employment necessitated that employers hire or select employees from
a larger group of potential workers (job applicants). Employee selection within work organizations
was largely informal, nonsystematic, and idiosyncratic before the involvement of I-O psychologists.
I-O psychology emphasized the empirical evaluation of the value of employee selection methods
that were based on principles of scientific research methodology being developed in experimental
psychology and on standardized measurement of individual differences (Vinchur, 2007). The basic
cornerstone of employee selection from the perspective of I-O psychology is that measures of the
abilities, knowledge, skills, and other characteristics of job applicants can be used to predict which
individuals will be (more) effective employees. Such measures (usually labeled as tests) are developed
systematically, based on job requirements, to maximize accuracy and minimize error in the assessment
of these individual differences among applicants.
The process of empirical evaluation of employee selection methods soon became known as
test validation, and by the 1920s the general components of an employee selection process and the
procedures for establishing the value of the selection process were much like what they are to
this day (Vinchur, 2007). For example, Kornhauser and Kingsbury (1924) described an initial job
analysis that led to the selection of existing tests or the creation of new tests to be administered to
job applicants. The scores on the selection tests were used to predict the individuals’ performance
on a criterion that represented success or proficiency on the job. Accuracy in the prediction of job
performance (criterion scores) was a measure of the validity of the test(s). Finally, the minimally
acceptable performance score was used to establish a passing or cut score on the test to determine
which applicants would be hired or rejected for the job. Bingham and Freyd (1926) addressed
various questions that should be asked (and answered) in advance of a validation investigation to
determine its scientific and practical feasibility. These issues concerned whether inadequate
current selection procedures were likely causes of performance deficiencies in the target job,
whether there were more job applicants than job openings, whether the number of projected hires
was adequate for scientific evaluation of test validity, whether an accurate and reliable performance
criterion was available, and whether the organizational members would be receptive to the validity
investigation.
Although these components of the selection process and considerations related to the feasibility
of test validation are still generally applicable to contemporary employee selection, we do not want
to convey the impression that nothing has changed since the 1920s! Much has changed. Test vali-
dation and the operational use of validated tests for employee selection decisions in the 1920s and
following decades were based on the assumption that job requirements and organizational contexts
were relatively stable over time. The pace of change in contemporary organizations has increased
considerably, with technological, product, and market changes having large and frequent impacts on
how jobs are performed. It was also assumed in those early decades that individual job performance
was linked to organizational effectiveness in a cumulative and direct way; that is, improvements
in the job performance of individual employees lead directly to improved organizational effective-
ness. There was little concern for performance of work groups or selecting for improved group
functioning. Today, much work is performed by groups, who may work together face-to-face or only
communicate via technology. In addition, the early emphasis on developing selection tests was the
assessment of cognitive and psychomotor abilities that could be shown to be predictive of task per-
formance. Personality and motivation were rarely evaluated as possible predictors of performance
criteria, whereas contemporary group-based work places considerable interpersonal and teamwork
performance demands on employees that may be better predicted by noncognitive measures. Prior
to about 30 years ago, a selection test found to be a valid predictor of job performance in an organi-
zational setting was not considered useful for employee selection for the same job in similar organi-
zational settings unless the test score-criterion measure relationship was empirically demonstrated
in each new setting. It was thought that unmeasured situational factors might attenuate or even
eliminate the test-criterion relationship demonstrated in the original setting. More recently, profes-
sional thinking has moved toward the view that validity evidence obtained in one organizational
Handbook of Employee Selection 3
setting for a particular assessment measure can be used to support the use of that measure in other
settings, provided that sufficient similarity of the job requirements, applicant groups, and organiza-
tional context can be shown.
The remainder of this chapter presents brief summaries of the eight major parts of the handbook.
We do not attempt to address the specifics of each chapter in any detail. Rather, we synthesize the
important issues discussed in the part’s chapters. Some chapters are necessarily discussed more
than others, but that does not reflect our evaluative judgments concerning the chapters. We trust
that reading these brief comments about the parts of the handbook will help send you to the specific
chapters that interest you most at a given point in time.
The four chapters in Part 1 focus on topics that are the basic building blocks for science- and
data-based employee selection. These topics include reliability and validity, design of validation
studies, generalizability of validity evidence, and measurement of work activities within their orga-
nizational context. Important conclusions that can be drawn from the chapters of Part 1 include the
following:
1. In employee selection, the principal concern is for the validity of the predictive inference
that future job behavior can be linked to current test scores.
2. There are different types of validity evidence for the predictive inference and several ways
to obtain that evidence.
3. Although there is support for the general concept that validity evidence obtained in one
organizational setting may support a similar predictive inference in another setting, numer-
ous situational factors affect that transfer of validity evidence.
4. Validity evidence quantitatively aggregated across studies can provide more accurate
estimates of test score-work behavior relationships than evidence drawn from individual
4 Handbook of Employee Selection
studies, but the quality of primary validity studies, including analysis of work activities, is
critical for the validity of the aggregated data.
5. The validity evidence for criterion (work behavior) measures is as important as the evidence
for predictor (test scores) measures when examining the validity of predictive inferences.
Part 2 contains six chapters that address various aspects of how selection systems are used
operationally and how they are managed within the organization. Chapters examine such specific
topics as applicant recruitment, test administration, use of technology, and the organizational fac-
tors affecting the implementation and sustainability of employee selection programs. Themes that
emerge in Part 2 include the following:
Part 3 contains five chapters that each examines a different predictor frequently used in employee
selection. These include cognitive ability, personality, values, cognitive styles, motives and needs,
emotional and social intelligence, and physical abilities. Although cognitive ability measures have
demonstrated their importance as predictors of job performance in many jobs and organizations,
evidence is presented in other chapters in Part 3 that noncognitive constructs often provide incre-
mental predictive validity beyond that related to cognitive ability tests. In addition, concerns about
the group mean differences and adverse impact often found with cognitive tests have motivated I-O
psychologists to search for combinations of cognitive and noncognitive predictors to meet validity
requirements and social objectives in employment practices. Enhanced theory and improved mea-
surement of noncognitive constructs in recent years have also increased their use in selection.
Part 4 contains four chapters that are focused on decisions that must be made in designing an
assessment and determining the conditions under which it will be used. These chapters address
various tradeoffs (or balancing of multiple goals) that must be considered when making decisions
about the development of an employee selection system and its operational use once developed. In
addition to the validity-adverse impact tradeoff noted above in Part 3, decisions regarding develop-
ment and operation of selection systems in an organization must include considerations of required
time and budget resources and the number of organizational incumbents needed for various pur-
poses. In addition, the relation between the selection system and the managers for whom the new
employees will work often must be negotiated, and, minimally, the selection system developers must
understand who makes the final hiring decision. Finally, decisions must be made regarding what
information (assessment feedback) is provided to job candidates and the organization. For example,
successful candidates often have developmental needs, but should such information be communi-
cated to the candidate only, to organizational management only, or to both? How much information
is provided to unsuccessful candidates? Do such candidates from outside of the organization receive
the same information as those who are current employees seeking a new job?
The six chapters in Part 5 address the measurement of employee’s behavior on the job. Not
too many years ago, probably only one or two chapters could comprehensively cover this topic.
Traditionally in I-O psychology, the job incumbent’s performance of the tasks found on the job
description would have been the criterion measure of interest when assessing the validity of an
Handbook of Employee Selection 5
employee selection system. For some jobs characterized by high turnover, absence rates, or expen-
sive training costs, a relevant prediction would have been whether a job candidate could be expected
to remain on the job for some period of time, to attend work regularly, or to complete training
successfully. The definition of an “effective employee” has expanded considerably in the past two
decades to include more than someone who completes training programs, performs adequately
on core job tasks, attends work regularly, and remains on the job. The importance of displaying
such behaviors as adaptability, creativity, and organizational citizenship, while minimizing such
counterproductive work behaviors as incivility, verbal and physical aggression, and theft and sabo-
tage of organizational resources, has been recognized and documented in organizational research,
and these additional behaviors have emerged as relevant criteria for employee selection systems.
The expanded definition of effective employee performance makes it more likely that predictors
beyond cognitive ability will contribute to the prediction of effectiveness, because some behavior
domains (e.g., citizenship and counterproductive behavior) are likely to be related to personality
and motivation.
Part 6 contains four chapters that address the environment in which employee selection occurs.
One chapter examines the general nature of ethical and moral dilemmas and, more specifically,
the Code of Ethics of the American Psychological Association, as they relate to employee selec-
tion. Another discusses three documents [Standards for Educational and Psychological Testing
(American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education, 1999), Principles for the Use and Validation of Personnel
Selection Procedures (Society for Industrial and Organizational Psychology, 2003), and Uniform
Guidelines for Employee Selection Procedures (Equal Employment Opportunity Commission,
Civil Service Commission, Department of Labor, & Department of Justice, 1978)] containing legal
or professional guidelines and standards that are directly linked to employee selection and pro-
vide potential yardsticks by which selection instruments and programs can be evaluated. The final
two chapters in this part look at legal issues related to selection. One discusses 14 legal principles
derived from U.S. statutory law and case law derived from written opinions of the U.S. federal
judiciary, noting their application to and implications for employee selection. The other discusses
commonalities and differences in the legal environment of employee selection across 22 countries,
addressing such issues as the subgroups (racial, ethnicity, religious, gender) that are protected from
employment discrimination, the employers and employment practices covered by employment leg-
islation, the penalties for unfair employment practices, and procedures for charging and defending
claims of discrimination. The chapters in Part 6 define the important sociolegal environment in
which employee selection must exist.
The chapters in Parts 1–6 of this handbook in large part are written to be somewhat generic in
their content so that they apply well to most types of organizations and jobs. In contrast, the chap-
ters in Part 7 examine employee selection related to specific types of organizations or occupations.
Chapters focus on selection in the U.S. military, public (governmental) agencies, and multinational
organizations with particular emphasis on the factors that raise specific concerns in each of these
types of organizations. Other chapters address selection within particular occupations, including
skilled craft and technician jobs, sales and service jobs, entry- and middle-manager positions, and
executive leadership positions, again focusing on specific concerns. In addition, another chapter
addresses special issues related to selecting individuals for membership in work teams, including
concerns about individual-team fit and complementarity.
Part 8 presents a somewhat different perspective on employee selection than the other parts of
the handbook. As its title denotes, the chapters in Part 8 describe important events or concepts in the
relatively recent history of employee selection, true professional milestones. Three chapters focus
on large-scale, longitudinal projects that had a major impact on science- and data-based employee
selection. These include the development and implementation of the assessment center method at
AT&T (Chapter 39); the advancement of job performance theory that resulted from Project A con-
ducted within the U.S. Army (Chapter 40); and the development of O*NETTM, a comprehensive
6 Handbook of Employee Selection
database and taxonomy that provides descriptions of all occupations in the U.S. economy that are
linked to many other occupational and labor market databases and are accessible to users on the
Internet (Chapter 41). Chapter 42 addresses the issues of validity generalization and situational
specificity, tracing the development of arguments and related evidence from both perspectives.
The situational specificity perspective argued that evidence for the validity of a selection system
(or test) was required for each organizational situation in which the selection test was used and
was the dominant professional viewpoint for the first seven or eight decades of data-based selec-
tion. Validity generalization shifted professional thinking by presenting evidence that much of
the situational variation in validity evidence for similar selection tests was related to statistical
artifacts in the sample data sets. In Chapter 42, a somewhat different approach to these questions
is presented, and it is argued that situational moderators of test validity do exist—specifically,
job type and task complexity—and that other moderators are likely to exist, supporting the need
for continued development of situational measurement and inclusion of situational variables in meta-
analytic studies of validity generalization. The final chapter in Part 8 looks somewhat further
back into the history of science-based selection but focuses on selection research and application
in Europe. Considerable scientific and applied activity related to employee selection occurred in
many European countries in parallel to similar efforts being conducted in North America, but they
remain relatively unknown outside of Europe, an unfortunate omission in the knowledge domain
of I-O psychologists in other parts of the world.
The handbook concludes with an epilogue written by Robert M. Guion, who draws upon his 50+
years of scientific and professional involvement in employee selection to reflect on the past, present,
and future status of selection. Guion muses on such topics as psychometric issues, the “perpetual
disconnect” between science and practice, the role of narrow and broad constructs in advancing
selection knowledge, and the integration of multilevel concepts into our thinking about selection.
Such musings provide a clear message that there is still much to learn about employee selection
with regard to the psychological constructs that form the basis of effective job performance and to
the implementation of selection procedures that help organizations and employees come together to
achieve both of their goals for success.
REFERENCES
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education (1999). Standards for educational and psychological testing. Washington,
DC: Author.
Bingham, W. V., & Freyd, M. (1926). Procedures in employment psychology: A manual for developing scien-
tific methods of vocational selection. New York, NY: McGraw-Hill.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Depart-
ment of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43,
8290–8315.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Kornhauser, A. W., & Kingsbury, F. A. (1924). Psychological tests in business. Chicago: University of Chicago
Press.
Pfeffer, J., & Sutton, R. I. (2006). Hard facts, dangerous half-truths, and total nonsense: Profiting from
evidence-based management. Boston, MA: Harvard Business School Press.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures (4th ed.). Bowling Green, OH: Author.
Vinchur, A. J. (2007). A history of psychology applied to employee selection. In L. L. Koppes (Ed.), Historical
perspectives in industrial and organizational psychology (pp. 193–218). Mahwah, NJ: Lawrence
Erlbaum.
Part 1
Foundations of Psychological
Measurement and Evaluation
Applied to Employee Selection
John P. Campbell and Frank J. Landy,
Section Editors
This page intentionally left blank
2 Reliability and Validity
Dan J. Putka and Paul R. Sackett
Reliability and validity are concepts that provide the scientific foundation upon which we construct
and evaluate predictor and criterion measures of interest in personnel selection. They offer a com-
mon technical language for discussing and evaluating (a) the generalizability of scores resulting
from our measures (to a population of like measures), as well as (b) the accuracy inferences we
desire to make based on those scores (e.g., high scores on our predictor measure are associated with
high levels of job performance; high scores on our criterion measure are associated with high levels
of job performance).1 Furthermore, the literature surrounding these concepts provides a framework
for scientifically sound measure development that, a priori, can enable us to increase the likelihood
that scores resulting from our measures will be generalizable, and inferences we desire to make
based upon them, supported.
Like personnel selection itself, science and practice surrounding the concepts of reliability and
validity continue to evolve. The evolution of reliability has centered on its evaluation and framing
of “measurement error,” as its operational definition over the past century has remained focused on
notions of consistency of scores across replications of a measurement procedure (Haertel, 2006;
Spearman, 1904; Thorndike, 1951). The evolution of validity has been more diverse—with changes
affecting not only its evaluation, but also its very definition, as evidenced by comparing editions
of the Standards for Educational and Psychological Testing produced over the past half century
by the American Educational Research Association (AERA), American Psychological Association
(APA), and National Council on Measurement in Education (NCME) (AERA, APA, & NCME,
1999). Relative to the evolution of reliability, the evolution of validity has been well covered in the
personnel selection literature (e.g., Binning & Barrett, 1989; McPhail, 2007; Schmitt & Landy,
1993; Society for Industrial and Organizational Psychology, Inc., 2003), and will continue to be
well covered in this handbook. Indeed a section of this chapter and the three chapters that follow all
speak directly to developments with regard to validity—particularly as it relates to personnel selec-
tion. For this reason and those we note below, the bulk of this chapter will be devoted to providing
an integrated, modern perspective on reliability.
In reviewing literature in preparation for this chapter, we were struck at the paucity of organi-
zational research literature that has attempted to juxtapose and integrate perspectives on reliability
of the last 50 years, with perspectives on reliability from the first half of the 20th century. Indeed,
Borsboom (2006) lamented that to this day many treatments of reliability are explicitly framed or
implicitly laden with assumptions based on measurement models from the early 1900s. While clas-
sical test theory (CTT) certainly has its place in treatments of reliability, framing entire treatments
around it serves to “trap” us within the CTT paradigm (Kuhn, 1962). This makes it difficult for
students of the field to compare and contrast—on conceptual and empirical grounds—perspectives
offered by other measurement theories and approaches to reliability estimation. This state of affairs
is highly unfortunate because perspectives on reliability and methods for its estimation have evolved
1 Throughout this chapter we use the term “scores” to generically refer to observed manifestations of a measurement
procedure—thus, scores might be ratings, behavioral observations, test scores, etc.
9
10 Handbook of Employee Selection
greatly since Gulliksen’s codification of CTT in 1950, yet these advances have been slow to dis-
seminate into personnel selection research and practice. Indeed, our review of the literature reveals
what appears to be a widening gap between perspectives of reliability offered in the organizational
research literature and those of the broader psychometric community (Borsboom, 2006). Couple
this trend with (a) the recognized decline in the graduate instruction of statistics and measure-
ment over the past 30 years in psychology departments (Aiken, West, & Millsap, 2008; Merenda,
2007), as well as (b) the growing availability of statistical software and estimation methods since
the mid-1980s and we have a situation where the psychometric knowledge base of new researchers
and practitioners can be dated prior to exiting graduate training. Perhaps more disturbing is that
the lack of dissemination of modern perspectives on reliability can easily give students of the field
the impression that there have not been many scientifically or practically useful developments in the
area of reliability since the early 1950s.
In light of the issues raised above, our aim in the first part of this chapter is to parsimoniously
reframe and integrate developments in the reliability literature over the past century that reflects,
to the extent of our knowledge, our modern capabilities. In laying out our discussion, we use exam-
ples from personnel selection research and practice to relate key points to situations readers may
confront in their own work. Given our focus, we should note that that several topics commonly
discussed in textbook or chapter-length treatments of reliability are missing from this chapter. For
example, topics such as standard errors of measurement, factors affecting the magnitude of reliabil-
ity coefficients (e.g., sample heterogeneity), and applications of reliability-related data (e.g., correc-
tions for attenuation, measure refinement) receive little or no attention here. The omission of these
topics is not meant to downplay their importance to our field; rather, it just reflects the fact that fine
treatments of these topics already exist in several places in the literature (e.g., Feldt & Brennan,
1989; Haertel, 2006; Nunnally, 1978). Our emphasis is on complementing the existing literature,
not repeating it. In place of these important topics, we focus on integrating and drawing connec-
tions between historically disparate perspectives on reliability. As we note below, such integration
is essential, because the literature on reliability has become extremely fragmented.
For example, although originally introduced as a “liberalization” of CTT over 40 years ago,
generalizability theory is still not well integrated into textbook treatments of reliability in the orga-
nizational literature. It tends to be relegated to secondary sections that appear after the primary
treatment of reliability (largely based on CTT) is introduced, not mentioned at all, or treated as
if it had value in only a limited number of measurement situations faced in research and practice.
Although such a statement may appear as a wholesale endorsement of generalizability theory and
its associated methodology, it is not. As an example, the educational measurement literature has
generally held up generalizability theory as a centerpiece of modern perspectives on reliability, but
arguably, this has come at the expense of shortchanging confirmatory factor analytic (CFA)-based
perspectives on reliability and how such perspectives relate to and can complement generalizability
theory. Ironically, this lack of integration goes both ways, because CFA-based treatments of reli-
ability rarely if ever acknowledge how generalizability theory can enrich the CFA perspective, but
rather link their discussions of reliability to CTT. Essentially, investigators faced with understand-
ing modern perspectives on reliability are faced with a fragmented, complex literature.
Overview
Our treatment of reliability is organized into four main sections. The first section offers a concep-
tual, “model-free” definition of measurement error. In essence, starting out with such a model-
free definition of error is required to help clarify some confusion that tends to crop up when one
begins to frame error from the perspective of a given measurement theory and the assumptions
such theories make regarding the substantive nature of error. Next we overlay our conceptual
treatment of error with perspectives offered by various measurement models. Measurement mod-
els are important because they offer a set of hypotheses regarding the composition of observed
Reliability and Validity 11
scores, which, if supported, can allow us to accurately estimate reliability from a sample of data
and apply those estimates to various problems (e.g., corrections for attenuation, construction of
score bands). Third, we compare and contrast three traditions that have emerged for estimating
reliability: (a) a classical tradition that arose out of work by Spearman (1904) and Brown (1910),
(b) a random-effects model tradition that arose out of Fisher’s work with analysis of variance
(ANOVA), and (c) a CFA tradition that arose out of Joreskog’s work on congeneric test models.
Lastly, we close our discussion of reliability with a critical examination of how we have histori-
cally framed and dealt with measurement error. This closing section leads naturally into a discus-
sion of validity.
Our treatment of validity is entirely conceptual in nature. We do not address operational issues
in the design of studies aimed at obtaining various types of validity evidence. Rather, we attempt
to outline a set of distinctions that we view as central to an understanding of validity. Namely, we
discuss (a) validity as predictor-criterion relationship versus broader conceptualizations, (b) valid-
ity of an inference versus validity of a test, (c) types of validity evidence versus types of validity,
(d) validity as an inference about a test score versus validity as a strategy for establishing job relat-
edness, (e) the predictive inference versus the evidence for it, (f) classical versus modern notions
of reliability as a limiter of validity, and (g) validity limited to inferences about individuals versus
including broader consequences of test score use. Our belief is that a clear understanding of these
foundational issues in validity is essential for effective research and practice in the selection arena.
Reliability
A specification for error is central to the concept of reliability, regardless of one’s theoretical
perspective; but to this day the meaning of the term “error” is a source of debate and confusion
(Borsboom & Mellenbergh, 2002; Murphy & DeShon, 2000; Schmidt, Viswesvaran, & Ones,
2000). The sources of variance in scores that are designated as sources of error can differ as a func-
tion of (a) the inferences or assertions an investigator wishes to make regarding the scores, (b) how
an investigator intends to use the scores (e.g., for relative comparison among applicants or absolute
comparison of their scores to some set standard), (c) characteristics of the measurement procedure
that produced them, and (d) the nature of the construct one is attempting to measure. Consequently,
what is called error, even for scores produced by the same measurement procedure, may legiti-
mately reflect different things under different circumstances. A convenient way to address these
points is to examine how error has come to be operationally defined in the context of estimating reli-
ability. All measurement theories seem to agree that reliability estimation attempts to quantify the
expected degree of consistency in scores over replications of a measurement procedure (Brennan,
2001a; Haertel, 2006). Consequently, from the perspective of reliability estimation, error reflects
the expected degree of inconsistency between scores produced by a measurement procedure and
replications of it. There are several elements of these operational definitions that warrant further
explanation, beginning with the notion of replication. Clarifying these elements will provide an
important foundation for the remainder of our discussion.
Replication
From a measurement perspective, replication refers to the repetition or reproduction of a mea-
surement procedure such that the scores produced by each “replicate” are believed to assess the
same construct.2 There are many ways of replicating a measurement procedure. Perhaps the most
2 As we discuss later, the degree to which replicates are assumed to “assess the same construct” differs across measure-
ment theories. The degree of similarity among replicates has been discussed under the rubric of degrees of part-test
similarity (Feldt & Brennan, 1989) and degrees of parallelism (Lord, 1955). At this point, further discussion of this issue
is unnecessary, but we will revisit this issue when discussing the role of measurement models in reliability estimation.
12 Handbook of Employee Selection
straightforward way would be to administer the same measurement procedure on more than one
occasion, which would provide insight into how consistent scores are for a given person across
occasions. However, we are frequently interested in more than whether our measurement proce-
dure would produce comparable scores on different occasions. For example, would we achieve
consistency over replicates if we had used an alternative, yet similar, set of items to those that
comprise our measure? Answering the latter question is a bit more difficult in that we are rarely
in a position to replicate an entire measurement procedure (e.g., construct two or more 20-item
measures of conscientiousness and compare scores on each). Consequently, in practice, “parts”
or “elements” of our measurement procedure (e.g., items) are often viewed as replicates of each
other. The observed consistency of scores across these individual elements is then used to make
inferences about the level of consistency we would expect if our entire measurement procedure
was replicated, that is, how consistent would we expect scores to be for a given person across
alternative sets of items we might use to assess the construct of interest. The forms of replication
described above dominated measurement theory for nearly the first 5 decades of the 20th century
(Cronbach, 1947; Gulliksen, 1950).
Modern perspectives on reliability have liberalized the notion of replicates in terms of (a) the
forms that they take and (b) how the measurement facets (i.e., items, raters, tasks) that serve to
define them are manifested in a data collection design (i.e., a measurement design). For example,
consider a measurement procedure that involves having two raters provide ratings for individuals
with regard to their performance on three tasks designed to assess the same construct. In this case,
replicates take the form of the six rater-task pairs that comprise the measurement procedure, and as
such, are multifaceted (i.e., each replicate is defined in terms of specific rater and a specific task).
Prior to the 1960s, measurement theory primarily focused on replicates that were defined along a
single facet (e.g., replicates represented different items, different split-halves of a test, or the same
test administered on different occasions).3 Early measurement models were not concerned with rep-
licates that were multifaceted in nature (Brown, 1910; Gulliksen, 1950; Spearman, 1910). Modern
perspectives on reliability also recognize that measurement facets can manifest themselves differ-
ently in any given data collection design. For example, (a) the same raters might provide ratings for
each ratee, (b) a unique, nonoverlapping set of raters might provide ratings for each ratee, or (c) sets
of raters that rate each ratee may vary in their degree of overlap. As we note later, the data collection
design underlying one’s measurement procedure has important implications for reliability estima-
tion, which, prior to the 1960s, was not integrated into measurement models. It was simply not the
focus of early measurement theory (Cronbach & Shavelson, 2004).
Expectation
A second key element of the operational definition of reliability offered above is the notion of
expectation. The purpose of estimating reliability is not to quantify the level of consistency in
scores among the sample of replicates that comprise one’s measurement procedure for a given
study (e.g., items, raters, tasks, or combinations thereof). Rather, the purpose is to use such
information to make inferences regarding (a) the consistency of scores resulting from our mea-
surement procedure as a whole with the population from which replicates comprising our mea-
surement procedure were drawn (e.g., the population of items, raters, tasks, or combinations
thereof believed to measure the construct of interest) and (b) the consistency of the said scores
for the population of individuals from which our sample of study participants was drawn. Thus,
the inference space of interest in reliability estimation is inherently multidimensional. As we
describe in subsequent sections, the utility of measurement theories is that they help us make
this inferential leap from sample to population; however, the quality with which estimation
approaches derived from these theories do so depend on the properties of scores arising from
3 A key exception here is Cronbach’s (1947) treatment of a coefficient of equivalence and stability.
Reliability and Validity 13
each replicate, characteristics of the construct one is attempting to measure, and characteristics
of the sample of one’s study participants.
Summary
In this section, we attempted to offer a model-free perspective on error and how it has come to be
operationally defined from the perspective of reliability. We adopted this strategy in part because
4 Actually, this is a bit of an overstatement. As alluded to in the opening paragraph, error, in any given situation, will be
partly dependent on the generalization(s) the investigator wishes to make regarding scores. In some cases, investigators
may choose not to treat a given source of inconsistency in scores as error. For example, this might occur in the context of
performance ratings where inconsistencies in job incumbents’ scores across different dimensions of performance may
be viewed as acceptable by the investigator (Scullen, Mount, & Goff, 2000). This example illustrates why the investiga-
tor is a critical part of defining error in any given situation. We will touch upon this topic again later when we discuss
generalizability theory.
14 Handbook of Employee Selection
of the confusion noted above, but also to bring balance to existing treatments of reliability in the
industrial-organizational (I-O) literature, which explicitly or implicitly tend to frame discussions of
reliability from the CTT tradition. The language historically used in treatments of CTT makes it
difficult for investigators to recognize that inconsistency in scores is not necessarily beyond mean-
ingful explanation, although we conceive of it as random. Another reason we belabor this point is
that beginning with Spearman (1904), a legacy of organizational research emerged that focuses
more on making adjustment for error in our measures (e.g., corrections for attenuation), rather than
developing methods for modeling and understanding error in our measures; which in part may
reflect our tendency to view such error as unexplainable. The implication of this legacy and poten-
tial for remediation in light of recent methodological advances are touched upon in the concluding
sections of our discussion of reliability.
5 Readers may question the omission of item response theory (IRT) from the subsequent discussion. Like Brennan (2006),
we tend to view IRT as offering a “scaling” model rather than a “measurement” model because it does not have a built-in
explicit consideration of measurement error. Further, the focus of applications of IRT is often on estimation/scaling of a
latent trait, ability of interest, or calibration of item parameters rather than the isolation and quantification of measurement
error (see Brennan 2006, pp. 6-7). Although we are not downplaying the obvious importance of IRT for psychometrics and
personnel selection, we felt it was beyond the scope of this chapter to address IRT while still addressing reliability and valid-
ity as we have done herein. For a recent, parsimonious treatment of IRT, we refer readers to Yen and Fitzpatrick (2006).
Reliability and Validity 15
Conceptually, a person’s true score equals the expected value of their observed scores across an
infinite set of replications of the measurement procedure. Given that such an infinite set of rep-
lications is hypothetical, a person’s true score is unknowable; but, as it turns out, not necessarily
unestimatable (see Haertel, 2006, pp. 80–82). As noted earlier, true score represents the source(s) of
consistency in scores across replicates (note there is no “r” subscript on the true score component
in Equation 2.1)—in CTT it is assumed to be a constant for a given person across replicates. Error,
on the other hand, is something that varies from replicate to replicate, and CTT hypothesizes that
the mean error across the population of replicates for any given person will be zero. In addition to
these characteristics, if we look across persons, CTT hypothesizes that there will be (a) no correla-
tion between true and error score associated with a given replicate (r Tp.Ep1 = 0), (b) no correlation
between error scores from different replicates (r Ep1.Ep2 = 0), and (c) no correlation between error
scores from a given replicate and true scores from another replicate (r Ep1.Tp2 = 0). Although the CTT
score model does not necessitate that error scores from a given replicate (or composite of replicates)
be uncorrelated with scores from measures of other attributes, the latter is a key assumption under-
lying the use of reliability coefficients to correct observed correlations for attenuation (Spearman,
1910; Schmidt & Hunter, 1996). Essentially, this last assumption implies that inconsistency in a
measurement procedure will be unrelated to any external variables (i.e., variables other than our
replicates) and therefore beyond meaningful prediction. From basic statistics we know that the vari-
ance of the sum of two independent variables (such as T and E) will simply equal the sum of their
variances; thus, under CTT, observed score variance across persons for a given replicate is simply
the sum of true score variance and error variance.
As we detail later, reliability estimation attempts to estimate the ratio of σ2T over σ2T + σ2E, not for a
single replicate, but rather for a measurement procedure as a whole, which as noted earlier is often
conceived as being comprised of multiple replicates. Thus, reliability coefficients are often inter-
preted as the proportion of observed score variance attributable to true score variance, or alterna-
tively, the expected correlation between observed scores resulting from our measurement procedure
and scores that would be obtained had we based our measure on the full population of replicates of
interest (i.e., hypothetical true scores).
One of the key defining characteristics of CTT is the perspective it takes on replicates. Recall
that earlier we offered a very generic definition for what constitutes a replicate. We described how
we often conceive of parts or elements of a measurement procedure as replicates and use them to
estimate the reliability of scores produced by our procedure as a whole. As we note later, CTT-based
reliability estimation procedures assume that replicates have a certain degree of “parallelism.” For
example, for two replicates to be considered strictly (or classically) parallel they must: (a) produce
identical true scores for a given individual (i.e., Tp for Replicate A = Tp for Replicate B), (b) have iden-
tical mean observed scores, and (c) have identical error variances.6 The commonly used Spearman-
Brown prophecy formula is an example of a CTT-based estimation procedure that is based on the
assumption that replicates involved in its calculation are strictly parallel (Feldt & Brennan, 1989).
Often times it is not realistic to expect any two replicates to be strictly parallel. For example,
items on a test of cognitive ability are rarely of the same difficulty level, and raters judging
incumbents’ job performance often differ in their level of leniency/severity. Under such condi-
tions, item means (or rater means) would differ, and thus, such replicates would not be considered
strictly parallel. In recognition of this, CTT gradually relaxed its assumptions over the years to
accommodate the degrees of parallelism that are more likely to be seen in practice. The work of
Lord (1955), Lord and Novick (1968), and Joreskog (1971) lay out several degrees of parallelism,
which we briefly review below.
6 Given condition (a) and (c) such replicates will also have identical observed score variances.
16 Handbook of Employee Selection
Tau-equivalent replicates (a) produce identical true scores for a given individual and have (b)
identical mean observed scores, but may have different error variances (across persons) and as such
different observed variances. Essentially tau-equivalent replicates relax assumptions further, in that
they allow true scores produced by the replicates to differ by a constant (i.e., Tp for Replicate 1 = Tp
for Replicate 2 + C). As such, essential tau-equivalence accommodates the situation in which there
are mean differences across replicates (e.g., items differ in their difficulty, and raters differ in their
leniency/severity). The assumption of essential tau-equivalence underlies several types of coeffi-
cients commonly used in reliability estimation such as coefficient alpha, intraclass correlations, and
as we discuss in the next section, generalizability coefficients.7
One thing that may not be immediately obvious from the description of essential tau-equivalence
offered above is that it does not accommodate the situation in which replicates differ in true score
variance (across persons). Joreskog’s (1971) notion of congeneric test forms (or more generally, con-
generic replicates) accommodated this possibility. Specifically, the congeneric model allows true
scores produced by a given replicate to be a linear function of true scores from another replicate
(i.e., Tp for Replicate 1 = b × Tp for Replicate 2 + C). As we will illustrate in our later section on reli-
ability estimation, this accommodates the possibility that replicates may be differentially saturated
with true score variance or be measured on a different metric.
The degrees of parallelism discussed above have implications for estimating reliability; more
specifically, they have implications for the accuracy of results produced by reliability estimation
methods that we apply to any given set of replicates. As we discuss later, we can apply nearly
any reliability estimation method derived from the classical tradition to any sample of replicates,
regardless of their underlying properties; however, the estimate we get will differ in its accuracy
depending on (a) the extent to which the underlying properties of those replicates conform to the
assumptions above and (b) characteristics of the construct one is attempting to measure. It is beyond
the scope of chapter, and not its intent, to provide a catalog of coefficients that may be appropriate
for estimating the reliability depending on the degree of parallelism among the replicates of interest,
because excellent descriptions exist elsewhere in the literature (e.g., Feldt & Brennan, 1989, Table 3,
p. 115). However, in reviewing treatments such as the one offered by Feldt and Brennan (1989), be
cognizant that the myriad coefficients they review (including the commonly used Spearman-Brown
prophecy and coefficient alpha) were formulated to deal with scores arising from measurement
procedures in which (a) replicates were defined by a single facet (e.g., replicates reflect different
items or test parts) and (b) that facet was fully crossed with one’s objects of measurement (e.g., all
test takers are administered the same set of items, all test takers completed the same test on two
different occasions). As we will see below, application of classical reliability estimation methods
in cases in which replicates are multifaceted (e.g., replicates representing task-rater pairs) or cases
in which the design underlying one’s measurement procedure is not fully crossed is problematic
(Cronbach & Shavelson, 2004). The treatment of reliability for measurement procedures character-
ized by multifaceted replicates or involving noncrossed measurement designs leads naturally to the
introduction of G-theory.
Generalizability Theory
G-theory liberalizes CTT in that it has mechanisms within its score model for (a) dealing with
single-faceted and multifaceted replicates, (b) simultaneously differentiating and estimating mul-
tiple sources of error arising from different measurement facets (e.g., items, raters, occasions,
tasks), (c) dealing with scores produced by a wide variety of data collection designs (e.g., crossed,
nested, and ill-structured measurement designs), (d) adjusting the composition of true score and
7 Actually, this statement is a bit of a misnomer, because coefficient alpha and intraclass correlations simply represent
specific computational forms of a broader class of coefficients known as generalizability coefficients (Cronbach &
Shavelson, 2004).
Reliability and Validity 17
error depending on the generalizations one wishes to make regarding the scores, (e) adjusting the
composition of true score and error depending on how one intends to use the scores (e.g., for rela-
tive comparison among applicants or absolute comparison of their scores to some set standard),
and (f) relaxing some of the assumptions put on the distributional properties of true and error com-
ponents proscribed under CTT. The purpose of this section will be to elaborate these features of
G-theory model in a way that is relatively free from G-theory jargon—which has been cited as one
reason why the unifying perspective that G-theory offers on reliability has yet to be widely adopted
by organizational researchers (DeShon, 2002).
Perhaps the most visible way G-theory model liberalizes the CTT model is its ability to han-
dle measurement procedures comprised of multifaceted replicates. To illustrate this key differ-
ence between the G-theory and CTT models, let us first consider an example in which we have
observed scores based on ratings of job applicants’ responses to three interview questions designed
to assess interpersonal skill. Say that we had the same three raters interview each applicant and
that each rater asked applicants the same three questions (i.e., applicants, raters, and questions are
fully crossed). Thus, we have nine scores for each applicant—one for each of our nine “replicates,”
which in this case are defined by unique question-rater combinations. Under CTT and G-theory, we
might conceive of an applicant’s true score as the expected value of his/her observed score across
the population of replicates—in this case it is the population of raters and questions. However,
if we were to apply the CTT score model to such replicates it would break down because it does
not account for the fact that some replicates share a rater in common and other replicates share a
question in common. As such, the error associated with some replicates will be correlated across
applicants, therefore violating one of the key assumptions underlying CTT measurement model (i.e.,
errors associated with different replicates are uncorrelated). As we show below, the G-theory mea-
surement model permits the addition of terms to the model that account for the fact that replicates
are multifaceted. The insidious part of this illustration is that the situation above would not prevent
us from applying estimation methods derived from CTT to these data (e.g., calculating coefficient
alpha on the nine replicates). Rather, perhaps unbeknownst to the investigator, the method would
allocate error covariance among replicates that share a rater or question in common to true score
variance because they are a source of consistency across at least some of the replicates (Komaroff,
1997; Raykov, 2001a). That is, the CTT score model and commonly used coefficients derived from
it (e.g., coefficient alpha) are blind to the possibility of multifaceted replicates, which is a direct
reflection of the fact that early measurement theory primarily concerned itself with fully crossed,
single-faceted measurement designs (Cronbach & Shavelson, 2004).
To account for the potential that replicates can be multifaceted, G-theory formulates its measure-
ment model from a random-effects ANOVA perspective. Unlike CTT, which has its roots in the
correlational research tradition characteristic of Spearman and Pearson, G-theory has its roots in
the experimental research tradition characteristic of Fisher (1925). As such, G-theory is particularly
sensitive to dealing with replicates that are multifaceted in nature and both crossed and noncrossed
measurement designs. It has long been acknowledged that issues of measurement design have
been downplayed and overlooked in the correlational research tradition (Cattell, 1966; Cronbach,
1957), and this is clearly evident in reliability estimation approaches born out of CTT (Cronbach &
Shavelson, 2004; Feldt & Brennan, 1989). To ease into the G-theory measurement model, we start
with a simple example, one in which we assume observed scores are generated by a replicate of a
measurement procedure that is defined along only one facet of measurement. The observed score
(X) for a given person p that is produced by any given replicate defined by measurement facet “A”
(e.g., “A” might reflect items, occasions, raters, tasks, etc.) is assumed to be an additive function:
where b0 is the grand mean score across persons and replicates of facet A; up is the main effect of
person p and conceptually the expected value of p’s score across the population of replicates of facet
18 Handbook of Employee Selection
A (i.e., the analogue of true score); ua represents the main effect of replicate a and conceptually is
the expected value of a’s score across the population of persons; upa represents the p × a interaction
effect and conceptually reflects differences of the rank ordering of persons across the population
of replicates of facet A; and lastly, epa is the residual error that conceptually is left over in Xpa after
accounting for the other score effects.8 As with common random-effects ANOVA assumptions,
these score effects are assumed to (a) have population means of zero, (b) be uncorrelated, and (c)
have variances of σ2P, σ2A, σ2B, σ2AB, and σ2Residual, respectively (Jackson & Brashers, 1994). It is the
latter variance components that are the focus of estimation efforts in G-theory, and they serve as
building blocks of reliability estimates derived by G-theory.
Of course, the example above is introduced primarily for pedagogical purposes; the real strength
of the random-effects formulation is that the model above is easily extended to measurement proce-
dures with multifaceted replicates (e.g., replicates that reflect question-rater pairs). For example, the
observed score (X) for a given person p that is produced by any given replicate defined by measure-
ment facets “A” and “B” (e.g., “A” might reflect questions and “B” might reflect raters) is assumed
to be an additive function.
A key difference to point out between the models specified in Equations 2.3 and 2.4 is the
interpretation of the main effects for individuals. Once again, up is the main effect of person p, but
conceptually it is the expected value of p’s score across the population of replicates defined by fac-
ets A and B. Thus, although the up term in Equations 2.3 and 2.4 provides an analogue to the true
score, the substance of true scores differs depending on the nature of the population(s) of replicates
of interest. Extending this model beyond two facets (e.g., a situation in which replicates are defined
as a combination of questions, raters, and occasions) is straightforward and simply involves adding
main effect terms for the other facets and associated interaction terms (Brennan, 2001b).
One thing that is evident from the illustration of the G-theory model provided above is that,
unlike the CTT model, it is scalable; that is, it can expand or contract depending on the degree to
which replicates underlying a measurement procedure are faceted. Given its flexibility to expand
beyond simply a true and error component, the G-theory model potentially affords investigators
with several more components of variance to consider relative to the CTT model. For example,
using the interview example presented above, we could potentially decompose variance in interview
scores for applicant p on question q as rated by rater r into seven components.9
σ2X = σ2P + σ2Q + σ2R + σ2PQ + σ2PR + σ2QR + σ2PQR, Residual (2.5)
Recall from our discussion of CTT that the basic form reliability coefficients take on is σ2T/(σ2T +
σ2E). This fact begs the question, from the G-theory perspective, what sources of variance com-
prise σ2T and σ2E? As one might guess from the decomposition above, the G-theory model offers
researchers a great deal of flexibility when it comes time to specifying what constitutes error vari-
ance and true score variance in any given situation. As we will demonstrate below, having this flex-
ibility is of great value. As alluded to in our opening paragraph, the sources of variance in scores
that are considered to reflect error (and true score for that matter), can differ depending on (a) the
8 The highest order interaction term and the residual term in G-theory models are confounded because such designs essen-
tially amount to having one observation per cell. Thus, in practice, it is not possible to generate separate estimates of
variance in X attributable to these two effects.
9 Two notes here: First, as we will discuss in the following sections, one’s ability to estimate each of these components
will be limited by the measurement design underlying one’s measurement procedure. The example here assumes a fully
crossed deign, which will often not be the case in practice. Second, note that in Equation 2.5 we combine variance com-
ponents for the applicant-question-rater interaction and residual terms; this reflects the fact that these sources of variance
will not be uniquely estimable.
Reliability and Validity 19
generalizations an investigator wishes to make with regarding the scores; (b) how an investigator
intends to use the scores (e.g., for relative comparison among applicants or absolute comparison
of their scores to some set standard); and (c) characteristics of the data collection or measurement
design itself, which can limit an investigator’s ability to estimate various components of variance.
The idea of having flexibility of specifying what components of observed variance contribute to true
score and error is something that is beyond the CTT score model because it only partitions variance
into two components. In the following sections, we highlight how the G-theory model offers inves-
tigators flexibility for tailoring the composition of σ2T and σ2E to their situation.
σ 2P σT2
Eρ2 = =
σ 2PQ σ 2 σ 2PQR , Residual σT2 + σ E2
σ 2P + + PR + (2.6)
nQ nR nQ nR
where the term in brackets represents σ2E, nQ is the number of interview questions, nR is the number
of raters, and nQnR is the product of the number of questions and raters.10 Note that increasing the
number of questions and/or raters will result in decreasing that part of error associated with ques-
tions and/or raters, respectively. The idea that G-theory allows for the scaling of these effects as a
function of the number of questions and raters sampled is analogous to the role of the Spearman-
Brown prophecy in CTT, in which the number of replicates that comprise a measurement procedure
directly affects the estimated reliability of scores produced by that procedure (Feldt & Brennan,
1989). The key difference here is that G-theory allows one to differentiate and examine the effect
that adjusting the sampling of different types of facets has for reliability (e.g., separately adjusting
the number of questions and raters), whereas the Spearman-Brown prophecy does not allow such
differentiation to occur. As such, applying the Spearman-Brown prophecy to estimate what the reli-
ability of scores would be if the length of a measure is changed can greatly mislead investigators if
the replicates that comprise that measure are multifaceted (Feldt & Brennan, 1989).
To illustrate, let us take the interview example offered above and say σ2P = .50, σ2PQ = .30,
σ2PR = .10, and σ2PQR,Residual = .10. Recall our interview is comprised of three questions and three rat-
ers (i.e., nine question-rater pairs serve as replicates). Using Equation 2.6, the estimated reliability
of the average rating across questions and raters would be .78 (σ2T = .50, σ2E = .14). Now, if we were
to ask what effect “doubling the length of the interview” would have on reliability, and we used the
Although labeled as a “generalizability” coefficient, note that this formula provides an estimate of σ2T over σ2T + σ2E, and
10
Spearman-Brown prophecy (i.e., 2Eρ2/[1 + Eρ2]) to answer that question, we would achieve an esti-
mate of .88, which is analogous to what we achieve if we replaced nQ, nR, and nQnR in Equation 2.6
with 2nQ, 2nR, and 2nQ2nR. Note that the Spearman-Brown prophecy does not provide the estimated
reliability for 18 question-rater pairs (i.e., double the existing number of replicates), but rather an
estimated reliability for 36 question-rater pairs (i.e., six questions × six raters). As such, in this case,
the Spearman-Brown formula gives us an estimated reliability if the effective length of the inter-
view were quadrupled rather than doubled. Another shortcoming of the Spearman-Brown formula
is that it fails to account for the fact that there are multiple ways one can effectively double the length
of the interview, each of which may produce a different reliability estimate. For example, we can
have 2 questions and 9 raters, which would give us 18 question-rater pairs and result in an average
rating reliability of .75 on the basis of Equation 2.6. Alternatively, we can have 9 questions and 2
raters, which would also give us 18 question-rater pairs but result in an average rating reliability of
.85 on the basis of Equation 2.6. Essentially there is no mechanism within the Spearman–Brown
formula that accounts for the fact that facets may differentially contribute to error. As this example
illustrates, making adjustments to the number of levels sampled for one facet (e.g., questions in this
case) may have a much more profound effect on error than making adjustments to the number of
levels sampled for other facets (e.g. raters) included in the design.
Returning to our discussion of the dependency of σ2T and σ2E on the generalizations one wishes
to make regarding their scores, let us now say that a different investigator uses the same interview
procedure described above, but instead only wished to generalize scores from the procedure across
the population of raters. For example, this might be the case if the investigator feels that the ques-
tions get at different parts of the interpersonal skill construct, and as such does not wish to treat
inconsistency in scores across questions (for a given applicant) as error. In such a case, variance
associated with applicant main effects (σ2P) and a function of applicant-question interaction effects
(σ2PQ) would comprise σ2T, and variance associated with interactions between applicants and rat-
ers (σ2PR) and the applicant-rater-questions along with residual error applicant (σ2PQR,Residual) would
comprise σ2E (Brennan, 2001b; DeShon, 2002). In this situation, what the investigator is essen-
tially doing is examining the consistency of scores across raters on the basis of ratings that have
been averaged across the three interview questions—in G-theory this is known as fixing a facet of
measurement.11
Note the idea of fixing a facet of measurement for purposes of estimating σ2T and σ2E in the context of G-theory is
11
different from modeling a factor or covariate as fixed in the context of mixed-effects models (DeShon, 2002; Searle,
Casella, & McCulloch, 1992).
Reliability and Validity 21
they do in this example) is not addressed by CTT, because it is simply beyond the scope of the CTT
model to handle error of that type (Cronbach & Shavelson, 2004).
Another common ratings design faced in practice (particularly with job performance ratings) is one in which ratees are
12
nested with raters (e.g., each group of incumbents is rated by their respective group supervisor). In this case, each ratee
has only one rater, and as such there is no way to distinguish between the σ2PR (typically considered a source of error)
and σ2P (typically considered true score variance). Thus, estimating inter-rater reliability on the basis of data structured
in this manner is not possible.
22 Handbook of Employee Selection
the applicant-rater interaction (σ2PR) would be just that, but variance attributable to applicant main
effects (σ2P) would also reflect applicant-occasion variance (σ2PO). This stems from the fact that
raters are observing a given applicant on the same occasion, and as such occasion of measurement
serves as a source of consistency in raters’ ratings that would not be present if raters conducted
separate interviews. In both of the examples above, σ2PO is not separable from the other source of
variance with which it is confounded. In the case of separate interviews, raters covary with occa-
sions; in the case of panel interviews, occasions are not replicated for a given applicant. Thus, these
examples illustrate how a measurement facet can hide in different ways to influence the substantive
meaning of σ2E (in the case of the separate interview) and σ2T (in the case of the panel interviews).
The examples above also serve to illustrate an important point—just because we cannot isolate
or estimate a source of variance underlying observed scores does not mean those sources of vari-
ance are not present and influencing our scores (Brennan, 2001b; DeShon, 1998; Feldt & Brennan,
1989; Schmidt & Hunter, 1996). Indeed, it is interesting to take the concept of hidden facets and
use them to frame some common measurement issues in personnel selection. For example, the
magnitude of person-rater interaction variance (σ2PR) in job performance ratings has been found to
be quite large (e.g., Scullen et al., 2000; Schmidt et al., 2000). However, if raters are viewing the
performance of individuals on (a) different occasions, (b) different tasks, and/or (c) different tasks
on different occasions, then part of what we typically label person-rater interaction variance may
actually also reflect several other sources of variance (e.g., person-occasion interaction variance,
person-task interaction variance, and person-task-occasion interaction variance). In other words,
the hidden facets of occasion and task might help explain the sizable person-rater interaction effects
often found in job performance ratings. In the context of assessment centers, hidden facets might
partially explain the common finding of the dominance of exercise effects over dimension effects
(Lance, 2008). For example, dimensions within exercises share an occasion of measurement in
common (and sometimes share raters as well), whereas dimensions in different exercises do not.
As such, all else being equal we would expect scores for dimensions within exercises to be more
consistent with each other than scores for dimensions in different exercises. Thus, what is inter-
preted as an exercise effect in the context of assessment center ratings may partially be explained by
hidden occasion and rater facets of measurement that serve to increase consistency among dimen-
sion scores within exercises relative to dimension scores across exercises (e.g., Cronbach, Linn,
Brennan, & Haertel, 1997). These examples illustrate the potential utility of framing common mea-
surement issues through the lens of hidden facets illuminated by G-theory.
Summary
In this section, we described perspectives on observed scores adopted by two measurement theories
that dominate current discussions of reliability. Through a single example, we illustrated the many
ways in which G-theory liberalizes not only the score model offered by CTT, but also the very per-
spective it offers on reliability. By no means did the discussion above fully illustrate how G-theory
is applied or how reliability coefficients based on G-theory are calculated. For such details, the
reader is referred to other treatments (Brennan, 2001b; DeShon, 2002; Haertel, 2006; Shavelson
& Webb, 1991). Nevertheless, given space constraints, this was not our intent. Rather, we tried, in
a way that was relatively free of G-theory jargon, to show how G-theory offers a way for framing
and dealing with measurement situations that CTT was designed to handle, as well as those that
CTT was never really designed to handle—a key reason why G-theory currently underlies modern
perspectives on reliability (Cronbach & Shavelson, 2004).
Estimation of Reliability
In the sections above, we have outlined conceptual and model-based perspectives on reliability
and measurement error. In this section, we address how these concepts and models translate into
Reliability and Validity 23
methods for estimating reliability. Reliability is often summarized in terms of (a) coefficients rang-
ing from 0 to 1 or (b) standard errors of measurement (SEMs) expressed in a raw score metric.
Our focus is on the former, partly out of page limits and partly because the latter can typically be
calculated from components of the former.13 As noted earlier, under CTT and G-theory the goal of
reliability estimation is to estimate the ratio σ2T / (σ2T + σ2E). The following sections discuss methods
for estimating this ratio and components of it. Our intent here is not to provide a catalog of different
types of reliability coefficients, nor is our intent to provide a cookbook on how to estimate reliability
in any given situation. Indeed, as should be clear from the previous section, doing so would not be
fruitful given that the composition of σ2T and σ2E in any given situation partly reflects the aims of
the individual investigator. Rather, we focus on comparing and contrasting different historical tradi-
tions on estimating reliability, examining the pros and cons of each, and speaking to their equifinal-
ity under certain conditions.
The extant literature on reliability estimation is characterized by a multitude of loosely orga-
nized coefficients and estimation methods. Historically, the psychometric literature tended to
organize discussions of reliability estimation in terms of categories or types of reliability (e.g.,
test-retest reliability, split-half, parallel-forms, coefficients of equivalence, stability, precision;
Cronbach, 1947; Gulliksen, 1950). With the advent of G-theory, psychometricians have slowly
gravitated away from categories or types of coefficients that characterized early test theory
because “the categories may now be seen as special cases of a more general classification, gen-
eralizability coefficients” (AERA et al., 1999, p. 27). As Campbell (1976) noted, the G-theory
model “removes the somewhat arbitrary distinctions among coefficients of stability, equivalence,
and internal consistency and replaces them with a general continuum of representativeness”
(p. 202). Interestingly, this movement toward a unitarian perspective on reliability has temporally
coincided with the movement from trinitarian to unitarian perspectives on validity (Brennan,
2006). Ironically, unlike our views on validity, our formal treatments of reliability estimation in
organizational research have remained focused on categories or types of reliability coefficients
(e.g., Aguinis, Henle, & Ostroff, 2001; Guion, 1998; Le & Putka, 2007; Ployhart, Schnider, &
Schmitt, 2006; Schmidt & Hunter, 1996).14 Rather than continuing to bemoan the current state
of affairs, we offer an alternative way of framing discussions of estimating reliability that may
help bring organizational research, practice, and pedagogy more in line with modern psychomet-
ric thought. Before doing so, we offer a quick example to help illustrate the rationale behind the
structure we offer below.
When calculating existing types of reliability coefficients, such as a simple Pearson correlation
calculated between two replicates (Brown, 1910; Spearman, 1910), coefficient alpha (Cronbach,
1951), or intraclass correlation (ICC; Shrout & Fleiss, 1979)—with which most investigators are
familiar, it is important to remember that these are just sets of mathematical operations that can be
applied to any set of replicates of our choosing (e.g., raters, items, tasks, occasions). They will all
produce, to varying degrees of quality (depending on the properties of the underlying data and con-
struct being measured), estimates of the ratio σ2T/(σ2T + σ2E). As noted above, the substantive mean-
ing of σ2T and σ2E will depend in large part on the types of replicates to which the mathematical
operations are applied. For example, if we apply them to replicates defined as items, σ2T will reflect
consistency across items; if we apply them to replicates defined as occasions, σ2T will reflect consis-
We refer the interested readers to Brennan (1998), Haertel (2006), and Qualls-Payne (1992) for modern treatments of
13
SEMs in the context of CTT and G-theory. One advantage of SEMs over reliability coefficients is that they can be tai-
lored to individuals being measured (e.g., differential amounts of error depending on individuals’ level of true score),
whereas reliability coefficients are typically associated with groups of individuals. The latter is often cited as one benefit
of IRT-based perspectives on measurement over CTT- and G-theory-based perspectives; however, CTT and G-theory
also offer methods for generating individual-level SEMs (Haertel, 2006).
14 Our
speculation on why this occurred is (a) the perceived complexity and jargon-loaded nature of G-theory (DeShon,
2002), (b) the overarching dominance of the correlational research tradition underlying selection research and practice
(Cronbach, 1957; Dunnette, 1966; Guion, 1998), and (c) the steady decline of teaching psychometrics and statistics in our
graduate programs since the 1970s (Aiken et al., 2008; Merenda, 2007).
24 Handbook of Employee Selection
tency across occasions; if we apply them to replicates defined as raters, σ2T will reflect consistency
across raters; and so on and so forth.15 Unfortunately, our literature has a tendency to associate
certain types of coefficients with certain types of replicates (e.g., coefficient alpha with items, ICCs
with raters). This is unfortunate and misleading, because this simply reflects the type of replicate
with which these procedures happened to be introduced by earlier authors. Computationally, the
procedures are blind to the types of replicates to which they are applied, and many are algebraically
identical (Cronbach & Shavelson, 2004; Feldt & Brennan, 1989). For example, alpha is a specific
type of ICC, and all ICCs can be framed as generalizability coefficients (Brennan, 2001b; McGraw
& Wong, 1996). We organize our discussion below around three traditions for estimating reliability.
The classical tradition largely attempts to estimate reliability directly, with little attention toward
estimating components of it. More modern traditions (e.g., those based on random-effects models
and CFA models) attempt to generate estimates of σ2T and σ2E, or components of them, which gives
investigators flexibility to combine components in different ways to calculate reliability estimates
appropriate for their situation and achieve a better understanding of the sources of error (and true
score) in their measures.
Classical Tradition
This classical tradition has its roots in using Pearson correlation between two replicates (e.g., split-
halves of a single test, tests administered on two different occasions) to estimate reliability (Brown,
1910; Spearman, 1910). It is based on the premise that the correlation between two strictly paral-
lel replicates (e.g., split-halves of a test, the same test administered on two occasions) equals the
proportion of observed score variance attributable to true scores from a single replicate. If applied
to split-halves of a test, the Spearman-Brown prophecy formula would then be used to “step-up”
the said correlation to arrive at an estimate of reliability for scores produced by the full test. The
primary strength of estimation methods based on this tradition is their simplicity and widespread
familiarity. Pearson correlations are easy to calculate and widely used in selection research and
practice (Schmidt & Hunter, 1996).
Early psychometricians realized that the Spearman-Brown approach described above becomes
unwieldy in situations dealing with more than two replicates (e.g., a 10-item conscientiousness
scale). Specifically, they realized that depending on which split-halves of their test they calculated
their correlation on, they would get a different estimate of reliability (Kuder & Richardson, 1937).
In light of this difficulty, researchers developed alternative approaches to estimating reliability
that were a function of replicate variances (e.g., item variances) and observed score variances
(e.g., variance of the full test). These approaches provided a computationally simple solution
that could easily accommodate measures involving two or more single-faceted replicates and are
reflected in Kuder and Richardson’s (1937) KR-20, Guttman’s (1945) set of lambda coefficients,
and Cronbach’s coefficient alpha (Cronbach, 1951).16, 17 Another positive characteristic of these
latter approaches relative to the Spearman-Brown prophecy is that they only necessitate replicates
be essentially tau-equivalent, as opposed to strictly parallel (Novick & Lewis, 1967), although
subsequent research has found that alpha is robust to violations of essential tau-equivalence
(Haertel, 2006).
15 Given the discussion raised earlier, σ2T in any of these examples may also reflect variance attributable to one or more
hidden facets of measurement.
16 On
a historical note, Cronbach did not invent coefficient alpha per se—Guttman’s (1945) L3 coefficient and Hoyt’s (1941)
coefficient are algebraically identical to alpha and were introduced long before Cronbach’s (1951) landmark article.
17 We should note that a subtle difference between Pearson r-based indices of reliability and those noted here (i.e., KR-20,
Gutman’s lamdas, alpha) is that the latter assess the additive relationship between replicates, whereas Pearson r assesses
the linear relationship between replicates. Differences in the variances of replicates will reduce alpha and other additive
reliability indices, but they will have no effect on Pearson r-based indices because the latter standardizes any variance
differences between replicates away (McGraw & Wong, 1996).
Reliability and Validity 25
Unfortunately, all of the classical estimation approaches described above, from Spearman-Brown
through coefficient alpha, are limited in some important ways. As noted earlier, the CTT model on
which these coefficients are based was developed for use with measurement procedures involving
single-faceted replicates that were fully crossed with one’s objects of measurement (Cronbach &
Shavelson, 2004). The simplicity of calculating a Pearson r, the Spearman-Brown prophecy, and
alpha belies interpretational and statistical problems that arise if one attempts to apply them to
replicates that are (a) not fully crossed with one’s objects of measurement or (b) multifaceted in
nature. As Cronbach and Shavelson (2004) noted in their discussion of the possibility of applying
alpha to replicates that are not fully crossed, “Mathematically, it is easy enough to substitute scores
from a nested sample matrix by simply taking the score listed first for each (person) as belonging in
Column 1, but this is not the appropriate analysis” (p. 400). Nevertheless, application of such clas-
sical estimation methods, regardless of a procedure’s underlying design, has been common practice
in organizational research (Viswesvaran, Schmidt, & Ones, 2005).
To illustrate the problems that arise when classical estimation methods are applied to measure-
ment designs that are not fully crossed, consider an example in which job incumbents are each
rated by two raters on their job performance. Some incumbents may share one or more raters in
common, whereas others may share no raters in common. In this case, standard practice is to (a)
randomly treat one rater for each ratee as “rater 1” and the other as “rater 2,” (b) assign the rat-
ings of “rater 1” to column 1 and the ratings of “rater 2” to column 2 in a data set, (c) calculate the
Pearson correlation between columns to estimate the reliability of a single-rater’s ratings, and then
(d) use the Spearman–Brown prophecy on the said correlation to estimate the reliability for the
average rating (Viswesvaran et al., 2005). Putka et al. (2008) elaborated on several problems with
this common practice, namely (a) the estimates derived from this process can differ depending on
the assignment of raters to columns 1 and 2 for each ratee; (b) Pearson r fails to account for the fact
that residual errors are nonindependent for ratees that share one or more raters in common, which
leads to a downward bias in estimated true score variance (σ2T) (Kenny & Judd, 1986); and (c) the
Spearman–Brown prophecy inappropriately scales the contribution of rater main effect variance to
error variance (σ2E) as a function of the number of raters per ratee, rather than the amount of over-
lap between sets of raters that rate each ratee, leading to an overestimate of σ2E (see also, Brennan,
2001b, p. 236). In addition to these points, Putka and his colleagues offer a solution for dealing
with this type of design that is based on the random-effects model tradition of estimating reliability
(discussed later).
Second, with regard to the problem of applying classical methods to multifaceted replicates, the
task-rater example presented earlier clearly showed the hazards of blindly applying alpha to repli-
cates of such nature. However, it would be fallacy to suggest that investigators who adopt classical
methods would actually apply alpha or other classical methods in such a manner. Indeed, early
psychometricians seemed acutely aware of the limitations of the CTT model, and calculating dif-
ferent types of coefficients was the way early psychometricians attempted to deal with the inability
of the CTT model to account for multifaceted replicates. For example, Cronbach (1947) discussed
the coefficient of equivalence and stability (CES), which was calculated by correlating two different
forms of a measure completed by the same respondents on two different occasions (i.e., replicates
defined by form-occasion combinations). Cronbach later realized that emergence of the G-theory
score model in the 1960s eliminated the need to “mix and match” pairs of replicates like this and
provided a generalized solution that applied regardless of whether one was dealing with single-
faceted or multifaceted replicates and regardless of whether one was dealing with crossed or non-
crossed designs (Cronbach & Shavelson, 2004).
Although the tradition of using coefficients such as CES to deal with multifaceted replicates
has faded in psychometrics, it has continued to characterize organizational research and practice,
because we have continued to frame problems of reliability in a way that, for better or worse,
resembles the psychometric literature of the 1940s. For example, Schmidt and his colleagues have
demonstrated how, in the context of fully crossed designs, one can calibrate different sources of
26 Handbook of Employee Selection
error in scores (e.g., error arising from inconsistencies across items, occasions, raters, etc.) through
the addition and subtraction of Pearson correlations applied to different types of replicates (Schmidt
et al., 2000). Indeed, for fully crossed designs, Schmidt and others illustrated how one can arrive
at estimates for at least some of the variance components estimable based on the random-effects
model underlying G-theory (Schmidt, Le, & Ilies, 2003; Le, Schmidt, & Putka, 2009). However,
it is important to note that the calibration methods based on the classical coefficients alluded to
above will not be able to estimate all components of variance that a given measurement design
may support estimating—even if the design is fully crossed. For example, such methods cannot
be used to estimate the unique contribution of facet main effects (e.g., rater main effects, question
main effects) or interactions among facets (e.g., question-rater effects). Lacking this flexibility is
unfortunate, particularly if one is interested in (a) comparing scores to standards (e.g., cut-off score)
rather than simply making relative comparisons among individuals or (b) simply gaining a more
comprehensive understanding of the sources of variance underlying scores. Remember that the
CTT score model that gave rise to the classical coefficients discussed above was never designed to
account for main effects of measurement facets, largely because they were assumed not to exist (e.g.,
recall parallel measures have equal means) and because they were not of interest in the problems
that Spearman and other early psychometric researchers concerned themselves with (Cronbach &
Shavelson, 2004).
into the psychometric and I-O literature, because many still seem to equate G-theory with ANOVA-
based variance component estimation procedures that characterized G-theory upon its introduction
to the literature.
Procedures for the direct estimation of variance components that underlie all reliability coef-
ficients are now widely available in common statistical packages (e.g., SAS, SPSS) and allow inves-
tigators to estimate variance components with a few clicks of a button. DeShon (2002) and Putka
and McCloy (2008) provided clear examples of the ease with which variance components can
be estimated within SAS and SPSS. As such, modern methods of variance component estima-
tion are far easier to implement than (a) procedures characteristic of the early G-theory literature
and (b) the calibration techniques discussed by Schmidt et al. (2000), which would require an
investigator to engage in a series of manipulations with various types of coefficients arising out
of the classical tradition. In addition to offering parsimony, modern methods of variance compo-
nent estimation have another key advantage: they can readily deal with missing data and unbal-
anced designs characteristic of organizational research (DeShon, 1995; Greguras & Robie, 1998;
Marcoulides, 1990; Putka et al., 2008). In contrast, ANOVA-based variance component estima-
tors characteristic of the early G-theory literature are not well equipped to handle such messy
designs. Indeed, when confronted with such designs, advocates of G-theory have often suggested
discarding data to achieve a balanced design for purposes of estimating variance components (e.g.,
Shavelson & Webb, 1991)—with modern methods of variance component estimation, the need
for such drastic steps has subsided. The most notable drawback of modem methods of variance
component estimation—largely based on full or restricted maximum likelihood—is that they can
involve rather substantial memory requirements for large measurement designs (Bell, 1985; Littell,
Milliken, Stroup, & Wolfinger, 1996). In some cases, such requirements may outstrip the memory
that Windows-based desktop computers can currently allocate to programs for estimating variance
components (e.g., SAS and SPSS).
The strengths of reliability estimation methods based on the random-effects model tradition
relative to the classical tradition are substantial. First, they fully encompass classical methods in
that they can be used to estimate reliability for measurement procedures involving single-faceted
replicates that are fully crossed with one’s object of measurement. Second, unlike classical meth-
ods, they can easily be used to formulate reliability estimates for measurement procedures involving
multifaceted replicates in which the facets are crossed, nested, or any combination thereof. Third,
the random-effects tradition provides investigators not only with coefficients, but also the variance
components that underlie them. As Cronbach and Shavelson (2004) stated, “Coefficients (reliability)
are a crude device that do not bring to the surface many subtleties implied by variance components”
(p. 394). Variance components allow researchers to get a much finer appreciation of what com-
prises error than simply having one omnibus estimate of error. Readers interested in learning more
about formulation of reliability estimates via variance components estimated by random-effects
models—or more generally, G-theory—are referred to DeShon (2002), Haertel (2006), Putka et al.
(2008), and Shavelson and Webb (1991). For a more thorough technical presentation, one should
consult Brennan (2001b).
they offer no clear mechanism for (a) testing or dealing with violations of CTT and G-theory
measurement model assumptions and (b) specifying or testing alternative factorial composi-
tions of true score—both of which have fundamental implications for the interpretation of reli-
ability estimates.18 It is in this regard that CFA approaches to estimating reliability are strong
(McDonald, 1999).
Unlike reliability estimation approaches born out of the classical and random-effects traditions,
CFA-based approaches force investigators to be specific about the substantive nature of the latent
structure underlying their replicates (indicators, in CFA terms). For example, CFA forces them to
face questions such as:
• Is the covariance shared among replicates (i.e., true score variance from the perspective of
classical and random-effects approaches) accounted for by a single latent true score factor
or multiple latent factors?
• Do indicators of the latent true score factor(s) load equally on that/those factor(s) (i.e., are
they at least essentially tau-equivalent) or is their heterogeneity in factor loadings (i.e., sug-
gesting they are not at least essentially tau-equivalent)?
• What proportion of true score variance (as defined in CTT and G-theory) reflects the
effects of a single latent factor, as opposed to residual covariances?
Although such questions have implications for reliability estimates arising from the classical and
random-effect traditions, neither of these traditions has a built-in mechanism for addressing them.
In essence, they ascribe all shared covariance among replicates to a latent entity (e.g., true score)
regardless of whether it stems from a single factor or multiple factors. Thus, in some ways CFA can
be seen as a way of clarifying the factorial composition of true score variance as conceived by CTT
and G-theory measurement models. One may argue that such clarification is more an issue of valid-
ity rather than reliability (e.g., Schmidt et al., 2000); however, as we discuss below, the dimensional-
ity of the focal construct of interest has implications for the accuracy of reliability estimates based
on the classical and random-effects traditions (Lee & Frisbie, 1999; Rae, 2007; Rogers, Schmitt, &
Mullins, 2002).
The CFA tradition of reliability estimation arose out of Joreskog’s (1971) work on the notion
of congeneric tests discussed earlier. To illustrate, consider a situation in which we administer a
10-item measure of agreeableness to a sample of job applicants. In this case, our replicates are
single-faceted and defined in terms of items, and those items are fully crossed with our objects of
measurement—applicants. From the CFA perspective, we might view the replicates as indicators of
a latent factor representing true score, and then fit a model to the data such that the variance of the
latent factor is set to one, and the factor loadings and unique variances are freely estimated. On the
basis of such a model, the estimated reliability of the sum of the k replicates can be obtained via
2
k
∑ λ i σ̂T2
i =1
ω = =
k k
2
σˆ T2 + σˆ E2 (2.7)
∑ λ i + ∑ θii
i =1 i =1
where λi represents the estimated factor loading for the ith of k replicates, and θii represents the esti-
mated unique variance for the ith replicate (McDonald, 1999; Reuterberg & Gustafsson, 1992).19 As
One potential caveat to this regards the fairly recent ability of SAS and SPSS to fit random-effects models that allow for
18
heterogeneity in variance component estimates (e.g., Littell et al., 1996; SPSS Inc., 2005). Such capabilities might be
leveraged test parallelism assumptions underlying the CTT and G-theory score models.
19 McDonald (1999) refers to this coefficient as “omega” (p. 89).
Reliability and Validity 29
with the application of classical and random-effects approaches to reliability, the substantive mean-
ing of σ2T and σ2E based on this formulation will differ depending on the type of replicates to which
they are applied (e.g., items, raters, occasions, tasks, etc.). Thus, if applied to replicates defined by
items, the estimate of σ2T provided by the squared sum of loadings will reflect consistency among
items (attributable to the latent true score factor). If applied to replicates defined by raters, the esti-
mate of σ2T provided by the squared sum of loadings will reflect consistency among raters (again,
attributable to the latent true score factor).
A key benefit of the CFA approach described above is it will allow one to impose constraints
on parameter estimates (e.g., the λs and θs) that allow one to test various assumptions underlying
the CTT and G-theory score models (see Joreskog & Sorbom, 2001, pp. 124–128; Reuterberg &
Gustafsson, 1992; Werts, Linn, & Joreskog, 1974). If replicates are strictly parallel (an assumption
underlying the Spearman-Brown prophecy; Feldt & Brennan, 1989), they should have equal factor
loadings (i.e., λ1 = λ2 = λk) and equal unique variances (i.e., θ11= θ22 = θkk). If replicates are tau-
equivalent or essentially tau-equivalent (an assumption underlying alpha and coefficients based on
the random-effects tradition), they should have equal factor loadings but their unique variances
can differ. To the extent that factor loadings vary across replicates (i.e., the replicates are not at
least essentially tau-equivalent), most reliability estimates based out of the classical and random-
effects tradition (e.g., alpha) will tend to be slightly downwardly biased (Novick & Lewis, 1967).20
Nevertheless, this common claim is based on the premise that the replicates on which alpha is
estimated are experimentally independent—from a CFA perspective this would imply there is no
unmodeled sources of covariance among replicates after accounting for the latent true score factor
(Komaroff, 1997; Raykov, 2001a; Zimmerman, Zumbo, & LaLonde, 1993). In light of the fact
that many constructs of interest to organizational researchers are heterogeneous (e.g., situational
judgment) or clearly multidimensional (e.g., job performance), application of the formula shown in
Equation 2.7 would be questionable because it implies that a single common factor accounts for the
covariance among replicates, which in practice may rarely be true.
The observation above brings us to a critical difference between the CFA-based formulation of
reliability noted in Equation 2.7 and those based on the classical and random-effects traditions—
the former specifies a single latent factor as the sole source of covariance among replicates, and
as such only variance in replicates attributable to that factor is treated as true score variance.
Recall from the operational definition of true score offered earlier and the perspective on true
score offered by the CTT and G-theory score models that true score reflects all sources of consis-
tency across replicates. As Ghiselli (1964) noted, “The fact that a single term … has been used to
describe the amount of the trait an individual possesses should not be taken to imply that individual
differences in scores on a given test are determined by a single factor” (p. 220). The implications
of this are that whereas the CFA formulation above ignores any covariance among replicates that
is left over after extracting a first latent factor, classical coefficients such as alpha and coefficients
derived from the random-effects tradition lump such covariance into the estimate of true score
variance (Bost, 1995; Komaroff, 1997; Maxwell, 1968; Raykov, 2001a; Smith & Luecht, 1992;
Zimmerman et al., 1993).
This characteristic of the CFA approach offered above presents investigators with a dilemma:
Should residual covariance observed when adopting such an approach be treated as (a) error variance
(σ2E) or (b) a source of true score variance (σ2T)? In estimates of reliability based on the classical or
random-effects traditions, one does not have much of an option. True score variance as estimated
under each of those traditions reflects any source of consistency in scores, regardless of whether it
This downward bias arises from the fact that most reliability estimates based on these traditions rely on the average
20
covariance among replicates to make inferences regarding the magnitude of true score variance for each replicate. To
the extent that replicates are not essentially tau-equivalent, this average covariance will tend to underestimate true score
variance for each replicate (a component of true score variance of the composite of replicates), thus leading to a slight
underestimation of reliability when all other assumptions are met (e.g., uncorrelated errors among replicates) (Feldt &
Brennan, 1989; Raykov, 2001a).
30 Handbook of Employee Selection
stems from a first common factor, or what, in CFA-terms, would be viewed as residual covariance
or correlated uniquenesses (Komaroff, 1997; Scullen, 1999). However, with CFA, researchers have
the flexibility to distinguish between true score variance that (a) arises from the first common factor,
and (b) reflects residual covariance among replicates after extracting the first factor (Raykov, 1998;
2001b). Although in theory having this flexibility is valuable because it allows one insight into the
substance of true score variance, it also has practical benefits in that it can allow investigators to
tailor a reliability coefficient to their situation depending on the nature of the construct they are
assessing. To illustrate this flexibility, we offer three examples below that selection researchers and
practioners may encounter.
First, let us say one (a) designs a measurement procedure to assess a unidimensional construct,
(b) uses a fully crossed measurement design comprising replicates defined by a single facet of
measurement (e.g., items) to assess it, (c) fits the single-factor CFA model described above to the
resulting data, and (d) finds evidence of residual covariance. Assuming there is no pattern to the
residual covariance that would suggest the presence of additional substantively meaningful fac-
tors, the investigator would likely desire to treat the residual covariance as a source error variance
(σ2E) rather than a source of true score variance (σ2T).21 Fortunately, such residual covariance can
be easily incorporated into Equation 2.7 by replacing the term corresponding to the sum of unique
variances with a term that reflects the sum of unique variances and residual covariances or by
simply replacing the denominator with observed score variance (Komaroff, 1997; Raykov, 2001a).
If one were to calculate alpha on these same data, or fit a simple random-effects model to estimate
σ2T, such residual covariance would be reflected in σ2T as opposed to σ2E and thus would produce a
reliability estimate that is higher than the modified omega-coefficient described here when the sum
of the residual covariances are positive (lower when the sum is negative) (Komaroff, 1997). It is
important to note that the comparison made here between alpha and modified omega is based on the
assumption that the replicates in the analysis are at least essentially tau-equivalent. If the replicates
are not at least essentially tau-equivalent, then this would serve to lower the estimate of alpha, thus
either partially or completely offsetting any positive bias created by the presence of residual covari-
ance (Raykov, 2001a).
As another example, let us say one (a) designs a measurement procedure to assess a relatively
heterogeneous, but ill-specified construct (e.g., situational judgment) and again, (b) uses a fully
crossed measurement design comprising replicates defined by a single facet of measurement (e.g.,
scenarios) to assess it, (c) fits the single-factor CFA model described above to the resulting data,
and (d) finds evidence of residual covariance. In this case, the investigator may choose to treat
the residual covariance as a source of true score variance (σ2T) rather than error variance (σ2E).
Unlike the first example, given the heterogeneous nature of the situational judgment construct,
the investigator would not likely expect the covariance among scenarios to be accounted for by
a single factor. For example, the investigator may hypothesize that the scenarios comprising the
assessment vary in the degree to which various combinations of individual differences (e.g.,
interpersonal skill, conscientiousness, and general mental ability) are required to successfully
resolve them. As such, scores on scenarios may differentially covary depending on the simi-
larity of the individual difference profile required to resolve them. Under such conditions, one
would need more than a single-factor to account for covariation among replicates, but given the
ill-structured nature of situational judgment construct, the investigator may not find strong evi-
dence for a simple factor structure. As was the case with treating residual covariances as σ2E in
the previous example, Equation 2.7 can easily be modified to treat residual covariances as σ2T by
Even if there is a pattern to the residual covariances, the investigator might still wish to treat them as contributing to σ2E
21
if they reflect an artifact of the particular measurement situation (e.g., Green & Hershberger, 2000). This raises an impor-
tant point: The examples offered here are for illustration; they are not prescriptions for future research and practice.
Ultimately it will be the individual investigator who decides how to treat residual covariance given the characteristics of
the measurement situation he or she faces.
Reliability and Validity 31
adding a term to the squared sum of loadings that reflects the sum of all residual covariances,
specifically
2
k
( )
k k
∑
i =1
λ i
+ ∑ ∑ Cov ei , e j
j
∧
σ 2T
i i≠ j
ϖ′ = = ∧ ∧
(2.8)
k
2
σ 2T + σ E
( )
k k k
∑
i =1
λ i
+ ∑ ∑ Cov e
i
i
,
j
e j
+ ∑ θii
i =1
i≠ j
Note that treating residual covariance as σ2T represents a departure from how the CFA literature on
reliability estimation has generally espoused treating such covariance when estimating reliability
(e.g., Komaroff, 1997; Raykov, 2001b). Nevertheless, the perspective offered by these authors is
largely based on the assumption that the investigator is assessing a unidimensional construct. If
one were to calculate alpha on such replicates, or fit a random-effects model to estimate σ2T, the
covariance among residuals noted above would contribute to σ2T as opposed to σ2E and as such
would produce a coefficient similar in magnitude to what is provided by Equation 2.8 (Komaroff,
1997).
Lastly, and as a third example, let us say one (a) designs a measurement procedure to assess a
multidimensional construct (e.g., job performance), (b) uses a fully crossed measurement design
comprising replicates defined by a single facet of measurement (e.g., items) to assess it, and (c)
samples content for the measure in a way that allows one to distinguish between different dimen-
sions of the construct (e.g., samples items corresponding to multiple job performance dimensions).
In this situation, one might be interested in estimating the reliability of scores on each dimension
of the construct separately, as well as estimating the reliability of a composite score based on the
sum of dimension-level scores (e.g., an overall performance score). To achieve a reliability estimate
for scores on the overall composite, the single-factor CFA model described would clearly not be
appropriate. Rather, a multifactor model may be fitted in which each factor reflects dimensions of
the construct being targeted by the measure. Indicators would be allowed to load only on those fac-
tors they are designed to reflect, and the reliability and true score variance of the overall composite
score would be a function of factor loadings and factor covariances (Kamata, Turhan, & Darandari,
2003; Raykov, 1998; Raykov & Shrout, 2002). Any residual covariance among indicators associ-
ated with a given factor could be treated as noted in the earlier examples (i.e., treated as σ2T or σ2E)
depending on how the investigator views such covariance in light of the substantive nature of the
target construct and particular measurement situation. Such multifactor models could also be used
to simultaneously generate separate estimates of reliability of scores for each dimension of the con-
struct (Raykov & Shrout, 2002).
Although classical and random-effects traditions do not concern themselves with the factorial
composition of true score covariance as the CFA tradition does, estimation methods arising out of
the former traditions have developed to deal with reliability estimation for scores produced by mea-
sures that clearly reflect the composite of multiple dimensions. Such methods have typically been
discussed under the guise of (a) reliability estimation for measures stratified on content (e.g., items
comprising the measure were sampled to assess relatively distinct domains such as deductive and
inductive reasoning) or, more generally, (b) reliability estimation for composite scores (Cronbach,
Schoneman, & McKie, 1965; Feldt & Brennan, 1989). Here “composites” do not necessarily refer
to a compilation of items thought to reflect the same construct (i.e., replicates), but rather compila-
tions of measures designed to reflect distinct, yet related constructs (e.g., proficiency with regard to
different requirements for a trade or profession) or different components of a multidimensional con-
struct (e.g., task and contextual performance). Scores produced by such component measures may
differ in their reliability and their observed relation with one another. Sensitivity to such issues is
clearly seen in classical formulas for the reliability of composites such as stratified coefficient alpha
32 Handbook of Employee Selection
(Cronbach et al., 1965) and Mosier’s (1943) formula for the reliability of a weighted composite.22
In the case of stratified alpha and Mosier’s coefficient, σ2T for the overall composite score reflects
the sum of σ2T for each component of the composite and the sum of covariances between replicates
(e.g., items, raters) comprising different components.23 The fact that covariances between replicates
from different components of the composite contribute to true score variance has a very important
implication—these estimates will likely produce inflated estimates of reliability in cases in which
measures of each component share one or more elements of a facet of measurement (e.g., raters,
occasions) in common.
For example, consider a situation in which one gathers job performance ratings on two dimensions
of performance for each ratee—task performance and contextual performance. Assume that for any
given ratee, the same two raters provided ratings of task performance and contextual performance.
In this case, the measures of task and contextual performance share raters in common and as such
are “linked” (Brennan, 2001b). Thus, the covariation between task and contextual performance
in this example reflects not only covariance between their true scores but also covariance arising
from the fact that they share a common set of raters. Were we to apply stratified alpha or Mosier’s
formula to estimate the reliability of the composite score produced by summing across the two
dimensions (using inter-rater reliability estimates for each component in the aforementioned
formulas), covariance attributable to having a common set of raters would contribute to true score
variance, thus artificially inflating the reliability estimate (assuming we wish to generalize the
measures across raters). Stratified alpha and Mosier’s formula are based on the assumption that
errors of measurement associated with components that comprise the composite are uncorrelated;
to the extent they are positively correlated—a likely case when components share one or more
elements of a facet of measurement in common—the estimates they provide can be substantially
inflated (Rae, 2007). Outside of multivariate G-theory, which is not widely used nor discussed in the
organizational research literature (Brennan, 2001b; Webb & Shavelson, 1981), there appear to be no
practical, straightforward analytic solutions to this situation on the basis of classical and random-
effects estimation traditions.
Note Mosier’s (1943) formula is equivalent to the formula for stratified coefficient alpha if elements comprising a com-
22
are multifaceted those replicates that share a level of a given facet in common (e.g., replicates that
share a common rater or occasion of measurement) will covary above and beyond any substantive
factors (e.g., interpersonal skill, job performance) that underlie the replicates (DeShon, 1998).
There are numerous ways to account for multifaceted replicates within the CFA framework;
however, only recently have they begun to find their way into the literature (e.g., DeShon, 1998;
Green, 2003; Le et al., 2008; Marsh & Grayson, 1994; Marcoulides, 1996). Many of the methods
being espoused for handling multifaceted replicates in the context of CFA have their roots in the
literature on modeling of multitrait-multimethod data (e.g., Kenny & Kashy, 1992; Widaman, 1985).
For example, in the context of the interview example offered earlier, we might fit a model that not
only includes a latent factor corresponding to the construct of interest (e.g., interpersonal skill), but
also specify latent factors that correspond to different raters or interview questions (e.g., all indica-
tors associated with rater 1 would load on a “Rater 1” factor, all indictors associated with rater 2
would load on a “Rater 2” factor). Alternatively, one might allow uniqueness for those indicators
that share a rater or question in common to covary and constrain those that do not to zero (e.g., Lee,
Dunbar, & Frisbie, 2001; Scullen, 1999). By fitting such models, one can derive estimates of vari-
ance components associated with various elements of one’s measurement design (e.g., person-rater
effects, person-question effects) that resemble what is achieved by fitting a random-effects model
to the data described earlier (e.g., DeShon, 2002; Le et al., 2008; Marcoulides, 1996; Scullen et al.,
2000). As illustrated earlier in our discussion of G-theory, these variance components can then be
used to construct reliability coefficients appropriate for one’s situation.
Unfortunately, as was the case with using classical reliability coefficients to calibrate various
sources of error in scores (e.g., Schmidt et al., 2000), CFA-based approaches to variance compo-
nent estimation have a few drawbacks. First, they easily lend themselves to estimating variance
attributable to (a) facet main effects (e.g., rater main effects, question main effects) or (b) interac-
tions among measurement facets (e.g., rater-question interaction effects). Although it is possible to
estimate the effects above, this would require calculating covariances among persons (i.e., persons
as columns/variables) across facets of measurement of interest (e.g., raters, question) as opposed
to the typical calculation of covariances among question-rater pairs (i.e., question-rater pairs are
treated as columns/variables) across objects of measurement (e.g., persons).24 Furthermore, it is not
clear how CFA could be leveraged to deal with designs that are more ill-structured in nature (e.g.,
Putka et al., 2008). For example, recall the example earlier where we had performance ratings for
a sample of incumbents that were rated by multiple raters, and the raters that rated each incumbent
varied in their degree of overlap. When confronted with such designs in the past applications of
CFA, organizational researchers have generally resorted to random assignment of raters to columns
for each ratee (e.g., Mount, Judge, Scullen, Sytsma, & Hezlett, 1998; Scullen et al., 2000; Van
Iddekinge, Raymark, Eidson, & Attenweiler, 2004). As noted earlier, the drawback of doing this is
that it can produce results that (a) vary simply depending on how raters are assigned for each ratee
and (b) fail to account for the nonindependence of residuals for incumbents that share a rater in
common, which serves to downwardly bias estimates of true score variance (Kenny & Judd, 1986;
Putka et al., 2008).
Lastly, for better or worse, the literature on CFA offers myriad ways to parameterize a model to
arrive at variance component estimates, each of which have various strengths and weaknesses that
are still in the process of being ironed out (e.g., Eid, Lischetzke, Nussbeck, & Trierweiler, 2003;
Lance, Lambert, Gewin, Lievens, & Conway, 2004; Marsh & Grayson, 1994). With the random-
effects model discussed above, the number of alternative parameterizations (at least as currently
implemented in common statistical software such as SAS and SPSS) is quite limited. The difficulty
this creates when using CFA to estimate variance components is determining which parameteriza-
tion is most appropriate in a given situation, because a clear answer has not emerged and will likely
ultimately depend on characteristics of the construct being measured and characteristics of one’s
24 Interested readers are referred to Hocking (1995) and Brennan (2001b, pp. 166–168).
34 Handbook of Employee Selection
measurement situation (Marsh & Grayson, 1994). This is complicated by the fact that often times
the choice of which parameterization is adopted may be less a matter of substantive considerations
and more a reflection of the parameterization that allowed the software fitting the model to converge
to an admissible solution (Lance et al., 2004). Ultimately, such nuances, coupled with the com-
plexity of CFA-based approaches to variance component estimation, may limit the utility of such
approaches for reliability estimation in general selection research and practice.
TABLE 2.1
Relative Advantages and Disadvantages of Reliability Estimation Traditions
Random
Characteristics of Estimation Tradition Classical Effects CFA
Perceived simplicity Yes No No
Widely discussed in organizational literature on reliability estimation Yes No No
Easily implemented with standard statistical software (e.g., SPSS, SAS) Yes Yes Noa
Direct and simultaneous estimation of variance components underlying σ2T No Yes Yes
and σ2E
Straightforward to apply to nested and ill-structured measurement designs No Yes No
confronted in applied organizational research and practice
Capacity to isolate and estimate variance attributable to facet main effects No Yes Nob
and interactions among facets
Offers mechanism for testing and dealing with violations of CTT and No Noc Yes
G-theory measurement model assumptions
Offers mechanism for specifying and testing alternative factorial No No Yes
compositions of true score
difficult because of ambiguity of what constitutes true score, particularly for measures of heteroge-
neous constructs. This is where the CFA tradition can offer an advantage; however, this advantage
does not come freely—its price is added complexity.
For single-faceted replicates that are fully crossed with one’s objects of measurement, CFA
methods are straightforward to apply and clear examples exist (e.g., Brown, 2006; McDonald,
1999). For multifaceted replicates, a systematic set of examples has yet to be provided for inves-
tigators to capitalize on, which is complicated by the fact that the CFA models can be param-
eterized in numerous different ways to arrive at a solution (Marsh & Grayson, 1994). This has
a tendency to restrict such solutions to psychometrically savvy researchers and practitioners.
Moreover for the ill-structured measurement designs discussed by Putka et al. (2008), which are
all too common in selection research involving ratings (e.g., assessment centers, interviews, job
performance), it is not clear how the CFA models would overcome the issues raised. Thus we have
a tradeoff between the ease with which modern random-effects models and software can deal
with multifaceted measurement designs of any sort and the model fitting and testing capabilities
associated with CFA, which can serve to not only check on measurement model assumptions,
but also refine our specification (and understanding) of true score for measures of heterogeneous
constructs.
Although we have treated reliability estimation approaches arising out of classical, random
effects, and CFA traditions separately, it is important to recall how we began this section—all of
these traditions can be used to arrive at the same ratio—σ2T/(σ2T + σ2E). The substantive meaning
of σ2T and σ2E will of course depend on the type of replicates examined, the nature of the measure-
ment procedure, and the construct that one is assessing. Nevertheless, all of these traditions can
potentially be leveraged to arrive at an estimate of this ratio and/or components of it. How they
arrive at those estimates, the assumptions they make in doing so, and the strengths and weaknesses
of the methodologies they use is what differentiates them. In cases in which one has a measurement
procedure comprised of single-faceted replicates or multifaceted replicates in which facets are fully
crossed with one’s objects of measurement and one is interested solely in using scores to make
relative comparisons among objects of measurement (e.g., persons), much literature has accumu-
lated indicating that these traditions can produce very similar results, even in the face of moderate
violation of common tau-equivalence assumptions (e.g., Le et al., 2008; Reuterberg & Gustafsson,
1992).
For example, Brennan (2001b) and Haertel (2006) show how random-effects ANOVA models
may be used to estimate variance components and form reliability coefficients that are identical to
the types of reliability coefficients from the classical tradition (e.g., alpha, coefficients of equivalence
and stability). Marcoulides (1996) demonstrated the equivalence of variance components estimated
based on CFA and random-effects ANOVA models fitted to a multifaceted set of job analysis data.
Le et al. (2008) illustrated how one can arrive at similar variance component estimates using func-
tions of Pearson correlations, random-effects models, and CFA models. Lastly, Brennan (2001b)
and Hocking (1995) demonstrated how it is possible to generate variance component estimates
without even invoking the random-effects or CFA models but simply calculating them as functions
of observed variance and covariances (in some ways, akin to Schmidt et al., 2000). Each of these
works illustrate that under certain conditions, the three traditions discussed above can bring inves-
tigators to similar conclusions. However, as illustrated above, nuances regarding the (a) generaliza-
tions one wishes to make regarding their scores, (b) the intended use of those scores (e.g., relative
comparisons among applicants vs. comparisons of their scores to a fixed cutoff), (c) characteristics
of one’s measurement procedure itself (e.g., nature of its underlying design), and (d) characteristics
of the construct one is attempting to measure (e.g., unidimensional vs. multidimensional, homoge-
neous vs. heterogeneous) make some of these approaches more attractive than others under different
circumstances. Ideally, the well-informed investigator would be in a position to capitalize on the
relative strengths of these traditions when formulating reliability and variance component estimates
of interest given his/her situation.
36 Handbook of Employee Selection
Per our discussion in earlier sections, we attribute this in part to the tendency of CTT-based treatments of reliability to
25
The above example might suggest that hierarchical linear models (HLMs) underlying multilevel
modeling efforts could provide a framework for modeling inconsistencies in scores. However, software
for fitting such models has historically been based on algorithms that have been optimized for deal-
ing with nested data structures (e.g., Bryk & Raudenbush, 1992), rather than the myriad of data struc-
tures that characterize measurement designs faced in research and practice. To model inconsistencies
in scores arising from myriad types of measurement designs, a more general modeling framework is
required. Along these lines, efforts are emerging on several fronts. For example, Putka, Ingerick, and
McCloy (2008a) elaborate on how linear mixed models (LMMs) offer a powerful framework for not
only partitioning error in ratings (per G-theory), but also modeling the substantive determinants of
components of error regardless of the measurement design underlying those ratings. Applied empiri-
cal examples of attempts to model inconsistency on the basis of LMMs, and more limited versions
of them are beginning to emerge (e.g., Hox, 1994; LaHuis & Avis, 2007; Laenen, Vangeneugden,
Geys, & Molenberghs, 2006; Putka et al., 2008b; Shrout, 1993; Van Iddekinge, Putka, Raymark, &
Eidson, 2005). Taking a slightly different tack, Englehard (1994) and Patz, Junker, Johnson, and
Mariano (2002) have described IRT-based approaches to modeling inconsistencies in ratings. The
theme shared by all of these efforts is that they are attempting to model inconsistencies in measures
through capitalizing on modern analytic methods that post-date the developments of both CTT and
G-theory, but are not easily placed in the psychometric curricula because such investigations may be
beyond the purview of traditional investigations of reliability and validity. Nevertheless, these may
represent the emergence of a refined psychometric paradigm that may prove particularly useful for
understanding and improving our most error-prone measurement procedures.
We find it deplorable … the measurement requirement occupies a median of only 4.5 weeks in the PhD
curriculum in psychology. A substantial fraction of programs offered no training in test theory or test
construction; only 46% of programs judge that the bulk of their graduates could assess the reliability
of their own measures (p. 43).
Under such conditions, it makes it nearly impossible for faculty to comprehensibly integrate and dis-
cuss implications of developments in the areas above into classroom discussions of psychometrics—
almost out of necessity, we limit ourselves to basic treatment of age-old perspectives on measurement.
Couple this observation with the explosion of new statistical software and availability of new esti-
mation methods since the mid-1980s, and it creates a situation where staying psychometrically
current can be a challenge for those in academe, as well as those in practice. Of course, also com-
plicating the trends above is the course of normal science—which leads us to pursue incremental
research that serves to refine measurement models and the perspectives on reliability they offer but
does not emphasize integration of models and perspectives (Kuhn, 1962). Such a lack of integra-
tion among psychometric models and perspectives is unfortunate because it can serve as a source
of parsimony, which is critical when one has limited time to devote to such topics in the course of
38 Handbook of Employee Selection
graduate instruction and in the course of applied research and practice. We hope this treatment has
brought some degree of parsimony to what have often been treated as disparate, loosely related top-
ics. Furthermore, we hope it cast developments in the area of reliability in a novel light for selection
researchers and practitioners and encourages us to explore and capitalize on modern methods for
framing reliability, error, and their underlying components.
Concept of Validity
Validity, according to the 1999 Standards for Educational and Psychological Testing, is “the degree
to which evidence and theory support the interpretations of test scores entailed by proposed uses of
tests” (AERA, APA, & NCME, 1999, p. 9). There is a long history and considerable literature on the
subject of validity. With limited space here, it is impossible to do justice to the subject. We attempt
to highlight a set of important issues in the ongoing development of thinking about validity, but we
direct the interested reader to a set of key resources for a strong foundation on the topic. One key
set of references is the set of chapters on the topic of validity in the four editions of Educational
Measurement, which is that field’s analog to the Handbook of Industrial and Organizational
Psychology. Cureton (1951), Cronbach (1971), Messick (1989), and Kane (2006) each offer detailed
treatment of the evolving conceptualizations of validity. Another key set focuses specifically on
validity in the context of personnel selection. There have been two prominent articles on validity in
the employment context published in the American Psychologist by Guion (1974) and Landy (1986).
There is also a very influential paper by Binning and Barrett (1989). A third key set is made up of
classic highly cited articles in psychology—Cronbach and Meehl’s (1955) and Loevinger’s (1957)
treatises on construct validity.
Our focus here is entirely conceptual. We do not address operational issues in the design of
studies aimed at obtaining various types of validity evidence. Rather, we attempt to outline a set of
issues that we view as central to an understanding of validity.
most commonly an individual; in other settings, a larger aggregate such as a classroom or a school)
from their scores on assessment devices. The generic question “Is this a valid test?” is not a use-
ful one; rather, the question is “Can a specified inference about the target of assessment be validly
drawn from scores on this device?” Several important notions follow from this position.
First, it thus follows that the inferences to be made must be clearly specified. It is often the case
that multiple inferences are proposed. Consider a technical report stating, “This test representatively
samples the established training curriculum for this job. It measures four sub-domains of job knowl-
edge, each of which is predictive of subsequent on-the-job task performance.” Note that three claims
are made here, dealing with sampling, dimensionality, and prediction, respectively. Each claim is
linked to one or more inferences about a test taker (i.e., degree of curriculum mastery, differentia-
tion across subdomains, relationships with subsequent performance, and incremental prediction of
performance across subdomains).
Second, support for each inference is needed to support the multifaceted set of claims made
about inferences that can be drawn from the test. Each inference may require a different type of
evidence. The claim of representative content sampling may be supported by evidence of the form
historically referred to as “content validity evidence,” namely, a systematic documentation of the
relationship between test content and curriculum content, typically involving the judgment of subject
matter experts. The claim of multidimensionality may be supported by factor-analytic evidence, and
evidence in support of this claim is one facet of what has historically been referred to as construct
validity evidence (i.e., evidence regarding whether the test measures what it purports to measure).
The claim of prediction of subsequent task performance may be supported by what has historically
been referred to as “criterion-related validity evidence,” namely, evidence of an empirical relation-
ship between test scores and subsequent performance. Note that the above types of evidence are
provided as examples—it is commonly the case that multiple strategies may be selected alone or
in combination as the basis for support for a given inference. For example, empirical evidence of a
test-criterion relationship may be unfeasible in a given setting because of sample size limitations,
and the investigator may turn to the systematic collection of expert judgment as to the likelihood
that performance on various test components is linked to higher subsequent job performance.
Third, it may prove the case that some proposed inferences receive support as evidence is gath-
ered and evaluated, whereas others do not. In the current example, what might emerge is strong
support for the claim of representative curriculum sampling and strong support for the claim of
prediction of subsequent performance, but evidence in support of a unidimensional rather than
multidimensional structure. In such cases, one should revise the claims made for the test; in this
case, dropping the claim that inferences can be drawn about differential standing on subdomains
of knowledge.
s etting of primary interest for the current volume, the key inference is one of prediction: Can scores
from measures gathered before a selection decision be used to draw inferences about future job
behavior?
Over the last several decades, there has been a move from viewing these as types of validity to
types of validity evidence. All lines of evidence—content sampling, dimensionality, convergence
with other measures, investigations of the processes by which test takers respond to test stimuli, or
relations with external criteria—deal with understanding the meaning of test scores and the infer-
ences that can be drawn from them. Because construct validity is the term historically applied to
questions of the meaning of test scores, the position emerged that if all forms of validity evidence
contributed to understanding the meaning of test scores, then all forms of validity evidence were
really construct validity evidence. The 1999 edition of the Standards pushed this one step further:
If all forms of evidence are construct validity evidence, then “validity” and “construct validity”
are indistinguishable. Thus the Standards refer to “validity” rather than “construct validity” as the
umbrella term. This seems useful, because construct validity carries the traditional connotations
of referring to specific forms of validity evidence, namely convergence with conceptually related
measures and divergence from conceptually unrelated measures.
Thus, the current perspective reflected in the 1999 Standards is that validity refers to the eviden-
tiary basis supporting the inferences that a user claims can be drawn from a test score. Many claims
are multifaceted, and thus multiple lines of evidence may be needed to support the claims made
for a test. A common misunderstanding of this perspective on validity is that the test user’s burden
has been increased, because the user now needs to provide each of the types of validity evidence.
In fact, there is no requirement that all forms of validity evidence be provided; rather, the central
notion is, as noted earlier, that evidence needs to be provided for the inferences that one claims can
be drawn from test scores. For example, if one’s intended inferences make no claims about content
sampling, content-related evidence is not needed. If the claim is simply that scores on a measure
can be used to forecast whether an individual will voluntarily leave the organization within a year
of hire, the only inference that needs to be supported is the predictive one. One may rightly assert
that scientific understanding is aided by obtaining other types of evidence than those drawn on to
support the predictive inference (i.e., forms of evidence that shed light on the construct(s) underly-
ing test scores), but we view such evidence gathering as desirable but not essential. One’s obligation
is simply to provide evidence in support of the inferences one wishes to draw.
In employment settings, the most crucial inference to be supported about any measure is whether
the measure is job-related. Labeling a measure as job-related means “scores on this measure can
be used to draw inferences about an individual’s future job behavior”—we term this the “predictive
inference.” In personnel selection settings, our task is to develop a body of evidence to support the
predictive inference. The next section of this chapter outlines mechanisms for doing so.
Some potential confusion arises from the failure to differentiate between settings where types of
validity evidence are being used to draw inferences about the meaning of test scores rather than to
draw a predictive inference. For example, content-related validity evidence refers to the adequacy
with which the content of a given measure samples a specified content domain. Assume that one is
attempting to develop a self-report measure of conscientiousness to reflect a particular theory that
specifies that conscientiousness has four equally important subfacets: dependability, achievement
striving, dutifulness, and orderliness. Assume that a group of expert judges is given the task of
sorting the 40 test items into these four subfacets. A finding that 10 items were rated as reflecting
each of the four facets would support the inference of adequate domain sampling and contribute to
an inference about score meaning. Note that this inference is independent of question about the job
relatedness of this measure. One could draw on multiple lines of evidence to further develop the
Reliability and Validity 41
case for this measure as an effective way to measure conscientiousness (e.g., convergence with other
measures) without ever addressing the question of whether predictive inferences can be drawn from
this measure for a given job. When one’s interest is in the predictive hypothesis, various types of
validity evidence can be drawn upon to support this evidence, as outlined below.
Linkage 1
Observed Observed
• Via local validation study
predictor criterion
• Via validity generalization/
measure measure
transportability evidence
Linkage 3
Latent Latent
• Via logic and judgment based on
predictor criterion
existing theory and the body of
domain domain
relevant empirical evidence
FIGURE 2.1 Routes to providing evidence in support of the predictive inference. (Adapted from Binning,
J. F., & Barrett, G. V., Journal of Applied Psychology, 74, 478–494, 1989.)
42 Handbook of Employee Selection
measure is a direct sample of the criterion construct domain, then the predictive inference is estab-
lished based on expert judgment (e.g., obtained via a job analysis process) (Linkage 5 in Figure
2.1). Having an applicant for a symphony orchestra position sight read unfamiliar music is a direct
sample of this important job behavior. Having an applicant for a lifeguard position dive to the bot-
tom of a pool to rescue a simulated drowning victim is a simulation, rather than a direct sample of
the criterion construct domain. However, it does rely on domain sampling logic, and, like most work
sample tests, aims at psychological fidelity in representing critical aspects of the construct domain.
The second mechanism for linking an observed predictor and a criterion construct domain
is via establishing a pair of linkages, namely (a) the observed predictor–observed criterion link
(Linkage 1 in Figure 2.1) and (b) the observed criterion–criterion construct domain link (Linkage
4 in Figure 2.1). The first of these can be established empirically, as in the case of local criterion-
related evidence, or generalized or transported evidence. Critically, such evidence must be paired
with evidence that the criterion measure (e.g., ratings of job performance) can be linked to the
criterion construct domain (e.g., actual performance behaviors). Such evidence can be judgmental
(e.g., comparing criterion measure content to critical elements of the criterion construct domain
revealed through job analyses) and empirical (e.g., fitting CFA models to assess whether dimen-
sionality of the observed criterion scores is consistent with the hypothesized dimensionality of the
criterion construct domain). It commonly involves showing that the chosen criterion measures do
reflect important elements of the criterion construct domain. Observed measures may fail this test,
as in the case of a classroom instructor who grades solely on attendance when the criterion con-
struct domain is specified in terms of knowledge acquisition, or in the case of a criterion measure
for which variance is largely determined by features of the situation rather than by features under
control of the individuals.
The third mechanism also focuses on a pair of linkages, namely (a) linking the observed predic-
tor scores and the predictor construct domain (Linkage 2 in Figure 2.1) and (b) linking the predictor
construct domain and the criterion construct domain (Linkage 3 in Figure 2.1). The first linkage
involves obtaining data to support interpreting variance in predictor scores as reflecting variance in
a specific predictor construct domain. This reflects one form of what has historically been referred
to as construct validity evidence, namely, amassing theory and data to support assigning a speci-
fied construct label to test scores. For example, if a test purports to measure achievement striving,
one might offer a conceptual mapping of test content and one’s specification of the domain of
achievement striving, paired with evidence of empirical convergence with other similarly specified
measures of the construct. However, showing that the measure does reflect the construct domain is
supportive of the predictive inference only if the predictor construct domain can be linked to the
criterion construct domain. Such evidence is logical and judgmental, requiring a clear articulation
of the basis for asserting that individuals higher in the domain of achievement striving will have
higher standing on the criterion construct domain than individuals lower in achievement striving.
Thus, there are multiple routes to establishing the predictive inference. These are not mutually
exclusive; one may provide more than one line of evidence in support of the predictive inference. It
is also not the case that the type of measure dictates the type of evidentiary strategy chosen.
that typically fall under discussions of validity (e.g., observed predictor scores are related to subse-
quent job behavior).
Historically, the idea that reliability limits or puts an upper bound on validity is grounded in
reliability theory from the early 20th century (i.e., CTT) and notions of validity that peaked in the
1950s when criterion-related validity was the gold standard. By operationalizing validity as a cor-
relation between two sets of scores (each of which are assumed to conform to the CTT score model),
it is straightforward to demonstrate that “reliability” serves to put an upper bound on “validity”
(Spearman, 1904, 1910). Clearly, validity and reliability have greatly evolved since the early 20th
century; such an evolution begs the question—How does the idea that “reliability limits validity”
manifest itself in modern perspectives on these concepts? Here we briefly juxtapose this classical
idea against the general forms of validity evidence that might be garnered to support the inference
that applicants’ scores on a selection procedure can predict their subsequent job behavior.
Recall from the previous section that one mechanism for establishing evidence of a link between
observed predictor scores and subsequent job behavior involves demonstrating that predictor mea-
sures reflect a representative sampling of actual job behaviors (Linkage 5 in Figure 2.1). Typically,
such behaviors are identified through job analyses that may involve panels of job experts, surveys of
job incumbents, and so forth. Given the many judgments involved in the job analysis process, results
of that process are prone to similar types of errors we concern ourselves with in investigations of
reliability (e.g., inconsistency among raters’ ratings of job tasks, inconsistency among questions
designed to assess task criticality). To the extent that investigators recognize such inconsistencies as
error, and to the extent that such errors manifest themselves in job analysis results, then investiga-
tors’ argument that predictor measure content is indicative of actual behavior on the job would be
weakened. Note that although this example draws on issues related to reliability, the focus shifts
from reliability of a test score (the historic focus of the perspective that reliability puts an upper
bound on validity) to the reliability of processes used in the development of predictor measures.
A second mechanism for establishing evidence of a link between observed predictor scores and
subsequent job behavior involves demonstrating a link between (a) observed predictor and crite-
rion scores (Linkage 1 in Figure 2.1) and (b) observed criterion scores and the criterion construct
domain (Linkage 4 in Figure 2.1). As noted earlier, arguments for the veracity of the first link are
often based on local criterion-related validity studies or synthetic estimates (e.g., meta-analytic esti-
mates, job component validity). Here, the relation between reliability and validity can become quite
complex, because it involves many subtleties that are often masked in CTT-based discussion of
reliability’s effect on criterion-related validity estimates. For example, if the predictor and criterion
being examined share some element of the measurement design in common (e.g., assessed by com-
mon raters, assessed on a common occasion), then there will be competing biases on the observed
validity estimate. All else being equal, unreliability in the criterion will tend to downwardly bias
the validity estimate, whereas covariance arising from common element of the measurement design
(e.g., raters, occasions) would tend to upwardly bias the validity estimate (Viswesvaran et al., 2005).
In this situation, error (in the reliability sense) serves to simultaneously downwardly and upwardly
bias observed predictor-criterion correlations, thus serving to weaken one’s arguments because of
the uncertainly it creates regarding the predictor-criterion relationship. Investigators’ arguments
for the veracity of a link between observed criterion scores and the criterion construct domain may
be weakened by (a) unreliability in the job analytic judgments that gave rise to the content of the
observed criterion measure (per the example from the previous paragraph) and (b) unreliability of
the observed criterion scores themselves (e.g., inconsistency in raters’ performance ratings).
Lastly, a third mechanism for establishing evidence of a link between observed predictor scores
and subsequent job behavior involves demonstrating a link between (a) observed predictor scores
and the predictor construct domain (Linkage 2 in Figure 2.1) and (b) the predictor and criterion con-
struct domains (Linkage 3 in Figure 2.1). Although we do not typically correct for predictor unreli-
ability in criterion-related validity studies, if we concern ourselves with the link between observed
predictor scores and the theoretical domain that the predictor measures’ content it is believed to
44 Handbook of Employee Selection
represent, then unreliability in the predictor scores can serve to weaken arguments for that link.
The link between predictor and criterion construct domains is largely established based on theory.
In theory, we tend not concern ourselves with properties of observed scores—as such, the idea
that reliability limits validity does not directly apply. Nevertheless, how we come to defensible
theory, or at least provide evidence for relationships that serve as its key tenets, is by empirical
means. Unreliability in scores produced by measures of key constructs within a theory can weaken
theory-based arguments by introducing uncertainty into the substantive nature of key relationships
underpinning the theory. For example, it is fairly well recognized that unreliability can have a
deleterious effect on investigators’ ability to detect nonlinear relationships, interaction effects (e.g.,
moderating effects), and increments in criterion-related validity. Theory-based arguments that rest
on the assumption that those effects do not exist may be weakened to the extent that the bodies of
literature on which those theory-based arguments are defended used unreliable measures (e.g., rat-
ings) of key constructs (e.g., job performance). Null findings may not be indicative that such effects
do not exist in reality, but rather simply indicate that there is too much “noise in the data” to detect
them. The net result of this phenomenon is that we may build up a base of empirical evidence that
“consistently” finds lack of such effects, which may lead one to potentially falsely conclude such
effects do not exist (i.e., the scientific allure of consistency in findings across studies may bias one
toward accepting the null findings). This example demonstrates the subtlety via which a lack of
reliability may weaken theory-based arguments for the link between observed predictor scores and
subsequent job behavior.
consequences do not threaten the validity of inferences that can be drawn from scores on the integ-
rity test, because the consequences are not a function of the test itself.
Summary
In conclusion, we have attempted to develop seven major points about validity. These are that (a) we
have moved far beyond early conceptualizations of validity as the correlation between test scores
and criterion measures; (b) validity is not a characteristic of a test, but rather refers to inferences
made from test scores; (c) we have moved from conceptualizing different types of validity to a
perspective that there are different types of validity evidence, all of which contribute to an under-
standing of the meaning of test scores; (d) the key inference to be supported in employment settings
is the predictive inference, namely, that inferences about future job behavior can be drawn from
test scores; (e) there are multiple routes to gathering evidence to support the predictive inferences;
(f) reliability may weaken arguments for validity of inferences, but does so in a way that is more
complex than traditional notions that reliability puts a specifiable upper bound on validity; and (g)
although evidence about unintended negative consequences of test use (e.g., negative applicant reac-
tions to the test) may affect a policy decision as to whether or not to use the test, such evidence is not
a threat to the predictive inference and does not affect judgments about the validity of the test. Our
belief is that a clear understanding of these foundational issues in validity is essential for effective
research and practice in the selection arena.
References
Aguinis, H., Henle, C. A., & Ostroff, C. (2001). Measurement in work and organizational psychology. In
N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work, and
organizational psychology: Volume 1: Personnel psychology (pp. 27–50). London, England: Sage.
Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Graduate training in statistics, measurement, and methodol-
ogy in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD
programs in North America. American Psychologist, 63, 32–50.
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
American Psychological Association (joint committee). (1954). Technical recommendations for psychological
tests and diagnostic techniques. Psychological Bulletin, 51, 201–238.
Bell, J. F. (1985). Generalizability theory: The software problem. Journal of Educational and Behavioral
Statistics, 10, 19–29.
Binning, J. F., & Barrett, G.V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
and evidential bases. Journal of Applied Psychology, 74, 478–494.
Bliese, P. D. (2002). Multilevel random coefficient modeling in organizational research. In F. Drasgow &
N. Schmitt (Eds.), Measuring and analyzing behavior in organizations (pp. 401–445). San Francisco,
CA: Jossey-Bass.
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440.
Borsboom, D., & Mellenbergh, G. J. (2002). True scores, latent variables, and constructs: A comment on
Schmidt and Hunter. Intelligence, 30, 505–514.
Bost, J. E. (1995). The effects of correlated errors on generalizability and dependability coefficients. Applied
Psychological Measurement, 19, 191–203.
Brennan, R. L. (1998). Raw-score conditional standard errors of measurement in generalizability theory.
Applied Psychological Measurement, 22, 307–331.
Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of replications.
Journal of Educational Measurement, 38, 295–317.
Brennan, R. L. (2001b). Generalizability theory. New York, NY: Springer-Verlag.
Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan
(Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: American Council on Education and
Praeger.
46 Handbook of Employee Selection
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford Press.
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of
Psychology, 3, 296–322.
Bryk, A. S., & Raudenbush, S.W. (1992). Hierarchical linear models. Newbury Park, CA: Sage.
Campbell, J. P. (1976). Psychometric theory. In M. D. Dunnette (Ed.), Handbook of industrial and organiza-
tional psychology (pp. 185–222). New York: John Wiley & Sons.
Cattell, R. B. (1966). The data box: Its ordering of total resources in terms of possible relational systems. In
R. B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp. 67–128). Chicago, IL: Rand
McNally.
Cronbach, L. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12, 1–16.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 292–334.
Cronbach, L. J. (1957). The two disciplines of scientific psychology. American Psychologist, 12, 671–684.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (pp. 221–237).
Washington, DC: American Council on Education.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measure-
ments: Theory of generalizability for scores and profiles. New York, NY: John Wiley.
Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. H. (1997). Generalizability analysis for perfor-
mance assessments of student achievement or school effectiveness. Educational and Psychological
Measurement, 57, 373–399.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52,
281–300.
Cronbach, L. J., Rajaratnam, N., & Gleser, G. (1963). Theory of generalizability: A liberalization of reliability
theory. British Journal of Statistical Psychology, 16(2), 137–163.
Cronbach, L. J., Schoneman, P., & McKie, D. (1965). Alpha coefficient for stratified-parallel tests. Educational
& Psychological Measurement, 25, 291–312.
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and its successor proce-
dures. Educational and Psychological Measurement, 64, 391–418.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621–694). Washington
DC: American Council on Education.
DeShon, R. P. (1995, May). Restricted maximum likelihood estimation of variance components in generaliz-
ability theory: Overcoming balanced design requirements. Paper presented at the 10th annual conference
of the Society of Industrial and Organizational Psychology, Orlando, FL.
DeShon, R. P. (1998). A cautionary note on measurement error corrections in structural equation models.
Psychological Methods, 3, 412–423.
DeShon, R. P. (2002). Generalizability theory. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing
behavior in organizations (pp. 189–220). San Francisco, CA: Jossey-Bass.
Dunnette, M. D. (1966). Personnel selection and placement. Oxford, England: Wadsworth.
Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407–424.
Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific
method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological
Methods, 8, 38–60.
Englehard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted
Rasch model. Journal of Educational Measurement, 31, 93–112.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.,
pp. 105–146). New York: American Council on Education and Macmillan.
Fisher, R.A. (1925). Statistical methods for research workers (1st ed.). Edinburgh, Scotland: Oliver and Boyd.
Ghiselli, E. E. (1964). Theory of psychological measurement. New York: McGraw-Hill.
Green, S. B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88–101.
Green, S. B., & Hershberger, S. L. (2000). Correlated errors in true score models and their effect on coefficient
alpha. Structural Equation Modeling, 7, 251–270.
Greguras, G. J., & Robie, C. (1998). A new look at within-source interrater reliability of 360 degree feedback
ratings. Journal of Applied Psychology, 83, 960–968.
Guion, R. M. (1974). Open a new window: Validities and values in psychological measurement. American
Psychologist, 29, 287–296.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley.
Guttman, L. A. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282.
Reliability and Validity 47
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110).
Westport, CT: American Council on Education and Praeger.
Hocking, R. R. (1995). Variance component estimation in mixed linear models. In L. K. Edwards (Ed.), Applied
analysis of variance in behavioral science (pp. 541–571). New York, NY: Marcel Dekker.
Hox, J. J. (1994). Hierarchical regression models for interviewer and respondent effects. Sociological Methods
and Research, 22, 300–318.
Hoyt, C. (1941). Test reliability obtained by analysis of variance. Psychometrika, 6, 153–160.
Jackson, S., & Brashers, D. E. (1994). Random factors in ANOVA. Thousand Oaks, CA: Sage.
Joreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109–133.
Joreskog, K. G., and Sorbom, D. (2001). LISREL 8: User’s reference guide. Lincolnwood, IL: Scientific
Software International.
Kamata, A., Turhan, A., & Darandari, E. (2003, April). Estimating reliability for multidimensional composite
scale scores. Paper presented at the annual meeting of the American Educational Research Association,
Chicago, IL.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport,
CT: American Council on Education and Praeger.
Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence assumption in analysis of
variance. Psychological Bulletin, 99, 422–431.
Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor
analysis. Psychological Bulletin, 112, 165–172.
Komaroff, E. (1997). Effect of simultaneous violations of essential tau-equivalence and uncorrelated errors on
coefficient alpha. Applied Psychological Measurement, 21, 337–348.
Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2,
151–160.
Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: The University of Chicago Press.
Laenen, A., Vangeneugden, T., Geys, H., & Molenberghs, G. (2006). Generalized reliability estimation using
repeated measurements. British Journal of Mathematical and Statistical Psychology, 59, 113–131.
LaHuis, D. M., & Avis, J. M. (2007). Using multilevel random coefficient modeling to investigate rater effects
in performance ratings. Organizational Research Methods, 10, 97–107.
Lance, C. L. (2008). Why assessment centers don’t work the way they’re supposed to. Industrial and
Organizational Psychology, 1, 84–97.
Lance, C. L., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of dimen-
sion and exercise variance components in assessment center postexercise dimension ratings. Journal of
Applied Psychology, 89, 377–385.
Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist,
41, 1183–1192.
Le, H., & Putka, D. J. (2007). Reliability. In S. G. Rogelberg (Ed.), Encyclopedia of industrial and organiza-
tional psychology (Vol. 2, pp. 675–678). Thousand Oaks, CA: Sage.
Le, H., Schmidt, F. L., & Putka, D. J. (2009). The multifaceted nature of measurement artifacts and its implica-
tions for estimating construct-level relationships. Organizational Research Methods, 12, 165–200.
Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models
for analyzing scores from tests comprised of testlets. Educational and Psychological Measurement, 61,
958–975.
Lee, G., & Frisbie, D. A. (1999). Estimating reliability under a generalizability theory model for test scores
composed of testlets. Applied Measurement in Education, 12, 237–255.
Littell, R. C., Milliken, G. A., Stroup, W. W., & Wolfinger, R. D. (1996). SAS system for mixed models. Cary,
NC: SAS Institute.
Loevinger, J. (1957). Objective tests as instruments of psychological theory [Monograph No. 9]. Psychological
Reports, 3, 635–694.
Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325–336.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Lumsden, J. (1976). Test theory. Annual Review of Psychology, 27, 251–280.
Marcoulides, G. A. (1990). An alternative method for estimating variance components in generalizability
theory. Psychological Reports, 66, 102–109.
Marcoulides, G. A. (1996). Estimating variance components in generalizability theory: The covariance struc-
ture analysis approach. Structural Equation Modeling, 3, 290–299.
Marsh, H. W., & Grayson, D. (1994). Longitudinal confirmatory factor analysis: Common, time-specific,
item-specific, and residual-error components of variance. Structural Equation Modeling, 1, 116–146.
48 Handbook of Employee Selection
Maxwell, A. E. (1968). The effect of correlated error on reliability coefficients. Educational and Psychological
Measurement, 28, 803–811.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients.
Psychological Methods, 1, 30–46.
McPhail, S. M. (Ed.) (2007). Alternative validation strategies: Developing new and leveraging existing validity
evidence. San Francisco, CA: John Wiley and Sons.
Merenda, P. F. (2007). Psychometrics and psychometricians in the 20th and 21st centuries: How it was in the
20th century and how it is now. Perceptual and Motor Skills, 104, 3–20.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York:
American Council on Education and Macmillan.
Mosier, C. I. (1943). On the reliability of a weighted composite. Psychometrika, 8, 161–168.
Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait rater and level effects
in 360-degree performance ratings. Personnel Psychology, 51, 557–576.
Murphy, K. R., & DeShon, R. (2000). Interrater correlations do not estimate the reliability of job performance
ratings. Personnel Psychology, 53, 873–900.
Ng, K. T. (1974). Spearman’s test score model: A restatement. Educational and Psychological Measurement,
34, 487–498.
Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements.
Psychometrika, 32, 1–13.
Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-Hill.
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated
test items and its application to large-scale educational assessment data. Journal of Educational and
Behavioral Statistics, 27, 341–384.
Ployhart, R. E., Schneider, B., & Schmitt, N. (2006). Staffing organizations: Contemporary practice and theory
(3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Putka, D. J., Ingerick, M., & McCloy, R. A. (2008a, April). Integrating reliability and validity-based perspec-
tives on error in performance ratings. Presentation in J. Cortina (Chair), Write, for these words are true:
Uncovering complexity in I/O. Symposium conducted at the 23rd Annual Society for Industrial and
Organizational Psychology Conference, San Francisco, CA.
Putka, D. J., Ingerick, M., & McCloy, R. A. (2008b). Integrating traditional perspectives on error in ratings:
Capitalizing on advances in mixed effects modeling. Industrial and Organizational Psychology, 1,
167–173.
Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organiza-
tional research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93,
959–981.
Putka, D. J., & McCloy, R. A. (2008, February). Estimating variance components in SPSS and SAS: An anno-
tated reference guide. Retrieved March 23, 2009, from http://www.humrro.org/djp_archive/Estimating
Variance Components in SPSS and SAS.pdf
Qualls-Payne, A. L. (1992). A comparison of score level estimates of the standard error of measurement.
Journal of Educational Measurement, 29, 213–225.
Rae, G. (2007). A note on using stratified alpha to estimate the composite reliability of a test composed of inter-
related nonhomogeneous items. Psychological Methods, 12, 177–184.
Raykov, T. (1998). Coefficient alpha and composite reliability with interrelated nonhomogeneous items.
Applied Psychological Measurement, 22, 375–385.
Raykov, T. (2001a). Bias of coefficient α for fixed congeneric measures with correlated errors. Applied
Psychological Measurement, 25, 69–76.
Raykov, T. (2001b). Estimation of congeneric scale reliability using covariance structure analysis with nonlin-
ear constraints. British Journal of Mathematical and Statistical Psychology, 54, 315–323.
Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: Point and interval estimation
using a structural equation modeling approach. Structural Equation Modeling, 9, 195–212.
Reuterberg, S. E., & Gustafsson, J. E. (1992). Confirmatory factor analysis and reliability: Testing measure-
ment model assumptions. Educational and Psychological Measurement, 52, 795–811.
Rogers, W. M., Schmitt, N., & Mullins, M. E. (2002). Correction for unreliability of multifactor measures:
Comparison of alpha and parallel forms of approaches. Organizational Research Methods, 5, 184–199.
Rozeboom. W. W. (1966). Foundations of the theory of prediction. Homewood, IL: Dorsey.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research
scenarios. Psychological Methods, 1, 199–223.
Reliability and Validity 49
Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical examination of the effects of different
sources of measurement error on reliability estimates for measures of individual-differences constructs.
Psychological Methods, 8, 206–224.
Schmidt, F. L., Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability.
Personnel Psychology, 53, 901–912.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.
Schmitt, N., & Landy, F. J. (1993). The concept of validity. In N. Schmitt, W.C. Borman & Associates (Eds.),
Personnel selection in organizations (pp. 275–309). San Francisco, CA: Jossey-Bass.
Scullen, S. E. (1999). Using confirmatory factor analyses of correlated uniquenesses to estimate method vari-
ance in multitrait-multimethod matrices. Organizational Research Methods, 2, 275–292.
Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings.
Journal of Applied Psychology, 85, 956–970.
Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York, NY: Wiley.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Shrout, P. E. (1993). Analyzing consensus in personality judgments: A variance components approach. Journal
of Personality, 61, 769–788.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological
Bulletin, 86, 420–428.
Smith, P. L., & Luecht, R. M. (1992). Correlated effects in generalizability studies. Applied Psychological
Measurement, 36, 229–235.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures (4th ed.). Bowling Green, OH: Author.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of
Psychology, 15, 72–101.
Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295.
SPSS. (2005). Linear mixed-effect modeling in SPSS: An introduction to the mixed procedure (Technical Report
LMEMWP-0305). Chicago, IL: Author.
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement (pp. 560–620).
Washington, DC: American Council on Education.
Tryon, R. C. (1957). Reliability and behavior domain validity: Reformulation and historical critique.
Psychological Bulletin, 54, 229–249.
Van Iddekinge, C. H., Putka, D. J., Raymark, P. H., & Eidson, C. E., Jr. (2005). Modeling error variance in
job specification ratings: The influence of rater, job, and organization-level factors. Journal of Applied
Psychology, 90, 323–334.
Van Iddekinge, C. H., Raymark, P. H., Eidson, C. E., Jr., & Attenweiler, W. J. (2004). What do structured
selection interviews really measure? The construct validity of behavior description interviews. Human
Performance, 17, 71–93.
Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of job performance?
A meta-analytic framework for disentangling substantive and error influences. Journal of Applied
Psychology, 90, 108–131.
Webb, N. M., & Shavelson, R. J. (1981). Multivariate generalizability of general educational development rat-
ings. Journal of Educational Measurement, 18, 13–22.
Werts, C. E., Linn, R. L., & Joreskog, K. G. (1974). Intraclass reliability estimates: Testing structural assump-
tions. Educational and Psychological Measurement, 34, 25–32.
Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-multimethod data.
Applied Psychological Measurement, 9, 1–26.
Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measure-
ment (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger.
Zimmerman, D. W., & Williams, R. H. (1980). Is classical test theory “robust” under violation of the assump-
tion of uncorrelated errors? Canadian Journal of Psychology, 34, 227–237.
Zimmerman, D. W., & Williams, R. H. (1997). Properties of the Spearman correction for attenuation for normal
and realistic non-normal distributions. Applied Psychological Measurement, 21, 253–279.
Zimmerman, D. W., Zumbo, B. D., & LaLonde, C. (1993). Coefficient alpha as an estimate of test reliability
under violation of two assumptions. Educational and Psychological Measurement, 53, 33–49.
This page intentionally left blank
3 Validation Strategies
for Primary Studies
NATURE OF VALIDITY
Most early applications of the use of tests as decision-making tools in the selection of personnel in
work organizations involved a validation model in which the scores on tests were correlated with
some measure or rating of job performance, such as the studies of salespersons by Scott (1915) and
streetcar motormen by Thorndike (1911). This view of validity was reinforced in books by Hull
(1928) and Viteles (1932). Subsequent reviews by Ghiselli (1966, 1973) were similarly focused on
what was by then known as criterion-related validity.
During this time, there was a recognition that tests could and should be based on other logical
arguments as well. Standards for Educational and Psychological Tests (American Psychological
Association [APA], American Educational Research Association [AERA], and National Council
on Measurement in Education [NCME], 1954) identified four aspects of validity evidence: content,
predictive, concurrent, and construct. With time, the predictive and concurrent aspects of validity
became seen as simply different research designs, the purpose of which was to establish a predictor-
criterion relationship; hence, they became known as criterion-related validity. Content and construct
validation were seen as alternate methods by which one could validate and defend the use of test
scores in decision-making. A much broader view of the nature of validity is accepted today, and in
general it is seen as the degree to which the inferences we draw from a set of test scores about job
performance are accurate.
51
52 Handbook of Employee Selection
addition, there should be support for the notion that the construct is central to the performance
of job tasks.
This separation of approaches to validity produced numerous problems, not the least of which
was the notion that there were times when one approach was to be preferred over another or that
there were different acceptable standards by which these different aspects of validity were to be
judged. Most important, however, was the realization on the part of measurement scholars that all
were aspects of construct validity—the theoretical reasonableness of our explanations of job behav-
ior. There was a realization that the inferences we derive from test scores was central to all valida-
tion work. Content validity and the practices usually associated with it were recognized as desirable
practices in the development of any test. Careful consideration of the “theory” and hypotheses that
underlie our conceptualization of performance and how the constructs central to job performance
are represented in our tests is always important and unifying insofar as our validation efforts are
concerned. Traditional criterion-related research represents one type of evidence that can be col-
lected to confirm/disconfirm these hypotheses. This “unitarian” approach to validity was strongly
argued in the 1985 Standards and has been incorporated most completely in the 1999 version of the
Standards (AERA, APA, & NCME, 1999).
1. Job analysis: The foundation of validation in employment settings always involves the
development of a clear understanding of job and organizational requirements. For exam-
ple, for promotion purposes these would be the requirements of the target job(s) into which
a person might be promoted. For training and development purposes, these would be the
meaningful outcomes in terms of on-the-job performance that are the focus of the training/
development efforts.
2. Systematic development: As measures are developed, they need to follow an architecture
that is firmly grounded in the results of the job analysis. As the development of the mea-
sures is planned and as the tools are being constructed, activities need to be focused on
ensuring that the measures are carefully targeted to address the intended constructs.
3. Independent verification: Once the measures are developed, they need to be subjected
to independent verification that they measure the intended constructs. At times, this can
involve statistical studies to determine whether the measures exhibit expected relation-
ships with other independent measures (e.g., Does the 360-degree assessment of leader-
ship behavior correlate with an independent interview panel’s judgment of a leader’s
54 Handbook of Employee Selection
behavior?). Often, the independent verification is derived from structured expert reviews
of the measures that are conducted prior to implementation. Regardless of the method, this
“independent verification” is a necessary aspect of verifying the validity of a measure.
approach. Rather, all validation studies and selection practices should aspire to the ethical and
professional guidelines offered in the Principles, which means using sound methods rooted in sci-
entific evidence and exhibiting high standards of quality. However, the Principles’ guidelines are
not formulaic to the exclusion of professional judgment, nor are their applications invariant across
circumstances. In the following paragraphs, several factors are identified that have potential impli-
cations for the strength of the evidence needed by a local validation effort. In general, these factors
represent situations in which conclusions regarding the validity of a selection practice need to be
made with a higher degree of confidence than usual. In turn, these situations tend to warrant intensi-
fied research strategies.
of a central focus of the validation effort. It is our contention that validating selection systems for
high-risk occupations are a special circumstance warranting particularly “watertight” validation
strategies in which strong evidence should be sought to support the inferences made. In these cir-
cumstances, a test with low validity (e.g., less than r = .10) might be used to make hiring decisions
if that relationship is found to be statistically significant.
In some circumstances, the cost of false negatives is more salient. For example, strong evidence
of a test’s validity may be warranted when an organization needs to fill a position or several posi-
tions, but applicants’ test scores are below some acceptable standard, indicating that they are not fit
to hire (i.e., predicted on-the-job performance is low or very low). In this case, the organization’s
decision to reject an applicant on the basis of their test scores would leave a position or several
positions within the organization vacant, a costly mistake in the event that false-negative errors
are present. Demonstrable evidence to support the test’s validity would be needed to justify such a
decision, and in essence, convince the organization that they are better off with a vacant position
than putting the wrong person in the job. In these instances, one might want evidence of a larger
test-criterion relationship (perhaps greater than r = .30) to warrant use of this test and the possible
rejection of competent applicants.
The possibility of false negatives becomes a special concern when members of some subgroup(s)
are selected less frequently than members of another subgroup. When unequal ratios of various
subgroups are selected, the organization must be prepared to show that false negatives are not
primarily of one group as opposed to another. When this is impossible, the legal and social costs
can be very high. Concern about these costs is another reason to be concerned about the false
negatives aside from the concerns associated with possible vacant positions.
Personnel psychologists have long been aware of the fact that the utility of selection systems
increase as a function of selectivity, such that selection instruments even modestly related to impor-
tant outcomes can have large payoffs when there are many applicants from which only a few are to
be selected (Brogden, 1951, 1959). On the other hand, as selection ratios become extremely liberal,
such that nearly all applicants are accepted, even selection instruments highly related to perfor-
mance have less positive implications for utility. From a purely utilitarian perspective, it would
seem logical that demonstrating test validity is less of an impetus when selection ratios are liberal
(because even the best tests will have little effect) and more of an impetus when selection ratios are
low. The consequences of large-scale applicant rejections would also seem to justify more rigorous
validation methods from societal and legal perspectives. However, these extremes likely occur less
frequently in practice, as indicated by Schmidt and Hunter (1998), who cite typical selection ratios
that are moderate, ranging on average from .30 to .70.
In licensing examinations, this utility perspective takes a different form because the major pur-
pose of these examinations is to protect the public from “injuries” related to incompetent practice.
In this case, the license-no license decision point using test scores is usually set at a point that is
judged to indicate “minimal competence.” Depending on the service provided (e.g., hairdresser vs.
surgeon), the cost of inappropriately licensing a person could be very different. On the other hand,
certification examinations are usually oriented toward the identification of some special expertise in
an area (e.g., pediatric dentistry or forensic photography), hence a decision as to a score that would
warrant certification might result in the rejection of larger numbers or proportions of examinees.
The cost-benefit balance in this situation (assuming all are minimally competent) might accrue
mostly to the individual receiving the certification in the form of greater earning power.
Another factor that can affect the extent of the local validation effort that is required is the avail-
ability of existing validation research. The Principles describes three related validation strategies
that can be used as alternatives to conducting traditional local validation studies or to support the
conclusions drawn at the primary study level. First, “transportability” of validity evidence involves
applying validity evidence from one selection scenario to another, on the basis that the two con-
texts are judged to be sufficiently similar. Specifically, the Principles note that researchers should
be concerned with assessing similarity in terms of job characteristics [e.g., the knowledge, skills,
Validation Strategies for Primary Studies 57
and abilities (or KSAs) needed to perform the job in each context], job tasks and content, applicant
pool characteristics, or other factors that would limit generalizability across the two contexts (e.g.,
cultural differences). Assessing similarity in this manner usually requires that researchers conduct
a job analysis or rely on existing job analysis materials combined with their own professional exper-
tise and sound judgment.
Second, synthetic validity is a process in which validity for a test battery is “synthesized” from
evidence of multiple predictor-job component relationships (Peterson, Wise, Arabian, & Hoffman,
2001; Scherbaum, 2005). Job analysis is used to understand the various components that comprise
a particular job, and then predictor-job-component relationships are collected for all available jobs
with shared components. Because evidence can be drawn from other jobs besides the focal job, syn-
thetic validity may be a particularly useful strategy for organizations that have too few incumbents
performing the focal job to reach adequate sample sizes for a traditional criterion-related study
(Scherbaum, 2005).
Third, validity generalization involves using meta-analytic findings to support the conclusion
that predictor-criterion validity evidence can be generalized across situations. Like transportability
strategies, meta-analytic findings provide researchers with outside evidence to support inferences
in a local context. The argument for validity generalization on the basis of meta-analyses is that
some selection tests, such as cognitive ability tests (Ones, Viswesvaran, & Dilchert, 2005), are valid
across selection contexts. Thus, the implication is that with validity generalization strategies, unlike
transportability, in-depth job analyses or qualitative studies of the local organizational context are
unnecessary. In support of this assertion, the program of research initiated by Schmidt and Hunter
and colleagues (for review, see Schmidt & Hunter, 2003) has argued that between-study variability
in validity coefficients can be largely attributed to statistical artifacts, such as range restriction,
unreliability, or sampling error. However, caution is warranted to the extent that meta-analyses have
identified substantive moderators, or in the presence of strong theory indicating that some variable
may moderate the magnitude of validity. Further, with regard to generalization across contexts,
inferences drawn from meta-analytic findings are limited to the contexts of those studies included
in the meta-analysis. At minimum, meta-analytic findings should be referenced in test development
and can be used to supplement evidence at the local level, either via theoretical or statistical means
(Newman, Jacobs, & Bartram, 2007). The argument for more direct use of validity generalization
strategies is dependent on the strength of the meta-analytic findings.
Others may be theoretically or rationally justified (e.g., construct to construct linkages) or tested
using latent variable models, although these applications are relatively rare in personnel selec-
tion research (see Campbell, McHenry, & Wise, 1990, for an attempt to model job performance).
Consistent with the unitarian conceptualization of validity, all validity efforts in a selection con-
text are ultimately concerned with demonstrating that test scores predict future job performance,
and each of the various inferential pathways represents sources or types of evidence to support
this common inference. Binning and Barrett (1989, p. 482) described how “truncated” validation
strategies often concentrate exclusively on demonstrating evidence for a single inferential path-
way and as a result provide only partial support for conclusions regarding test validity. A more
cogent argument for validity is built upon demonstration of strong evidence for several inferential
pathways within the nomological network. For example, in addition to demonstrating a statisti-
cal relationship between operational measures from the predictor and performance domain, as
is commonly the main objective in criterion-related validity studies, researchers should provide
evidence of the psychological constructs underlying job performance (as well as the predictor
measures) and demonstrate that the criterion measure adequately samples constructs from the
performance domain.
Criterion Concerns
In practice, criterion-related validity studies are often criticized for failing to adequately address
validity issues surrounding the criterion measure(s) used. The relative lack of scientific scrutiny
focused on criteria, termed the “criterion problem” (Austin & Villanova, 1992), has been a topic
of discussion among personnel psychologists for years (Dunnette, 1963; Fiske, 1951; Guion, 1961).
Universal to these discussions is the call for more rigorous validation evidence with respect to
the criteria that are used. Binning and Barrett (1989) outlined this task, underscoring two inter-
related goals for the validation researcher. First, they suggested that the selection of criteria should
be rooted in job analysis to the same extent that selection of predictors traditionally are (i.e.,
more attention to rigorous “criterion development”). Other considerations relevant to the task of
criterion development and validation include the use of “hard” or objective criteria versus more
proximal behaviors that lead to these outcomes (Thayer, 1992), use of multiple relevant crite-
ria as opposed to a single overall criterion (Dunnette, 1963), and the possibility that criteria are
dynamic (Barrett, Caldwell, & Alexander, 1985). Second, researchers should be concerned with
demonstrating evidence of construct-related validity for the criterion. Investigators must specify
the latent dimensions that underlay the content of their criterion measures. This involves expansion
of the nomological network to include inferences that link the criterion measure(s) to constructs in
the performance domain (e.g., by demonstrating that criterion measures are neither contaminated
nor deficient with respect to their coverage of the intended constructs in the performance domain)
and link constructs in the performance domain to job demands that require specific ability or moti-
vational constructs (e.g., by demonstrating through job analysis that constructs in the performance
domain are organizationally meaningful). Campbell and his colleagues (e.g., Campbell, McCloy,
Oppler, & Sager, 1993) have repeatedly emphasized this emphasis on the nature of criteria or per-
formance constructs. These authors make the somewhat obvious, although often overlooked, point
that performance should be defined as behavior (“what people actually do and can be observed”);
the products of one’s behavior, or what are often called “hard criteria,” are only indirectly the
result of one’s behavior and other factors that are not attributable to an individual job incumbent.
Further, we may consider relatively short term or proximal criteria or distal criteria, such as the
impact of one’s career on some field of interest. Any specification of a performance or criterion
domain must also consider the impact of time (Ackerman, 1989; Henry & Hulin, 1989). In any
study of performance, these various factors must be carefully considered when one decides on the
nature of the performance constructs and actual operationalizations of the underlying constructs
and how those measures might or might not be related to measures of other constructs in the
domain of interest.
Validation Strategies for Primary Studies 59
instance, the sample size needed to demonstrate a statistically significant predictor-criterion rela-
tionship decreases as the magnitude of the relationship that exists between predictor and criterion
(i.e., effect size) increases. Sussman and Robertson (1986), in their assessment of various predic-
tive and concurrent validation designs, found that those strategies that allowed larger sample sizes
gained a trivial increment in power. This suggests that, as long as sample sizes can support the
use of a criterion-related design, further attention toward increasing N may not reap large benefits.
Other factors affecting power include the interrelatedness and number of predictors used, such that
the addition of nonredundant predictors increases power (Cascio, Valenzi, & Silbey, 1978). The
reliability of the predictors and criteria and the decision criteria used for inferring that a relation-
ship is nonzero also impact power.
By incorporating power analysis in validation design, researchers can increase the likelihood
that relationships relevant to key inferences will be tested with sufficient sample size upon which to
have confidence in the results. However, from a scientific standpoint, the importance of demonstrat-
ing that predictor-criterion relationships are statistically significant may be overstated, given that
relationships, which may not be practically meaningful, can reach statistical significance with large
enough sample sizes. For instance, a statistically significant relationship, in which a test accounts for
less than 5% of the variance in job performance, is not unequivocal support for the test’s use. This
is especially evident when there is reason to suggest that other available tests could do a better job
predicting performance. Nonetheless, we agree with Cortina and Dunlap’s (1997) overall contention
that statistical significance testing remains a useful tool to researchers when decision criteria are
needed. For this reason, strong evidence in support of the predictor-criterion relationship should be
derived based on the size and significance of the validity coefficients.
Operationally, there are several other important considerations in criterion-related research (e.g.,
job analyses that support the relevance of predictor and criterion constructs and the quality of the
measures of each set of constructs). However, those concerns are addressed repeatedly in textbooks
(e.g., Guion, 1998; Ployhart, Schneider, & Schmitt, 2006). In the next section of this chapter, we
address a very important concern that is rarely discussed.
In Phase 2, scores on the criterion measure (i.e., ratings from a single rater on a work sample)
were standardized across raters. There were significant differences between raters that were attrib-
uted to different rating standards and not to group differences in ratees, such as experience, rank,
or supervisor performance ratings. The raters were noncommissioned officers and did not receive
extensive training in the rating task, so that differences between raters in judgmental standards
were not unexpected. As a result, the validity coefficients for both jobs increased (radio repairers,
from .09 to .18; automotive mechanics, from .17 to .24). In Phase 3, validity coefficients were cor-
rected for range restriction, which again resulted in an increase in the observed validity coefficients
(radio repairers, from .18 to .49; automotive mechanics, from .24 to .37). Maier noted that the final
validity coefficients were within the expected range on the basis of previous studies.
The Maier (1988) study is illustrative of the large effect that data cleaning can have for attaining
more accurate estimates of validity coefficients in a predictive design scenario. Several caveats are
also evident, so that researchers can ensure that data cleaning procedures conducted on sound pro-
fessional judgment are not perceived as data “fudging.” First, the cleaning procedures need to have
a theoretical or rational basis. Researchers should document any decision criteria used and the sub-
stantive changes that are made. For example, researchers should record methods used for detecting
and dealing with outliers. In addition, a strong case should be built in support of any decisions made.
The researcher bears the burden of defending each alteration made to the data. For example, in the
Maier study, the decision to standardize criterion data across raters (because raters were relatively
untrained and used different rating standards) was supported by empirical evidence that ruled out
several alternative explanations for the mean differences observed between raters.
of minimally competent is critical) would answer the question correctly. These probabilities are
summed across experts and across items. The result is the cutoff score. A second approach to the
setting of cut scores is to set them by reference to some acceptable level of performance on a crite-
rion variable. In this case, one could end up saying that an individual with a score of 75 on some test
has a 10% (or any percent) chance of achieving success on some job. However, this “benchmark-
ing” of scores against criteria does not resolve the problem because someone will be asked to make
sometimes equally difficult decisions about what constitutes acceptable performance. Cizek (2001)
provided a comprehensive treatment of methods of setting performance standards.
The use of score cutoffs to establish minimum qualifications or competency is common in licens-
ing exams. Credentialing exams may require evidence of a higher level of skill or performance
capability in some domain, but they too usually require only a “pass-fail” decision. Validation of
these cutoffs almost always relies solely on the judgments of experts in the performance area of
interest. In these cases, careful explication of the behaviors required to perform a set of tasks and
the level of “acceptable” performance is essential and likely the only possible form of validation.
are likely quite similar to those made in the individual assessments discussed in the previous section
of this chapter but also may occur when judgments are made about job applicants in employment
interviews, assessment centers, and various other instances in which human resource specialists
or psychologists make employment decisions. Clinical judgment is often compared with statisti-
cal judgment in which test scores are combined on the basis of an arithmetic formula that reflects
the desired weighting of each element of information. The weights may be determined rationally
by a group of job experts or by using weights derived from a regression of a measure of overall
job success on scores on various dimensions using different methods of measurement. Meehl’s
original research (1954) showed that the accuracy of the worst regression estimate was equal to
the judgments made by human decision-makers. A more recent treatment and review of this litera-
ture by Hastie and Dawes (2001) has reaffirmed the general conclusion that predictions made by
human experts are inferior to those based on a linear regression model. However, human experts are
required to identify the types of information used in the prediction task. The predictions themselves
are likely best left to some mechanical combination rule if one is interested in maximizing a perfor-
mance outcome. The overall clinical judgment when used to make decisions should be the focus of
the validation effort, but unless it is clear how information is combined by the decision-maker, it is
unclear what constructs are playing a role in their decisions. The fact that these clinical judgments
are often not as highly correlated with externally relevant and important outcomes suggests that the
constructs these decision-makers use are not relevant.
In clinical judgment, the presence or absence of adverse impact can be the result of a combina-
tion of information that does not display sizable subgroup differences or a bias on the part of the
person making the judgment. Psychologists making clinical judgments may mentally adjust scores
on the basis of their knowledge of subgroup differences on various measures. There are again no
studies of which we are aware that address the use or appropriateness of such adjustments.
were actually being evaluated for employment versus a situation in which they were responding “for
research purposes.” To our knowledge, this comparison has not been made frequently, and when it
has been done in meta-analyses cognitive and noncognitive test validities have not been separated
(Schmitt, Gooding, Noe, & Kirsch, 1984). It is the case that practical considerations have made the
use of concurrent designs much more frequent than predictive designs (Schmitt et al., 1984).
Meta-analytic data suggest that there are not large differences in the validity coefficients resulting
from these two designs. Further, range restriction corrections can be applied to correct for the fact
that data for lower-scoring persons are absent from concurrent studies, but these data are often
absent in reports of criterion-related research. Nor can we estimate any effects on test scores
that might result from the fact that much more is at stake in a testing situation that may result in
employment as opposed to one that is being done for research purposes. Moreover, as Sussman
and Robertson (1986) maintained, the manner in which some predictive studies are designed and
conducted make them little different than concurrent studies.
and for the variability due to small sample size are also well known and routinely applied in validity
generalization work. However, most primary studies do not report information that allows for
sample-based corrections for criterion unreliability or range restriction. Schmidt and Hunter (1977)
in their original meta-analytic effort used estimates of the sample size of the validity coefficients
they aggregated because not even sample size was available in early reports. Consequently, in
estimating population validity coefficients, meta-analysts have been forced to use assumed artifact
distributions based on the small amount of data that are available. There is some evidence that
these assumptions are approximately correct (e.g., Alexander, Carson, Alliger, & Cronshaw, 1989;
Sackett & Ostgaard, 1994) for range restriction corrections, but the use of such assumed artifact
distributions would not be necessary with adequate reporting of primary data. Unfortunately, such
information for most of our primary database is lost. In addition, researchers disagree regarding
the appropriate operationalization of criterion reliability (Murphy & DeShon, 2000; Schmidt,
Viswesvaran, & Ones, 2000).
1. They should be predictive (i.e., longitudinal with data collection at multiple points), con-
current, and of sufficient sample size to allow for adequate power in the tests of hypoth-
eses. Large-scale studies in which organizations continue data collection over time on an
ever-expanding group of participants should be initiated.
2. Multiple criteria should be collected to allow for evaluation of various KASO-performance
relationships.
Validation Strategies for Primary Studies 67
3. Data should be collected to allow for artifact corrections such as unreliability in the criteria
and range restriction.
4. Unit-level data should be collected to allow for evaluation of multilevel hypotheses. These
data should include basic unit characteristics and outcome data.
5. Demographic data should be collected to allow for evaluation of subgroup differences
in the level of performance and differences in KASO-performance relationships across
subgroups.
6. Data on constructs thought to be related (and unrelated) to the target constructs of interest
should be collected to allow for evaluation of broader construct validity issues.
Obviously, these studies would necessitate a level of cooperation and planning not characteris-
tic of multiple researchers, much less multiple organizations. However, real advancement in our
understanding of individual differences in KSAOs and performance will probably not come from
additional small-scale studies or meta-analyses of primary studies that address traditional questions
with sample sizes, research designs, and measurement characteristics that are not adequate.
CONCLUSIONS
It is certainly true that meta-analyses have provided our discipline with strong evidence that many
of the relationships between individual differences and performance are relatively strong and gener-
alizable. However, many situations where validation is necessary do not lend themselves to validity
generalization or the use of meta-analytic databases. As a result, practitioners frequently find them-
selves in situations where well-designed primary studies are required. A focus on the appropriate
designs for these studies is therefore important.
Additionally, without primary studies of the relationships between individual differences and
performance, there can be no meta-analyses, validity transport, or validity generalization. The qual-
ity and nature of the original studies that are the source of our meta-analytic database determine to
a great extent the currency and quality of the conclusions derived from the meta-analyses, statistical
corrections notwithstanding.
We argue that the field would be greatly served by large-scale primary studies of the type con-
ducted as part of Project A (see Sackett, 1990 or Campbell & Knapp, 2001). These studies should
begin with a clear articulation of the performance and predictor constructs of interest. They should
involve the collection of concurrent and predictive data and improve upon research design and
reporting issues that have bedeviled meta-analytic efforts for the past three decades. Demographic
data should be collected and reported. Data should be collected across multiple organizational units
and organizations (and perhaps globally), and data describing the organizational context should be
collected and recorded. We know much more about the complexities of organizational behavior,
research design, measurement, and individual differences than we did 80–100 years ago, and this
should be reflected in how we collect our data and make them available to other professionals.
REFERENCES
Ackerman, P. L. (1989). Within-task intercorrelations of skilled performance: Implications for predicting
individual differences? (A comment on Henry & Hulin, 1987). Journal of Applied Psychology, 74,
360–364.
Aguinis, H. (Ed.). (2004). Test score banding in human resource selection: Legal, technical and societal issues.
Westport, CT: Praeger.
Alexander, R. A., Carson, K. P., Alliger, G. M., & Cronshaw, S. F. (1989). Empirical distributions of range
restricted SDx ion validity studies. Journal of Applied Psychology, 74, 253–258.
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1985). Standards for educational and psychological testing. Washington,
DC: American Psychological Association.
68 Handbook of Employee Selection
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
American Psychological Association, American Educational Research Association, and National Council on
Measurement in Education. (1954). Technical recommendations for psychological and diagnostic tech-
niques. Psychological Bulletin, 51, 201–238.
Arthur, W., Jr., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related valid-
ity of assessment center dimensions. Personnel Psychology, 56, 125–153.
Arvey, R. D., Nutting, S. M., & Landon, T. E. (1992). Validation strategies for physical ability testing in police
and fire settings. Public Personnel Management, 21, 301–312.
Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77,
836–874.
Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychologi-
cal research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social
Psychology, 51, 1173–1182.
Barrett, G. V., Caldwell, M. S., & Alexander, R. A. (1985). The concept of dynamic criteria: A critical reanalysis.
Personnel Psychology, 38, 41–56.
Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical
reanalysis. Journal of Applied Psychology, 66, 1–6.
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
and evidential bases. Journal of Applied Psychology, 74, 478–494.
Boehm, V. R. (1982). Are we validating more but publishing less? The impact of governmental regulation on
published validation research—An exploratory investigation. Personnel Psychology, 35, 175–187.
Borman, W. C., & Motowidlo, S. J. (1997). Task performance and contextual performance: The meaning for
personnel selection research. Human Performance, 10, 99–109.
Boudreau, J. W., & Ramstad, P. M. (2003). Strategic industrial and organizational psychology and the role
of utility. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology (Vol. 12,
pp. 193–221). Hoboken, NJ: Wiley.
Brogden, H. E. (1951). Increased efficiency of selection resulting from replacement of a single predictor with
several differential predictors. Educational and Psychological Measurement, 11, 173–195.
Brogden, H. E. (1959). Efficiency of classification as a function of number of jobs, percent rejected, and the
validity and intercorrelation of job performance estimates. Educational and Psychological Measurement,
19, 181–190.
Brown, C. W., & Ghiselli, E. E. (1953). Percent increase in proficiency resulting from use of selection devices.
Journal of Applied Psychology, 37, 341–345.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod
matrix. Psychological Bulletin, 56, 81–105.
Campbell, J. P. (1990). An overview of the Army Selection and Classification Project (Project A). Personnel
Psychology, 43, 231–240.
Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). The substantive nature of job performance variabil-
ity. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 258–299). San
Francisco, CA: Jossey-Bass.
Campbell, J. P., & Knapp, D. J. (Eds.) (2001). Exploring the limits in personnel selection and classification.
Mahwah, NJ: Lawrence Erlbaum.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt &
W. C. Borman (Eds.), Personnel selection in organizations. San Francisco, CA: Jossey-Bass.
Campbell, J. P., McHenry, J. J., & Wise, L. L. (1990). Modeling job performance in a population of jobs.
Personnel Psychology, 43, 313–334.
Campion, M. A., Outtz, J. L., Zedeck, S., Schmidt, F. L., Kehoe, J. F., Murphy, K. R., & Guion, R. M. (2001).
The controversy over score banding in personnel selection: Answers to 10 key questions. Personnel
Psychology, 54, 149–185.
Cascio, W. F., Alexander, R. A., & Barrett, G. V. (1988). Setting cut scores: Legal, psychometric, and profes-
sional issues and guidelines. Personnel Psychology, 41, 1–24.
Cascio, W. F., Valenzi, E. R., & Silbey, V. (1978). Validation and statistical power: Implications for applied
research. Journal of Applied Psychology, 63, 589–595.
Cizek, G. J. (Ed.). (2001). Setting performance standards. Hillsdale, NJ: Lawrence Erlbaum.
Cohen, J. (1988). Statistical power for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Validation Strategies for Primary Studies 69
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings.
Chicago, IL: Rand-McNally.
Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods,
2, 161–172.
Coward, W. M., & Sackett, P. R. (1990). Linearity of ability-performance relationships: A reconfirmation.
Journal of Applied Psychology, 75, 297–300.
Cropanzano, R., & Wright, T. A. (2003). Procedural justice and organizational staffing: A tale of two paradigms.
Human Resource Management Review. Special Issue: Fairness and Human Resources Management, 13,
7–39.
Davis, J. E. (1989). Construct validity in measurement: A pattern matching approach. Evaluation and Program
Planning. Special Issue: Concept Mapping for Evaluation and Planning, 12, 31–36.
Dunnette, M. D. (1963). A note on the criterion. Journal of Applied Psychology, 47, 251–254.
Edwards, J. R. (2002). Alternatives to difference scores: Polynomial regression analysis and response surface
methodology. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations.
San Francisco: Jossey-Bass.
Fiske, D. W. (1951). Values, theory, and the criterion problem. Personnel Psychology, 4, 93–98.
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analyses of assessment center
validity. Journal of Applied Psychology, 72, 493–511.
Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: Wiley.
Ghiselli, E. E. (1973). The validity of aptitude tests in personnel selection. Personnel Psychology, 26,
461–478.
Gibson, W. M., & Caplinger, J. A. (2007). Transportation of validation results. In S. M. McPhail (Ed.),
Alternative validation strategies: Developing new and leveraging existing validity evidence: The profes-
sional practice series. (pp. 29–81). Hoboken, NJ: Wiley.
Goldman, B. M., Gutek, B. A., Stein, J. H., & Lewis, K. (2006). Employment discrimination in organizations:
Antecedents and consequences. Journal of Management, 32(6), 786–830.
Guion, R. M. (1961). Criterion measurement and personnel judgments. Personnel Psychology, 14, 141–149.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world. Thousand Oaks, CA: Sage.
Henry, R. A., & Hulin, C. L. (1989). Changing validities: Ability-performance relations and utilities. Journal
of Applied Psychology, 54, 365–367.
Hogan, J., & Quigley, A. M. (1986). Physical standards for employment and the courts. American Psychologist,
41, 1193–1217.
Hull, C. L. (1928). Aptitude testing. Yonkers, NY: World Book.
Ilgen, D. R., & Pulakos, E. D. (Eds.). (1999). The changing nature of performance. San Francisco: Jossey-
Bass.
Jeanneret, P. R., & Silzer, R. (Eds.). (1998). Individual psychological assessment: Predicting behavior in orga-
nizational settings. San Francisco, CA: Jossey-Bass.
Lance, C. L., Foster, M. R., Gentry, W. A., & Thoresen, J. D. (2004). Assessor cognitive processes in an opera-
tional assessment center. Journal of Applied Psychology, 89, 22–35.
Lance, C. L., Foster, M. R., Nemeth, Y. M., Gentry, W. A., & Drollinger, S. (2007). Extending the nomological
network of assessment center construct validity: Prediction of cross-situationally consistent and specific
aspects of assessment center performance. Human Performance, 20, 345–362.
Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist,
41, 1183–1192.
Livingston, S. A. (1980). Comments on criterion-referenced testing. Applied Psychological Measurement, 4,
575–581.
Maier, M. H. (1988). On the need for quality control in validation research. Personnel Psychology, 41,
497–502.
McDaniel, M. A. (2007). Validity generalization as a test validation approach. In S. M. McPhail (Ed.),
Alternative validation strategies: Developing new and leveraging existing validity evidence: The profes-
sional practice series (pp. 159–180). Hoboken, NJ: Wiley.
Meehl, R. J. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence.
Minneapolis: University of Minnesota Press.
Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American
Psychologist, 30, 955–966.
Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45, 35–44.
70 Handbook of Employee Selection
Motowidlo, S. J. (2003). Job performance. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook
of psychology (Vol. 12, pp. 39–53). Hoboken, NJ: Wiley.
Murphy, K. R. (1986). When your top choice turns you down: The effects of rejected offers on the utility of
selection tests. Psychological Bulletin, 99, 133–138.
Murphy, K. R., & DeShon, R. P. (2000). Interrater correlations do not estimate the reliability of job perfor-
mance ratings. Personnel Psychology, 53, 873–900.
Newman, D. A., Jacobs, R. R., & Bartram, D. (2007). Choosing the best method for local validity estima-
tion: Relative accuracy of meta-analysis versus a local study versus Bayes-analysis. Journal of Applied
Psychology, 92, 1394–1413.
Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Cognitive ability in selection decisions. In O. Wilhelm
(Ed.), Handbook of understanding and measuring intelligence (pp. 431–468). Thousand Oaks, CA:
Sage.
Peterson, N. G., Wise, L. L., Arabian, J., & Hoffman, R. G. (Eds.). (2001). Synthetic validation and valid-
ity generalization: When empirical validation is not possible. In J. P. Campbell & D. J. Knapp (Eds),
Exploring the limits in personnel selection and classification (pp. 411–452). Mahwah, NJ: Lawrence
Erlbaum.
Picano, J. J., Williams, T. J., & Roland, R. R. (Eds.). (2006). Assessment and selection of high-risk operational
personnel. New York, NY: Guilford Press.
Ployhart, R. E., & Schmitt, N. (2007). The attraction-selection-attrition model and staffing: Some multilevel
implications. In D. B. Smith (Ed.), The people make the place: Exploring dynamic linkages between
individuals and organizations (pp. 89–102). Mahwah, NJ: Lawrence Erlbaum.
Ployhart, R. E., Schneider, B., & Schmitt, N. (2006). Staffing organizations: Contemporary practice and the-
ory. Mahwah, NJ: Lawrence Erlbaum.
Rothstein, H. R. (1992). Meta-analysis and construct validity. Human Performance, 5, 71–80.
Ryan, A. M., & Sackett, P. R. (1987). A survey of individual assessment practices by I/O psychologists.
Personnel Psychology, 40, 455–488.
Ryan, A. M., & Sackett, P. R. (1992). Relationships between graduate training, professional affiliation, and
individual psychological assessment practices for personnel decisions. Personnel Psychology, 45,
363–385.
Ryan, A.M., & Sackett, P. R. (1998). Individual assessment: The research base. In R. Jeanneret & R. Silzer
(Eds.), Individual psychological assessment: Predicting behavior in organizational settings (pp. 54–87).
San Francisco, CA: Jossey-Bass.
Sackett, P. R. (Ed.) (1990). Special Issue: Project A: The U. S. Army Selection and Classification Project.
Personnel Psychology, 43, 231–378.
Sackett, P. R., & Ostgaard, D. J. (1994). Job-specific applicant pools and national norms for cognitive ability
tests: Implications for range restriction corrections in validation research. Journal of Applied Psychology,
79, 680–684.
Sackett, P. R., & Roth, L. (1991). A Monte Carlo examination of banding and rank order methods of test score
use in personnel selection. Human Performance, 4, 279–295.
Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment cre-
dentialing, and higher education: Prospects in a post-affirmative action world. American Psychologist,
56, 302–318.
Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score adjustment in preemploy-
ment testing. American Psychologist, 49, 929–954.
Scherbaum, C. A. (2005). Synthetic validity: Past, present, and future. Personnel Psychology, 58, 481–515.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generaliza-
tion. Journal of Applied Psychology, 62, 529–540.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychol-
ogy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schmidt, F. L., & Hunter, J. E. (2003). History, development, evolution, and impact of validity generalization
and meta-analysis methods, 1975–2001. In K. R. Murphy (Ed.), Validity generalization: A critical review.
Applied Psychology Series (pp. 31–65). Mahwah, NJ: Lawrence Erlbaum.
Schmidt, F. L., Hunter, J. E., Outerbridge, A. N., & Trattner, M. H. (1986). The economic impact of job selec-
tion methods on size, productivity, and payroll costs of the federal work force: An empirically based
demonstration. Personnel Psychology, 39, 1–30.
Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related validity studies. Journal
of Applied Psychology, 61, 473–485.
Validation Strategies for Primary Studies 71
Schmidt, F. L., Mack, M. J., & Hunter, J. E. (1984). Selection utility in the occupation of U.S. park ranger for
three modes of test use. Journal of Applied Psychology, 69, 490–497.
Schmidt, F. L., Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability.
Personnel Psychology, 53, 901–912.
Schmitt, N., Gooding, R., Noe, R. A., & Kirsch, M. (1984). Meta-analyses of validity studies published between
1964 and 1982, and the investigation of study characteristics. Personnel Psychology, 37, 407–422.
Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and predictive
efficiency of various predictor combinations. Journal of Applied Psychology, 82, 717–730.
Schneider, B. (1987). The people make the place. Personnel Psychology, 40, 437–454.
Schneider, B., Smith, D., & Sipe, W. P. (2000). Personnel selection psychology: Multilevel considerations.
In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations
(pp. 91–120). San Francisco, CA: Jossey-Bass.
Scott, W. D. (1915). The scientific selection of salesmen. Advertising and Selling, 5, 5–7.
Shotland, A., Alliger, G. M., & Sales, T. (1998). Face validity in the context of personnel selection: A multime-
dia approach. International Journal of Selection and Assessment, 6, 124–130.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures. Bowling Green, OH: Author.
Strickland, W. J. (1979). The relationship between program evaluation research and selection system validation:
Application to the assessment center method. Dissertation Abstracts International, 40(1-B), 481–482.
Sussman, M., & Robertson, D. U. (1986). The validity of validity: An analysis of validation study designs.
Journal of Applied Psychology, 71, 461–468.
Terpstra, D. E., & Kethley, R. B. (2002). Organizations’ relative degree of exposure to selection discrimination
litigation. Public Personnel Management, 31, 277–292.
Terpstra, D. E., & Rozell, E. J. (1993). The relationship of staffing practices to organizational level measures of
performance. Personnel Psychology, 46, 27–48.
Thayer, P. W. (1992). Construct validation: Do we understand our criteria? Human Performance, 5, 97–108.
Thorndike, E. L. (1911). Individuality. Boston, MA: Houghton-Mifflin.
Viteles, M. S. (1932). Industrial psychology. New York, NY: Norton.
This page intentionally left blank
4 Work Analysis
Kenneth Pearlman and Juan I. Sanchez
The purposes of this chapter are to describe and summarize the current state of the art with respect
to work analysis as it applies to employee or personnel selection and to suggest expansions of such
applications in light of emerging and anticipated changes in the world of work. We use the term
“work analysis” broadly to refer to any systematic process for gathering, documenting, and analyz-
ing information about (a) the content of the work performed by people in organizations (e.g., tasks,
responsibilities, or work outputs), (b) the worker attributes related to its performance (often referred
to as knowledge, skills, abilities, and other personal characteristics, or KSAOs), or (c) the context in
which work is performed (including physical and psychological conditions in the immediate work
environment and the broader organizational and external environment). Other terms, such as “job
analysis,” “occupational analysis,” and “job specification” are often used, sometimes interchange-
ably and with somewhat varying definitions in different contexts, to refer to one or more of these
activities. Our use of “work analysis” reflects our preference for a broader term that does not con-
note a focus on any particular aspect or unit of analysis in the study of work.
The present focus on staffing involves just one of a wide range of organizational and human
resources (HR) needs and purposes for which work analysis is used (e.g., training, job evaluation,
performance management, job design, workforce planning, and many others). However, we take a
broad view of this function, ranging from its narrowest conception (external hiring) to a wider range
of related HR functions (personnel placement and classification, internal staffing and promotion,
selection for training or special assignments, and selection out of an organization) and extending to
its broadest conception as reflecting different transition or decision points within an ongoing pro-
cess or “life cycle” of employee assimilation and socialization within an organization (Anderson &
Ostroff, 1997).
This chapter is organized into four major sections. “Traditional Selection-Related Applications of
Work Analysis” examines the primary applications of selection-oriented work analysis. “A Review
of Major Work Analysis Methods and Approaches” provides a review and analysis of major histori-
cal work analysis methods that have been used to support personnel selection. “Key Work Analysis
Practice Issues” is devoted to several key issues that arise in the practical application of work analy-
sis to personnel selection. “Frontiers of Work Analysis: Emerging Trends and Future Challenges”
discusses several emerging trends, issues, and challenges that we see as critical to the continuing
and future relevance and utility of selection-oriented work analysis. “Synopsis and Conclusions”
summarizes and draws some general conclusions on the basis of the material presented in the main
sections.
73
74 Handbook of Employee Selection
analysis for employee selection) it is therefore relevant to understand that an organization’s selection
processes are (or should be) derived from its overall selection strategy, which in turn is (or should
be) based on its broader goals and strategies. Just as selection systems must be viewed as part of
broader HR systems (recruitment, training, performance management, and so forth), and ideally
should be integrated in a planful manner with these other systems (which themselves are, ideally,
linked to and aligned with the organization’s broader goals), the work analysis systems that support
them should be similarly viewed. In other words, organizational goals and strategy should drive
selection system goals and strategy, which in turn should drive work analysis strategy, which then
serves as the basis for the many specific decisions involved in designing a particular work analysis
system, program, or project.
Broadly speaking, the purpose of work analysis for personnel selection applications is to ensure
that selection systems are work- or job-related, and hence valid, and thereby have value or utility for
the organization, as well as being legally defensible.1 Within this context, four general categories of
work analysis application can be distinguished, as follows.
1 The legal authority relevant to work analysis practices in the context of personnel selection is embodied in the Uniform
Guidelines on Employee Selection Procedures (Guidelines; EEOC, CSC, DOL, & DOJ, 1978) issued by the agencies
responsible for enforcing federal antidiscrimination law. The Guidelines outline several specific and somewhat differing
work analysis requirements for different selection procedure validation strategies that represent important constraints
and considerations that practitioners must take into account. The various legal and professional guidelines and standards
relevant to domestic and international personnel selection, including work analysis, are treated in depth elsewhere in
this volume and hence are not covered here. Detailed discussion of legal and professional standards for work analysis as
they relate specifically to the applications we describe below can be found in Brannick, Levine, and Morgeson (2007),
Gatewood, Feild, and Barrick (2008), and the Stelly/Goldstein, Gibson/Caplinger, Johnson, and McDaniel chapters in
McPhail (2007).
Work Analysis 75
KSAOs, are then developed and judgmentally linked back to the content domain by subject matter
experts (SMEs). Validity is a function of the strength of these (measure-content domain) linkages,
which in turn are a function of the quality of the original work analysis, the quality of the experts,
and the rigor of the linkage process.
TABLE 4.1
Work Analysis Information Framework
Level of Analysis
Work Descriptor
Category Broad Moderate Specific
Work-oriented • Major duties • Task clusters or work • Tasks or work steps
content • Major responsibilities activities • Work outputs
• Work functions or processes • Specific materials, equipment,
• Material, equipment, tool, and tools, and machines
machine categories • Work-content-based
performance standards
• Specific work environment
features and working
conditions
Worker-oriented • Position/job/ • Generalized work behaviors • Worker-behavior-based
content occupational titles • Worker functions or processes performance standards
• Behavioral indicators
Attribute • Personality traits, values, • Generic or • Specialized/technical skills
requirements and interests cross-functional skills • Specialized/technical
• Aptitudes and abilities knowledge
distinction). Qualitative approaches build up work or job information, often from scratch (e.g., on
the basis of observing or interviewing job incumbents to determine specific tasks performed) and
generally one job at a time, yielding detailed, narrative information that is customized to indi-
vidual jobs or specific work within an organization. Quantitative approaches are generally based
on the use of structured work analysis questionnaires or surveys comprised of pre-established
lists of different types of work or job descriptors (i.e., work characteristics or units of analysis,
such as work behaviors, worker functions, or KSAOs). These usually include rating scales that
permit SMEs (incumbents, supervisors, or job analysts) to quantify their judgments about individual
descriptors along dimensions of interest (e.g., performance frequency, importance, level of
complexity, and consequences of error).
Work analysis content refers to the types of work descriptors used and the level of analysis or
detail represented by these descriptors. McCormick (1976) has usefully distinguished among three
broad descriptor categories, including (in slightly adapted form) (a) work-oriented content descrip-
tors, in which the descriptive frame of reference is the work to be done, including the purpose,
steps, tools and materials, required resources, and conditions under which work is accomplished
(examples include tasks, activities, duties, responsibilities, working conditions, and work outputs);
(b) worker-oriented content descriptors, in which the descriptive frame of reference is what workers
do to carry out work (examples include worker functions, processes, or behaviors); and (c) attri-
bute requirement descriptors, in which the descriptive frame of reference is the attributes needed
by a worker to do the specified work (examples include skills, knowledge, abilities, and tempera-
ments or dispositions). Distinctions among these categories can at times be blurry because of some
natural overlap and the unavoidable imprecision of language (e.g., the second and third categories
are sometimes considered as a single “worker-oriented” or “attribute” category); nonetheless, they
have proven to be conceptually useful. Work descriptors can be further differentiated in terms of
the level of analysis or description reflected by a particular category of descriptor. Work-oriented
content descriptors can range from narrow and specific (such as tasks performed) to broad and gen-
eral (such as major duties or responsibilities), as can worker-oriented content descriptors (ranging
from specific behavioral performance indicators to broad position or job titles). Similarly, attribute
requirement descriptors can be represented by very narrowly defined characteristics (specialized
Work Analysis 77
skills or knowledge) or very broadly defined human attributes (such as aptitudes and abilities or
personality traits).
Table 4.1 illustrates how these two facets of work analysis content interact and play out in terms
of specific descriptor types. It provides examples of descriptors representing each work descriptor
category within each of three levels of analysis (broad, moderate, and specific; these are admittedly
imprecise distinctions that, nonetheless, seem to work reasonably well as a descriptive frame of
reference for our purposes.) Note that the qualitative-quantitative work analysis process distinction
is not represented here but is represented in our later expansion of this table and associated text
discussion of specific work analysis methodologies.
TABLE 4.2
Significant Work Analysis Methods Organized by Work Analysis
Information Framework Cells
Level of Analysis
Work Descriptor
Category Broad Moderate Specific
Work-oriented • Job Diagnostic Survey • Minnesota Job Description • TI/CODAPa
content • PPRF Work Styles Questionnaire
• PIC Checklist
Worker-oriented • Dictionary of Occupational Titles • Position Analysis • Cognitive Task Analysisa
content classification structure Questionnaire • Critical Incident
Techniquea
Attribute • Fleishman Ability Requirements • SCANS • Job Element Methoda
requirements Scales • Work Keys • CIP-2000 Knowledge
• Holland Interest Taxonomy Taxonomy
Hybrid • Functional Job Analysisa • Competency Modeling
(multidescriptor • MOSAICa • Strategic Job Modelinga
and/or multilevel) • SHL Universal Competency Framework • O*NETa
a Method generates some or all of its information using qualitative processes (see earlier text discussion).
The Position Analysis Questionnaire (PAQ; McCormick, Jeanneret, & Mecham, 1972), one
of the most extensively researched and widely used of all work analysis methods, is a structured
questionnaire completed by incumbents or supervisors that describes jobs in terms of 27 stan-
dardized worker-oriented dimensions that are neither highly specific nor extremely broad and are
common across nearly all jobs. It thus lends itself well to quantitative cross-job comparisons, job
family development, and validity evidence extension applications; in particular, synthetic validity
(originated by McCormick as “job-component” validity on the basis of PAQ research) and valid-
ity transportability. TI/CODAP (an acronym for Task Inventory/Comprehensive Occupational Data
Analysis Programs; Christal & Weissmuller, 1988) refers to a collection of computer programs and
applications that analyze and array in various ways quantitative data collected from standardized
task inventories on enlisted military and officer jobs. Following initial implementation in the U.S.
Air Force in 1967, it was eventually adopted as the primary work analysis tool for all branches of the
military and has expanded into widespread use in academia, business, industry, and federal, state,
and local governments. The method is simple and efficient, permitting relatively quick gathering
and analysis of detailed task information from large samples of workers. Its disadvantages include
the often lengthy process required for task inventory development and the fact that its exclusively
work-oriented output is not directly suitable for applications requiring worker-oriented or KSAO-
type data without additional steps or procedures.
The Fleishman Ability Requirement Scales (ARS) methodology is an outgrowth of several lines
of programmatic research involving task characteristics and human ability taxonomies begun in the
1960s (Fleishman & Quaintance, 1984). This led to the development of a highly construct-valid set
of 52 cognitive, physical, psychomotor, and sensory abilities, along with associated rating scales
based on empirically calibrated task or behavioral anchors used to evaluate the ability requirements
of tasks, broader job components, or entire jobs. Later research extended the approach to the cover-
age of social/interpersonal dimensions, knowledge, and skills. The method’s chief assets are the
strong research base underlying the ability sets and rating scales, its relative ease of understanding
and use, and the utility of the resulting ability requirement profiles for quantitative job comparison
and job family development, as well as in providing a direct linkage to potential selection tools
within a criterion-related or construct validation context. It is less well suited to applications requir-
ing work-oriented descriptor content, such as validity evidence development (content validity) and
criterion development, although Landy (1988) provided a case study illustrating how this can be
accomplished.
There is another subset of work analysis methods that can be thought of as “taxonomy-driven,”
that is, methods in which judgments or ratings are made in terms of a predetermined list of attribute
requirement descriptors (which may represent one or more levels of analysis) determined through
prior research to represent all (or the important) elements of a meaningful KSAO domain. In this
respect they are similar to the ARS methodology, but differ from the PAQ approach, in which
analyst ratings are made on specific items and are subsequently converted statistically into broader
dimension scores. Examples of such methods include the Holland interest taxonomy (Holland,
1973), the SHL Universal Competency Framework (Bartram, 2005), the CIP-2000 knowledge tax-
onomy (embodied in the U.S. Department of Education’s current Classification of Instructional
Programs; USDE, 2002), the SCANS skill set (Secretary’s Commission on Achieving Necessary
Skills, 1992), and Work Keys (McLarty & Vansickle, 1997). Not all of these were developed as work
analysis methods per se (such as SCANS and CIP-2000), but they all can be or have been readily
adapted for such application.
Several of the methods included in Table 4.2 might be thought of as “special-purpose” methods.
The Job Element Method (Primoff, 1975) was an innovative approach to developing selection
standards primarily for trades and labor jobs, as well as an early application of synthetic validity,
when developed at the U.S. Civil Service Commission (now the U.S. Office of Personnel Management)
during the 1950s, although it is no longer in significant use. The Job Diagnostic Survey (JDS;
Hackman & Oldham, 1976) measures, through survey items, the “motivating potential” of jobs in
80 Handbook of Employee Selection
terms of five broad work-oriented job dimensions—skill variety, task identity, task significance,
autonomy, and feedback—predicted by the method’s underlying “job characteristics theory” to be
positively related to work performance because they reflect inherently motivating characteristics of
work. The Minnesota Job Description Questionnaire (MJDQ; Dawis, 1991) is based on a taxonomy
of 21 “occupational reinforcers” that describe occupations in terms of various job circumstances or
conditions that can be related to individuals’ work preferences. The JDS and MJDQ are generally
viewed as aids to job design, work satisfaction prediction, and applications other than conventional
personnel selection (they do not lend themselves to the identification of sets of KSAOs on which
selection tools would be built), but instruments of this type do offer possible benefits for selection
that we discuss later in the chapter.
We characterize a final group of methods displayed in Table 4.2 as “new or emerging” approaches
to work analysis in that they attempt, in one way or another, to meet the challenges of the chang-
ing nature of work. (We provide a more in-depth discussion of one such approach, Competency
Modeling, later in this chapter.) The Personality-Related Position Requirements Form (PPRF;
Raymark, Schmit, & Guion, 1997) and the Performance Improvement Characteristic (PIC) check-
list (Hogan, Davies, & Hogan, 2007) attempt to measure the personality requirements of jobs via
structured questionnaires, thereby providing a means to supplement more conventional (task- or
ability-oriented) work analysis methods with the identification of some of the nontask and noncog-
nitive requirements of work, as we discuss later. Cognitive task analysis (Schraagen, Chipman, &
Shalin, 2000) encompasses various qualitative methods for analyzing the (unobservable) mental
processes (e.g., strategies and knowledge) used by experts to perform cognitively complex work
tasks and how experts and novices differ in such performance. Used primarily to support training
applications, such methods have not as yet demonstrated obvious or easy applicability to employee
selection—especially given their cost, complexity, and time- and labor-intensiveness—but they do
warrant our continued interest, particularly in view of the increasing cognitive complexity of work
in many sectors of our economy.
The U.S. Office of Personnel Management’s Multipurpose Occupational Systems Analysis
Inventory-Close-Ended (MOSAIC) system (Rodriguez, Patel, Bright, Gregory, & Gowing, 2002),
which has become the federal government’s primary work analysis system, is a multipurpose,
automated, survey-based work analysis approach used to simultaneously collect information (from
incumbents and supervisors) on many jobs within a broad occupational area (such as information
technology, professional and administrative, trades and labor). It is based on use of a common lan-
guage for both competencies (KSAOs) and generalized task statements that are relevant to an entire
occupational area (rather than specific jobs), and hence lends itself well to cross-job comparisons in
terms of work- and worker-oriented descriptors. It is useful for personnel selection as well as other
HR applications (such as workforce planning, training needs assessment, and performance manage-
ment), the integration of which is facilitated by the underlying common descriptor base.
The Occupational Information Network (O*NET™), treated in depth elsewhere in this volume,
was developed by the U.S. Department of Labor in the mid-1990s (Peterson et al., 1999) to replace
an outmoded DOT system with an online electronic database of greatly expanded job information
and more effective methodologies for data collection, analysis, and updating. O*NET was premised
on the need for a comprehensive, theoretically based, and empirically validated common language
that represented a hierarchical, taxonomic approach to work description and would therefore be
capable of describing the characteristics of work and workers at multiple levels of analysis. Its
centerpiece is the O*NET Content Model, which serves as the blueprint and integrating framework
for the various descriptive components of the system. The content model encompasses six major
domains of job descriptors representing some 20 individual job descriptor categories or taxonomies
that reflect various (over 270) work- and worker-oriented descriptors on which data are collected
from trained analysts and job incumbents by means of various structured questionnaires and sur-
veys. O*NET research and development is ongoing, with core data having been collected on most
of its 812 substantively defined occupations, whereas several of the newer descriptor taxonomies
Work Analysis 81
(e.g., Tools and Technology, Detailed Work Activities) and occupational task content are still being
built out across the occupational spectrum. Nonetheless, the internal research studies completed to
date provide a great deal of supportive reliability and validity evidence for the system. Moreover,
its potential for various research uses (e.g., see Dierdorff & Morgeson, 2007) and applications (as
evidenced in large and steadily increasing numbers of user downloads of O*NET online products
and tools) is becoming increasingly evident.
Strategic Job Modeling (SJM) is a term coined by Schippmann (1999) to describe an approach to
work analysis that can serve as a basis for integrated HR systems. Its core is a conceptual framework
outlining the key descriptive elements on the work and person sides of the performance equation.
The approach itself consists of a series of steps, suggested guidelines, procedures, and work aids for
obtaining information on these descriptors within the context of a given SJM project. No specific
instrumentation is provided as part of the method; it is essentially a practitioner guide for conduct-
ing work analysis with an explicit emphasis on taking account of strategic organizational issues as a
point of departure. It attempts to integrate the rigor of traditional work analysis methods and prac-
tices within the less well-defined terrain of strategic work analysis (which we discuss later in this
chapter), resulting in a potentially significant evolutionary next step for work analysis and serving as
a stimulus for further thinking and development in such directions (see, e.g., Barney, 2000).
TABLE 4.3
Descriptor Appropriateness for Selection-Related Work Analysis Applications
Level of Analysis
Work Descriptor Category Broad Moderate Specific
Work-oriented content CD CD, VD, VE-SV CD, VD
Worker-oriented content VE-VG VE-VG, VE-SV, VE-VT CD, VE-VT
Attribute requirements PD, VE-VG PD, VD, VE-VG PD, VD
CD, criterion development applications; PD, predictor development applications; VD, validity evidence development
(content validity) applications; VE, validity evidence extension applications; VE-VG, encompassing validity general-
ization; VE-SV, synthetic validity; VE-VT, validity transportability.
82 Handbook of Employee Selection
at a moderate level of analysis for synthetic validity, whereas legal considerations suggest that
worker-oriented content descriptors at specific or moderate levels of analysis are most appropriate
for validity transportability.
Alternative work analysis methods can also be evaluated in terms of various specific practi-
cal and technical considerations, such as those described by Gatewood et al. (2008) and further
adapted and expanded here, including such things as the method’s (or instrument’s) (a) availability
and readiness for use; (b) versatility of application to different jobs and work settings, descriptor
needs, and units of analysis; (c) degree of standardization; (d) acceptability to users and stakehold-
ers; (e) analyst or respondent training requirements; (f) sample size needs; (g) reliability and valid-
ity of resulting data; (h) cost; and (i) suitability and utility for the specific intended application(s).
Such evaluations are best undertaken without expecting to find or create the “perfect” work analysis
system or method, because each has strengths and limitations and relatively greater or lesser suit-
ability for particular purposes or applications. Moreover, it is not always necessary to pick a single
method, because it is often possible to create hybrid methods that borrow or adapt elements from
several different approaches.
Reflecting more broadly on this review, it appears that work analysis is at a crossroads—one
rooted in the fact that whereas work in many economic sectors has been changing a lot, work analy-
sis methodology has been changing only a little. The general phenomenon of the changing nature of
work, workers, and the workplace resulting from broader economic, demographic, and technologi-
cal changes has been extensively described and documented for at least the last 20 years (Coates,
Jarratt, & Mahaffie, 1990; Johnston & Packer, 1987), as has its specific ramifications for many orga-
nizational and HR practices, including personnel selection and work analysis (Offerman & Gowing,
1993; Pearlman & Barney, 2000; Sanchez, 1994). Rather than slowing down or stabilizing, the pace
of such changes, if anything, appears to be accelerating (Landy, 2007). However, for work analysis,
after periods of substantial innovation in the middle part of the 20th century, with a few exceptions
the last several decades have been largely ones of methodological refinements, variations, and new
combinations of tried-and-true methods.
Like all organizational practices, work analysis must continue to adapt and evolve to maintain
its relevance and utility. Although the traditional concept of a job may not be “dead,” as some have
argued (Bridges, 1994), its changed settings and dynamics have created new and formidable chal-
lenges for traditional work analysis assumptions and practices. Among those who have speculated
about this (Cunningham, 2000; Fogli & Whitney, 1998; Levine & Sanchez, 2007; Pearlman &
Barney, 2000), there has been some consensus that such challenges imply the need for work analysis
methods with a greater ability to capture such things as (a) strategic and future work requirements
that are based on a more macro, “top-down” (i.e., organization-level) than a micro, “bottom-up”
(i.e., individual and job level) orientation; (b) broader and multiple work roles and work processes
rather than individual jobs and work content; (c) broader sets of worker attributes (e.g., personal-
ity, attitudes, and values) relevant to contextual performance (Borman & Motowidlo, 1993), team
and organizational performance outcomes, and task and individual performance outcomes; and
(d) important elements of the broader work and organizational environment, as well as to incorpo-
rate interdisciplinary perspectives and methodological innovations that would facilitate and enhance
such pursuits. (We elaborate on many of these points later in this chapter.)
There are signs that critical mass has been building toward a new generation of work analysis
approaches that reflect some of these directions, as exemplified by some of the more expansive, ver-
satile, future-oriented, and multipurpose approaches to work analysis that have appeared in orga-
nizational practice and the research literature in recent years, such as PPRF, MOSAIC, O*NET,
various forms of strategic work analysis (such as SJM), work analysis for teams (Brannick et al.,
2007; Jones, Stevens, & Fischer, 2000), and as reflected in an increasing number of recent texts
that recognize and promote such forward-looking perspectives (e.g., Brannick et al., 2007; Landy
& Conte, 2007; Ployhart, Schneider, & Schmitt, 2006). Nonetheless, much of this territory remains
uncharted and unexplored.
Work Analysis 83
equal, more data will have to be collected and that the work analysis process will be correspond-
ingly more costly and labor-intensive than when a broader level of description is involved. Another
point to recognize is that it is always possible to “aggregate up,” or combine data gathered on specific
descriptors into broader descriptor categories, but it is impossible to disaggregate broader descriptor
data to a level of detail on which data have not been collected. The implication of this for practice
is to use the most specific level of analysis for which any future use of the data may be planned or
anticipated. A corollary recommendation is, to the degree that either multiple applications are cur-
rently at hand or unexpected applications may emerge in the future, to collect data on descriptors
representing multiple levels of analysis, perhaps with a bias toward gathering more rather than less
detailed information; this will facilitate the use of the work analysis results in support of more than
one application. A final point, applicable to situations in which multiple jobs are to be analyzed and
compared, is the importance of recognizing that the use of highly specific descriptors (e.g., tasks)
tends to exaggerate the degree of between-job differences, whereas the use of very broad descriptors
(such as abilities) tends to minimize such differences. This is why descriptors at a moderate level
of analysis, such as PAQ dimensions and their derivatives (generalized work behaviors), have often
been found optimal for such applications.
and attribute-oriented descriptors, suggesting that SMEs’ inferences about attribute requirements—
whether via direct or indirect estimation—are well grounded in their judgments of work activities
or job tasks. Although the specific processes and individual judgment policies through which such
ratings are made remain a matter for further research, these findings nonetheless provide reasonably
strong underpinnings for the various methods and techniques that have been developed and used to
make such linkages in practice.
thinking of work analysis in terms of its more broadly conceived consequential validity, because
this provides a framework for demonstrating and documenting to organizational leaders and stake-
holders its pivotal role in linking personnel selection (as well as many other HR practices) to critical
aspects of work performance (as revealed through work analysis). This value is further enhanced to
the degree that such practices can be shown to produce significant “returns on investment” (Becker,
Huselid, & Ulrich, 2001); for example, in the form of individual or organization-level performance
improvements.
organization’s vision, mission, strategy, structure, culture, and values—than are addressed in con-
ventional methods, which, if present at all (e.g., as they are in FJA, PAQ, and O*NET), tend to be
limited to those associated with specific jobs and work settings (work schedule, working conditions,
environmental hazards) rather than the broader organizational and external context (business strat-
egy, competitive environment, market conditions). Absent such measures, we may fail to detect the
need for potentially important or useful worker attributes and potentially critical selection proce-
dure validation criteria. For example, it is plausible that areas where noncognitive attributes (such
as personality traits, attitudes, and values) might have their greatest predictive value remain largely
unexplored because of our historical focus on more conventional (job- and performance-oriented)
criteria that ignore the broader organizational context of work. Models for explaining and under-
standing this are only just beginning to emerge (Tett & Burnett, 2003). We would go so far as to
argue that the definition, measurement, and mapping of the work environment—in effect, creating
a “common language” of work context—at multiple levels of analysis is the next major frontier in
work analysis. This is not a small challenge. It is a problem of long standing in psychology as a
whole (Frederiksen, 1972) and continues to be one of acute importance in I-O psychology today
(Johns, 2006).
similarly recommended for selection-oriented work analysis (Goldstein, 1997). This would provide
the critical context to facilitate more specific work analytic efforts, thereby also facilitating the
direct generation of worker KSAOs related to broader organizational criteria, strategies, and goals.
Such an approach could in turn provide a framework from which other, more broadly conceived,
selection-related applications of work analysis (in addition to the four outlined earlier) might be
explored and capitalized on. For example, one such application could be the provision (via ads, real-
istic job previews, or other recruiting tools) of customized information to applicants about various
aspects of work and worker requirements (e.g., context factors, career ladders and lattices based on
job interrelationships and skill or knowledge transferability) that are potentially related to applicant
attraction and subsequent organizational commitment. Another example is collecting work analy-
sis data on contextual and other factors relevant to selection for nontraditional or nonperformance
criteria, such as successful post-hire assimilation and socialization, or different levels of employee
“fit” (person-job, person-role, person-team, person-organization; Higgs, Papper, & Carr, 2000). Yet
another example is using work analysis questionnaires more in the mode of an organizational survey
(i.e., as an ongoing or regularly recurring intervention) rather than exclusively as a “one-and-done”
tool for work profiling; this could provide a measure of work content and work context stability/
volatility and offer insights into the nature of such content or context changes and their potential
implications for selection-related worker attributes.
All of this suggests a potentially useful reconceptualization of work analysis as organizational
strategy—that is, as a strategic tool—and hence characterized by a strong organization develop-
ment (OD) component, as has been suggested by Schippmann (1999) and Higgs et al. (2000). It
also suggests the need to bridge a historical and professional disconnect between those who have
tended to view jobs and work from a traditional, “micro” perspective (e.g., personnel and industrial
psychologists, training and education professionals, cognitive psychologists, occupational analysts,
industrial engineers, and human factors specialists) and those who have tended to look at work
from a more “macro” perspective (e.g., labor economists; sociologists; business and management
consultants; demographers; ethnologists; and clinical, social, and organizational psychologists). In
our view, the work analysis enterprise would be better served by an integration of these perspec-
tives, facilitated by much more interdisciplinary work among such professionals than historically
has been the case, as some have called for (Barney, 2000; Cunningham, 2000). Such a reframing
and associated changes in work analysis practice—and practitioners—underlie what we believe to
be a broader and potentially more useful concept of “strategic work analysis” (SWA) as a systematic
effort to identify or define current or anticipated work or worker requirements that are strategically
aligned with an organization’s mission and goals. This would subsume some other related terms and
practices in current use, such as future-oriented job analysis, strategic job analysis (which is some-
times used as a synonym for the prior term—and sometimes not), strategic job (or work) modeling,
and competency modeling, which, given its substantial impact on contemporary work analysis dis-
cussion and practice, we now consider in greater depth.
Competency modeling (CM) is a form of work analysis whose use has become widespread since
about the mid-1980s. Its proponents (chiefly practitioners and consultants) view it as a fundamen-
tally new and different form of work analysis that holds the key to developing integrated HR systems
that are aligned with an organization’s strategy, culture, or values. Its critics (chiefly researchers
and more traditional work analysis professionals) have viewed it as simply an alternative label for
conventional work analysis and have expressed doubts and concerns about its technical rigor and
overall validity (Barrett & Callahan, 1997), concerns that have been borne out in empirical evalu-
ations (Lievens, Sanchez, & De Corte, 2004; Morgeson, Delaney-Klinger, Mayfield, Ferrara, &
Campion, 2004). Discussions of the subject are complicated by the fact that there is no professional
consensus regarding the meaning of either “competency” as a concept or “competency modeling”
as a methodology. There appear to be almost as many definitions of these terms as there are users
of them (Schippmann et al., 2000). This problem is compounded by the fact that most existing defi-
nitions describe a complex and multifaceted concept, such as that offered by Spencer, McLelland,
Work Analysis 91
and Spencer (1994). They define competency as a combination of motives, traits, self-concepts, atti-
tudes, values, content knowledge, or cognitive behavior skills and as any individual characteristic
that can be reliably measured or counted and that can be shown to differentiate superior from aver-
age performers. The difficulty with such definitions is that they lump together in a single construct
attributes representing vastly different domains, characteristics, and levels of analysis, which limits
its value conceptually (Clouseau-like, it means everything, therefore it means nothing) and practi-
cally (as a useful or measurable descriptor or unit of analysis in work analysis).
The typical output of a CM project is a set of worker attributes (competencies) believed to con-
tribute to an organization’s broad strategy and goals, culture, or values. As such, these attributes
are (at least in theory) applicable across the entire organization, or within large units or functional
areas, and thereby (in a project’s fullest application) able to serve as a common framework underly-
ing the various components of an integrated HR system, such as training and development, perfor-
mance management, compensation, and selection/promotion. Each competency is given a name or
label (and, in some cases, a definition) and is usually accompanied by a set of behavioral indicators
(BIs) that exemplify desirable (and, in some cases, undesirable, as well as “moderate”) behavioral
manifestations of the competency and thereby serve as the basis for measuring individuals’ stand-
ing on the competency. Multiple sets of BIs are often developed to address a given competency’s
manifestation across different job families (sales, engineering), across functional specialties within
a job family (account executive, technical consultant, customer service representative), or across
occupational levels within a single job. For example, a “systems thinking” competency (defined as
“making calculated decisions that take into account impact on other activities, units, and individu-
als”) might have different BIs for sales managers (“evaluates the impact on others before changing
work processes”) than for account team leaders (“helps staff understand how their function relates
to the overall organization”), as appropriate for these different roles.
We believe there is a huge chasm between the CM ideal envisioned by its proponents and the
actual practices that, under the rubric of CM, produce the type of output described above. In our
experience, the majority of such practices fall into one of two categories. The first category involves
entirely traditional work analysis, of one form or another, which leads to the development of sets
or taxonomies of well-defined, work-related (but not strategy-, culture-, or values-related) person
attributes (KSAOs) and associated metrics (behavioral or numerical rating scales, tests, or other
instrumentation) that meet accepted professional standards for such work. Such activities are
labeled as CM, and the KSAOs called competencies, to satisfy explicit or implicit requirements
of organizations or particular leaders. The second category purports to derive attributes related
to organizational strategy, culture, and values but entails the use of poorly conceived, incomplete,
or otherwise inadequate procedures (e.g., convenience samples engaged in unstructured group
discussions conducted without reference to individual or organizational work performance) that
lead to the development of ad hoc, idiosyncratic, ill-defined (or undefined) concepts or “folk
constructs”—ad hoc or “armchair” concepts or labels devised without reference to existing research
or theory—that are often little more than a “wish list” of desired worker attributes or purported
organizational values along with brainstormed (and typically unvetted and unvalidated) examples
(behavioral indicators) of good performance for each identified competency.
The above discussion is not meant to impugn the CM ideal of its proponents, but rather to high-
light the disconnect we perceive between this ideal and most contemporary CM practice, which
is either fairly rigorous but not explicitly “strategic” (the first category described above) or osten-
sibly strategic but not very rigorous, and hence ultimately unsuccessful in its strategic intent (the
second category described above). We would, in fact, classify this ideal as a third (although still
mostly unrealized) category of CM practice—one which combines the laudable goals of (a) linking
organizational strategy (and other organization-level variables and outcomes) to desired individual
employee attributes (an early model for which was presented by McLagan, 1988) and (b) utilizing
rigorous development methodologies of both conventional work analysis and other disciplines to
ensure the validity of these linkages, much as proposed by Schippmann (1999) and Barney (2000).
92 Handbook of Employee Selection
For example, the “traditional” Critical Incident Technique can be readily adapted to the genera-
tion of genuinely strategically driven competencies and associated BIs, requiring only a change
of frame of reference for incident generation from specific jobs to various organization-level vari-
ables, combined with the use of SMEs appropriate for this frame of reference. The unique aspect
of this category of CM practice is its explicit strategic organizational focus, without reference to
the work performed in any particular jobs. This is why we believe it is most appropriately regarded
as simply one particular form of the more broadly conceived SWA concept we proposed above,
and why (along with all of the conceptual and definitional ambiguities noted above) it has been
argued (Pearlman, 1997) that the terms “competency” and “competency modeling” be abandoned
altogether.
Viewing the CM ideal as one form of SWA (or, perhaps more precisely, as SWA with a particular
output format, as outlined above) clarifies and hence enhances our understanding of its potential
value to organizations. For example, CM proponents tout the utility of CM output for such purposes
as employee selection, promotion assessment, succession planning, and performance management
(including multisource feedback, via assembling and adapting BIs from an entire set of competen-
cies into a survey format). However, a less frequently recognized value of this particular SWA out-
put format, when rigorously developed, lies in the fact that it in effect functions as a set of relatively
concrete and meaningful performance standards for often abstract or complex constructs, as cus-
tomized for a given organization. As such, it serves a broader (and more subtle) purpose of influenc-
ing and (when built into performance management and feedback systems) incenting employees to
behave in ways that are consistent with the organization’s vision, mission, strategic objectives, and
values, especially to the degree that the competency model is widely communicated and promoted
by leaders throughout the organization, as is often the case. In other words, it serves as a strategic
tool to build and reinforce an organization’s culture.
To the degree that this is regarded as a major purpose of a given CM application, it arguably
should not be evaluated in terms of the same criteria as conventional work analysis methods used for
conventional HR applications. For example, it can be argued that, to the degree they achieve the goal
of influencing people to behave in strategically aligned ways within a given organizational context,
competency models that are idiosyncratic to particular organizations (as opposed to taxonomically
or theoretically driven) are of legitimate value (Becker et al., 2001), as would be consistent with the
OD, rather than psychometric, orientation of such applications—an assertion that gives new mean-
ing to the (tongue-in-cheek) adage, “It works great in practice, but it’ll never work in theory!” This
is not an argument for poor methodology or sloppy development work. Obviously, competencies
that are so nebulous or behaviorally “empty” as to defy organizationally meaningful definition and
behavioral referents will scarcely lend themselves to objective measurement or be of any organiza-
tional utility. As we noted above, this is where conventional work analysis methods, appropriately
adapted, can play a significant role by bringing methodological rigor to the development process.
The major need going forward, as we see it, is for creative thought and research addressing such
potential adaptations (such as the beginning efforts of Lievens et al., 2004, and Lievens & Sanchez,
2007), as well as the development of new data collection methods and approaches, to support all
varieties of SWA.
such as selection and staffing (especially validity evidence extension applications, because it is
ideally suited for cross-job comparison), training and career development, performance manage-
ment, and workforce planning. At a broader level it could provide the means for tracking trends and
changes in work content and occupational structure across the economy; for assessing and address-
ing national “skill gaps,” and skill transferabililty and occupational portability issues; and for study-
ing selection and talent allocation issues at a national level. From a scientific standpoint, it would
constitute a critical research tool for advancing theory development regarding work performance
and the structure of work or occupations—in effect, moving us closer to a “unified theory of work”
(Vaughn & Bennett, 2002).
This notion of “a complete, universally applicable information system for human resources allo-
cation” (Peterson & Bownas, 1982, p. 49) based on taxonomic information about work, work envi-
ronments, and human attributes—a “common language” of people and jobs—has long been viewed
as something of a “holy grail,” enticing work analysis researchers and practitioners for the better
part of 80 years. Such a system, when fully realized, would have underpinnings of structure (mean-
ing logical interrelationships among both descriptor categories and specific elements within those
categories) and standardization (meaning common definitions, rules, and metrics) that thereby pro-
mote common understanding and usage of system elements among all users and stakeholders. This
was the driving vision behind the U.S. Labor Department’s Occupational Research Program of the
1930s and its development of the DOT, and was reflected to varying degrees in such later systems
as FJA, the PAQ, ARS, MOSAIC, and O*NET. In our view, no single system has as yet been able
to fully realize this vision, although O*NET probably comes the closest in terms of its scope and
analytic capabilities.
Despite the simplicity and elegance of the concept, the practical realization of such a system is
enormously complex. This led Higgs et al. (2000) to conclude that “most systems like this … have
held great conceptual promise but … have eventually died of their own administrative weight and
expense” (p. 108). Many complex choices and decisions must be made in the conception, design,
and implementation of such a system, depending on the applications or objectives at issue, such
as (a) descriptor coverage—how many and which work- and worker-oriented attribute domains
will be included in the system—and the associated question of whether the common framework
will be operationalized as a single set of a relatively limited number of descriptor elements rep-
resenting a single level of description, or as multiple descriptor sets or taxonomies representing
multiple attribute domains and levels of descriptions; (b) descriptor level of analysis (the breadth
or narrowness of descriptor definition, as well as whether to allow multiple levels of analysis
via the use of hierarchical descriptor element taxonomies); (c) whether descriptor coverage will
apply (or will be designed so as to allow or promote application) to work, to workers, or to both;
(d) whether individual jobs will be described exclusively in terms of descriptor sets that are used
across all jobs in the system or will also include some types of job-specific information (such as
tasks or tools/technology); (e) the characteristics of the metrics or scales by which descriptors will
be quantified; (f) the policy and deployment questions of how much and which parts (descriptors)
of a common framework will be required for use by all organizational units and which parts, if
any, can be user-specific, which speaks to the critical issue of gaining the support and cooperation
of multiple users and stakeholders, without which the system is unlikely to succeed; (g) devising
efficient and effective procedures for ongoing data collection; and (h) devising procedures for
maintaining and updating the system’s data structure, which itself involves numerous technical
and practical challenges (e.g., the dilemma of changing or incorporating new data elements to
respond to changed needs or realities while maintaining comparability and continuity with the
prior data structure).
Despite this wide range of options, decisions, and challenges, we believe that the vision of such
a system continues to be both worthy and viable, if approached in manageable steps, segments, or
prototypes, on the basis of sound professional judgment, and undertaken with broad and high-level
organizational support.
94 Handbook of Employee Selection
REFERENCES
Anderson, N., & Ostroff, C. (1997). Selection as socialization. In N. Anderson & P. Herriot (Eds.), International
handbook of selection and assessment (pp. 413–440). New York, NY: Wiley.
Arvey, R. D., Salas, E., & Gialluca, K. A. (1992). Using task inventories to forecast skills and abilities. Human
Performance, 5, 171–190.
Baranowski, L. E., & Anderson, L. E. (2005). Examining rating source variation in work behavior to KSA link-
ages. Personnel Psychology, 58, 1041–1054.
Barney, M. (2000). Interdisciplinary contributions to strategic work modeling. Ergometrika, 1, 24–37.
Barrett, G. V., & Callahan, C. M. (1997, April). Competencies: The Madison Avenue approach to professional
practice. In R. C. Page (Chair), Competency models: What are they and do they work? Practitioner forum
conducted at the meeting of the Society for Industrial and Organizational Psychology, St. Louis, MO.
Bartram, D. (2005). The great eight competencies: A criterion-centric approach to validation. Journal of Applied
Psychology, 90, 1185–1203.
Becker, B. E., Huselid, M. A., & Ulrich, D. (2001). The HR scorecard: Linking people, strategy, and perfor-
mance. Boston: Harvard Business School Press.
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to included elements of contextual
performance. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection in organizations
(pp. 71–98). San Francisco, CA: Jossey-Bass.
Boselie, P., Dietz, G., & Boon, C. (2005). Commonalities and contradictions in HRM and performance research.
Human Resource Management Journal, 3, 67–94.
Brannick, M. T., Levine, E. L., & Morgeson, F. P. (2007). Job and work analysis: Methods, research, and appli-
cations for human resource management. Los Angeles, CA: Sage.
Bridges, W. (1994). Job shift: How to prosper in a world without jobs. Reading, MA: Addison-Wesley.
Cascio, W. F., & Young, C. E. (2005). Work-family balance: Does the market reward firms that respect it? In
D. F. Halpern & S. E. Murphy (Eds.), From work-family balance to work-family interaction: Changing
the metaphor (pp. 49–63). Mahwah, NJ: Lawrence Erlbaum.
Work Analysis 95
Christal, R. E., & Weissmuller, J. J. (1988). Job-task inventory analysis. In S. Gael (Ed.), The job analysis hand-
book for business, industry, and government (Vol. II, pp. 1036–1050). New York, NY: Wiley.
Coates, J. F., Jarratt, J., & Mahaffie, J. B. (1990). Future work: Seven critical forces reshaping work and the
work force in North America. San Francisco, CA: Jossey-Bass.
Cunningham, J. W. (2000). Introduction to a new journal. Ergometrika, 1, 1–23.
Dawis, R. V. (1991). Vocational interests, values, and preferences. In M. D. Dunnette & L. M. Hough (Eds.),
Handbook of industrial and organizational psychology (2nd ed., Vol. 2, pp. 833–872). Palo Alto, CA:
Consulting Psychologists Press.
Dierdorff, E. C., & Morgeson, F. P. (2007). Consensus in work role requirements: The influence of discrete
occupational context on role expectations. Journal of Applied Psychology, 92, 1228–1241.
Droege, R. C. (1988). Department of Labor job analysis methodology. In S. Gael (Ed.), The job analysis hand-
book for business, industry, and government (Vol. II, pp. 993–1018). New York, NY: Wiley.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department
of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43(166),
38295–38309.
Fine, S. A., & Wiley, W. W. (1971). An introduction to Functional Job Analysis: A scaling of selected tasks.
Kalamazoo, MI: Upjohn Institute.
Flanagan, J. C. (1954). The critical incident technique. Psychological Bulletin, 51, 327–358.
Fleishman, E. A., & Quaintance, M. K. (1984). Taxonomies of human performance. Orlando, FL: Academic
Press.
Fogli, L., & Whitney, K. (1998). Assessing and changing managers for new organizational roles. In R. Jeanneret &
R. Silzer (Eds.), Individual psychological assessment (pp. 285–329). San Francisco, CA: Jossey-Bass.
Frederiksen, N. (1972). Toward a taxonomy of situations. American Psychologist, 27, 114–123.
Gael, S. (Ed.). (1988). The job analysis handbook for business, industry, and government. New York: Wiley.
Gatewood, R. D., Feild, H. S., & Barrick, M. (2008). Human resource selection (6th ed.). Mason, OH: Thomson/
South-Western.
Gibson, C. B., Porath, C. L., Benson, G. S., & Lawler, E. E., III. (2007). What results when firms implement
practices: The differential relationship between specific practices, firm financial performance, customer
service, and quality. Journal of Applied Psychology, 92, 1467–1480.
Goiffin, R. D., & Woycheshin, D. E. (2006). An empirical method of determining employee competencies/
KSAOs from task-based job analysis. Military Psychology, 18, 121–130.
Goldstein, I. L. (1997). Interrelationships between the foundations for selection and training systems. In
N. Anderson & P. Herriot (Eds.), International handbook of selection and assessment (pp. 529–541).
New York, NY: Wiley.
Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process.
In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 3–34). San Francisco,
CA: Jossey-Bass.
Guest, D., Conway, N., & Dewe, P. (2004). Using sequential tree analysis to search for “bundles” of HR prac-
tices. Human Resource Management Journal, 14, 79–96.
Guthrie, J. (2001). High-involvement work practices, turnover, and productivity: Evidence from New Zealand.
Academy of Management Journal, 44, 180–192.
Hackman, J. R., & Oldham, G. R. (1976). Motivation through the design of work: Test of a theory. Organizational
Behavior and Human Performance, 16, 250–279.
Hanson, M. A., Borman, W. C., Kubisiak, U. C., & Sager, C. E. (1999). Cross-domain analysis. In N. G.
Peterson, M. D. Mumford, W. C. Borman, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational
information system for the 21st century: The development of O*NET (pp. 247–258). Washington, DC:
American Psychological Association.
Harvey, R. J. (1986). Quantitative approaches to job classification: A review and critique. Personnel Psychology,
39, 267–289.
Harvey, R. J., & Wilson, M. A. (2000). Yes Virginia, there is an objective reality in job analysis. Journal of
Organizational Behavior, 21, 829–854.
Higgs, A. C., Papper, E. M., & Carr, L. S. (2000). Integrating selection with other organizational processes and
systems. In J. F. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies
(pp. 73–122). San Francisco, CA: Jossey-Bass.
Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail
(Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence
(pp. 181–229). San Francisco, CA: Jossey-Bass.
Holland, J. L. (1973). Making vocational choices: A theory of careers. Englewood Cliffs, NJ: Prentice-Hall.
96 Handbook of Employee Selection
Huselid, M. (1995). The impact of human resource management practices on turnover, productivity, and corpo-
rate performance. Academy of Management Journal, 38, 635–672.
Johns, G. (2006). The essential impact of context on organizational behavior. Academy of Management Review,
31, 386–408.
Johnston, W. B., & Packer, A. E. (1987). Workforce 2000. Indianapolis, IN: Hudson Institute.
Jones, R. G., Stevens, M. J., & Fischer, D. L. (2000). Selection in team contexts. In J. F. Kehoe (Ed.), Managing
selection in changing organizations: Human resource strategies (pp. 210–241). San Francisco, CA:
Jossey-Bass.
Kanfer, R., & Kantrowitz, T. M. (2002). Emotional regulation: Command and control of emotion in work
life. In R. G. Lord, R. J. Klimoski, & R. Kanfer (Eds.), Emotions in the workplace (pp. 433–472). San
Francisco, CA: Jossey-Bass.
Landy, F. J. (1988). Selection procedure development and usage. In S. Gael (Ed.), The job analysis handbook
for business, industry, and government (Vol. I, pp. 271–287). New York, NY: Wiley.
Landy, F. J. (2007). The validation of personnel decisions in the twenty-first century: Back to the future. In
S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity
evidence (pp. 409–426). San Francisco, CA: Jossey-Bass.
Landy, F. J., & Conte, J. M. (2007). Work in the 21st century (2nd ed.). Boston, MA: Blackwell.
Lawler, E. E., III, Mohrman, S. A., & Benson, G. (2001). Organizing for high performance: Employee involve-
ment, TQM, reengineering, and knowledge management in the Fortune 1000. San Francisco, CA: Jossey
Bass.
Levine, E. L., & Sanchez, J. I. (2007). Evaluating work analysis in the 21st century. Ergometrika, 4, 1–11.
Lievens, F., & Sanchez, J. I. (2007). Can training improve the quality of inferences made by raters in compe-
tency modeling? A quasi-experiment. Journal of Applied Psychology, 92, 812–819.
Lievens, F., Sanchez, J. I., & De Corte, W. (2004). Easing the inferential leap in competency model-
ing: The effects of task-related information and subject matter expertise. Personnel Psychology, 57,
881–904.
Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A compari-
son of the CVI, T, rWG(J), and r*WG(J) indexes. Journal of Applied Psychology, 84, 640–647.
MacDuffie, J. P. (1995). Human resource bundles and manufacturing performance: Organizational logic and
flexible production systems in the world auto industry. Industrial and Labor Relations Review, 48,
197–221.
McCormick, E. J. (1976). Job and task analysis. In M. D. Dunnette (Ed.), Handbook of industrial and organi-
zational psychology (pp. 651–696). Chicago, IL: Rand McNally.
McCormick, E. J., Jeanneret, P. R., & Mecham, R. M. (1972). A study of job characteristics and job dimen-
sions as based on the Position Analysis Questionnaire (PAQ). Journal of Applied Psychology, 56,
347–368.
McLagan, P. A. (1988). Flexible job models: A productivity strategy for the information age. In J. P. Campbell,
R. J. Campbell, & Associates (Eds.), Productivity in organizations (pp. 369–387). San Francisco, CA:
Jossey-Bass.
McLarty, J. R., & Vansickle, T. R. (1997). Assessing employability skills: The Work Keys™ super-system. In
H. F. O’Neil, Jr. (Ed.), Workforce readiness: Competencies and assessment (pp. 293–325). Mahwah, NJ:
Lawrence Erlbaum.
McPhail, S. M. (Ed.). (2007). Alternative validation strategies: Developing new and leveraging existing valid-
ity evidence. San Francisco, CA: Jossey-Bass.
Mitchell, J. L., & Driskill, W. E. (1996). Military job analysis: A historical perspective. Military Psychology,
8, 119–142.
Morgeson, F. P., & Campion, M. A. (1997). Social and cognitive sources of potential inaccuracy in job analysis.
Journal of Applied Psychology, 82, 627–655.
Morgeson, F. P., & Campion, M. A. (2000). Accuracy in job analysis: Toward an inference-based model.
Journal of Organizational Behavior, 21, 819–827.
Morgeson, F. P., Delaney-Klinger, K., Mayfield, M. S., Ferrara, P., & Campion, M. A. (2004). Self-presentation
processes in job analysis: A field experiment investigating inflation in abilities, tasks, and competencies.
Journal of Applied Psychology, 89, 674–686.
Morris, J. A., & Feldman, D. C. (1996). The dimensions, antecedents, and consequences of emotional labor.
Academy of Management Review, 21, 986–1010.
Offerman, L. R., & Gowing, M. K. (1993). Personnel selection in the future: The impact of changing demo-
graphics and the nature of work. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection
in organizations (pp. 385–417). San Francisco, CA: Jossey-Bass.
Work Analysis 97
Pearlman, K. (1980). Job families: A review and discussion of their implications for personnel selection.
Psychological Bulletin, 87, 1–28.
Pearlman, K. (1997, April). Competencies: Issues in their application. In R. C. Page (Chair), Competency
models: What are they and do they work? Practitioner forum conducted at the meeting of the Society for
Industrial and Organizational Psychology, St. Louis, MO.
Pearlman, K., & Barney, M. F. (2000). Selection for a changing workplace. In J. F. Kehoe (Ed.), Managing selec-
tion in changing organizations: Human resource strategies (pp. 3–72). San Francisco, CA: Jossey-Bass.
Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job
proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373–406.
Peterson, N. G., & Bownas, D. A. (1982). Skill, task structure, and performance acquisition. In M. D. Dunnette
& E. A. Fleishman (Eds.), Human performance and productivity: Vol. 1. Human capability assessment
(pp. 49–105). Hillsdale, NJ: Lawrence Erlbaum.
Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., & Fleishman, E. A. (Eds.). (1999). An
occupational information system for the 21st century: The development of O*NET. Washington, DC:
American Psychological Association.
Ployhart, R. E., Schneider, B., & Schmitt, N. (2006). Staffing organizations: Contemporary theory and prac-
tice. Mahwah, NJ: Lawrence Erlbaum.
Primoff, E. S. (1975). How to prepare and conduct job-element examinations (U.S. Civil Service Commission
Technical Study 75-1). Washington, DC: U.S. Government Printing Office.
Primoff, E. S., & Fine, S. A. (1988). A history of job analysis. In S. Gael (Ed.), The job analysis handbook for
business, industry, and government (Vol. I, pp. 14–29). New York, NY: Wiley.
Raymark, P. H., Schmit, M. J., & Guion, R. M. (1997). Identifying potentially useful personality constructs for
employee selection. Personnel Psychology, 50, 723–736.
Rodriguez, D., Patel, R., Bright, A., Gregory, D., & Gowing, M. K. (2002). Developing competency models to
promote integrated human-resource practices. Human Resource Management, 41, 309–324.
Sackett, P. R. (2003). Exploring strategies for clustering military occupations. In A. K. Wigdor & B. F. Green
(Eds.), Performance assessment for the workplace (Vol. II, pp. 305–332). Washington, DC: National
Academy Press.
Sanchez, J. I. (1994). From documentation to innovation: Reshaping job analysis to meet emerging business
needs. Human Resource Management Review, 4, 51–74.
Sanchez, J. I. (2000). Adapting work analysis to a fast-paced and electronic business world. International
Journal of Selection and Assessment, 8, 204–212.
Sanchez, J. I., & Fraser, S. L. (1994). An empirical procedure to identify job duty-skill linkages in managerial
jobs: A case example. Journal of Business and Psychology, 8, 309–326.
Sanchez, J. I., & Levine, E. L. (2000). Accuracy or consequential validity: Which is the better standard for job
analysis data? Journal of Organizational Behavior, 21, 809–818.
Schippmann, J. S. (1999). Strategic job modeling: Working at the core of integrated human resources. Mahwah,
NJ: Lawrence Erlbaum.
Schippmann, J. S., Ash, R. A., Battista, M., Carr, L., Eyde, L. D., Hesketh, B., et al. (2000). The practice of
competency modeling. Personnel Psychology, 53, 703–740.
Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1981). Task differences as moderators of aptitude test validity in
selection: A red herring. Journal of Applied Psychology, 66, 166–185.
Schraagen, J. M., Chipman, S. F., & Shalin, V. L. (Eds.). (2000). Cognitive task analysis. Mahwah, NJ:
Lawrence Erlbaum.
Secretary’s Commission on Achieving Necessary Skills. (1992, April). Learning a living: A blueprint for high
performance: A SCANS report for America 2000. Washington, DC: U.S. Department of Labor.
Snow, C. C., & Snell, S. A. (1993). Staffing as strategy. In N. Schmitt, W. C. Borman, & Associates (Eds.),
Personnel selection in organizations (pp. 448–478). San Francisco, CA: Jossey-Bass.
Spencer, L. M., McLelland, D. C., & Spencer, S. (1994). Competency assessment methods: History and state
of the art. Boston: Hay-McBer Research Press.
Spreitzer, G. M. (1995). Psychological empowerment in the work place: Construct definition, measurement,
and validation. Academy of Management Journal, 38, 1442–1465.
Staw, B. M., & Epstein, L. (2000). What bandwagons bring: Effects of popular management techniques on
corporate performance, reputation, and CEO pay. Administrative Science Quarterly, 45, 523–559.
Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal
of Applied Psychology, 88, 500–517.
U.S. Department of Education, National Center for Education Statistics. (2002). Classification of instructional
programs: 2000 edition. Washington, DC: Author.
98 Handbook of Employee Selection
U.S. Department of Labor. (1965a). Dictionary of Occupational Titles (3rd ed., Vol. 1). Washington, DC: U.S.
Government Printing Office.
U.S. Department of Labor. (1965b). Dictionary of Occupational Titles (3rd ed., Vol. 2). Washington, DC: U.S.
Government Printing Office.
Vaughn, D. S., & Bennett, W. Jr. (2002, November). Toward a unified theory of work. Organizational simula-
tions and policy analyses (AFRL-HE-AZ-TP-2002-0014). Mesa, AZ: Air Force Research Laboratory.
Williams, A. P. O., & Dobson, P. (1997). Personnel selection and corporate strategy. In N. Anderson & P. Herriot
(Eds.), International handbook of selection and assessment (pp. 219–245). New York, NY: Wiley.
Wilson, M. A. (1997). The validity of task coverage ratings by incumbents and supervisors. Journal of Business
and Psychology, 12, 85–95.
Wilson, M. A. (2007). A history of job analysis. In L. Koppes (Ed.), Historical perspectives in industrial and
organizational psychology (pp. 219–241). Mahwah, NJ: Lawrence Erlbaum.
5 Current Concepts of
Validity, Validation,
and Generalizability
In this chapter, we consider the ways psychologists’ understanding of the meaning of the construct
“validity” and of the links between the procedures used to investigate and establish validity and this
construct have evolved over the last 50 years (Angoff, 1988; Langenfeld & Crocker, 1994). At one
time, psychologists spoke of different types of validity and advocated several seemingly distinct
strategies for defining and estimating the validity of tests (American Psychological Association
et al., 1954; Guion, 1980; Landy, 1986). In particular, it was assumed that there were substantial and
important distinctions between content, construct, and criterion-related validity, and this assump-
tion was incorporated into the guidelines that still regulate the enforcement of equal employment
law in the United States (Guion, 1980; Kleiman & Faley, 1985; Uniform Guidelines, 1978). Other
variants, including synthetic validity (Lawshe, 1952) and convergent and discriminant validity
(Campbell & Fiske, 1959) were introduced, adding to the complexity of the debate about how to
define and understand validity.
Since the late 1980s there has been clear and steady evolution in our understanding of valid-
ity. First, it is now widely accepted that there are not many different types of validity; rather,
validity is now recognized as a unitary concept (American Educational Research Association
[AERA], American Psychological Association [APA], & National Council on Measurement in
Education [NCME], 1985). It is also widely accepted that validity is a property of the infer-
ences or generalizations made on the basis of test scores, not a property of the tests themselves
(e.g., Lawshe, 1985). That is, it does not make sense to label a test “valid” or “not valid.” The
inferences made about people on the basis of test scores (e.g., Sam received a high score on a
conscientiousness scale, therefore Sam is probably dependable and reliable) might or might not
be valid, and validation is a process of determining whether or not these generalizations about
individuals are correct and meaningful. There are various strategies for investigating validity,
some of which concentrate on content-based evidence, some of which concentrate on the rela-
tionships between test scores and other criteria, but the goal of all methods of validation is to
determine whether particular conclusions one draws about individuals on the basis of test scores
are valid.
In this chapter, we will make five key points in discussing the meaning of the construct validity
and its relationship with the procedures commonly used for validating tests and assessments:
1. Virtually any investigation into the meaning and implications of test scores can be thought
of as validation.
2. Topics such as reliability, scaling, and factor structure represent specialized niches within
the broader umbrella of validity.
99
100 Handbook of Employee Selection
3. The key concern of virtually all methods of validation is with testing the inferences and
generalizations made on the basis of test scores.
4. The validity of tests and assessments cannot be divorced from the uses and applications of
those tests.
5. Decisions about applied selection programs rely on a wide range of generalizations about
specific facets of the validity of the selection process.
TABLE 5.1
Trends in the Development of Factor Analysis, Reliability, and Validity
Factor Analysis Reliability Validity
1900–1930 • Spearman introduces factor • Correction for attenuation • Limited consensus about
analysis • Spearman-Brown formula definition or estimation of
• Multiple factor analysis • Classic true score model validity
developed
1930–1960 • Oblique rotation • Coefficient alpha • Technical Recommendations
• Hierarchical factor analysis for Psychological Tests and
Diagnostic Techniques (1954)
• Predictive, concurrent,
content and construct
distinctions
1960–1990 • Procrustes rotation • Item response theory • Standards for Educational
• Confirmatory factor • Generalizability theory and Psychological Tests and
analysis Manuals (1966, 1985)
• Structural equation • Validity defined in terms of
modeling appropriateness of inferences
from test scores
1990–present • Widespread availability and • Reliability estimation via SEM • Standards for Educational
use of SEM and Psychological Testing
(1999)
• Construct validity as an
umbrella
Current Concepts of Validity, Validation, and Generalizability 101
there was no real agreement about how validity should be assessed or about the standards of evidence
for defining validity. In 1954, a joint committee of the APA, the AERA, and the National Council
of Measurement Used in Education released their Technical Recommendations for Psychological
Tests and Diagnostic Techniques (APA, 1954). This was the first of several set of testing standards
released over the next half-century, and it served to define many of the key terms in psychological
measurement.
The 1954 Technical Recommendations codified a view of validity that sorted validity into sev-
eral different types: criterion-related, content validity, and construct validity. This view of valid-
ity is still central to equal employment law in the United States (Guion, 1980; Kleiman & Faley,
1985; Landy, 1986; Uniform Guidelines, 1978) but is no longer accepted as useful by researchers.
Subsequent testing standards explicitly rejected the idea of different types of validity and advocated
the idea that validity is a unitary concept (Standards for Educational and Psychological Testing,
1985, 1999), and that validity is a property of the inferences made on the basis of test scores, not a
property of the tests themselves (e.g., Lawshe, 1985).
The current consensus suggests that there are various strategies for investigating validity, some
of which concentrate on content-based evidence, some of which concentrate on the relationships
between test scores and other criteria, but the goal of all methods of validation is to determine
whether particular conclusions one draws about individuals on the basis of test scores are valid. The
overarching concept that is increasingly used to describe and refer to all types of validation research
is “construct validation” (Cronbach, 1988, 1989; Cronbach & Meehl, 1955). As with factor analy-
sis and reliability, the trend in thinking about validity has moved in a very clear direction—that
is, a recognition of the fundamental links between virtually all methods of collecting, analyzing,
and interpreting data about the meaning of test scores (McDonald, 1999)—and therefore represent
investigations of construct validity.
There has been a growing recognition that analyses that would once have been thought of as
addressing quite different questions (e.g., factor analysis, assessment of reliability) are better thought
of as parts of the general process of validation. One implication is that topics that have been tradi-
tionally thought of as quite distinct by personnel psychologists may need to be reconsidered from
the perspective of a more unified view of psychometrics. This trend is most evident when consider-
ing the question of whether there is any useful difference between reliability and validity.
reliability or generalizability theory can and should be subsumed under the broad heading of con-
struct validity.
The aim of validity is to understand the meaning of test scores. One way to do this is to develop
and test structural models of test scores, identifying the sources of variance in test scores. When you
say that a test shows a high level of construct validity, this is the same as saying that a large portion
of the variance in test scores is due to the construct the test is designed to measure. Measurement
error, in the classical sense, is one of many things that can cause test scores to imperfectly repre-
sent constructs. Generalizability theory (Brennan, 1983; Cronbach et al., 1972) suggests that there
is often variation in test scores that is due to things other than random measurement error or the
construct being measured (e.g., systematic variation attributable to measurement conditions, rater
differences, etc.); this theory lays out several methods of estimating these various effects.
Reliability assessment represents a very special, and possibly very small, niche within the larger
domain of construct validity. Classical test theory (CTT) deals with random measurement error,
which can be dealt with in relatively straightforward ways if the extent of error is known. When
the assumptions of CTT are met (e.g., observed scores can be sensibly divided into true scores and
random measurement errors), reliability theory can provide elegant and powerful approaches for
interpreting data, but the idea that one can or should investigate reliability without at the same time
thinking about validity is not a sensible one (Murphy & Deshon, 2000b). Several recent studies have
discussed or illustrated ways of linking assessments of reliability with assessments of construct
validity (Kraiger & Teachout, 1990; Murphy & DeShon, 2000a).
The goal of virtually all psychometric investigations is the same—to make sense of test scores. The
methods of evaluating reliability described here can all contribute to that goal. Knowing whether or
not test scores are stable over time (test-retest reliability), whether difference evaluators reach similar
conclusions (inter-rater reliability), or whether the different items on a test hang together (coefficient
alpha) can help in understanding what tests mean. It is not sensible, and indeed it is not clear that it
is possible, to investigate reliability without learning about validity or to investigate validity without
learning about reliability (McDonald, 1999; Murphy & DeShon, 2000b; Nunnally, 1975).
later time (e.g., post-hire measures). In concurrent designs, test scores and criterion measures are
often obtained at roughly the same time. However, it is not the temporal relationship between tests
and criteria that distinguishes these two designs, but rather the question of whether test scores play
a direct or indirect role in defining the sample. In predictive designs, test scores are not used to
select job applicants. Job applicants might be selected at random or might be hired on the basis of
screening criteria that are uncorrelated with the test being validated; however, in a predictive design,
the sample of individuals who provide predictor and criterion scores is assumed to be representative
of the broader population of applicants that the tests being validated will eventually be used to
screen. In concurrent designs, test scores, or measures that are likely to be correlated with test
scores, are used to screen job applicants. For example, the most common concurrent design is one
in which the scores of job incumbents (who were selected on the basis of the tests being validated
or other selection processes) are correlated with performance measures. In theory, this should lead
to potentially serious levels of range restriction and should substantially depress validity estimates.
However, several reviews have shown that concurrent validity estimates are generally quite similar
to estimates obtained in predictive designs (Barrett, Phillips, & Alexander, 1981; Guion & Cranny,
1982; Nathan & Alexander, 1988; Schmitt, Gooding, Noe, & Kirsch, 1984). In part, this is because
severe range restriction is fairly rare, and in part it is because correction formulas ignore many of
the factors that affect the outcomes of real-world validity studies.
There have been hundreds, if not thousands, of studies examining the relationships between
personnel selection tests and measures of job performance, performance in training, and similar
criteria. One of the major tasks of personnel psychologists over the last 40 years has been to make
sense of this vast body of literature; meta-analysis and validity generalization methods have con-
tributed substantially to this task.
Validity Generalization
In a series of papers, Schmidt, Hunter, and their colleagues have argued that it is often possible to
generalize from existing research on test validity to draw conclusions about the validity of tests in
various settings. In particular, they have shown that it is often possible to generalize the fi ndings of
existing validity research to new findings, thus providing a general solution to the same problem that
motivated the synthetic validity model (i.e., the problem of estimating validity in a particular context
without doing a new validity study in that context). Personnel psychologists had long assumed that
validity estimates were too low and that the results of validity studies were simply too inconsistent
from organization to organization or from job to job to permit this sort of generalization. Schmidt
and Hunter’s critical insight was that statistical artifacts, such as sampling error, range restriction,
and unreliability, had led personnel psychologists to underestimate the size and overestimate the
variability of validity coefficients.
Schmidt and Hunter (1977) developed a validity generalization model that estimates and corrects
for the effects of statistical artifacts (e.g., limited reliability and range restriction) on the distribu-
tions of test validity estimates. The Schmidt-Hunter validity generalization model suggests that in
almost every test, true validities are (a) substantially larger and (b) much less variable than psychol-
ogists have traditionally believed (Murphy, 2000, 2003; Salgado, Anderson, Moscoso, Bertua, &
de Fruyt, 2003; Schmidt, 1992; Schmidt & Hunter, 1977, 1980, 1998).
There is a large body of research examining the criterion-related validity of different selection
tests, particularly written tests that measure general or specific facets of cognitive ability. Validity
generalization analyses of these studies suggest that measures of cognitive ability are positively cor-
related with measures of performance in virtually all jobs, and that the results are sufficiently con-
sistent across hundreds of validity studies (after correcting for the effects of sampling error, limited
reliability, etc.) to support the inference that ability tests are valid predictors of performance in vir-
tually any job (Hunter, 1986; McHenry, Hough, Toquam, Hanson, & Ashworth, 1990; Ree & Earles,
1992; Schmidt & Hunter, 1998; Schmidt, Ones, & Hunter, 1992). Broad validity generalization has
also been claimed for personality traits such as conscientiousness (Barrick & Mount, 1991).
104 Handbook of Employee Selection
Since the late 1970s several variations of the basic model first proposed by Schmidt and Hunter
(1977) have been proposed. Murphy’s (2003) edited volume on validity generalization includes
excellent chapters that review the history of validity generalization models (Schmidt, Landy), pro-
pose new analytic models on the basis of maximum likelihood estimation (Raju and Drasgow) and
Bayesian statistics (Brannick and Hall), and discuss the conceptual and analytic challenges in apply-
ing these models to draw inferences about the validity of selection tests (Bobko and Roth, Burke
and Landis). Bayesian models for validity generalization appear to have considerable potential for
addressing the problem of how to interpret and apply validity generalization results (Newman,
Jacobs, & Bartram, 2007).
Synthetic Validity
When Lawshe (1952) introduced the method of synthetic validity, he noted that jobs can be broken
down into several components, and these components (e.g., supervision, obtaining information
from gauges and devices) are shared across several settings. Jobs differ in terms of their particu-
lar mixes of components, but the general building blocks are quite similar across many jobs and
organizations. Lawshe realized that if one could obtain estimates of the validity of tests for each
of these components, it would be possible to combine these to produce an estimate of the valid-
ity of the test or tests for any job, regardless of its specific content (Peterson, Wise, Arabian, &
Hoffman, 2001).
Implementing this method would certainly be a challenge; this method would require you to
describe jobs in terms of a relatively universal taxonomy of components, establish the validity of
different classes of tests for each component, then build a validity estimate on the basis of what
you know about the validities of the tests for each component and the intercorrelations of the tests
and the components, all of which might prove challenging. Nevertheless, the method has clear con-
ceptual appeal. However, despite more than 50 years of research on synthetic validity, this method
has not proved completely particular or successful (Steel, Huffcutt, & Kammeyer-Mueller, 2006).
Advances in meta-analysis have made this method more feasible (Steel et al., 2006), but probably
have also suggested that there may be simpler solutions to the problem of estimating validity in a
job where no validity studies exist.
Consequential Validity
Messick (1988, 1989) suggested that we should consider the consequences of test use when evaluat-
ing validity. For example, the use of cognitive ability tests to make high-stakes decisions about indi-
viduals contributes to the racial segregation of some workplaces and schools (Gottfredson, 1986;
Herrnstein & Murray, 1994; Jensen, 1980). These tests appear to measure cognitive abilities reason-
ably well and they predict important criteria, but they also lead to very negative consequences for
some examinees. Should this unfortunate fact influence our evaluation of the validity of cognitive
ability tests? The answer probably depends on how narrowly or broadly one defines validity.
If validity is defined solely in terms of whether a test measures what it is supposed to measure
or predicts what it is supposed to predict, the consequences of test use might arguably be irrelevant
to validity, except where consequences provide evidence relevant to the intended meaning or use of
tests. If validity is defined in terms of an overall assessment of the value or the impact of a test, these
same consequences might be central to assessing validity.
One resolution to the question of whether or not the consequences of test use should factor
into evaluations of validity is to remember that tests are neither valid nor invalid; rather, they are
valid for some purposes and not for others. Thus, the validity of a test cannot be fully defi ned
without reference to the way the test will be used, and different perspectives on the purpose of
testing might lead to quite different assessments of validity. Thus, the same test might be seen as
valid for the purpose of maximizing expected performance while minimizing costs and invalid
for making decisions that do not unduly disadvantage members of particular racial or ethnic
groups.
Current Concepts of Validity, Validation, and Generalizability 105
CONTENT-BASED APPROACHES
All methods of validation share the same basic aim: to present evidence and theory that supports
particular inferences made about test scores. One common method of validation involves examina-
tion of the content of a test using expert judgment and/or empirical analyses to assess the relation-
ship between the content of a test and the content domain the test purports to measure or predict
(AERA, APA, & NCME, 1985). The assumption underlying this method of validation is that if the
test is a representative sample of the domain, people who receive high scores on the test are likely
to exhibit a high level of the attribute the test is designed to measure.
In the context of personnel selection, the term “content validity” has taken on a unique and
special meaning—using assessments of the content of a test to make inferences about the likelihood
that individuals who receive high scores on a selection test will also perform well on the job
(Binning, 2007; Guion, 1998; Murphy & Davidshofer, 2005; Uniform Guidelines on Employee
Selection Procedures, 1978). In principle, this inference could be tested directly by standard
methods of criterion-related validation (i.e., by correlating scores on these tests with measures of job
performance). Content-oriented methods provide a different approach, in which inferences about
the predictive value of test scores are established on the basis of a systematic comparison between
the content of a test and the content of a job. If a systematic match between the content of a test
and the content of a job can be established, the usual inference is that people who do well on the test
will also do well on the job.
There are several different approaches that are used to assess the match between test content and
job content. One approach laid out in the Uniform Guidelines on Employee Selection Procedures
(1978) is to argue that the test is a valid predictor of performance if it can be shown to be a repre-
sentative sample of the work domain. Section 14C(4) of the Guidelines states,
106 Handbook of Employee Selection
To demonstrate the content validity of a selection procedure, a user should show that the behavior(s)
demonstrated in the selection procedure are a representative sample of the behavior(s) of the job in
question or that the selection procedure provides a representative sample of the work product of the job.
Similarly, the third edition of the Principles for the Validation and Use of Personnel Selection
Procedures (Society for Industrial and Organizational Psychology [SIOP], 1987) stated that a test is
content valid if it is “… a representative sample of the tasks, behaviors, or knowledge drawn from
that [job] domain” (p. 19). Under this definition, content validity can be established if experts agree
that a test adequately samples the portions of the job domain that are critical to adequate perfor-
mance. This strategy is based on the inference that people who do well on a sample of the job will
also do well if selected into that job.
An alternate strategy is to describe a test as content-valid if it measures knowledge, abilities, and/
or skills that are required for successful job performance. Section 14C(1) of the Uniform Guidelines
(1978) notes:
Selection procedures which purport to measure knowledge, skills, or abilities may in certain
circumstances be justified by content validity, although they may not be representative samples, if the
knowledge, skill, or ability measured by the selection procedure can be operationally defined … and if
that knowledge, skill, or ability is a necessary prerequisite to successful job performance.
Similarly, the Principles for the Validation and Use of Personnel Selection Procedures (SIOP, 2003)
state “Evidence based on test content may include logical or empirical analyses that compare the
adequacy of the match between test content and work content, worker requirements, or outcomes
of the job” (p. 6) and “Evidence for validity based on content typically consists of a demonstration
of a strong linkage between the content of the selection procedure and important work behaviors,
activities, worker requirements, or outcomes on the job.” (p. 21)
Linkage methods represent a variation on the performance requirements approach described
above in which subject matter experts are asked to make judgments about overlaps in the
knowledge, skills, and abilities (KSAs) needed to do well on a test and those needed to do well
on a job. Goldstein and Zedeck (1996) argued that “… when there is congruence between the
KSAs required to perform on the job and the KSA required to perform on the testing instrument,
then it should be possible to make inferences about how the test scores relate to job behavior.”
(p. 28; see also Goldstein, Zedeck, & Schneider, 1993). This method involves (a) linking KSAs to
specific job elements (i.e., making a judgment that specific KSAs are required for or contribute to
performance of specific aspects of a job), (2) linking KSAs to test items or to subtests (these steps
are often done by independent groups of experts; Goldstein & Zedeck, 1996, refer to this step as
retranslation), and (3) assessing the communalities between these KSA lists. If the same KSAs
are judged to be required for performing well on the test and for performing well on the job, then
the inference is made that people who do well on the test will also do well on the job.
might be representative of the job, but the responses people make to these items might have little or
nothing to do with responses to the same sort of content in a work setting (Guion, 1977; Murphy &
Davidshofer, 2005; Sackett, 1987).
• Measure-to-measure relationships: How well do scores on the test correlate with perfor-
mance ratings (criterion-related validity)?
• Measure-to-construct relationships: How well do test scores represent cognitive ability?
How well do performance ratings represent job performance (construct validity)?
• Predictor-measure-to-criterion-construct relationships: Do test scores predict job perfor-
mance (as opposed to performance ratings)?
Typically, criterion-related validity studies start (and sometimes end) with measure-to-measure
relationships (e.g., the correlation between test scores and performance ratings). As the Binning
and Barrett framework makes clear, these correlations are not really the main focus of a validity
study. Validity studies usually start with the assumption that there are strong links between the
criterion measure and the construct it is supposed to represent; here this is equivalent to assuming
that performance ratings are good measures of job performance, an assumption that is probably not
true (Murphy & Cleveland, 1995). If there is strong evidence of the construct validity of criterion
measures, the Binning and Barrett (1989) framework suggested that it should be possible to make
strong inferences about the link between test scores and the performance construct (this link is
shown with a heavy line in Figure 5.1). However, establishing a strong link between the predictor
and the criterion construct requires, first and foremost, good criterion measures. Given the long and
sorry history of criterion measures used in personnel psychology (Austin & Villanova, 1992), it is
hard to be optimistic about our ability to make well-founded inferences about links between test
scores and poorly measured criteria.
This framework implies that construct validity issues may have different meanings for predic-
tors and criteria. The simplest case is one in which the test and the criterion show strong evi-
dence of construct validity. In that case, empirical correlations between tests and criterion measures
provide convincing evidence that the underlying constructs are related. On the other hand, it is
Performance
Ability test
rating
Cognitive Job
ability performance
FIGURE 5.1 Relationships between constructs and measures. (Adapted from Binning, J. F., & Barrett, G. V.,
Journal of Applied Psychology, 74, 478–494, 1989.)
108 Handbook of Employee Selection
possible that a test that shows poor construct validity could nevertheless show good performance
as a predictor. The correlation between test scores and both criterion measures and the criterion
constructs they represent do not necessarily depend on the construct validity of the predictor. On
the other hand, poor construct validity on the criterion side renders empirical evidence of correla-
tions between predictors and criterion measures potentially useless. Nobody really cares if tests are
correlated with performance ratings; what is important is whether or not test scores are correlated
with performance.
LOCAL GENERALIZATIONS
First, it is important to clarify the meaning of “generalization” in the context of this discussion
about implementation decisions. Generalizations that support (justify) implementation decisions
refer to inferences that previous validation research results apply in the specific local setting
in which a selection program is being designed and implemented. We refer to these as “local
generalizations.” Local generalization is somewhat different from the generalization that
underlies the typical scientific inference from empirical evidence to theoretical conclusion.
For example, the inference from research evidence that conscientiousness leads to low error
rates is an inference from data to a theoretical proposition. This inference is different from the
conclusion that, say, Hogan Personality Inventory conscientiousness scores will predict filing
accuracy among new clerical hires in XYZ Corporation. The latter conclusion is an inference
that a theoretical proposition applies in a specific local setting. This is a local generalization. The
guidelines in the Standards and Principles about selection validation methods as described above
are primarily focused on assuring meaningful generalizations from data to theory. Similarly, the
principles of scientific inference (e.g., see Cook & Campbell, 1979) describe the scientific controls
and conditions necessary for a set of data to lead to unambiguous, unconfounded theoretical
conclusions. In contrast, the Standards and Principles provide far less guidance about sound
generalizations from theory to specific settings.
In a modest effort to codify the basis for making sound local generalizations, we suggest that the
tradition of selection research and practice supports two qualitatively different rationales, either of
which can support local generalizations. The first is a sampling rationale. The essential feature of
this rationale is that important features of the local, specific setting such as applicant type, predic-
tor construct, and criterion construct are represented by the set of validation research results from
which one is making a local generalization. The main focus of this rationale is the claim that fea-
tures of the local setting are adequately sampled in some sense by the study conditions that produced
the set of results. It should be understood that there is, or could be, a theoretical component to this
sampling rationale. The theoretical component would serve to identify some finite set of features
of the local setting that are relevant to the use of the selection procedure. This set of theoretically
Current Concepts of Validity, Validation, and Generalizability 109
relevant features would determine the manner in which the local setting should be sampled by the
set of research results. Consider the example of an organization’s decision to include personality
assessment in an assessment center. Lievens, Chasteen, Day, and Christiansen (2006) argued that
the validity of personality assessments depends on the “strength” of the work situation. If that is an
accepted theoretical conclusion, then the sampling approach to local generalization would require
that the research samples adequately represent the situation strength profile of the local setting.
The second type of rationale supporting local generalizations is based on a purely theoretical
justification. In this case, the argument justifying local generalization is that the theoretical rationale
accounting for the validity of a selection procedure is judged to apply to the local setting because
the defined theoretical components are known to be relevant in the local setting or because the
theoretical explanation discounts major situational differences as being irrelevant to the theoretical
framework that explains the predictor-criterion relationship. An expert makes this judgment on
the basis of the logical implications of the accepted theoretical framework. Consider the example
of job component validity (JCV), which is a method for local generalization because its purpose
is to derive an estimate of validity in a specific setting on the basis of previous research results.
Hoffman, Rashovsky, and D’Egidio (2007) described this method of generalization as relying on a
theoretical framework linking validity to job components and a “gravitational” hypothesis justifying
assumptions about the meaning of long-term incumbents’ test scores. A JCV estimate is derived
from a theoretical rationale, not a sampling rationale.
Finally, it is important to note that local generalizations are equally relevant to conclusions
about local measurement properties and prediction properties. From an implementation perspec-
tive, measurement inferences are used to identify and attach meaning to the predictors and criteria,
and prediction inferences define the impact or usefulness of selection decisions. Local general-
ization inferences provide confidence that the measurement and prediction inferences are accu-
rate in the context in which these measures will be used. In the Binning and Barrett framework
described above, measurement inferences are represented by the vertical links and prediction infer-
ences by the horizontal links. Our view of this model is that it is intended to capture the diverse
types of data-to-theory inferences underlying selection validity and is not intended to represent
local generalizations. We take this view because the two considerations represented in the model—
criterion-versus-predictor and measure-versus-construct considerations—are separate from the
general-versus-specific considerations underlying local generalizations. To capture considerations
of local generalization, the model could be expanded to include this third dimension distinguishing
general theoretical conclusions from generalizations to local settings. In this extended model, link-
ages would represent local generalizability across this third dimension.
considerations, it is justified by the local generalization that the usefulness of the selection program
is similarly constant across those times and situations.
Each of the following briefly describes a characteristic of typical selection programs that implies
some specific local generalization necessary to support the design of the selection program.
1. One set of selection procedures is applied across a wide range of work differences. These
include differences between positions within jobs, jobs within job families, job families
within organizations, across changes in the work, and across time. It is often the case
that selection program designers seek to create job families that are as wide as possible
but still admit a single selection solution across all differences within the family. This is
often driven by considerations of cost, efficiency, and ease of employee movement within
the family of jobs. But program designers make a decision about the actual range of jobs
within the family that will share the same selection process. This decision is based, in
part, on a conclusion that the range of true criterion differences within the family is small
and the range of predictor validity values across those criteria is similarly small. It should
be noted that the local generalization about the range of criterion differences is not an
inference about measurement. This is an inference about the meaning of the true latent
structure of the diverse criteria and not a conclusion about the measurement of these cri-
teria. After all, the organization values workplace behavior, not the measure of workplace
behavior.
2. All candidates are treated the same with respect to their scores on the selection proce-
dures. Although this approach may, to a great extent, grow out of the regulatory pressure
not to distinguish between protected and majority groups, other differences between can-
didates are also often ignored, such as differences between experienced and inexperienced
candidates, candidates with internal performance records and those with none, and candi-
dates with recent educational experience and those with older education experience. The
underlying generalization is not that candidate factors have no effect on validity or score
levels but that the effects of such candidate differences are small compared to the costs that
would be incurred by tailoring selection practices to such candidate differences.
The generalization that supports the decision to treat all candidates’ scores the same is
a measurement inference and a prediction inference. The measurement inference is that the
meaning of candidate scores is independent of candidate differences such as work experi-
ence, education, and test preparation. A particularly compelling version of this practice is
the strategy of administering a selection procedure only in English. Although the organi-
zation-level rationale for this decision may be that English is the language of the job, the
underlying measurement conclusion supporting this practice is that linguistic differences
between candidates do not affect the meaning of the assessment scores. Of course, this is
a less important conclusion for selection procedures that have little or no verbal content.
The companion prediction conclusion supporting this practice is that candidate differ-
ences do not systematically moderate the predictive validity of the selection procedure.
Consider the example of a sales situational judgment test (SJT) developed in the context
of a particular sales job within an organization. Candidate differences in a priori knowl-
edge of sales job procedures may systematically influence the predictive validity of the
sales SJT.
3. Once implemented, all predictors in a selection system are often managed in the same
way. Predictors differ with respect to the variance of their validities. But often different
predictors are treated as if their validities are as equally unchanging across time and situ-
ations. For example, it is not uncommon for a combination of cognitive and personality
predictors to be implemented in the same way across the same family of jobs regardless
of evidence that personality is likely to have more variable validity across those jobs. This
simplification of process is almost always attributed, in part, to the administrative value of
Current Concepts of Validity, Validation, and Generalizability 111
selection processes that are easy to manage. This may also be attributable, in part, to the
scientific result that the combined effect of positively correlated predictors is near optimal
when they are given equal weight even if they have unequal validities.
The local generalization that justifies the practice of treating all predictors alike is about
prediction validity differences across time and conditions. The supporting conclusion is
that such differences are small relative to the cost and administrative burden that would be
caused by tailoring selection processes to such differences.
4. Relatively short tests are chosen to represent desired predictor constructs. In large-scale
commoditized employment processes, it is not uncommon for vendors to offer and orga-
nizations to choose shorter assessments of desired predictor constructs. (This point is
separate from the relative merits of computer adaptive assessments.) For example, it is not
uncommon for operational cognitive ability batteries to consist of three tests each requir-
ing 5 minutes to administer. Similarly, short personality forms are increasingly available.
The practical considerations are obvious—shorter costs less. The underlying local gener-
alization is a measurement inference about the predictors that the accumulated effects of
shorter length do not outweigh the lower cost. The possible effects of shorter length may
include predictor construct under-representation, lower validity, and higher variability in
scores and validity across individual differences and test conditions.
5. New hires are expected to be acceptable on every facet of work behavior. Individual
selection studies typically focus on criteria that capture a manageable number of work
behaviors. Most often, these include some combination of task performance behaviors,
contextual behaviors, overall performance, and turnover. Typically, validity studies either
focus on specific facets of work behavior or aggregate a limited number of facets such as
task performance and contextual performance to create an overall performance measure.
In contrast, selected candidates are expected to demonstrate acceptable behavior in every
facet of work behavior valued by the organization. For example, new hires are expected
to be productive enough, fast enough, helpful enough, honest enough, reliable enough,
prompt enough, have appropriate enough personal habits, stay long enough, work safely
enough, be easy enough to work with, and so on.
The selection program designer’s conclusion that those selected will be acceptable in
every facet of employee value is, at root, a local generalization about the criterion constructs
measured in validity studies. There appear to be three possible conclusions about criterion
constructs that would support the expectation that selected employees are acceptable on all
facets of work behavior. First, one might conclude that each facet of work behavior valued
in the local job and organization is represented among the criteria used in existing validity
studies. Second, one might conclude that aggregate measures of performance commonly
reported in existing validity studies have sufficient construct scope that the validity with
which they are predicted is generalizable to any particular facet of work behavior. Third, one
could rely on a criterion “halo” effect that, if existing studies demonstrated validity for some
facets of work behavior, then that level of validity may be generalized to unmeasured facets of
work behavior because people who are good at some things tend to be good at other things.
6. Test scores are not adjusted to eliminate the effects of candidates’ previous experience
or insight about tests. Selection program designers decide whether to adjust test scores on
the basis of candidate experience with or insight about the tests. Selection programs rarely
adjust cognitive test scores to correct for effects of retesting, test preparation training,
test anxiety counseling, or other test-related experience that may impact test scores. This
is a consequential decision because recent evidence (Hausknecht, Halpert, Di Paolo, &
Moriarty, 2007) continues to confirm that retesting leads to approximately a 0.25 stan-
dard deviation improvement in selection test scores, and new evidence (Lievens, Reeve, &
Heggestad, 2007) suggests retesting changes measurement constructs and decreases pre-
dictive validity. In contrast, it appears more likely that some selection programs have
112 Handbook of Employee Selection
implemented adjustment strategies for personality assessment scores. (We have no empiri-
cal data on this point, but the research attention given to indices of socially desirable
responding suggests that some selection programs use such indices to adjust personality
scores.)
Two types of generalizations are needed to support decisions made about score adjust-
ments. The first is a measurement inference about the effect of candidate experience and
insight differences on scores and whether those score effects are construct-relevant. The
second is a prediction inference about the effects such score differences have on criterion
validity. A decision not to adjust scores would be justified by the conclusion that measure-
ment effects are small or, even if measurement effects are not small, effects on criterion
validity are small. As with all other program decisions described here, other factors unre-
lated to measurement and prediction will play a significant role in decisions about score
adjustment and are likely to force the program designer into an evaluation of the tradeoffs
between competing interests.
A key point here is that each of these six examples of a common selection practice represents
a set of decisions made by the selection program designer on the basis of conclusions about the
effectiveness of the selection practice in the local setting. Even if one argues that these practices
simply reflect a pragmatic tolerance of imperfection or the unknown, all decisions affect the value
of a selection program for an organization. In this discussion about professional selection practice,
decisions with such consequences are assumed to have a substantive rationale relevant to the valid-
ity of the selection program.
These six examples of local generalizations may or may not apply to any particular selection
program. On the basis of these and other examples of specific local generalizations, Table 5.2
describes a simple, more general taxonomy of local generalizations relevant to all selection programs.
This simple model is organized around two facets of any local generalization. In an applied selection
context, all local generalizations are generalizations of either a theoretical measurement inference
or a theoretical prediction inference and apply to the predictor, or the criterion, or both. Within the
family of theoretical measurement inferences, some relate to the meaning of scores, others relate to
characteristics of the score distributions. This is an important reminder that selection designers and
implementers care about more than relationships. Score levels are also important. Within the family
of theoretical prediction inferences, some relate to the relationship between the predictor and the
criterion, others refer to independent factors that may moderate that relationship. Admittedly, this
is a fine distinction. However, many of the most typical or important challenges for local selection
implementation work involve a consideration of possible moderators.
TABLE 5.2
Local Generalizations Relevant to Selection Programs
Type of Local
Generalization Predictor Criterion
Measurement
Score meaning • Chosen predictors assess intended constructs • Local criteria of interest
• Locally developed predictors assess intended constructs (implicit or explicit) associated
• Motives of local test takers do not alter intended constructs with the intended use of the
• Local test preparation options do not alter intended constructs selection procedures represent
• Local retesting practices do not alter intended constructs the constructs underlying
• Local accommodations to assessment procedures for assumed levels of validity
candidates with disabilities do not alter intended constructs • Local job analysis methods
• Local exemptions/waivers from assessment requirements are accurately represent the major
based on alternative representations of the constructs assessed facets of work and worker
by the predictor attributes
Score • Response strategies of local candidates do not alter expected • Systematic errors by criterion
psychometrics scores raters, if used, do not alter the
• Score reliability in local candidate samples is sufficient level of performance targeted
• Local circumstances such as recruiting practices do not alter by choice of qualifying
expected group differences standards, if any
Prediction
Predictor- • Local conditions do not alter the level of validity inferred • The introduction of the selection
criterion from previous research program does not, itself, alter the
relationship importance of criteria targeted
by the program’s intended use
Prediction • Local conditions do not cause irrelevant changes in • Local conditions do not
moderators differential prediction slopes or intercepts in analyses of introduce unmeasured criterion
differential prediction constructs, leading to biased
• Local candidate characteristics ignored in the selection differential prediction results
process such as level of education and experience, do not
moderate level of validity
• Local differences between positions, jobs, or work families for
which the same selection process is applied do not alter the
level of validity expected from previous research
job element validities across the job elements thought to be relevant to a whole job. Finally, we have
not included evidence based on response processes or evidence based on consequences of testing,
both of which are included in the Standards. Evidence based on consequences does not derive from
a unique methodological approach, as do the five methods described here. In effect, asserting that
evidence based on consequences is within the scope of selection validity simply expands the scope
of criteria to include other possible consequences than group effects. It has no unique implications
for validation methodology other than a possible organization level of analysis. For this reason, it
has no unique implications for local generalization other than to suggest a broader scope. Similarly,
evidence based on response processes is distinguished primarily by the nature of its theoretical
foundation, not its methodology. In fact, one could argue that every predictor-criterion relation-
ship of interest to selection psychology could admit a theoretical analysis based on the examinee’s
response process.
For each of the five types of evidence, we describe its relevance to local generalizations. Evidence
supporting local generalizations justifies a conclusion that validity relationships demonstrated by the
validation methodology apply in the specific local setting. The relevance of each type of evidence
114 Handbook of Employee Selection
can be evaluated by examining its relevance to measurement and prediction conclusions and by
examining the extent to which the evidence provides a sampling or theoretical rationale linking the
specific local setting to the research evidence.
A final point about the evaluations that follow is that all of the five categories of validation
methods are characterized by the types of variables they utilize (e.g., item content, performance
measures, inter-item correlations) and the types of information they gather (e.g., empirical data
and expert judgments). These validation methods are not distinguished by their application of the
basic requirements of unconfounded inference. Examples of such requirements include the repre-
sentativeness of samples with respect to intended populations, the experimental independence of
cause and effect measures, a temporal sequence of observations consistent with the hypothesized
relationship, and adequate sample sizes. These fundamental requirements of scientific inference are
applicable to all five categories of methods although often in somewhat different ways. However,
because they are not specific to one or another category of validation methods, these fundamental
principles of sound inference are not included in the evaluation of these six methods relevant to
local generalizations. We assume the version of each method we evaluate complies with these fun-
damental requirements.
can discount many possible setting-specific moderators. This, in turn, greatly facilitates the local
generalization that validity levels observed in research settings apply to the local specific setting.
A compelling example of this type of theory-based generalization is offered by Hogan, Davies,
and Hogan (2007), who developed a theory of personality at work to explain why the validity
of personality generalizes across contexts. Third, criterion methods are empirical. They involve
samples of research participants. This sampling feature enables selection program implementers to
compare local candidate pools to well-described research samples. Generalizing validity evidence
to the specific local setting is supported by the demonstration that the research samples represent
the local candidate pool.
An important strength of criterion methods is that they can provide sampling and theory-based
evidence. This is valuable because sampling-based generalizations and theory-based generaliza-
tions are interdependent. Theory-based explanations for valid relationships allow one to recognize
what sample characteristics are relevant to the generalization from research samples to local candi-
date pools. In the example above, empirically supported theory implies that GMA-task proficiency
validity is not moderated by organization culture. On the basis of this theoretical conclusion, the
local selection designer is justified in not considering organization culture when judging whether
the research samples represent the local candidate pool.
and in some cases much greater weight, to the current local study estimates than would the
medical model.
Synthetic validation consists of methods for constructing local generalization arguments on the
basis of similarity of job components across jobs/settings, their relative importance, and informa-
tion about predictive validity values for individual components. Content validation evidence could
be used in the same way, although it is typically not used in this fashion. Typically, content evidence
is gathered in the specific setting in which the selection procedure will be used. Any validation
inference based on locally gathered content evidence is an inference about local validity.
Meta-analyses of prediction results and, less often, measurement data can support three separate
rationales for justifying local generalizations. First, meta-analyses summarize information about
validity differences across samples. This provides a sampling rationale for local generalizations to
the extent that the local setting is well represented by the range of samples contributing to the meta-
analysis results. Second, meta-analyses lead to theoretical conclusions about the likely presence and
meaning of situational variables that moderate validity. Where meta-analyses indicate that mea-
sured moderators explain relatively little of the observed validity variability, a theoretical argument
can be made that such validities are unlikely to be affected to any great extent by any situational
moderators, measured or unmeasured (Schmidt & Hunter, 1998). Where that theoretical argument
is accepted, local generalizations need not depend on any evaluations of situational moderators. In
effect, the sampling rationale becomes moot. Finally, mathematical techniques have been developed
and evaluated by which meta-analytic mean validity values may be translated into estimates of local
validity values (Brannick, 2001; Newman, Jacobs, & Bartram, 2007; and Schmidt & Raju, 2007).
Other categories of evidence do not provide as direct rationales for local generalization. Criterion-
related evidence and internal structure evidence do not, themselves, aggregate information in a way
that leads directly to local generalizations. Certainly meta-analysis may be applied to results from
both to provide a direct rationale for local generalizability. This observation is not a criticism as
much as it simply points out the boundaries of certain classes of validation evidence.
Two final observations merit attention here. The first is that two traditions have emerged and have
been supported for justifying local generalizations. First is the empirical tradition, which is repre-
sented by meta-analytic techniques applied primarily to the results of criterion-oriented research.
The empirical tradition supports a sampling rationale by virtue of the fact that all empirical research
involves samples of participants who generate data in some applied or laboratory setting. The empir-
ical tradition also supports theory-based local generalizations by which the theoretical framework
explaining the valid relationship of interest can be interpreted to draw conclusions about the likely
strength of situation-specific moderators. Where theoretical analysis suggests few important mod-
erators, local generalizations are justified.
The second tradition is captured in content validation and synthetic validation methods. This
tradition might be called the expert judgment tradition of validation and local generalization. It
is important to acknowledge this tradition and, perhaps, check its boundaries from time to time.
This tradition manifests itself in content validation and synthetic validation methods in which
judgments by job experts of the relevance and/or validity of worker attributes with respect to job
components creates the foundation for the validity inference. When these judgments are made by
job experts who are work scientists, they might be expected to be grounded in empirical research
findings relevant to the judgment task. Where these judgments are made by job experts who are
not scientists, they are presumed to be grounded in experiential learning about the requirements of
successful job performance. Although not without criticism as noted above, the Principles endorses
both as sources of validity evidence. The question that should be raised about the expert judgment
tradition is about the degree of abstraction from observable work behavior that begins to degrade
the accuracy of expert judgments. There would be two benefits of a professional effort to identify
standards of validation that must be satisfied by expert judgment processes. One benefit would
apply directly to content and synthetic validation methods by defining process standards necessary
to warrant judgments that could be treated as a source of validity evidence. The second benefit
would be that the same standards could be applied to the process by which selection designers and
implementers make judgment about the appropriateness of local generalizations. Both are expert
120 Handbook of Employee Selection
judgments about the strength of an inference in which the inference relies on the experts’ knowledge
of the relevant research foundation and ability to interpret and synthesize that knowledge into
informative, accurate judgments.
The final observation acknowledges the importance for local generalizations of the accepted
body of knowledge representing selection psychology. Local generalizations, like any theoretical
inference, are made in the context of the accumulated body of knowledge relevant to the issue.
Although much of this section addresses the manner in which local generalizations are informed
by diverse types of validation methods and processes, local generalizations are likely to be less
accurate if each requires a repeated reevaluation of the full litany of all relevant research findings.
Rather, local generalizations would benefit greatly by access to professionally representative, if
not consensus, summaries of the existing body of research relevant to the validity issue in ques-
tion. Currently, the profession of industrial-organizational (I-O) psychology has no such systematic
forum for documenting what we know in a way that conveys broad professional acceptance.
In summary, we have described validity as a changing and multifaceted concept. Validity is the
accumulation of evidence of many types, including psychometric and structural relationships, evi-
dence of meaning, evidence of prediction, and evidence of consequences to support many types of
professional decisions relating to the implementation of selection programs. No single claim about
validity and no single method of validation are likely to capture all of these facets.
REFERENCES
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1985). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
American Psychological Association. (1954). Technical Recommendations for Psychological Tests and
Diagnostic Techniques. Washington, DC: Author.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. Braun (Eds.), Test validity (pp. 19–32).
Hillsdale, NJ: Lawrence Erlbaum.
Austin, J. T., & Villanova, P. (1992). The criterion problem 1917–1992. Journal of Applied Psychology, 77,
836–874.
Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical
reanalysis. Journal of Applied Psychology, 66, 1–6.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Binning, J. F. (2007). Validation strategies. In S. G. Rogelberg (Ed.), Encyclopedia of industrial and organiza-
tional psychology (Vol. 2, pp. 859–864). Thousand Oaks, CA: Sage.
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
and evidential bases. Journal of Applied Psychology, 74, 478–494.
Brannick, M. T. (2001) Implications of empirical Bayes meta-analysis for test validation. Journal of Applied
Psychology, 86, 468–480.
Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Test Program.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity in the multi-trait multi-method
matrix. Psychological Bulletin, 56, 81–105.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings.
Chicago, IL: Rand McNally.
Cronbach, L. J. (1988). Five perspectives on the validity argument. In H. Wainer & H. Brown (Eds.), Test valid-
ity. Hillsdale, NJ: Lawrence Erlbaum.
Cronbach, L. J. (1989). Construct validation after thirty years. In R. Linn (Ed.), Intelligence: Measurement,
theory, and public policy. Urbana: University of Illinois Press.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measure-
ments: Theory of generalizability for scores and profiles. New York, NY: Wiley.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52,
281–302.
Dimitrov, D. M. (2002). Reliability: Arguments for multiple perspectives and potential problems with general-
ization across studies. Educational and Psychological Measurement, 62, 783–801.
Current Concepts of Validity, Validation, and Generalizability 121
Goldstein, I. L., & Zedeck, S. (1996). Content validation. In R. Barrett (Ed.), Fair employment strategies in
human resource management (pp. 27–37). Westport, CT: Quorum Books.
Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process.
In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations. San Francisco, CA: Jossey-
Bass.
Gottfredson, L. S. (1986). Societal consequences of the g factor in employment. Journal of Vocational Behavior,
29, 379–410.
Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are
and how to use them. Educational and Psychological Measurement, 66, 930–944.
Guion, R. M. (1977). Content validity, the source of my discontent. Applied Psychological Measurement, 1,
1–10.
Guion, R. M. (1978a). “Content validity” in moderation. Personal Psychology, 31, 205–213.
Guion, R. M. (1978b). Scoring content domain samples: The problem of fairness. Journal of Applied Psychology,
63, 499–506.
Guion, R. M. (1980). On trinitarian doctrines of validity. Professional Psychology, 11, 385–398.
Guion, R. M. (1998) Assessment, measurement and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Guion, R. M., & Cranny, C. J. (1982). A note on concurrent and predictive validity designs. Journal of Applied
Psychology, 67, 239–244.
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A
meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology,
92, 373–385.
Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure in American life. New
York, NY: Free Press.
Hoffman, C. C., Rashovsky, B., & D’Egidio, E. (2007). Job component validity: Background, current research,
and applications. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leverag-
ing existing validity evidence. San Francisco, CA: Jossey-Bass.
Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail
(Ed.). Alternative validation strategies: Developing new and leveraging existing validity evidence. San
Francisco, CA: Jossey-Bass.
Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of
Vocational Behavior, 29, 340–362.
Jensen, A. R. (1980). Bias in mental testing. New York, NY: Free Press.
Kleiman, L. S., & Faley, R. H. (1985). The implications of professional and legal guidelines for court decisions
involving criterion-related validity: A review and analysis. Personnel Psychology, 38, 803–833
Kraiger, K., & Teachout, M. S. (1990). Generalizability theory as construct-related evidence of the validity of
job performance ratings. Human Performance, 3, 19–36.
Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist,
41, 1183–1192.
Langenfeld, T. E., & Crocker, L. M. (1994). The evolution of validity theory: Public school testing, the courts,
and incompatible interpretations. Educational Assessment, 2, 149–165.
Lawshe, C. H. (1952). Employee selection. Personnel Psychology, 6, 31–34.
Lawshe, C. H. (1985). Inferences from personnel tests and their validities. Journal of Applied Psychology, 70,
237–238.
Lievens, F., Chasteen, C. S., Day, E. A., & Christiansen, N. D. (2006). Large-scale investigation of the role of
trait activation theory for understanding assessment center convergent and discriminant validity. Journal
of Applied Psychology, 91, 247–258.
Lievens, F., Reeve, C. L., & Heggestad E. D. (2007). An examination of psychometric bias due to retesting on
cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 1672–1682.
McCormick, E. J. (1959). The development of processes for indirect or synthetic validity: III. Application of
job analysis to indirect validity. A symposium. Personnel Psychology, 12, 402–413.
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment inter-
views: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599–616.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates.
McHenry, J. J, Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project A validity results:
The relationship between predictor and criterion domains. Personnel Psychology, 43, 335–355.
McKay, P. F., & McDaniel, M. A. (2006). A Re-examination of Black-White mean differences in work perfor-
mance: More data, more moderators. Journal of Applied Psychology, 91, 538–554.
122 Handbook of Employee Selection
McKenzie, S. B., Podsakoff, P. M., & Jarvis, C. B. (2005). The problem of measurement model misspecifica-
tion in behavioral and organizational research and some recommended solutions. Journal of Applied
Psychology, 90, 710–730.
Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of mea-
surement. In H. Wainer & H. Brown (Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum.
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational
Researcher, 18, 5–11.
Murphy, K. R. (2000). Impact of assessment of validity generalization and situational specificity on the science
and practice of personnel selection. International Journal of Selection and Assessment, 8, 194–206.
Murphy, K. R. (2003). Validity generalization: A critical review. Mahwah, NJ: Lawrence Erlbaum.
Murphy, K. R. (2010). How a broader definition of the criterion domain changes our thinking about adverse
impact. In J. Outtz (Ed.), Adverse impact (pp. 137–160). San Francisco, CA: Jossey-Bass.
Murphy, K. R., & Cleveland, J. (1995). Understanding performance appraisal: Social, organizational and
goal-oriented perspectives. Newbury Park, CA: Sage.
Murphy, K. R., & Davidshofer, C.O. (2005). Psychological testing: Principles and applications (6th ed). Upper
Saddle River, NJ: Prentice Hall.
Murphy, K. R., & DeShon, R. (2000a). Inter-rater correlations do not estimate the reliability of job performance
ratings. Personnel Psychology, 53, 873–900.
Murphy, K. R., & DeShon, R. (2000b). Progress in psychometrics: Can industrial and organizational psychol-
ogy catch up? Personnel Psychology, 53, 913–924.
Murphy, K. R., & Shiarella, A. (1997). Implications of the multidimensional nature of job performance for the
validity of selection tests: Multivariate frameworks for studying test validity. Personnel Psychology, 50,
823–854.
Nathan, B. R., & Alexander, R. A. (1988). A comparison of criteria for test validation: A meta-analytic investi-
gation. Personnel Psychology, 41, 517–535.
Newman, D. A., Jacobs, R. R., & Bartram, D. (2007). Choosing the best method for local validity estima-
tion: Relative accuracy of meta-analysis versus a local study versus Bayes-analysis. Journal of Applied
Psychology, 92, 1394–1413.
Nunnally, J. (1975). Psychometric theory 25 years ago and now. Educational Researcher, 4, 7–14, 19–21.
Peterson, N. G., Wise, L. L., Arabian, J., & Hoffman, R. G. (2001). Synthetic validation and validity gen-
eralization when empirical validation is not possible. In J. Campbell & D. Knapp (Eds.), Exploring
the upper limits of personnel selection and classification (pp. 411–451). Mahwah, NJ: Lawrence
Erlbaum.
Ree, M. J., & Earles, J. A. (1992). Intelligence is the best predictor of job performance. Current Directions in
Psychological Science, 1, 86–89.
Roth, P. L., BeVire, C. A., Bobko, P., Switzer, F. S. III, & Tyler, P. (2001). Ethnic group differences in
cognitive ability in employment and education settings: A meta-analysis. Personnel Psychology, 54,
297–330.
Sackett, P. R. (1987). Assessment centers and content validity: Some neglected issues. Personnel Psychology,
40, 13–25.
Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., & de Fruyt, F. (2003). International validity generaliza-
tion of GMA and cognitive abilities: A European Community meta-analysis. Personnel Psychology, 56,
573–605.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge
in psychology. American Psychologist, 47, 1173–1181.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generaliza-
tion. Journal of Applied Psychology, 62, 643–661.
Schmidt, F. L., & Hunter, J. E. (1980). The future of criterion-related validity. Personnel Psychology, 33,
41–60.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psycho-
logy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnel selection. Annual Review of Psychology, 43,
627–670.
Schmidt, F. L., & Raju, N. S. (2007). Updating meta-analytic research findings: Bayesian approaches versus the
medical model. Journal of Applied Psychology, 92, 297–308.
Schmidt, F. L., Viswesvaran, C., & Ones, D. S. (2000). Reliability is not validity and validity is not reliability.
Personnel Psychology, 53, 901–912.
Current Concepts of Validity, Validation, and Generalizability 123
Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Meta-analysis of validity studies published
between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37,
407–422.
Society for Industrial and Organizational Psychology (1987). Principles for the validation and use of personnel
selection procedures (3rd ed.). Bowling Green, OH: Author.
Society for Industrial and Organizational Psychology (2003). Principles for the validation and use of personnel
selection procedures (4th ed.). Bowling Green, OH: Author.
Steel, P. G., Huffcutt, A. I., & Kammeyer-Mueller, J. (2006). From the work one knows the worker: A sys-
tematic review of challenges, solutions and steps to creating synthetic validity. International Journal of
Selection and Assessment, 14,16–36.
Stelly, D. J., & Goldstein, H. W. (2007). Application of content validation methods to broader constructs. In
S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leveraging existing validity
evidence. San Francisco, CA: Jossey-Bass.
Thorndike, R. M., & Lohman, D. F. (1990). A century of ability testing. Chicago, IL: Riverside.
Uniform guidelines on employee selection procedures. (1978). Federal Register, 43(166), 38290–39309.
Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of performance?
A meta-analytic framework for disentangling substantive and error influences. Journal of Applied
Psychology, 90, 108–131.
This page intentionally left blank
Part 2
Implementation and Management
of Employee Selection Systems
in Work Organizations
Jerard F. Kehoe and Robert E. Ployhart,
Section Editors
This page intentionally left blank
6 Attracting Job Candidates
to Organizations
Recruiting is more complex today than it has ever been. Technology is assisting corporations in the
ability to find skilled, cost-effective talent in all corners of the world, enabling globally integrated
workforces. However, to be successful corporations need recruiting models that accommodate
growth markets and mature markets, entry and experience professionals, and cross-generational
applicants—twenty somethings, thirty somethings, etc. Corporations must also develop successful
recruiting strategies to secure hot skills or market value skills. Recruiting models must leverage
global best practices while complying with local legislation and managing local cultures. Recruiting
must involve ways to process candidates through hiring quicker than ever while managing greater
volumes of applicants than in the past.
The ability to attract individuals to work at organizations is a topic of perennial research inter-
est. Major reviews of the research on recruitment appear periodically (e.g., Barber, 1998; Breaugh
& Starke, 2000; Ployhart, 2006; Rynes, 1991; Rynes & Cable, 2003; Rynes, Heneman, & Schwab,
1980; Taylor & Collins, 2000), noting findings and lamenting a lack of progress in addressing many
questions (Saks, 2005). In this chapter, we look at current questions regarding applicant attraction
arising from the recent workplace trends noted above as a means of framing a practical research
agenda for the future. Specifically, we address what is known and what we need to know about
applicant attraction in light of globalization, advances in technology in recruitment, and organiza-
tional efforts toward more strategic talent management.
We have organized our review from a more traditional recruitment process perspective into the
three stages of reaching potential applicants, maintaining applicant interest, and securing offer
acceptance (Barber, 1998), recognizing that the same organizational practices may affect each of
these phases and that an applicant’s ability to separate these is not always great. We also have fore-
gone a study-by-study review, because these are already plentiful, in favor of a high-level integration
of key findings.
127
128 Handbook of Employee Selection
PRACTICE IMPLICATION
Companies today should have some working knowledge of what sources of information are
being viewed by applicants. For example, blogs are becoming a common way for applicants
(and employees) to share experiences and perceptions of a company and its hiring processes
in real time. A recommendation for all recruiters would be to “Google” your company’s name
to understand what information is available to the applicant. Read this information regardless
of the source to gain insight as to what image is being portrayed about your company.
Realities of the job and working within a company have been a particular research focus for
quite some time (see Phillips, 1998 and Premack & Wanous, 1985 for early reviews). Studies
suggest that the presentation of negative or realistic information will have differential effects on
different categories of job seekers [e.g., those with less prior job experience (Meglino, DeNisi, &
Ravlin, 1993); higher quality applicants (Bretz & Judge, 1998; Thorsteinson, Palmer, Wulff, &
Anderson, 2004)]. Further, Highhouse, Stanton, and Reeve (2004) found that negative informa-
tion about prospective employers is discounted more than positive information. However, Reeve,
Highhouse, and Brooks (2006) also showed that one negative piece of information can affect
the accumulated effects of multiple moderately positive pieces of information (i.e., the relative
balance of positive or negative information is not as important as the intensity of one’s affective
response to a piece of negative information). The overall conclusion of this line of research is
that although providing realistic information (and negative information) may adversely affect the
attraction of some desired applicants, its overall effect on applicant pool quality and quantity may
depend on the specifics of the job, industry, labor market, job seekers, and nature of the informa-
tion itself.
PRACTICE IMPLICATION
The challenge is for recruiters and hiring managers to acquire and perfect techniques that
balance the “sell” with the job and company “realities.” A suggestion for recruiters would be
to create a realistic job preview discussion guide that can be used by interviewers, recruiters,
and hiring managers. The guide could have competitive advantage talking points as well as
information on the realities of working in the company and in a particular role. Interviewers
should be taught how to best sell the company and how to set realistic expectations. Indeed,
recent rises in truth-in-hiring or fraudulent inducement lawsuits suggest that special care
be taken when making specific statements regarding time to promotion and variable pay
(Hansen, 2007).
Attracting Job Candidates to Organizations 129
Highhouse and colleagues have generated several studies integrating decision-making theory
with recruitment research that address issues of how to present information to potential applicants.
Some conclusions from these studies are as follows:
1. Adding more relevant information to job postings can lead to greater job attractiveness
perceptions (Yuce & Highhouse, 1997; see also Barber & Roehling, 1993 for similar
findings).
2. Lack of information (e.g., about pay) can lead to negative inferences and lower attractive-
ness perceptions (Yuce & Highhouse, 1997; however, see also Highhouse & Hause, 1995
and Maurer et al., 1992).
3. Suggesting vacancy scarcity or time scarcity (limited application period) can lead to more
positive employer image inferences (Highhouse, Beadle, Gallo, & Miller, 1998).
PRACTICE IMPLICATION
The practical advice is “more information is better.” Organizations should review job postings
to add position- and organization-relevant information. If appropriate, vacancy and/or time
scarcity should be emphasized.
Methodological shortcomings of research have been pointed out in many previous reviews; these
reviews typically end with a call for research designs that recognize the complexities involved in
attraction (e.g., consider multiple sources; consider specific information provided; consider context
such as labor market; consider job seeker experience; consider organizational-level antecedents and
outcomes, not just individual level). That is, research seldom focuses on more than one factor at a
time and their relative influences. One other shortcoming is that the focus on recent college gradu-
ates leaves a gap in knowledge of the generalizability of findings to other groups.
PRACTICE IMPLICATION
Evaluate an organization’s image and reputation, contrast what is found with what is desired,
and then develop planned activities to affect images held by job seekers.
PRACTICE IMPLICATION
Evaluate how recruitment and advertising practices might be used to enhance organizational
image. In particular, consider how to engage applicants via media that are rich and credible.
We now discuss generating applicant interest in light of workplace trends regarding globalization,
technology, and strategic talent management.
SOURCING GLOBALLY
It is critical for an employer to recognize that in a global market, there may be large differences in
familiarity, reputation, and image across regions; hence, recruitment activities may need to vary to
create the type and level of employer knowledge desired (i.e., a familiar organization with a positive
reputation in one locale may have to engage in more and different recruitment activities in another
locale). To tap the same caliber of talent in different locations may require different sourcing strate-
gies. For example, an Internet presence may greatly expand an organization’s reach, but, because
of differences in Internet access, it may lead to missing key talent sources in some locations. The
issue of the “digital divide” has been raised with regard to the use of technology in recruiting
(Cober, Brown, Keeping, & Levy, 2004), although evidence for negative effects of technology on
pool diversity in North America are mixed (see Russell, 2007 for a review). However, in terms of
global recruiting, access remains an important concern. Another example would be the need for
appropriate translations. Although the ubiquity of English as the language of business can lead to
Attracting Job Candidates to Organizations 131
Williams & Bauer, 1994). Harrison, Kravitz, Mayer, Leslie, and Lev-Arey (2006) provided a meta-
analytic review of this research and suggested that the use of less prescriptive affirmative action
plans (i.e., not preferential treatment), communication of detailed features, and providing justifica-
tions for use are likely to result in the most positive reactions across the applicant pool. Initially, it
might seem that this research would readily generalize globally. However, Myors et al. (2008) noted
that the United States is somewhat of an outlier in its prohibition on quotas and within-group stan-
dards. The greater prevalence of preferential treatment in hiring outside of the United States means
that one needs to consider how culture and economic conditions will affect applicant perceptions of
policy statements and hiring practices.
Researchers have specifically considered how the portrayal of a diverse workforce in recruitment
advertisements affects attraction (e.g., Avery, Hernandez, & Hebl, 2004; Perkins, Thomas, & Taylor,
2000). For example, Avery (2003) found complex reactions for Blacks and Whites to ad diversity,
depending upon the individual’s other-group orientation as well as whether the ads portrayed more
or less racial vertical integration. Despite some complexity in findings, the general advice has been
that ads depicting greater diversity should be used (Avery & McKay, 2006). On a global basis, this
would extend beyond considerations of racial/ethnic and gender diversity to diversity of culture (e.g.,
through dress). Vertical integration of the organization by individuals of different nationalities may
also affect attraction across countries.
Table 6.1 outlines some suggested research directions related to globalization and attracting
applicants.
TABLE 6.1
Reaching Potential Applicants—Unanswered Research Questions
Globalization Technology Strategic Talent Management
Are there differences in recruitment source Does the use of technology in How do organizational recruitment
effects across countries because of recruitment only influence the activities vary according to
differences in economies and cultures? “how” and “amount” of strategic talent management
information delivery to potential objectives?
applicants, or does it also alter
“what” attracts applicants?
Does the role of various attributes of How can individual differences in To what extent do strategic ad
information (e.g., valence, specificity) in what attracts be leveraged to placements affect recruiting
attraction vary by culture and economic customize messages in outcomes differently than less
conditions? recruitment? targeted advertising?
How do differences in familiarity, reputation, How does job seeker web search How do policy/benefit emphases in
and image across countries affect an behavior influence how to best attraction of targeted talent affect
organization’s ability to attract applicants? attract applicants? recruiting outcomes for targeted
groups?
How does research on U.S.-based preferential What are the most effective, How can source-of-hire
treatment and diversity policy statements innovative uses of technology for information be used effectively in
apply to attraction of applicants globally, generating interest among targeted recruitment?
where different norms and standards apply high-quality applicants?
with regard to hiring?
Are the effects of media richness, media How does technology’s role in What is a cost-effective level of
credibility, and specific recruitment attraction differ for different job customization in recruitment?
activities moderated by culture? levels and applicant pools?
When and how should recruitment materials How do targeted individuals react
and activities be culturally adapted? to customized versus
noncustomized messages?
How does vertical integration by nationality
affect attraction on a global basis?
Attracting Job Candidates to Organizations 133
PRACTICE IMPLICATION
Organizations need to evaluate sourcing strategies in different markets and culturally adapt
recruiting materials as needed. What are considered valued organizational attributes in differ-
ent cultures should be considered in developing recruiting strategies and materials.
participate in activities, and purchase virtual property and goods. Because Second Life provides
access to a wide range of individuals who are geographically dispersed, recruiting in Second
Life is taking place. For example, KPMG conducted an online fair, allowing reach of many more
applicants than a traditional job fair (Flinders, 2007). Other approaches include organizations
such as Merrill Lynch’s hosting webinars to efficiently reach multiple people in multiple locations
(Mullich, 2004) and organization involvement in electronic gaming as a means of garnering attrac-
tion. Cosmetics maker L’Oreal bypassed the traditional visits to top business schools in search of
talent via an online competition. More than 30,000 students from over 110 countries logged on,
and it is now one of the world’s largest business planning contests (see http://www.estrat.loreal.
com/_int/_en/whatis.aspx?).
In the past, considering individual differences in reactions to recruitment materials and selec-
tion processes was seen as less useful from a practical perspective because developing different
content and processes for different types of prospective applicants was seen as resource-prohibitive.
Technology allows for a much greater capability for differences in what applicants are provided,
and thus there is renewed interest in individual differences. Customizing messages can occur in
terms of the characteristics of those who frequent a particular website, LISTSERV, or job board.
Customization can also occur in the specific types of messages sent to various groups.
Research on customization of messages is fairly nascent. In a general vein, researchers have
compared websites that are more recruiting-oriented to those that are more screening-oriented (Hu,
Su, & Chen, 2007; Williamson, Lepak, & King, 2003). Although some of this research suggests
that more recruiting-oriented websites generate greater attraction, several studies have shown that
providing self-screening information (e.g., assessments of fit with the position in terms of abilities,
values, and needs) is seen as particularly valuable by applicants and directly affects variables such
as information recall and site viewing time. (e.g., Dineen, Ash, & Noe, 2002; Dineen et al., 2007).
Given that fit perceptions influence self-selection (Cable & Judge, 1996; Kristof-Brown, Zimmerman,
& Johnson, 2005), enabling accurate self-assessment of fit can be a vital role for websites.
Another individual difference consideration with regard to technology and recruitment is vari-
ability in job seeker web search behavior (Cober, Brown, Keeping, & Levy, 2004). That is, consider-
ation of the amount of time spent on information accessed, number of different components or sites
accessed, and total time spent on web searching may be important in assessing which applicants are
best reached through these media. Cober et al. (2004) focused on web search behavior as a media-
tor of website effects on attraction to a specific organization; extending this research to the broader
question on how applicants avail themselves of recruitment information would be a useful extension
of earlier research on recruitment sources.
Table 6.1 (p. 132) provides a summary of research directions on technology and applicant
attraction.
PRACTICE IMPLICATION
In implementing technology in recruitment, consider content, usability, and aesthetics in
conjunction with one another. Organizations that develop innovative uses of technology set
themselves apart from competitors. Messages should be customized where appropriate, and
applicants should be provided with self-screening opportunities.
umbrella of talent management practices and emphasize alignment of recruitment practices with
strategic goals (Lewis & Heckman, 2006).
One of the keys to strategic talent management is to identify which talent pools are pivotal where
small gains in applicant quantity or quality will yield large returns (Boudreau & Ramstad, 2005;
Lewis & Heckman, 2006). Researchers have shown that differences in applicant pool quality have util-
ity in that hiring higher quality applicants has benefits for the organization (Connerley, Carlson, &
Mecham, 2003). One then has to identify which practices might lead to achieving these gains.
We can envision ways in which the topic of generating applicant interest might be approached
with a consideration of the organization’s strategy for talent management. For example, which tal-
ent pools are the organization most concerned about retaining and, therefore, for which applicant
pools should concerns about applicant quality predominate? Which talent pools has the organiza-
tion determined to be ones for greater investments in development internally and how might that
affect recruitment targets? Although it is easy to see how such questions can serve in a practical way
to ensure that organizational resources for recruitment are directed in keeping with strategy, it also
may be important to consider an organization’s strategy in determining what they consider recruit-
ment effectiveness (i.e., for a given job type, one organization’s strategy might suggest that generat-
ing strong interest among high talent is critical, whereas another’s might suggest that applicant pool
quality does not have to be as strong).
Organizational strategies for recruitment are likely to vary based on context. For example,
Hallier (2001) discussed how recruitment for Greenfield sites might seek to attract a different type
of applicant than might be traditionally sought for individuals in similar jobs at other locations. As
another example, Taylor (2006) pointed to the need for better research on recruitment strategies for
small firms.
Strategic ad placement to attract specific talent pools has often been discussed in the context of
recruiting a more diverse applicant pool (see Avery & McKay, 2006 for a review) but can be consid-
ered more broadly as a talent management practice. For example, an organization concerned with
attraction of a specific talent pool may think about ad placement in terms of reaching that particular
pool. Technology has enabled location- or context-based advertising on the web in innovative ways
(e.g., placing job ads next to specific types of just-released content on websites).
Similarly, whereas targeted recruitment messages to attract minority applicants, targeted cam-
puses in recruitment, and targeted job fairs have all been discussed in terms of attraction of minority
applicants (Avery & McKay, 2006), one can envision other targeted messages and targeted sourcing
depending on other talent pool targets. For example, organizations can send customized text mes-
sages to potential applicants at certain universities or belonging to certain professional organiza-
tions to advertise openings or events. Cisco has held focus groups with ideal recruitment targets to
uncover how they spend free time and websites they visit to better target their recruitment efforts
(e.g., recruiters attend art fairs, microbrewery festivals, and home and garden shows; advertisements
are placed at local movie theatres; Nakache, 1997). The company has also used a “friends program”
in which prospects are matched with current employees with similar background and skills who call
to discuss what it is like to work there (Nakache, 1997). In the Dineen et al. (2007) study mentioned
earlier, customizing the message or providing feedback regarding organizational fit likelihood was
found to increase the attraction of qualified applicants and decrease the attraction of unqualified
individuals. Obviously, although customization and provision of feedback can be more costly to
initiate, the long-run cost-savings of creating a more targeted applicant pool are apparent. Cober,
Brown, Keeping, and Levy (2004) noted that we need more research on the effects of such messages
on intended targets as well as what the unintended effects are on those outside of the target group.
Although research on how organizational policies and benefits affect recruitment outcomes is
not new (e.g., job security rights, Roehling & Winters, 2000; salary negotiations, Porter, Conlon &
Barber, 2004; work-family balance initiatives, Nord, Fox, Phoenix, & Viano, 2002; affirmative
action policies, Harrison et al., 2006), strategic talent management suggests a direct tie of policy/
benefit promotion to potential targeted applicant groups to success in recruiting those groups. For
136 Handbook of Employee Selection
example, companies interested in attracting women into occupations where they are underrepre-
sented (e.g., engineering) may emphasize work-life initiatives; however, surprising little empirical
research has actually examined work-life initiatives’ impact in attraction (Carless & Wintle, 2007;
Casper & Buffardi, 2004), let alone in offer acceptances.
Applicant tracking systems (ATS) allow for better determining which sources are yielding
desired applicants (e.g., track location based advertising). However, source-of-hire tracking can be
of poor quality if insufficient attention is paid to quality of tracking tokens or to developing drop-
down self-report boxes of sources that are specific and easy for applicants to use (Ruiz, 2007).
Cable and Turban (2001) stated that, “There are not recruitment ‘best practices’ across firms”
because firms need to assess what employer knowledge their target market holds before developing
a recruitment strategy. Approaching recruitment from an integrated talent management perspective
suggests a strong shift away from examining what works best across applicants to what works best
for specific targets of interest in specific contexts. This is hardly a new suggestion—it has been made
by Rynes (Rynes, 1991; Rynes & Cable, 2003; Rynes et al., 1980) in all of her reviews—but the cur-
rent zeitgeist with regard to strategic talent management may increase the likelihood of the research
shift. Once again, Table 6.1 (p. 132) summarizes research suggestions on attracting targeted talent.
PRACTICE IMPLICATION
Organizations need to identify pivotal talent pools and use strategic ad placement, targeted
messages, and targeted sourcing for those pools. A quality ATS to monitor, evaluate, and
evolve recruiting efforts is key to success.
MAINTAINING INTEREST
Although it is clear that many of the factors that relate to generating interest (e.g., organizational
image, job attributes) relate to maintaining interest, there are several research topics concentrated
primarily on keeping applicants in the pipeline once they have applied. Some general findings are
discussed in the following sections.
APPLICANTS PREFER RECRUITERS THAT TREAT THEM WELL AND ARE INFORMATIVE
Applicants prefer and react more positively to recruiters who treat them well and are informative
(see Breaugh & Starke, 2000 or Rynes & Cable, 2003 for reviews of this research).
In other words, applicants seek informational and interactional justice (Bell, Wiechmann, &
Ryan, 2006) as they proceed through the hiring process. However, it is important to note that the
general consistent consensus is that the job and organization are more important than the recruiter
(Chapman, Uggerslev, Carroll, Plasentin, & Jones, 2005).
In our view, advice to organizations on this topic remains pithy: treat applicants nicely, make sure
recruiters know the jobs for which they are recruiting, and train recruiters. It seem that research-
ers should focus more on the micro level of interactions between recruiters and applicants to better
inform training as to what causes affective shifts among applicants. For example, we need to under-
stand better where the balance lies between technology-enabled processes and high-touch recruit-
ment practices—at what point do applicants perceive less high touch and become less satisfied?
PRACTICE IMPLICATION
Recruiters need to be carefully selected. Training and enabling of recruiters requires
investment.
Attracting Job Candidates to Organizations 137
PRACTICE IMPLICATION
Rather than focusing on recruiter-applicant similarity as critical, efforts should be directed
at finding out applicant views of organizational diversity climate, where those views derive
from, and how they are influenced by organizational recruitment activities.
PRACTICE IMPLICATION
Site visit features (i.e., how time is spent) need to be investigated in terms of which ones relate
to actual offer acceptance.
138 Handbook of Employee Selection
PRACTICE IMPLICATION
Data on perceptions of the hiring tools used should be regularly gathered and monitored;
however, conclusions should be couched in terms of the many other factors exerting influence
on applicant decision-making.
PRACTICE IMPLICATION
Rather than passively accepting self-selection rates, organizations should investigate reasons
for self-selection to uncover whether any causes are problematic or are leading to a loss of
desirable applicants. Also, desirable self-selection should be facilitated through tools that
enable assessment of job and organization fit.
Attracting Job Candidates to Organizations 139
TABLE 6.2
Maintaining Interest—Unanswered Research Questions
Globalization Technology Strategic Talent Management
How do cultural and economic differences What is the relative balance What factors differentially affect
affect perceptions of the interactional and between high-touch and high-tech the interest of high-quality from
informational justice of encounters with in maintaining attraction? low-quality applicants?
recruiters?
Do recruiter/applicant similarity effects exist How effective are innovative How do competitor recruitment
on features such as culture or nationality? technologies at maintaining practices affect organizational
applicant interest? ability to maintain interest?
What is the role of culture in reactions to What features of Internet-based Do recruiter characteristics have
selection tools? testing lead to more positive or differential effects on targeted
more negative reactions? versus nontargeted individuals?
Which characteristics?
How do applicants form inferences of How can organizations stimulate Are there differences in self-
organizational climate for global inclusion? positive and counteract negative selection rates and reasons for
word-of-mouse? targeted and nontargeted
individuals?
What are cultural and economic influences on How does organizational use of How do changes in recruitment
the relative roles of pay, promotion innovations in technology in activities to target applicant pools
opportunities, job security, signing recruitment affect job offer affect subsequent training and
inducements, social influencers, and acceptance? socialization activities and other
work-life balance in job offer acceptance? HR practices?
What factors influence relocation decisions?
How does organizational off-shoring and
outsourcing affect applicant perceptions of
job security and how does that affect offer
acceptance?
Trends toward globalization, changes in technology, and increased strategic talent management
suggest some new directions for research on maintaining applicant interest. These are listed in
Table 6.2 and described below.
any strong consistent pattern of relations between type of selection tool and applicant reactions that
indicates particular cultural values as key to reactions (see Anderson & Witvliet, 2008 for a review).
For example, Ryan et al. (2009) found that similarities outweighed differences in perceptions of
selection tools where cultural values were directly assessed and that country gross domestic product
(GDP) was a greater contributor to variance in perceptions at the country level than were national
standings on cultural values.
PRACTICE IMPLICATION
There is a need to consider how job markets affect the ability to maintain interest, and mecha-
nisms for doing so need to be implemented in hot and/or valued skill markets. Cultural influ-
ences should be considered throughout selection process design.
as face-to-face encounters (Chapman & Rowe, 2002; Chapman, Uggerslev, & Webster, 2003;
Silvester & Anderson, 2003; Silvester, Anderson, Haddleton, Cunningham-Snell, & Gibb, 2000;
Straus, Miles, & Levesque, 2001).
Some researchers have questioned whether interested individuals might be reluctant to complete
applications and screening tools on the Internet because of privacy concerns (Lievens & Harris,
2003). Although there are a few studies that suggest privacy concerns relate to willingness to use
Internet-based recruitment tools (Bauer et al., 2006; Harris, Ispas, & Mano, 2007; Harris, Van Hoye,
& Lievens, 2003), there is no compelling evidence that applicants are unwilling to complete online
applications in actual practice. However, some popular technological innovations may raise privacy
concerns if not appropriately managed. For example, the practice of tagging and tracking visitors
to corporate career websites and then deploying company ads in other web locations they visit to
maintain a company’s presence in the job seeker’s mind has increased (Ruiz, 2008). Ensuring that
personal data on the individual are not captured and maintained as well as ensuring that pervasive
advertising does not turn off applicants is important.
Technology by itself is neither good nor bad. Technology has great potential in the recruiting
process, but the integration points on where it helps and hurts the recruitment process are far from
being understood.
PRACTICE IMPLICATION
Investigating applicant pool preferences with regard to technology can lead to appropriate
adaptations that are targeted to those applicants. Attending to how technology-enabled pro-
cesses are messaged to applicants is also important. Organizations must combat negative and
stimulate positive “word of mouse.”
propositions regarding minority job seekers and influences in site visits). Researchers may be less
interested in overall applicant pool self-selection effects and need to drill down on specific reasons
for self-selection of different target groups.
PRACTICE IMPLICATION
Investigate whether high-quality applicants react differently to recruitment activities than
low-quality applicants and then adapt to their preferences. Focusing efforts on high-quality
applicants will lead to a more effective recruitment process, but should be coupled with efforts
to dissuade low-quality applicants in effective and nonoffensive ways.
ACCEPTING OFFERS
The ratio of job acceptances to offers is considered an important indicator of recruiting success for
many organizations (i.e., Do the ones we want want us?). The factors mentioned earlier as affect-
ing attraction are sometimes not as critical to an acceptance: Individuals gather more information,
eliminate options on those factors, and change their criteria as they proceed through the job search
process. Prior reviews have noted that weak methodologies (e.g., examining intent to accept at early
stages of the process) have clouded findings on what affects decisions to accept offers. However,
several general conclusions have emerged.
TIMELINESS
Timeliness of offer is important (Rynes et al., 1991). Pressures by hiring managers to speed up the
recruitment process are not without empirical backing, because one can lose desirable individuals
with delays. However, Rynes and Barber (1990) noted that although offers in hand are generally
favored over uncertain possibilities from other organizations, this likely varies with quality of
applicant as competitive individuals can afford to wait longer but also may be “snatched up”
sooner.
Attracting Job Candidates to Organizations 143
PRACTICE IMPLICATION
Evaluation of offer acceptance influencers needs to be made with data gathered later in the
process. Data gathered earlier can be used to assess influences on attraction and maintaining
interest but should not be considered as necessarily accurate when it comes to influences on
offer acceptance.
INDUCEMENTS
Inducements can make a difference (Rynes & Barber, 1990). Organizations use signing bonuses
routinely in tight labor markets for particular job categories (e.g., nursing). For example, the
U.S. Army began offering “quick ship” bonuses of $20,000 for recruits who agree to report
for training within 30 days (Shanker, 2007). Although inducements certainly affect offer
acceptance, we lack empirical data on their effects on applicant quality and turnover, because
bonuses appeal to individuals willing to move around for short-term gains (Hansen, 2006;
Medland, 2003).
WORK-LIFE BALANCE
The role of work-life balance in job choice appears to be increasing in importance. Much has been
written about how younger generations are more concerned about work-life balance than previous
labor market entrants (Charles & Harris, 2007; Fleetwood, 2007; Roos, Trigg, & Hartman, 2006).
Of particular note is the importance of this factor to high-quality applicants. For example, in a
2007 study of turnover at IBM, top performers report leaving for a perception of increased work-
life balance, whereas solid contributors left for more traditional factors like pay and promotion
opportunities.
Trends in the workplace such as globalization and technological innovations have not appeared
to have a great impact on recruitment research on offer acceptance, but there are a few areas in
which we think research attention is warranted (see Table 6.3).
TABLE 6.3
Accepting Offers—Unanswered Research Questions
Globalization Technology Strategic Talent Management
What are cultural and economic influences on How does organizational use of How do changes in recruitment
the relative roles of pay, promotion innovations in technology in activities to target applicant pools
opportunities, job security, signing recruitment affect job offer affect subsequent training and
inducements, social influencers, and acceptance? socialization activities and other HR
work-life balance on job offer acceptance? practices?
What factors influence relocation decisions?
How does organizational off-shoring and
outsourcing affect applicant perceptions of
job security and how does that affect offer
acceptance?
tickets to upward mobility, and so salary plays a bigger factor in job interest in those locations.
Potential recruits often are switching jobs frequently in efforts to obtain salary increases. Further,
because compensation packages and hours worked vary widely through the world, global recruiting
requires adjustments to offers to make them attractive in different locations (Brandel, 2006). We
have already noted that the role of social influencers will likely vary with the role accorded family
and friends in a given culture. Hence, although comparative empirical research on offer acceptance
and country is not available, existing evidence strongly suggests some differential rating of factors
by culture and economic conditions.
One of the biggest concerns of those involved in global hiring is being able to constantly antici-
pate where the cost-effective labor is that is willing to do lower-level work and stay in the role. For
example, 20 years ago, India had a pool of skilled labor. As more companies tapped India, attrition
and the cost of labor spiked. In 2008, companies are quickly moving to countries such as Malaysia,
Vietnam, and South Africa to find skilled, cost-effective labor. Increasing off-shoring and out-
sourcing to low-cost countries may also lead applicants in locations where organizations have more
mature businesses to take a closer look at job security in decision-making than has been the case
in the past. We know little about how publicity regarding an organization’s layoffs and off-shoring
affect willingness to apply and to accept offers (Aiman-Smith, Bauer, & Cable, 2001; Kammeyer-
Mueller & Liao, 2006).
PRACTICE IMPLICATION
Recognizing cultural and market influences on the relative importance of factors in offer
acceptance is important. Anticipating how market changes will affect recruitment efforts and
organizational image can facilitate the effectiveness of future efforts.
PRACTICE IMPLICATION
Review and evaluate technology use and messaging in later stages of the process, not just
early ones.
PRACTICE IMPLICATION
Narrow recruitment efforts through targeting. Focus on the early identification of talent and
generating interest in those pools. Evaluate how attraction strategies such as inducements can
affect internal equity.
CONCLUSIONS
In this chapter, we have summarized conclusions in recruitment research that have broad general-
ity. This focus may leave the reader wondering about the pros and cons of more contextualized
approaches to researching recruitment that are industry, job-level, geography, or target appli-
cant group-specific. Our conclusion is that contextual factors must be evaluated, but they are not
necessarily going to change recruitment theory and models. For example, although we provided
numerous examples in this chapter in which culture might make a difference, we also provided
numerous examples in which it does not appear to greatly influence research findings and/or prac-
tice. Advancements in recruitment research and practice will come from better articulation of when
one ought to be attending to these contextual factors and when they can be ignored or only minor
modifications in approaches be made.
We have eschewed a traditional review of the recruitment literature for a more focused look at
how some of the key conclusions of research should be interpreted in light of the important trends of
increasing globalization, increasing use of technology in recruiting, and increasing attention to stra-
tegic talent management. Our review leads us to conclude that although organizations are certainly
using new practices and adopting new strategies in response to these trends, the research base lags
practice in these areas. Increasing our attention to recruitment processes with these trends in mind
should yield more theory-driven practices than those adopted today, while at the same time better
informing our understanding of what influences attraction to organizations.
146 Handbook of Employee Selection
REFERENCES
Aiman-Smith, L., Bauer, T. N., & Cable, D. M. (2001). Are you attracted? Do you intend to pursue? A recruit-
ing policy-capturing study. Journal of Business and Psychology, 16, 219–237.
Anderson, N. (2003). Applicant and recruiter reactions to new technology in selection: A critical review and
agenda for future research. International Journal of Selection and Assessment, 11, 121–136.
Anderson, N., & Witvliet, C. (2008). Fairness reactions to personnel selection methods: An international com-
parison between the Netherlands, the United States, France, Spain, Portugal and Singapore. International
Journal of Selection and Assessment, 16, 1–13.
Avery, D. R. (2003). Reactions to diversity in recruitment advertising—Are differences Black and White?
Journal of Applied Psychology, 88, 672–679.
Avery, D. R., Hernandez, M., & Hebl, M. R. (2004). Who’s watching the race? Racial salience in recruitment
advertising. Journal of Applied Social Psychology, 34, 146–161.
Avery, D. R., & McKay, P. F. (2006). Target practice: An organizational impression management approach to
attracting minority and female job applicants. Personnel Psychology, 59, 157–187.
Baack, D. W., & Singh, N. (2007). Culture and web communications. Journal of Business Research, 60,
181–188.
Barber, A. E. (1998). Recruiting employees: Individual and organizational perspectives. Thousands Oaks, CA: Sage.
Barber, A. E., & Roehling, M. V. (1993). Job postings and the decision to interview: A verbal protocol analysis.
Journal of Applied Psychology, 78, 845–856.
Bauer, T., Truxillo, D., Tucker, J., Weathers, V., Bertolino, M., Erdogan, B., & Campion, M. (2006). Selection
in the information age: The impact of privacy concerns and computer experience on applicant reactions.
Journal of Management, 32, 601–626.
Beach, L. R. (1993). Broadening the definition of decision making: The role of prechoice screening of options.
Psychological Science, 4, 215–220.
Bell, B. S., Wiechmann, D., & Ryan, A. M. (2006). Consequences of organizational justice expectations in a
selection system. Journal of Applied Psychology, 91, 455–466.
Boudreau, J. W., & Ramstad, P. M. (2005). Talentship and the new paradigm for human resource management:
From professional practice to strategic talent decision science. Human Resource Planning, 28, 17–26.
Braddy, P. W., Thompson, L. F., Wuensch, K. L., & Grossnickle, W. F. (2003). Internet recruiting: The effects
of web page design features. Social Science Computer Review, 21, 374–385.
Brandel, M. (2006). Fishing in the global talent pool. Computerworld, 40, 33–35.
Breaugh, J., & Starke, M. (2000). Research on employee recruiting: So many studies, so many remaining ques-
tions. Journal of Management, 26, 405–434.
Bretz, R. D., & Judge, T. A. (1998). Realistic job previews: A test of the adverse self-selection hypothesis.
Journal of Applied Psychology, 83, 330–337.
Brooks, M. E., & Highhouse, S. (2006). Familiarity breeds ambivalence. Corporate Reputation Review, 9,
105–113.
Brooks, M. E., Highhouse, S., Russell, S., & Mohr, D. (2003). Familiarity, ambivalence, and firm reputation: Is
corporate fame a double-edge sword? Journal of Applied Psychology, 88, 904–914.
Cable, D. M., & Judge, T. A. (1996). Person-organization fit, job choice decisions, and organizational entry.
Organizational Behavior and Human Decision Processes, 67, 294–311.
Cable, D. M., & Turban, D. B. (2001). Establishing the dimensions, sources and value of job seekers’
employer knowledge during recruitment. Research in Personnel and Human Resources Management,
20, 115–163.
Cable, D. M., & Yu, K. Y. T. (2007). How selection and recruitment practices develop the beliefs used to assess
fit. In C. Ostroff & T. A. Judge (Eds.), Perspectives on organizational fit (pp. 155–182). New York, NY:
Lawrence Erlbaum.
Carless, S. A., & Wintle, J. (2007). Applicant attraction: The role of recruiter function, work-life balance poli-
cies and career salience. International Journal of Selection and Assessment, 15(4), 394–404.
Casper, W. J., & Buffardi, L. C. (2004). Work-life benefits and job pursuit intentions: The role of anticipated
organizational support. Journal of Vocational Behavior, 65(3), 391–410.
Chapman, D. S., & Rowe, P. M. (2002). The influence of videoconference technology and interview structure
on the recruiting function of the employment interview: A field experiment. International Journal of
Selection and Assessment, 10(3), 185–197.
Chapman, D. S., Uggerslev, K. L., Carroll, S. A., Plasentin, K. A., & Jones, D. A. (2005). Applicant attrac-
tion into organizations and job choice: A meta-analytic review of the correlates of recruiting outcomes.
Journal of Applied Psychology, 90, 928–944.
Attracting Job Candidates to Organizations 147
Chapman, D. S., Uggerslev, K. L., & Webster, J. (2003). Applicant reactions to face-to-face and technology-
mediated interviews: A field investigation. Journal of Applied Psychology, 88(5), 944–953.
Chapman, D. S., & Webster, J. (2003). The use of technologies in the recruiting, screening, and selection pro-
cesses for job candidates. International Journal of Selection and Assessment, 11(2–3), 113–120.
Charles, N., & Harris, C. (2007). Continuity and change in work-life balance choices. British Journal of
Sociology, 58(2), 277–295.
Cober, R. T., Brown, D. J., Keeping, L. M., & Levy, P. E. (2004). Recruitment on the Net: How do organiza-
tional web site characteristics influence applicant attraction? Journal of Management, 30, 623–646.
Cober, R. T., Brown, D. J., & Levy, P. E. (2004). Form, content, and function: An evaluative methodology for
corporate employment Web sites. Human Resource Management, 43, 201–218.
Cober, R. T., Brown D. J., Levy, P. E., Cober, A. B., & Keeping, L. M. (2003). Organizational Web sites: Web
site content and style as determinants of organizational attraction. International Journal of Selection and
Assessment, 11, 158–169.
Collins, C. J., & Han, J. (2004). Exploring applicant pool quantity and quality: The effects of early recruitment
practices, corporate advertising, and firm reputation. Personnel Psychology, 57, 685–717.
Collins, C. J., & Stevens, C. K. (2002). The relationship between early recruitment-related activities and the
application decisions of new labor-market entrants: A brand equity approach to recruitment. Journal of
Applied Psychology, 87, 1121–1133.
Connerley, M. L., Carlson, K. D., & Mecham, R. L. (2003). Evidence of differences in applicant pool quality.
Personnel Review, 32, 22–39.
Connerley, M. L., & Rynes, S. L. (1997). The influence of recruiter characteristics and organizational recruit-
ment support on perceived recruiter effectiveness: Views from applicants and recruiters. Human Relations,
50, 1563–1586.
Dineen, B. R., Ash, S. R., & Noe, R. A. (2002). A web of applicant attraction: Person-organization fit in the
context of Web-based recruitment. Journal of Applied Psychology, 87, 723–734.
Dineen, B. R., Ling, J., Ash, S. R., & DelVecchio, D. (2007). Aesthetic properties and message customization:
navigating the dark side of web recruitment. Journal of Applied Psychology, 92, 356–372.
Fleetwood, S. (2007). Why work-life balance now? International Journal of Human Resource Management,
18, 387–400.
Fletcher, R. (2006). The impact of culture on web site content, design, and structure: An international and a
multicultural perspective. Journal of Communication Management, 10, 259–273.
Flinders, K. (2007, October 23). Harnessing Generation Y. Computer Weekly, 142, pNA.
Gibby, R. E., & Boyce, A. S. (2007, April). A cross-cultural look at items of numerical reasoning. Presented at
the annual meetings of the Society for Industrial and Organizational Psychology, New York, NY.
Hallier, J. (2001). Greenfield recruitment and selection: Implications for the older worker. Personnel Review,
30, 331–350.
Hansen, F. (2006). Refining signing bonuses. Workforce Management, 85, 39–41.
Hansen, F. (2007). Avoiding truth-in-hiring lawsuits. Workforce Management Online. Retrieved December 13,
2007, from http://www.workforce.com
Harris, M. M., Ispas, D., & Mano, H. (2007, April). Applicant perceptions of recruitment sources: A Romanian
sample. Presented at the annual meetings of the Society for Industrial and Organizational Psychology,
New York, NY.
Harris, M. M., Van Hoye, G., & Lievens, F. (2003). Privacy and attitudes towards internet-based selec-
tion systems: A cross-cultural comparison. International Journal of Selection and Assessment, 11,
230–236.
Harrison, D. A., Kravitz, D. A., Mayer, D. M., Leslie, L. M., & Lev-Arey, D. (2006). Understanding attitudes
toward affirmative action programs in employment: Summary and meta-analysis of 35 years of research.
Journal of Applied Psychology, 91, 1013–1036.
Hausdorf, P. A., & Duncan, D. (2004). Firm size and internet recruiting in Canada: A preliminary investigation.
Journal of Small Business Management, 42, 325–334.
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated
model and meta-analysis. Personnel Psychology, 57, 639–683.
Hermeking, M. (2005). Culture and Internet consumption: Contributions from cross-cultural marketing and
advertising research. Journal of Computer-Mediated Communication, 11, 192–216.
Highhouse, S., Beadle, D., Gallo, A., & Miller, L. (1998). Get ‘em while they last! Effects of scarcity informa-
tion in job advertisements. Journal of Applied Social Psychology, 28, 779–795.
Highhouse, S., & Hause, E. L. (1995). Missing information in selection: An application of the Einhorn-Hogarth
ambiguity model. Journal of Applied Psychology, 80, 86–93.
148 Handbook of Employee Selection
Highhouse, S., Stanton, J. M., & Reeve, C. L. (2004). Examining reactions to employer information using a
simulated web-based job fair. Journal of Career Assessment, 12, 85–96.
Highhouse, S., Thornbury, E. E., & Little, I. S. (2007). Social-identity functions of attraction to organizations.
Organizational Behavior and Human Decision Processes, 103, 134–146.
Hu, C., Su, H., & Chen, C. B. (2007). The effect of person-organization fit feedback via recruitment web sites
on applicant attraction. Computers in Human Behavior, 23, 2509–2523.
Judge, T. A., & Bretz, R. D. (1992). Effects of work values on job choice decisions. Journal of Applied
Psychology, 77, 261–271.
Kammeyer-Mueller, J., & Liao, H. (2006). Workforce reduction and job seeker attraction: Examining
job seekers’ reactions to firm workforce reduction policies. Human Resource Management, 45,
585–603.
Karande, K., Almurshidee, K. A., & Al-Olayan, F. (2006). Advertising standardisation in culturally similar
markets: Can we standardise all components? International Journal of Advertising, 25, 489–512.
Kilduff, M. (1990). The interpersonal structure of decision making: A social comparison approach to organiza-
tional choice. Organizational Behavior and Human Decision Processes, 47, 270–288.
Kristof-Brown, A. L., Zimmerman, R. D., & Johnson, E. C. (2005). Consequences of individual’s fit at work:
A meta-analysis of person-job, person-organization, person-group, and person-supervisor fit. Personnel
Psychology, 58, 281–342.
Legree, P.J., Gade, P.A., Martin, D.E., Fischl, M.A., Wilson, M.J., Nieva, V.F., McCloy, R., & Laurence, J.
(2000). Military enlistment and family dynamics: Youth and parental perspectives. Military Psychology,
12, 31–49.
Lewis, R. E., & Heckman, R. J. (2006). Talent management: A critical review. Human Resource Management
Review, 16, 139–154.
Lievens, F., & Harris, M. M. (2003). Research on internet recruiting and testing: Current status and future
directions. In C. Cooper & I. Robertson (Eds.), International review of industrial and organizational
psychology (Vol. 18, pp.131–165). Chichester, England: Wiley.
Lievens, F., & Highhouse, S. (2003). The relation of instrumental and symbolic attributes to a company’s attrac-
tiveness as an employer. Personnel Psychology, 56, 75–102.
Marcus, B. (2003). Attitudes towards personnel selection methods: A partial replication and extension in a
German sample. Applied Psychology: An International Review, 52(4), 515–532.
Matsumoto, D. (2001). The handbook of culture and psychology. New York, NY: Oxford University Press.
Maurer, S. D., Howe, V., & Lee, T. W. (1992). Organizational recruiting as marketing management: An inter-
disciplinary study of engineering graduates. Personnel Psychology, 45, 807–833.
McKay, P. F., & Avery, D. R. (2006). What has race got to do with it? Unraveling the role of racioethnicity in
job seekers’ reactions to site visits. Personnel Psychology, 59, 395–429.
Medland, M. E. (2003). When to pay signing bonuses. HR Magazine, 48, 99–102.
Meglino, B. M., DeNisi, A. S., & Ravlin, E. C. (1993). Effects of previous job exposure and subsequent job
status on the functioning of a realistic job preview. Personnel Psychology, 46(4), 803–822.
Moscoso, S., & Salgado, J. F. (2004). Fairness reactions to personnel selection techniques in Spain and Portugal.
International Journal of Selection and Assessment, 12, 187–196.
Mullich, J. (2004). Finding the schools that yield the best job-applicant ROI. Workforce Management, 83,
67–68.
Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., et al. (2008). Broadening
international perspectives on the legal environment for personnel selection. Industrial and Organizational
Psychology 1, 266–270
Nakache, P. (1997). Cisco’s recruiting edge. Fortune, 136, 275–276.
Neal, T. M. (2005, August 22). Military’s recruiting troubles extend to affluent war supporters. Washington
Post.
Nord, W. R., Fox, S., Phoenix, A., & Viano, K. (2002). Real-world reactions to work-life balance programs:
Lessons for effective implementation. Organizational Dynamics, 30, 223–238.
Orbe, M. P. (1998). Constructing co-cultural theory: An explication of culture, power, and communication.
Thousand Oaks, CA: Sage.
Perkins, L. A., Thomas, K. M., & Taylor, G. A. (2000). Advertising and recruitment: Marketing to minorities.
Psychology & Marketing, 17, 235–255.
Phillips, J. M. (1998). Effects of realistic job previews on multiple organizational outcomes: A meta-analysis.
Academy of Management Journal, 41, 673–690.
Ployhart, R. E. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of
Management, 32, 868–897.
Attracting Job Candidates to Organizations 149
Porter, C. O. L. H., Conlon, D. E., & Barber, A. E. (2004). The dynamics of salary negotiations: Effects on
applicant’s justice perceptions and recruitment decisions. International Journal of Conflict Management,
15(3), 273–303.
Premack, S. L., & Wanous, J. P. (1985). A meta-analysis of realistic job preview experiments. Journal of
Applied Psychology, 20, 706–719.
Ravlin, E. C., & Meglino, B. M. (1987). Effect of values on perception and decision making: A study of alterna-
tive work values measures. Journal of Applied Psychology, 72, 666–673.
Reeve, C. L., Highhouse, S., & Brooks, M. E. (2006). A closer look at reactions to realistic recruitment mes-
sages. International Journal of Selection and Assessment, 14, 1–15.
Riordan, C. M., & Holiday-Wayne, J. (2008). “Are all measures the same? A critical review and examination
of demographic similarity measures in relational demography within groups research.” Organizational
Research Methods, 11, 562–592.
Roehling, M. V., & Winters, D. (2000). Job security rights: The effects of specific policies and practices on the
evaluation of employers. Employee Rights and Responsibilities Journal, 12, 25–38.
Roos, P. A., Trigg, M, K., & Hartman, M. S. (2006). Changing families/changing communities: Work, family
and community in transition. Community, Work & Family, 9, 197–224.
Ruiz, G. (2007). Where’d they come from? Workforce Management, 86, 39–40.
Ruiz, G. (2008). PeopleFilter Amplifies ability to track job applicants. Workforce Management Online. Retrieved
February 2, 2008, from http://www.workforce.com
Russell, D. P. (2007). Recruiting and staffing in the electronic age: A research-based perspective. Consulting
Psychology Journal: Practice and Research, 59, 91–101.
Ryan, A. M., Boyce, A. S., Ghumman, S., Jundt, D., Schmidt, G., & Gibby, R. (2009). Going global: Cultural
values and perceptions of selection procedures. Applied Psychology: An International Review, 58,
520–556.
Ryan, A. M., & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and decisions: A critical
review and agenda for the future. Journal of Management, 26, 565–606.
Ryan, A. M., Sacco, J. M., McFarland, L. A., & Kriska, S. D. (2000). Applicant self-selection: Correlates of
withdrawal from a multiple hurdle process. Journal of Applied Psychology, 85, 163–179.
Rynes, S. L. (1991). Recruitment, job choice, and post-hire consequences: A call for new research directions. In
M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 399–444). Palo Alto,
CA: Consulting Psychologists Press.
Rynes, S. L., & Barber, A. E. (1990). Applicant attraction strategies: An organizational perspective. Academy
of Management Review, 15, 286–310.
Rynes, S. L., Bretz, R. D., & Gerhart, B. (1991). The importance of recruitment in job choice: A different way
of looking. Personnel Psychology, 44, 487–521.
Rynes, S. L., & Cable, D. M. (2003). Recruitment research in the twenty-first century. In W. C. Borman,
D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology: Volume 12 Industrial-Organizational
Psychology (pp. 55–76). Hoboken, NJ: John Wiley & Sons.
Rynes, S. L., Heneman, H. G., & Schwab, D. P. (1980). Individual reactions to organizational recruiting: A
review. Personnel Psychology, 33, 529–542.
Saks, A. M. (2005). Job search success: A review and integration of the predictors, behaviors, and outcomes.
In S. D. Brown & R. W. Lent (Eds.), Career development and counseling: Putting theory and research to
work (pp. 155–179). Hoboken, NJ: John Wiley & Sons.
Saks, A. M., Leck, J. D., & Saunders, D. M. (1995). Effects of application blanks and employment equity on
applicant reactions and job pursuit intentions. Journal of Organizational Behavior, 16, 415–430.
Shanker, T. (2007, August 11). Army, shedding a slump, met July recruiting goal. The New York Times. Retrieved
from http://www.nytimes.com/2007/08/11/us/11recruit.html
Silvester, J., & Anderson, N. (2003). Technology and discourse: A comparison of face-to-face and telephone
employment interviews. International Journal of Selection and Assessment, 11, 206–214.
Silvester, J., Anderson, N., Haddleton, E., Cunningham-Snell, N., & Gibb, A. (2000). A cross-modal compari-
son of telephone and face-to-face interviews in graduate recruitment. International Journal of Selection
and Assessment, 8, 16–21.
Straus, S. G., Miles, J. A., & Levesque, L. L. (2001). The effects of videoconference, telephone, and face-to-
face media on interviewer and applicant judgments in employment interviews. Journal of Management,
27, 363–381.
Taylor, M. S., & Collins, C. J. (2000). Organizational recruitment: Enhancing the intersection of theory and
practice. In C. L. Cooper & E. A. Locke (Eds.), Industrial and organizational psychology: Linking theory
and practice (pp. 304–334). Oxford, England: Basil Blackwell.
150 Handbook of Employee Selection
Taylor, S. (2006). Acquaintance, meritocracy and critical realism: Researching recruitment and selection pro-
cesses in smaller and growth organizations. Human Resource Management Review, 16, 478–489.
Thorsteinson, T. J., Palmer, E. M., Wulff, C., & Anderson, A. (2004). Too good to be true? Using realism to
enhance applicant attraction. Journal of Business and Psychology, 19, 125–137.
Triandis, H. C. (Ed.). (1995). Individualism & collectivism. Boulder, CO: Westview Press.
Turban, D. B. (2001). Organizational attractiveness as an employer on college campuses: An examination of the
applicant population. Journal of Vocational Behavior, 58, 293–312.
Turban, D. B., & Cable, D. M. (2003). Firm reputation and applicant pool characteristics. Journal of
Organizational Behavior, 24, 733–751.
Turban, D. B., Campion, J. E., & Eyring, A. R. (1995). Factors related to job acceptance decisions of college
graduates. Journal of Vocational Behavior 47, 193–213.
Turban, D. B., & Dougherty, T. W. (1992). Influences of campus recruiting on applicant attraction to firms.
Academy of Management Journal, 35, 739–765.
Turban, D. B., Forret, M. L., & Hendrickson, C. L. (1998). Applicant attraction to firms: Influences of organiza-
tion reputation, job and organizational attributes, and recruiter behaviors. Journal of Vocational Behavior,
52, 24–44.
Van Hoye, G., & Lievens, F. (2005). Recruitment-related information sources and organizational attractiveness:
Can something be done about negative publicity? International Journal of Selection and Assessment, 13,
179–187.
Van Hoye, G., & Lievens, F. (2007). Investigating web-based recruitment sources: Employee testimonials vs.
word-of-mouse. International Journal of Selection and Assessment, 15, 372–382.
Vance, C. M., & Paik, Y. (2006). Managing a global workforce: Challenges and opportunities in International
Human Resource Management. London, England: M. E. Sharpe.
Vecchio, R. P. (1995). The impact of referral sources on employee attitudes: Evidence from a national sample.
Journal of Management, 21, 953–965.
Williams, M. L., & Bauer, T. N. (1994). The effect of a managing diversity policy on organizational attractive-
ness. Group & Organization Management, 19, 295–308.
Williamson, I. O., Lepak, D. P., & King, J. (2003). The effect of company recruitment Web site orientation of
individuals’ perceptions of organizational attractiveness. Journal of Vocational Behavior, 63, 242–263.
Younger, J., Smallwood, N., & Ulrich, D. (2007). Developing your organization’s brand as a talent developer.
Human Resource Planning, 30, 21–29.
Yuce, P., & Highhouse, S. (1997). Effects of attribute set size and pay ambiguity on reactions to ‘help wanted’
advertisements. Journal of Organizational Behavior, 19, 337–352
Zottoli, M. A., & Wanous, J. P. (2000). Recruitment source research: Current status and future directions.
Human Resource Management Review, 10, 353–382.
Zusman, R. R., & Landis, R. S. (2002). Applicant preferences for web-based versus traditional job postings.
Computers in Human Behavior, 18, 285–296.
7 Test Administration and
the Use of Test Scores
151
152 Handbook of Employee Selection
TABLE 7.1
General Considerations for the Use of Selection Systems
• Discuss whether or not the organization has a “selection problem.” Personnel selection is only one option or
approach out of many possibilities (training, motivational interventions, or organizational redesign may supplement
or supplant selection).
• Define the desired purpose(s) for which test scores will be used. Direct uses include predicting performance,
minimum proficiency, safe behaviors, and reduced turnover; less direct uses include identifying training needs,
identifying potential for career advancement, or providing sponsored licensure or certification.
• Determine the characteristics to be measured by tests. KSAOs may be more directly measurable for applicants
with job experience; conversely, entry-level jobs or future job requirements may require the measurement of more
distal constructs (e.g., ability, motivation).
• Identify the practical constraints that influence your selection process—and how to deal with them. These
constraints include the volume of testing over time, the number of jobs into which individuals are being selected or
classified, the available resources of the organization (e.g., type of tests, number of computer testing terminals,
availability of test administrators), and the level of test security that can be achieved.
This section presents several issues that must be addressed prior to data collection when determin-
ing how test scores will be used. Table 7.1 provides a summary of these issues.
First, for what purpose will the test scores be used? We focus on selection issues in this chapter,
but testing is useful for several organizational purposes. Common uses of test scores are (a) to select
job applicants for employment, (b) to identify training needs, (c) to promote current employees or
determine who will be put on a fast track for later promotion, and (d) licensure or certification.
Second, what is the ultimate goal in collecting these test scores? Typical goals to which a testing
program can contribute are maximizing the mean level of task-specific or contextual job perfor-
mance; maximizing the number of employees who meet a minimum level of proficiency; improving
employee retention; increasing the diversity of skill sets in the organization; or minimizing counter-
productive behaviors such as theft, absenteeism, unsafe work behavior, or drug abuse. Taxonomies
of job performance have helped sharpen the models, definitions, and metrics that underlie organi-
zational objectives, but they also have opened our eyes to the fact that multiple objectives are often
of interest, making selection goals complicated. If the objective is a combination of two or more of
these goals, as it often is, then a decision must be made about the relative importance of each goal
to the organization. If no explicit consideration is paid to how testing relates to major organizational
goals, decisions made along the way could contradict those made in conjunction with other practices
and policies of the organization.
Third, what characteristics do you want to measure? For example, when selecting applicants
for employment, it is legitimate to select for characteristics that are important for the current job,
future job requirements, or person-organization fit. Prediction and applicant reactions to testing are
usually most favorable when the test or test battery measures characteristics that are more proximal
to performance, such as job knowledge and skill. Job knowledge tests and work samples tend to
be more expensive than measures of more indirect determinants of job performance, such as abil-
ity and personality; however, it may not be possible to test for specific job knowledge and skill for
entry-level positions.
Fourth, what is the volume of testing necessary? Tests can be used to select a single individual
from a small number of applicants into a specific position or to generate a large pool of individu-
als who may be selected or classified into multiple jobs. Given the probabilistic nature of selection
benefits (i.e., small gains accumulate over large numbers of selection decisions), test users must
determine if the costs associated with a small volume of testing are worth the benefits that are
likely to be obtained. Testing in small versus large organizations can influence the way test scores
are used, because large organizations tend to have the resources that allow them more freedom to
do different things with test scores. Smaller organizations may be limited to buying off-the-shelf
Test Administration and the Use of Test Scores 153
tests that are scored and interpreted by outside vendors, and support for the use of the measure may
require relying on multiple sources of external information (e.g., job analysis, transporting validity
from other situations, and meta-analyses). Although this is also an option for larger organizations,
they are more likely to have the option of developing their own tests and conducting a local valida-
tion study. This offers more freedom in customizing, weighting, combining, reporting, and applying
test scores.
Finally, the mode of administration can influence how test scores are used. Will the test be
administered via paper and pencil, computer, role play, work sample, or even interactive voice
response (IVR)? Will the test be proctored or unproctored, one-on-one or group administration?
The shift to Internet or computer-based testing opens the door to a much wider range of test formats
(e.g., video, interactive, adaptive) and new constructs that can be measured more reliably (e.g., inter-
personal skills, ability to speak a foreign language).
The decisions we have just briefly covered influence later decisions in the testing process. The
remainder of this chapter discusses (a) collection of test scores, (b) computation of test scores, and
(c) making selection decisions on the basis of test scores.
Alternate Forms
A common and effective strategy for enhancing test security is creating alternate forms of the
test, thus increasing test security while maintaining comparable scores across forms. Given that
a test reflects a representative sample of content from the construct domain of interest, it should
154 Handbook of Employee Selection
be possible to develop alternate measures of the same construct that exhibit similar psychometric
properties in terms of reliability and validity. High correlations between scores across test forms
provides evidence that scores from a particular test are reflective of an individual’s standing on an
underlying construct rather than reflective of an individual’s understanding of content unique to
a particular test form. The following subsections review (a) creating alternate forms, (b) adaptive
tests, and (c) test form equating.
It is possible to create alternate forms of personality tests using the same domain-sampling pro-
cedure as for ability tests because there are many potential items to measure personality constructs.
However, this may not be necessary because the “correct” answer in a personality test is usually not
difficult to determine. Prior exposure to personality items is usually not necessary to fake a person-
ality test, as demonstrated in directed faking studies (Viswesvaran & Ones, 1999). A common prac-
tice is simply to randomize the presentation order of the same items when creating alternate forms.
In general, the importance of having alternate forms corresponds to the extent to which a single
correct answer can be determined for the test questions. Thus, alternate forms are less important for
personality and biodata tests; more important for SJTs, interviews, and work simulations; and most
important for knowledge or ability tests.
A recent trend is the development of many forms (e.g., 10 instead of 2) in an attempt to minimize
cheating and keep a testing program going if one of the test forms is compromised. Oswald et al.
(2005) extended a method proposed by Gibson and Weiner (1998) for creating many test forms on
the basis of the statistics from a pool of items that may not have appeared on the same test form or
have been given to the same sample. This method of generating parallel forms potentially mini-
mizes the exposure of any single test item; in fact, item exposure can be a constraint built into the
procedure for generating test forms (see van der Linden, 2005). A more common strategy for creat-
ing parallel forms is to create three or four unique alternate forms, then create additional forms by
changing the item order and/or by taking some items from each of the original forms. Note, how-
ever, that mixing items from two or more unique forms to create a new form means that some parts
of the unique forms are compromised if the new form is stolen.
Equating
When alternate forms are used, it is necessary to equate test scores across forms so that a given
score on one form is equivalent to the same score on the other form. Although equating can intro-
duce some additional error into the selection process as opposed to using the same measure across
all applicants, careful attention to the process of test development and the establishment of equiv-
alent forms reduces such error. When samples are randomly equivalent and there are common
anchor items across test forms that have similar parameters across samples, several different equat-
ing methods from item response theory (IRT) yield similarly good results (Kim, Choi, Lee, & Um,
2008). In cases in which sample sizes are smaller (less than 500 per item) or IRT assumptions are
untenable, it is necessary to rely on other equating methods. There are two other kinds of equating
methods: (a) linear equating and (b) equipercentile equating. In linear equating, scores on two tests
are considered to be equated if they correspond to the same number of SD units from the mean.
Because linear equating is entirely analytical and does not require data at each point in the test
score range, it offers the advantages of allowing a mapping of scores from one version to the other
throughout the entire range of scores and requires smaller sample sizes. A disadvantage of linear
equating is that it requires the assumption that differences in the shapes of the raw-score distribu-
tions for each form are trivial.
In equipercentile equating, scores on two tests are considered to be equated if they correspond to
the same percentile rank (for details, see Livingston, 2004, pp. 17–23). A problem with equipercen-
tile equating is that it requires very large sample sizes to precisely equate the entire range of scores
on each test. In this method, large errors of estimation are likely in score ranges where data are
scant or erratic, so there must be many observations for each possible score on each form (Peterson,
Kolen, & Hoover, 1989). Methods are available that smooth the empirical distribution of the data,
allowing for more reasonable equipercentile equating in ranges of the scale with less data, assum-
ing the smoothed distribution is the correct one underlying the data (see Dorans, Pommerich, &
Holland, 2007). If the only score that needs to be equated is a cut score and only a pass-fail decision
is communicated to applicants, we recommend equating the cut score on the basis of equipercentile
equating because that will lead to the same pass rate within each form. Whether simpler methods
are as useful as IRT-based approaches is an empirical question, but given very similar empirical
156 Handbook of Employee Selection
outcomes between IRT and classical test theory (Fan, 1998; MacDonald & Paunonen, 2002), we
suspect they will often yield similar results.
Online Testing
Where possible, we recommend using computer administration to enhance test security. Although
computer-based testing does not resolve all security issues (e.g., test-takers can still memorize items),
the elimination of paper forms that are more easily stolen is a clear advantage. However, to maintain
the security advantage of computer-administered tests, strict physical and information technology
(IT) security measures must be in place to protect against unauthorized access to the testing soft-
ware. Reputable vendors that specialize in online testing will have extensive security procedures for
protecting their intellectual property. We do not recommend that organizations conduct large-scale
testing using their own computer system unless extensive and up-to-date security measures are in
place and have been thoroughly tested for this purpose.
Unproctored Internet testing is increasingly common and presents unique problems for main-
taining test security. There are several things that can be done to help minimize test exposure and
motivation to cheat in this situation (Tippins et al., 2006). First, there should be a single point of
entry so that applicants can only complete the test one time (applicants should not be allowed to
retest to improve their score without the knowledge and approval of the hiring organization). The
system should recognize returning applicants so they are not allowed to take the test again. Second,
an applicant tracking system should be used that collects substantial identification information from
the applicant. Third, applicants should be warned about the consequences of cheating and how their
identity and their answers may be verified. Fourth, item order should be randomized with each
administration and/or items should be drawn from item banks so that the same set of items is not pre-
sented to all applicants in the same order. Finally, unproctored Internet tests might be used for initial
testing or to screen out candidates who are highly unlikely to be suitable for the job if they are then
followed up with a proctored test (see Nye, Do, Drasgow, & Fine, 2008). Cheating at the unproctored
stage of testing offers little advantage if the cheater is unlikely to pass the second proctored stage.
In a variation of this approach, it has been suggested that a proctored testing session could contain
some smaller versions of the tests administered in the unproctored testing session to corroborate the
unproctored test scores (Segall, 2001). Scores that are too discrepant between the two sessions would
be called into question. Ultimately, organizations using unproctored Internet testing must accept that
the test items are in the public domain and will be exposed to anyone who really wants access.
Test Administration and the Use of Test Scores 157
TESTING TIME
The amount of time available for testing is often a constraint that influences decisions about the
types of tests to be administered, the mode of administration, the number of tests included in a test
battery, the number of items in a test, and the number of stages in the selection system. For example,
if financial or other considerations place a strict time limit of 1 hour on test administration, that
limits the number of different constructs that can be assessed, the types of constructs that can be
assessed (e.g., reading comprehension takes longer than perceptual speed and accuracy), and the
testing method (e.g., SJTs generally take longer than biodata inventories). Our general advice when
presented with such constraints is to go for depth over breadth. It may be tempting to measure as
many different constructs as possible, but it is much better to measure a few things well than to mea-
sure many things poorly. Because testing time is finite, organizations should maintain their focus in
effectively measuring a relatively small handful of key constructs relevant for selection purposes,
fully recognizing that not all constructs of potential relevance can be measured. One way to reduce
the size of a test battery is to remove tests that are highly correlated with other tests in the battery
and do not provide incremental validity. Such a reduced test battery can often maintain its worth in
terms of reliability and validity for its intended purposes (see Donnellan, Oswald, Baird, & Lucas,
2006; Stanton, Sinar, Balzer, & Smith, 2002). In addition, many organizations cut down on testing
time by using unproctored web administration of noncognitive predictors (Tippins et al., 2006).
Those who meet a minimum score are invited to the proctored testing session where more tests
are administered. If the unproctored test scores can be linked to the proctored test scores via the
applicant name or identification number, then the number of constructs assessed can be increased
without increasing proctored testing time.
Power tests are defined as those tests for which speed is not relevant to the measurement of the
construct, such as for many cognitive ability and achievement tests. Thus, test-takers should be
given adequate time to complete the entire test. However, because unlimited time is not available
for test administration, a rule of thumb for a minimum amount of testing time for power tests is the
time it takes for 90% of the examinees to complete 90% of the items. Unlike power tests, speeded
tests require test-takers to perform quickly on simpler tasks within the time allotted. For paper-and-
pencil administration, speeded tests must be long enough and time limit short enough that no one is
able to finish. The test score is then the number of correct responses minus a correction for guessing
(see Cronbach, 1990). For computer-administered tests, the computer can keep track of the time it
takes to complete a set number of items.
Issues surrounding equity in testing could dramatically influence decisions on the amount of
testing time allocated. Consider job applicants who are nonnative speakers of the language in which
the test is written. If it is safe to assume that language skills are not relevant to the construct being
assessed by the test, then testing time should be set so that nonnative speakers have time to complete
the test when they may take more time to read instructions and test items. Alternatively, tests could
be redesigned so that written language or other factors irrelevant to the construct are minimized,
such as administering a video-based form of a test instead of a traditional written form (Chan &
Schmitt, 1997; Weekley & Jones, 1997).
Consideration of individuals covered by the Americans with Disabilities Act (ADA; 1990) is
also an organizational imperative. If the test is not speeded, applicants with disabilities preventing
them from completing the test in the normally allotted time should be given extra time to complete
the test. Readers are referred to Campbell and Reilly (2000) for an excellent discussion of ADA
accommodations in testing.
may require that a test be read to them; tests administered in different countries may require that the
test be translated accurately into several languages; and when organizations are not always capable
of administering their test by computer, they must have a paper form on hand. In some cases it is
reasonable to assume that differences in test format are merely cosmetic and have no bearing on
construct measurement. However, in many cases it is necessary to formally test whether format dif-
ferences lead to score differences that are irrelevant to the constructs being measured.
Statistical tests for measurement invariance allow one to determine whether constructs are being
measured in the same or similar way between two test formats. Ideally, scores from different test
formats would exhibit strict measurement invariance (e.g., similar patterns and magnitudes of factor
loadings, similar error variance estimates; Vandenberg & Lance, 2000). When partial invariance
is said to exist, then further work is needed to determine which items behave differently across
test forms and need to be revised or deleted. Recent simulation and empirical work suggests that
partial invariance may only have a slight impact on selection outcomes in many practical situations
(Millsap & Kwok, 2004; Stark, Chernyshenko, & Drasgow, 2004), but the infancy of this work
prevents us from making any conclusive statements (for a current and comprehensive review of
tests of measurement invariance, see Schmitt & Kuljanin, 2008). Large sample sizes are desirable
in conducting appropriate tests for measurement invariance (see Meade & Bauer, 2007); otherwise,
one may have to rely on a strong rational basis for the equivalence of test formats (e.g., providing
evidence that the formats of the different tests do not influence the construct of interest and should
be unrelated to differences in the samples tested).
RETESTING
According to professional guidelines, allowing candidates to retest at a later date is a best practice
(Society for Industrial and Organizational Psychology, 2003; U.S. Department of Labor, 1999).
Because any single assessment can be influenced by various types of systematic measurement error
unrelated to the construct being measured (e.g., illness, extreme anxiety, not meeting the testing
prerequisites), it is reasonable to offer the opportunity to retest when it is appropriate and feasible.
Retesting is typically not a relevant concern for small-scale testing programs. If a small company
is testing to fill a specific position, the opportunity to retest cannot be offered after the position has
been filled. Candidates should be given the chance to retest if job opportunities are continuously
available, especially if it involves internal candidates. On the other hand, how are the psychomet-
ric qualities of the test affected by allowing candidates to take it two, three, or four times? Is the
purpose of the test defeated if a candidate can retake the test as many times as it takes to finally
pass? Score increases from retesting could be partially due to regression to the mean, because low
test-scorers are more likely to retest, and their retest scores would tend to increase by chance alone.
Alternatively, score increases could be due to practice effects on the test-specific content (e.g., mem-
orizing the correct answers) or in picking up on effective test strategies (e.g., learning that picking
the longest multiple choice answer is beneficial when in doubt). These are undesirable effects that
need to be considered if not controlled for or prevented by not allowing for a retest. On the other
hand, retest scores can also reflect desirable effects. Some test-takers could be less anxious about
the test when they retake it, which could lead to score increases even if their underlying knowledge
upon retest remains the same. Test-takers may also consider the types of questions they missed
when they first tested and concentrate subsequent study in those areas so that the retest reflects true
increases in the construct being measured.
Some recent studies have examined the effect of retesting on scores for different types of tests.
In a study of admission exams for medical students, Lievens, Buyse, and Sackett (2005) found stan-
dardized mean gains in test scores (after correcting for test-retest reliability) of 0.46 for a cognitive
ability test, 0.30 for a knowledge test, and 0.40 for a SJT. Retesting was conducted using alternate
forms. Raymond, Neustel, and Anderson (2007) found standardized mean gains of 0.79 and 0.48 for
two different medical certification exams. In both cases, these gains were nearly the same whether
Test Administration and the Use of Test Scores 159
the identical test or a parallel test was used. This latter finding was contrary to a recent meta-
analysis of retesting effects on cognitive ability tests that found an adjusted overall effect size of
0.46 for identical forms and 0.24 for alternate forms (Hausknecht, Halpert, Di Paolo, & Moriarty
Gerrard, 2007). Overall, Hausknecht et al. (2007) found a gain of 0.26 adjusted SD units between
time 1 and time 2 and 0.18 between time 2 and time 3. The portion of these effects that was due
to regression to the mean was estimated to be less than 10%. Test coaching, defined as instruction
aimed at improving test scores (through learning either test-related skills or test-taking strategies),
had a large effect. The effect size for individuals who had some form of test coaching was 0.70, as
opposed to 0.24 for individuals that did not have any coaching. Another moderator was the type
of cognitive ability assessed, with tests of quantitative and analytical ability showing larger mean
gains than tests of verbal ability.
Do scores on a retest reflect the individual’s ability better than scores on the initial test? Using
a within-person analysis of different forms of a knowledge test, Lievens et al. (2005) found
significantly higher validity coefficients for retest scores than for initial scores. There were no
significant differences in validity coefficients for the cognitive ability test or the SJT. However,
in a follow-up study, Lievens, Reeve, and Heggestad (2007) conducted within-group analyses of
two administrations of the same cognitive ability test and found a validity coefficient for retest
scores of .00, compared to .19 for initial scores. The initial score correlation was not significant,
but it suffered from range restriction and a small sample size. The validity coefficient for the larger
group of examinees that did not retest was .48 (P < .01). Lievens et al. (2007) further showed that
the psychometric properties of the test were not invariant from time 1 to time 2, meaning that test
scores appeared to reflect different constructs across administrations. The initial test was more
strongly related to general cognitive ability, whereas the retest was more strongly related to memory.
This may have been because the retest was the same test rather than an alternate form as in Lievens
et al. (2005), allowing the test-taker to benefit from remembering the content from the initial test.
These findings are informative, but a great deal more research is necessary before conclusions can
be drawn about the relative validity of retest scores.
There is also a need for research on the effect of retesting on adverse impact. Specifically, are
expected score gains upon retesting equivalent for different gender or racial/ethnic subgroups?
Are candidates within different subgroups equally likely to retest when given the opportunity?
Questions such as these must be answered before it is possible to speculate on the impact of retesting
on adverse impact ratios found in practice.
A common question is how long a candidate should have to wait before being allowed to retest.
There is little research to inform this decision, although research suggests that the length of the time
interval does not appear to influence test score gains when alternate forms are used. Hausknecht et
al. (2007) noted that score gains tended to decrease as the time interval increased only for identical
forms. Raymond et al. (2007) found no effect of time delay on score increases regardless of whether
identical or alternate forms were used. Using identical forms, Burke (1997) found different retest
gains across components of a cognitive and psychomotor ability selection battery, but the retesting
time interval (ranging from 1 to 5 years) did not moderate these gains by any significant amount.
The appropriate time interval for retesting in employment settings depends on factors such as the
availability of alternate forms, cost of testing, the need to fill positions, and the type of test. A mini-
mum of 6 months between administrations has been a common rule of thumb for large-scale ability
or knowledge testing programs, and 30–60 days is more typical for certain types of skills tests (e.g.,
typing, software proficiency).
Choosing Predictors
A common problem faced by personnel selection researchers and practitioners is choosing a set of
predictors from a larger set of potential predictors for the purpose of creating a predictor battery.
In large organizations, an experimental predictor battery may have been assembled for a validation
study, and the choice of which predictors to include is based on the psychometric characteristics of
the predictors as measured in that study. In smaller organizations, the choice may be based on pub-
lished norms or meta-analysis results for different kinds of predictors. Often, there are additional
practical constraints such as test availability, cost of the tests, and required testing time.
There are usually two goals that are often at odds with one another when assembling a predictor
battery—maximizing criterion-related validity while simultaneously attempting to minimize
adverse impact against protected groups (e.g., racial/ethnic minorities or females). Creating a
composite of several valid predictors is a common strategy for reducing the degree to which a
selection procedure produces group differences (Campbell, 1996; Sackett & Ellingson, 1997;
Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997). The problem is that most valid predictors of
performance are cognitive in nature, and those predictors also tend to have the largest potential for
adverse impact. Therefore, adding a cognitive predictor that increases the validity of the composite
will often have the simultaneous effect of increasing adverse impact (Sackett & Ellingson, 1997).
Compounding the problem is the fact that adding a predictor with little adverse impact to a
predictor with large adverse impact typically does not reduce the adverse impact of the composite
to the extent that would generally be expected (Potosky, Bobko, & Roth, 2005; Sackett & Ellingson,
1997). Sackett and Ellingson (1997) gave an example of two uncorrelated predictors. One predictor
had a standardized mean subgroup difference (d) of 1.00 and the other had a d of 0.00. Most
researchers would expect that the two predictors would offset each other, so the d of an equally
weighted composite of the two predictors would be 0.50. In fact, the d of this composite would
be 0.71. Potosky et al. (2005) further demonstrated the difficulty in reducing adverse impact with
a predictor composite by pointing out that range restriction in many published studies results in
d values for some noncognitive predictors that are too small relative to the d values that exist in
applicant populations. In other words, the potential for adverse impact in many predictors has been
underestimated because d has been computed in range-restricted samples of job incumbents rather
than in the full range of job applicant samples. On the other hand, a meta-analysis of cognitive ability
and race/ethnic differences found that although overall Black-White d values hover around 1.0, the d
values are about 0.60 in high-complexity jobs (Roth, Bevier, Bobko, Switzer, & Tyler, 2001). Thus,
the results of these two meta-analyses suggest that the potential adverse impact of cognitive ability
may be less than previously estimated and the potential adverse impact of alternative predictors may
be greater than previously estimated.
The mathematical presentation of Sackett and Ellingson (1997) and the meta-analysis of Potosky
et al. (2005) demonstrated that reducing adverse impact by adding predictors to a composite is not
as easy as it seems at first glance. The take-away message is that reducing adverse impact is not
Test Administration and the Use of Test Scores 161
a simple matter of adding a noncognitive predictor or two to a predictor composite that includes a
measure of cognitive ability. Researchers and practitioners trying to balance validity against poten-
tial adverse impact when creating predictor composites should explore a wider variety of alterna-
tive predictors and weighting schemes rather than relying on offsetting the adverse impact of one
predictor with another.
How does one choose between alternative composites when the composite that has the highest
validity may have the worst adverse impact, and vice versa? To help inform the decision about which
of two or more potential predictor composites does the best job of maximizing validity and mini-
mizing adverse impact, Johnson, Abrahams, and Held (2004) created a combined validity/adverse
impact index score (VAI). This is a single number that indicates the extent to which a unit-weighted
composite score balances the level of criterion-related validity with the average standardized mean
subgroup difference. Importantly, this index gives more weight to higher levels of d. For example,
an increase in d from 0.00 to 0.10 is not as damaging as an increase from 0.50 to 0.60. The increases
are of the same magnitude, but in the first case potential adverse impact is still low and in the second
case potential adverse impact is becoming less acceptable. The composite that most effectively bal-
ances validity against potential adverse impact according to this procedure is the one at which VAI
is at its maximum. The VAI formula can be adjusted to reflect the relative value the user places on
validity and adverse impact.
Weighting Predictors
When multiple predictors are used, a decision must be made on how much weight to apply to each
predictor when computing a composite score. Two types of weights are considered here—statistical
and rational weights. Statistical weights are data-driven whereas the researcher specifies rational
weights, perhaps with input from SMEs. The most common statistical weights are derived using
multiple regression because regression weights maximize the prediction of the criterion in a sample.
However, regression weights have numerous limitations that suggest alternative weighting schemes
are often more desirable. First, criterion scores must be available, which would not be the case if a
content-oriented validation strategy is used. Second, regression weights focus entirely on prediction
so they cannot take into account adverse impact or other practical considerations. Third, regression
weights can be difficult to interpret and explain to stakeholders.
Finally, the question of ultimate interest is how well regression weights predict when they are
applied to data in other independent samples (e.g., job applicants), not to the specific sample in
which the weights were derived. There are two primary factors influencing the extent to which
regression weights derived in one sample will predict in another. First, sampling error variance
makes regression weights prone to inaccuracy compared to unit weights when sample sizes are
relatively small (i.e., less than about 180; Schmidt, 1971). Second, high predictor intercorrelations
(i.e., multicollinearity) lead to instability in regression weights, making them less applicable to
other samples (Green, 1977). Thus, regression weights derived in a small sample and/or with highly
intercorrelated predictor variables could be very different from the most stable or optimal weights
based on the population of interest.
De Corte, Lievens, and Sackett (2007) presented a procedure for weighting predictors in such
a way that the tradeoff between selection quality (e.g., validity, average criterion score of those
selected) and adverse impact is Pareto-optimized. Pareto optimization means that mean subgroup
differences are minimized for a given level of validity (or validity is maximized for a given level of
mean subgroup differences). The procedure applies optimization methods from the field of opera-
tions research and involves nonlinear programming. The authors offer a computer program for
application of the procedure. Unfortunately, the Pareto-optimal composites require down-weighting
cognitive ability considerably to reduce mean subgroup differences on the composite by a practi-
cally significant amount, and this compromises validity in most cases.
Turning from empirical to rational weighting schemes, examples include job analysis ratings of
importance (Goldstein, Zedeck, & Schneider, 1993) and expert judgments of predictor importance
162 Handbook of Employee Selection
(Janz, Hellervik, & Gilmore, 1986). Johnson (2007) has found that weighting each predictor by the
number of performance dimensions to which it is relevant yields higher validities than does apply-
ing unit weights. Given the multidimensional nature of job performance, the approach of placing
greater weight on predictors that are likely to influence a wider range of outcomes is an attractive
approach from a conceptual standpoint as well as a predictive standpoint. There are also other con-
siderations that may influence the weighting scheme, for better or for worse, such as equally weight-
ing cognitive and noncognitive portions of the predictor battery or weighting to please stakeholders
(e.g., the CEO thinks the interview should be given more weight than the cognitive test).
An important point to keep in mind when considering alternative weighting schemes is that the
weights that are directly applied to a set of predictors, called nominal weights, may not have the desired
effect. Brannick and Darling (1991) showed that variables are not actually weighted as intended unless
they are uncorrelated. This is because applying a nominal weight to one variable also applies an
implicit weight to each of the variables with which that variable is correlated. As a result, the effective
weight applied to a given variable is not the nominal weight, but instead reflects a combination of the
nominal weight and the implicit weights resulting from that variable’s correlation with each of the
other variables in the composite (see Guion, 1998). Because many components of a selection battery
are likely to be positively correlated to some extent, the composite scores created by weighting predic-
tors will not actually reflect the intended weighting scheme. Brannick and Darling (1991) presented
a procedure for translating nominal weights into effective weights, which, when applied, allow each
predictor to contribute to the composite in the manner that was originally intended.
Before spending time coming up with a weighting scheme and worrying if our explicit weights
correspond to our effective weights, we should first ask to what extent does differential weight-
ing of predictors influence the overall composite score? Koopman (1988) and Ree, Carretta, and
Earles (1998) showed that very different sets of weights can lead to highly correlated composite
scores (often above .95). Bobko, Roth, and Buster (2007) reviewed the literature on the usefulness
of unit weights and concluded that unit weights are highly appropriate under many circumstances,
including when using a content-oriented validation strategy. We recommend applying unit weights
to standardized variables in most situations because different sets of weights often make little dif-
ference in the resulting composite score, effective weights rarely correspond to the nominal weights
applied to predictors, and unit weights have been shown to be superior or equivalent to other weight-
ing schemes (Bobko et al., 2007). These weights are the easiest to calculate, the easiest to explain,
and the most generalizable to different situations. Unless sample sizes are large, the number of
predictors is small, and predictor intercorrelations are low, it is probably best to keep weighting of
predictors simple (Bobko et al., 2007; Cohen, 1990).
Going beyond choosing and weighting predictors, three review papers take a broader focus on
the frequent tradeoff between validity and reducing adverse impact (Kravitz, 2008; Ployhart &
Holtz, 2008; Sackett, Schmitt, Ellingson, & Kabin, 2001). These authors examined a wide variety
of organizational strategies indicating which strategies appear to have promise (e.g., administering
tests in a video format to reduce an unnecessary language burden), which strategies have not been
as fruitful (e.g., modifying tests based on differential item functioning statistics), and which longer-
term strategies lack empirical evidence but may be promising (e.g., engaging the organization in
community-based projects in a broad effort to increase visibility and attract more qualified minority
applicants).
decision was made, such as top-down selection versus banding; Truxillo & Bauer, 1999, 2000). The
type of score information provided should depend on the purpose of the assessment and the nature
of the applicant. For example, if an assessment center is used to evaluate internal candidates for
promotion, it is probably beneficial to the organization and the candidates to provide extensive feed-
back on how the candidate did on each exercise. This provides feedback on developmental needs
that could lead to targeted training interventions and performance improvement among employees
(see Chapter 20, this volume, for details on providing assessment feedback to applicants). On the
other hand, if a standardized test is used to screen out many external applicants who do not meet a
minimum cut score, no more than a pass/fail decision need be communicated. If a pool of potential
candidates is identified that exceed a cut score, but the highest-scoring individuals in that pool will
be given the first opportunity for job openings, then some information about where the individual
stands among other candidates (e.g., a ranking or percentile score) would be appropriate to com-
municate an idea of how likely a job offer is in the near future.
Raw test scores are often transformed to make them more interpretable to applicants. This is espe-
cially important when the selection instrument contains tests that do not have right and wrong answers
(e.g., personality tests). To avoid confusion, we recommend that negative scores not be reported, which
may occur with raw scores from biodata inventories or by computing z-scores. A common transforma-
tion for score reporting is to standardize scores to have a mean of 50 and SD of 10 (called T-scores).
This provides information about how the applicant did on each test but does not explicitly state the
number correct. Another transformation is to convert the total score to a 100-point scale, and when
a cut score is used, the cut score can be set at a value such as 70. This reporting method is easy to
understand, because scores are similar to grades in school, and provides applicants with a good idea
of where they stand compared to the maximum. A downside is it may incorrectly imply that the score
represents the percentage of items answered correctly. Scores may also be placed into categories for
the purposes of communication to management or other constituencies (e.g., “excellent,” “good,” “bor-
derline,” “poor”). These coarser categories are sometimes used for selection, although that will reduce
the validity of the selection measure, as discussed in the next section.
METHODS OF SELECTION
Given a selection process that shows linear prediction of a meaningful criterion, top-down selection
using a linear composite of the standardized scores is the standard method for maximizing the util-
ity of criterion-related validity coefficients, assuming there is no nonrandom attrition such as when
the top-ranked talent decides to take a job elsewhere (Murphy, 1986). Any alternative to top-down
selection usually means compromising the validity and utility of the test at least to some extent, and
sometimes to a great extent (Schmidt, Mack, & Hunter, 1984; Schmidt, 1991). However, as we have
seen, maximizing validity often means high adverse impact, so many larger organizations look for
alternatives to top-down selection. Three major alternatives are prominent in the selection litera-
ture. The first alternative is setting a cut score, above which applicants are selected and below which
applicants are rejected. Often everyone who passes the test is considered qualified, but the number
of job openings is smaller than the number of qualified applicants. Job offers may then be made
in a top-down fashion, making the cut score almost irrelevant. Alternatively, other considerations
may come into play, such as job experience or other additional skill sets. In these cases, selection is
operating more like a multiple-hurdle selection system.
164 Handbook of Employee Selection
In fact, setting a cut score for a test as part of a multiple hurdle selection system is the second major
alternative to top-down selection. Those scoring above the cut score will move to the next stage of the
selection process, which may be another test or something as simple as an unstructured interview or
reference check. It is possible to set up a multiple-hurdle selection system such that the selection cutoffs
and the order of the predictors reduces mean subgroup differences and adverse impact ratios while
retaining the highest possible levels of mean predicted performance (De Corte, Lievens, & Sackett,
2006; Sackett & Roth, 1996). Ideally this approach could be Pareto-optimized, as we discussed with
regard to weighting predictor composites (De Corte et al., 2007), but the multivariate conditions are
very complex to solve even with nonlinear programming. Multiple hurdles offer the advantage of
reducing the overall cost of the selection procedure, because all applicants do not complete each com-
ponent. This allows the more expensive tests (e.g., computer simulations) or time-intensive tests (e.g.,
assessment center) to be administered to smaller numbers of applicants at the end of the process. This
increases the utility of the selection system compared with a compensatory approach by reducing the
cost of selection, but it depends on the assumption that low scores at an early stage of selection can-
not be compensated for by high scores at a later stage. A disadvantage of multiple hurdles is that the
reliability of the entire selection system is reduced because the reliability of the entire system is the
product of the reliabilities of each element of the system (Haladyna & Hess, 1999).
The third alternative is the use of test score banding procedures. Banding is a broad term
that encompasses any selection procedure that groups test scores together and considers them to
be equivalent. For example, standard error of the difference (SED) banding considers scores to
be equivalent unless they are significantly different from each other (Cascio, Outtz, Zedeck, &
Goldstein, 1991). Several empirical papers provide information on the necessary tradeoff between
maximizing validity or mean predicted performance and minimizing adverse impact that results
from the use of banding (e.g., Campion et al., 2001; Sackett & Roth, 1991; Schmitt & Oswald,
2004). The tradeoff tends to be larger when the bands are larger, the selection ratio is smaller, and
the standardized mean difference between groups is larger. However, these rules are not set in stone
because results also depend on the banding method and how the size of the band happens to align
with the cutoff point for selection in a particular data set. Ironically, there is very little evidence that
banding has much of a practical effect in reducing adverse impact (Barrett & Lueke, 2004), except
in the case of top-down selection of protected group members within bands (Sackett & Roth, 1991).
This is not a viable strategy because the Civil Rights Act of 1991 prohibits selection on the basis of
protected class status without a consent decree, and random selection within bands may actually
increase adverse impact (Barrett & Lueke, 2004).
Although there may be many situations in which the use of these alternatives to top-down selec-
tion makes sense for the organization, we do not recommend adopting them solely for the purpose
of reducing adverse impact. When cognitive ability measures are incorporated into a selection bat-
tery, large tradeoffs between adverse impact and validity are often impossible to avoid. Although
nothing in today’s U.S. legal system bars an organization from sacrificing validity to reduce adverse
impact, doing so fails to take full advantage of what selection research on validity has to offer
(Pyburn, Ployhart, & Kravitz, 2008). Furthermore, we would agree that the courts could perceive a
procedure that decreases the validity of the predictors (cognitive and noncognitive) as a decrease in
the job relevance of the selection system. Addressing adverse impact concerns in a proactive manner
should go well beyond simply assigning applicants to coarse categories (e.g., active recruitment of
minorities, fostering a climate for diversity, engagement in the community).
judgmental method is the Angoff (1971) method, in which expert judges estimate the probabil-
ity that a minimally qualified candidate will answer each test item correctly. The estimates are
summed across items to calculate the expected value of the mean test score for minimally qualified
candidates. A criticism of the Angoff method is that judges generally find it difficult to estimate
these probabilities. In fact, they tend to overestimate them, resulting in higher cut scores than those
determined by other methods. The cut score is often subsequently adjusted by lowering it one or two
standard errors of measurement of the test. Despite this general problem in estimation, Angoff-like
methods have the advantage of being well received by the courts (Kehoe & Olson, 2005).
Empirical methods for establishing cut scores are based on the relationship between test per-
formance and criterion performance, so a criterion-related validation study is required to use these
methods. In the regression technique, the criterion score associated with successful job performance
is determined and regression is used to find the test score corresponding to that predicted criterion
score. Forward regression regresses criterion scores on test scores, and reverse regression regresses
test scores on criterion scores. These methods produce different cut scores, so both methods could
be used to produce a range of cut scores and expert judgment could be used to set the appropriate
cut score within that range (Mueller et al., 2007).
Expectancy charts are a useful way of presenting information on the relationship between test
performance and criterion performance and can be used to help set a cut score. Expectancy charts
can graphically display expected criterion performance or the percentage of those selected who
would be expected to be successful on the job given a set of alternative cut scores. The advantages
of using expectancy charts to set cut scores are (a) they are easy for decision-makers to understand,
(b) the cut score is based on expected criterion performance, and (c) the courts have shown support
for these types of methods (Kehoe & Olson, 2005).
The method used to select cut scores deserves a great deal of attention because cut scores have
increasingly been subject to legal challenge. When cut scores are used, test users should put as much
effort into setting cut scores as they put into establishing the validity of the test. Test users should
carefully consider the need for a cut score and determine if a top-down selection strategy would be
more desirable. If a good business necessity case can be made, defending a top-down strategy may
be easier than defending a cut score because there is a great deal of room for interpretation in the
legal and professional literature on cut scores (e.g., what constitutes minimum qualifications). Test
users must be careful to have a clear and consistent rationale based on sound professional judgment
for what is done at each step of the cut score setting process. Because cut scores are likely to remain
in use in many selection contexts, future research could profitably investigate further the major sub-
stantive and methodological factors involved in the justification and setting of cut scores.
this meta-analysis also suggests that fit measures may be useful tools postselection for developmental
purposes, such as when working with employees who may develop performance issues or who are
withdrawing and may be considering leaving the organization because of some type of misfit.
A major issue with selection for any type of fit is calculating a fit score. Common techniques are
difference scores and correlations. Difference scores suffer from several methodological problems
(Edwards, 1994), including the confounding of effects of the variables composing the difference
score and attenuation of reliability. Correlations reflect similarity in profile shape but not absolute
differences between scores. Arthur et al. (2006) found stronger relationships when fit was calculated
via correlations than via difference scores. Polynomial regression is the appropriate analysis method
when evaluating the relationship between fit and a criterion (Edwards, 1994), but this cannot be used
for within-subject analyses to assign scores to individuals. As yet, there is no ideal way to measure
fit (Kristof-Brown et al., 2005).
On the basis of current research, we recommend that P-O fit only be used for selection when the
goal is to minimize turnover and a local validation study is possible. A procedure similar to that of
McCulloch & Turban (2007) holds promise and should be legally defensible. They had call center
managers describe the characteristics of the call center using a Q-sort measure to sort 54 work
descriptors using a 9-point scale in a forced normal distribution. This defined the call center profile.
Call center representatives sorted the same descriptors in terms of how much they valued each char-
acteristic. The P-O fit score was the correlation between the individual profile and the call center
profile. P-O fit was correlated .36 with employee retention and uncorrelated with job performance.
Personality has also been used as a content domain for assessing P-O fit (Kristof-Brown et
al., 2005). Tett, Jackson, Rothstein, and Reddon (1994) showed that personality scales that are
positively related to a performance dimension in some situations may have legitimately negative
correlations with the same performance dimension in other situations. For example, a person high
in agreeableness may do well in an organization that has a team-based, cooperative culture but
may have difficulty in an organization with a culture that is highly competitive and adversarial.
This suggests that the first organization should select applicants that are high on agreeableness
and the second organization should select applicants that are low on agreeableness, but selecting
individuals who are low on a positive trait measure is probably not the best approach. Rather, the
second organization should include a measure of competitiveness in the selection system and use
top-down selection.
CONCLUSIONS
Clearly, the use of test scores in personnel selection is embedded within a large and multifaceted
context. Given that selection is only one part of a set of system-wide organizational policies and
practices, there are many important questions to which it is worthwhile to invest the time in pur-
suing answers. In this chapter, we highlighted questions that must be asked (a) at the outset of the
testing program, (b) with regard to collection of test scores, (c) when computing test scores, and (d)
when making selection decisions. Test users should be aware of the legal implications of each deci-
sion made to avoid potential litigation problems. They should also be aware of the organizational
implications of each decision to ensure that the testing program is consistent with other organiza-
tional goals. When used properly by an experienced professional, test scores have great potential to
positively influence organizational outcomes.
REFERENCES
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measure-
ment (2nd ed., pp. 508–600). Washington, DC: American Council on Education.
Arthur, W., Bell, S. T., Villado, A. J., & Doverspike, D. (2006). The use of person-organization fit in employment deci-
sion making: An assessment of its criterion-related validity. Journal of Applied Psychology, 91, 786–801.
Test Administration and the Use of Test Scores 167
Barrett, G. V., & Lueke, S B. (2004). Legal and practical implications of banding for personnel selection. In
H. Aguinis (Ed.), Test-score banding in human resource selection: Technical, legal, and societal issues
(pp. 71–111). Westport, CT: Praeger.
Bobko, P., Roth, P. L., & Buster, M. A. (2007). The usefulness of unit weights in creating composite scores: A
literature review, application to content validity, and meta-analysis. Organizational Research Methods,
10, 689–709.
Brannick, M. T., & Darling R. W. (1991). Specifying importance weights consistent with a covariance struc-
ture. Organizational Behavior and Human Decision Processes, 50, 395–410.
Burke, E. F. (1997). A short note on the persistence of retest effects on aptitude scores. Journal of Occupational
and Organizational Psychology, 70, 295–301.
Campbell, J. P. (1996). Group differences and personnel decisions: Validity, fairness, and affirmative action.
Journal of Vocational Behavior, 49, 122–158.
Campbell, W. J., & Reilly, M. E. (2000). Accommodations for persons with disabilities. In J. F. Kehoe (Ed.),
Managing selection in changing organizations (pp. 319–370). San Francisco, CA: Jossey-Bass.
Campion, M. A., Outtz, J. L., Zedeck, S., Schmidt, F. L., Kehoe, J. F., Murphy, K. R., & Guion, R. M. (2001).
The controversy over score banding in personnel selection: Answers to 10 key questions. Personnel
Psychology, 54, 149–185.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York, NY: Cambridge
University Press.
Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implications of six methods of test
score use in personnel selection. Human Performance, 4, 233–264.
Chan, D., & Schmitt, N. (1997). Video versus paper-and-pencil method of assessment in situational judg-
ment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied
Psychology, 82, 143–159.
Chang, H. H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items on adaptive
testing. Psychometrika, 73, 441–450.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312.
Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York, NY: Harper Collins.
De Corte, W., Lievens, F., & Sackett, P. R. (2006). Predicting adverse impact and mean criterion performance
in multistage selection. Journal of Applied Psychology, 91, 523–537.
De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between
selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393.
Donnellan, M. B., Oswald, F. L., Baird, B. M., & Lucas, R. E. (2006). The Mini-IPIP scales: Tiny-yet-effective
measures of the Big Five factors of personality. Psychological Assessment, 18, 192–203.
Dorans, N. J., Pommerich, M., & Holland, P. W. (2007). Potential solutions to practical equating issues. New
York, NY: Springer.
Edwards, J. R. (1994). The study of congruence in organizational behavior research: Critique and a proposed
alternative. Organizational Behavior and Human Decision Processes, 58, 51–100.
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person
statistics. Educational and Psychological Measurement, 58, 357–381.
Gibson, W. M., & Weiner, J. A. (1998). Generating random parallel test forms using CTT in a computer-based
environment. Journal of Educational Measurement, 35, 297–310.
Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process.
In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 3–34). San Francisco,
CA: Jossey-Bass.
Green, B. F. (1977). Parameter sensitivity in multivariate methods. Multivariate Behavioral Research, 12,
264–288.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Haladyna, T., & Hess, R. (1999). An evaluation of conjunctive and compensatory standard-setting strategies for
test decisions. Educational Assessment, 6, 129–153.
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A
meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology,
92, 373–385.
Huselid, M. A., Jackson, S. E., & Schuler, R. S. (1997). Technical and strategic human resource management
effectiveness as determinants of firm performance. Academy of Management Journal, 40, 171–188.
Janz, T., Hellervik, L., & Gilmore, D. (1986). Behavior description interviewing: New, accurate, cost effective.
Boston, MA: Allyn and Bacon.
168 Handbook of Employee Selection
Johnson, J. W. (2007). Synthetic validity: A technique of use (finally). In S. M. McPhail (Ed.), Alternative
validation strategies: Developing new and leveraging existing validity evidence (pp. 122–158). San
Francisco, CA: Jossey-Bass.
Johnson, J. W., Abrahams, N., & Held, J. D. (2004, April). A procedure for selecting predictors considering
validity and adverse impact. Poster presented at the 19th Annual Conference of the Society for Industrial
and Organizational Psychology, Chicago, IL.
Kehoe, J. F., & Olson, A. (2005). Cut scores and employment discrimination litigation. In F. J. Landy (Ed.),
Employment discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 410–449).
San Francisco, CA: Jossey-Bass.
Kim, D-I., Choi, S. W., Lee, G., & Um, K. R. (2008). A comparison of the common-item and random-groups
equating designs using empirical data. International Journal of Selection and Assessment, 16, 83–92.
Koopman, R. F. (1988). On the sensitivity of a composite to its weights. Psychometrika, 53, 547–552.
Kravitz, D. A. (2008). The validity-diversity dilemma: Beyond selection—The role of affirmative action.
Personnel Psychology, 61, 173–193.
Kristof-Brown, A. L., Zimmerman, R. D., & Johnson, E. C. (2005). Consequences of individuals’ fit at work:
A meta-analysis of person-job, person-organization, person-group, and person-supervisor fit. Personnel
Psychology, 58, 281–342.
Lievens, F., Buyse, T., & Sackett, P. R. (2005). Retest effects in operational selection settings: Development and
test of a framework. Personnel Psychology, 58, 981–1007.
Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on
cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 1672–1682.
Lievens, F., & Sackett, P. R. (2007). Situational judgment tests in high-stakes settings: Issues and strategies
with generating alternate forms. Journal of Applied Psychology, 92, 1043–1055.
Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: Educational Testing Service.
MacDonald, P., & Paunonen, S. V. (2002). A Monte Carlo comparison of item and person characteristics based
on item response theory versus classical test theory. Educational and Psychological Measurement, 62,
921–943.
McCulloch, M. C., & Turban, D. B. (2007). Using person-organization fit to select employees for high-turnover
jobs. International Journal of Selection and Assessment, 15, 63–71.
Meade, A. W., & Bauer, D. J. (2007). Power and precision in confirmatory analytic tests of measurement invari-
ance. Structural Equation Modeling: A Multidisciplinary Journal, 14, 611–635.
Millsap, R. E., & Kwok, O. (2004). Evaluating the impact of partial factorial invariance on selection in two
populations. Psychological Methods, 9, 93–115.
Mueller, L., Norris, D., & Oppler, S. (2007). Implementation based on alternate validation procedures: Ranking,
cut scores, banding, and compensatory models. In S. M. McPhail (Ed.), Alternative validation strategies:
Developing new and leveraging existing validity evidence (pp. 349–405). San Francisco, CA: Jossey-
Bass.
Murphy, K. R. (1986). When your top choice turns you down: Effect of rejected offers on the utility of selection
tests. Psychological Bulletin, 99, 133–138.
Nye, C. D., Do, B.-R., Drasgow, F., & Fine, S. (2008). Two-step testing in employment selection: Is score infla-
tion a problem? International Journal of Selection and Testing, 16, 112–120.
Oswald, F. L., Friede, A. J., Schmitt, N., Kim, B. H., & Ramsay, L. J. (2005). Extending a practical method for devel-
oping alternate test forms using independent sets of items. Organizational Research Methods, 8, 149–164.
Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.),
Educational measurement (3rd ed.). New York, NY: American Council on Education and MacMillan.
Ployhart, R. E., & Holtz, B. C., (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and
sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172.
Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and alternative measures
to predict job performance and reduce adverse impact: Corrected estimates and realistic expectations.
International Journal of Selection and Assessment, 13, 304–315.
Pyburn, K. M., Jr., Ployhart, R. E., & Kravitz, D. (2008). The validity-diversity dilemma: Overview and legal
context. Personnel Psychology, 61, 143–151.
Raymond, M. R., Neustel, S., & Anderson, D. (2007). Retest effects on identical and parallel forms in certifica-
tion and licensure testing. Personnel Psychology, 60, 367–396.
Ree, M., Carretta, T., & Earles, J. (1998). In top-down decisions, weighting variables does not matter: A con-
sequence of Wilks’ theorem. Organizational Research Methods, 1, 407–420.
Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive
ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330.
Test Administration and the Use of Test Scores 169
Sackett, P. R., & Ellingson, J. E. (1997). The effect of forming multi-predictor composites on group differences
and adverse impact. Personnel Psychology, 50, 707–721.
Sackett, P. R., & Roth, L. (1991). A Monte Carlo examination of banding and rank order methods of test score
use in personnel selection. Human Performance, 4, 279–295.
Sackett, P. R., & Roth, L. (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on
performance and minority hiring. Personnel Psychology, 49, 549–572.
Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. (2001). High-stakes testing in employment, creden-
tialing, and higher education: Prospects in a post-affirmative-action world. American Psychologist, 56,
302–318.
Schmidt, F. L. (1971). The relative efficiency of relative and simple unit predictor weights in applied differen-
tial psychology. Educational and Psychological Measurement, 31, 699–714.
Schmidt, F. L. (1991). Why all banding procedures in personnel selection are logically flawed. Human
Performance, 4, 265–277.
Schmidt, F. L., Mack, M. J., & Hunter, J. E. (1984). Selection utility in the occupation of U.S. park ranger for
three modes of test use. Journal of Applied Psychology, 69, 490–497.
Schmitt, N., & Chan, D. (2006). Situational judgment tests: Method or construct? In J. A. Weekley &
R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement, and application (pp. 135–156).
Mahwah, NJ: Lawrence Erlbaum.
Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of practice and implications. Human
Resource Management Review, 18, 210–222.
Schmitt, N., & Oswald, F. L. (2004). Statistical weights of ability and diversity in selection decisions based on
various methods of test-score use. In H. Aguinis (Ed.), Test-score banding in human resource selection:
Technical, legal, and societal issues (pp. 113–131). Westport, CT: Praeger.
Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and efficiency among
various predictor combinations. Journal of Applied Psychology, 82, 719–730.
Segall, D. O. (2001, April). Detecting test compromise in high-stakes computerized adaptive testing: A verifica-
tion testing approach. Paper presented at the annual meeting of the National Council on Measurement in
Education, Seattle, WA.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures (4th ed.). Bowling Green, OH: Author.
Stanton, J. M., Sinar, E. F., Balzer, W. K., & Smith, P. C. (2002). Issues and strategies for reducing the length
of self-report scales. Personnel Psychology, 55, 167–194.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item functioning
and differential test functioning on selection decisions: When are statistically significant effects practi-
cally important? Journal of Applied Psychology, 89, 497–508.
Tett, R. P., Jackson, D. N., Rothstein, M., & Reddon, J. R. (1999). Meta-analysis of bi-directional relations in
personality-job performance research. Human Performance, 12, 1–29.
Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006).
Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225.
Truxillo, D. M., & Bauer, T. N. (1999). Applicant reactions to test score banding in entry-level and promotional
contexts. Journal of Applied Psychology, 84, 322–339.
Truxillo, D. M., & Bauer, T. N. (2000). The roles of gender and affirmative action in reactions to test score use
methods. Journal of Applied Social Psychology, 30, 1812–1828.
U.S. Department of Labor. (1999). Testing and assessment: An employer’s guide to good practices. Washington,
DC: U.S. Department of Labor, Employment and Training Administration.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature:
Suggestions, practices and recommendations for organizational research. Organizational Research
Methods, 3, 4–70.
van der Linden, W. J. (2005). Linear models for optimal test design. New York, NY: Springer.
Viswesvaran, C., & Ones, D. S., (1999). Meta-analyses of fakability estimates: Implications for personality
measurement. Educational and Psychological Measurement, 59, 197–210.
Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25–49.
This page intentionally left blank
8 Technology and
Employee Selection
171
172 Handbook of Employee Selection
TABLE 8.1
How Technology Has Changed I-O Practice and Educational Requirements
Some New Realities
What Is at Stake
• Retaining roles as functional architects of HR systems vs. system administrators.
• Setting strategy and making decisions vs. acting as service providers.
• Moving I-O psychology forward with theories and analytical methods required to understand modern work
environments vs. missing opportunities to advance the discipline in step with technology.
In this chapter, we first provide an overview of common technology-based systems and their
benefits and limitations. Then, we cover a core set of critical issues common to technology-based
selection systems—are they always the best approach, do they add utility, and are they fair and
defensible? The issues include measurement equivalence, unproctored Internet administration,
advanced delivery methods, demographic issues regarding technology, applicant reactions, and
legal considerations. Next, we discuss a range of implementation issues confronting the practi-
tioner: how to build and maintain the system; how to transition from other systems and/or legacy
paper-pencil processes; and how to integrate between selection systems and other components of
the talent management process. Finally, we discuss current HR technology and how to stay up to
date. This section will address technology advances that are new as of this writing, the drivers for
other innovation trends that appear likely for the future, and references for staying current in HR
technology applications.
The next section highlights some of the common technologies currently in use by organizations
for personnel selection.
Technology and Employee Selection 173
Tactical Strategic
software hubs for additional services (e.g., job posting and background checking) to further extend
the value they provide to the hiring process through automation.
In addition to the main functions of data tracking and management, an ATS will enable storage
and reporting of candidate quality and flow rates throughout the staffing process. This allows
for computation of effectiveness metrics, such as the success rates of recruiters and recruiting
channels, time to hire, and the criterion validity of candidate information. Data storage and
reporting are also critical for understanding how the system as a whole is operating with respect to
critical outcomes such as the diversity mix of the candidate pool at each stage of selection. These
reports are required to support government recordkeeping requirements and to respond to audit and
challenge requests.
On its own, an ATS typically provides very little sophistication to support the measurement of
people, so supplemental processes are often added to support the measurement required for strong
selection decisions, and these tools are usually required to integrate with the processes and data
formats supported by the ATS.
experiences (e.g., with equipment, work processes, or business issues common to the job). Because
it is possible for HR administrators and hiring managers to build these questions themselves within
the software, sometimes with little oversight, it is important to emphasize that screening questions,
like test questions, are pre-employment tools that must be job-relevant and valid. This requirement
can create a challenge for the broad distribution of the user rights to construct these automated
screening questions.
Sophisticated versions of these screening tools are based on the same measurement principles as
the weighted application blank technique (e.g., England, 1971), in which scores are set based on the
value of a particular response for predicting job success. Here again, the sophistication of the mea-
surement should be balanced against the qualifications of the users. Unlike a weighted application
blank, where an I-O psychologist would likely guide the design of the tool, the design of the scoring
scheme in many automated tools may be delegated to a broader range of system users. Software
developers and users may perceive this feature as a benefit, but the flexibility comes with substantial
risk. For example, basic qualification questions are frequently too strict or bear little relationship to
the performance on the job if they are not constructed carefully. Well-designed systems will include
controls that limit access to question construction, scoring, and deployment to qualified users.
Another inherent limitation with automated screening tools stems from the fact that they are
designed to be deployed early in a selection process when the commitment of the job seeker to the
selection process can be low. This places operational pressure to keep screening content brief, so
the candidate is not overburdened early in the process. For example, many employers desire their
screening and application process to require no more than 20 or 30 minutes, and this requirement
places restrictions on the types of questions that may be used. Additionally, screening questions that
are embedded in the application are broadcast widely, usually on the employer’s “Careers” page of
its website. Because screening content is frequently distributed broadly on open-access websites,
ability, knowledge, skill, and other question types in which right/wrong scoring is typical should be
avoided because of concerns for item security and cheating by candidates. Security considerations
for Internet-based selection content will be expanded on later in this chapter.
Automated Testing
Compared with screening tools, testing provides greater insight into individual characteristics by
deploying psychometric instruments that can provide more accurate measurement of constructs that
are difficult to index with screening questions, such as abilities, traits, and knowledge.
Automated delivery systems for these tools tend to have several common features. Test-takers
are typically invited to take the assessment by providing secure log-in and password information;
standardized instructions, help menus, and practice items are then provided to orient the test-taker.
During the test session, several features are deployed to facilitate test-taking, including count-down
timers, progress indicators, and other navigational aids. Many test delivery systems also include
advanced features such as the capability to present audio, video, and animated graphics as part of
the question stimuli; they may also allow for alternative response formats such as “drag and drop”
controls to provide a greater flexibility for handling a range of item types.
Despite the rapid introduction of new technology into test delivery, it can be argued that little
has changed regarding the test instruments themselves; it is the conditions under which tests are
administered that have changed more dramatically. Many systems allow for the deployment of a
test in any location where the Internet is accessible, thereby eliminating the role of the test proc-
tor. Internet delivery of tests has raised new debate over the appropriate security conditions for
these measures (Naglieri et al., 2004). Professional standards and guidelines for testing dictate the
importance of material security, standardization of the test environment, and the control of factors
that may impact test performance aside from the construct being assessed (American Educational
Research Association, American Psychological Association, and National Council on Measurement
in Education, 1999; Society for Industrial and Organizational Psychology [SIOP], 2003). However,
operational pressures push toward remote deployment methods because they allow employers to
176 Handbook of Employee Selection
efficiently gain insight into job candidates before they are brought on-site for more expensive activi-
ties, assuming that the test results are not damaged by the lack of standardization in the testing
environment. We return to this issue in more detail later in the chapter.
In the sections that follow, we describe some of the issues that arise because of the increased use
of technology in the selection process.
The literature lacks such studies that also specifically use pre-employment tests (they note Salgado
& Moscoso, 2003, as a lone exception of a within-group design in this literature). New studies
have introduced other interesting factors that may play a role. For example, Meade, Michels, and
Lautenschlager (2007) found scores to be comparable for many personality constructs; however,
there were exceptions when the authors introduced having (or not having) a choice in whether the
test was taken via paper or Internet. They also distinguish between computerized and Internet-
based tests in the literature, pointing out that bandwidth and Internet connectivity affect test down-
load speeds, which in turn affect administration time and therefore measurement, particularly for
speeded tests. (Note that several studies cited in this section were focused on computer-based rather
than Internet-based testing, including George et al., 1998; Mead & Drasgow, 1993; and Richman
et al., 1999.) Ideally, future research will examine equivalency between paper, computer, and online
personnel tests using repeated-measure designs.
Examinee Identification
Currently there is no way to verify that the person taking the test on the Internet is the actual can-
didate. Tippins et al. (2006) described and critiqued options for retesting all or some candidates
(e.g., randomly selected individuals) and the practical challenges of administering this program
and interpreting any score differences that are observed between administrations. In the future,
biometric verifications, real-time video communications, and other technology may make examinee
identification more practical.
Test Security
Security is a concern mostly for the publisher and user of the test. The user might be concerned
about his/her data, although the encryption protocols for Internet-based tests can theoretically be
as good for those for any other data exchanges. Although the delivery system can block the ability
to share and distribute the test using the local computer’s operating system, there is no stopping the
determined candidate from using other technology to circumvent that security (e.g., by taking pic-
tures of the computer screen). Test publishers typically include a statement that the candidate must
agree to in order to access the test, stipulating that he or she will take the test honestly and will not
distribute the information. The organization also may attempt to limit access to the test to serious
candidates who have been prescreened rather than opening the test content to the general public.
Cheating
Clearly, candidates who want to cheat will find that unproctored Internet testing provides them
with opportunities because no one is watching them take the test. To date there have been many
Technology and Employee Selection 179
conference presentations, but few publications, on the effect of cheating on test results. In personal-
ity testing, a related literature, meta-analyses on social desirability response distortion and faking,
provides evidence that paper personality tests do not appear to differ from computerized tests with
regard to cheating (Dwight & Feigelson, 2000; Richman et al., 1999). It is likely that mitigating
factors in a candidate’s decision to cheat will include the existence of “right” answers (such as those
on an ability test) versus opinions or attitudes (such as in personality tests in which the scales of
interest and the right answers are less obvious to the candidate) and the extent to which the test event
is seen as a hurdle (if not viewed as pass/fail, there is less at stake and therefore less motivation to
cheat). Therefore, the organization might attempt to deemphasize the high-stakes nature of test, for
instance, by describing it as one datum point of many. Threatening to retest candidates in a proc-
tored, second hurdle might also deter some cheaters.
Test Standardization
Dispensing with proctoring limits the ability to control the test environment, leading to situa-
tions that may disadvantage some candidates. In addition to the distractions that may be present
in an unsupervised setting, hardware and unforeseen software issues may arise. Differences
between examinees’ computers and Internet connections could affect testing, particularly on
speeded tests. The remedy for this is to use as little local software as possible—ideally, with
current technology this means a browser-based test—and to have instructions and trouble-
shooting tips available for the examinee to confi rm optimal usability of the test system prior to
administration.
As shown in this section, unproctored testing brings ethical issues to the forefront. The threat of
cheating or unfair conditions creates an ethical dilemma for I-O psychologists, whose duty is to pro-
vide a fair and consistent test experience for all and an equal opportunity for test takers to perform
up to their level of ability or trait.
on a unitary, well-defined construct, and it is thus better suited for testing cognitive abilities
or knowledge areas versus personality or other softer constructs. CAT also requires a large
investment in the item bank development and calibration research. A single adaptive scale that
is designed for use with test-takers of wide ability ranges can easily require hundreds of test
items, each scaled on over a thousand test-takers. Even with the extra security and precision
afforded by a CAT, if the test is deployed without a proctor, the measure will still likely require
some form of follow-up test to estimate whether the candidate took the initial test without
assistance. Despite these concerns, the increase in unproctored testing provides a business
rationale for the investment in these tools by a broader range of organizations than has been
the case in the past, so it is possible that the theory and techniques behind adaptive testing may
increase in their popularity.
In practice, the use of CAT has been more prominent in high-stakes certification testing than
in the employment context; however, Internet-based CAT has now been developed for high-
volume selection (e.g., Gibby, Biga, Pratt, & Irwin, 2008). Techniques have also been developed
for applying adaptive testing technologies to personality assessment (e.g., McCloy, Heggestad, &
Reeve, 2005; Roberts, Donoghue, & Laughlin, 2000; Stark, Chernyshenko, Drasgow, & Williams,
2006), and applications of personality-based CAT are under development for application in the
armed services (e.g., Borman et al., 2008). The reemergence of CAT as a tool for Internet-based
selection has also contributed to confusion among lay people and software programmers alike as
to the differences between CAT and branched item delivery. Branched items (sometimes referred
to as “chained” items) are constructed to represent test content where a series of events has inter-
dependencies, and the choice of a response to an initial item determines the next item in the chain.
For example, a series of questions about medical diagnosis may begin with a set of symptoms and
the respondent must choose between several diagnostic options. On the basis of the option the
respondent picks, the next test question will provide results from their choice and the next step
must then be chosen. In this manner, a decision tree may be constructed to determine whether
the respondent knows the correct path toward a diagnosis. The ability to branch item presentation
on the basis of prior responses is also enabled by computer-based test delivery. However, unlike
adaptive delivery, branching in this manner does not meet the assumption of local independence
of test items that is required for adaptive testing. Thus, branched tests are able to reflect complex
test content, but they do not operate in the same manner and with the same benefits as adaptive
tests. Nevertheless, the concept of branching is appealing and several vendors of Internet test
engines actively market the capability as adaptive testing. Branching is perceived as adding real-
ism because later items can represent the consequences of prior actions. Although this feature
may add interest to low-stakes assessments such as post-training knowledge checks, there are
reasons to be skeptical regarding the quality of these measures if they violate assumptions
of the underlying measurement theory. As is the case with many aspects of technology-driven
assessment, the challenge is the education of lay decision-makers on the differences between
these methodologies.
To enable better understanding of these advanced testing techniques, students of I-O psychol-
ogy should be educated on appropriate test theory and methods. For instance, item response the-
ory (IRT) will become increasingly important because it provides the foundation for CAT and it
improves measurement over classical test theory (CTT; Embretson, 2000). Historically, many I-O
psychology graduate programs have taught CTT but have placed less emphasis on IRT, which is
more complex and arguably less accessible if covered as one section of a general survey course on
testing. The inclusion of a full IRT course in the curriculum would be advantageous for the field.
be presented in digitized video clips during an assessment of applicants for customer service posi-
tions, thereby providing a more realistic context than is possible through written scenarios and more
standardized than is possible with live role players.
Advances in computer animation, driven largely by the growth of the computer gaming industry,
have provided another set of tools that are gaining in popularity for assessment. Some assessment
platforms have been developed that incorporate animated avatars and simulated work environments
in an attempt to accommodate a wider range of stimuli. Questions remain as to whether the elicited
responses are adequately representative of real-world actions; however, visual appeal and the ability
to simulate rare events will likely spark enough applications of these assessments to provide a test
bed for research on these questions.
Additionally, new forms of test items can be created by combining animation with graphical
tools such as drag-and-drop controls. For example, work processes that involve the placement of
objects in a location and sequence can be modeled in animated environments. Test takers then click
and drag images to the correct location to complete the assessment. These item types can be used
to simulate processes such as food preparation, housekeeping rules, assembly tasks, and medical
procedures.
Aside from the measurement benefits of higher fidelity stimulus presentation and enhanced
standardization, increased use of high-fidelity item types may also serve to engage job appli-
cants in the assessment task. This outcome is of increasing value in a recruiting market geared
toward younger job seekers who may become frustrated with lengthy application and assessment
processes.
A final benefit from high-fidelity item presentation may be in its potential to reduce score
differences among demographic groups. Outtz (1998) presented findings suggesting that video
presentation of stimuli may reduce observed score differences. Although this finding has not
been widely demonstrated (e.g., Weekley & Jones, 1997), nor are the reasons behind the finding
clear (hypotheses include increased social context cues, reduced reading level, or psychomet-
ric artifact due to differences in the reliability between assessment methods), more meth-
ods for better understanding this pervasive issue would be advantageous and deserve further
investigation.
The use of high-fidelity item presentation also raises various challenges and issues. From a prac-
tical perspective, the deployment of assessments that include video, audio, and/or graphical stimuli
requires additional server capacity for the provider, bandwidth at the delivery site, and compatible
playback software on the local computer used for the assessment. Installation and maintenance of
these components can be complex and expensive, and failures can quickly erode the benefits of
standardization across assessment participants.
For developers of assessment software, the apparent benefits of higher fidelity presentation must
be counter-balanced with the challenges posed by the specificity of the stimuli. To appreciate this
point, consider the customer service assessment mentioned above. Should the assessment provider
wish to deploy the assessment in a healthcare environment, the stimulus video would likely need to
be reproduced in a healthcare setting. In contrast, if the assessment were simply a written SJT, the
adaptation could be as easy as changing the word “customer” to “patient.” In general, the higher
the realism provided in the assessment stimuli, the less generalizable that assessment is likely to be
across the environments where it could be deployed.
We offer one final note of caution regarding computer-driven high-fidelity assessments.
Psychologists have long been seduced by the idea that microbehaviors are closely related to
brain function and therefore can reveal important truths about individuals. With automation,
this reductionism is fed by various new microbehaviors that may be collected in the course of
computer-delivered assessments. Mouse-over or “hover-time” measures, click patterns, page-
view counts, and similar metrics are possible to extract from some assessment delivery vehi-
cles. To be sure, adequate research has yet to be conducted on the potential value that may be
hidden in these data piles; our prediction is that these endeavors are more likely to be a red
182 Handbook of Employee Selection
herring than a guidepost on the path toward understanding and predicting important workplace
behavior.
DEMOGRAPHIC CONSIDERATIONS
Computer ownership and usage is not ubiquitous. Although access rates to computers and the
Internet have risen dramatically since the mid-1990s (National Telecommunications and Information
Administration [NTIA], 2002; Pew Research Center, 2007), there remains a large enough shortfall
in certain portions of the population that employers need to consider the impact of adopting auto-
mated assessment processes on their resultant applicant pools. There are two reasons why employ-
ers should be concerned about the accessibility of technology. First, limited access to the Internet
may discourage some applicants from applying. Second, automation could negatively affect the
attitudes of some applicants toward the hiring process and the company.
In many cases, requiring applicants to use the Internet as an entry into the hiring process will
have a negligible effect on the composition of the applicant pool; however, the degree to which this
projection is true likely depends on a range of factors such as the level of the job, the location and
type of work involved, and the characteristics of the available qualified job seekers in the market.
The extent of the impact should be considered as recruitment processes and automated tools for
screening and assessing applicants are implemented. For example, data from the NTIA and the Pew
Research Center show that Internet usage at home has increased rapidly for all segments of society
in the past few years. Furthermore, Internet usage patterns also show that job search is a com-
mon activity for demographic groups who use the Internet outside of their homes (NTIA, 2002).
However, it is also important to note that some groups (such as those over age 65 and those with
less than a high-school education) have lower Internet usage rates (i.e., below 50% as of 2007; Pew
Research Center, 2007).
When evaluating the potential impact of Internet access rates on applicant pool demographics,
employers should collect evidence on the use of the Internet for job search by different demographic
groups. Carefully monitored demographics can reveal if some aspects of the applicant population
are not accessing the selection process, suggesting that parallel application processes should be
pursued. However, given the trend toward wider Internet access, increasing computer literacy, and
strong competition for labor, organizations will risk losing qualified candidates if an Internet-based
recruitment strategy is not pursued to some extent.
of these effects may be magnified if negative reactions are more predominant for individuals
belonging to legally protected demographic groups. Internet-based selection techniques may be
especially susceptible to negative reactions if some applicant groups do not have easy Internet
access.
Generally, however, research has not shown large differences in applicant reactions to selection
procedures because of race (Hausknecht, Day, & Thomas, 2004), although under some conditions
these differences may emerge more prominently (Ryan & Ployhart, 2000). When group differ-
ences in reactions toward technology-based selection procedures have been examined directly, few
meaningful differences have been found for racial groups, whereas age group differences were
potentially more of a concern (Reynolds & Lin, 2003).
In sum, the available research suggests that some of the more negative assumptions regarding
reactions to technology have not been realized. In fact, traditionally disadvantaged groups may
perceive some benefit to the objectivity provided by an automated system. Compared to traditional
processes that may have required a visit to the employer to complete an application, perhaps finding
access to the Internet has proven to be less of a challenge. More work in this area will be required
to better understand the factors that affect applicant perceptions to technology-based selection tools
and their potential to influence broader attitudes and behavior during the application and socializa-
tion process.
LEGAL CONSIDERATIONS
Despite the “wild west” attitude inspired by many software vendors who have reshaped recruitment
and selection processes, all of the usual employment laws apply to these new techniques. Less well
known are the risks some of the newer procedures may carry for employers because traditional con-
cepts related to the formal submission of a job application are reshaped by new technologies. Two
areas of regulation have focused most directly upon Internet-based hiring: the U.S. federal rules
regarding recordkeeping for Internet applicants and European data protection regulations.
• All expressions of interest that are submitted through the Internet or related technologies
that the employer considers for a particular position, regardless of whether the individual
is declared an “Internet applicant”
• Internal resume databases, including items for each record such as the date each was
added, the positions for which each search of the database was made, and the criteria used
for each search
• External database search information, such as the positions for which each search was
made, the date of the search, the search criteria used, and records for each individual who
met the basic requirements for the position
• All tests, test results, and interview notes
• The gender, race, and ethnicity of each Internet applicant
The OFCCP also provided elaborate guidance on what constitutes a basic qualification. These are
qualifications that are either advertised (e.g., posted on a website) or established prior to considering
any expression of interest for a particular position (e.g., when searching an external resume data-
base). Furthermore, basic qualifications must meet three criteria: (a) they must not involve compari-
son of qualifications between applicants (e.g., 2 years of engineering experience is noncomparative;
the most experienced engineer in a pool of engineering applicants is comparative), (b) they must be
objective (i.e., not dependent upon judgment), and (c) they must be “relevant to performance of the
particular position and enable the [employer] to accomplish business-related goals.” By definition,
passing an employment test cannot be a basic qualification.
In 2004, an interagency task force proposed another set of guidance. The task force proposed
five additional Questions and Answers regarding the Uniform Guidelines on Employee Selection
Procedures (Equal Employment Opportunity Commission, U.S. Department of Labor, U.S.
Department of Justice, and Office of Personnel Management, 2004). These questions and answers
clearly linked Internet-based selection processes to the rules set forth in the Uniform Guidelines.
They also defined Internet applicants, but their definition differs from the OFCCP version. These
questions and answers have not been issued in final form, and given evident differences with the
OFCCP’s finalized rules, they may never be published in final form.
differing standards for varying circumstances; therefore, privacy protection in the United States is
guided more by limited legislation and regulation, and by self-regulation.
Organizations that seek to deploy Internet-based HR systems that involve international data
transfers have several compliance options. Two approaches are most common. First, organizations
can seek permission from data providers (i.e., job seekers) regarding how and where their data will
be processed. Second, U.S.-based organizations may join a Department of Commerce program that
certifies them as a safe harbor for personal data. This certification states the organization’s willing-
ness to adhere to seven Safe Harbor Privacy Principles that the Commerce Department negotiated
with the EU. This program, and the EU laws to which it relates, are described at http://www.export.
gov/safeharbor. The seven principles are as follows:
In addition to the EU rules, a growing number of U.S. laws deal with privacy considerations. For
example, California laws have also been enacted that require a privacy statement to be posted on
any site that collects personal information from California residents, and additional rules require
the disclosure of any breach of personal data security.
Several organizations have been established to help companies comply with the various aspects
of the expanding protections and rules regarding data privacy. Services such as TRUSTe will review
data privacy policies and procedures in light of the applicable rules and will act as arbiters to settle
disputes over data handing, thereby satisfying the seventh safe harbor principle of enforcement that
requires the presence of a formal resolution mechanism.
Clearly, the liabilities associated with these responsibilities need to be carefully examined with
respect to any HR processes, and online recruitment and selection processes are of particular con-
cern because of their broad reach to the public at large. Companies that use online recruitment and
selection processes should be aware of these privacy considerations and take steps to ensure their
online tools are compliant; otherwise, Internet-based systems that collect information broadly from
job seekers will raise risks associated with liability for data processing that is inconsistent with EU
and U.S. laws.
Case Law
As of this writing, few cases have moved through the courts that relate to the appropriate use
of technology in the hiring process; cases relating more generally to data privacy are more
prevalent (e.g., FTC v. Toysmart.com, 2000). However, more of these cases are sure to emerge
186 Handbook of Employee Selection
and provide further guidance regarding the risks and boundaries of technology-based selection
practices.
IMPLEMENTATION ISSUES
I-O psychologists must consider a range of issues when implementing technology-driven selection
systems. This section discusses the issues in terms of the development, administration, and support
of technology-based selection systems.
DEVELOPMENT
Designing and managing technology-based selection systems involves skills that are related to the
analytical and decision-making skills of an I-O psychologist, yet also requires skills that are not
central, such as business acumen and IT expertise. Development includes making a business case
for purchasing or building a solution, acting as an information architect for the new system, and
managing the transition from traditional to technology-based systems.
TABLE 8.2
Costs of Implementing Technology-Based Selection
Source Examples
Development • Software development and/or customization
• Technology integration (with applicant tracking systems, enterprise resource planning, etc.)
• Equivalency studies (paper/computer/Internet comparisons)
• Hardware (as applicable); for example, fax machines or scanners for scoring paper tests
electronically
Deployment • Field testing/quality assurance
• System hosting fees
• Installation costs if locally hosted
• Account setup (process flows, permissions, reporting requirements)
• Training
Maintenance • Tracking and reporting
• Upgrades
• Security monitoring
• Backups and failure recovery
compatible with the organization’s special requirements (such as security protocols and firewalls).
Whether the solution is created in-house or procured from a vendor, the IT department must assist
with its implementation.
or ways of describing the data contained in a software system (Weiss, 2001). The tags facilitate the
transfer of data from one system to another, providing the information the receiving system will
need to process the information. However, not all systems currently use XML, and there may be
differences between systems in data definition that the manager of the integration project should
become aware of to have a successful implementation.
ADMINISTRATION
When administering the system, the organization must pay special attention to the ways in which
information is accessed, processed, and stored.
Access
Different types of users will have varying levels and methods of access to the selection tools and
systems. The organization must have policies detailing when, or if, test information is available to a
user (i.e., applicants, HR staff, and hiring managers). For instance, external applicants might be able
to take tests only after a screening and approval process categorizes them as viable candidates (e.g.,
application blank, phone screen). The system might provide internal applicants, but not externals,
with a feedback report about their results. HR and other administrative staff might have different
levels of access, from an administrator with basic privileges (access to deliver tests and see only
basic results), to an HR manager with purchasing authority (buy tests, view applicant inventory, see
detailed results), to analysts and I-O psychologists (run database queries of raw data, conduct item
analyses). Hiring managers might have another type of access, allowing them to see only those
candidates who are eligible for interviews.
Automation
Organizations must decide to what extent the information will be processed by rules and automa-
tion, and to what extent HR experts will be involved to make judgments and carry out transactions.
The system can enable users to do work that formerly had been conducted by others, such as work-
force and adverse impact analyses that business managers and HR generalists could conduct them-
selves rather than requesting from an I-O psychologist. Having this capability for self-service brings
into question the role of the I-O psychologist in the system and the extent to which expertise should
be offloaded to automation. On the one hand, when automation and user training are effectively
combined, the psychologist can be freed from routine, transactional work to become more strategic.
On the other hand, there is a risk of misuse of the system if results are interpreted incorrectly or if
sensitive data fall into the wrong hands.
Data Storage
The method of storage partly determines the administration possibilities. For instance, a data-
base structure that permits data integration across systems, whether through XML technology as
described above or through other means, opens the possibilities for different types of selection tools
and other HR data to be part of a larger transactional database. Thus, test results could be used for
succession planning, individual development plans, etc.
SUPPORT
The selection system will not be a static entity. As technology progresses, the organizational struc-
ture changes, and HR processes are upgraded, there will be a need for consulting and dedicated
staff time in the form of technical support, ongoing maintenance, and user training. When provided
by a vendor, these services should be included in a service-level agreement as part of the contract
for the vendor’s services.
Technology and Employee Selection 189
Technical Support
Expert support for the technology-based selection system includes services to each major group
of stakeholders (i.e., candidates, HR staff, and business managers). As with any selection system,
whether traditional or technology-based, candidates may need support to access information about
tests and eligibility for testing and may also require troubleshooting to access and take tests that are
delivered unproctored over the Internet. HR staff may encounter technical difficulties requiring IT
support. Business managers may need guidance from HR, including I-O psychologists, about the
use of the system and interpretation of results.
Maintenance
Ongoing maintenance, updates, and revisions (e.g., content and feature changes to hosted websites)
are another dynamic aspect of the selection system. Some routine downtime is expected for main-
tenance as the system is checked for ongoing quality assurance and upgrades. These activities must
be high priorities for the vendor and the organization’s internal IT department.
Staff Training
There should be multiple methods for training, including live sessions, websites with frequently
asked questions, and self-guided training modules. Retraining should be readily available as the
organization changes, the system advances, and staff turns over. Even with the best-designed inter-
face, failure to provide adequate staff training can derail the selection system through improper
administration of test procedures and policies. Further, as test data become increasingly integrated
into the broader HR strategy of attracting, developing, and retaining talent, the trend will be toward
more and more complex systems that require more training.
Importantly, the overarching trend that influences development, implementation, and support
is technology integration. Many technology providers are building market share by supporting a
broader array of HR functions across the “talent management lifecycle,” requiring the integra-
tion of data from recruitment and organizational entry through management and career progres-
sion. In the future, it may be rare for a technology-based selection system to be implemented and
administered in isolation from other systems. The facets of talent management will be integrated
into a single enterprise resource planning (ERP) database for use in making strategic decisions
about human capital. As mentioned earlier, there are several implications for I-O psychologists.
Researchers and practitioners will be able to obtain data that are harder to come by today (e.g.,
performance and return-on-investment data for use in validation studies, program evaluation, and
employee-organization linkage research). Psychologists will be able to earn a broader role in strat-
egy-setting and decision-making if they are able to provide an analytical, forward-thinking use of
the information. Most importantly, I-O psychologists will be able to broaden their own expertise,
as well as improve the field’s knowledge and credibility for research and practices that have true
organizational impact.
FUTURE DIRECTIONS
Looking ahead, the practice of employee selection will likely continue to experience rapid change
as a result of technology advancements. Many of the topics described in this chapter are in their
infancy; as long as technology ventures continue to be a potentially high-return (albeit high-risk)
investment, funding will continue to drive innovation into areas that will impact HR tools and pro-
cesses. On the basis of advancements that are already implemented in adjacent industries and trends
already in place, several areas of innovation for selection practices are likely.
Regarding assessment presentation formats, advancements in the construction and animation
of computer graphics will continue to provide assessment developers with new options for stimuli.
Importantly, the technology to construct graphical stimuli will become simpler to use and less
190 Handbook of Employee Selection
expensive to acquire, thereby allowing for broader application and experimentation. Tools are avail-
able now for constructing animated sequences that can be configured by end users. As these tools
are implemented and experimental data are collected, the potential operational and measurement
benefits will be better known. These foundational tools will continue to be funded from other indus-
tries, but applicants will likely see increased use of these features as a side benefit as these technolo-
gies develop further.
The push toward integrated media components will also continue. Technologies used at work will
be more likely to involve integrated interactive video and audio capability, and as these tools become
more commonplace, assessment delivery systems will also incorporate them to improve the fidelity
of stimulus presentation and to increase the realism of the experience. Many approaches to assess-
ment can benefit from these technologies, but assessment centers and other forms of work simulation
will likely be at the forefront because these techniques emphasize stimulus and response fidelity, and
these in turn create the need for process efficiencies that can be addressed by technology. Several
professional workshops have already been devoted to the application of these technologies in assessment
centers and simulations (e.g., Howard, Landy, & Reynolds, 2002; Rupp & Fritzsche, 2008).
Response fidelity will also increase as technology evolves. Currently high-fidelity response (e.g.,
speech or written text) is often scored by trained assessors in the employment context. High-volume
applications of assessments in education are already deploying handwriting and text recognition
tools and scoring algorithms that relate to human ratings of the subject matter (e.g., Landauer,
2006). These tools also should be expected to jump into employment testing and assessment as these
approaches are refined. Speech recognition tools will also be likely to improve to a point where they
can be more readily deployed for assessment purposes. The combination of more accurate speech-
to-text tools and the advancements of semantic recognition algorithms will allow for more complex
assessment to be deployed on a greater scale than we see today.
Technology advancement may also expand the set of variables we are able to measure and use
in research and practice. For example, some research has pointed to the relationship between timed
response latency and trait strength (Siem, 1991), suggesting a possible tool for detecting distortion
and deception in test responses.
In a broader context, the integration of technologies that support the full range of people pro-
cesses will create some new opportunities for the use of selection data. For example, as selection
tools are integrated into common data frameworks such as those held in learning and performance
management systems, the use of selection data to help shape the on-boarding and career develop-
ment process will be far more operationally viable than it is today. The opportunities to develop
data-leveraged processes to manage talent more effectively will multiply as the number of inte-
grated talent management components expands.
Of course, work itself will also continue to be affected by technology shifts. Organization struc-
tures and operational systems may shift from strict hierarchies toward networks, in part because
technologies allow for natural communication patterns and functional interdependencies to be
quickly established and supported through multifunctional shared workspaces. Through these tools,
global networks for accomplishing work can be established. These changes will need to be reflected
in the selection processes that are used to staff organizations that work in these ways.
This brief discussion touches on only a few of the likely sources for innovation in technology-
based selection, and many of these examples are already possible, but they are currently deployed in
limited and localized circumstances because of their complexity, cost, and primitive stage of devel-
opment. Future innovation will have a broad set of drivers and influences, including advancements
in technology standards that unify how people and data are represented to better enable integration
(e.g., HR-XML Consortium standards); case law and government regulations will have a broader
impact on technology application in areas that relate to fairness, privacy, and accessibility; and
venture investment will flow toward new technologies that have the potential for business impact.
These facilitators and potential barriers will interplay to shape the availability and distribution of
new approaches, providing a rich test bed for I-O psychologists who are prepared to participate.
Technology and Employee Selection 191
As discussed earlier, I-O psychologists must collaborate closely with their IT staff and technology
providers outside of their organizations to meet their stakeholders’ needs, to stay educated about
advances that affect their work, and to obtain and hold a strategic role in their organization’s
decisions and policies. As technology advances, it will be important to continuously look for
opportunities for research and practice improvements, rather than being reactive to these trends,
responding only after other stakeholders have noted the opportunities and begun to demand new
systems and processes.
Getting to know the IT staff in your organization is one way to stay on top of technologi-
cal innovations; another is supplementing the SIOP conference with others that follow technology
in HR, such as the HR Executive HR Technology Conference. Other conferences that cover the
topic include the annual meetings of the International Test Council and the Association of Test
Publishers. For reference material on trends and business applications written from a practitioner
perspective, I-O psychologists should consult IT-focused business articles in Human Resources
Executive, Harvard Business Review, and the MIT Sloan Management Review, for example. SIOP’s
volume on technology in HR (Gueutal & Stone, 2005) provided practical advice on a range of top-
ics. Readers who are interested in technologies that are currently available for use in organizations
should consult industry analysts and commentators, such as HR.com, Human Resource Executive
Online (HREonline.com), and Rocket-hire.com, in which new technology products are reviewed
frequently.
As described in the section on critical issues and elsewhere in this chapter, technology-based
testing generates questions about the constructs that are being measured and the extent to which
administration methods (such as unproctored testing) affect measurement. Because this literature
is practice-oriented and typically atheoretical, the results often pertain to best practices or research
methods rather than to psychological theory. In fact, it may be the lack of theory associated with
the development and implementation of technology that has held back research in this area. Many
TABLE 8.3
Summary: Areas for Future Research
Equivalence
• Research should better distinguish between computerized and Internet-based assessment to examine Internet-specific
considerations (e.g., bandwidth, Internet connectivity issues, browser constraints).
• More research in the noncognitive domain: Can new technologies limit response distortion?
• Within-group, repeated-measure studies of technology-based tests and assessments.
Assessment Environment
• Examine differences under the most controversial usage conditions: high-stakes testing with clear pass/fail outcomes
under various proctoring conditions.
• When conducting quasi-experimental field studies, research should classify the factors that influence scores so a
taxonomy of influences and acceptable administration protocols can be developed.
New Technologies
• Does adaptive test delivery limit security risks for unproctored testing?
• How might video presentation or other technology mitigate adverse impact?
• Does animation (e.g., animated work scenarios, use of avatars) influence assessment responses and/or reactions?
Are responses consistent with those generated by live role players?
• What assessment value is added by tracking “microbehaviors” such as hover-time and click patterns?
• Can technical advances such as noncognitive CAT approaches limit score inflation in applied settings?
Organizational Impact
• What combinations of technologies (e.g., selection systems, performance management systems) add incremental value
to organizations?
192 Handbook of Employee Selection
of the issues raised by technology are fundamentally practical and thus may hold less appeal for
full-time researchers. The published literature on computerized and Internet testing is relatively
small. To generate further research interest, we have summarized many of the issues raised in this
chapter in Table 8.3.
REFERENCES
American Educational Research Association, American Psychological Association, and National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Psychological Association.
Bartram, D., & Brown, A. (2004). Online testing: Mode of administration and the stability of OPQ 32i scores.
International Journal of Selection and Assessment, 12, 278–284.
Borman, W. C., Lentz, E. M., Schneider, R. J., Houston, J. S., Bearden, R., & Chen, H. T. (2008). Adaptive
personality scales as job performance predictors: Initial validation results. In M.S. Fetzer & S.E. Lambert
(Eds.), Computer adaptive testing (CAT) and personnel selection. Presented at the annual meeting of the
Society for Industrial and Organizational Psychology, San Francisco, CA.
Cappelli, P. (2001). Making the most of online recruiting. Harvard Business Review, 79, 139–146.
Dwight, S. A., & Feigelson, M. E. (2000). A quantitative review of the effect of computerized testing on the
measurement of social desirability. Education and Psychological Measurement, 60, 340–360.
Embretson, S. E. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum
England, G. W. (1971). Development and use of weighted application blanks. Dubuque, IA: Brown.
Equal Employment Opportunity Commission, U.S. Department of Labor, U.S. Department of Justice, and
Office of Personnel Management. (2004). Agency information collection activities: Adoption of addi-
tional questions and answers to clarify and provide a common interpretation of the Uniform Guidelines
on Employee Selection Procedures as they relate to the Internet and related technologies. Federal Register,
69(43), 10152–10158.
Federal Trade Commission v. Toysmart.com, LLC, 2000 WL 34016434 (D. Mass. July 21, 2000).
Fletcher, P. A. K. (2005). From personnel administration to business-driven human capital management. In
H. Gueutal, D. L. Stone, & E. Salas (Eds.), The brave new world of eHR: Human resources in the digital
age (pp. 54–103). New York, NY: Wiley & Sons.
George, C. E., Lankford, J. S., & Wilson, S. E. (1998). The effects of computerized versus paper-and-pencil
administration of measures of negative affect. Computers in Human Behavior, 8, 203–209.
Gibby, R. E., Biga, A. M., Pratt, A. K., & Irwin, J. L. (2008). Online and unsupervised adaptive cognitive ability
testing: Lessons learned. In R. E. Gibby & R. A. McCloy (Chairs), Benefits and challenges of online and
unsupervised adaptive testing. Symposium presented at the annual meeting of the Society for Industrial
and Organizational Psychology, San Francisco, CA.
Gilliland, S., & Cherry, B. (2000). Customers of selection processes. In J.F. Kehoe (Ed.), Managing selection
strategies in changing organizations (pp. 158–196). San Francisco: Jossey-Bass.
Gueutal, H. G., & Stone, D. L. (Eds.) (2005). The brave new world of eHR: Human resources in the digital age.
San Francisco, CA: Jossey-Bass.
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated
model and meta-analysis. Personnel Psychology, 57, 639–683.
Howard, A., Landy, F. J., & Reynolds, D. H. (2002). Marrying assessment and technology: For better and
worse. Workshop conducted at the meeting of the Society for Industrial and Organizational Psychology,
Toronto, Canada.
Kehoe, J. F., Dickter, D. N., Russell, D. P., & Sacco, J. M. (2005). e-Selection. In H. Gueutal, D. L. Stone, &
E. Salas (Eds.), The brave new world of eHR: Human resources in the digital age (pp.54–103). New
York, NY: Wiley & Sons.
Landauer, T. (2006). Reliability and validity of a new automated short answer scoring technology. Paper pre-
sented at the Annual Meeting of the Association of Test Publishers, Orlando, FL.
McCloy, R. A., Heggestad, E. D., & Reeve, C. L. (2005). A silk purse from the sow’s ear: Retrieving nor-
mative information from multidimensional forced-choice items. Organizational Research Methods, 8,
222–248.
Mead, A. D., & Drasgow, F. (1993). Effects of administration medium: A meta-analysis. Psychological Bulletin,
114, 449–458.
Mead, A.W., Michels, L. C., & Lautenschlager, G. J. (2007). Are Internet and paper-and-pencil personality tests
truly comparable? Organizational Research Methods, 10, 322–345.
Technology and Employee Selection 193
Naglieri, J. A., Drasgow, F., Schmitt, M., Handler, L., Prifitera, A., Margolis, A., & Velasquez, R. (2004).
Psychological testing on the Internet: New problems, old issues. American Psychologist, 59, 150–162.
National Telecommunications and Information Administration. (2002). A nation online: How Americans are
expanding their use of the Internet. Washington, DC: U.S. Department of Commerce.
Outtz, J. (1998). Video-based situational testing: Pros and cons. Workshop presented at the annual meeting of
the Society for Industrial and Organizational Psychology. Dallas, TX.
Pew Research Center. (2007). Demographics of Internet users. Retrieved June 15, 2008, from http://www.
pewinternet.org/Data-Tools/Download-Data/Trend-Data.aspx
Ployhart, R. E., Weekly, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of
applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable?
Personnel Psychology, 56, 733–752.
Potosky, D., & Bobko, P. (2004). Selection testing via the Internet: Practical considerations and exploratory
empirical findings. Personnel Psychology, 57, 1003–1034.
Reynolds, D. H. & Lin, L. (2003). An unfair platform? Subgroup reactions to Internet selection techniques. In
T. N. Bauer (Chair), Applicant reactions to high-tech recruitment and selection methods. Symposium
conducted at the meeting of the Society for Industrial and Organizational Psychology, Orlando, FL.
Richman, W. L., Kiesler, S., Weisband, S., & Drasgow, F. (1999). A meta-analytic study of social desirability
distortion in computer-administered questionnaires, traditional questionnaires, and interviews. Journal
of Applied Psychology, 84, 754–775.
Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding
unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32.
Rupp, D. E., & Fritzsche, B. A. (2008). Using technology to enhance assessment and development programs.
Workshop conducted at the meeting of the Society for Industrial and Organizational Psychology, San
Francisco, CA.
Ryan, A. M., & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and decisions: A critical
review and agenda for the future. Journal of Management, 26, 565–606.
Salgado, J. F., & Moscoso, S. (2003). Internet-based personality testing: Equivalence of measures and asses-
sees’ perceptions and reactions. International Journal of Assessment and Selection, 11, 194–203.
Siem, F. M. (1991). Predictive validity of response latencies from computer administered personality tests.
Paper presented at the 33rd Annual Conference of the Military Testing Association, San Antonio, TX.
Sinar, E. F., Reynolds, D. H., & Paquet, S. L. (2003). Nothing but ‘Net? Corporate image and web-based
testing. International Journal of Selection and Assessment, 11, 150–157.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures (4th ed.). Bowling Green, OH: Author.
Standish Group. (1995). The chaos report. West Yarmouth, MA: Author.
Stark, S., Chernyshenko, O. S., Drasgow, F., & Williams, B. A. (2006). Examining assumptions about item
responding in personality assessment: Should ideal point methods be considered for scale development
and scoring? Journal of Applied Psychology, 91, 25–39.
Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006).
Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225.
U.S. Department of Labor. (2005). Obligation to solicit race and gender data for agency enforcement purposes:
Final rule. Federal Register, 70 (194), 58947–58961.
Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25–49.
Weiss, J. R. (2001). Six things you should know about XML. The Industrial Organizational Psychologist, 39,
30–34.
This page intentionally left blank
9 Strategy, Selection, and
Sustained Competitive
Advantage
195
196 Handbook of Employee Selection
In the sections that follow, we first discuss the need for selection to take a broader perspective
and show consequences at the business unit level. We then review the dominant strategic HR per-
spectives on how HR practices and human capital resources are linked to a business unit’s strat-
egy. This is followed by a critical examination of whether personnel selection contributes to such
resources and sustained competitive advantage. We then discuss several ways through which selec-
tion may contribute to the unit’s ability to execute its strategy. We conclude by considering selection
in relation to other HR practices (e.g., training, compensation) for creating competitive advantage.
Most of the discussion to follow is based on integrating theoretical perspectives, with less emphasis
on empirical research, simply because such research has not yet been conducted. The hope is that
this chapter can serve as a springboard for empirical research into the effects of selection on strategy
execution and ultimately organizational performance.
unit-level performance, when we place no such confidence in self-assessments of one’s own perfor-
mance! A second, more important limitation of this research is that it asks only generally whether
“a valid or systematic selection process is used.” Demonstrating that higher performing firms are
more likely to use systematic selection procedures does not establish causality and is a far cry from
showing that units with the greatest levels of talent perform the best. As DeNisi, Hitt, and Jackson
(2003, p. 12) questioned, “If hiring ‘better’ people results in higher productivity, how exactly does
the selection of individuals translate into improved organizational performance?”
This all leads to the rather uneasy conclusion that the existing literature says relatively little about
whether selection contributes to business unit effectiveness—and hence offers the possibility of
competitive advantage to the firm (Ployhart, 2006). We believe it is critical for selection researchers
to show such value for three reasons. First, to the extent selection is perceived as nonstrategic, many
managers and organizational decision-makers will continue to use suboptimal selection practices.
Second, an inability to demonstrate value may limit the rigorous I-O approach to selection primar-
ily to entry-level hires (or “bottom of the pyramid hires”), where the focus may be on efficiency
and cost-effectiveness rather than added value. Third, the profession of I-O psychology may not
achieve the degree of reputation and visibility it deserves by failing to demonstrate how one of its
core practices, selection, adds value to firms. We do not mean to overstate our points here because
clearly many firms do not suffer from these issues. But it is also apparent that many firms do not
use selection practices as I-O psychologists would advocate (Anderson, 2005; Rynes, Brown, &
Colbert, 2002). We believe part of reason stems from the challenge associated with showing selec-
tion’s unit-level impact (Ployhart, 2006).
SHRM
The field of SHRM has grown rapidly since the late 1980s. Most traditional HR, and especially
I-O, research focuses on predicting, explaining, or controlling individual behavior. As noted above,
selection in particular has treated individual job performance (and related criteria like turnover) as
“the ultimate criterion.” In contrast, the criteria of interest in SHRM scholarship are at the unit level,
and most typically that of the firm (Wright & Boswell, 2002). SHRM scholarship tends to focus on
between-firm (or business unit) differences in HR practices that help explain between-firm (or busi-
ness unit) differences in performance. The typical SHRM study involves an examination of how
business units that use different HR practices perform differently (Becker & Huselid, 2006). Thus,
unlike traditional HR research, which focuses on individual differences in outcomes, SHRM research
focuses on unit differences in outcomes. The other part to the story is that SHRM researchers tend to
focus on predicting financial, market, or accounting criteria (Gerhart, 2005). Hence, SHRM scholar-
ship has attracted a great deal of attention from the business community precisely because it shows
that HR practices can improve the organizational metrics most important to stakeholders.
Broadly speaking, there are three dominant perspectives on how to best use HR practices
(Delery & Doty, 1996). The universalistic perspective suggests that use of certain HR practices will
always be useful and relevant. Note that such a belief suggests using the appropriate practices will
always improve a firm’s effectiveness, irrespective of changes in the economy, the firm’s strategy, or
its competition. Colbert (2004) argued:
Research under this perspective has been useful in identifying discrete HR practices that are universally
sensible, but it has not contributed much to HRM in the strategic sense, if we take strategic to mean
practices that differentiate the firm in its industry and that lead to sustainable competitive advantage.
(p. 344).
Yet this universalistic approach is precisely the approach taken by many selection scholars (e.g.,
cognitive ability is always a good predictor of job performance). The contingency perspective
suggests that HR practices will be useful and relevant only when they match with each other and
198 Handbook of Employee Selection
the firm’s strategy. The contingency perspective is more directly linked to adding value for the firm
because it recognizes that HR practices must support the firm’s strategy and be internally consistent
with other practices. Attempts have been made to link HR strategies to generic business strategies
(e.g., Porter’s cost leadership, product differentiation, and focus), (Schuler, Galante, & Jackson, 1987).
Although the idea that the “appropriate” set of HR practices can be deduced from a general business
strategy has obvious appeal, the approach has been widely criticized (Chadwick & Cappelli, 1999).
The configural perspective builds from the contingency approach to further recognize synergies
that exist in patterns of practices that fit with particular strategies. This approach takes the most
holistic view of HR management and suggests specific HR practices cannot be understood (or their
effects decomposed) in isolation from other practices and the firm’s unique strategy. In contrast, it
appears there are bundles of HR practices that must be used in combination to drive strategic value.
These are most often called high-performance work systems and involve combinations of practices
that include systematic staffing, training, compensation, and related practices. Firms that use these
high performance work systems outperform those that do not (Huselid, 1995).
Although there is now fairly compelling evidence that use of HR practices and high-performance
work systems is related to firm value (Combs, Yongmei, Hall, & Ketchen, 2006; although see Wright,
Gardner, Moynihan, & Allen, 2005), it should be recognized that most SHRM research only exam-
ines the link between unit HR practices and effectiveness. Intervening explanatory processes, such
as how the HR practice influences the cognitions, affect, and behavior of individuals, are rarely
considered (Gerhart, 2005; Becker & Huselid, 2006; Wright & Haggerty, 2005). These unexamined
intervening processes have been referred to as SHRM’s “black box.” It is within this black box that
I-O psychology in general, and selection specialists in particular, are uniquely situated to demon-
strate value. First, however, it is important to understand the dominant theoretical perspectives that
are invoked to explain why HR practices contribute to firm performance.
collectives of people. For example, in many organizations knowledge is shared informally through
social networks, rather than through more formal organizational structures and processes. As such,
it is quite difficult to replicate the knowledge and knowledge sharing process in other organizations.
A highly effective climate is likewise socially complex because it exists in the shared perceptions
of employees. Southwest Airlines has fended off multiple imitators (remember Continental Lite or
the United Shuttle?). Although these imitators could replicate the business model (e.g., point-to-
point service, single type of aircraft, minimal in-flight service, etc.), they could not duplicate key
elements of the organization’s culture. Time compression diseconomies represent the notion that
time is often not something that can be compressed with equal effectiveness (Dierickx & Cool,
1989). For example, firms that have strong brand equity have an advantage that competitors cannot
easily copy because it takes a long time to generate brand recognition and brand loyalty. Causal
ambiguity describes resources that are linked to effective firm performance, but the specific reasons
or paths through which they contribute are not obvious. For example, it may not be apparent which
specific HR practices or combinations of practices contribute to building a more effective workforce.
Because creation of these resources is not easily understood, it is hard for competitors to copy them
with equal effectiveness.
In the RBV, the final condition for creating sustainable competitive advantage is that the resources
be nonsubstitutable. Nonsubstitutable resources are those that are both necessary and sufficient for
effective firm performance. For example, ATMs are an effective technological substitution for most
bank teller transactions, making bank tellers’ talent a substitutable resource. However, suppose that
bank tellers also provide financial advice that adds value and increases the customer base—in this
example bank tellers’ talent would be nonsubstitutable. Thus, only resources that are valuable, rare,
inimitable, and nonsubstitutable can create sustained competitive advantage. Selection practices
may or may not meet these conditions.
Business
unit’s
strategy
Valuable
Rare
Human
capital
Selection (individual) level
emergence
Individual
differences Individual
(KSAOs) level job
acquired and performance
retained
FIGURE 9.1 Conceptual model-linking personnel selection with business unit strategy. Notice that the extent
to which unit-level resources are valuable, rare, inimitable, and nonsubstitutable will determine the strength of
relationship between unit-level human and social capital and sustained competitive advantage. However, note
that the contingent relationships found at the strategic level do not moderate individual-level predictor valid-
ity. The dashed line represents the primary relationship examined in personnel selection research. Only the
bold box represents potentially strategic resources. (Adapted from Ployhart, R. E., Journal of Management,
32, 868–897, 2006.)
Strategy, Selection, and Sustained Competitive Advantage 201
Here is where the usual selection story ends, with perhaps the additional effort in estimating utility.
However, considering our review of the SHRM literature earlier, it is apparent that effective selec-
tion practices may or may not support the business unit’s strategy and add value.
One of the main reasons for this ambivalence is that selection practices can be copied and will
not, by themselves, form the basis for sustained competitive advantage. Such practices may lead to
short-lived advantage or contribute to maintaining parity with competing units. Instead, unit-level
competencies (i.e., collective human and social capital; the bold box in Figure 9.1) form the basis
for sustained competitive advantage (Prahalad & Hamel, 1990; Barney & Wright, 1998). Consider
the organization selecting frontline employees on the basis of conscientiousness, agreeableness, and
emotional stability. By themselves, these individual-level selection KSAOs can and are screened
by many companies. However, when combined with other properly aligned HR practices, a strong
culture, and other management practices, selecting on these individual-level KSAOs can lead to the
emergence of a unit-level competency such as “aggregate customer service.” Thus, unit-level human
and social capital competencies are the source of sustained competitive advantage; they represent
intangible assets and add value to the firm but are created through application of HR practices like
personnel selection. However, the following paragraphs discuss certain conditions that must be met
before unit-level competencies will create sustainable unit-level differences in performance.
First, validity at the individual-level generalizes across like jobs, as is consistent with most
selection research (note there is no moderator of the dashed relationship in Figure 9.1). However,
this does not mean the validity of unit-level competencies (human and social capital) generalizes
across contexts. Rather, human capital at the unit level only adds value to the extent it is consistent
with the firm’s strategy. Also known as “external fit,” an organization creating a unit-level compe-
tency such as customer service will benefit only to the extent that the competency is critical to its
business strategy. Whereas one retailer competing on service would benefit from the creation of a
customer service competency, another competing on price would experience less, if any, impact
from the same activities. This is an important implication because it means one can be doing a
wonderful job of selecting at the individual level, yet adding nothing to the sustained competitive
advantage of the firm.
Second, it was previously asserted that selection practices alone are insufficient to create the
unit-level competencies that are the basis of sustained competitive advantage. Selection practices
must be aligned with other HR programs (e.g., training, performance management, and compensa-
tion) to create the desired impact. Selection practices that are inconsistently applied, or conflict with
other HR practices, will not create the type of human capital emergence necessary for a unit to build
a high-quality stock of aggregate human and social capital. For example, a unit may use exemplary
selection practices, but have an inability to recruit top talent. Or, the unit may attract and select top
talent, but fail to keep them in sufficient quantities. The unit’s climate, leadership, or compensation
may further enhance or lower the capital’s potential. Also known as “internal fit,” it is important to
recognize that selection is but one lever available to management and that the greatest impact comes
when all levers are aligned and pointing in a common direction.
Third, selection can have significant strategic value if the targeted KSAOs fit externally (with
the business strategy) and internally (with other HR programs) to create unit-level competencies
that drive important firm outcomes. Firms may compete on service, cost, or innovation; they do
not select on these competencies. While complicated enough, this scenario is likely to have several
important boundary conditions. For example, the relationship between unit competencies and unit
outcomes is unlikely to be linear cross-sectionally or over time (see Ployhart, 2004). Figure 9.1
emphasizes a contingency or configural view of selection practice because the value of unit-level
human and social capital is dependent (moderated by) on the unit’s strategy. This is in contrast to the
personnel selection literature, which in many ways argues for a universalistic approach. It may well
be the case that cognitive ability and conscientiousness predict the individual performance of most
jobs (Schmidt & Hunter, 1998). Although true, it does not necessarily follow that selecting on these
attributes will result in sustainable competitive advantage. Further, universalistic findings at the
202 Handbook of Employee Selection
individual level become contextualized at the unit level, and hence require a configural perspective.
For example, looking cross-sectionally, there is likely a need for a critical mass of human and social
capital (Deirickx & Cool, 1989). Particularly in highly task-interdependent work, hiring only a few
highly talented employees is unlikely to result in unit-level competencies (i.e., ensure sufficient
quantities of human and social capital) that will translate into improved business unit performance.
Although the “tipping point” defining critical mass likely changes by industry, occupation, and
organization, there is always a threshold. The relationship between the level of unit’s talent and its
performance is likely to be nonlinear.
Fourth, looking longitudinally, time is a critical element for modeling the value of human and
social capital. It takes time for HR practices to create human and social capital, and more time
for the increasing stock to translate into unit-level outcomes (Ployhart, 2004). Such a temporal
perspective is true for all unit-level processes, with higher-level processes requiring a longer time
perspective than lower-level processes (Kozlowski & Klein, 2000). In short, implementation of a
new selection procedure, regardless of its merits, will take time before its effects can be measured
at the unit level. This, of course, assumes that the other HR practices comprising the “bundle” are
themselves stable. Where other HR practices are evolving, the effects of a selection procedure may
be even more difficult to ascertain (or their effects accelerated). This requisite leap of faith may
explain why some organizations frequently change their selection processes.
One final caveat is that the predictors of individual job performance may or may not be the best
drivers of sustained competitive advantage. Certainly one expects better individual performance
to contribute to better unit performance, but this relationship will not necessarily be linear or
isomorphic (e.g., group process losses, etc). Because the criteria are not isomorphic across levels
(e.g., adding one additional “high ability” employee to a casino operator employing hundreds is
unlikely to measurably impact that unit’s performance), there exists a good possibility the predictors
will not be as well (Bliese, 2000). This means additional KSAOs may be important drivers of
unit effectiveness and sustained competitive advantage (e.g., those contributing to shared collective
performance and social capital, such as personality traits linked to interpersonal and teamwork skills;
Noe, Colquitt, Simmering, & Alvarez, 2003; Oldham, 2003). In short, characteristics unrelated to
individual performance may be important for unit performance (see also Chapter 35, this volume,
regarding team member selection).
If the above propositions are compared to the typical model for utility analysis, similarities and
differences become apparent. In terms of similarities, utility analysis and SHRM scholarship would
predict KSAOs that are more rare (have a lower base rate in the population) will result in higher
utility and competitive advantage. However, there is also an important difference. Utility analysis
assumes that economic value resulting from more valid selection will produce firm-level value;
SHRM scholarship does not. Validity does not equal value, and valuable and rare resources may
only contribute to parity or temporary competitive advantage. Sustained competitive advantage also
requires resources that are inimitable and nonsubstitutable. It is on these latter two points that utility
analysis and the contemporary view of selection fall short.
These points illustrate that one cannot simply assume that using more valid selection procedures
will add firm-level value, improve organizational performance, or create an opportunity for sus-
tained competitive advantage. In fact, sole focus on individual level selection prohibits such conclu-
sions. Again, consider one of the most important implications of Figure 9.1: an organization can be
using highly valid predictors of job performance at the individual level but not necessarily develop-
ing the kinds of human capital necessary for the firm’s sustained competitive advantage! This may
seem discouraging, but it represents incredible opportunity for selection scholars and practitioners
to better demonstrate their value to organizational decision-makers. There exists a bridge to connect
micro and macro disciplines of scholarship by linking selection (micro) to strategy (macro), and
each discipline has the opportunity to strengthen the other. From a practical perspective, articu-
lating selection’s value will increase the likelihood of effective practices being implemented and
supported.
Strategy, Selection, and Sustained Competitive Advantage 203
Consider the fact that staffing impacts the firm’s strategy and helps implement it. In terms of
impact, Figure 9.1 alludes to the expectation that implementing HR practices aligned with the unit’s
strategy will contribute to human and social capital emergence that is valuable, rare, inimitable,
and nonsubstitutable. However, through the process of developing and administering the selection
system, selection also helps to articulate and reinforce the firm’s strategy through all levels of the
organization. In the following section, we consider various means through which personnel selec-
tion can establish business-unit-level value.
(interestingly, compilation models are less-mentioned). They essentially argued for scholars to
focus on how HR practices create human and social capital emergence. Bowen and Ostroff (2004)
developed the concept of HR strength, an idea that stronger (more consistent, greater internal fit) HR
practices create more cohesive climates than do weaker (less consistent) HR practices.
So, how might HR practices, and selection in particular, contribute to the emergence of unit-
level competencies that form the basis for sustained competitive advantage? Lepak and Snell
(1999) suggested there are different types of employee groups that differ in their strategic value
and uniqueness to the firm and that firms should vary their HR practices accordingly. For example,
firms should use commitment-based HR practices for employees who are both highly valuable and
unique (e.g., top managers), but use efficiency-based HR practices for employees who are valuable
but not unique (e.g., entry-level employees). Jobs have similarly been differentiated as “A” jobs and
“C” jobs, with the former being central to the firm’s ability to execute its strategy (Kaplan & Norton,
2003). Selection is most likely to create unit-level impact when applied to A jobs.
As an example, in most banks the loan officer job is an important one. They are the interface
between the bank and customers and make decisions about whether the bank should lend money.
Incumbents in such positions are often evaluated on sales and service criteria. The KSAO keys
to selection in this context would be those related to sales (e.g., persistence, energy, persuasion)
and service (agreeableness, conscientiousness). However, much more critical to the organization’s
success are the far smaller number of A jobs responsible for setting the organization’s overall credit
policies and managing its risk. Although small in number, their impact on the firm is enormous
(witness the meltdown in the financial markets from overzealous lending practices). Criteria used to
evaluate these jobs might include various ratios of returns to bad loans. The KSAOs would be sub-
stantially different, focusing on quantitative analysis, forecasting, geopolitical trends, and the like.
Consistent across all this theory is the expectation that HR practices create a composition and/or
compilation of competencies (unit KSAOs) from individual KSAOs. It bears repeating that the stra-
tegic value manifests from the unit-level competencies, not individual KSAOs. HR practices pro-
duce unit differences and may contribute to sustained competitive advantage through the creation
of valuable, rare, inimitable, and nonsubstitutable competencies (human and social capital at the
unit level).
Empirical research is just starting to appear that supports many of these expectations. Lepak and
Snell (2002) found that firms do in fact use different HR practices for different employee groups
varying in terms of value and uniqueness. Takeuchi, Lepak, Heli, and Takeuchi (2007) found that
aggregate manager perceptions of employee competencies were positively related to self-reports of
firm performance. Several studies have found that human capital manifests different forms of com-
position across units, and hence units can be distinguished in terms of aggregate individual KSAOs
(Jordan, Herriot, & Chalmers, 1991; Schaubroeck, Ganster, & Jones, 1998; Schneider, White, &
Paul, 1998). Ployhart, Weekley, and Baughman (2006) extended these findings to show that job- and
organizational-level human capital helped explain between-unit differences in job performance and
job satisfaction.
Considerably more research is needed linking HR practices to specific types of human and social
capital emergence. For example, do firms that use more aligned recruiting, selection, and retention
practices create more valuable and rare forms of human capital more quickly? Do such firms have
greater control over the flow of human and social capital? Do firms that use such practices outper-
form rivals? SHRM scholarship needs to focus squarely on human and social capital emergence,
thereby illuminating the black box between HR practices and unit outcomes. Fortunately, I-O psy-
chologists and selection specialists are the torch that can light the examination of this black box.
Zimmerman, 2005), but this only scratches the surface of retention issues. For example, recent unit-
level research suggests the economic costs associated with turnover are far greater than individual
research might indicate. Unit-level human and social capital losses represent the quantity and quality
of talent lost through turnover, the costs of which may be greater than simply the rate of turnover
(Glebbeek & Bax, 2004; Kacmar, Andrews, Rooy, Steilberg, and Cerone, 2006; Shaw, Gupta, &
Delery, 2005). This is an important transition because unit-level turnover captures losses of collective
processes (e.g., coordination and communication) not apparent in individual-level research. Dess
and Shaw (2001) theorized that turnover rates may not fully capture the costs of turnover. For
example, service- and knowledge-based firms are highly dependent on social capital (the collective
social relationships, interactions, and networks within a unit). Effective unit performance in these
settings requires employees to share information about customers, services, products, and best
practices. Unit turnover creates holes in social networks that have negative consequences beyond
simple turnover rates (Shaw et al., 2005).
Human capital theory predicts that the loss of talent through turnover means the investment
made in unit human capital may not be recouped. But all loss is not considered equal. For example,
turnover in A jobs should be more costly than turnover in C jobs. That is, the opportunity costs
of turnover in strategically critical positions should be higher than the same costs in less mission-
critical positions. Similarly, higher quality unit human capital is expected to produce more value,
so turnover among top talent will incur greater opportunity costs than turnover amongst lesser
talent (Becker, 1964). Turnover should also be more damaging when those who leave have more
tacit knowledge (Strober, 1990). From the RBV perspective, units are not equally distributed with
respect to human capital. Turnover may diminish the stock of unit human capital, but turnover
among the highest quality human capital may make that resource less valuable, rare, and more
easily imitated unless replacements are equally qualified (Lepak & Snell, 1999). Because individual
KSAOs tend to be normally distributed, by definition higher quality talent will be more scarce
and difficult to replace. Empirically, Shaw et al. (2005) found that unit social capital losses were
associated with additional negative effects beyond mere performance losses, indicating the quality
of social capital was an important factor. Research similarly finds a loss of shared tacit knowledge
(Hitt, Bierman, Shimizu, & Kochhar, 2001) and an erosion of communication and coordination
(Sacco & Schmitt, 2005).
Selection experts should take ownership of turnover problems. For example, an HR executive
of a Fortune 500 company told one of the authors: “I tell my (HR) team that turnover means
our hiring process failed.” That is a bold statement (people quit for all kinds of reasons that
have nothing to do with their ability to perform the job), but there is some truth to better
utilizing staffing as a means to control turnover. That said, adequately solving this problem
will require a deeper understanding of turnover than simply identifying the KSAOs necessary
for job performance. For example, it will probably require an understanding of how individuals
fit within the culture of the fi rm and characteristics of the work group, something that is rarely
assessed as part of selection practices. Note that doing so would obviously contextualize
selection, an idea that has been recently been advocated (Cascio & Aguinis, 2008). This research
may help link selection to important internal processes that contribute to the fi rm’s ability to
retain the high-quality stock of strategically valuable human and social capital necessary for
strategy implementation.
Earlier when discussing Figure 9.1, we noted some similarities and differences between viewing
staffing as a strategic process versus traditional utility analysis. We emphasize that it is the unit-
level human and social capital that creates sustained competitive advantage. Therefore, it is critical
to articulate the economic value of this intangible asset—what Ulrich and Smallwood (2005) called
“return on intangibles.” In some of our own research, we have shown how individual KSAOs emerge
to create meaningful between-unit differences in human capital (Ployhart, Weekley, & Ramsey,
2005). More important, we quantified this unit human capital in a composition model and linked
it to unit-level financial and productivity-based outcomes over time. In a sample of over 100,000
employees and over 1,000 retail chain stores, we showed how store levels of service orientation
contributed to between-store differences in sales and controllable profit. Stores performed better as
they acquired higher levels of human capital and performed more poorly as they lost human capital.
Thus, we showed how this “intangible asset” was in fact tangibly linked to important store financial
criteria.
called for in different parts of the world, and thus the KSAOs matched to the competencies needed
to execute respective functions will also differ.
Global selection also means that selection methods developed in one part of the world may not
generalize to other countries or cultures (e.g., Ryan, McFarland, Baron, & Page, 1999). Even where
the same competencies are desired, indicating the same KSAOs are required, different or at least
modified methods may be required to identify people who possess them. For example, it is well
established that personality measures are affected by response styles (e.g., extreme responding,
acquiescence), which in turn vary across cultures. Other selection methods like biodata and situ-
ational judgment, which are usually developed and keyed at a local level, may not generalize across
cultural boundaries (Lievens, 2006). The obvious point is that globalization greatly compounds the
challenges facing selection researchers attempting to build the corporate competencies required to
support execution of the business strategy and impact key outcomes. The talent necessary to com-
pete in one local economy (e.g., China, India) may contribute little to supporting the organization’s
overall strategy. For example, suppose a U.S.-based organization opens a manufacturing facility in
China. They may choose to employ local managers because they have expertise in Chinese culture
and business practices, but they may lack some key competencies necessary for them to understand
how their operation adds value to the overall firm. Hence, assuming some autonomy, they may make
choices that run counter to the firm’s strategy even though they are successful locally. The critical
issue is alignment between the firm’s strategy and talent within and across geographic locations.
Although the strategy determines which are A jobs, performance largely determines who is
considered an A player. Selection specialists can play a critical role in a firm’s success to the extent
they can match A players with A jobs. Firms that are successful in placing their most talented indi-
viduals in the most mission-critical jobs should be more successful than those taking a more casual
approach to placement. Although seemingly obvious, we are not aware of any systematic research
examining the issue. However, the role is not limited to selecting talent. It also involves deselecting
marginal performers from critical jobs. If the job is truly essential to the execution of the firm’s
strategy, then rigorous performance management, including the reallocation or separation of weak
performers, becomes an important if unpleasant task.
TABLE 9.1
Summary of Key Implications: Integrating Strategic Human Resources
With Personnel Selection
1. Only unit-level human capital and social capital can offer strategic value to the firm. Personnel selection and selection
on targeted individual-level KSAOs can only contribute to the firm’s strategy insofar as they contribute to the
emergence of strategically valuable unit-level human and social capital.
2. Individual-level criterion-related validity is insufficient evidence to demonstrate the strategic value of selection. It is
necessary, but not sufficient. Simply using selection procedures that predict individual job performance is no guarantee
that personnel selection will contribute to the firm’s sustained competitive advantage.
3. Validity at the individual level generalizes across contexts, but the validity of unit-level human capital and social
capital does not generalize. The extent to which these unit-level competencies have relationships with unit outcomes is
dependent on the firm’s strategy, market, competitors, and related factors. Indeed, if unit-level human and social
capital are to add value to the firm’s competitive advantage, then the firm would not want these relationships to
generalize to other firms!
4. Demonstrating the strategic value of personnel selection will likely require a longitudinal focus, because selection
(and related HR) practices must be implemented appropriately over time to create a critical mass of unit-level human
and social capital emergence.
5. A critical mass of aggregate KSAOs is necessary to influence unit-level consequences. This has three important
implications. First, the development and sustainability of this critical mass is likely to be dynamic and nonlinear.
Second, the relationships between unit-level human and social capital are likely to be dynamic and nonlinear with
unit-level outcomes. Third, there is likely a threshold of diminishing returns; a tipping point where adding more talent
is unlikely to provide the same value to the firm.
6. Because performance criteria are not isomorphic between the individual and unit levels, there is the possibility that
different predictors of performance will be present at each level.
(specifically, mean store assessment scores correlated with a store-level customer service index at
r = .19, P < .05). Although not proving that the individual-level assessment was causing customer
service, it was certainly perceived as more important by the participating organization than was yet
another individual-level validation study.
CONCLUSIONS
In this chapter, we have argued that selection scholarship must be expanded to consider how, when,
and why personnel selection practices will contribute to creating business-unit-level value and
sustained competitive advantage. We noted that in contrast to expectations, effective selection prac-
tices will not always translate into firm-level value. Table 9.1 summarizes the main implications of
our chapter.
As a consequence of the issues summarized in Table 9.1, we have also tried to articulate ways
through which selection practices can manifest such value. Much is currently written about how
the HR profession needs to be part of the firm’s strategic decision-making team; this is even more
true for I-O psychology. Being able to demonstrate selection’s contribution to the firm’s strategy
is one way to accomplish this goal. Although the road toward establishing empirical connections
between selection practices and business unit sustained competitive advantage will not be easy
or quick, we believe it is vital for the future of our profession and I-O psychology’s own strategic
direction.
ACKNOWLEDGMENTS
We thank James L. Farr, Nancy T. Tippins, Jerard F. Kehoe, and Janice Molloy for providing sug-
gestions on earlier drafts of this chapter.
210 Handbook of Employee Selection
REFERENCES
Anderson, N. R. (2005). Relationships between practice and research in personnel selection: Does the left hand
know what the right is doing? In A. Evers, N. R. Anderson, & O. Smit-Voskuyl (Eds.), The Blackwell
handbook of personnel selection (pp.1–24). Oxford, England: Blackwell.
Barney, J. B. (1991). Firm resources and sustained competitive advantage. Journal of Management, 17,
99–120.
Barney, J. B., & Wright, P. M. (1998). On becoming a strategic partner: The role of human resources in gaining
competitive advantage. Human Resource Management, 37, 31–46.
Barrick, M. R., & Zimmerman, R. D. (2005). Reducing voluntary, avoidable turnover through selection. Journal
of Applied Psychology, 90, 159–166.
Becker, B. E., & Huselid, M. A. (2006). Strategic human resource management: Where do we go from here?
Journal of Management, 32, 898–925.
Becker, G. S. (1964). Human capital: A theoretical and empirical analysis with special reference to education.
New York, NY: National Bureau of Economic Research.
Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data
aggregation and analysis. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and
methods in organizations: Foundations, extensions, and new directions (pp. 349–381). San Francisco,
CA: Jossey-Bass.
Bowen, D. E., & Ostroff, C. (2004). Understanding HRM-firm performance linkages: The role of the “strength”
of the HRM system. Academy of Management Review, 29, 203–221.
Cascio, W. F., & Aguinis, H. (2008). Staffing twenty-first-century organizations. Academy of Management
Annals, 2, 133–165.
Chadwick, C., & Cappelli, P. (1999). Alternatives to generic strategy typologies in strategic human resource
management. In P. M. Wright, L. D. Dyer, J. W. Boudreau, & G. T. Milkovich (Eds.), Research in person-
nel and human resources management, (Suppl. 4, pp. 1–29). Stamford, CT: JAI Press.
Colbert, B. A. (2004). The complex resource-based view: Implications for theory and practice in strategic
human resource management. Academy of Management Review, 29, 341–358.
Combs, J., Yongmei, L., Hall, A., & Ketchen, D. (2006). How much do high-performance work practices matter?
A meta-analysis of their effects on organizational performance. Personnel Psychology, 59, 501–528.
Delery, J. E., & Doty, D. H. (1996). Modes of theorizing in strategic human resource management: Tests
of universalistic, contingency, and configurational performance predictions. Academy of Management
Journal, 29, 802–835.
DeNisi, A. S., Hitt, M. A., & Jackson, S. E. (2003). The knowledge-based approach to sustainable competi-
tive advantage. In S. E. Jackson, M. A. Hitt, & A. S. DeNisi (Eds.), Managing knowledge for sustained
competitive advantage (pp. 3–33). San Francisco, CA: Jossey-Bass.
Dess, G. G., & Shaw, J. D. (2001). Voluntary turnover, social capital, and organizational performance. Academy
of Management Review, 26, 446–456.
Dierickx, I., & Cool, K. (1989). Asset stock accumulation and sustainability of competitive advantage.
Management Science, 35, 1504–1511.
Gerhart, B. (2005). Human resources and business performance: Findings, unanswered questions, and an
alternative approach. Management Revue, 16, 174–185.
Glebbeek, A. C., & Bax, E. H. (2004). Is high employee turnover really harmful? An empirical test using
company records. Academy of Management Journal, 47, 277–286.
Grant, R. M. (1996). Toward a knowledge-based theory of the firm. Strategic Management Journal, 17,
109–122.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decision (2nd ed.). Mahwah, NJ:
Erlbaum.
Hitt, M. A., Bierman, L., Shimizu, K., & Kochhar, R. (2001). Direct and moderating effects of human capital
on strategy and performance in professional service firms: A resource-based perspective. Academy of
Management Journal, 44, 13–28.
Huselid, M. A. (1995). The impact of human resource management practices on turnover, productivity, and
corporate financial performance. Academy of Management Journal, 38, 635–672.
Jordan, M., Herriot, P., & Chalmers, C. (1991). Testing Schneider’s ASA theory. Applied Psychology, 40,
47–53.
Kacmar, K. M., Andrews, M. C., Rooy, D. L. V., Steilberg, R. C., & Cerone, S. (2006). Sure everyone can be
replaced … but at what cost? Turnover as a predictor of unit-level performance. Academy of Management
Journal, 49, 133–144.
Strategy, Selection, and Sustained Competitive Advantage 211
Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations:
Contextual, temporal, and emergent processes. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel
theory, research, and methods in organizations: Foundations, extensions, and new directions. (pp. 3–90).
San Francisco, CA: Jossey-Bass.
Latham, G. P., & Whyte, G. (1994). The futility of utility analysis. Personnel Psychology, 47, 31–46.
Lepak, D. P., & Snell, S. A. (1999). The human resource architecture: Toward a theory of human capital allocation
and development. The Academy of Management Review, 24, 34–48.
Lepak, D. P., & Snell, S. A. (2002). Examining the human resource architecture: The relationships among
human capital, employment, and human resource configurations. Journal of Management, 28, 517–543.
Lievens, F. (2006). International situational judgment tests. In J. A. Weekley & R. E. Ployhart (Eds.), Situational
judgment tests (pp. 279–300). Mahwah, NJ: Lawrence Erlbaum.
Nahapiet, J., & Ghoshal, S. (1998). Social capital, intellectual capital and the organizational advantage.
Academy of Management Review, 23, 242–266.
Noe, R. A., Colquitt, J. A., Simmering, M. J., & Alvarez, S. A. (2003). Knowledge management: Developing
intellectual and social capital. In S. E. Jackson, M. A. Hitt, & A. S. DeNisi (Eds.), Managing knowledge
for sustained competitive advantage (pp. 209–242). San Francisco, CA: Jossey-Bass.
Oldham, G. R. (2003). Stimulating and supporting creativity in organizations. In S. E. Jackson, M. A. Hitt,
& A. S. DeNisi (Eds.), Managing knowledge for sustained competitive advantage (pp. 274–302). San
Francisco, CA: Jossey-Bass.
Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. In J. Martocchio (Ed.),
Research in personnel and human resource management, 23, 121–176. Oxford, England: Elsevier.
Ployhart, R. E. (2006). Staffing in the 21st century. Journal of Management, 32, 868–897.
Ployhart, R. E., & Schneider, B. (2002). A multilevel perspective on personnel selection: Implications for
selection system design, assessment, and construct validation. In F. J. Dansereau & F. Yammarino (Eds.),
Research in multi-level issues Volume 1: The many faces of multi-level issues (pp. 95–140). Oxford,
England: Elsevier Science.
Ployhart, R. E., & Schneider, B. (2005). Multilevel selection and prediction: Theories, methods, and models.
In A. Evers, O. Smit-Voskuyl, & N. Anderson (Eds.), Handbook of personnel selection (pp. 495–516).
Oxford, England: Blackwell.
Ployhart, R. E., Weekley, J. A., & Baughman, K. (2006). The structure and function of human capital emergence:
A multilevel examination of the ASA model. Academy of Management Journal, 49, 661–677.
Ployhart, R. E., Weekley, J. A., & Ramsey, J. (2005). Does human capital relate to unit level effectiveness over
time? Paper presented at the 2005 Annual Meeting of the Academy of Management, Honolulu, HI.
Prahalad, C. K., & Hamel, G. (1990). The core competence of the corporation. Harvard Business Review, 68,
79–91.
Ryan, A. M., McFarland, L. A., Baron, H., & Page, R. (1999). An international look at selection practices:
Nation and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391.
Rynes, S. L., Brown, K. G., & Colbert, A. E. (2002). Seven misconceptions about human resource practices:
Research findings versus practitioner beliefs. Academy of Management Executive, 16, 92–103.
Sacco, J. M., & Schmitt, N. (2005). A dynamic multilevel model of demographic diversity and misfit effects.
Journal of Applied Psychology, 90, 203–231.
Saks, A. M. (2005). The impracticality of recruitment research. In A. Evers, O. Smit-Voskuyl, & N. Anderson
(Eds.), Handbook of personnel selection (pp. 47–72). Oxford, England: Blackwell.
Schaubroeck, J., Ganster, D. C., & Jones, J. R. (1998). Organization and occupation influences in the attraction-
selection-attrition process. Journal of Applied Psychology, 83, 869–891.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology:
Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schneider, B. (1987). The people make the place. Personnel Psychology, 40, 437–454.
Schneider, B., Smith, D. B., & Sipe, W. P. (2000). Personnel selection psychology: Multilevel considerations.
In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations:
Foundations, extensions, and new directions. (pp. 91–120). San Francisco, CA: Jossey-Bass.
Schneider, B., White, S. S., & Paul, M. C. (1998). Linking service climate to customer perceptions of service
quality: Test of a causal model. Journal of Applied Psychology, 83, 150–163.
Schuler, R. S., Galante, S. P., & Jackson, S. E. (1987). Matching effective HR practices with competitive strategy.
Personnel, 64, 18–27.
Shaw, J. D., Duffy, M. K., Johnson, J. L., & Lockhart, D. E. (2005). Turnover, social-capital losses, and performance.
Academy of Management Journal, 48, 594–606.
212 Handbook of Employee Selection
Shaw, J. D., Gupta, N., & Delery, J. E. (2005). Alternative conceptualizations of the relationship between vol-
untary turnover and organizational performance. Academy of Management Journal, 48, 50–68.
Strober, M. H. (1990). Human capital theory: Implications for HR managers. Industrial Relations, 23,
214–239.
Takeuchi, R., Lepak, D. P., Heli, W., & Takeuchi, K. (2007). An empirical examination of the mechanisms
mediating between high-performance work systems and the performance of Japanese organizations.
Journal of Applied Psychology, 92, 1069–1083.
Taylor, M. S., & Collins, C. J. (2000). Organizational recruitment: Enhancing the intersection of theory and
practice. In C. L. Cooper & E. A. Locke (Eds.), Industrial and organizational psychology: Linking theory
and practice (pp. 304–334). Oxford, England: Blackwell.
Terpstra, D. E., & Rozell, E. J. (1993). The relationship of staffing practices to organizational level measures of
performance. Personnel Psychology, 46, 27–48.
Ulrich, D., & Smallwood, N. (2005). HR’s new ROI: Return on intangibles. Human Resource Management,
44, 137–142.
Whyte, G., & Latham, G. P. (1997). The futility of utility analysis revisited: When even an expert fails. Personnel
Psychology, 50, 601–610.
Wright, P. M., & Boswell, W. R. (2002). Desegregating HRM: A review and synthesis of micro and macro HR
research. Journal of Management, 28, 247–276.
Wright, P. M., Dunford, B. D., & Snell, S. A. (2001). Human resources and the resource-based view of the firm.
Journal of Management, 27, 701–721.
Wright, P. M., Gardner, T. M., Moynihan, L. M., & Allen, M. R. (2005). The relationship between HR practices
and firm performance: Examining causal order. Personnel Psychology, 58, 409–446.
Wright, P. M., & Haggerty, J. J. (2005). Missing variables in theories of strategic human resource management:
Time, cause, and individuals. Management Revue, 16, 164–173.
10 Managing Sustainable
Selection Programs
The objective of this chapter is to describe the major features of selection programs that contribute
to their sustainable success. This chapter will focus primarily on the organizational context of selec-
tion programs considering the influence of governance, fit, strategy, and outcomes. The chapter will
not focus on the psychometric technology of selection practices that affect the value of selection
decisions as this content is treated elsewhere in this handbook. However, because the context of per-
sonnel selection science is so directly relevant, the relationship between program sustainability and
the scientific context of validation, research, and data will be explored. Finally, this chapter is the
result of collaboration between psychologists with U.S.- and European-centric professional experi-
ence. The intent is not so much to ensure comprehensive coverage of cultural or national differences
between sustainable selection programs as much as it is to better ensure that this chapter is relevant
to modestly diverse cultural and national perspectives and contexts.
Two recent chapters (Tippins, 2002; Roe, 2005) and one article (Klehe, 2004) have addressed the
design and implementation of selection programs. This chapter’s focus on the organizational context
for selection programs complements these earlier works. Tippins (2002) and Roe (2005) focused
primarily on the procedural elements of the selection process itself. For example, both addressed
selection decision processes such as the use of cut scores, multiple hurdle methods, and compensa-
tory methods. Roe (2005) placed somewhat greater emphasis on the planning processes such as the
use of a program of requirements (PoR) for the design of selection systems. Indeed, many of Roe’s
suggestions regarding planning processes could be usefully applied to any of several HR processes.
In contrast, Tippins (2002) focused more on the necessary elements of a fully functioning selection
program, including critical elements such as the management of test materials, test administra-
tion processes, test preparation strategies, and test use rules. Finally, Klehe (2004) focused on the
institutional pressures that may help or hinder the adoption of selection procedures that are recom-
mended by academia.
This chapter completes the picture by addressing the organizational context in which the design,
implementation, and management of selection programs takes place. It should also be noted that this
chapter complements Chapter 9, this volume, which also addresses the organizational context for
selection. In contrast to Chapter 9, this chapter treats the organizational context as an independent
variable, if you will, that influences the features of selection programs necessary to be sustainable.
In Chapter 9, Ployhart and Weekley focus on the organization as the dependent variable by consider-
ing the impact of selection as a human resources (HR) management strategy on the organization.
213
214 Handbook of Employee Selection
general level, organization purposes create the environment in which the most fundamental deci-
sions about sourcing employees are made. This chapter focuses on organization purposes served by
selection systems, in particular the multiple layers of purposes that can be served.
At the next level, the HR strategy—if it can be defined—is likely to provide essential direction in
establishing the goals and objectives of a selection program and the manner in which it is integrated
with other HR programs and business processes. At the next level of specificity, governance in the
form of operating principles, policies, and guidelines, establishes the authorities, accountabilities,
boundary conditions, roles, and the like, that enable the selection program to function effectively
and efficiently within the context of other HR processes. Finally, selection process management is
the most specific form of structure within which selection programs operate. The elements of pro-
cess management are highly specific to the functioning of a selection program. They can be com-
mon across all units and jobs or they can vary across units and jobs. Figure 10.1 provides a visual
depiction of these sustainability considerations as well as the specific points underlying each one
that are addressed in this chapter.
Before describing the four layers of organizational context that affect sustainability, we offer our
perspective about the meaning of selection system sustainability.
FIGURE 10.1 The four-part model of organizational factors influencing the sustainability of selection
programs.
Managing Sustainable Selection Programs 215
to the extent that its purpose, strategy, governance, and management are consistent with these touch-
stones. A vital implication of our perspective is that although an organization may have designed a
sophisticated selection procedure that displays high validity, reliability, and fairness to begin with,
paying insufficient attention to sustainability issues will inevitably result in disuse or, more subtly, a
gradual (or even rapid) decline in the psychometric performance of the system over time.
ORGANIZATION PURPOSES
Traditionally, personnel selection has been conceived as the fit between the person and their immedi-
ate job role (i.e., P-J fit; Chatman, 1991; Ostroff & Rothausen, 1996). As organizations have become
more delayered, flexible, and team-based in their structures, so the imperative to consider person-
nel selection from other, more superordinate levels of fit has gained momentum among researchers
and personnel practitioners active in employee selection. Principally this has meant that issues of
person-team fit (P-T fit) and person-organization fit (P-O fit) have been added to the selection agenda
over recent years, and the need to consider any selection decision from all three levels of analysis—
person-job fit (P-J fit), P-T fit, and P-O fit—has been increasingly recognized (e.g., Ployhart &
Schneider, 2005). In effect, this has resulted in the criterion space under consideration in selec-
tion being substantially extended to include other levels of fit. Yet this expansion of the criterion
space has only been rather recent, and the research-base upon which I/O psychologists can make
grounded recommendations to organizations to best manage multilevel selection systems remains
underdeveloped. To illustrate this point, two quotes will suffice.
The critical challenge is to expand our conceptual horizon beyond the level of person-job fit and to
incorporate multiple and interactive levels of analysis into selection decision-making. (Herriot &
Anderson, 1997, p. 26)
Reflecting on nearly a century of personnel selection research, it is quite troubling to us that we have
no solid answers … and approaches to answering the questions remain outside of traditional personnel
selection research. We may be able to show how hiring better people contributes to better individual job
performance, but we have hardly examined whether this contributes to better unit-level performance.
(Ployhart & Schneider, 2005, p. 496)
This relative paucity of research into selecting for P-T and P-O fit compared against the mass
of studies into aspects of P-J fit of course leads to problems in making sound recommendations for
the management of selection systems at different levels of analysis (e.g., Anderson, Lievens, van
Dam, & Ryan, 2004; Ployhart, 2007; Schneider, Smith, & Sipe, 2000). Despite this discrepancy in
research coverage, the area of multilevel selection has recently become far more active, and sev-
eral authors internationally have contributed theoretical models (e.g., Ployhart & Schneider, 2002;
Ployhart, 2007; Stevens & Campion, 1994), empirical studies have been published (e.g., LePine,
Hollenbeck, Ilgen, & Hedlund, 1997; Morgeson, Reider, & Campion, 2005), and even validated
measures of P-T fit have appeared in publication (e.g., Burch & Anderson, 2004; Mumford, Van
Iddekinge, Morgeson, & Campion, 2008). In short, there has been a far more speculative approach
than clear signs of having arrived in terms of the focus being put upon the generation of theoretical
models and conceptual think-piece papers rather than the publication of robust empirical studies
into multilevel selection effects.
Despite these shortcomings, several important implications for the management of selection sys-
tems can be gleaned from the recent literature. In perhaps the most detailed and directly relevant
contribution to the validation of multilevel selection decisions, Ployhart and Schneider (2005) pro-
posed a ten-stage model for the conduct of any such validation study. The stages are as follows:
2. Articulate relationships between theory and measurement issues, especially with regard to
data aggregation.
3. Articulate predictors define predictor methods and specify their predominant level/levels
of analysis.
4. Specify within-level relationships: Operationalize direction and strength of knowledge,
strength, abilities, and other characteristics (KSAO)-criterion relationships within level
(i.e., P-J, P-T, and P-O).
5. Specify cross-level relationships: Operationalize contextual effects, cross-level effects,
and multiple-level effects.
6. Sample units: Sample a sufficient number of units to test for within- and cross-level
relationships.
7. Measures: Use appropriate measures for each level of analysis.
8. Aggregation: Test for unit-level variance and reliability of aggregation procedures.
9. Analyze data using appropriate procedures.
10. Interpret results giving consideration to within- and cross-level findings.
This procedure for validation of multilevel selection procedures is comprehensive, but it is appar-
ent that only the most statistically versed of HR practitioners supported by an industrial-organiza-
tional (I-O) psychologist able to undertake the relevant analyses would be able to complete such a
procedure. Rather, it is far more conceivable in practice that selectors will approach such decisions
in a notably ad hoc manner, will give weights to different within- and cross-level variables on the
basis of some notional “rules of thumb” known only to themselves, and will be prone to a gamut
of errors brought on by information overload, imperfect information processing, and satisfaction in
their decision-making strategies. Indeed, this is what we would expect from, say, the vast literature
now accumulated into interviewer and assessor decision-making under conditions of information
overload.
Yet Ployhart and Schneider’s (2005) model for validation, and thus sustainability management,
is highly valuable in pointing up the complexities of the task facing any practitioner or researcher.
Other authors have highlighted other issues of concern, including the likelihood that maximizing fit
at one level of analysis can lead to declines in fit at other levels of analysis. For instance, Anderson
et al. (2004) proposed three types of effects in cross-level selection decision-making: (a) comple-
mentary, (b) neutral, and (c) contradictory fit. That is, KSAOs being sought by an organization at
the level of P-J fit can either be complementary to P-T and P-O fit, neutral in their overlaps, or more
problematically, contradictory in their effects. For example, high extraversion needed for P-J fit can
be complementary for team-level issues of fit; whereas, conversely, high-rule independence needed
for innovation potential in a research and development (R&D) scientist may militate against P-O
climate fit in an organization that possesses a strong climate in which conformity is valued.
The subdomain of multilevel fit in selection promises to generate novel but far more complex
models of decision-making to support organizational efforts to optimize P-J, P-T, and P-O fit.
However, this field remains at a very embryonic stage of development with mostly theoretical and
model-building contributions published to date. Applied research in field study settings is badly
needed to begin to extend and clarify our understanding of these complexities and how best to
advise organizations to deal with the many issues, challenges, and controversies thrown up by mul-
tilevel fit in employee selection.
HR STRATEGY
HR strategy can vary significantly across organizations. For example, very small or highly entrepre-
neurial organizations may have no formalized HR strategy, whereas large organizations are likely
to have an HR strategy that is professionally developed and integrated to some extent into the busi-
ness strategy. Our experience with medium and large organizations points to the importance of five
Managing Sustainable Selection Programs 217
that information about employees’ current skills be used to make selection decisions about moving
employees to other jobs. In this situation, selection programs may need to focus on two consider-
ations—the relevance of current skills to future work and the relevance of current performance to
future performance in new jobs. In this scenario, the distinction between skills and performance
can be important for a workforce management selection program. This distinction hinges on the
assumption that skills and performance are assessed differently. In our experience, workforce man-
agement strategies that focus on the movement of employees between jobs vary in the extent to
which they rely on records of performance and assessments of skills. Where the movement under
consideration is between two similar jobs, the selection emphasis is often on recent performance
in the current job. Recent performance is usually assessed by referring to administrative records
of recent job performance such as appraisal ratings, salary and bonus awards, progressive work
assignments, and the like. This approach is particularly evident within the realm of expatriate man-
agement, in which selection decisions are typically based on the expatriate’s technical expertise and
domestic track record as opposed to language skills, international adaptability, and other selection
context predictors (Bonache, Brewster, & Suutari, 2001; Harris & Brewster, 1999; Mendenhall &
Oddou, 1985). This may in part be due to the fact that the expatriate position for which the candi-
date is being sought is highly similar to the domestic position this candidate is vacating. In contrast,
where the movement is between dissimilar jobs, the selection focus is more likely to be on skills that
are assessed independently of administrative assessments of performance. Such skill assessments
may include ad hoc supervisor’s ratings of defined skill levels, skill tests, and ad hoc interviews.
Another scenario can be even more problematic for selection programs. Many organizations con-
tract with external organizations to provide workers who perform work in the client organization. If
the client organization does not require contract employees to complete its own selection process to
be assigned to the jobs in question, it almost certainly faces a future dilemma. The dilemma arises
when, as is often the case, the client organization eventually wants to hire a contract employee who
has demonstrated sufficient job success as its own employee. In this case, there is the very real and
predictable likelihood that some significant percentage of successful contract employees will fail
to satisfy the client organization’s selection criteria despite their demonstrated success on the very
job for which they are being hired. This conflict between job success and selection failure can cause
serious harm to the credibility and defensibility of the selection program, although it may be entirely
consistent with the level of validity and the de facto selection rate. Within sustainable selection pro-
grams, a strategy will be pursued to avoid this conflict by requiring all contract employees to satisfy
the selection criteria prior to being assigned to work in the client organization or by establishing
some form of selection policy (see below) that allows recent success in the same job to be a surrogate
for satisfying that job’s selection criteria.
This prospect of having two ways of satisfying selection standards for a job may also manifest
itself where a vacancy may be filled by external applicants or by incumbent employees as part of
internal progression programs. For example, the movement of employees from entry-level techni-
cal positions to higher-level technical positions may be governed by a progression program that
specifies requirements for progression from one level to the next. In such progression programs,
progression requirements are selection criteria, and the employee-applicant often has some degree
of control over the process of satisfying the requirements. Internal progression requirements often
consist of various standards including demonstrated skills, training certifications, and/or current
job performance. In contrast, external hiring into the same job may consist of a different profile
of selection criteria such as education degrees, years of experience, interviews, and perhaps even
qualification test results. It is not uncommon for internal progression requirements to be different
than external hiring criteria for the same position simply because more local information is known
about incumbent employees than about external applicants.
Where such differences occur, it is crucial to give careful consideration to the equivalence
of the two paths to the target position. In many cases, it is very difficult, if not impossible, to
define equivalence psychometrically. There may be few, if any, people who have scores on both
Managing Sustainable Selection Programs 219
sets of selection criteria. The selection criteria in the two paths may be qualitatively different.
For example, internal progression may rely heavily on administrative documentation of local
workplace behavior such as performance and training achievement, whereas external hiring is
likely to rely on indicators such as degrees and test and interview results. One possible empirical
definition of equivalence is that job hires coming from the two paths tend to perform equally well;
that is, they have the same expected performance level. Other definitions of equivalence may be
rational, rather than empirical. One rational definition is that receiving managers agree the two
sets of standards are equivalent. A similar definition is that receiving managers are indifferent with
regards to the qualifications of the two types of applicants. However established, it is crucial that
the organization establishes the equivalence of the two paths for the two sets of selection criteria
to be simultaneously sustained.
Funding
It is trite to say that sustainable selection programs must be adequately funded. Although true, that is
not the point being made here. Rather, the manner in which funding for selection programs derives
from corporate budgets and/or unit budgets has great influence on the organizational pressures
acting on the selection program. Where funding is mostly or entirely from corporate budgets and
is relatively distant from the means by which units fund the corporation, it is likely that corporate
interests in defensibility, fit with HR strategy, and perceived fairness and equivalence across units
will take on more importance in the design and management of selection programs. Where unit-based
funding is substantial or, often, even contingent on unit-level satisfaction with selection programs,
the pressures for unit-specific design and management are likely to be much greater. Our view is
that the latter condition is more difficult to manage for selection program managers because it can
create pressures that are more likely to conflict with the professional values of consistency, validity
across broad job families, and job focus. In general, corporate interests tend to have a convergent
influence supportive of a single, whole, integrative selection program whereas unit interests tend
to have a divergent influence that leads to differentiated, multiple selection practices across units.
Divergence of interests is more likely to create fundamental conflicts with the professional and legal
standards for selection programs.
Approval Roles
Two types of approvals are covered here: (a) the approval to implement or change a selection program
and (b) the approval to waive or exempt individuals from the requirements of a selection program.
(More detail about such approvals is provided below in the discussion of policy.) Where these two
approval roles reside in a corporate organization—perhaps even in the corporate organization that
designed and validated the selection procedures—the interests of the corporation are likely to be more
influential than if either or both approval roles reside in the affected units. In many ways, the impact
of approval roles is the same as the impact of funding source. The organizational entity that funds and
approves has more influence. However, we have seen various combinations of funding and approval
roles that create complexity in their mutual impact on selection programs. Indeed, selection programs
may be most sustainable where funding is corporate but approval is local. (The reverse combination of
local funding with corporate approval is unlikely in our experience except in organizations with very
220 Handbook of Employee Selection
highly centralized authorities.) The relevance of approval roles to sustainable selection programs is
that, at its core, the authority to approve the implementation of, changes to, or exceptions to a selection
program is tantamount to approval authority over the content of the selection program.
It may be difficult to reach agreement to organizationally separate funding and approval roles,
but, when separated, they create a form of checks and balances that may serve to sustain a selec-
tion program across a wider range of circumstances than if both were housed in the same level of
the organization. Corporate funding and local approval give both organizational levels a significant
operational stake in and influence over the selection program that is commensurate with their neces-
sary interests in the program.
The value we place on balancing these interests is rooted in the perspective that the effectiveness
of a selection program (of great local interest) and its defensibility or compliance (of great corporate
interest) are both necessary considerations and both require attention to be optimized. An alterna-
tive perspective we have observed in some organizations is that a selection program’s defensibility
can be quite variable and is assured only by persistent and rigorous attention, whereas the effective-
ness of a selection program can be satisfied more easily by the involvement of professional-level
selection expertise. In effect, this perspective holds that effectiveness can be attained by the exper-
tise of the designer but that defensibility requires continual attention to and alignment among all
processes that comprise a selection system. This latter perspective is less likely to seek a balance
between effectiveness and defensibility and is more likely to place great weight on defensibility.
In addition to the above issues, the expatriate selection system sustainability may be further com-
plicated because of the fact that expatriates are incumbents of a myriad of different occupations and
countries. The term expatriate may thus be legitimately used to describe a French banker in Hong
Kong and an American geologist working for an oil company in Saudi Arabia. Any standardization
vis-à-vis expatriate selection decision-making is therefore likely to imply the comparison of apples
and oranges. This being the case, Mol (2007) has called for an abandonment of research into expa-
triate selection as such. A multinational bank might be better off selecting expatriate bankers on
the basis of the selection system in place for the selection of domestic bankers rather than trying to
develop a selection system that is tailored specifically to expatriate bankers in Hong Kong.
Expert-Only Strategy
In the expert-only strategy, the professionals who design, develop, and validate selection procedures
do not own the strategic direction of the organization’s selection programs. Although they may cre-
ate selection procedures, they do not determine which organizational needs will be addressed by
selection solutions, they do not determine what selection strategies will be applied across units, they
do not have authority over tradeoffs between cost and value of selection procedures, and so on. This
expert-only strategy can manifest itself in various ways. A recent organizational strategy is to house
selection experts in HR organizations sometimes called COEs or centers of expertise. These COEs
are positioned as technical resources to the business, which may be accessed by business units as
needed—in the judgment of the business units—to develop HR solutions to business problems.
Similarly, selection experts who are described as internal consultants serve in roles very similar
to COEs. COEs are almost certainly an indication of an expert-only approach. Another clear sign
of an expert-only approach is the situation in which selection experts are funded only on a project-
specific basis. This can be the case whether selection experts are located in corporate or local
organizations. A third sign of an expert-only approach is that the selection experts do not report in
222 Handbook of Employee Selection
an HR department. Being housed outside of HR almost always means that selection budget funding
is closely tied to specific projects rather than an overall strategic purpose for selection. Yet another,
more dramatic version of the COE approach is outsourcing, in which the selection design role is
moved from the home organization to an external organization to provide design services back to
the home organization. In this case, the role and scope of the selection designer’s work is specified
by a services contract, which is typically overseen and approved by nonselection experts. Finally,
the HR strategy of hiring vendors for professional selection expertise almost certainly indicates an
expert-only role.
The expert-only approach is likely to have several typical consequences for selection programs.
First, it will be difficult to develop a long-term plan for the gradual restructuring or introduction of a
comprehensive selection program or to evolve to become an experimenting organization (cf. Binning
& Barrett, 1989) where selection assessment results might become part of a larger workforce plan-
ning strategy. Second, virtually all authority over the administration of the selection program and
over initial decisions about standards and policies is likely to reside in the funding organization or,
possibly, in the organization responsible for the administration of the selection procedures to can-
didates. Third, the development of selection programs that apply in some consistent fashion across
different organizational units will be difficult. The scope of selection design work is more likely
to have a local focus on particular jobs within particular units. Fourth, local business leaders may
provide stronger support to locally focused selection programs than corporately focused programs
if they see the programs as more directly tailored to their specific needs.
Expert-Owner Strategy
Selection experts who also own selection strategy identify strategic directions by analyzing orga-
nizational needs that cut across units or that have the greatest long-term benefits for the organiza-
tion as a whole. A critical strategic activity for selection owners is long-term planning. Strategic
planning can take many forms but almost always includes collaborative planning with HR leaders
across units of the organization. Such planning would typically focus on common interests across
units as well as unique interests of specific units. As mentioned earlier, particular challenges may be
faced in this regard by expert-owners in MNCs in which subsidiary local idiosyncrasies (such as the
host country legal context and the local labor market) may prevent the establishment of a selection
strategy that cuts across the various units of the organization. Here, multinational organizations will
clearly need to be sensitive to local needs rather than merely attempting to impose a standardized
procedure upon multiple units. Indeed, we would argue that this tension between standardization
versus country specificity will require active management.
Strategy ownership can manifest itself in various ways. Owners are more likely to have responsi-
bility for neighboring functions such as employment policy, employee research, training of employ-
ment administrative staff, and regulatory compliance that depend on or influence the selection
strategy. Strategy owners may have stronger and longer relationships than expert-only managers
with corporate and unit HR leaders because their roles are more comparable and interrelated. At the
same time, owners may not have strong relationships with unit business leaders because funding is
not as likely to be directly tied to business unit sources. This can be an important consideration for
selection strategy owners. Certainly there is considerable value in well-developed relationships with
HR leaders and with the business leaders they support. Typically, these relationships are not man-
aged independently. One approach is for the selection owner’s primary unit relationship to be with
the unit’s HR leader with business leader relationship managed under the umbrella of the HR-leader
relationship.
Strategy ownership has other implications for selection programs. They are more likely to be both
coherently integrated across units of the organization and supported by a common set of policies
and practices. Strategic selection programs are more likely to be integrated with other HR programs
such as training and development, workforce management, compliance support functions, and
organization-wide HR information systems. The selection development function is more likely to
Managing Sustainable Selection Programs 223
have policy authority regarding managers’ roles in selection decision-making, even if these roles vary
from unit to unit. One of the most tangible possible indicators of strategy ownership would be that
selection developers would have created a selection strategy document used to communicate and plan
with units and other HR functions and to develop selection budgets.
Overall, strategy ownership can be used to build in several features of selection programs that
promote sustainability. The one cautionary note is that strategic ownership tends to align itself
with corporate interests that are longer term and cross-unit. It is critical that the strategic role not
inadvertently lead to weaker relationships with local units where the actual selection decisions
are made.
factors to respond to changes when they occur. In sharp contrast, developers of selection programs
rarely track quality indicators such as turnover, performance levels, and job requirements on a con-
tinuous basis. One reason is that employee quality data are more difficult to obtain and developers
often spend the effort to gather them only in the context of ad hoc validation studies. Another, per-
haps more fundamental reason is that the quality of selection programs is not viewed in most cases
as highly variable across continuous short time intervals. A third, more subtle reason may be that
selection developers are generally conservative about the “validator’s risk” (M. Tenopyr, Personal
Communication, 1988). The validator’s risk is the gamble selection developers take with local vali-
dation studies that the study may not support a conclusion of validity. In countries where the regula-
tion of employment selection procedures hinges on validation evidence, selection developers view
validation evidence in a legal context. The validator’s risk combined with the legal context often
results in developers being conservative about the conduct of validation studies. It is very unusual
for developers to have a continuous validation process in place. (A major exception to this is the case
in which the developer is a large consulting house with a selection process implemented in many
client organizations. In this case, the developer may have virtually continuous validation efforts
underway within and across client organizations. This ongoing, large-scale validation strategy tends
to minimize the validator’s risk, maximize defensibility, respond to local client desires for impact
measures, and provide external marketing documentation.)
The net result of these factors is that delivery functions are more likely than development func-
tions to have recent and continuous assessments of factors of interest to the owner. Indeed, with the
growing popularity of customer satisfaction measures for HR processes, selection delivery func-
tions may even have ratings of new hire quality (the developer’s domain) from hiring managers.
Independent of any other considerations, the ready availability of speed and cost metrics compared
with quality metrics can cause speed and cost metrics to be given more weight in the process of
rebalancing quality with speed and cost.
Given the availability of information about quality and speed and cost, the third step is to
determine the decision process(es) by which the current pressure to rebalance interests is resolved.
One efficient approach to these decisions is to establish a two-stage decision process. Stage one
is an administrative process designed to handle routine minor or temporary fluctuations without
directly involving the ultimate owner of the competing interests. Policies and practices can be
established with the owner’s concurrence to resolve such pressures. For example, temporary busi-
ness conditions that increase the urgency with which a vacant position must be filled might be
routinely addressed by an administrative approval process for authorizing a temporary change in
the selection standards. The key features of this first stage are that it is an established process the
owner has endorsed and that the developer and/or deliverer manages the process on behalf of the
owner’s interests.
Stage two is reserved for occasions in which the pressure to rebalance is greater in scope, more
important, less routine, and/or has longer-term implications than the simpler administrative process.
The key difference with the first processes is that here the owner is directly involved in the rebalanc-
ing decision. In stage two, the role of the developer and deliverer is to provide information about
the competing factors and to describe the methods and implications of changes to those factors as
well as constraints on what is possible. They may also be well served to make recommendations to
the owner.
The underlying principle of this two-stage process is that, above some threshold of importance,
the accountability for balancing competing interests of quality, speed, and cost lies with the ulti-
mate owner of those interests. One of the greatest risks to a selection program’s sustainability is the
disengagement of the ultimate owner from key decisions that impact the value and cost of the selec-
tion program for the owner. An important secondary benefit of an owner-based decision process for
rebalancing competing interests is that it occasionally re-engages the owner with the accumulated
information and decisions that give direction to selection programs and that ultimately impact the
owner’s success.
Managing Sustainable Selection Programs 225
HR TECHNOLOGY
We briefly note here that the manner in which HR technology is applied to selection programs can
have a significant impact on the sustainability of selection systems. This subject is taken up in con-
siderable detail in Chapter 8, this volume.
GUIDING PRINCIPLES
We describe guiding principles here despite our experience that many organizations do not explic-
itly identify the guiding principles that shape selection programs. Where used, guiding principles
express the implications of an organization’s cultural values for selection programs. One reason
they may be less common is that many organizations have defined and communicated explicit
cultural values that are general in nature (e.g., integrity, respect for others, customer focus, team-
work, and safety), which may be seen as sufficient to provide overall guidance to HR programs,
including selection programs. Also, it is frequently the case that selection programs have a
considerable amount of process-specific governance in the form of policies, systems requirements,
and well-defined practices. Even if guiding principles have a strong influence on the development
of such polices and practices, once those policies and practices are implemented, behavior in
the selection processes may be constrained to the point that guiding principles may have little
incremental value.
Guiding principles apply across all units within an organization. They are not authority-based but,
like general organizational values, are expectations or aspirations. Even where guiding principles
are defined and made explicit for selection programs, there is rarely any enforcement mechanism
226 Handbook of Employee Selection
put in place other than the general accountability associated with an organization’s code of conduct,
explicitly or implicitly stated.
The benefit of guiding principles is that they provide operationally relevant guidance to selec-
tion programs. The overall purpose of this guidance is to ensure selection programs are designed,
delivered, and used consistent with the values most fundamental to the organization. In this way,
they enhance the sustainability of selection programs by ensuring a fit with the organization. The
following briefly describes selection program operating principles we have found in various large
organizations.
1. People are accountable for selection decisions: This principle fixes the accountability for
selection decisions on the people who make hiring decisions, rather than on rules or algorithms
that might be built into decision support tools. An implication of this principle is that selection
programs should be designed to inform and guide decision-makers, not replace them.
2. Choose the best: This principle establishes a deliberately high standard for who is
selected.
3. Equal access/opportunity: Many organizations will espouse an operating principle relating
to some aspect of fairness in the selection processes. In cultures that place high value on
performance-based results, such as the Latin American societal cluster (Gupta & Hanges,
2004), this principle is unlikely to refer to equal outcomes and is more likely to refer to
some other equity principle such as equal access or equal opportunity.
4. Compliance with government regulations: An operating principle endorsing compliance
with prevailing laws may seem unnecessary. Nevertheless, organizations that choose to
endorse such a principle may do so to set an important tone for all participants in its selec-
tion programs. Communicating about the importance of compliance can have a chilling
effect on risky behavior, and the language an organization uses to describe a compliance
principle should have significant impact on the manner in which compliance is addressed.
5. Selection processes do not replace performance management: This principle addresses
the appropriateness of possible uses of assessment results. Our experience has been that,
occasionally, the relative ease and simplicity of assessment-based selection processes lead
managers to consider ways in which assessment could be used to facilitate other person-
nel decisions. This operating principle would discourage the use of ad hoc assessments as
surrogates for current performance information in which current performance information
has not been well managed.
6. Selection assessment benefits the individual as well as the organization: Organizations
that embrace an explicit commitment to act in the interests of employees and even appli-
cants may endorse some form of principle that selection assessment results should benefit
the people who are assessed. This can be a strong principle that leads to assessment feed-
back, assessment preparation information, and assessment-based development feedback
that might not be provided otherwise.
In summary, guiding principles are intended to provide values-based guidance to the development
and ongoing administration of selection programs as well to the individuals who make selection
decisions. They are likely to have less ongoing value where an organization develops policies and
practices that impose relatively “tight” constraints on processes and people. But where such policies
and practices are either nonexistent or impose only “loose” constraints, guiding principles may be
more important.
SELECTION POLICY
In contrast to guidance of operating principles, selection policies are prescriptive. They define
authority and roles, establish rules and requirements, and set limits and boundary conditions.
Managing Sustainable Selection Programs 227
Because policies have direct and explicit impact on the behavior of virtually all participants in the
selection process, they often are seen as the “face” of the selection program. They are some of the
most tangible aspects of a selection program. Selection policies also are perhaps the best indication
of the ownership of the selection program. Because policy establishes authority, policy ownership is
the clearest indicator of selection program ownership.
It is likely that selection programs are among the most policy-oriented of all HR programs and
practices. There are three primary causes. First, in many countries, national and/or local laws regu-
late employment selection. Where it is regulated, organizations are very likely to establish policies
that enable compliance with those legal requirements. Second, employment selection is a high-
stakes process. Employment decisions have large consequences for people on both sides of the deci-
sion. People care a lot about employment decisions, and their interests sometimes conflict. Policies
are often used to manage these interests and the consequences for people. Third, employment selec-
tion is about a scarce, but valuable, resource. A qualified person selected into one job by one unit is
no longer available to be selected by other units for other jobs. Many organizations have found that
policy is required to govern managers’ access to candidates and candidates’ access to jobs.
that selection policy about access to selection data would need to incorporate HIPAA provisions.
A fundamental principle underlying HIPAA protections is that the patient “owns” her health infor-
mation and that the patient’s authorization is required to release and transmit that information.
Notwithstanding HIPAA in the United States, some organizations may choose to establish a selec-
tion data ownership policy that explicitly establishes ownership of selection data. It is unlikely an
organization would regard the applicant as the “owner” of her selection data for various reasons.
However, the organization may establish access rules that protect the interests of candidates to have
the selection data be used only for purposes for which it is intended and valid. For example, as a
practical matter, a policy might prohibit managers from accessing an employee’s selection data for
purposes of performance management feedback or promotion consideration to a job for which the
selection data are not relevant.
The most common example of grandfathering applies to incumbents in a job when new or revised
selection standards are applied to that job.
A second type of standard exemption relies on an equivalency between two different sets of
selection standards. For example, a work simulation assessing telephone-based customer handling
skills in a sales context may be regarded as assessing the same skills, perhaps at a higher level, as
a similar work simulation designed in a service context. An equivalency policy means that candi-
dates who satisfy a particular standard on one selection procedure are treated as having satisfied
a particular standard on another selection procedure. The third common example of a standard
exemption relies less on an equivalency rationale than on a substitution rationale. For example, a
candidate who has demonstrated certain work experience, training, degrees, or other education/
training experience may be exempt from having to meet a test standard designed to be predictive of
those very accomplishments. In effect, the candidate has accomplished the criterion result the test
was designed to predict.
Selection programs are less prone to incremental erosion of confidence and credibility to the
extent systematic rationales for exemptions can be anticipated and accounted for in the original
applications rules and taken out of the hands of local, ad hoc decision-makers.
A final example is offered of a policy designed to establish authority for an ad hoc decision about
the use of selection standards. This example is different from the examples above, which rely on
some form of equivalence or substitution rationale. In effect, those rationales are all grounded in
the construct relevance of one set of standards to another set of standards. In contrast, this example
is grounded in what might be called a business necessity rationale. The typical situation is one in
which there is an ordinary set of selection standards for a particular job. For the sake of this exam-
ple, assume this ordinary set of standards is typically satisfied by 20% of the applicants. In all likeli-
hood this set of standards was chosen, in part, because the selection ratio yielded by these standards
enabled the hiring organization to meet its ordinary hiring needs at an acceptable cost and speed.
But business circumstances are always changing. From time to time, the hiring organization may
have an urgent need to substantially increase its hiring rate. For example, in The Netherlands man-
datory military service was lifted in the 1990s, resulting in thousands of unfilled vacancies. In this
case, there can be a compelling business necessity rationale to temporarily or permanently reduce
the selection standards to achieve the increased hire rate necessary to meet the business need. A
policy can be developed to address this special case that allows standards to be temporarily lowered
and may even specify certain conditions that must be satisfied to ensure the need is substantial. At
root, this authority, like the others described above, owns the responsibility to evaluate the ad hoc
tradeoff between the benefits of a faster, easier, less onerous, and possibly fairer seeming selection
process with the potential loss in expected performance among those selected. However, the policy
assigns authority, and it is important for these exemption processes to rely on input from the affected
business managers about the impact of the tradeoff on their business.
Often there are competing interests with respect to policy efforts designed to manage applicants’
access to selection processes. Policies that restrict access often have the direct or indirect effect of
increasing the yield rate among the applicants who do have access under those policies. For exam-
ple, typical retest policies limit applicants’ opportunities to retake selection tests they have previ-
ously failed. Given that retest policies, by their nature, limit the access of people who do relatively
poorly on tests, they are likely to increase the overall yield rate of the selection process by limiting
access to some lower scorers. Of course, the countervailing effect of retesting is that it generally
increases scores by approximately one-fourth of a standard deviation (Hausknecht, Halpert, Di
Paolo, & Moriarty Gerrard; 2007), thus increasing the pass rate among the applicants who retake
tests. However, recent evidence from Lievens, Reeve, and Heggestad (2007) indicated that this score
increase introduces measurement and predictive bias that harm criterion validity. Similarly, policies
that exclude candidates who do not meet basic qualifications such as education and work experience
are, in many cases, more likely to block lower qualified applicants, thus increasing the overall yield
rate. These types of access policies that increase the yield rate will, all else the same, reduce cost
per hire and, possibly, reduce the cycle times for employment processes.
On the other hand, policies that facilitate the access of larger numbers of applicants better ensure
that enough qualified candidates are available at any point in time. Also, they accommodate the
desires of candidates who seek jobs in the organization, thus potentially improving the candidates’
good will toward the organization. Also, increased access may reduce recruiting expenses, all else
the same.
Legal Compliance
Certain selection policies are directed primarily at the organization’s legal compliance responsibili-
ties. The policies we include in this category are the policies that establish authority for monitoring
selection results for evidence of prohibited discrimination, for owning modifications to selection
procedures to improve compliance, for the decisions about whether any modifications to selection
procedures should be made, and for responding to enforcement agencies’ requests for compliance
information.
This category of policies also relates to the question of the “official” database of selection results
for applicants and employees. It is often the case that selection data are formally and informally
located in various files—paper and electronic. Certain selection data such as hiring manager inter-
view ratings and protocols are often kept in local personnel files or even in hiring manager files.
In contrast, other selection data such as formal assessment results, demographic data, and resume
information are often maintained in corporate or unit HR information system databases. Compliance
support policy should specify what the “official” selection database is, how selection data get into
that database, how they are organized there, and who is responsible for putting them there.
An additional consideration regarding compliance policy is that the compliance issues associated
specifically with selection processes are often part of a larger employment and recruiting context.
Enforcement agencies may be as interested in recruiting and sourcing methods, resume searching
and screening, and an organization’s internal staffing system features as they are in detail about
selection procedures, data, and decisions. This broader context of compliance issues often involves
other roles and organizations beyond the development, validation, and maintenance of selection
programs. In this situation of multiple organizations having a role in employment compliance, selec-
tion policy is best integrated with compliance policies of the other organizations. However this inte-
gration takes place, it is advantageous to have a clearly established role with overarching authority
over responses to enforcement agencies.
policy that determines who authorizes the selection standards for a particular selection procedure.
To help make this point about alignment, suppose a new selection procedure is designed to make
hiring decisions for a call center where account representatives talk to customers about their orders,
bills, and payments. The selection procedure consist of a work sample exercise to assess customer
handling skills and a cognitive ability test to assess information learning and processing skills.
In this example, a policy question is, “Who should have the authority to approve the standards by
which these selection procedures are used to make selection decisions?” The standards can take
many forms, including pass-fail cut scores, scores bands, and methods of combining the work simu-
lation and cognitive test results. The choice of standards will impact the cost and speed of the hiring
process, the performance of the new hires, and the degree of impact on protected groups, if any. In
determining who should have the authority to approve the final set of standards, the question that
should be asked is, “Who has accountability for the outcomes that will be affected by the approved
standards?” In this example, the accountabilities may be divided among different organizations
and roles. For example, the business leader over the call center operation is likely to have ultimate
accountability for the performance of the account representatives. In some organizations, that same
business leader may also have ultimate accountability for the supporting employment process and
its compliance with prevailing regulations. In this situation, a very strong case can be made that the
business leader who is accountable for all of the most important consequences of the selection deci-
sions should have the authority to approve selection standards. This policy would then, presumably,
define the role of the designer of the selection system, the manager of the employment process, and
the compliance manager as expert resources to the business leader’s decision about the standards.
This situation is an example of high alignment between authority and accountability.
The point of this subsection is that selection policies contribute to selection system sustainability
in various ways, but that a paramount requirement of selection policies is that the authority granted
by a policy should be aligned with the accountability for the consequences of the decisions made
under the policy. One implication of this alignment principle is that the selection program designer
may not have the authority over all selection-relevant policy decisions. In particular, the authority
to approve the selection standards that drive key business results is most aligned with the role that
“owns” the same business results.
impact this trend is having on the management of selection programs and will offer suggestions
about strategies for sustaining selection programs in this changing context.
The distinction between the transaction management work of employment process manage-
ment and the “knowledge management” work of the development and validation of selection
programs is important. Like other HR-oriented “knowledge” work (e.g., compensation and labor
relations), the development and validation of selection programs has historically been managed
as an expertise, not a process. In general, the performance standards for these types of work have
been imprecise and general. Typically, the evaluation of a selection developer’s performance in
the development of new selection procedures does not rely on quantified metrics describing the
development process.
Increasingly the focus on process management has invited the “customers” of employment
processes, hiring managers, and business leaders to require metrics of the employment process
as the means by which they evaluate the quality of those services. Common employment process
metrics include (a) cycle time measures such as time from requisition to hire and time between
employment process events; (b) flow rates through each step in the employment process (e.g., the
rate at which people who schedule an employment office interview actually show up, complete
the interview, and move on to the next event); and (c) various cost measures such as cost per hire,
cost per candidate, or cost per event such as cost per assessment test or per interview. Clearly,
these process-oriented metrics are affected by the selection procedures and standards produced
by the selection developer, which may be seen as a root cause of satisfactory or unsatisfactory
process metrics.
Beyond these most typical metrics, additional metrics may be included to capture information
about the quality of the selected employees. The two most frequent examples of quality-of-hire met-
rics are early survival rates (e.g., 3-, 6-, and 12-month survival) and hiring manager (i.e., customer)
ratings of early overall satisfaction with the new hires. However, a fundamental problem is that the
process options available to employment process managers may have little effect on quality-of-hire
metrics. Indeed, options such as enhanced job preview processes and more targeted recruiting prac-
tices, which may offer some improvement in quality-of-hire metrics, may do so at a higher cost.
We suggest an approach here that may be helpful for selection program managers faced with
this challenge that employment process metrics are creating new pressure on the sustainability of
selection procedures. Essentially, this approach is to rethink the potential value of process metrics,
not in terms of research value but in terms of business decision value, and change or supplement the
information available to business leaders to help them continuously monitor the benefit of selection
procedures and accompanying usage standards. The research perspective tends to view a selection
program as a relatively fixed, unchanging manifestation of the basic, stable requirements of job suc-
cess. The business process perspective views selection programs as organizational processes in the
context of real-time business conditions that can change rapidly.
These different perspectives have led to very different approaches to the evaluation of selection
procedures and employment processes. Validation is an episodic, occasional event that is needed
only every several years to confirm that the causal model has not changed. Process metrics repre-
sent a continual process that enables process managers to optimize processes as needed. Business
managers are not trying to confirm scientific conclusions; they are trying to make business decisions
with uncertain data to optimize important outcomes.
Our own perspective about these divergent perspectives is that, although selection developers
cannot surrender the importance they attach to validation, they would be wise to become more open
to the prescientific value of continuously gathered data about worker behavior such as the quality-of-
hire data gathered by employment process managers. For many reasons, these types of data do not
have the information value of worker behavior data gathered in research settings. But they do have
value for building a more complete understanding of the possible situational dynamics that impact
worker behavior and a deeper understanding of the relationship between worker behavior and the
business outcomes that are most important to work managers.
Managing Sustainable Selection Programs 233
CONCLUSIONS
This chapter describes the organizational considerations that directly influence the sustainability
of selection programs. The four overarching categories of these organizational considerations are
organization purpose, HR strategy, governance, and process management. Beyond the technical
considerations of validity, utility, bias, and fairness, we make the case that these organizational con-
siderations are critical in designing and implementing a selection program. To the extent that pur-
pose, strategy, governance, and process are deliberately incorporated into the design of the selection
program, the success of that program is better ensured. Inattention to these organizational consider-
ations can undermine the sustainability of a selection program despite its validity.
We also note here that much of this chapter has been written from experience more than research.
The sustainability of selection programs warrants more research attention than has been given in
the past. Psychometric concerns are critical, but any organization that neglects sustainability does
so at its own peril and likely will find, in due course, that the psychometric integrity of its selection
procedures is inevitably compromised.
REFERENCES
Anderson, N., Lievens, F., van Dam, K., & Ryan, A. M. (2004). Future perspectives on employee selec-
tion: Key directions for future research and practice. Applied Psychology: An International Review, 53,
487–501.
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
and evidential bases. Journal of Applied Psychology, 74, 478–494.
Bonache, J., Brewster, C., & Suutari, V. (2001). Expatriation: A developing research agenda. Thunderbird
International Business Review, 43, 3–20.
Burch, G. S. J., & Anderson, N. (2004). Measuring person-team fit: Development and validation of the team
selection inventory. Journal of Managerial Psychology, 19, 406–426.
Chatman, J. A. (1991). Matching people and organizations: Selection and socialization in public accounting
firms. Administrative Science Quarterly, 36, 459–484.
DiMaggio, P. J., & Powell, W. W. (1991). Introduction. In W. W. Powell & P. J. DiMaggio (Eds.), The new
institutionalism in organizational analysis (pp. 1–38). Chicago: University of Chicago Press.
Dowling, P. J., & Welch, D. E. (2004). International human resource management: Managing people in a mul-
tinational context (4th ed.). London, England: Thomson Learning.
Gupta, V., & Hanges, P. J. (2004). Regional and climate clustering of social cultures. In R. J. House, P. J. Hanges,
M. Javidan, P. W. Dorfman, & V. Gupta (Eds.), Leadership, culture, and organizations: The GLOBE
study of 62 societies (pp. 178–218). Thousand Oaks, CA: Sage.
Harris, H., & Brewster, C. (1999). The coffee-machine system: How international selection really works.
International Journal of Human Resource Management, 10, 488–500.
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A
meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology,
92, 373–385.
Health Insurance Portability and Accountability Act of 1996. Public Law No. 104-91, 110 Stat. 1936.
Herriot, P., & Anderson, N. (1997). Selecting for change: How will personnel and selection psychology sur-
vive? In N. Anderson & P. Herriot (Eds.), International handbook of selection and assessment (pp. 1–38).
London, England: Wiley.
Klehe, U. C. (2004). Choosing how to choose: Institutional pressures affecting the adoption of personnel selec-
tion procedures. International Journal of Selection and Assessment, 12, 327–342.
LePine, J. A., Hollenbeck, J. R., Ilgen, D. R., & Hedlund, J. (1997). Effects of individual differences on the
performance of hierarchical decision-making teams: Much more than g. Journal of Applied Psychology,
82, 803–811.
Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on
cognitive ability tests in selection settings. Journal of Applied Psychology, 92, 1672–1682.
Mendenhall, M., & Oddou, G. (1985). The dimensions of expatriate acculturation: A review. The Academy of
Management Review, 10, 39–47.
Mol, S. T. (2007). Crossing borders with personnel selection: From expatriates to multicultural teams.
Unpublished dissertation, Rotterdam, The Netherlands: Erasmus University.
234 Handbook of Employee Selection
Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The impor-
tance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58,
583–611.
Mumford, T. V., Van Iddekinge, C. H., Morgeson, F. P., & Campion, M. A. (2008). The team role test:
Development and validation of a team role knowledge situational judgment test. Journal of Applied
Psychology, 93, 250–267.
Oliver, C. (1997). Sustainable competitive advantage: Combining institutional and resource-based views.
Strategic Management Journal, 18, 697–713.
Ostroff, C., & Rothausen, T. J. (1996). Selection and job matching. In D. Lewin, D. J. B. Mitchell, & M. A.
Zaidi (Eds.), Human resource management handbook (pp. 3–52). Greenwich, CT: JAI Press.
Ployhart, R. E. (2007). Organizational staffing: A multilevel review, synthesis, and model. In G. R. Ferris &
J. Martocchio (Eds.), Research in personnel and human resource management (Vol. 23). Oxford,
England: Elsevier.
Ployhart, R. E., & Schneider, B. (2002). A multi-level perspective on personnel selection research and practice:
Implications for selection system design, assessment, and construct validation. In F. J. Yammarino &
F. Dansereau (Eds.), The many faces of multi-level issues: Research in multi-level issues (Vol. 1,
pp. 95–140). Oxford, England: Elsevier.
Ployhart, R. E., & Schneider, B. (2005). Multilevel selection and prediction: Theories, methods, and models.
In A. Evers, N. Anderson, & O. Voskuijl (Eds.), The Blackwell handbook of personnel selection (pp.
495–516). Oxford, England: Blackwell.
Roe, R. A. (2005). The design of selection systems—Context, principles, issues. In A. Evers, N. Anderson, &
O. Smit (Eds.), Handbook of personnel selection (pp. 73–97). Oxford, England: Blackwell.
Schneider, B., Smith, D. B., & Sipe, W. P. (2000). Personnel selection psychology: Multilevel considerations.
In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations:
Foundations, extensions, and new directions. San Francisco, CA: Jossey-Bass.
Scott, W. R. (1995). Institutions and organizations. Thousand Oaks, CA: Sage.
Stevens, M. J., & Campion, M. A. (1994). The knowledge, skill and ability requirements for teamwork:
Implications for human resource management. Journal of Management, 20, 503–530.
Tippins, N. T. (2002). Issues in implementing large-scale selection programs. In J. W. Hedge & E. D. Pulakos
(Eds.), Implementing organization interventions: Steps, processes, and best practices (pp. 232–269). San
Francisco, CA: Jossey-Bass.
Welch, D. (1994). Determinants of international human resource management approaches and activities: A
suggested framework. Journal of Management Studies, 31, 139–164.
11 The Business Value of
Employee Selection
Hiring good people is hard. Hiring great people is brutally hard. And yet nothing matters more in
winning than in getting the right people on the field. All the clever strategies and advanced technologies
in the world are nowhere near as effective without great people to put them to work.
Jack Welch
Winning (2005, p. 81)
Industrial and organizational (I-O) psychologists have played major roles in developing selection (or
staffing) tools and implementing selection programs at every level for organizations of every size
and in every industry, domestic and multinational. This chapter focuses on evaluating, monitoring,
and managing the business value of employee selection. We begin by offering some general com-
ments about the traditional model of employee selection, or staffing, its focus, and its components,
with particular emphasis on selection as a dynamic organizational process and the rationale for
evaluating the business value of employee selection. We then consider what managers know about
employee selection, the different perspectives of I-O psychologists and managers, and what both
groups should know about the value of employee selection. Following that discussion, we present a
decision-based framework that illustrates the logic of employee selection, with particular emphasis
on assessing the outcomes of selection efforts. Such outcomes may be expressed in qualitative or
quantitative terms, and we illustrate both. We also illustrate the potential payoffs associated with
valid selection programs over multiple cohorts and time periods, and the assessment of employee
performance in economic terms. We conclude with a set of recommendations for managing and
monitoring the business value of employee selection, including trade-offs among managerial con-
cerns for “better, faster, cheaper, with less adverse impact.”
235
236 Handbook of Employee Selection
process as prescreening devices. A set of measures that is consistent with the current staffing
model (Schmidt & Hunter, 1998) may include biographical data collected using application blanks
(Schmitt et al., 2007), integrity tests (Berry, Sackett, & Wiemann, 2007), and drug testing (Haar &
Spell, 2007). Often those job applicants who successfully complete the initial screening stage may
be required to pass a background check (Connerley, Arvey, & Bernardy, 2001) and, if they do,
they may be given paper-and-pencil or computer-administered tests that assess their general mental
abilities (GMAs) and personality traits (Behling, 1998; Frase, 2007; Hogan, Davies, & Hogan, 2007;
Morgeson et al., 2007), followed by an interview (Chapman & Zweig, 2005). Finally, for managerial
and other high-level jobs, there may be an additional stage, including a work-sample test (Roth,
Bobko, & McFarland, 2005) or an assessment center (Lance, Woehr, & Meade, 2007), in which
applicants must demonstrate specific knowledge and skills by performing a limited number of job-
related tasks in a controlled environment. With respect to work-sample tests, we do not mean to
imply that they are appropriate only for high-level jobs. They can also be extremely valid predictors
of performance in other jobs, such as craft jobs (electricians, plumbers, mechanics) and customer
service jobs (Cascio & Aguinis, 2010).
In the past, organizations have benefited from this traditional approach of “pick good people to
get good performance,” but the changing workplace is dramatically redefining personnel selection
(Pearlman & Barney, 2000). The next section describes some of the challenges to the business value
of that approach.
1. Past behavior may not always predict future behavior (behavioral consistency), particu-
larly if the new job differs in the types of personal characteristics necessary for successful
performance. Past behavior that is relevant to future performance may predict that perfor-
mance effectively.
2. Selection decisions about people and jobs are not independent events in an organization.
Indeed, the broader business value of selection is often linked to other human resources
(HR) processes, such as training, promotion, special assignments, staff reductions, career
development, and succession planning.
3. Hiring managers do not always hire the best scorers. Validated selection techniques are
rarely the only source of information for selection decision-making.
4. Jobs are changing faster than we can do validation studies.
5. Assessing the business value of selection is complex, because different constituents—
managers, applicants, HR professionals, and those who implement selection systems—value
different outcomes.
6. The social context and social psychological processes of selection decisions are often
ignored in the traditional approach. Interpersonal processes in group decision-making are
extremely important to the implementation of selection systems. For example, a particular
decision-maker’s position power, influence, and interpersonal attraction to another person
may be important to understand in selecting employees.
7. Utility calculations that estimate economic returns on investments for valid selection tech-
niques are not widely accepted or understood by business managers. Managers often do
not believe the magnitude of the estimated returns because of their size and also because
of the use of complex formulas with too many estimates and assumptions. To many, the
dollar returns associated with improved performance are not “tangible,” and certainly less
so than the dollars in one’s departmental budget. All of this suggests that few organiza-
tions, if any, view the costs related to selection as investments; rather, they consider them
The Business Value of Employee Selection 237
as expenses. Beyond that, validity coefficients of equal size, say, 0.35, are not necessarily
equally valuable to decision-makers if they reference different criteria. A sales manager,
for example, may or may not view a validity of .35 for predicting organizational citizenship
behaviors as equal in value to a validity of .35 for predicting the dollar volume of sales.1
1 We would like to thank Jerard F. Kehoe for suggesting these last two points.
238 Handbook of Employee Selection
They offered a contemporary model of work performance with two distinguishable compo-
nents: (a) task performance of a specific job, and (b) contextual performance—performance
related to organizational and social performance activities. Contextual performance includes
three levels of analysis: external, organizational, and the immediate work or job context. Table
11.1 describes their model of worker-attribute categories needed to predict success beyond job
performance per se.
The key challenge in predicting performance at any level is that our current selection methods
have demonstrated limited usefulness, despite 80 years of staffing research. Limitations of the cur-
rent approach include the following (Cascio & Aguinis, 2008): a near exclusive focus at the level
of the individual, the assumption of behavioral consistency, a focus on thin slices of behavior and
behavior that may not be representative of actual performance on a job, selection systems that pro-
duce high levels of adverse impact, overestimation of expected economic payoffs from the use of
valid selection procedures, and limited applicability of the traditional model when applied to execu-
tives and expatriates.
TABLE 11.1
Definitions and Examples of Work-Performance Model Worker—Attribute Categories
Attribute Category Definition Examples
Aptitude and abilities Capacity to perform particular classes or categories Cognitive, spatial/perceptual,
of mental and physical functions psychomotor, sensory, and physical
abilities
Workplace basic skillsa Fundamental developed abilities that are required Reading, writing, and arithmetic or
to at least some degree in virtually all jobs computational skills
Cross-functional skills Various types of developed generic skills that are Oral communication, problem analysis,
related to the performance of broad categories of interpersonal skills, negotiating,
work activity and that tend to occur across information gathering, organizing,
relatively wide ranges of jobs planning, and teamwork skills
Occupation-specific Developed ability to perform work activities that Ability to read blueprints, to repair
skills occur across relatively narrow ranges of jobs or electrical appliances, to operate a milling
are defined in relatively job- or activity-specific machine, to operate a forklift, to do word
terms processing
Occupation-specific Understanding or familiarity with the facts, Knowledge of financial planning and
knowledge principles, processes, methods, or techniques analysis, fire-protection systems,
related to a particular subject area, discipline, computer graphics, data communication
trade, science, or art; includes language networks, patent law, Spanish, COBOL,
proficiency spreadsheet software
Personal qualities (also An individual’s characteristic, habitual, or typical Adaptability, empathy, conscientiousness,
known as personality manner of thinking, feeling, behaving, or self-esteem, autonomy, sociability,
traits, temperaments, responding with respect to self and others, service orientation, emotional stability,
or dispositions) situations, or events integrity, honesty
Values Goals, beliefs, or ideals an individual holds as Empowerment, cooperation, achievement,
important and that function as the standards or initiative, work ethic
criteria by which he or she evaluates things
Interests An individual’s characteristic work-related
Realistic, investigative, artistic, social,
preferences or likes and dislikes regarding
enterprising, and conventional
specific (or classes of) work activities
a Workplace basic skills are differentiated from aptitudes and abilities because of their significant knowledge and learning
components.
Source: From Pearlman, K., & Barney, M., Selection for a changing workplace, in J. Kehoe (Ed.), Managing selection in
changing organizations: Human resource strategies, Jossey-Bass, San Francisco, CA, 2000. With permission.
The Business Value of Employee Selection 239
Although many existing selection methods perform well, we believe that if selection strategies
are to be more useful in response to rapidly changing organizations, then we need to broaden our
perspectives of the relevant criterion space and criterion constructs, from a focus on predicting task
performance per se, to the kinds of characteristics identified in Table 11.1. Indeed, the construct of
adaptability (Pulakos, Arad, Donovan, & Plamondon, 2000) may well be a key predictor of success
in a rapidly changing organization.
A related consideration for utility researchers is the meaning of the parameter t in the general
utility equation (see Cascio & Boudreau, 2008). Traditionally that parameter represents the average
tenure of individuals in a given job. Perhaps a more nuanced view is to define t as the length of time
that the constructs measured by the current selection system remain relevant. The faster that jobs
and organizations change, the lower the value of t.
TABLE 11.2
HR Processes That Depend Upon Selection
Hiring Multiple-stage process using various techniques and types of information for
mutual selection decision by organization and candidate
Training Selection for participation in particular training programs
Superior performance evaluation Selection for limited-frequency ratings or for distribution-controlled rankings
Promotion Selection for limited promotional opportunities or for job families or job levels
with limited population sizes
Special assignments Selection for assignments to task forces, committees, special projects
Career development Selection for development processes, programs, or mentors
Succession planning Selection for inclusion in replacement-planning or succession-planning databases
or management-planning sessions
Source: From Higgs, A. C., Papper, E. M., & Carr, L. S., Integrating selection with other organizational processes and
systems, in J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies, Jossey-
Bass, San Francisco, CA, 2000. With permission.
240 Handbook of Employee Selection
As stated earlier, selection is a key component in the overall life cycle of an individual’s employ-
ment with an organization. That life cycle includes changing jobs and changing people.
investments that will generate a stream of benefits over time. We hasten to add that the term “business
value” does not imply that all outcomes must be expressed exclusively in monetary or quantitative
terms. Indeed, as we shall demonstrate, much of the business value of selection may be expressed in
qualitative terms (e.g., improvements in customer service, team dynamics, or innovations).
Utility Analyses
The utility of a selection device is the degree to which its use improves the quality of the individuals
selected beyond what would have occurred had that device not been used (Taylor & Russell, 1939).
Because the technical details of utility analysis have been addressed elsewhere (Boudreau, 1991;
Boudreau & Ramstad, 2003; Cabrera & Raju, 2001; Cascio, 2000, 2007; Cascio & Boudreau, 2008),
we focus here only on the logic of utility analysis as illustrated in Figure 11.1.
At its core, utility analysis considers three important parameters: quantity, quality, and cost.
The top row of Figure 11.1 refers to the characteristics of candidates for employment as they flow
through the various stages of the staffing process. At each stage, the candidate pool can be thought
of in terms of the quantity of candidates, the average and dispersion of the quality of the candidates,
242 Handbook of Employee Selection
Talent
flows
Staffing
processes
FIGURE 11.1 The logic of utility analysis. (From Cascio, W. F., & Boudreau, J. W., Investing in people:
Financial impact of human resource initiatives, Pearson, Upper Saddle River, NJ, 2008. With permission.)
and the cost of employing the candidates. For example, the “applicant pool” might have a quantity
of 100 candidates, with an average quality value of $50,000 per year and a variability in quality
value that ranges from a low of $25,000 to a high of $100,000. This group of candidates might have
an anticipated cost (salary, benefits, training, and so on) of 70% of their value. After screening and
selection, the “offer candidates” might have a quantity of 50 who receive offers, with an average
quality value of $65,000 per year, ranging from a low of $35,000 to a high of $100,000. Candidates
who receive offers might require employment costs of 80% of their value, because they are highly
qualified and sought-after individuals. Eventually, the organization ends up with a group of “new
hires” (or promoted candidates in the case of internal staffing) that can also be characterized by
quantity, quality, and cost.
Similarly, the bottom row of Figure 11.1 reflects the staffing processes that create the sequen-
tial filtering of candidates. Each of these processes can be thought of in terms of the quantity of
programs and practices used, the quality of the programs and practices, as reflected in their abil-
ity to improve the value of the pool of individuals that survives, and the cost of the programs and
practices in each process. For example, as we have seen, the quality of selection procedures is often
expressed in terms of their validity, or accuracy in forecasting future job performance. Validity may
be increased by including a greater quantity of assessments (e.g., a battery of selection procedures),
each of which focuses on an aspect of KSAOs that has been demonstrated to be important to suc-
cessful performance on a job. Higher levels of validity imply higher levels of future job performance
among those selected or promoted, thereby improving the overall payoff to the organization. As
a result, those candidates that are predicted to perform poorly never get hired or promoted in the
first place. Decision-makers naturally focus on the cost of selection procedures because they are
so vividly depicted by standard accounting systems; but the cost of errors in selecting, hiring, or
promoting the wrong person is often much more important.
Utility analysis has achieved limited success in translating the value of valid selection pro-
cedures into terms that managers and organizational decision-makers understand (Jayne, 2000).
Unfortunately, in many cases such analyses lack credibility because of complex formulas and dol-
lar-based return-on-investment analyses that seem “too good to be true” (Ashe, 1990; Cascio, 1993;
Schmitt & Borman, 1993). Indeed, one may logically ask, if the return on investment associated
with such programs is so high, then why don’t all companies invest substantial amounts of resources
in them? The answer is that the actual returns are likely to be considerably lower than the estimated
returns, because researchers have tended to make simplifying assumptions with regard to variables
like economic factors that affect payoffs and to omit others that add to an already complex mix of
factors.
The Business Value of Employee Selection 243
Some of those other factors are economic factors (the effects of taxes, discounting, and variable
costs), employee flows into and out of the workforce, probationary periods (the difference in perfor-
mance between the pool of employees hired initially and those who survive a probationary period),
the use of multiple selection devices, and rejected job offers. One study used computer simula-
tion of 10,000 scenarios, each of which comprised various values of these five factors (Sturman,
2000). Utility estimates were then computed using the five adjustments applied independently. The
median effect of the total set of adjustments was -91% (i.e., the adjusted values were, on average,
91% lower than the unadjusted values), with a minimum effect of -71% and negative estimates 16%
of the time. Although most utility estimates for the simulated scenarios remained positive, the five
modifications had sizable and noteworthy practical effects. These results suggest that although valid
selection procedures may often lead to positive payoffs for the organization, actual payoffs depend
significantly on organizational and situational factors that affect the quantity, quality, and cost of
the selection effort.
Expectancy Charts and Performance Differences Between High and Low Scorers
Expectancy charts allow managers to see graphically, for example, the likelihood that each
quintile of scorers on an assessment procedure will perform successfully on a job. More formally,
organizational or institutional expectancy charts depict the likelihood of successful criterion
performance to be expected from any given level of predictor scores. Individual expectancy charts
depict the likelihood of successful criterion performance to be expected by an individual score
at any given level on an assessment procedure. Figure 11.2 shows these two types of expectancy
charts. Figure 11.2a is an organizational expectancy chart, whereas Figure 11.2b is an individual
expectancy chart.
40% 71%
Select top
60% 65%
80% 60%
All 53%
(b)
Equal to or above 1852 82%
The organizational expectancy chart provides an answer to the question, “Given a selection
ratio of .20, .40, .60, etc., what proportion of successful employees can be expected if the future is
like the past?” Such an approach is useful in attempting to set cutoff scores for future hiring pro-
grams. In similar fashion, the individual expectancy chart illustrates the likelihood of successful
criterion performance for an individual whose score falls within a specified range on the predictor
distribution.
Computational procedures for developing empirical expectancies are straightforward, and theo-
retical expectancy charts are also available (Lawshe & Balma, 1966). In fact, when the correlation
coefficient is used to summarize the degree of predictor-criterion relationship, expectancy charts
are a useful way of illustrating the effect of the validity coefficient on future hiring decisions. In
situations in which tests have only modest validities for predicting job performance, test-score dif-
ferences that appear large will correspond to modest scores on the expectancy distribution, reflect-
ing the modest predictability of job performance from test scores (Hartigan & Wigdor, 1989).
Another way to demonstrate the business value of selection—in this case, the value of a testing
program—is to compare performance differences between individuals who score at the top and
bottom of the test-score distribution on job-related criteria. For example, managers can learn that
a bank teller who scored in the top 80% on a test will serve 1,791 customers and refer 22 new cus-
tomers in 1 month, compared to the bottom 20% of test scorers, who will serve only 945 customers
and refer only 10 new customers (People Focus, 1998). Table 11.3 shows performance differences
in job-related criteria across three companies in a Food Marketing Institute study of supermarket
cashiers (Food Marketing Institute, 1985).
TABLE 11.3
Job-Performance Differences Among Supermarket Cashiers
Who Score at the Top and Bottom 50% of a Test Battery
Company Average Score Value
A Amount over or under Top 50% 1.53
Bottom 50% 2.18
Items per minute Top 50% 19.15
Bottom 50% 17.43
Rings per minute Top 50% 18.18
Bottom 50% 17.33
Number of voids Top 50% 7.17
Bottom 50% 9.08
B Amount over or under Top 50% 1.55
Bottom 50% 2.37
Items per minute Top 50% 21.47
Bottom 50% 17.67
Rings per minute Top 50% 18.29
Bottom 50% 16.01
Number of voids Top 50% 6.84
Bottom 50% 10.99
C Amount over or under Top 50% 1.47
Bottom 50% 1.94
Items per minute Top 50% 21.60
Bottom 50% 18.63
Rings per minute Top 50% 15.27
Bottom 50% 15.92
Number of voids Top 50% 5.83
Bottom 50% 5.73
The Business Value of Employee Selection 245
In our opinion, expectancy charts, together with illustrations of performance differences between
high- and low-scoring individuals on an assessment procedure, provide credible, tangible evidence
of the business value of selection.
1. Intelligence predicts job performance better than conscientiousness (Schmidt & Hunter,
1998).
2. Screening for intelligence results in higher job performance than screening for values or
values fit (Schmidt & Hunter, 1998; Meglino & Ravlin, 1998).
3. Being very intelligent is not a disadvantage for performing well on a low-skilled job
(Hunter, 1986; Schmidt & Hunter, 1998).
4. Personality inventories vary considerably in terms of how well they predict applicants’ job
performance (Barrick & Mount, 1991; Gardner & Martinko, 1996).
246 Handbook of Employee Selection
5. Integrity tests successfully predict whether someone will steal, be absent, or otherwise
take advantage of employers, although individuals can “fake good” on them (Ones,
Viswesvaran, & Schmidt, 1993; Ones, Viswesvaran, & Reiss, 1996).
6. Integrity tests do not have adverse impact on racial minorities (Ones & Viswesvaran,
1998).
Needless to say, these findings are disturbing, for they indicate that HR vice presidents, directors,
and managers live in a very different world from that of I-O psychologists. To bridge that gap, the
Society for Human Resource Management (SHRM) Foundation (half of whose Board of Directors
are I-O psychologists—academics and practitioners) has pursued an aggressive strategy over the
past 5 years. It commissions reviews of the professional literature in key HR areas (e.g., performance
management, employee selection, retention, reward strategies, employee engagement, and commit-
ment) by knowledgeable professionals and has them “translate” the results of published research
into practical guidelines. An academic and a practitioner review each draft the report to ensure that
it is well organized and jargon-free, the findings are presented clearly, and the implications of the
findings for professional practice are highlighted. The name of this initiative is “Effective Practice
Guidelines,” and each report may be downloaded in PDF form from http://www.shrm.org/about/
foundation/Pages/default.aspx.
We know that senior-level managers are extremely aware of the importance of hiring the right people
for their organizations. For example, the 2006 CEO Briefing by the Economist Intelligence Unit (EIU)
highlights significant management challenges that face the world’s corporate leaders on the basis of
responses from 555 senior executives from 68 countries. Two of the most critical ones are as follows:
In fact, respondents in the 2006 EIU study, by 45–55%, said that they expect to spend more on
people than on technology in the next 3 years. That’s an important change that highlights the impor-
tance of staffing issues.
More recently, the SHRM Foundation engaged the Hay Group to conduct research to identify the
most pressing human capital challenges faced by chief human resources officers and other C-suite
executives (Hay Group, 2007). There were two key phases of the research: in-depth interviews with
36 C-suite executives and an online survey with responses from 526 C-suite executives. Sixty-nine
percent of the respondents said that recruiting and selecting talented employees to positions was
one of the top three human capital challenges that could affect the ability of their organizations to
achieve strategic objectives.
Together, these results suggest that, whether an organization is purely domestic or international
in its scope of operations, senior managers recognize the critical importance of employee selection
to the achievement of their strategic objectives. There is therefore a pressing need and a ripe oppor-
tunity for I-O psychologists to have a major impact on organizations by demonstrating the business
value of employee selection. Executives need this information to make informed decisions about
selection tools and processes, and they have never been more receptive to it than now, in light of the
key human capital challenges they are facing.
TABLE 11.4
Historical Best-Practices Study: Most Common Procedures
Used to Select Entry-Level Managers
Selection Tools Percentage of Programs
Site interview 99
Resume/experience/education 71
Drug screen 59
Campus interview 40
Tests 40
Background check 36
Job performance 32
Other 31
References 28
Academics/grade point average 25
Work simulation 13
Job trial/probation 7
Assessment center 7
Source: Data from Cascio, W., & Fogli, L., Talent acquisition: new realities of
attraction, selection, and retention. Preconference workshop presented
at the annual conference of the Society for Industrial Organizational
Psychology, Chicago, IL, April 2004.
The interview, résumé, and a drug screen were the most frequently used methods. Given the
widespread use of the Internet and the kinds of changes in organizations we described earlier, Cascio
and Fogli (2004) described four current organizational themes in selection and assessment: (a) speed
(results of selection procedures should be fast to process; get results to hiring manager), (b) time
(cannot be time-consuming for applicant or hiring manager), (c) cost (vendors of selection instruments
offer volume discounts and licensing/new pricing strategies), and (d) quality (must be seen as essential
to getting better employees). To be sure, there are trade-offs among these four themes, and speed, time,
cost, and quality may be incompatible goals under certain circumstances. An online measure of ability
to work in cross-functional teams (low cost, fast) may not produce the same level of quality (validity)
as a more expensive, time-consuming work-sample test in the context of an assessment center. There
is no singularly “correct” approach, but the optimal approach to use in a given situation requires a
deep understanding of the needs of decision-makers, the interrelationships of selection with other HR
processes, and the relative importance of each theme to a particular organization and its managers.
Indeed, the advent of online applications and prescreening procedures has lead to the funnel and
multiple-hurdle hiring processes described in Figure 11.3. In terms of the four themes described
above, note how the faster, less costly, less time-consuming procedures (online application and pre-
screening) precede the slower, more costly and time-consuming ones (assessment, interview). From
start to finish, the overall process is designed to ensure the delivery of high-quality employees.
Online application
Prescreening
Background check; salary requirements, hours;
minimum qualification questions
Assesment
Aptitudes;
work sample;
personality;
honesty
Interview
Assuming an organization is willing to trade off speed and time for quality (validity), Cascio
and Aguinis (2008) recommended, where feasible, the use of internships, contingent work arrange-
ments, and the use of virtual-reality technology to predict future “in situ” performance. In situ
performance reflects the broad range of effects—situational, contextual, strategic, and environmen-
tal—that may affect individual, team, or organizational performance. Such specification provides
a richer, fuller, context-embedded description of the criterion space that we wish to predict. As an
example of this approach, consider that United Parcel Service routinely hires many of its full-time
employees from the ranks of contingent workers—almost half of its workforce in 2006, with 55%
of them college students (Bolgar, 2007).
We agree with Higgs et al. (2000) that it is important to adopt a broader perspective of successful
selection processes. Those processes should be assessed in terms of criteria that include empirical
validity, face validity, selection ratios, marketability of the selection effort, demonstration of the
business value of selection using quantitative as well as qualitative metrics, the effectiveness of the
management of selection processes, candidate reactions to those processes, overall expense, and the
timeliness with which selection decisions can be made.
College graduate candidates are screened on college campuses with interviews targeted to man-
agement competencies and organizational success factors. Candidates who pass the on-campus
screen are invited to a 1-day assessment at the corporate headquarters. This 1-day assessment
includes (a) learning financial analysis skills, (b) participating in a group-based leadership exercise
to improve retail-store effectiveness, and (c) two panel-group interviews. All assessors are trained to
evaluate candidate performance using behavioral benchmarks and standards. Independent and con-
sensus ratings standards and guidelines are required. An assessor conference is held after the 1-day
assessment. Selection-technique data, including ratings and behavioral observations, are reported to
the assessors, who include managers, executives, incumbents, and HR staff. Guidelines are provided
to establish bands of scores and to make successful decisions.
Observations of why this selection system has been successful in predicting job success and in
becoming integrated into the culture of the business include:
1. The original development and updates to the selection system have involved multiple orga-
nizational participants, including executives, managers, job incumbents, and representa-
tives from recruitment, staffing, and training.
250 Handbook of Employee Selection
2. Hiring decisions are made in a 1-day session with all key decision-makers involved.
3. Selection techniques are developed in the business context and are updated at least every
3 years. Interviews contain behavior-description questions and situational questions. The
leaderless-group-competition exercise requires candidates to visit company and competi-
tors’ stores and to read consumer information regarding trends and the latest company
strategies for business development.
4. Assessors self-monitor and also monitor each other to evaluate candidates using behaviors/
benchmarks related to competencies and success factors.
In our opinion, the key to successful implementation of a selection system is to involve decision-
makers and stakeholders. Development of selection techniques is therefore a necessary, but not a
sufficient condition, for their successful acceptance and use by decision-makers. Implementation is
an ongoing challenge.
REFERENCES
Anderson, N., Lievens, F., van Dam, K., & Ryan, A. M. (2004). Future perspectives on employee selection: Key
directions for future research and practice. Applied Psychology: An International Review, 53, 487–501.
Ashe, R. L, Jr. (1990, April). The legality and defensibility of assessment centers and in-basket exercises. Paper
presented at the meeting of the Society for Industrial and Organizational Psychology, Miami Beach,
FL.
Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Behling, O. (1998). Employee selection: Will intelligence and conscientiousness do the job? Academy of
Management Executive, 12, 77–86.
Berry, C. M., Sackett, P. R., & Wiemann, S. (2007). A review of recent developments in integrity test research.
Personnel Psychology, 60, 271–301.
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
and evidential bases. Journal of Applied Psychology, 74, 478–494.
Bolgar, C. (2007, November 15). High performance: How talent creates a competitive edge and powers organi-
zations to success. The Wall Street Journal, p. A6.
Boudreau, J. W. (1991). Utility analysis for decisions in human resource management. In M. D. Dunnette &
L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 2, pp. 621–
745). London, England: SAGE Publications.
Boudreau, J. W., & Ramstad, P. M. (2003). Strategic industrial and organizational psychology and the role of
utility analysis models. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Volume Eds.), Handbook of
psychology, Vol. 12, Industrial and organizational psychology (pp. 193–221). Hoboken, NJ: Wiley.
Boudreau, J. W., Sturman, M. C., & Judge, T. A. (1994). Utility analysis: What are the black boxes, and do they
affect decisions? In N. Anderson & P. Herriot (Eds.), Assessment and selection in organizations. Methods
and practice for recruitment and appraisal (pp. 77–96). New York, NY: John Wiley.
Cabrera, E. F., & Raju, N. S. (2001). Utility analysis: Current trends and future directions. International Journal
of Selection and Assessment, 9, 92–102.
Cascio, W. F. (1993). Assessing the utility of selection decisions: Theoretical and practical considerations. In
N. Schmitt & W.C. Borman (Eds.), Personnel selection in organizations (pp. 310–340). San Francisco,
CA: Jossey-Bass.
Cascio, W. F. (2000). Costing human resources: The financial impact of behavior in organizations (4th ed.).
Cincinatti, OH: South-Western.
Cascio, W. F. (2007). Evidence-based management and the marketplace for ideas. Academy of Management
Journal, 50(5), 1009–1012.
Cascio, W. F., & Aguinis, H. (2010). Applied psychology in human resource management (7th ed.). Upper
Saddle River, NJ: Pearson Prentice Hall.
Cascio, W. F., & Aguinis, H. (2008). Staffing 21st-century organizations. Academy of Management Annals, 2,
133–165.
Cascio, W. F., & Boudreau, J. W. (2008). Investing in people: Financial impact of human resource initiatives.
Upper Saddle River, NJ: Pearson.
The Business Value of Employee Selection 251
Cascio, W. & Fogli, L. (2004, April). Talent acquisition: New realities of attraction, selection, and retention.
Pre-conference workshop presented at the annual conference of the Society for Industrial Organizational
Psychology, Chicago, IL.
Chapman, D. S., & Zweig, D. I. (2005). Developing a nomological network for interview structure: Antecedents
and consequences of the structured selection interview. Personnel Psychology, 58, 673–702.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences (rev. ed.). New York, NY: Academic
Press.
Connerley, M. L., Arvey, R. D., & Bernardy, C. J. (2001). Criminal background checks for prospective and
current employees: Current practices among municipal agencies. Public Personnel Management, 30,
173–183.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52,
281–302.
Economist Intelligence Unit (2006, January). CEO briefing: Corporate priorities for 2006 and beyond.
Retrieved February 24, 2006, from http://www.eiu.com
Fisher, A. (2005, November 28). For happier customers, call HR. Fortune, p. 272.
Fogli, L. (2006, July). Managing change in turbulent times. Paper presented at the Sixth International
Conference on Knowledge, Culture and Change in Organizations. Monash University Centre, Prato,
Italy.
Food Marketing Institute. (1985). Cashier test battery administrator’s manual. Washington, DC: Author.
Frase, M. J. (2007, December). Smart selections: Intelligence tests are gaining favor in the recruiting world as
reliable predictors of future executive performance. HR Magazine, pp. 63–67.
Gardner, W. L., & Martinko, M. J. (1996). Using the Myers-Briggs type indicator to study managers: A litera-
ture review and research agenda. Journal of Management, 22, 45–83.
Haar, J. M., & Spell, C. S. (2007). Factors affecting employer adoption of drug testing in New Zealand. Asia
Pacific Journal of Human Resources, 45, 200–217.
Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment testing: Validity generalization, minor-
ity issues, and the General Aptitude Test Battery. Washington, DC: National Academy Press.
Hay Group. (2007, October 23). SHRM Foundation research on human capital challenges. Washington, DC:
Hay Group.
Higgs, A. C., Papper, E. M., & Carr, L. S. (2000). Integrating selection with other organizational processes and
systems. In J. Kehoe (Ed.), Managing selection in changing organizations: Human resource strategies
(p. 94). San Francisco, CA: Jossey-Bass.
Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail
(Ed.), Alternative validation strategies (pp. 181–229). San Francisco, CA: Jossey-Bass.
Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of
Vocational Behavior, 29, 340–362.
Jayne, M. E. A., & Rauschenberger, J. M. (2000). Demonstrating the value of selection in organizations. In
J. Kehoe (Ed), Managing selection in changing organizations: Human resource strategies (pp. 123–157).
San Francisco, CA: Jossey-Bass.
Kraut, A. I., & Korman, A. K. (Eds.). (1999). Evolving practices in human resource management: Responses
to a changing world of work. San Francisco, CA: Jossey-Bass.
Lance, C. E, Woehr, D. J., & Meade, A. W. (2007). Case study: A Monte Carlo investigation of assessment
center construct validity models. Organizational Research Methods, 10, 430–448.
Latham, G. P., & Whyte, G. (1994). The futility of utility analysis. Personnel Psychology, 47, 31–46.
Lawshe, C. H., & Balma, M. J. (1966). Principles of personnel testing (2nd ed.). New York, NY: McGraw-
Hill.
Le, H., Oh, I., Shaffer, J., & Schmidt, F. L. (2007). Implications of methodological advances for the prac-
tice of personnel selection: How practitioners benefit from meta-analysis. Academy of Management
Perspectives, 21, 6–15.
Meglino, B. G., & Ravlin, E. C. (1998). Individual values in organizations: Concepts, controversies, and
research. Journal of Management, 24, 351–389.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007).
Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60,
683–729.
Motowidlo, S. J. (2003). Job performance. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.), Comprehensive
handbook of psychology. Industrial and organizational psychology (Vol. 12, pp. 39–53). New York, NY:
Wiley.
Murphy, K. R., & Myors, B. (1998). Statistical power analysis. Mahwah, NJ: Lawrence Erlbaum.
252 Handbook of Employee Selection
Ones, D. S., & Viswesvaran, C. (1998). Gender, age, and race differences on overt integrity tests: Results across
four large-scale job applicant data sets. Journal of Applied Psychology, 83, 35–42.
Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in personality testing for per-
sonnel selection: The red herring. Journal of Applied Psychology, 81, 660–679.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities:
Findings and implications for personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 679–703.
Pearlman, K., & Barney, M. (2000). Selection for a changing workplace. In J. Kehoe (Ed.), Managing selection
in changing organizations: Human resource strategies (pp. 3–72). San Francisco, CA: Jossey-Bass.
People Focus (1998). Bank of America new hire assessment predictive validation report. Pleasant Hill, CA:
Author.
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace:
Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624.
Ramsay, H., & Scholarios, D. (1999). Selective decisions: Challenging orthodox analyses of the hiring process.
International Journal of Management Reviews, 1, 63–89.
Roth, P. L., Bobko, P., & McFarland, L.A. (2005). A meta-analysis of work sample test validity: Updating and
integrating some classic literature. Personnel Psychology, 58, 1009–1037.
Rynes, S. L., Colbert, A. E., & Brown, K. G. (2002). HR professionals’ beliefs about effective human resource
practices: Correspondence between research and practice. Human Resource Management, 41, 149–174.
Rynes, S. L., Giluk, T. L., & Brown, K. G. (2007). The very separate worlds of academic and practitioner
periodicals in human resource management: Implications for evidence-based management. Academy of
Management Journal, 50, 987–1008.
Schippmann, J. S., Ash, R. A., Battista, M., Carr, L., Eyde, L. D., Hesketh, B., et al. (2000). The practice of
competency modeling. Personnel Psychology, 53, 703–740.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychol-
ogy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schmidt, F. L., & Hunter, J. E. (2003). History, development, evolution, and impact of validity generalization
and meta-analysis methods, 1975–2002. In K. R. Murphy (Ed.), Validity generalization: A critical review
(pp. 31–66). Hillsdale, NJ: Lawrence Erlbaum.
Schmitt, N., & Borman, W. C. (Eds.). (1993). Personnel selection in organizations. San Francisco, CA: Jossey-
Bass.
Schmitt, N., Oswald, F. L., Kim, B. H., Imus, A., Merritt, S., Friede, A., & Shivpuri, S. (2007). The use of
background and ability profiles to predict college student outcomes. Journal of Applied Psychology, 92,
165–179.
Skarlicki, D. P., Latham, G. P., and Whyte, G. (1996). Utility analysis: Its evolution and tenuous role in human
resource management decision making. Canadian Journal of Administrative Sciences, 13, 13–21.
Society of Human Resources Management Foundation (2007, December 13). Top U.S. executives see rapidly
shrinking talent pool: HR survey finds recruiting and retaining tomorrow’s leaders and skilled employ-
ees is number one priority. Retrieved December 15, 2007, from http://www.shrm.org/about/foundation/
Pages/default.aspx
Sturman, M. C. (2000). Implications of utility analysis adjustments for estimates of human resource interven-
tion value. Journal of Management, 26, 281–299.
Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of
tests in selection. Journal of Applied Psychology, 23, 565–578.
Welch, J. (2005). Winning. New York, NY: HarperBusiness.
Whyte, G., & Latham, G. (1997). The futility of utility analysis revisited: When even an expert fails. Personnel
Psychology, 50, 601–610.
Part 3
Categories of Individual Difference
Constructs for Employee Selection
David Chan and Leaetta Hough, Section Editors
This page intentionally left blank
12 Cognitive Abilities
Intelligence affects individuals’ lives in countless ways. It influences work lives of employees
perhaps to a greater extent than any other individual differences trait. It determines whether an
employee will be able to perform assigned tasks. It is the strongest predictor of learning and acquisi-
tion of job knowledge as well as overall job performance. It is remarkably relevant regardless of the
occupation one holds. It even predicts extrinsic career success (i.e., earnings and promotions). As
such, it is an exceedingly precious trait to include in employee selection systems.
In this chapter, we provide an overview of cognitive ability’s key role in staffing organizations
and provide evidence-based practice recommendations. We first present a brief synopsis of the
history, current usage, and acceptance of cognitive ability tests in employee selection. Second, we
highlight the theoretical underpinnings of cognitive ability as a construct. Third, we discuss devel-
opments in its measurement. Fourth, we present an overview of the criterion-related validity of cog-
nitive ability tests in predicting valued work behaviors and outcomes. Fifth, we discuss the issue of
group differences in cognitive ability test scores both within the United States and internationally.
We conclude by discussing future research and practice challenges.
1 For the 100th anniversary of that article’s publication, the Journal of Personality and Social Psychology published
a special section attesting to the impact of cognitive ability on a multitude of life domains (Deary, Whiteman, Starr,
Whalley, & Fox, 2004; Gottfredson, 2004a; Kuncel, Hezlett, & Ones, 2004; Lubinski, 2004; Plomin & Spinath, 2004;
Schmidt & Hunter, 2004).
255
256 Handbook of Employee Selection
displayed a median of 68; see Lent, Aurbach, & Levin, 1971), differences in level of restriction of
range in samples (typically employees in concurrent studies who had already been selected into an
organization), and differences across studies in the unreliability of criterion measurement (typically
supervisory ratings of job performance). These statistical and measurement artifacts were respon-
sible for the differences in results observed across most previous validation studies, discrediting the
theory of situational specificity. The invention of meta-analysis (known as “validity generalization”
in the employee selection literature) paved the way to systematic investigations of predictor validity,
also reaching beyond the domain of cognitive abilities.2
Standardized, objective cognitive ability tests by now have a century-long history. Group tests
of intelligence were developed prior to World War I and used extensively during both World Wars.
European and U.S. armed forces continue to utilize cognitive ability tests for selection and/or place-
ment. Today, cognitive ability measures are used in educational admissions and civilian personnel
staffing. But how widespread is the use of cognitive ability measures in organizational settings in
general, as well as vis-à-vis other tools available for personnel decision-making? During the 1990s,
a patchwork of surveys indicated that cognitive ability tests were being used for employee selection
across the globe, but their usage varied by country (Salgado, Viswesvaran, & Ones, 2001). At the
turn of the century, the most comprehensive survey was conducted by Ryan, McFarland, Baron, and
Page (1999). Ryan and colleagues surveyed 959 organizations from 20 countries by randomly sam-
pling 300 large organizations (with more than 1,000 employees) in each country. The focus of their
study was the examination of national and cultural influences on many selection system features.
The pervasiveness of cognitive ability tests was also surveyed. On the basis of their data, Table 12.1
summarizes how extensively cognitive ability measures were used.
Across the 18 countries for which data were reported, on average, cognitive ability tests were
used between 21% and 50% of the time in employee selection. Organizations in The Netherlands,
Belgium, Portugal, Spain, South Africa, New Zealand, and the United Kingdom reported above
average use, whereas organizations in Germany, Hong Kong, and Italy reported especially low levels
of cognitive test use. Within each country, of 14 selection methods presented to study participants
(cognitive ability tests, physical ability tests, foreign language tests, work samples, personality tests,
integrity tests, interest inventories, simulation exercises, situational judgment tests, video-based
tests, projective techniques, drug tests, medical screens, and graphology), cognitive ability tests
were ranked in the top three most frequently utilized methods in 15 of 18 countries. It is of value
to note that some of the methods listed, such as situational judgment tests (SJTs) or simulations,
can be used to measure a variety of constructs, and thus data on their use are not necessarily
directly comparable to that of construct-specific ones such as standardized tests of cognitive ability
and personality. However, it appears that if objective tests are utilized at all in personnel staffing
decisions, cognitive ability measures are included with some frequency. Unfortunately, data on the
extensiveness of cognitive ability test use come from countries and cultural regions that are not
entirely representative of the world’s cultural regions. Data from eastern Europe (e.g., Ukraine,
Russia) and southern Asia (e.g., India, Pakistan) are meager; systematic, large-scale surveys from
Latin America, the Middle East, and Africa are also lacking. Countries from these world regions
offer a unique opportunity for industrial-organizational (I-O) psychologists to assess cultural
variability in the extensiveness of use of as well as reactions to employee selection tools, including
cognitive ability tests.
Prevalence data provide an index of organizational acceptance of selection tools. Another per-
spective on this issue can be gained by examining applicant reactions. Applicant reactions to selection
tests vary by test type (Kluger & Rothstein, 1993). There are a few international comparative studies
of applicant reactions to selection tests (including Belgium, France, Germany, Greece, Portugal,
2 In fact, 30 years after its inception, meta-analysis has changed the nature of epistemological inquiry in all sciences. To
date, there have been 40,318 peer-reviewed publications that have used or discussed meta-analytic methods, garnering
754,430 citations as of January 22, 2007 (Christensen, Beatty, Selzer, & Ones, under review).
Cognitive Abilities 257
TABLE 12.1
Cognitive Ability Test Use in 18 Countries
Country Organizations (N) Mean SD Rank
Australia 66 2.39 1.57 3
Belgium 68 3.85 1.30 1
Canada 84 2.59 1.62 3
France 35 2.29 1.72 5
Germany 35 1.90 1.52 2
Greece 27 2.54 1.56 3
Hong Kong 8 1.83 0.98 4
Ireland 49 2.79 1.42 3
Italy 29 1.33 0.82 5
Netherlands 66 3.76 1.41 2
New Zealand 112 3.37 1.41 2
Portugal 31 3.27 1.75 2
Singapore 16 2.83 1.60 2
South Africa 54 3.25 1.44 3
Spain 24 3.75 1.44 2
Sweden 91 2.86 1.37 3
United Kingdom 108 3.08 1.52 3
United States 52 2.09 1.26 3
All Countries 955 2.98 1.54 3
Responses were scaled as follows: 1 = never, 2 = rarely (1–20%), 3 = occasionally (21–50%), 4 = often (51–80%), and 5 =
almost always or always (81–100%). Rank = within-country rank among selection methods, computed by the authors. For
Hong Kong, rank 4 was a tie.
Source: Adapted from Ryan, A. M., McFarland, L., Baron, H., & Page, R., Personnel Psychology, 52, 359–391, 1999.
Italy, Spain, and the United States, among others, see Bertolino & Steiner, 2007; Moscoso, 2006;
Nikolaou & Judge, 2007), and the results suggest that general mental ability tests are among the
selection procedures rated more favorably. Hausknecht, Day, and Thomas (2004) conducted a meta-
analysis of applicant reactions to employee selection procedures. In ten of the studies they located,
study participants (in laboratory and field settings) rated the favorability of cognitive ability tests
along with other methods of employee selection. Pooled ratings from 1,499 study participants
indicated that the mean favorability ratings for cognitive ability tests were lower than those for
interviews, work samples, resumes, and references, but higher than those for (in descending order)
personality tests, biodata, personal contacts, honesty tests, and graphology. However, caution is
warranted in drawing conclusions about reactions of actual job applicants to cognitive ability tests
on the basis of these results: Participants in the studies contributing to the Hausknecht et al. (2004)
meta-analysis were not necessarily applying for jobs, were not in selection settings, and did not
experience each of the tools they were rating.
On the basis of the available data and our experiences in practice, we would like to offer the
following observations. Applicant reactions to cognitive ability tests are largely determined by
their perceived fairness and their perceived predictive validity (Chan & Schmitt, 2004). Applicant
perceptions of fairness can be improved by enhancing the procedural fairness of selection systems
as a whole, but are likely to present a challenge in employee selection as long as any traditionally
disadvantaged group (broadly defined) scores systematically lower on a given predictor battery.
Furthermore, it has recently been shown that general mental ability is a common antecedent not only
of performance on cognitive ability tests, but also of perceived test fairness and test-taking motivation
(Reeve & Lam, 2007). Regarding cognitive ability test use, tackling the issue of perceived predictive
258 Handbook of Employee Selection
validity (or lack thereof) will be one of the most crucial tasks for our profession in the years to come.
Vocal opponents of high-stakes testing have promulgated myths about the ability of intelligence
tests to predict valued outcomes. These myths, although often entirely unsupported by empirical
evidence or even common logic, are difficult to dispel and, if allowed to inform organizational
decisions on selection tool use, pose a threat to organizations and ultimately societies’ economic
welfare (Sackett, Borneman, & Connelly, 2008; Schmidt & Hunter, 1981).
STRUCTURE
Although there are numerous specific cognitive abilities, they all share a common construct core:
general mental ability, popularly called intelligence, or g (for general intelligence factor) in many
scientific writings. As Gottfredson (2002) so aptly noted, the multitude of ways to measure g attest
to its generality. Although measures of intelligence may look different, use different formats (e.g.,
individually administered tasks, paper-and-pencil tests, computerized batteries), and items (e.g.,
verbal, figural, numerical), this does not mean they assess entirely distinct constructs.
The structure of cognitive abilities has been examined extensively since Spearman’s (1904) dis-
tinction of g and s (specific abilities). A century of research has yielded hundreds of data sets in
which individuals took multiple cognitive ability measures. Carroll (1993) compiled, analyzed, and
summarized the correlation matrices resulting from over 460 of such data sets. The result was his
Cognitive Abilities 259
popular three-stratum model. Cognitive abilities are hierarchically organized. At the apex is g,
or an ability that is general. At the second stratum are group factors or broad abilities (e.g., prior
acquisition of knowledge, visual perception, and production of ideas). At the lowest level of the
hierarchy are specific factors or narrow abilities. It is quite common that individuals of similar
intelligence (i.e., at the same trait level of general mental ability) differ in their standing on specific
abilities because of differential “investment” of their cognitive capacity (guided by other personal
characteristics as well as idiosyncratic developmental and educational experiences) in these narrow
cognitive domains.
The most significant content domains that surface in most ability models are verbal/linguis-
tic, quantitative/numerical, and spatial/mechanical. The distinction between fluid and crystallized
intelligence (termed gf and gc respectively; see Cattell, 1971) provides a still popular conceptual
model but has also been shown to distinguish between g and lower-level abilities, rather than abil-
ity factors at the same level of the taxonomical hierarchy. Fluid and crystallized intelligence tend
to correlate around .70, and gf has been found to be indistinguishable from g (Gustafsson, 2002).
Today, researchers continue to clarify and refine the structure of intelligence in individual studies
(e.g., see Johnson & Bouchard, 2005; Johnson, te Nijenhuis, & Bouchard, 2007, 2008); however, the
overview provided by Carroll’s (1993) masterpiece remains the broadest and most comprehensive
empirical analysis of the issue to date.
When various cognitive ability tests are administered to test-takers reflecting the entire range of
intelligence from the general population, about 50% of the common variance is due to g, whereas
8–10% of the remaining common variance is attributable to verbal, quantitative, and spatial abilities
(Lubinski, 2004). g is also a great source of predictive power for behavior in many contexts, includ-
ing on the job (see below).
It has been found that relationships among cognitive ability scales are weaker at higher levels of
the ability spectrum (e.g., see Detterman & Daniel, 1989; Kane, Oakland, & Brand, 2006), imply-
ing a smaller amount of common variance due to g. Theoretically, one implication could be that
there may be more room among high-ability individuals for specific abilities to yield incremental
validities over tests of general mental ability. However, investigations of incremental validity in
highly complex jobs have so far yielded mixed results. Olea and Ree (1994) reported that specific
abilities contributed little beyond g to the prediction of job performance among pilots and naviga-
tors, whereas Ree and Carretta (1996) concluded that some specific abilities had the potential to
add incremental value at least for prediction of military pilot performance. If the common variance
among individual ability tests accounted for by g can be consistently found to be smaller than in
samples of broad talent, then it is plausible that specific abilities could add incremental value over
general mental ability for such groups. However, direct tests of this hypothesis among high-ability
job applicant samples are still called for.3
When broad job categories and applicants of a wide range of talent are studied, analyses directed
at incremental validities of specific ability measures over g have yielded disappointing results:
Specific abilities do not provide substantial incremental validity over g. However, we should also
note that in some meta-analyses, specific abilities have been shown to be similarly valid for the
prediction of some criteria (see below). In addition, there may also be nonvalidity-related consid-
erations for practitioners to include specific ability measures in their selection systems, such as the
consequences of anticipated group differences or applicant reactions.
MEASUREMENT
The list of cognitive ability measures available to scientists, individual practitioners, and organiza-
tions runs in the hundreds and includes everything from simple, homegrown measures to tests of
wide circulation supported by many decades of empirical research evidence. A discussion of the
3 Such investigations would have to sort out potential range restriction effects in these samples.
260 Handbook of Employee Selection
merits of individual measures cannot be provided in this chapter. However, a brief discussion of
commonly used methods, as well as current trends in cognitive ability assessment, is warranted.
Traditional, standardized tests are the most widespread method for measuring all types of cogni-
tive abilities. Their popularity is not because of a lack of alternative methods, but primarily because
of their excellent reliability, ease of administration, and scoring. Although validity (including pre-
dictive validity) is the property of a psychological construct (e.g., the abilities measured by a test,
not the test itself), the reliability of the assessment methods provides a ceiling for validities that can
be obtained in applied settings. From this point of view, standardized tests provide probably the
best solution for organizations looking to assess cognitive ability in a reliable, standardized, and
objective manner.
The use of standardized tests in employee selection and academic settings is not without con-
troversy. Unfortunately, criticism levied against cognitive ability tests, like that directed at other
standardized testing, often falls victim to “content-format confusion” (Ryan & Greguras, 1998; see
also Chan & Schmitt, 2004). In addition to standardized tests, many other methods can be used
to assess cognitive ability constructs, and a careful investigation of these methods and how they
are typically used can inform decisions on whether they are suitable for a given purpose and set-
ting. Interviews, assessment centers (ACs), and SJTs are all methods that assess cognitive ability
to varying degrees—sometimes by design, sometimes by accident. For example, the most recent
meta-analysis of the interview-cognitive ability correlations reported a mean, range-restriction cor-
rected correlation of .27 (N = 11,317, k = 40; Berry, Sackett, & Landers, 2007). Some intriguing
results reported include higher interview-ability test correlations when interview validity is high
and job complexity is low. Interviews with greater cognitive content can be expected to yield higher
criterion-related validities. Also, for low-complexity jobs, interviews may function as more of a
cognitive screen than for higher-complexity jobs.
Relationships between cognitive ability and overall AC ratings have also been examined. A
meta-analysis by Collins et al., (2003) reported that cognitive ability test scores correlated .43 with
overall AC ratings (N = 5,419, k = 34). However, AC dimensions have differential cognitive loads. In
a large-scale study, Dilchert and Ones (2009) reported that the highest correlations were found for
the AC dimension problem solving (r = .32, N = 4,856), providing further evidence for the fact that
cognitive ability measures capture real-world problem solving abilities (cf. Gottfredson, 1997).
Increasingly popular SJTs are also correlated with cognitive ability; however, the magnitude of
the correlation depends on the instructions given to participants. Knowledge instructions (e.g., “what
should one do,” “rate the best/worst option”) in completing SJTs produce an observed correlation
of .32 (N = 24,656, k = 69), whereas SJTs with behavioral tendency instructions (e.g., “what would
you do”) correlate .17 (N = 6,203, k = .26) with cognitive ability (McDaniel, Hartman, Whetzel, &
Grubb, 2007). Thus, if job applicants complete SJTs, especially under knowledge instructions,
assessments produce a ranking of job applicants on cognitive ability to a certain degree. This link
between cognitive ability and SJTs is also likely to be affected by other moderators in addition to
the type of instructions.
Many assessment methods increasingly rely on formats other that the traditional paper-and-pen-
cil form, a trend that is also reflected in ability measurement. Earlier research used meta-analysis to
establish the equivalence of computerized and paper-and-pencil versions of cognitive ability tests
(Mead & Drasgow, 1993). Recent trends in web-based assessment and test content delivery build
on the fact that tests of many individual difference predictors (not only cognitive ability) have been
shown to be equivalent between paper-and-pencil and computerized versions. However, the real
challenge arises not from a change in test format, but a change in administration mode.
Web-based, unproctored cognitive ability assessment is aimed at streamlining the application
process for applicants and organizations. Critics argue that this approach requires strong confidence
in the honesty of test-takers (who, at least in selection contexts, presumably have a strong incentive
to cheat). We would like to argue that these trends provide an opportunity for our field to take some
truly large steps in the development of cognitive ability assessments. Some organizations, confronted
Cognitive Abilities 261
by the real-world challenges of having to assess hundreds of thousands of applicants every year,
are already using computerized adaptive testing and constantly updated test materials to conduct
unproctored web-based testing (Gibby, 2008). Commercial test publishers and assessment providers
have also developed several strategies to address issues of cheating and test security, ranging from
regular monitoring for item-piracy and systematic, proctored retesting of test-takers (Burke, 2008)
to remote, video-based proctoring and biometric test-taker identification (Foster, 2008). We are
certain that for large-scale assessments, such trends will soon become the everyday reality, dictated
by demands for more streamlined assessment procedures from applicants and organizations alike
(see also Chapter 8, this volume). The challenges posed by remote, unproctored cognitive ability
assessment will need to be addressed by a more intense collaboration of scientists and practitioners,
as well as by drawing on expertise from outside of the psychological domain. Organizations and
providers that invest in the appropriate know-how and technology will be at the forefront of the next
big development in cognitive ability measurement.
TABLE 12.2
Hypothesized Moderators of Cognitive Ability Test Validities
Yes: Confirmed Moderators No: Rejected Moderators ?: Moderating Effect Unknown
Job complexity Situational variables Time of study (historical age)
Criterion predicted Organizational setting Age
Training performance Race Race
Job performance African Americans Asian Americans
Leadership Hispanics Native Americans
Turnover, etc. Sex National setting and culture
(except for some European
countries)
Cognitive ability construct assessed Military/civilian setting
GMA Validation design (concurrent/predictive)
Verbal ability Length of time on the job (up to 5 years)
Memory, etc. Method of criterion-measurement (e.g.,
ratings, production quantity, work samples)
Cognitive Abilities 263
using meta-analytic approaches. Studies of Asian and Native Americans as well as older adults are
notably absent from the I-O psychology literature (see below for more details).
Although we know a great deal about the validity of cognitive ability tests for predicting training,
task, and overall job performance criteria, knowledge of how cognitive ability relates to other aspects
of work behavior (e.g., organizational citizenship) is limited. Nevertheless, intriguing findings are
being reported. Alonso, Viswesvaran, and Sanchez (2008) found that cognitive ability correlated
more highly with contextual performance than personality factors. Also, the first direct predictive
validity study relating a cognitive ability measure to counterproductive behaviors was published
only recently (Dilchert, Ones, Davis, & Rostow, 2007), indicating that intelligent individuals avoid
engaging in organizational and interpersonal deviance on the job. More research into correlates of
cognitive ability outside of the traditional job performance domain would be welcome.
So far, there have been no investigations of cognitive ability test validity across time (i.e., has validity
for cognitive ability tests in general changed over time?). Labor force changes paired with progressive
changes in the way work is done in many fields (more complex processes, greater use of technology)
call for renewed inquiries. One hypothesis that we would like to offer is that cognitive ability tests
today have greater validity than even half a century ago. As job roles and tasks for jobs in most sectors
change over time to include more complex tools (e.g., computers), processes (e.g., virtual teamwork),
and requirements (e.g., multiple languages), the power of general mental ability as the basic learning
skill in predicting performance may increase substantially, especially in mid- to high-complexity jobs.
Another change that has put increasing demand on individuals and organizations over the last
2 decades or so is internationalization. We already know that organizations, small and large, com-
pete for customers in a global economy. However, in a time when mobility—real and virtual—is
greater than ever in humanity’s history, organizations now also internationally compete for their
labor force. Validity of cognitive ability tests has been studied in international contexts, most exten-
sively in Europe (Hülsheger, Maier, & Stumpp, 2007; Salgado, Anderson, Moscoso, Bertua, & de
Fruyt, 2003; Salgado, Anderson, Moscoso, Bertua, de Fruyt et al., 2003). Table 12.3 summarizes
TABLE 12.3
Validity of Cognitive Ability Tests for Predicting Job Performance
in European Countries
k N r ρ
Across countries
High-complexity jobs 14 1,604 .23 .64
Medium-complexity jobs 43 4,744 .27 .53
Low-complexity jobs 12 864 .25 .51
Analyses by country
France 26 1,445 .48 .64
Germany 8 746 .33 .53
Belgium and The Netherlands 15 1,075 .24 .63
Spain 11 1,182 .35 .64
United Kingdom 68 7,725 .26 .56
N = total number of subjects; k = number of studies summarized in meta-analysis; r = sample size weighted mean observed
correlation; ρ = operational validity, corrected only for sampling error and attenuation due to unreliability in the criterion.
Source: Data across countries summarized from Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., &
Rolland, J. P., Journal of Applied Psychology, 88, 1068–1081, 2003. Data for individual countries except Germany
summarized from Salgado, J. F., & Anderson, N., European Journal of Work and Organizational Psychology, 12,
1–17, 2003. Data for Germany summarized from Hülsheger, U. R., Maier, G. W., & Stumpp, T., International
Journal of Selection and Assessment, 15, 3–18, 2008.
264 Handbook of Employee Selection
the results of European meta-analyses for job performance. Findings are fully parallel to those from
the United States: Cognitive ability tests show substantial validity, and higher validities are found
for higher complexity jobs. Moreover, highest validities are found for general mental ability rather
than specific abilities (Salgado, Anderson, Moscoso, Bertua, & de Fruyt, 2003).
There are a few practical issues worth discussing here. First, it is often naively argued that educa-
tional requirements serve as proxies for cognitive ability. This argument suggests that using a cogni-
tive ability test would not be necessary if a screening based on educational credentials were in place.
There are two flaws in this line of reasoning. Educational qualifications of applicants to the same
jobs are very similar, or at least more homogenous than those of the population at large. Conversely,
even among those who hold advanced degrees (e.g., doctoral, medical, and law degrees), there is still
substantial variability in cognitive ability (Sackett & Ostgaard, 1994; Wonderlic Inc., 2002), indicat-
ing room for utility to be derived from cognitive testing. Findings of highest predictive validities for
the most complex jobs (e.g., lawyers, physicians) underscore the usefulness of cognitive tests even
when there is homogeneity in the educational level of job applicants. If for some reason individuals
of mixed educational level were to apply for the same job, a cognitive ability test is a more precise,
valid, and efficient selection tool to use (Berry, Gruys, & Sackett, 2006).
It is also often suggested that beyond a certain required level of ability, cognitive capacity does
not contribute to performance. Such arguments essentially suggest a nonlinear relationship between
cognitive ability and performance: Cognitive ability predicts up to a certain point on the trait con-
tinuum, but validity drops off beyond that. All data that have been brought to bear on this question
tell an opposite story (Arneson, 2007; Coward & Sackett, 1990; Ghiselli & Kahneman, 1962; Hawk,
1970; Tiffin & Vincent, 1960). The relationship between ability measures and performance is linear
throughout the range of the ability spectrum. Validity remains equally high even among the most
extremely talented individuals (Wai, Lubinski, & Benbow, 2005).
compared in standard deviation units. In meta-analyzing these effect sizes, d values from all
individual studies are pooled and averaged to obtain an overall effect size that reflects the magnitude
of group differences in the population at large. Corrections for unreliability in cognitive ability
measures are typically not applied, because selection decisions are based on observed scores.4 In
general, d values of .80 or greater are considered large effects, those around .50 are moderate, and
those below .20 are small (Cohen, 1977). (From a theoretical perspective, d values under .20 are
often trivial; however, under extreme conditions, such as when the majority group selection ratio is
under 1%, even small differences in the .10 to .20 range can lead to violation of the four-fifths rule
and thus constitute adverse impact.)
In the I-O psychology literature, it is widely believed and reported that sex differences in cogni-
tive ability are nonexistent (e.g., see Table 1 in Ployhart & Holtz, 2008). Table 12.4 offers a more
precise and detailed view in summarizing sex differences on cognitive variables. Differences in
verbal and mathematical abilities are negligible. Women score moderately higher than men on one
particular verbal ability marker—speech production. Largest sex differences are found on visual-
spatial measures such as mental rotation and spatial perception (meta-analytic d values in favor of
men are in the .40 to .70 range). On measures of figural reasoning that incorporate some mental
rotation and spatial perception items, sex differences are about .30 standard deviation units, with
men scoring higher on average. Thus, given selection ratios of 50% or lower for men, cognitive tests
with visual-spatial items could result in adverse impact against women.
More so than sex differences in cognitive ability, race and ethnic group differences have
consumed attention in employee selection research and practice, especially in the North American
context (Hough, Oswald, & Ployhart, 2001). Table 12.5 summarizes race and ethnic group
differences on cognitive ability from the only existing meta-analysis of the literature (Roth, Bevier,
Bobko, Switzer, & Tyler, 2001). On average, Blacks score 1.00 and Hispanics .83 standard deviation
units lower than Whites on general mental ability measures used in employee selection. Group
differences on measures used in military settings are somewhat larger, especially for the White-Black
comparison. One explanation for this finding could be the greater heterogeneity among military job
applicants. Cognitive ability differences between Black and White applicants to high-complexity
jobs are smaller than among applicants to lower complexity jobs, most likely because of severe self-
selection as well as higher minimum requirements with regard to educational credentials. Among
applicants to medium- and low-complexity jobs, White-Black and White-Hispanic differences in
cognitive ability tests are large and almost certain to result in adverse impact if cognitive ability
were the only predictor used in employee selection. This finding is at the root of the validity-
diversity dilemma that most U.S. organizations face today (Kehoe, 2008; Kravitz, 2008; Potosky,
Bobko, & Roth, 2008; Sackett, De Corte, & Lievens, 2008). The situation is slightly better among
applicants to high-complexity jobs, in which group mean-score differences in cognitive ability
are only moderate (d = .63) and thus carry slightly less severe implications for adverse impact.
Data on Asian American- and Native American-White cognitive ability differences among job
applicants is scant to nonexistent, especially when the job applied to is held constant (i.e., within-job
examinations). The broader psychological literature indicates slightly higher scores among Asian
Americans compared with Whites (Gottfredson, 1997), but again, systematic data on job applicants
are scarce. The response categories used for demographic data collection in psychological research
often subsume individuals from very heterogeneous race and ethnic backgrounds in a single
category, which complicates comparisons especially with regard to the White-Asian comparisons.
(Illustrating this fact is the 2000 special census report on Asians in the United States, which lists
11 national and ethnic categories of interest as well as one residual “other” category; see Reeves &
Bennett, 2004.) The educational literature reports disconcertingly sizable lower scores among Native
Americans when compared with Whites (Humphreys, 1988). We were able to locate only one study
4 However, corrections for attenuation due to range restriction and unreliability in the criterion are advisable when compar-
ing results across studies differing in their levels of range restriction and unreliability.
266 Handbook of Employee Selection
TABLE 12.4
Meta-Analyses of Sex Differences on Cognitive Ability Measures
Cognitive Variable Study k d
Vocabulary Hyde & Linn (1988) 40 −.02
Reading comprehension Hyde & Linn (1988) 18 −.03
Speech production Hyde & Linn (1988) 12 −.33
Mathematics computation Hyde, Fennema, & Lamon (1990) 45 −.14
Mathematics concepts Hyde, Fennema, & Lamon (1990) 41 −.03
Mathematics problem solving Hyde, Fennema, & Lamon (1990) 48 .08
Spatial perception Linn & Petersen (1985) 62 .44
Spatial perception Voyer, Voyer, & Bryden (1995) 92 .44
Mental rotation Linn & Petersen (1985) 29 .73
Mental rotation Voyer, Voyer, & Bryden (1995) 78 .56
Spatial visualization Linn & Petersen (1985) 81 .13
Spatial visualization Voyer, Voyer, & Bryden (1995) 116 .19
Figural reasoning (matrices) Lynn & Irwing (2004) 10 .30
k = number of studies summarized in meta-analysis; d = standardized group mean score difference. Positive effect sizes
indicate males scoring higher on average.
Source: Data from Hyde, J. S., American Psychologist, 60, 581–592, 2005.
that compared Native North Americans and Whites in a job context (Vanderpool & Catano, 2008).
In this study, American Indians (Canadian Aboriginals) scored much lower on verbal ability tests
than on nonverbal tests.
It is important to stress that although race and ethnicity are protected categories in the United
States, the characteristics that define disadvantaged groups elsewhere are diverse (cf. Myors et al.,
2008). Cognitive ability test scores of disadvantaged groups around the globe remain largely
unstudied in employee selection settings, although exceptions can be found in a handful of countries
(e.g., Australia, Israel, New Zealand, South Africa, Taiwan, and The Netherlands). Extant research
appears to point to consistently lower scores of disadvantaged groups (e.g., Aborigines in Australia,
Canada, New Zealand, and Taiwan; Blacks in South Africa; immigrants in The Netherlands;
Sackett & Shen, 2008).
TABLE 12.5
Race and Ethnic Group Mean Score Differences in GMA Among Job Applicants
Group Comparison Setting Job Complexity N k d
White-Black Industrial Across complexity levels 375,307 11 1.00
Industrial (within-job Low 125,654 64 .86
studies) Moderate 31,990 18 .72
High 4,884 2 .63
Military Across complexity levels 245,036 1 1.46
White-Hispanic Industrial Across complexity levels 313,635 14 .83
Military Across complexity levels 221,233 1 .85
k = number of studies summarized in meta-analysis; d = standardized group mean score difference; N = total sample size.
Positive effect sizes indicate Whites scoring higher on average.
Source: Data from Tables 2, 4, and 7 of Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P., Personnel Psychology,
54, 297–330, 2001.
Cognitive Abilities 267
Large-scale international comparisons using the same cognitive ability test in employment
settings are rare, if not nonexistent. The first two authors of this chapter recently developed items
for a computer adaptive figural reasoning test for operational use around the globe (Dilchert &
Ones, 2007). The test was intended to be used in the selection of employees for a Global 100
company regardless of job level, complexity, or location (Gibby & Boyce, 2007). Over 120,000
applicants from 100 countries, speaking a multitude of languages, took the test during the first
several months after it was made available online. The data allow us an extraordinary global look
at group differences on the same, nonverbal reasoning measure. The sampling of job applicants to
the same organization heightens the uniqueness and relevance of this data set for other employment
contexts. Among the nearly 60,000 U.S.-based applicants in this sample, group differences were
in the expected direction, with all minority groups except Asian Americans scoring lower than
Whites. However, typically observed group differences were reduced. We also analyzed national
differences for countries where sample sizes exceeded 100 applicants. The ten countries in which
job applicants scored highest on average were southeast Asian (4) and European (6). An analysis
of the data by cultural clusters revealed Confucian Asia and southern and northern Europe scoring
higher than other cultural regions. Of course, the exact nature of these group differences remains to
be studied (e.g., are they also in part due to differential attraction to jobs across countries and cultural
regions?). Regardless of the underlying mechanisms, observed differences between applicants from
different cultural regions applying to the same organization present a challenge and an opportunity
to actively shape their workforce on the basis of diversity and talent goals.
One area of group differences that has received little attention in the adverse impact literature
is that of cognitive ability differences between younger and older adults. One large-scale exami-
nation of cognitive abilities across the working life span (Avolio & Waldman, 1994) offers some
insights into the magnitudes of age-related declines in cognitive ability. Avolio and Waldman
(1994) reported mean scores of 25,140 White, Black, and Hispanic job applicants who had taken
the General Aptitude Test Battery (GATB) of the U.S. Employment Service and broke these scores
down by age groups. We used the data presented by Avolio and Waldman to compute standardized
effect sizes (d values) documenting the magnitudes of age-related declines in cognitive abilities
within race and ethnic groups. Table 12.6 presents these findings.
For all race and ethnic groups, there are sizable differences in general mental ability and specific
cognitive abilities between older and younger individuals. Using the 20–34 year olds as a com-
parison group, 35–44 years olds score about .30 standard deviation units lower on general mental
ability. Individuals who are between 45 and 54 score about .50 standard deviation units lower, and
55–65 year olds score approximately .80 standard deviation units lower. Trends for Blacks and
Hispanics mimic the data for Whites but are less precise because of smaller sample sizes. When
a general mental ability measure is used in employee selection, younger applicants stand to get
selected at greater rates than older applicants. The disparity in selection ratios would be particularly
severe if applicant pools included individuals from the entire age spectrum of adults. Comparisons
of young, White applicants with older, minority group applicants will lead to even larger observed
differences.
Focusing on specific abilities, differences between older and younger adults are lowest for verbal
ability. The oldest age group in the sample, 55–65 year olds, scores about .50 standard deviation
units lower than 20–34 year olds. The magnitudes of differences for numerical, spatial, and clerical
ability are similar to those for general mental ability. Enormously large differences are found for
form perception. The 55–65 year olds on average score about 1.53 standard deviation units lower
than 20–34 year olds. The difference between 45–54 year olds and those between 20 and 34 is also
larger than 1 standard deviation unit. Use of form perception measures is certain to result in a dras-
tic reduction of hiring rates for older adults.
Aging does result in cognitive ability declines, but certain abilities decline more with age than
others, and knowledge-based (or so called “crystallized”) abilities (e.g., vocabulary and certain
numerical skills) seem to display the slowest rate of decline in the normal population (see Avolio &
268 Handbook of Employee Selection
TABLE 12.6
Age Differences in Cognitive Ability
White Black Hispanic
Age Group Comparison
Cognitive (Reference Group
Variable 20–34 Yearsa) N d N d N d
GMA 35–44 2,571 .35 1,374 .32 360 .24
45–54 2,011 .57 643 .48 171 .62
55–65 968 .82 144 .68 49 .70
Verbal ability 35–44 2,571 .28 1,374 .24 360 .16
45–54 2,011 .36 643 .33 171 .39
55–65 968 .51 144 .39 49 .44
Numerical 35–44 2,571 .35 1,374 .40 360 .26
ability 45–54 2,011 .59 643 .57 171 .68
55–65 968 .82 144 .78 49 .76
Spatial ability 35–44 2,571 .32 1,374 .27 360 .27
45–54 2,011 .58 643 .46 171 .55
55–65 968 .91 144 .73 49 .78
Form 35–44 2,571 .54 1,374 .59 360 .59
perception 45–54 2,011 1.01 643 1.10 171 1.23
55–65 968 1.53 144 1.43 49 1.72
Clerical ability 35–44 2,571 .36 1,374 .49 360 .39
45–54 2,011 .62 643 .85 171 .90
55–65 968 .88 144 1.01 49 1.06
a Means and standard deviations for the reference group were obtained by sample-size weighting means and pooling stan-
dard deviations for 20- to 24-year-old and 25- to 34-year-old age groups. N = sample sizes for 20- to 34-year-old reference
groups (7,970, 4,720, and 1,056 for Whites, Blacks, and Hispanics, respectively). d = standardized group mean-score dif-
ference. Positive effect sizes indicate younger individuals scoring higher on average.
Source: Based on data presented in Table 3 of Avolio, B. J., & Waldman, D. A., Psychology and Aging, 9, 430–442, 1994.
Waldman, 1994; Owens, 1966; Schaie, 1994). Other evidence from the individual differences litera-
ture suggests that rates of cognitive decline are slower for those who have higher initial baseline abil-
ity (Deary, MacLennan, & Starr, 1998), higher levels of education (Deary et al., 1998; Rhodes, 2004),
and those employed in complex or enriched jobs that presumably use their cognitive abilities to a
greater extent (Schooler, Mulatu, & Oates, 1999). A recent meta-analysis suggests higher vocabulary
scores for older adults, although this effect appeared confounded by education (Verhaeghen, 2003).
In the future, the aging workforces of most Western countries will certainly necessitate greater atten-
tion to the consequences of cognitive ability test use for workforce diversity with regard to age.
Thus far, we have discussed only group mean score differences on cognitive ability tests. Another
salient issue is that of differential validity. Hunter, Schmidt, and Hunter (1979) and Schmidt,
Pearlman, and Hunter (1980) have quantitatively summarized dozens of validation studies using the
GATB with Blacks and Hispanics, respectively. Hunter et al.’s (1979) analysis demonstrated that on
average, validities for Whites were .01 correlational points higher than those for Blacks in predict-
ing objective performance criteria and .04 correlational points higher for predicting subjective rat-
ings of job performance (k = 866 validity pairs incorporating data from 120,294 Whites and 65,193
Blacks). Across 1,128 pairs of validity coefficients, Schmidt et al. (1980) showed White validities on
average to be .02 correlational points higher than Hispanic validities. Differential validity of cogni-
tive ability tests in organizational settings has not been reported for Asian and Native Americans in
the peer reviewed literature.
Cognitive Abilities 269
Rothstein and McDaniel (1992) reported an examination of differential validity by sex for cogni-
tive ability tests. Using 59 pairs of male and female correlations (N = 5,517 and 9,428, respectively),
they found observed validities to be on average .03 correlational points higher for women (validities
corrected for range restriction and unreliability in the criteria were .05 correlational points higher).
The higher validity for women was more marked in lower-complexity jobs and female-dominated
occupations. In male-dominated occupations, the validity was higher for predicting performance
among men. We were unable to locate differential validity investigations for older versus younger
adults. Future research should examine differential validity for hitherto unexamined groups (Asians,
Native Americans, older adults).
We would like to stress that an update of the existing literature on differential validity would
be valuable, because analyses for Black and Hispanic groups have relied on validation studies
conducted prior to the 1980s, and analyses by sex relied on data collected prior to 1990. Labor
force participation and occupational distributions of women, Blacks, and Hispanics are much
different today than 20–30 years ago. Changes in the nature of many jobs (e.g., greater complexity,
greater technological demands) as well as changes in the social milieu in many organizations
(e.g., emergence of workforce diversity as a core value) may manifest themselves in cognitive
ability-criteria relations. Research must also examine whether differential validity is found for
criteria other than overall job performance. In our opinion, studies on organizational citizenship
behaviors, task performance, and leadership criteria may constitute priorities. The only study yet
to examine Black-White differential validity of a cognitive ability test for predicting nontraditional
performance criteria investigated incidents of detected counterproductive behaviors (interpersonal
and those targeted at the organization) and found no evidence of differential validity (Dilchert et al.,
2007). However, replications of these results, as well as investigations among other minority groups,
are certainly warranted.
While differential validity compares the magnitudes of criterion-related validities between
groups of interest, differential prediction simultaneously compares slopes and intercepts of regres-
sion lines for such groups. A healthy body of literature in employee selection has led to the conclu-
sion that there is no predictive bias against Blacks in the United States (Rotundo & Sackett, 1999;
Schmidt, 1988). Some exceptions not withstanding (e.g., The Netherlands and South Africa), studies
from other parts of the world (as well as those for other minority groups) are sparse and need to be
conducted.
disciplines to evaluate whether some group differences on individual differences traits can be
reduced, and if so, how. In the meantime, I-O psychologists need to face the challenges that these
group differences pose in applied settings. To this end, the responsibility is equally distributed
among scientists, who need to address the areas of concern summarized above; test publishers, who
need to continuously collect and make available data regarding group differences and predictive
fairness of their tests; and individual practitioners, who need to educate themselves on the past and
current research as well as its implications for their specific purpose.
Another, somewhat easier challenge is that of enhancing the acceptability of cognitive ability mea-
sures among applicants to high-complexity jobs. As Lubinski (2004) pointed out, cognitive ability
can be assessed in all shapes and forms: “Variegated conglomerations of information and problem-
solving content, not necessarily tied to an educational program, which may involve fresh as well as old
learning (acquired in or out of school), may be used to assess general intelligence” (p. 98). However, it
is our opinion that when new formats and approaches are used to address issues of applicant reactions
and face validity, intellectual honesty still mandates an acknowledgment of the construct being mea-
sured. The proliferation of “new” abilities and claims that such abilities are independent of traditional
intelligence are insincere and harmful to the professional reputation of our field.
We have also observed that sometimes preconceived notions of cognitive test acceptability can
cloud our judgment. Our work with nonverbal figural reasoning tests, arguably an item type that on
the surface does not appear extraordinarily related to most real-world tasks, yielded some surpris-
ing findings. Data show that such items, especially when compared to those with verbal content,
are received very positively by applicants. Although contextualization is certainly a viable method
of achieving face validity, items need not always be contextualized to invoke positive applicant
reactions.
EPILOGUE
This chapter aimed to offer a broad and forthright overview of cognitive ability tests and their use
in employee selection. Other excellent overviews of the topic may be found in Drasgow (2003, espe-
cially with regard to structural issues); Ree, Carretta, and Steindl (2001, especially with regard to
broader life correlates); Ones, Viswesvaran, and Dilchert (2004, especially with regard to validity
for learning criteria); and Ones et al. (2005, especially with regard to a criterion-related validity in
organizational settings). Debates over the use of cognitive ability tests in selection settings can also
be found in a special issue of Human Performance (Viswesvaran & Ones, 2002).
Cognitive ability is the capacity to learn, solve problems, and adapt to environments. Abstract
thinking and logic reasoning determine success in various life domains by allowing us to not only
rely on skills acquired through past experience, but to react to novel situations through knowledge
and insights acquired in mental simulations. Cognitive ability continues to be the single best deter-
minant of work performance. We believe that the benefits associated with cognitive ability test use
in employee selection far outweigh potential concerns. Nonetheless, we have summarized some
challenges that need to be tackled if I-O psychology wants to continuously develop as a field. We
are certain our profession will be up to those challenges.
REFERENCES
Alonso, A., Viswesvaran, C., & Sanchez, J. I. (2008). The mediating effects of task and contextual performance.
In J. Deller (Ed.), Research contributions to personality at work (pp. 3–17). Munich, Germany: Rainer
Hampp.
Arneson, J. J. (2007). An examination of the linearity of ability—Performance relationships among high scor-
ing applicants. Unpublished doctoral dissertation, University of Minnesota, Minneapolis, MN.
Avolio, B. J., & Waldman, D. A. (1994). Variations in cognitive, perceptual, and psychomotor abilities across
the working life span: Examining the effects of race, sex, experience, education, and occupational type.
Psychology and Aging, 9, 430–442.
Cognitive Abilities 271
Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical
reanalysis. Journal of Applied Psychology, 66, 1–6.
Berry, C. M., Gruys, M. L., & Sackett, P. R. (2006). Educational attainment as a proxy for cognitive ability in selec-
tion: Effects on levels of cognitive ability and adverse impact. Journal of Applied Psychology, 91, 696–705.
Berry, C. M., Sackett, P. R., & Landers, R. N. (2007). Revisiting interview-cognitive ability relationships:
Attending to specific range restriction mechanisms in meta-analysis. Personnel Psychology, 60, 837–874.
Bertolino, M., & Steiner, D. D. (2007). Fairness reactions to selection methods: An Italian study. International
Journal of Selection and Assessment, 15, 197–205.
Burke, E. (2008, April). Remarks. In N. Tippins (Chair), Internet testing: Current issues, research, solutions,
guidelines, and concerns. Symposium conducted at the annual conference of the Society for Industrial
and Organizational Psychology, San Francisco, CA.
Campbell, J. P. (1990). The role of theory in industrial and organizational psychology. In M. D. Dunnette &
L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 1, pp. 39–73). Palo
Alto, CA: Consulting Psychologists Press.
Campbell, J. P. (1996). Group differences and personnel decisions: Validity, fairness, and affirmative action.
Journal of Vocational Behavior, 49, 122–158.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York, NY: Cambridge
University Press.
Cattell, R. B. (1971). Abilities: Their structure, growth, and action. Oxford, England: Houghton Mifflin.
Chan, D., & Schmitt, N. (2004). An agenda for future research on applicant reactions to selection procedures:
A construct-oriented approach. International Journal of Selection and Assessment, 12, 9–23.
Christensen, F. G. W., Beatty, A. S., Selzer, B. K., & Ones, D. S. (2009). Thirty years of meta-analysis:
Psychology’s gift to the sciences. Manuscript submitted for publication.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. San Diego, CA: Academic Press.
Collins, J. M., Schmidt, F. L., Sanchez-Ku, M., Thomas, L., McDaniel, M. A., & Le, H. (2003). Can basic indi-
vidual differences shed light on the construct meaning of assessment center evaluations? International
Journal of Selection and Assessment, 11, 17–29.
Coward, W., & Sackett, P. R. (1990). Linearity of ability-performance relationships: A reconfirmation. Journal
of Applied Psychology, 75, 297–300.
Deary, I. J., MacLennan, W. J., & Starr, J. M. (1998). Is age kinder to the initially more able?: Differential age-
ing of a verbal ability in the Healthy Old People in Edinburgh Study. Intelligence, 26, 357–375.
Deary, I. J., Whiteman, M. C., Starr, J. M., Whalley, L. J., & Fox, H. C. (2004). The impact of childhood intel-
ligence on later life: Following up the Scottish mental surveys of 1932 and 1947. Journal of Personality
and Social Psychology, 86, 130–147.
De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between
selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393.
Detterman, D. K., & Daniel, M. H. (1989). Correlations of mental tests with each other and with cognitive
variables are highest for low IQ groups. Intelligence, 13, 349–359.
Dilchert, S., & Ones, D. S. (2007, April). Influence of figural reasoning item characteristics on group mean-score
differences. In A. S. Boyce & R. E. Gibby (Chairs), Global cognitive ability testing: Psychometric issues
and applicant reactions. Symposium conducted at the annual conference of the Society for Industrial and
Organizational Psychology, New York, NY.
Dilchert, S., & Ones, D. S. (2009). Assessment center dimensions: Individual differences correlates and meta-
analytic incremental validity. International Journal of Selection and Assessment, 17, 254–270.
Dilchert, S., Ones, D. S., Davis, R. D., & Rostow, C. D. (2007). Cognitive ability predicts objectively measured
counterproductive work behaviors. Journal of Applied Psychology, 92, 616–627.
Drasgow, F. (2003). Intelligence and the workplace. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.),
Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 107–130). Hoboken,
NJ: John Wiley & Sons.
Foster, D. (2008, April). Remarks. In N. Tippins (Chair), Internet testing: Current issues, research, solutions,
guidelines, and concerns. Symposium conducted at the annual conference of the Society for Industrial
and Organizational Psychology, San Francisco, CA.
Ghiselli, E. E., & Kahneman, D. (1962). Validity and non-linear heteroscedastic models. Personnel Psychology,
15, 1–11.
Gibby, R. E. (2008, April). Online and unsupervised adaptive cognitive ability testing: Lessons learned. In
R. E. Gibby & R. A. McCloy (Chairs), Benefits and challenges of online and unsupervised adaptive
testing. Symposium conducted at the annual conference of the Society for Industrial and Organizational
Psychology, San Francisco, CA.
272 Handbook of Employee Selection
Gibby, R. E., & Boyce, A. S. (2007, April). Global cognitive ability testing: Psychometric issues and applicant
reactions. Symposium conducted at the annual conference of the Society for Industrial and Organizational
Psychology, New York, NY.
Gottfredson, L. S. (1997). Mainstream science on intelligence: An editorial with 52 signatories, history and
bibliography. Intelligence, 24, 13–23.
Gottfredson, L. S. (2002). Where and why g matters: Not a mystery. Human Performance, 15, 25–46.
Gottfredson, L. S. (2004a). Intelligence: Is it the epidemiologists’ elusive “fundamental cause” of social class
inequalities in health? Journal of Personality and Social Psychology, 86, 174–199.
Gottfredson, L. S. (2004b). Life, death, and intelligence. Journal of Cognitive Education and Psychology, 4,
23–46.
Gustafsson, J.-E. (2002). Measurement from a hierarchical point of view. In H. I. Braun, D. N. Jackson, &
D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 73–95).
Mahwah, NJ: Lawrence Erlbaum.
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated
model and meta-analysis. Personnel Psychology, 57, 639–683.
Hawk, J. (1970). Linearity of criterion-GATB aptitude relationships. Measurement and Evaluation in Guidance,
2, 249–251.
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse
impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of
Selection and Assessment, 9, 152–194.
Hull, C. L. (1928). Aptitude testing. Yonkers-on-Hudson, NY: World Book.
Hülsheger, U. R., Maier, G. W., & Stumpp, T. (2007). Validity of general mental ability for the prediction of
job performance and training success in Germany: A meta-analysis. International Journal of Selection
and Assessment, 15, 3–18.
Humphreys, L. G. (1988). Trends in levels of academic achievement of Blacks and other minorities. Intelligence,
12, 231–260.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–98.
Hunter, J. E., & Schmidt, F. L. (1982). Ability tests: Economic benefits versus the issue of test fairness.
Industrial Relations, 21, 122–158.
Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of employment tests by race: A compre-
hensive review and analysis. Psychological Bulletin, 86, 721–735.
Hyde, J. S. (2005). The gender similarities hypothesis. American Psychologist, 60, 581–592.
Hyde, J. S., Fennema, E., & Lamon, S. J. (1990). Gender differences in mathematics performance: A meta-
analysis. Psychological Bulletin, 107, 139–155.
Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin,
104, 53–69.
Johnson, W., & Bouchard, T. J. (2005). The structure of human intelligence: It is verbal, perceptual, and image
rotation (VPR), not fluid and crystallized. Intelligence, 33, 393–416.
Johnson, W., te Nijenhuis, J., & Bouchard, T. J. (2007). Replication of the hierarchical visual-perceptual-image
rotation model in de Wolff and Buiten’s (1963) battery of 46 tests of mental ability. Intelligence, 35, 69–81.
Johnson, W., te Nijenhuis, J., & Bouchard, T. J. (2008). Still just 1 g: Consistent results from five test batteries.
Intelligence, 36, 81–95.
Kane, H. D., Oakland, T. D., & Brand, C. R. (2006). Differentiation at higher levels of cognitive ability:
Evidence from the United States. Journal of Genetic Psychology, 167, 327–341.
Kehoe, J. F. (2008). Commentary on pareto-optimality as a rationale for adverse impact reduction: What would
organizations do? International Journal of Selection and Assessment, 16, 195–200.
Kluger, A. N., & Rothstein, H. R. (1993). The influence of selection test type on applicant reactions to employ-
ment testing. Journal of Business and Psychology, 8, 3–25.
Kravitz, D. A. (2008). The diversity-validity dilemma: Beyond selection—The role of affirmative action.
Personnel Psychology, 61, 173–193.
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2004). Academic performance, career potential, creativity, and
job performance: Can one construct predict them all? Journal of Personality and Social Psychology, 86,
148–161.
Lent, R. H., Aurbach, H. A., & Levin, L. S. (1971). Predictors, criteria, and significant results. Personnel
Psychology, 24, 519–533.
Linn, M. C., & Petersen, A. C. (1985). Emergence and characterization of sex-differences in spatial ability: A
meta-analysis. Child Development, 56, 1479–1498.
Cognitive Abilities 273
Lubinski, D. (2004). Introduction to the special section on cognitive abilities: 100 Years after Spearman’s
(1904) “ ‘General Intelligence,’ Objectively Determined and Measured”. Journal of Personality and
Social Psychology, 86, 96–111.
Lynn, R., & Irwing, P. (2004). Sex differences on the progressive matrices: A meta-analysis. Intelligence, 32,
481–498.
McCloy, R. A., Campbell, J. P., & Cudeck, R. (1994). A confirmatory test of a model of performance determinants.
Journal of Applied Psychology, 79, 493–505.
McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. (2007). Situational judgment tests, response
instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91.
Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests:
A meta-analysis. Psychological Bulletin, 114, 449–458.
Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., et al. (2008). International
perspectives on the legal environment for selection. Industrial and Organizational Psychology, 1,
206–246.
Nathan, B. R., & Alexander, R. A. (1988). A comparison of criteria for test validation: A meta-analytic investigation.
Personnel Psychology, 41, 517–535.
Nikolaou, I., & Judge, T. A. (2007). Fairness reactions to personnel selection techniques in Greece: The role of
core self-evaluations. International Journal of Selection and Assessment, 15, 206–219.
Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of
Applied Psychology, 79, 845–851.
Ones, D. S. (2004, October). Validity of cognitive ability tests in selection: Quantitative summaries of
meta-analyses. Cattell Award address given at the annual conference of the Society for Multivariate
Experimental Psychology, Naples, FL.
Ones, D. S., & Dilchert, S. (2004). Practical versus general intelligence in predicting success in work and
educational settings: A first-order and a second-order meta-analysis. Paper presented at the University
of Amsterdam, Amsterdam, The Netherlands.
Ones, D. S., Viswesvaran, C., & Dilchert, S. (2004). Cognitive ability in selection decisions. In O. Wilhelm &
R. W. Engle (Eds.), Handbook of understanding and measuring intelligence (pp. 431–468). Thousand
Oaks, CA: Sage.
Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Cognitive ability in personnel selection decisions. In
A. Evers, O. Voskuijl, & N. Anderson (Eds.), Handbook of selection (pp. 143–173). Oxford, England:
Blackwell.
Owens, W. (1966). Age and mental abilities: A second adult follow-up. Journal of Educational Psychology,
57, 311–325.
Plomin, R., & Spinath, F. M. (2004). Intelligence: Genetics, genes, and genomics. Journal of Personality and
Social Psychology, 86, 112–129.
Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and
sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172.
Potosky, D., Bobko, P., & Roth, P. L. (2005). Forming composites of cognitive ability and alternative measures
to predict job performance and reduce adverse impact: Corrected estimates and realistic expectations.
International Journal of Selection and Assessment, 13, 304–315.
Potosky, D., Bobko, P., & Roth, P. L. (2008). Some comments on Pareto thinking, test validity, and adverse
impact: When “and” is optimal and “or” is a trade-off. International Journal of Selection and Assessment,
16, 201–205.
Ree, M. J., & Carretta, T. R. (1996). Central role of g in military pilot selection. International Journal of
Aviation Psychology, 6, 111–123.
Ree, M. J., Carretta, T. R., & Steindl, J. R. (2001). Cognitive ability. In N. Anderson, D. S. Ones, H. K. Sinangil,
& C. Viswesvaran (Eds.), Handbook of industrial, work and organizational psychology: Vol. 1: Personnel
psychology (pp. 219–232). Thousand Oaks, CA: Sage.
Reeve, C. L., & Lam, H. (2007). Consideration of g as a common antecedent for cognitive ability test performance,
test motivation, and perceived fairness. Intelligence, 35, 347–358.
Reeves, T. J., & Bennett, C. E. (2004). We the people: Asians in the United States—Census 2000 special report
(CENSR-17). Washington, DC: U.S. Department of Commerce, Economics and Statistics Administration,
U.S. Census Bureau.
Rhodes, M. G. (2004). Age-related differences in performance on the Wisconsin Card Sorting Test: A meta-
analytic review. Psychology and Aging, 19, 482–494.
Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive
ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330.
274 Handbook of Employee Selection
Rothstein, H. R., & McDaniel, M. A. (1992). Differential validity by sex in employment settings. Journal of
Business and Psychology, 7, 45–62.
Rotundo, M., & Sackett, P. R. (1999). Effect of rater race on conclusions regarding differential prediction in
cognitive ability tests. Journal of Applied Psychology, 84, 815–822.
Ryan, A. M., & Greguras, G. J. (1998). Life is not multiple choice: Reactions to the alternatives. In M. D. Hakel
(Ed.), Beyond multiple choice: Evaluating alternatives to traditional testing for selection (pp. 183–202).
Mahwah, NJ: Lawrence Erlbaum.
Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation
and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391.
Sackett, P. R., Borneman, M. J., & Connelly, B. S. (2008). High stakes testing in higher education and employ-
ment: Appraising the evidence for validity and fairness. American Psychologist, 63, 215–227.
Sackett, P. R., De Corte, W., & Lievens, F. (2008). Pareto-optimal predictor composite formation: A comple-
mentary approach to alleviating the selection quality/adverse impact dilemma. International Journal of
Selection and Assessment, 16, 206–209.
Sackett, P. R., & Ostgaard, D. J. (1994). Job-specific applicant pools and national norms for cognitive ability
tests: Implications for range restriction corrections in validation research. Journal of Applied Psychology,
79, 680–684.
Sackett, P. R., & Roth, L. (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on
performance and minority hiring. Personnel Psychology, 49, 549–572.
Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment,
credentialing, and higher education: Prospects in a post-affirmative-action world. American Psychologist,
56, 302–318.
Sackett, P. R., & Shen, W. (2008, April). International perspectives on the legal environment for selec-
tion. Symposium conducted at the annual conference of the Society for Industrial and Organizational
Psychology, San Francisco, CA.
Salgado, J. F., & Anderson, N. (2003). Validity generalization of GMA tests across countries in the European
Community. European Journal of Work and Organizational Psychology, 12, 1–17.
Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., & de Fruyt, F. (2003). International validity generalization
of GMA and cognitive abilities: A European community meta-analysis. Personnel Psychology, 56, 605.
Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J. P. (2003). A meta-analytic
study of general mental ability validity for different occupations in the European Community. Journal of
Applied Psychology, 88, 1068–1081.
Salgado, J. F., Viswesvaran, C., & Ones, D. S. (2001). Predictors used for personnel selection: An overview
of constructs, methods and techniques. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran
(Eds.), Handbook of industrial, work and organizational psychology: Vol. 1: Personnel psychology
(pp. 165–199). London, England: Sage.
Schaie, K. W. (1994). The course of adult intellectual development. American Psychologist, 49, 304–313.
Schmidt, F. L. (1988). The problem of group differences in ability test scores in employment selection. Journal
of Vocational Behavior, 33, 272–292.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization.
Journal of Applied Psychology, 62, 529–540.
Schmidt, F. L., & Hunter, J. E. (1981). New research findings in personnel selection: Myths meet realities in
the 1980s. Management, 23, 23–27.
Schmidt, F. L., & Hunter, J. (2004). General mental ability in the world of work: Occupational attainment and
job performance. Journal of Personality and Social Psychology, 86, 162–173.
Schmidt, F. L., Hunter, J. E., & Outerbridge, A. N. (1986). Impact of job experience and ability on job
knowledge, work sample performance, and supervisory ratings of job performance. Journal of Applied
Psychology, 71, 432–439.
Schmidt, F. L., Hunter, J. E., Outerbridge, A. N., & Goff, S. (1988). Joint relation of experience and ability with
job performance: Test of three hypotheses. Journal of Applied Psychology, 73, 46–57.
Schmidt, F. L., Pearlman, K., & Hunter, J. E. (1980). The validity and fairness of employment and educational
tests for Hispanic Americans: A review and analysis. Personnel Psychology, 33, 705–724.
Schooler, C., Mulatu, M. S., & Oates, G. (1999). The continuing effects of substantively complex work on the
intellectual functioning of older workers. Psychology and Aging, 14, 483–506.
Spearman, C. (1904). “General intelligence,” objectively determined and measured. American Journal of
Psychology, 15, 201–293.
Stern, W. (1911). Die differentielle Psychologie in ihren methodischen Grundlagen [Differential psychology
and its methodological foundations]. Leipzig, Germany: J. A. Barth.
Cognitive Abilities 275
Tiffin, J., & Vincent, N. L. (1960). Comparison of empirical and theoretical expectancies. Personnel Psychology,
13, 59–64.
Vanderpool, M., & Catano, V. M. (2008). Comparing the performance of Native North Americans and pre-
dominantly White military recruits on verbal and nonverbal measures of cognitive ability. International
Journal of Selection and Assessment, 16, 239–248.
Verhaeghen, P. (2003). Aging and vocabulary score: A meta-analysis. Psychology & Aging, 18, 332–339.
Viswesvaran, C., & Ones, D. S. (2002). Special issue: Role of general mental ability in industrial, work, and
organizational psychology. Human Performance, 15.
Voyer, D., Voyer, S., & Bryden, M. (1995). Magnitude of sex differences in spatial abilities: A meta-analysis
and consideration of critical variables. Psychological Bulletin, 117, 250–270.
Wai, J., Lubinski, D., & Benbow, C. P. (2005). Creativity and occupational accomplishments among intellectu-
ally precocious youths: An age 13 to age 33 longitudinal study. Journal of Educational Psychology, 97,
484–492.
Wonderlic Inc. (2002). Wonderlic Personnel Test and Scholastic Level Exam user’s manual. Libertyville, IL:
Author.
This page intentionally left blank
13 Physical Performance Tests
Deborah L. Gebhardt and Todd A. Baker
Assessment of physical performance has a historical base in the exercise science, medical,
psychology, and military fields. It is an interdisciplinary field of study that encompasses work
physiology, biomechanics, industrial engineering, applied psychology, and medicine. Because of
the multidisciplinary aspects and uses of physical testing, it has been influenced by the law and
standards for testing in occupational settings. Hogan (1991a) provided a rich historical background
that highlights many of the factors that lead to the types of physical testing used today. Although
our society has now become computer-driven, there still remains a segment of civilian and military
jobs that are arduous in nature. A survey in 2001 indicated that preemployment physical tests
were used by less than 10% of the employers (Salgado, Viswesvaran, & Ones, 2001). However,
physical performance tests remain integral in selection of public safety personnel and have become
popular for evaluating applicants seeking blue collar jobs in the warehouse, manufacturing, long-
shore, telecommunications, railroad, airline, electric, and natural gas industries, as well as for law
enforcement and paramedic personnel (Gebhardt & Baker, 2007).
Physical performance tests are used for selection of applicants into arduous jobs, retention of
incumbents, and evaluation of general levels of physical fitness. Of primary interest in the employment
setting are the uses for selection and retention. Although physical testing is common in the selection
setting, few organizations evaluate incumbent personnel at specified intervals (e.g., annually) because
of union agreements and divergent opinions on the consequences of failing to meet the established
standard. A few public safety agencies require annual physical qualification (e.g., Nuclear Regulatory
Commission, state police agencies) with employment consequences (Gebhardt & Baker, 2006). The
employment consequences range from remedial training to job suspension, denial of promotion and
bonus payments, and job loss.
277
278 Handbook of Employee Selection
types and weights of objects). Further, scales that define the amount of time required to complete a
task (e.g., 10 seconds, 5 minutes) and the physical effort involved enable the job analyst to classify
tasks by physical demand. An additional scale, expected to perform, is used for public safety jobs
(e.g., nuclear security officer) that contain tasks with high importance to job performance, but which
are rarely performed (e.g., fire weapon at suspect). For example, nuclear plant security officers are
responsible for protecting the plant from sabotage and terrorist attack. Although an armed attack on
a nuclear plant has not occurred, the officers are tasked with protecting the plant and expected to
perform tasks related to protecting the plant should an attack occur.
Past research found that task ratings can be completed by job incumbents or supervisors (Hogan,
1991a). Because of the significance of the frequency and time spent rating the development of the
physical tests and criterion measures, incumbent personnel are typically best suited to complete the
job analysis questionnaire because they perform the tasks and therefore know the frequency and
duration parameters. If supervisors are used, first-line supervisors with previous job performance
should be selected.
Numerous decision rules or algorithms (frequency, importance, and time-spent combinations)
have been used to identify essential tasks from task ratings. There is not one specific algorithm
associated with physical-oriented job analysis. Selection of the most appropriate algorithm depends
upon the nature and mission of the job. For example, jobs in which most tasks are performed fre-
quently (e.g., assembly line) may require a larger weighting for frequency than importance. Jobs that
include highly important tasks that are infrequently performed (e.g., fire weapon at suspect) may use
an algorithm in which there is a separate importance or frequency mean cutoff to identify essential
tasks. Selection of the most appropriate algorithm requires knowledge of the job.
An initial overview of the physical demand of the essential tasks can be obtained by using the
physical effort scale (Fleishman, Gebhardt, & Hogan, 1986). Tasks with mean physical effort rat-
ings equal to or above a specified value (e.g., 4 on a 7-point scale) are considered physically demand-
ing. An alternate method to identify the essential physically demanding tasks is to compute the
product for each essential task’s mean physical effort and frequency task means. These data assist
the researcher in determining whether the essential tasks have adequate physical demand to require
further investigation.
ERGONOMIC/BIOMECHANICAL/PHYSIOLOGICAL ANALYSIS
Ergonomic, physiological, and biomechanical data are used to quantify the demands of a job.
These three types of analysis provide direct measures of job demand. The approaches range
from simple measures such as the distance a worker walks when gathering tools, to sophisticated
measures involving oxygen uptake, mathematical modeling, and use of archival engineering
data. For most jobs, simple ergonomic measures such as weights of objects, distances objects
are carried, and heights lifted to and from can be easily gathered at the job site. Where needed,
the forces required to push/pull objects can be obtained using a load-cell device that measures
force production.
Basic physiological and biomechanical data can define the actual demands of job tasks. The type
of data gathered is dependent upon the essential tasks and physical demands of the job. The data can
be gathered using a variety of equipment (e.g., heart rate monitor, accelerometer, oxygen/gas sensor,
mathematical modeling). Heart rate monitors can be used to assess physiological workload for jobs
that require sustained performance of one or more tasks at high intensities for extended time periods
(e.g., order filler, firefighter). The monitor collects the individual’s heart rate while performing physi-
cally demanding tasks to determine the heart rate response to the work and thus, the percentage of
maximum heart rate at which the individual was working. For example, if a 30-year old warehouse
order filler with a maximum heart rate of 190 beats per minute (bpm) (220 – 30 = 190 bpm) is work-
ing at an average heart rate of 142.5 bpm, he is working at 75% of his maximum (142.5/190 = 0.75).
The American College of Sports Medicine (ACSM) classified the intensity of physical activity in
Physical Performance Tests 279
terms of the percentage of maximum heart rate (Whaley, Brubaker, & Otto, 2006). Table 13.1 lists the
ACSM intensities, which range from very light to maximum (Whaley et al., 2006). Gebhardt, Baker,
and Thune (2006) found that workers in the order-filler job had heart rates of 71–81% of maximum
across a 3- to 4-hour time frame, thus placing the job in the “hard” intensity level. This information
can then be used to determine an estimate of the maximum oxygen uptake (VO2submax) needed to
perform the job tasks.
Past research indicated that to sustain arduous work for an 8-hour period, the individual must be
working at 40–50% of maximum aerobic capacity (VO2 max) (Astrand, Rodahl, Dahl, & Stromme,
2003; McArdle, Katch, & Katch, 2007). Direct measure of oxygen uptake has been performed on
jobs ranging from light industry (e.g., manual materials handling) to firefighting and military jobs
(Bilzon, Scarpello, Smith, Ravenhill, & Rayson, 2001; Sothmann, Gebhardt, Baker, Kastello, &
Sheppard, 2004). Examples of results from studies that gathered VO2 data found that ship-board
firefighting required a VO2 of 3.1 liters/minute (L · min–1) or approximately 36.5 milliliters of
oxygen/kilogram of body weight/minute (mL · kg–1 · min–1 ), whereas urban and forest firefighters
VO2 minimum requirements ranged from 33.5 to 45.0 mL · kg–1 · min–1 (Bilzon et al., 2001; Gledhill &
Jamnik, 1992; Sothmann et al., 2004). This type of physiological data, along with heart rate data,
can be used to determine whether a test of aerobic capacity would be useful and establish job-
related passing scores for aerobic capacity tests.
Biomechanical data, which encompass the use of physics principles to define human movement,
can be gathered by direct measurement (e.g., weight, time, distance) and film/video data. If the
force to move or lift an object cannot be measured with a load cell as described above, the param-
eters related to the movement can be used to calculate the force exerted in a movement. A simple
biomechanical or mathematical model was developed to calculate the force a paramedic needed
to exert to lift either the head or foot end of a patient loaded stretcher (Gebhardt & Crump, 1984).
The length and weight of the stretcher and the height and weight of the most common patient were
used to determine that a force of 152 lb was required to lift the head end of a stretcher carrying a
200-lb patient. Another method requires filming (motion analysis) workers performing a job task.
The motions are captured using optical sensors placed at the subjects’ joints (e.g., elbow). These data
are then mathematically transformed to provide indications of the forces incurred at specific body
joints (e.g., knee, hip), which in turn yield an indication of the forces the worker needs to generate
to complete the work task.
In summary, ergonomic, biomechanical, and physiological data provide information that is
important to defining the physical demand of a job and can be used in a validation study to assist in
setting passing scores. These data are not readily available and must be gathered on site.
TABLE 13.1
ACSM’s Categories of Physical Activity Intensity
Intensity Percentage of Maximum Heart Rate
Very light <35
Light 35−54
Moderate 55−69
Hard 70−89
Very hard ≥90
Maximum 100
280 Handbook of Employee Selection
completion. In either case, these factors should be taken into account when viewing the results of
job task ratings that have unexpected mean values. For example, nuclear power plant workers wear
occlusive clothing to protect them from the radiation. This clothing increases the workers’ core
temperature, which causes excessive sweating and reduces the workers’ ability to perform job tasks.
Research has shown that individuals with higher cardiovascular endurance are more readily able
to adjust to a heated environment (Astrand et al., 2003; Pandolf, Burse, & Goldman, 1977). Thus,
aerobic capacity may be relevant to job success because of the interaction of the physical demand
and the environmental conditions. Defining the demands of the essential tasks may assist in design-
ing the testing procedures, as well as criterion measures used in criterion-related validation studies.
Environmental condition information can be collected through job analysis questionnaires, incum-
bent focus group meetings, company operating procedures, and past weather history.
1. Muscular strength is the ability to exert force to lift, push, pull, or hold objects.
2. Muscular endurance is the ability to exert force continuously over moderate to long time
periods.
3. Aerobic capacity or cardiovascular endurance is the ability of the respiratory and cardio-
vascular systems to provide oxygen to the body systems for medium- to high-intensity
tasks performed over a moderate time period.
4. Anaerobic power is the ability to complete high-intensity, short-duration (e.g., 5–90 sec-
onds) tasks using stored energy (e.g., adenosine triphosphate).
5. Flexibility involves the range of motion at the joints (e.g., knee, shoulders) to bend, stoop,
rotate, and reach in all directions with the arms and legs.
6. Equilibrium is the ability to maintain the center of gravity over the base of support (e.g.,
feet) when outside forces (e.g., gravity, slipping on ice) occur.
7. Coordination is the ability to integrate sight, hearing, and other neuron-sensory cues to
perform motor activities (e.g., change of direction) in an accurate sequential pattern.
Past research has identified different structures for classifying physical abilities. One of the first
taxonomies of physical abilities was generated by Fleishman through factor analysis and was refined
into nine abilities (e.g., static strength, dynamic strength, coordination, equilibrium) similar to those
listed above (Fleishman & Quaintance, 1984; Fleishman, 1964). A second study using test data
from workers in physically demanding jobs found three components describing physical perfor-
mance: muscular strength, endurance, and movement quality (Hogan, 1991b). A subsequent study
using equal samples of men and women, not found in previous studies, included 25 physical tests,
some of which were found in the Fleishman study (Myers, Gebhardt, Crump, & Fleishman, 1993).
A confirmatory factor analysis postulating a six-factor structure was found to best describe physi-
cal performance for men and women. It included static or muscular strength, explosive strength or
anaerobic power, aerobic capacity (stamina), trunk strength, and flexibility. A comparison of most
of these studies is presented by Guion (1998) in which muscular strength, muscular endurance, and
muscular power (anaerobic power) were grouped under a single factor (muscular strength) and flex-
ibility, balance, and coordination were grouped under Hogan’s movement quality factor. However,
use of a single strength factor does not correspond to the physiological components that underlie
performance of different types of physical tasks. For example, it takes 5–10 minutes to complete
Physical Performance Tests 281
300 turns when closing large wheel valves. This requires a high level of muscular endurance and
would fall under the strength factor in Hogan’s model. However, a test of muscular strength may
not be the best predictor for performing this task. Rather, a test of muscular endurance would be
better suited. Although each of these structures has scientific merit, a combination of these studies
provides a framework for identifying the physical requirement in the work setting. Theses abilities
are muscular strength, muscular endurance, aerobic capacity, anaerobic power, flexibility, and equi-
librium, along with a coordination factor.
Performance of physical tasks requires varying levels of the different physical abilities. The
muscular strength may be as minimal as lifting a spoon or as high as lifting 90-lb cement bags.
Similarly, the energy expenditure may be primarily anaerobic (e.g., drag a victim 50 feet from
a fire) or aerobic (e.g., fill eight warehouse orders totaling 8,900 lbs. in 7 hours) depending on
the duration and intensity of the activity. When identifying the relevant physical abilities for a
position, it is more informative to gather information related to the level of the physical abilities
needed to complete essential job tasks. The Fleishman Job Analysis Survey provides a set of nine
Likert physical ability-rating scales and 10 psychomotor ability scales (Fleishman & Quaintance,
1984). An alternate physical demands inventory that uses seven physical ability rating scales
(e.g., muscular strength, muscular endurance, anaerobic power, aerobic capacity) and behavioral
anchors targeted at work behaviors has also been used to define job demands (Gebhardt, 1984).
Incumbents, supervisors, or job analysts can complete these ratings with the end product being a
profile of the physical demand of the essential job tasks. This approach allows for comparison of
multiple jobs and assists in the selection or design of testing procedures for only relevant abilities
(Gebhardt, Baker, Curry, & McCallum, 2005). Regardless of the method used to determine the
abilities related to job performance, the procedures must be logical and linked back to the essential
job tasks.
McArdle et al., 2007). Tests such as one repetition bench press or a dynamic lift to a specified height
are examples of isotonic strength tests. Isotonic tests have been found to be significant predictors for
public safety jobs (Davis, Dotson, & Santa Maria, 1982; Gebhardt & Crump, 1984).
Isokinetic testing assesses the force produced through a specified range of motion at the shoul-
der, back, and knee joints. The equipment used incorporates a force recording device (load cell)
and computer software, which controls the speed (degrees/second) at which a subject can per-
form maximal flexion and extension movements. The force generated by a subject is measured
in units of torque (τ), a vector quality that represents the force generated when rotating an object
(e.g., lower leg) about an axis (e.g., knee) (McGinnis, 2007). A torque curve is produced for each
joint measured. Scores are generated for each joint and summed to form a strength index. There
is limited published research using isokinetic testing in an occupational setting. However, some
research has shown a relationship between isokinetic test scores and injury reduction (Gilliam
& Lund, 2000; Karwowski & Mital, 1986). Research comparing isokinetic tests with isometric
and isotonic tests found the correlations among the tests to be high (r = .91 to .94; Karwowski &
Mital, 1986).
Muscular endurance (or dynamic strength) tests assess the ability to withstand muscular fatigue.
The duration of these tests varies in relation to the desired outcome and demands of the job. The
arm endurance test, in which a subject pedals an arm ergometer at a set resistance level (e.g., 50 W)
for a specified time period, is an example of a muscular endurance test (Gebhardt et al., 1998). The
test can be scored by counting the number of revolutions in a specified time period or the time until
a subject is unable to maintain a specific cadence. Other tests such as situps and pushups can also
be measured to exhaustion or for a specified duration.
Aerobic capacity tests assess the efficiency of the cardiovascular system (i.e., lungs, blood ves-
sels, heart) to deliver oxygen to the muscles. The tests can be classified as maximal and submaxi-
mal. In a maximal test, the subject typically runs on a treadmill or pedals a bicycle at incremental
workloads (e.g., increased treadmill speed and/or slope) until reaching exhaustion. The test can be
scored as the time to exhaustion or using a regression equation to determine the oxygen uptake value
(i.e., mL · kg−1 · min−1). The Bruce or the Balke treadmill protocols are commonly used (Whaley
et al., 2006). The submaximal assessments include the step test, 1.5-mile, 1-mile walk, and the
YMCA or Astrand-Rhyming bicycle test. The goal of the step and bicycle tests is to provide an
estimate of VO2max using heart rate response to the workload of the exercise (Astrand et al., 2003;
Golding, 2000). The results are reported in milliliters of oxygen per kilogram of body weight per
minute and expressed as VO2submax. The submaximal tests are used more frequently in employee
selection. Maximal and submaximal tests are considered to be medical assessments as defined by
the Americans with Disabilities Act of 1990 (ADA) (1990) and, therefore, should be given after
conditional offer of the job.
Flexibility and equilibrium tests have been used in employee selection. However, they are rarely
significant predictors of job performance. The correlations between job performance and these tests
ranges from 0.00 to 0.18 (Gebhardt et al., 2005; Gebhardt, Baker, & Sheppard, 1992). These correla-
tions may reflect that flexibility is specific to the joints providing the movement (e.g., knee, shoul-
der). Similarly, it appears that only when high levels of equilibrium (e.g., leaning out from heights of
40 ft or greater when lashing containers to a ship) are involved, is equilibrium a significant predictor
of job performance (Gebhardt, Schemmer, & Crump, 1985).
Some basic ability tests can be used to assess multiple abilities depending upon the intensity
and duration of the test. For example, the arm endurance test described above can be a muscular
endurance test or can become an anaerobic power test by shortening the duration (e.g., 10 sec) and
increasing the resistance (e.g., 100 W). Finally, most basic ability tests are practical in that they
are easily stored when not in use, have a small footprint, and are easy to transport and set up when
testing at multiple locations. The shortcoming of basic ability tests is that they do not resemble the
job tasks (Hogan, 1991a). Table 13.2 provides a listing of commonly used basic ability tests. For an
overview of these tests the reader should see the Hogan article.
Physical Performance Tests 283
TABLE 13.2
Basic Physical Ability Tests
Physical Ability Example Tests Physical Ability Example Tests
Muscular strength
Upper body Arm lift Aerobic capacity Step test
Shoulder lift 1.5-mile run
Handgrip 1-mile walk
Static push
Static pull
Chest pull
Dynamic lift
Pushups
Trunk/core Situps Flexibility Sit and reach
Trunk pull Joint range of motion
Lower body Leg lift Equilibrium Stabiliometer
Leg press balance beam
Muscular endurance
Upper body Arm endurance Anaerobic power Shuttle run
Pushups 100-yd run
Arm ergometer (10 seconds)
Margaria test
Purdue pegboard test
Trunk/core Situps
Lower body Stepping platform
Job simulations or work sample tests include components of the job (e.g., pursuing a suspect,
lifting and carrying boxes) and are used as predictors or criterion measures. Job simulations require
performance of actual or simulated job tasks during the test and may require use of job-specific
equipment. The primary advantage of a job simulation is its resemblance to the job. Further, they
can be developed directly from the essential job tasks and can provide an initial indication of how an
individual handles equipment. The feasibility of developing a simulation that does not include equip-
ment skills learned in training or on the job may be difficult. However, other equipment (e.g., weight
vest) can be used to simulate actual equipment (e.g., firefighter bunker gear). When job simulations
consist of a series of tasks, the sequence of performance should replicate the job as closely as pos-
sible The duration and intensity of a job simulation should reflect job conditions and not a protracted
set of tasks that last substantially longer than job events (Baker, Sheppard, & Gebhardt, 2000). It
is paramount that simulated tasks represent the critical physical job behaviors and that the param-
eters selected can be scored in a meaningful way to identify true individual differences (American
Educational Research Association [AERA], American Psychological Association, & National
Council on Measurement in Education, 1999). Many job simulations are time dependent, whereas
others call for the completion of a task or set of tasks in a specific time period. Regardless of the
scoring procedure used, it should reflect the job tasks and working conditions (e.g., distance run).
The primary disadvantages of job simulations are equipment size and construction (safe, durable,
storage), applicant safety (e.g., higher injury risk), and scoring the simulation. Despite the disad-
vantages, job simulations based on essential tasks and ergonomic parameters (e.g., duration, inten-
sity) possess content validity. Depending upon the job of interest, the following parameters should
be considered when designing a job simulation: (a) dimensions and weights of objects handled,
(b) frequency of weight category (e.g., 30–50 lb) handled in a time period (e.g., hour, day), (c) heights
to which objects lifted, (d) distances objects carried, (e) distances workers walk, (f) speed of task
movement, and (g) fidelity to the job.
284 Handbook of Employee Selection
Job simulations are commonly used in public safety jobs by law enforcement and fire depart-
ments. Less common are simulations of blue-collar jobs (e.g., warehouse worker, pipefitter, electri-
cian). These simulations involve lifting, pushing, and pulling movements, with lifting and carrying
of weighted objects being the most commonly used assessment.
Another type of lifting test—isoinertial—was developed to assess work capacity in a struc-
tured manner and increase the safety of the lifting tasks. Isoinertial tests involve lifting predeter-
mined weights from floor level to a defined height (e.g., waist, shoulder) at a specified pace (e.g.,
every 5 sec). This differs from the psychophysical approach in which the subject determines the
weight lifted. For isoinertial tests, the weight is increased by 5 or 10 lb every 20 to 40 seconds.
Depending upon the protocol, the weight of the objects is increased until the subject achieves
maximum capacity, or the maximum weight defined by the job analysis (Gebhardt et al., 2006;
Mayer, Gatchel, & Mooney, 1990). Isoinertial tests are used for selection of manual materi-
als handling personnel. Past research has shown that a progressive isoinertial lifting test is a
good assessment of an individual’s ability to perform lifting tasks over a time period (Hazard,
Reeves, & Fenwick, 1992) and an effective baseline test for lifting capacity. Isoinertial tests
were found to be a reliable, safe, and inexpensive method to screen for jobs with frequent lift-
ing (Hattori et al., 1998; Lygren, Dragesund, Joensen, Ask, & Moe-Nilssen, 2005). Two stud-
ies found that the inclusion of an isoinertial lifting evaluation was more predictive of injury
than basic strength tests (Gebhardt et al., 2006; Mayer et al., 1990). Other research found that
individuals with back pain performed poorly on an isoinertial lifting test (Ruan, Haig, Geisser,
Yamakawa, & Buchholz, 2001).
In an analysis of more than 50,000 subjects, Baker (2007) found numerous differences across
White, African-American, and Hispanic groups. To eliminate a gender affect, only men were
included in the study. Significant differences with effect sizes ranging from 0.29 to 0.52 were found,
with White men performing better than African-American men on basic ability and job simulation
tests that required some level of quick and/or continuous movement (e.g., pursuit run, 1.5-mile run,
arm endurance, firefighter evolution). For strength tests, White and African-American men have
significantly better scores than Hispanic men (Baker, 2007; Blakely et al., 1994).
Although one would ideally select tests with no adverse impact, this is not possible in the physi-
cal testing arena. Rather, one must select and/or design tests that have less adverse impact. Choosing
a job simulation over a basic ability test will not necessarily reduce adverse impact for women or
minorities. Further, if the job demands have a considerable upper body strength component (e.g.,
lineworker), tests that evaluate the ability should be used. Review of past research is the best method
to determine the validity and adverse impact of specific physical tests. When designing new tests,
one must ensure that the pilot or test sample includes an adequate number of women. Because of the
low number of women in physically demanding jobs, organizations should attempt to recruit women
from jobs with similar physical demands to participate in the data-gathering phase. This approach
was used in a statewide firefighter study in which women firefighters in fire suppression jobs were
recruited from neighboring states to participate in a validation study (Gebhardt & Baker, 1999).
Without these data, identification of a fair and sound passing score would not have been possible.
Although the reliability and adverse impact issues are important, one must pay special atten-
tion to the safety and logistics associated with administering physical performance tests. As stated
above, basic ability tests provide a more controlled environment than job simulations. If job simula-
tions are used, issues such as the floor surface, temperature, and general testing environment must
be monitored. Both types of tests can be administered safely.
the selection setting included lift/carry tasks, pursuit and arrest simulation, firefighter tasks, valve
turning, and lashing (Anderson, 2003; Gebhardt, 2007). Sothmann and associates in multiple studies
found performance on a firefighter job simulation to be significantly related (r = .67) to actual heart
rate response and oxygen uptake values during firefighting tasks (Sothmann et al., 1990; Sothmann
et al., 2004). The progression of these studies demonstrated the validity of job simulations and basic
ability tests with criterion measures.
When conducting a criterion-related validity study, creation/selection of the criterion measure(s)
is as important as test selection. The most common criteria used are supervisor and/or peer ratings
of job performance, productivity measures, and work samples. Injury and lost workdays are viable,
but very large samples are required for use of these data. Further, injury data variables may lack
variability and be confounded with other safety initiatives implemented concurrently with the test-
ing by an organization. Regardless of the type of criteria used, the reliability of the measure should
be determined (Shrout & Fleiss, 1979).
Job simulation tests possess content and construct validity in terms of tasks, conditions, duration,
and intensity, whereas basic ability tests possess construct validity (Arvey et al., 1992; Fleishman,
1964; Hogan, 1991b; Myers et al., 1993). However, theses types of validity may not be sufficient for
physical performance tests to withstand legal scrutiny (United States v. City of Erie, 1995; EEOC
v. Dial Corporation, 2006).
The difficulty with content and construct models is establishing an accurate passing score for
a test or test battery. This is especially difficult if a job simulation consists of a series of tasks. If
the test consists of one job task with a known performance criteria obtained in the job analysis,
completion of the task can serve as the passing score. For example, in a paramedic test in which the
force to lift the head end of a stretcher (152 lb) was established, the completion of two consecutive
lifts requiring 152 lb of force constituted a passing score (Gebhardt & Crump, 1984). In light of
the Lanning v. SEPTA (1999) litigation, there is added responsibility of the organization to gather
empirical data (e.g., arrest rates) to establish a passing score that reflects the minimally acceptable
level of job performance and is consistent with business necessity.
When determining the validity model to use, several factors must be considered, including (a) type
of test desired, (b) availability of job performance information (e.g., ratings, productivity, attrition),
and (c) organizational resources and commitment to the study. The latter two are related primarily
to criterion data collection. When using supervisor or peer ratings, the availability of personnel and
probability of obtaining individual differences must be considered. Finally, if productivity, attrition,
or injury data are available, the data must be reviewed for applicability with the abilities or tasks
being measured (e.g., musculoskeletal injury vs. eye injury), type of data available (e.g., quantitative
versus qualitative), and confounding effects of other organizational programs (e.g., safety). A com-
pendium of criteria for workplace assessment that addresses these issues is found in Performance
Assessment for the Workplace (Wigdor & Green, 1991).
Other statistical procedures used are logistic regression and canonical correlation. Logistic
regression allows for use of multiple predictors but requires a dichotomous criterion measure
(Pedhazur, 1997). This type of analysis is useful if the criterion measure can be used in a dichoto-
mous format such as the level of aerobic capacity required to perform an order selector or firefighter
job, or the likelihood of injury (Gebhardt et al., 2006; Hodgdon & Jackson, 2000; Sothmann et al.,
2004). Canonical correlation yields a correlation of two latent variables, one representing a set of
independent variables, the other a set of dependent variables (Levine, 1977; Tabachnick & Fidell,
1997). This method allows the researcher to investigate a set of dependent variables instead of
one variable. Each of these methods has advantages and disadvantages. Selection of a statistical
technique is dependent on the available data, types of tests and criterion, organizational goals, and
business necessity.
Similar to other employment assessments, the fairness of the test across protected groups should
be established. Because physical performance tests commonly demonstrate adverse impact against
women and older individuals in terms of test scores and passing rates, it is important to establish
test fairness. Regression analysis can be used to examine subgroup differences (Bartlett, Bobko,
Mosier, & Hannan, 1978; Cleary, 1968). Research using a moderated regression analysis to evaluate
test fairness typically found physical tests to be fair across sex and age subgroups (Gebhardt et al.,
1998; Sothmann et al., 2004).
physical and cognitive test scores, several questions have arisen. First, are the cognitive and physi-
cal performance components of a job independent or dependent? Can an individual compensate for
lesser physical capabilities with higher levels of cognitive capabilities? Because these components
are independent, higher levels of either factor cannot compensate for lower levels of the other factor.
Second, how will the weightings be determined? Job analysis data [e.g., essential task frequency;
time spent; knowledge, skills, and abilities (KSAs)] can be used to develop weightings that reflect
the portion of the job. Alternately, cognitive and physical KSA importance ratings were used to
establish a 50%-50% weighting scheme for a firefighter test (Brunet v. City of Columbus, 1995).
Although this weighting was upheld in the district and appellate courts, this approach may result in
greater adverse impact (e.g., sex, ethnic), especially if ranking is required.
data has been used to establish passing scores. Sothmann and colleagues used data from multiple
studies and pacing data to establish the passing score for a firefighter test battery (Sothmann et al.,
1990; Sothmann et al., 2004). Six video-pacing tapes of a firefighter suppression evolution criterion
measure were generated based on the validation data from incumbent firefighters (Sothmann et al.,
2004). The paces ranged from very fast (i.e., 1 standard deviation faster than the mean time) to very
slow (i.e., 3 standard deviations slower than the mean time). A random sample of 41 incumbent
firefighters (31 men, 10 women) was selected to determine which pace was acceptable or unac-
ceptable. On the basis of the incumbent responses, the slowest pace viewed as acceptable by most
incumbents was identified and served as the minimally acceptable level of job performance for the
firefighter work sample criterion measure. If pacing or other types of video-based ratings are used,
the individual criteria for judging the behaviors must be established, raters must be trained, and rat-
ings should be independently derived.
Physical performance tests in a selection setting typically use a single passing score that is
applicable to all candidates. However, a few law enforcement agencies utilize sex and/or age nor-
mative data to identify multiple passing scores in their selection procedures. For example, men age
20–29 years must complete 40 situps, whereas women age 20–29 years must complete 35. This
practice appears to be a clear violation of the Civil Rights Act of 1991, which states that passing
scores cannot vary on the basis of gender, ethnic group, or age. The rationale for using normative
gender and/or age data as passing scores is based on the premise that the agency is measuring
physical fitness and not job performance. This type of testing in which test scores are compared
to previously gathered data is called norm-referenced testing (Safrit & Wood, 1989). Most norma-
tive physical performance test data were gathered on the “normal population” and are not specific
to a job. These data allow an agency to classify individuals in terms of percentile or an overall
category (e.g., good, fair, poor). From an employment perspective, these classifications by age and/
or gender do not provide an indication of acceptable or unacceptable job performance. Individuals
supporting norm-referenced testing believe that different passing scores for men and women (or
age groups) are, in fact, the same because they represent the same percentile rank in the norm-
reference tables. However, these same individuals recommend a single test passing score when job
simulations are used.
In the 1980s and 1990s, the Employment Litigation Section (ELS) of the Department of Justice
urged law enforcement agencies to use norm-referenced selection tests (Ugelow, 2005). ELS took
the stand that sex-normed passing scores could be used to make selection decisions because the
candidates would receive physical training in the law enforcement academy. One court supported
the use of sex- and age-normed passing scores for selection and retention (Peanick v. Reno, 1995).
Another court upheld use of norm-referenced tests on the basis that the tests were assessing fitness
and not job requirements (Alspaugh v. Michigan Law Enforcement Officers Training Council, 2001).
In Lanning v. Southeastern Pennsylvania Transportation Authority (SEPTA) (1999), the court ruled
against the plaintiff’s recommendation to use sex-normed passing scores for a 1.5-mile run because
the percentile scores offered no evidence of how the times related to job performance. In these cases
that supported normed passing scores, little or no mention was made of how the percentile passing
scores were related to job performance.
Norm-referenced testing has not been used in firefighter or blue-collar job selection. Two pri-
mary issues must be considered prior to using a norm-referenced approach. First, multiple passing
scores will be used, thus violating a tenet of the Civil Rights Act of 1991. Second, the Uniform
Guidelines (1978) specified that a test’s passing score must represent the minimally acceptable level
of job performance. Use of percentiles does not establish the evidence needed to demonstrate that
the passing score represents or predicts minimally acceptable job performance. The Lanning v.
SEPTA (1999) decision rejected the use of percentiles because of this issue. It is the opinion of the
authors that the use of sex- and/or age norm-reference test scoring does not comply with employ-
ment laws and statutes in that multiple passing scores for a single test have not been shown to reflect
minimally acceptable job performance.
290 Handbook of Employee Selection
ADMINISTRATION OF TESTS
Physical performance test administration differs from cognitive testing in several respects. First,
test instructions must provide adequate detail to ensure the examinee understands the purpose and
goal of the test (e.g., complete maximum number of revolutions). Test instructions should include
information related to committing errors (e.g., running when only walking is permitted) and the
consequence of the errors (e.g., repeat trial, fail test). Having the test administrator demonstrate
correct and incorrect test performance enhances the instructions. Second, test administrators must
be trained to administer the test, recognize testing errors, and score the tests (e.g., time, count).
A video/DVD of a test can enhance the examinee’s understanding of the test but cannot replace the
test administrator.
Further, administrators and others should not provide encouragement (e.g., cheering) to
examinees because external motivation can alter performance. Examinees should be self-
motivated. Testing examinees separately removes the possibility of external motivation and
prevents subsequent examinees (e.g., second, third) from gaining test insight (e.g., pace) not
available to the first examinee, thus allowing for equal treatment of all examinees. Finally, if
a job simulation is used, test administrators must practice cuing the examinee to the next test
component because improper timing of test cues can impact examinee performance. Tests that are
timed and/or require counting the number of event completions involve coordination of a timing
device (e.g., stopwatch) or counter with occurrence of an event (e.g., cross finish line, completion
of a sequence).
Placement of physical performance tests in the selection continuum and retest policies vary in
relation to business necessity and type of test used. Typically, physical tests have been used at three
points in the selection process: (a) first selection assessment, (b) after completion and passing of the
cognitive test, and (c) after conditional offer of the job. If testing occurs at points a and b, candidates
who pass the test are placed in a pool and move onto the next assessment phase (e.g., medical exam,
interview). If one or more of the tests in a battery use any physiological parameter (e.g., heart rate)
to score a test, it is construed as a medical assessment and must be given after the conditional job
offer to be in compliance with ADA (1990). If retesting is allowed for those who fail the test bat-
tery, the time between the original and retest should be determined by the time needed to alter the
physiological state (e.g., muscular strength) and the organization’s needs. The literature is replete
with research demonstrating the ability of women and men to increase their physical capabilities
(Knapik et al., 2001; Kraemer et al., 2001; Nindl et al., 2007). From a physiological standpoint,
2–3 months of sustained exercise are required to realize substantial gains in strength and aerobic
capacity (McArdle et al., 2007). Although retesting can effectively take place 3 months after ini-
tial testing, an organization may determine that the logistics for retesting are difficult or the pool
of qualified applicants is sufficient. Conversely, the “shelf life” of test results may be impacted by
inactivity, injury, or aging for individuals not initially selected for the job. Therefore, a retest may
be appropriate prior to entry into the job.
LEGAL ISSUES
As women were denied the opportunity to enter higher paying trades and public safety jobs,
greater scrutiny was placed on the selection procedures and decision models used by organizations.
Litigation in the physical testing area focused mainly on adverse impact in the selection setting,
with a few cases related to job retention. As stated above, the physiological sex differences led to
test score differences, and in turn, disproportionate hiring of women. These test differences are not
due to test bias, because corresponding differences are found for the criterion measure of interest
(Hogan, 1991a). In fact, almost all physical tests violate the four-fifths rule, which defines adverse
impact as the passing rate of a protected group (e.g., women, minorities) being less than 80% (4/5)
of the majority group (e.g., men) (EEOC, 1978). Although almost all physical tests have an adverse
impact on women, they have been upheld when the validity evidence demonstrates the relationship
of the test and passing score(s) to the job (e.g., Porch v. Union Pacific Railroad, 1997). However,
when job analysis and/or validity evidence is lacking, the tests have not been upheld (e.g., Teresa
D. Varden v. Alabaster, AL et al., 2005). Prior papers have reviewed the physical testing litigation
(Hogan & Quigley, 1986; Terpstra, Mohamed, & Kethley, 1999). The review provided in this
section focuses on recent physical testing litigation and is organized by the laws under which they
were filed. The two laws, Civil Rights Act (1964, 1991) and ADA (1990), are similar, but they have
one difference that impacts physical testing. Both laws require showing the job-relatedness of a
selection procedure, but the ADA also requires determining whether a reasonable accommodation
was available.
ADA OF 1990
The ADA (1990) was designed to protect individuals with mental and physical disabilities in the
private sector workforce. In the federal sector, the Rehabilitation Act of 1973 (1973) is a corollary
to the ADA. Title I of ADA states that health/medical status inquires must follow conditional offer
of employment and physical tests can be given prior to conditional job offer [42 U.S.C. § 12112
(b) (5-6) (d) (2-4)]. However, these stipulations impact the type of test used for pre-job offer testing.
For example, submaximal aerobic capacity tests (e.g., step, bicycle, treadmill) require monitoring of
heart rate and cannot be given during the pre-offer stage. Physical performance testing has inherent
safety issues when assessing individuals ranging in age from 20 to more than 50 years. The ACSM
provides recommendations for screening (e.g., heart rate, blood pressure) prior to participation in
exercise/testing (Whaley et al., 2006). Because of the ADA medical test restrictions (e.g., blood
pressure, heart rate taken to ensure safely when completing tests), employers have used waiver
forms and medical certification by a physician for pre-job offer testing and medical examinations
for the post-offer testing. It should be noted that use of a waiver or any prescreening approach does
not absolve the employer of responsibility for the safety of the applicant, as seen in White v. Village
of Homewood (1993), in which the court ruled that a signed waiver was not enforceable.
Most ADA litigation has dealt with medical issues (e.g., vision, diabetes, bipolar disorder) and
incumbent personnel, rather than physical performance issues (Rothstein, Carver, Schroeder, &
Shoben, 1999). However, three court cases filed by incumbents included physical testing. In Andrews
v. State of Ohio (1997), officers failed to meet the Ohio State Highway Patrol strength, aerobic, and/
or weight standards and filed suit. The court dismissed the case and stated that the plaintiffs were
not disabled or regarded as disabled. Similarly, in Smith v. Des Moines (1996) the court ruled in
summary judgment that a firefighter who failed to meet the city’s aerobic standard was not disabled,
just unfit for firefighter work. In Belk v. Southwestern Bell Telephone Company (1999), the plain-
tiff, who wore leg braces because of residual effects of polio, desired to move from a sedentary to
a physically demanding job. The plaintiff requested and received a physical test modification but
failed the test. The court ruled in favor of the plaintiff for reasons other than failing the physical
tests. On appeal, the decision was vacated and settled out of court.
292 Handbook of Employee Selection
continued employment with the Massachusetts State Police. Enlisted members (i.e., troopers) who
do not pass the test (after repeated trials and remedial training) will be dismissed from the state
police regardless of age (Gebhardt & Baker, 2006).
The courts have also ruled on issues related to the development and use of passing scores for
physical performance tests. The Standards for Educational and Psychological Testing (AERA et
al., 1999) and SIOP Principles (2003) state that passing scores should be set using empirical data
that identify the relationship of the test to “relevant criteria.” The use of passing scores has more
recently been linked to business necessity and minimum qualifications (Kehoe & Olson, 2005).
One of the cases that articulated use of business necessity and minimum qualifications was Lanning
v. SEPTA (1999). Both Lanning v. SEPTA (1999, 2002) cases and their impact on passing scores
have received a great deal of attention (Gutman, 2003; Sharf, 1999, 2003). Sharf (2003) provides
an excellent overview and interpretation of the Lanning outcomes. In addition, the article provides
insight from employment attorneys pertaining to how the Lanning decision will impact the defense
of employment tests. A brief description of the Lanning cases are provided here, followed by the
impact the decisions have on physical performance testing.
In an attempt to improve the law enforcement capabilities of their police force, SEPTA imple-
mented a physical performance test battery that included a 1.5-mile run. SEPTA’s consultant
established the passing score for this test at 12 minutes, which equated to an aerobic capacity of
42.5 mL · kg–1 · min–1. This passing score led to disparate passing rates of 55.6% for men and
6.7% for women. Plaintiffs (women who failed the test and the Department of Justice) filed suit
under Title VII against SEPTA on the basis of sex discrimination. The District Court for the first
Lanning case (1999) ruled in favor of SEPTA and concluded the test was job-related and of business
necessity. This decision was appealed and the Third Circuit remanded the case back to the District
Court. Using the instructions for business necessity provided by the Circuit Court, the District
Court convened in 2002 and again ruled in favor of SEPTA. The second appeal to the Third Circuit
affirmed the District Court ruling.
Some decisions from the Lanning 2002 ruling are perplexing and have been questioned in other
District Courts. The Lanning 2002 ruling suggested that the SIOP Principles (2003) are not relevant
in establishing the job-relatedness and business necessity of employment tests. In addition, the
Lanning decision stated that a successful legal defense was not predicated on compliance with all
sections of the EEOC Uniform Guidelines (1978). However, in United States v. City of Erie (2005),
a different District Court concluded that the SIOP Principles and EEOC Uniform Guidelines were
relevant in the defense of a test.
The Lanning ruling applied a stricter burden to prove job-relatedness and business necessity of
a physical performance test. To meet these requirements, there must be evidence that (a) the test
is related to job performance, (b) alternative tests and methods have been considered, and (c) the
passing score reflects the minimally acceptable level of job performance. The minimally accept-
able level of performance was defined as “likely to do the job,” not “some chance of doing the job”
(Sharf, 2003). In addition, the minimal level of performance did not have to be defined by the fitness
level of current employees. This case indicated that an employer has the right to improve the physi-
cal capabilities of their workforce. Thus, if a minimum acceptable level of test performance can be
established using empirical criteria (e.g., arrest records, archival data), it is irrelevant if incumbents
cannot meet the minimal levels. The Lanning case showed that various information sources can be
used to defend a test and establish a minimum acceptable passing score.
In EEOC v. Dial Corp (2006), the EEOC found that a physical performance test involving mov-
ing/lifting 35-lb bars to heights of 30 and 60 in. discriminated against women and was more dif-
ficult than the job. Further, some women who completed the test received a fail status on the basis of
comments made by the test administrators. Dial used a business necessity defense and claimed the
test resulted in lower on-the-job injuries and that the test accurately measured the job requirements.
EEOC noted that reduction in injuries began 2 years prior to the implementation of the test, thus
being attributed to earlier measures (e.g., job rotation, job redesign) instituted by Dial. The court
294 Handbook of Employee Selection
ruled in favor of the plaintiff, stating that Dial did not adequately demonstrate that the test is valid
or a business necessity. This ruling was later upheld on appeal.
The recurring theme evident in the EEOC v. Dial Corp ruling is the need for accurate job analysis
to develop an acceptable test. Dial defended the test with two pieces of flawed information: (a) the
test reflected actual job tasks and (b) use of the test resulted in a decline in on-the-job injuries.
Further, subjective judgment of an individual’s performance on physical tests should be avoided.
REFERENCES
Age Discrimination in Employment Act of 1967, 29 U.S.C. Sec. 621, et. seq. (1967).
Alspaugh v. Michigan Law Enforcement Officers Training Council, 634 N.W.2d 161 (Mich. App. 2001).
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
Americans with Disabilities Act of 1990, 42 U.S. C. A.
Anderson, C. K. (2003). Physical ability testing for employment decision purposes. In W. Karwowski & W. S.
Marras (Eds.). The occupational ergonomics handbook. New York, NY: Routledge.
Andrews v. State of Ohio, 104 F.3d 803 (6th cir., 1997).
Arvey, R. D., Landon, T. E., Nutting, S. M., & Maxwell, S. E. (1992). Development of physical ability tests for
police officers: A construct validation approach. Journal of Applied Psychology, 77, 996–1009.
Physical Performance Tests 295
Astrand, P., Rodahl, K., Dahl, H. A., & Stromme, S. G. (2003). Textbook of work physiology (4th ed.) Champaign,
IL: Human Kinetics.
Baker, T. A. (2007, April). Physical performance test results across ethnic groups: Does the type of test have
an impact? Paper presented at the annual meeting of the Society for Industrial and Organizational
Psychology. New York, NY.
Baker, T. A., & Gebhardt, D. L. (2001). Utility of physical performance tests in reduction of days lost and
injuries in railroad train service positions. Beltsville, MD: Human Performance Systems.
Baker, T. A., & Gebhardt, D. L. (2005a). Development and validation of selection assessments for Energy
Northwest nuclear security officers. Beltsville, MD: Human Performance Systems.
Baker, T. A., & Gebhardt, D. L. (2005b). Examination of revised passing scores for state police physical per-
formance selection tests. Beltsville, MD: Human Performance Systems.
Baker, T. A., Sheppard, V. A., & Gebhardt, D. L. (2000). Development and validation of physical performance
test for selection of City of Lubbock police officer. Beltsville, MD: Human Performance Systems.
Bartlett, C. J., Bobko, P., Mosier, S. B., & Hannan, R. (1978). Testing for fairness with a moderated multiple
regression strategy: An alternative to differential analysis. Personnel Psychology, 31, 233–241.
Belk v. Southwestern Bell Telephone Company 194 F.3d 946 (8th Cir. 1999).
Berkman v. City of New York, 536 F. Supp. 177, 30 Empl. Prac. Dec. (CCH) 33320 (E.D.N.Y. 1982).
Bilzon, J. L., Scarpello, E. G., Smith, C. V., Ravenhill, N. A., & Rayson, M. P. (2001). Characterization of the
metabolic demands of simulated shipboard Royal Navy fire-fighting tasks. Ergonomics, 44, 766–780.
Blakely, B. R., Quinones, M. A., Crawford, M. S., & Jago, I. A. (1994). The validity of isometric strength tests.
Personnel Psychology, 47, 247–274.
Brunet v. City of Columbus, 58 F.2d 251 (6th Cir. 1995).
Cascio, W. F., Alexander, R. A., & Barrett, G. V. (1988). Setting cutoff scores: Legal, psychometric, and profes-
sional issues and guidelines. Personnel Psychology, 41, 1–24.
Cascio, W. F., Outtz, J. L., Zedeck, S., & Goldstein, I. L. (1991). Statistical implications of six methods of test
score use in personnel selection. Human Performance, 4, 233–264.
Chaffin, D. B., Herrin, G. D., Keyserling, W. M., & Foulke, J. A. (1977). Pre-employment strength testing in
selecting workers for materials handling jobs (Report CDC-99-74-62). Cincinnati, OH: National Institute
for Occupational Safety and Health, Physiology, and Ergonomics Branch.
Chenoweth, D. (1994). Positioning health promotion to make an economic impact. In J. P. Opatz (Ed.),
Economic impact of worksite health promotion. Champaign, IL: Human Kinetics.
Civil Rights Act of 1964 (Title VII), 42 U.S.C. §2000e-2, et seq., (1964).
Civil Rights Act of 1991, S. 1745, 102nd Congress, (1991).
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal
of Educational Measurement, 5, 115–124.
Craig, B. N., Congleton, J. J., Kerk, C. J., Amendola, A. A., & Gaines, W. G. (2006). Personal and
non-occupational risk factors and occupational injury/illness. American Journal of Industrial Medicine,
49, 249–260.
Davis, P. O., Dotson, C. O., & Santa Maria, D. L. (1982). Relationship between simulated fire fighting tasks and
physical performance measures. Medicine and Science in Sports and Exercise, 14, 65–71.
Equal Employment Opportunity Commission v. Dial Corp, No. 05-4183/4311 (8th Cir. 2006).
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, and Department
of Justice. (1978). Uniform Guidelines on Employee Selection Procedures. Washington, DC: Bureau of
National Affairs, Inc.
Fleishman, E. A. (1964). Structure and measurement of physical fitness. Englewood, NJ: Prentice Hall.
Fleishman, E. A., Gebhardt, D. L., & Hogan, J. C. (1986). The perception of physical effort in job tasks. In
G. Borg & D. Ottoson (Eds.), The perception of exertion in physical work (pp. 225–242). Stockholm,
Sweden: Macmillan Press.
Fleishman, E. A., & Quaintance, M. K. (1984). Taxonomies of human performance. New York, NY: Academic Press.
Fraternal Order of Police Local 101 v. Butler County Sheriff’s Department, #05-ULP-09-0509, 23 OPER 30
(Ohio SERB, 2006).
Gately v. Massachusetts, 92-CV-13018-MA (D. Mass. Dec. 30, 1992).
Gately v. Massachusetts, No. 92-13018 (D. Mass. Sept. 26, 1996).
Gebhardt, D. L. (1984). Revision of physical ability scales. Bethesda, MD: Advanced Research Resources
Organization.
Gebhardt, D. L. (2000). Establishing performance standards. In S. Constable & B. Palmer (Eds.), The pro-
cess of physical standards development (pp. 179–197). Wright-Patterson Air Force Base, OH: Human
Systems Information Analysis Center.
296 Handbook of Employee Selection
Gebhardt, D. L. (2007, April). Physical performance testing: What is the true impact? Paper presented at the
annual meeting of the Society for Industrial and Organizational Psychology. New York, NY.
Gebhardt, D. L., & Baker, T. A. (1999). Validation of physical performance tests for the selection of firefighters
in the State of New Jersey. Beltsville, MD: Human Performance Systems.
Gebhardt, D. L., & Baker, T. A. (2006). Determination of incumbent passing scores for the Massachusetts State
Police physical performance test. Beltsville, MD: Human Performance Systems.
Gebhardt, D. L., & Baker, T. A. (2007). Physical performance assessment. In S. G. Rogelberg (Ed.), Encyclopedia
of industrial and organizational psychology (pp. 179–199). Thousand Oaks, CA: Sage.
Gebhardt, D. L., Baker, T. A., Curry, J. E., & McCallum, K. (2005). Development and validation of medical
guidelines and physical performance tests for U. S. Senate Sergeant at Arms positions: Vol. I and II.
Beltsville, MD: Human Performance Systems.
Gebhardt, D. L., Baker, T. A., & Sheppard, V. A. (1992). Development and validation of physical performance
tests dockworker, hostler, and driver jobs in the freight industry. Hyattsville, MD: Human Performance
Systems.
Gebhardt, D. L., Baker, T. A., & Sheppard, V. A. (1998). Development and validation of physical performance
tests for BellSouth physically demanding jobs. Hyattsville, MD: Human Performance Systems.
Gebhardt, D. L., Baker, T. A., & Thune, A. (2006). Development and validation of physical performance, cognitive,
and personality assessments for selectors and delivery drivers. Beltsville, MD: Human Performance
Systems.
Gebhardt, D. L. & Crump, C. E. (1984). Validation of physical performance selection tests for paramedics.
Bethesda, MD: Advanced Research Resources Organization.
Gebhardt, D. L., & Crump, C. E. (1990). Employee fitness and wellness programs in the workplace. American
Psychologist, 45, 262–272.
Gebhardt, D. L., Schemmer, F. M., & Crump, C. E. (1985). Development and validation of selection tests for
longshoremen and marine clerks. Bethesda, MD: Advanced Research Resources Organization.
Gilliam, T., & Lund, S. J. (2000). Injury reduction in truck driver/dock workers through physical capability new
hire screening. Medicine and Science in Sports and Exercise, 32, S126.
Gledhill, N., & Jamnik, V. K. (1992). Characterization of the physical demands of firefighting. Canadian
Journal of Sport Science, 17, 207–213.
Golding, L. A. (2000). YMCA fitness testing and assessment manual (4th ed.), Champaign, IL: Human
Kinetics.
Guion, R. M. (1998). Assessment, measurement and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Gutman, A. (2003). Adverse impact: Why is it so difficult to understand? The Industrial-Organizational
Psychologist, 40, 50.
Hardy v. Stumpf, 17 Fair Empl. Prac. Cas. (BNA) 468 (Supp. Ct. Cal. 1978).
Harless v. Duck, 22 Fair Empl. Prac. Cas. (BNA) 1073 (6th Cir. 1980).
Hattori, Y., Ono, Y., Shimaoka, M., Hiruta, S., Kamijima, M., & Takeuchi, Y. (1998). Test-retest reliability of
isometric and isoinertial testing in symmetric and asymmetric lifting. Ergonomics, 41, 1050–1059.
Hazard, R. G., Reeves, V., & Fenwick, J. W. (1992). Lifting capacity. Indices of subject effort. Spine, 17,
1065–1070.
Hodgdon, J. A., & Jackson, A. S. (2000). Physical test validation for job selection. In S. Constable & B. Palmer
(Eds.), The process of physical fitness standards development (pp. 139–177). Wright-Patterson Air Force
Base, OH: Human Systems Information Analysis Center.
Hogan, J. C. (1991a). Physical abilities. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and
organizational psychology (Vol. 2, pp. 753–831). Palo Alto, CA: Consulting Psychologist Press.
Hogan, J. C. (1991b). Structure of physical performance in occupational tasks. Journal of Applied Psychology,
76, 495–507.
Hogan, J. C., & Quigley, A. M. (1986). Physical standards for employment and the courts. American
Psychologist, 41, 1193–1217.
Hogan, J. C., & Quigley, A. M. (1994). Effects of preparing for physical ability tests. Public Personnel
Management, 23, 85–104.
Jackson, A. S., Osburn, H. G., Laughery, K. R., & Vaubel, K. P. (1992). Validity of isometric tests for
predicting the capacity to crack open and closed industrial valves (pp. 688–691). In Human Factors and
Ergonomics Society annual meeting proceedings. Santa Monica, CA: Human Factors and Ergonomics
Society.
Jackson, A. S., & Sekula, B. K. (1999). The influence of strength and gender on defining psychophysical lifting
capacity. Proceeding of the Human Factors and Ergonomics Society, 43, 723–727.
Physical Performance Tests 297
Karwowski, W., & Mital, A. (1986). Isometric and isokinetic testing of lifting strength of males in teamwork.
Ergonomics, 29, 869–878.
Kehoe, J. F., & Olson, A. (2005). Cut scores and employment discrimination litigation. In F. J. Landy (Ed.),
Employment discrimination litigation. San Francisco, CA: Jossey-Bass.
Knapik, J. J., Canham-Chervak, M., Hoedebecke, E., Hewitson, W. C., Hauret, K., Held, C. et al. (2001). The
fitness training unit in U.S. Army basic combat training: Physical fitness, training outcomes, and injuries.
Military Medicine, 166, 356–361.
Knapik, J. J., Darakjy, S., Hauret, K. G., Canada, S., Scott, S., Rieger, W. et al. (2006). Increasing the physical
fitness of low-fit recruits before basic combat training: An evaluation of fitness, injuries, and training
outcomes. Military Medicine, 171, 45–54.
Knapik, J. J., Jones, S. B., Darakjy, S., Hauret, K. G., Bullock, S. H., Sharp, M. A., et al. (2007). Injury rates and
injury risk factors among U.S. Army wheel vehicle mechanics. Military Medicine, 172, 988–996.
Kraemer, W. J., Mazzetti, S. A., Nindl, B. C., Gotshalk, L. A., Volek, J. S., Bush, J. A., et al. (2001). Effect of
resistance training on women’s strength/power and occupational performances. Medicine and Science in
Sports and Exercise, 33, 1011–1025.
Landy, F., Bland, R., Buskirk, E., Daly, R. E., Debusk, R. F., Donovan, E., et al. (1992). Alternatives to chrono-
logical age in determining standards of suitability for public safety jobs (Technical Report) University
City, PA: Center for Applied Behavioral Sciences, Pennsylvania State University.
Landy, F. J., & Conte, J. M. (2007). Work in the 21st century: An introduction to industrial and organizational
psychology. Malden, MA: Blackwell.
Lanning v. Southeastern Pennsylvania Transportation Authority, 181 F.3d 478, 482-484 (3rd Cir. 1999).
Lanning v. Southeastern Pennsylvania Transportation Authority, 308 F.3d 286 (3rd Cir. 2002).
Legault v. Russo, 64 FEP Cases (BNA) 170 (D.N.H., 1994).
Levine, M. S. (1977). Canonical correlation analysis: Uses and interpretation. Beverly Hills, CA: Sage.
Lygren, H., Dragesund, T., Joensen, J., Ask, T., & Moe-Nilssen, R. (2005). Test-retest reliability of the
Progressive Isoinertial Lifting Evaluation (PILE). Spine, 30, 1070–1074.
Mayer, T., Gatchel, R., & Mooney, V. (1990). Safety of the dynamic progressive isoinertial lifting evaluation
(PILE) test. Spine, 15, 985–986.
McArdle, W. D., Katch, F. I., & Katch, V. L. (2007). Exercise physiology: Energy, nutrition, and human perfor-
mance physiology (5th ed.). Baltimore, MD: Lippincott Williams & Wilkins.
McGinnis, P. M. (2007). Biomechanics of sport and exercise (2nd ed.). Champaign, IL: Human Kinetics.
Myers, D. C., Gebhardt, D. L., Crump, C. E., & Fleishman, E. A. (1993). The dimensions of human physical
performance: Factor analyses of strength, stamina, flexibility, and body composition measures. Human
Performance, 6, 309–344.
Nindl, B. C., Barnes, B. R., Alemany, J. A., Frykman, P. N., Shippee, R. L., & Friedl, K. E. (2007). Physiological
consequences of U.S. Army Ranger training. Medicine and Science in Sports and Exercise, 39,
1380–1387.
Pandolf, K. B., Burse, R. L., & Goldman, R. F. (1977). Role of physical fitness in heat acclimatization, decay
and reinduction. Ergonomics, 20, 399–408.
Peanick v. Reno, 95-2594 (8th Cir. 1995).
Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). New
York, NY: Harcourt Barace College Publishers.
Pentagon Force Protection Agency v. Fraternal Order of Police DPS Labor Committee, FLRA Case #WA-CA-
04-0251 (Wash. Region, 2004).
Porch v. Union Pacific Railroad, Administrative law proceeding, State of Utah.
Rehabilitation Act of 1973, 29 U.S.C. 701 et. seq. (1973).
Rothstein, M. A., Carver, C. B., Schroeder, E. P., & Shoben, E. W. (1999). Employment law (2nd ed.). St. Paul,
MN: West Group.
Ruan, C. M., Haig, A. J., Geisser, M. E., Yamakawa, K., & Buchholz, R. L. (2001). Functional capacity evalua-
tions in persons with spinal disorders: Predicting poor outcomes on the Functional Assessment Screening
Test (FAST). Journal of Occupational Rehabilitation, 11, 119–132.
Sackett, P. R., & Mavor, A. S. (2006). Assessing fitness for military enlistment: Physical, medical and mental
health standards. Washington, DC: The National Academies Press.
Safrit, M. J., & Wood, T. M. (1989). Measurement concepts in physical education and exercise science.
Champaign, IL: Human Kinetics.
Salgado, J. F., Viswesvaran, C., & Ones, D. S. (2001). Predictors used for personnel selection. In N. Anderson,
D. S. Ones, H. K. Sinaangil, & C. Viswesvaran (Eds.), Handbook of work and organizational psychology
(Vol. 1, pp. 164–199). London, England: Sage.
298 Handbook of Employee Selection
Sharf, J. C. (1999). Third circuit’s Lanning v. SEPTA decision: Business necessity requires setting minimum
standards. The Industrial-Organizational Psychologist, 37, 149.
Sharf, J. C. (2003). Lanning revisited: The third circuit again rejects relative merit. The Industrial-Organizational
Psychologist, 40, 40.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological
Bulletin, 86, 420–428.
Smith v. Des Moines, #95-3802, 99 F.3d 1466, 1996 U.S. App. Lexis 29340, 72 FEP Cases (BNA) 628, 6 AD
Cases (BNA) 14 (8th Cir. 1996). [1997 FP 11]
Society for Industrial and Organizational Psychology (2003). Principles for the validation and use of personnel
selection procedures (4th ed.). Bowling Green, OH: Author.
Sothmann, M. S., Gebhardt, D. L., Baker, T. A., Kastello, G. M., & Sheppard, V. A. (2004). Performance
requirements of physically strenuous occupations: Validating minimum standards for muscular strength
and endurance. Ergonomics, 47, 864–875.
Sothmann, M. S., Saupe, K., Jasenof, D., Blaney, J., Donahue-Fuhrman, S., Woulfe, T., et al. (1990). Advancing
age and the cardiovascular stress of fire suppression: Determining the minimum standard for aerobic
fitness. Human Performance, 3, 217–236.
Tabachnick, B. G., & Fidell, L. S. (1997). Using multivariate statistics. New York, NY: HarperCollins College
Publishers.
Teresa D. Varden v. City of Alabaster, Alabama and John Cochran ; US District Court, Northern District of
Alabama, Southern Division; 2:04-CV-0689-AR.
Terpstra, D. A., Mohamed, A. A., & Kethley, R. B. (1999). An analysis of federal court cases involving nine
selection devices. International Journal of Selection and Assessment, 7, 26–34.
Ugelow, R. S. (2005). I-O psychology and the Department of Justice. In F. J. Landy (Ed.), Employment
discrimination litigation: Behavioral, quantitative, and legal perspectives (pp. 463–490). San Francisco,
CA: Jossey-Bass.
United States v. City of Erie, Pennsylvania, #04-4, 352 F. Supp. 2d 1105, (W.D. Pa. 2005).
UWUA Local 223 & The Detroit Edison Co., AAA Case No. 54-30-1746-87 (Apr. 17, 1991) (Lipson, Arb.)
Whaley, M. H., Brubaker, P. H., & Otto, R. M. (2006). ACSM’s guidelines for exercise testing and prescription
(7th ed.). Philadelphia, PA: Lippincott Williams & Wilkins.
White v. Village of Homewood, 628 N.E.2d 616 (Ill. App. 1993).
Wigdor, A. K., & Green, B. F. (1991). Performance assessment for the workplace. Washington, DC: National
Academy Press.
14 Its Measurement and Validity
Personality
Personality variables have had a roller-coaster-like ride in employee selection during the 20th and
21st centuries. They have been denounced and rescued several times over the years. We foresee
a more stable and sanguine future as evidence continues to mount documenting the importance
of personality variables as determinants of individual- and team-level performance. Indeed, one of
the most important advances in our field can be attributed to the recognition of the importance of
personality variables in determining and explaining performance. With the addition of personality
variables to our models of job performance, we are now able to explain significantly more variation
in behavior and performance than ever before (Hough, 2001).
In this chapter, we review the issues, document the evidence, and describe the consensus emerg-
ing about the usefulness of personality variables in employee selection. We describe factors that
hinder our understanding and those that help increase our knowledge of personality variables and
their role in more accurately predicting work-related criteria. We address issues related to taxo-
nomic structure, measurement methods, level of measurement, validity, and factors that threaten
and enhance the validity of personality measures.
299
300 Handbook of Employee Selection
Nonetheless, Hough and her colleagues (Hough, 1992, 1997, 1998; Hough & Oswald, 2000,
2005, 2008; Hough & Schneider, 1996; Schneider & Hough, 1995; Schneider, Hough, & Dunnette,
1996) have steadfastly criticized the FFM, concluding it is an inadequate taxonomy of personality
variables for I-O psychology to build knowledge and understand the determinants of work behavior
and performance. They and others (especially Block, 1995) argued that the FFM is not comprehen-
sive, combines variables into factors that are too heterogeneous, and is method-bound, dependent
upon factor analysis of lexical terms. Variables that are not well represented at the factor level of
the FFM include autonomy versus dependency and succorance (Costa & McCrae, 1988), social
competence and insight (Davies, Stankov, & Roberts, 1998; Gough, 1968; Hogan, 1969; Schneider,
Ackerman, & Kanfer, 1996), risk taking (Paunonen & Jackson, 1996; Zuckerman, Kuhlman,
Joireman, Teta, & Kraft, 1993), masculinity/femininity (Hough, 1992; Kamp & Gough, 1986),
moral development and ego development (Kohlberg, Levine, & Hewer, 1994), tolerance for con-
tradiction (Chan, 2004); ego depletion (Baumeister, Gailliot, DeWall, & Oaten, 2006), and emo-
tionality (Averill, 1997; Russell & Carroll, 1999) (see Hough & Furnham, 2003, for other missing
variables).
Although some of these traits are included as lower-order facets in inventory-specific conceptu-
alizations of the FFM, they are not necessarily measuring the same trait, nor are they necessarily
narrow or homogenous enough to constitute personality facets. Compound traits such as integrity,
managerial potential, or customer service orientation (cf. Ones & Viswesvaran, 2001) are made up
of several homogeneous traits that do not necessarily covary, but all relate to a criterion of interest
(Hough & Schneider, 1996). Hough and Ones (2001) have offered a working taxonomy of personal-
ity compound traits including scales available to measure them.
In addition, a lack of generally accepted facet-level taxonomies for the Big Five domains and
the resulting reliance on inventory-specific, lower-level trait descriptions has impeded research and
practice of personality measurement relating to prediction of behaviors and performance in work
settings. This issue is only now being addressed as researchers are developing empirically derived,
facet-level taxonomies for Big Five domains (see Birkland & Ones, 2006, for emotional stability;
Connelly, Davies, Ones, & Birkland, 2008a, for agreeableness; Connelly et al., 2008b, for openness;
Connelly & Ones, 2007, as well as Roberts, Chernyshenko, Stark, & Goldberg, 2005, for conscien-
tiousness; Davies, Connelly, Ones, & Birkland, 2008, for extraversion).
Although most meta-analyses have utilized the FFM to summarize the relationships between
personality variables and job-related criteria, summaries of relationships at this broad a level
can mask relationships that emerge between narrower facets and performance constructs. Hough
and her colleagues (Hough, 1992, 1997, 1998; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990;
Hough & Oswald, 2005; Schneider, Hough et al., 1996) have long argued that focusing exclusively
on factor-level personality traits in the prediction of heterogeneous work-related criteria can be
counterproductive for a science aiming to explain the relationships between personality constructs
and work-related constructs. The predictive validity of a personality variable depends on (a) the
criterion content domain being predicted (Bartram, 2005; Hogan & Holland, 2003; Hough, 1992;
Ones, Dilchert, Viswesvaran, & Judge, 2007) and (b) the hierarchical match between the predictor
and criterion measures (Hogan & Roberts, 1996; Ones & Viswesvaran, 1996; Schneider, Hough
et al., 1996).
This is not to conclude that measurement at levels even more narrow than the facet level of the
FFM is better. Overly narrow personality constructs can impede the growth of knowledge just as
overly broad constructs can impede our science and practice (Hough, 1997; Ones, Viswesvaran, &
Dilchert, 2005; Oswald & Hough, 2008). Although it is appropriate to summarize the relationships
between narrow constructs and various criteria, it is difficult to build a science without learning
about the extent to which information and conclusions generalize at a broader construct level as
well, including the Big Five and even higher order factors (cf. Digman, 1997). Combining variables
into compound variables (such as integrity and customer service orientation) that consist of multiple
Big Five domains, such as conscientiousness, agreeableness, and emotional stability, can increase
Personality 301
the predictive accuracy of personality variables (Hough & Ones, 2001; Ones et al., 2007; Ones &
Viswesvaran, 2001).
Meta-analytic evidence summarizing personality-criterion relationships at various levels,
including Big Five factors, facets, and compound scales, indicates that validity varies as a func-
tion of the theoretical relevance of the predictor of the criterion, which includes similarity of
bandwidth (Hogan & Roberts, 1996; Hough, 1992; Hough et al., 1990; Rothstein & Goffin, 2000;
Schneider, Hough et al., 1996; Tett, Jackson, Rothstein, & Reddon, 1999). Personality variables
that are a priori identified as theoretically relevant to a criterion correlate more highly with the cri-
terion, and the overall predictor and criterion should be similarly heterogeneous/homogeneous.
The FFM provides an important organizing function for I-O psychology and has helped con-
nect our science to our sister sciences. At the same time, we encourage the search for personality
constructs that consist of variables with similar nomological nets to improve our understanding of
personality structure. Hough (see Hough & Ones, 2001) proposed a nomological-web clustering
approach to develop personality constructs that are conceptually and empirically similar. She pro-
posed that cluster analysis of the patterns of relationships of target variables with other variables,
including variables other than personality variables, are needed to identify homogeneous personal-
ity constructs that are characterized as having similar nomological nets. Using this logic, Hough
and Ones (2001) conducted a qualitative cluster analysis of such personality variable profiles and
recommended others use their taxons and further improve their taxonomy. Some quantitative sum-
marizes have used this taxonomy in summarizing results across personality scales (e.g., Dudley,
Orvis, Lebiecki, & Cortina, 2006; Foldes, Duehr, & Ones, 2008). Dudley et al. (2006) examined the
criterion-related validities of four of the facets of conscientiousness defined by Hough and Ones,
and found that these facets (a) have only low to moderate correlations with each other; (b) correlated
differentially with the broad factor conscientiousness; and (c) depending upon the occupation and
criterion, correlate higher with the criterion than does global conscientiousness. Foldes et al. (2008)
used the Hough and Ones taxonomy to summarize mean score differences between Whites and dif-
ferent ethnic groups, finding that (a) facet-level mean score differences varied although the facets
all belonged to the same domain and (b) factor-level differences varied from their facet-level mean
score differences. Taken together, these summaries of very different types of information provide
construct validity evidence for the Hough and Ones personality taxonomy as well as its usefulness
for the science and practice of personnel selection. We urge others to report validities according to
Hough and Ones’ proposed structure, as well as to refine their structure to increase our understand-
ing of the pattern of relationships between personality constructs and other constructs.
MEASUREMENT METHODS
Although personality variables are typically measured with self-report, Likert-type items and scales,
other assessment methods can be used. In this section we describe reliability, construct validity,
and criterion-related validity evidence and discuss practical issues such as development cost and
cost and ease of administration of self-report measures [including Likert-type, forced choice, item
response theory (IRT), and other recent innovations] as well as several other methods of measuring
personality [i.e., biodata, interviews, situational judgment tests (SJTs), simulations, and assessment
centers]. For each of these different methods of measuring personality variables we summarize the
effects of response distortion and measurement mode (i.e., paper-and-pencil administered, com-
puter administered, and Internet administered).
being other-rated or recorded, and in most cases captured in a less standardized fashion. Personality
questionnaires elicit an individual’s responses to items and use these responses (assuming Likert-
type scaling) to express the individual’s trait standing in comparison to a normative group. What
distinguishes them from most other self-report methods is the degree of standardization they pro-
vide in eliciting test-taker responses, allowing the user to reliably compare an individual’s scores to
those of other test-takers.
Decades of research, hundreds of primary studies, and dozens of quantitative summaries have
shown that such standardized, self-report tests of personality traits provide (a) reliable assessments
(cf. Viswesvaran & Ones, 2000) and (b) scores that correlate at highly useful levels with valued
organizational outcomes and criteria. Hough and Furnham (2003), Hough and Oswald (2008), Ones
et al. (2005), and Ones et al. (2007) provided comprehensive overviews of the validity of personality
measures for predicting various valued behaviors and outcomes in organizational settings, and we
refer interested readers to these summaries.
Despite the strong empirical evidence for their validity (see the section “Validity of Personality
Constructs and Other Factors that Affect Their Usefulness,” p. 308, for details), self-report mea-
sures of personality are often criticized when used in employee selection because of the possibility
of intentional response distortion. Much of the research addressing the issue of response distortion
has focused on standardized tests (rather than other ways of assessing personality; see below) and
much of the basis of criticism of self-report measures is, as Chan (2008) concluded, more of an
urban-legend than reality. Dilchert, Ones, Viswesvaran, and Deller (2006) have summarized sug-
gested palliatives and evaluated their merit to deal with intentional distortion and socially desirable
responding on such measures, concluding that approaches such as score corrections or exclusion of
test-takers on the basis of social desirability scale scores have little merit and that future improve-
ments are more likely to come from the use of new and innovative item formats. Oswald and Hough
(2008) and Hough and Oswald (2008) also summarized the literature, cautioning that results may
differ depending on (a) item transparency (i.e., subtle vs. obvious items), (b) research setting, (i.e.,
experimental vs. real-life applicant selection setting), and (c) research design [i.e., concurrent vs.
predictive (longitudinal) design]. They concluded that (a) validities for both types of scales remain
essentially intact in real-life applicant selection situations using concurrent validation studies and
(b) subtle-item scales also retain their validities in predictive designs. Below, we review new devel-
opments in this area and evaluate their promise for addressing concerns about response distortion
in typical, Likert-type self-report personality scales.
high and low scoring individuals and thus potentially lead to better selection decisions (although
criterion-related validity is unlikely to be affected because the correlation coefficient is relatively
insensitive to minor changes in rank order) (Chernyshenko et al., 2007; Stark et al., 2006).
OTHER-REPORTS
Organizations frequently use 360º feedback measures (a form of other-reports) and observers fre-
quently rate personality characteristics of participants in simulations and assessment center exer-
cises (see below). Yet organizations rarely use other-reports of individuals’ personality that are
assessed with standardized personality measures.
A meta-analysis summarizing data published before 1996 shows that correlations between
self-observer (true score) and other-reports for the Big Five range from .46 to .62, suggesting that
they can increment criterion-related validity beyond that of self-reports (Connolly, Kavanagh, &
Viswesvaran, 2007). The cumulative evidence also shows that although close relatives’ descriptions
of a target person are more similar to that of the target’s self-description than are peer ratings, even
strangers can provide valuable insight into a target individual’s personality on easily observed traits,
such as extraversion (Connolly et al., 2007). Not surprisingly, traits that are more easily observed in
short-acquaintance situations (such as extraversion) also exhibit higher levels of interobserver agree-
ment (Kenny, Albright, Malloy, & Kashy, 1994).
Interestingly, a more recent meta-analysis shows that work colleagues’ perceptions of individuals’
personality traits were only slightly more accurate than strangers’ perceptions (as indexed by self-
other and other-other correlations) and consistently less accurate than those of family members,
friends, and cohabitators (Connelly, 2008). These results held regardless of whether the work
colleague was a supervisor, coworker, or reference. Despite these low self-other and other-other
correlations, a single coworker’s description of a target’s personality predicts job performance better
than does a self-rating of personality (Connelly, 2008; Mount, Barrick, & Strauss, 1994; Nilsen,
1995). Parallel results have been found for classmates’ and friends’ personality ratings predicting
grades and academic achievement (Connelly, 2008). However, only a handful of studies in Connelly’s
meta-analysis have examined the predictive validity of others’ ratings of personality, and further
research on this topic is warranted.
Studies investigating potential response distortion in standardized other-reports of personality
are also scarce. It is safe to assume that if such measures were used to elicit information from
candidates’ acquaintances, the choice of rating source will influence the degree of distortion to be
expected. However, it is unlikely that organizations are willing to rely on ratings obtained from
spouses or friends in selecting among job applicants. We see the potential for other-ratings of
personality for applications in which the source of the ratings can be standardized and verified
(e.g., personality ratings made by the last two supervisors). Other tools used in employee selection
already employ a similar rationale (e.g., letters of reference) but do not provide the benefit that
standardized ratings of personality could provide: wide distributions of scores that could be used
to select rather than identify negative indicators that allow screening out of potential candidates.
We encourage researchers and practitioners to explore the potential for standardized other-ratings
of personality and to conduct additional studies investigating their criterion-related validity and
potential for incrementing validity.
BIODATA
Biodata measures (also known as biographical data and autobiographical information) focus on
previous life experiences and have a long history in I-O psychology as useful predictors of work-
related criteria (see reviews by Barge & Hough, 1988; Ghiselli, 1966; Griffeth, Hom, & Gaertner,
2000; Hough, in press; Hunter & Hunter, 1984; Reilly & Chao, 1982; Schmitt, Gooding, Noe, &
Kirsch, 1984). The premise of their success is the old adage: Past behavior is the best predictor
Personality 305
of future behavior (consistency principle). Using Wernimont and Campbell’s (1968) terminology,
they are samples of relevant behavior rather than signs. When scale development is construct-
oriented, biodata represent another method of measuring individual differences such as personality
constructs (Hough, 1989; Hough & Paullin, 1994). As Tenopyr (1994) hypothesized, biodata scales
developed to measure personality constructs (e.g., Big Five factors and their facets) correlate
appropriately with each other and with work-related criteria (cf. Kilcullen, White, Mumford, &
Mack, 1995; Manley, Benavidez, & Dunn, 2007; Oviedo-Garcia, 2007; Sisco & Reilly, 2007a;
Stokes & Cooper, 2001).
Although intentional distortion occurs on biodata and on traditional personality scales, the evi-
dence on the extent of distortion compared to traditional personality scales is mixed. Some studies
report less distortion on biodata scales (e.g., Kilcullen et al., 1995; Sisco & Reilly, 2007b; Stokes,
Hogan, & Snell, 1993), whereas other research suggests little difference in the amount of faking
on biodata versus standard personality scales (e.g., McFarland & Ryan, 2000; White et al., 2001).
Evidence suggests that one fruitful approach to reduce distortion is to require respondents to elabo-
rate on their responses to biodata items (Schmitt & Kunce, 2002; Schmitt et al., 2003). Moreover,
both verifiable and subtle items (where the construct measured is less apparent) appear to retain
their validity when used in real-life applicant settings (Alliger, Lilienfeld, & Mitchell, 1996; Gough,
1994; Harold, McFarland, & Weekley, 2006; White, Young, Hunter, & Rumsey, 2008). Given
the advantages of biodata, it is surprising that biodata measures are not used more frequently for
employee selection (Stokes & Cooper, 2004).
INTERVIEW
Around the world the interview is probably the most frequently used employee selection assessment
method (Moscoso, 2000; Ryan, McFarland, Baron, & Page, 1999) and is most often intended
to measure personality characteristics (Huffcutt, Conway, Roth, & Stone, 2001). Huffcutt and
colleagues developed a comprehensive taxonomy of possible interview constructs that interview
questions might measure. The seven constructs were (a) mental ability, (b) knowledge and skills,
(c) basic personality characteristics (such as the Big Five), (d) applied social skills such as social
intelligence and social competence, (e) interests and preferences, (f) organizational fit, and (g)
physical attributes. They sorted 338 interview questions from 47 actual employment interviews
into the seven constructs. They found that interview questions were most often intended to
measure personality characteristics (35% of the questions), followed by applied social skills
(28%), mental ability (16%), knowledge and skills (10%), interest and preferences (4%), physical
attributes (4%), and organizational fit (3%). Sixteen percent of all questions were intended to
measure conscientiousness or its facets.1 Their study does not address the construct validity of
the interview ratings but did find that interview ratings of personality correlate well with overall
job performance in various jobs. The correlations (corrected for range restriction in interview
scores and measurement error in performance evaluations) with overall job performance were .33
for extraversion, .33 for conscientiousness, .51 for agreeableness, and .47 for emotional stability.
This study and meta-analyses of the criterion-related validity of interviews in general (Huffcutt &
Arthur, 1994; McDaniel, Whetzel, Schmidt, & Maurer, 1994; Wiesner & Cronshaw, 1988) leave
little doubt about the criterion-related validity of the interview.
Studies that investigate the construct validity of the interview often investigate external cor-
relates, answering questions about the trait saturation of the interview ratings but often leaving
unanswered whether or not the interview measured the construct(s) intended. For example, meta-
analytic evidence indicates that interview ratings in general correlate .40 with measures of cognitive
ability (Huffcutt, Roth, & McDaniel, 1996) and .26 with measures of conscientiousness (corrected
1 The results of differential observer agreement for personality traits reviewed above can provide helpful information about
which traits are best assessed with traditional employment interviews, at least with regard to issues of reliability.
306 Handbook of Employee Selection
correlations, Cortina, Goldstein, Payne, Davison, & Gilliland, 2000). Barrick, Patton, and Haugland
(2000) had interviewers assess interviewee personality characteristics and correlated those ratings
with interviewee personality self-reports. Corrected correlations were .19 for conscientiousness, .24
for emotional stability, .49 for extraversion, .44 for openness, and .37 for agreeableness. As would
be expected, the overlap between self-reports and interview scores was slightly lower than typically
reported self-other correlations using standardized personality tests for the same dimensions. More
disconcerting: the two best personality predictors of job performance, conscientiousness and emo-
tional stability, were clearly not well measured in the interview—at least if self-reports are used as
the criterion. However, in this study, the interview questions were not designed specifically to mea-
sure personality characteristics; interviewers’ ratings were based on their general impression of the
interviewee personality characteristics. If assessment of personality characteristics via the interview
is the goal, questions should be developed specifically for this purpose. However, the few studies
that have examined the construct validity of personality scores obtained from interviews designed
specifically to measure personality variables do not provide much support for the construct validity
of such interview scores (e.g., Van Iddekinge, Raymark, Eidson, & Attenweiler, 2004).
A second study by Van Iddekinge, Raymark, and Roth (2005) examined the construct valid-
ity of an interview for assessing the NEO Personality Inventory (Costa & McCrae, 1992) facets
of altruism, self-discipline, and vulnerability. Interviewees described themselves using the NEO
facet scales, and experienced interviewers interviewed the mock candidates, asking them ques-
tions intended to measure the three characteristics. The interviewers provided interviewer ratings
of the personality constructs, and they completed the NEO facet scales to describe the candidates.
The study included an honest as well as an applicant-like condition. In the honest condition, con-
vergent validities of interviewer-based NEO ratings with the self-report NEO ratings averaged .32
(discriminant validities averaged .20); convergent validities of the interview ratings with self-report
NEO ratings averaged .24 (discriminant validities averaged .16). Neither type of interviewer-based
assessment of personality showed good convergent validity with the target constructs. Convergent
validities were even lower in the applicant-like condition (Van Iddekinge et al., 2005), possibly
because of the effect of response distortion in self-reports and interview scores in this condition.
The somewhat disappointing results for construct validity can perhaps be improved with atten-
tion to four variables that moderate the accuracy of personality judgments: the judge, the target
individual, the trait, and information obtained (Funder, 1995). Research suggests that (a) unstruc-
tured interviews are better measures of personality variables than structured interviews (although
criterion-related validity may suffer), (b) visible traits such as extroversion and agreeableness are
better measured than less visible traits, and (c) accuracy increases with more information about the
target individual (Blackman & Funder, 2002).
SJTS
SJTs present test-takers with a scenario (in written, audio, or video format) and several response
options describing possible courses of action. For employee selection purposes, SJTs are most often
contextualized for specific occupational domains (e.g., law enforcement or customer service) and
designed to measure personality traits deemed particularly relevant (e.g., conscientiousness or extra-
version). To capture personality-relevant variance in test-taker responses, the development of sce-
narios and response options must be theory- and data-driven, and SJT scores hypothesized (based
on item content) to measure a certain personality trait should relate to external measures of the same
construct (Chan & Schmitt, 2005).
As it is with all individual difference measures, reliability is an important factor when evaluating
the usefulness of SJTs for employee selection purposes. SJTs are often multidimensional (McDaniel,
Hartman, Whetzel, & Grubb, 2007), rendering internal consistency estimates of little value (as
would be the case when an internal consistency reliability estimate were to be computed across
items of different scales on a traditional personality test). In these circumstances, parallel form
Personality 307
reliability (Chan & Schmitt, 2002) and test-retest correlations (over a short time period, Schmidt &
Hunter, 1996) are appropriate methods of estimating reliability. However, both types of estimates
for SJTs measuring personality variables are rarely presented in the literature, and we encourage
scientists and practitioners alike to investigate and report on this issue to further improve our knowl-
edge of personality measurement using situational judgment approaches.
Another important issue concerns the distribution of constructs assessed. Providing response
options that load on conceptually and empirically different traits complicates score interpretation
and makes interindividual comparisons difficult. This is especially true in the case of personal-
ity assessment. The challenge lies in developing different response options for SJT scenarios that
are all expressions of the same personality trait, albeit at different trait levels. Making test-takers
choose between response-options loading on different personality dimensions will result in ipsative
or partially ipsative scores, limiting their usefulness for employee selection purposes (see earlier
discussion on ipsativity).
Another important issue in assessing personality with the use of SJTs is that of response instruc-
tions. A major distinction is between behavioral tendency versus knowledge instructions, sometimes
conceptualized as “would do” versus “should do.” Although there are variations on these themes
(e.g., asking respondents what they “would do,” what they “would most and least likely do,” and
what they “have done in the past”), different behavioral tendency instructions produce scores that
are highly correlated (Ployhart & Ehrhart, 2003). Conceptually, SJTs administered with behavioral
tendency instructions are more likely to elicit responses that resemble future behavior on the job,
rather than mere knowledge of appropriate responses to a given scenario.
A recent meta-analysis helps shed light on the constructs (including personality traits) typically
assessed using SJTs under varying response instructions (McDaniel et al., 2007). We used the meta-
analytic true-score correlations provided by McDaniel et al. in combination with Big Five intercor-
relations to estimate the amount of personality variance typically observed in SJTs. We obtained
meta-analytic Big Five intercorrelations from the Ones (1993) meta-analysis2 and attenuated them to
reflect observed relationships (using meta-analytic reliability estimates from Viswesvaran & Ones,
2000). A multiple regression of SJT scores on the Big Five indicates that at the construct level, 25%
of the variance assessed by SJTs with behavioral tendency instructions (“what would you do?”)
is personality (Big Five) variance. If SJTs are administered with knowledge instructions (“what
should you do?”), less than 10% of the variance is explained by the Big Five. This suggests that SJTs
with behavioral tendency instructions are better suited to measure personality traits. “Would do”
instructions elicit more personality-saturated responses and thus predict outcomes in organizational
settings that are often predicted by personality variables (e.g., overall job performance, contextual
performance, leadership, job satisfaction). Unfortunately, McDaniel et al.’s (2007) meta-analyses
of SJT validity by response instructions could only investigate validity for predicting overall job
performance, not for performance facets (e.g., task vs. contextual performance). For overall job
performance, the operational validity for SJTs with behavioral tendency instructions (those most
strongly related to the Big Five domains of personality), was .26 (corrected for sampling error and
attenuation due to criterion unreliability).
or screening tools in high-risk jobs). This is also true for what can be considered their building
blocks—single exercises or simulations that can be administered individually to assess personal
characteristics.
Motowidlo, Dunnette, and Carter (1990) described a simulation as any situation that “present[s]
applicants with a task stimulus that mimics an actual job situation” (p. 640). Now simulations are
considered situational tests that have fidelity greater than a paper-and-pencil test (Thornton & Rupp,
2003). Most often, these simulations are designed to assess constructs that are supposedly hard to
measure with paper-and-pencil tests. However, this claim is rarely substantiated, and construct-
validity evidence for the traits underlying performance on simulations is often sparse, making it dif-
ficult to evaluate whether the same traits could be assessed by other methods. A systematic review
of the available literature reveals that many dimensions assessed in ACs are at least conceptually
related to personality dimensions (Arthur, Day, McNelly, & Edens, 2003).
Arthur et al.’s (2003) construct-based meta-analysis of the AC method has shown that personality-
based AC dimensions, especially influencing others (a facet of extraversion), possess predictive
validity that rivals that of cognitive-ability-based dimensions such as problem solving. A recent
survey of AC practices among 97 organizations in western Europe and North America (Krause &
Thornton, 2009) shows that personality-based, extraversion-related dimensions are among those
most commonly assessed in ACs, in addition to interpersonal ones conceptually related to agree-
ableness (e.g., consideration of others).
In ACs, personality-relevant variance is captured using simulations and exercises such as role-
plays, group discussions, or in-baskets. An early meta-analysis by Scholz and Schuler (1993) revealed
an interesting pattern of findings, indicating that scores obtained in group discussion exercises
mainly captured openness to experience, dominance, and self-confidence (ρ = .46, .34, and .39,
respectively, N = 236–318), whereas in-basket exercises only reflected dominance (ρ = .23, N = 273).
A recent large-scale investigation in two primary samples (N = 3,748–4,770) showed that scores on
many simulations correlate only negligibly with personality characteristics, with the exception of
extraversion (Ones & Dilchert, 2008).
Simulations and exercises are often tailored to a given job context to make them more realistic
and face-valid. However, design features can impact the nature of the construct measured and the
quality of the measurement (e.g., reliability). For example, a leaderless group discussion that is
competitive (e.g., framed in a negotiation scenario) is likely to elicit behaviors indicative of differ-
ent personality traits than a discussion that is cooperative (e.g., framed in a team problem-solving
context). The selection of simulations and exercises for the prediction of specific criteria should
take such issues into account. Factors such as observability also affect the reliability and validity
of scores. The survey by Krause and Thornton (2009) indicated that in about 50% of organizations
surveyed, most (>75%) AC exercises are specifically developed for an organization. Customization
is costly. If customization elicits behavior indicative of traits particularly valued in a given context,
the cost is likely worthwhile. However, if customization is simply to increase face validity, it is a
missed opportunity.
Mount, and Judge (2001); Dudley et al. (2006); Hogan and Holland (2003); Hogan and Ones (1997);
Hough and Furnham (2003); Hough and Ones (2001); Hough and Oswald (2008); Ones et al. (2007);
Ones, Viswesvaran, and Schmidt (1993); Roberts, Kuncel, Shiner, Caspi, and Goldberg (2007); and
Rothstein and Goffin (2006). These summaries indicate that personality constructs predict many
important criteria, including major life outcomes. The list of criteria that are well predicted by per-
sonality variables includes, among others:
• Overall job performance: Conscientiousness, rtrue = .23 (Barrick & Mount, 1991) and
roperational = .20 (Hurtz & Donovan, 2000); personality-based integrity tests, roperational = .37
(Ones et al., 1993)
• Organizational citizenship: Conscientiousness, robserved = .19; positive affectivity, robserved =
.16 (Borman, Penner, Allen, & Motowidlo, 2001)
• Counterproductive work behavior: Conscientiousness, roperational = −.26 (Salgado, 2002);
dependability, rtrue = −.34 (Dudley et al., 2006); personality-based integrity tests, roperational =
−.32 (Ones et al., 1993)
• Managerial effectiveness: Dominance, roperational = .27; energy level, roperational = .20; achieve-
ment orientation, roperational = .17 (Hough, Ones, & Viswesvaran, 1998); conscientiousness,
rtrue = .22 (Barrick & Mount, 1991)
• Customer service: Customer service scales, roperational = .34 (Ones & Viswesvaran, 2008)
• Job satisfaction: Emotional stability, rtrue = .29; conscientiousness, rtrue = .26; extraversion,
rtrue = .25; agreeableness, rtrue = .17 (Judge et al., 2002)
• Divorce: Conscientiousness, robserved = −.13; emotional stability, robserved = −.17; agreeable-
ness, robserved = −.18 (Roberts et al., 2007)
• Mortality: Conscientiousness, robserved = −.09; extraversion/positive emotion, robserved = −.07;
emotional stability, robserved = −.05; agreeableness/lack of hostility, robserved = −.04; each
greater than the effects of socioeconomic status and IQ (Roberts et al., 2007)
Ones et al. (2005) have summarized the meta-analytic evidence for compound personality scales
in predicting work-related criteria and shown that these scales have high validity in predicting the
specific criteria they were developed for and for overall job performance. We readily acknowl-
edge that it is not necessarily Big Five factors that predict valued outcomes. Indeed, we argue that
(a) more specific criteria are predicted by more narrow personality traits; (b) complex criteria are
predicted by theoretically appropriately matched predictors; and (c) for some of the criteria listed
above, the highest predictive validities are not necessarily obtained at the factor level.
We do not want to underestimate the importance of the FFM. It has provided a structure for us
to think about personality variables. Prior to its acceptance, personality and I-O psychology had
little from which to generalize, the myriad of personality measures and variables numbered in the
hundreds, and there were different names for the same or similar constructs or the same name for
different constructs. We are not advocating a return to the “good old daze” (Hough, 1997). We
applaud the interest and evidence coming from studies that examine facet-level variables of the
FFM. We urge more such research, especially research based on empirically derived, generalizable,
facet-level personality taxonomies.
The Hough and Furnham (2003) summary of meta-analyses according to predictor and criterion
construct provided excellent examples of how predictor-criterion relevance affects the relation-
ship between the two. In addition, validities are typically higher in concurrent validation studies
compared to longitudinal validity studies (see Lievens, Dilchert, & Ones, 2005, for exceptions).
Validities are also higher in “weak” situations in which people have more autonomy and control
compared with “strong” situations in which people have few options. Trait activation theory (Tett &
Burnett, 2003; Tett & Guterman, 2000) provides an integrated framework for understanding how
the situation can explain variability in the magnitude of the relationships between personality and
behavior and performance.
INCREMENTAL VALIDITY
Personality variables can increment criterion-related validity in at least one of two ways. One way
is in combination with other relevant personality variables. A second way is in combination with
other individual variables such as measures of cognitive ability. If the predictor variables have low
intercorrelations but do correlate with the criterion variable, criterion-related validity will increase
when they are used in combination.3 Personality variables generally have low correlations with
cognitive ability measures and do increment validity when jointly used (Bartram, 2005; McHenry,
Hough, Toquam, Hanson, & Ashworth, 1990; White et al., 2008). When used in combination with
other measures such as the interview, biodata, and situational judgment, personality variables also
increment validity (DeGroot & Kluemper, 2007; McManus & Kelly, 1999).
ADVERSE IMPACT
Group mean score differences on measures used in employee selection are one of the major factors
determining adverse impact against protected groups, in addition to the selection ratio and score
variability. Hough, Oswald, and Ployhart (2001) summarized studies that examined mean score
differences between Whites and various ethnic minorities, between men and women, and between
older and younger people on personality traits, cognitive ability, and physical abilities. They found
essentially no differences between Whites and ethnic minorities for most personality variables.
They also examined mean score differences between groups at the facet-level of the Big Five with
some unexpected findings: For some facets, mean-score differences differed from that of their
respective Big Five factor (e.g., a Black-White difference of d = −.10 on global extraversion, but .12
on surgency/dominance). A recent meta-analysis of race and ethnic group differences on personal-
ity measures also showed modest differences between Whites and ethnic minority groups on other
facets of the Big Five (Foldes et al., 2008) and again established that differential patterns may exist
for Big Five factors and facets (e.g., a Black-White difference of −.12 on global emotional stability
measures but .17 on self-esteem, a facet of emotional stability). Table 8 of Foldes et al. (2008) also
provides a summary of scenarios based on majority/minority group selection ratios under which
these observed group differences can result in adverse impact.
3 Suppressor variables (i.e., variables that are uncorrelated with the criterion but correlated with another predictor) can
increment criterion-related validity; such cases are very rare among personality variables.
Personality 311
linearity, the same assumption that is critical to traditional utility approaches (cf. Schmidt, Hunter,
McKenzie, & Muldrow, 1979).
Nonlinearity has been postulated in other predictor domains (such as cognitive abilities), but few
if any such predictor-criterion relationships have been substantiated based on empirical evidence
(cf. Coward & Sackett, 1990). At least two plausible scenarios of nonlinearity between personality
traits and criterion variables that would impact employee selection can be envisioned: A relationship
between the predictor and criterion in which an asymptote is reached after a certain level of predic-
tor scores (e.g., beyond a certain point, all conscientious individuals keep their workplace similarly
tidy, and differences in orderliness do not translate into performance differences). Additionally, a U-
(or inverted U) shaped function, in which the direction of the relationship actually reverses beyond a
certain level of predictor scores, is possible. Other types of nonlinear relationships are also possible
(see Guion, 1991), but for most selection systems, especially those that use rigorous top-down selec-
tion, the two relationships described above are most detrimental to the utility of a hiring strategy
that had assumed a linear relationship between independent and dependent variables.
In the case of an asymptotic relationship between personality and criterion scores, there is still
potential utility in using personality as part of a selection process. Higher scoring applicants have
higher predicted performance than many low-scoring individuals, even if the highest scoring can-
didates are indistinguishable from one another in terms of their performance once hired. Many
organizations using minimum standards or defined cut-off scores do so because of the implicit
assumption that predictor scores do not matter after a certain cutoff (Coward & Sackett, 1990).
Nonetheless, the utility of selecting for a given characteristic depends on the make-up of the appli-
cant pool (e.g., how many individuals score above a certain level on the predictor) and factors such
as the number of applicants hired.
If, however, personality-performance relationships are described by an inverted U-shaped func-
tion, the detrimental effect on overall utility of a selection system is significant. In cases where
predictor-criterion relationships reverse direction, top-down hiring could result in the acceptance of
applicants who display high predictor scores but actually perform worse than some lower scoring
candidates. Such systematically documented nonlinearity would pose a threat to the usual opera-
tional use of personality measures.
In general, nonlinear relationships occur when the independent or dependent variable is non-nor-
mally distributed. Most personality traits used in employee selection (rather than screening) are nor-
mally distributed; thus, there is little concern. Nonlinearity may be more of an issue when measures
are used that assess only a narrow, extreme range of the trait continuum. For example, a measure of
abnormal personality designed to detect infrequently occurring psychopathological characteristics
will have a nonnormal predictor score distribution in most applicant populations. However, most of
these measures are not suitable for pre-offer employee selection and are most commonly employed
for screening out extreme cases after a conditional job offer has been made. Similarly, some par-
ticular performance indicators, most notably those of distal performance outcomes (e.g., salary,
frequency of citations to published works), are known to be nonnormally distributed and could thus
lead to nonlinear relationships.
Very few studies have investigated the issue of nonlinear personality-criterion relationships. An
early study reported nonlinear relationships between impulsive expression and two narrow per-
formance measures in a small sample of accountants (Day & Silverman, 1989). A study involving
several larger samples found no evidence of nonlinearity for conscientiousness predicting overall
job performance (Robie & Ryan, 1999). Other research reported nonlinear relationships between
conscientiousness and job performance in two samples (one involved biodata and a SJT, another
involved a traditional personality test); however, the nature of the nonlinear relationships differed
(inverted U vs. asymptote) (LaHuis, Martin, & Avis, 2005). Benson and Campbell (2007) reported
nonlinear relationships between composites of “dark-side” personality traits and leadership as
assessed in AC dimensions and supervisory ratings. Large-scale investigations that include per-
formance-relevant personality traits, diverse samples of occupations, and various measures of job
312 Handbook of Employee Selection
performance are needed to determine whether nonlinearity occurs in organizational settings, and if
so, what its impact is on employee selection.
CONCLUSIONS
We understand personality and its role in determining work behavior and performance better than
ever before. Although the FFM has provided an important framework to organize our research and
systematically cumulate evidence, understanding personality and personality-criterion relationships
requires more variables, including broader and narrower variables. Current research examining
the taxonomic structure at the facet-level of the FFM will benefit science and practice as generally
accepted models emerge. Such models allow us to move beyond inventory-specific investigations of
limited generalizability to cumulating results across studies and settings, thus enabling systematic
investigations of moderator variables. Such models also enhance our theory building and theory
testing. As our knowledge of personality-criterion relationships grows for different hierarchical lev-
els of predictor and criterion variables, we learn how to combine predictor variables into criterion-
appropriate variables that will enhance the prediction of valued outcomes in applied settings.
The prospects of better understanding the determinants of work behavior and performance are
exciting. Already primary studies, meta-analyses, and second-order meta-analyses provide ample
evidence that traditional self-report questionnaires of personality are among the most powerful
predictors of behavior in work settings. New developments in assessment and scoring methods
show promise for further improvements in measurement and prediction. Although initial optimism
regarding alternate response formats (e.g., fully ipsative forced-choice scales) proved unjustified,
other innovations (e.g., ideal point response methods and adaptive testing based on IRT) are prom-
ising ways to address concerns about traditional self-reports of personality on Likert-type scales.
Moreover, I-O psychologists have several other assessment tools at their disposal to measure per-
sonality (e.g., biodata, interviews, other-reports, SJTs, and ACs).
In addition to improving measurement using self-report personality measures, we encourage
researchers to thoroughly investigate the value of standardized other-reports in relation to occu-
pational criteria. The few studies that have investigated their criterion-related validity suggest that
other-reports may be even more valid for certain criteria than are self-report measures of personal-
ity. Other-reports can reliably capture personality variance that improves construct coverage and
thus have the potential to increment criterion-related validity. More evidence for the validity of
other reports must be established and moderator variables (such as rating source) more systemati-
cally investigated before organizations will be persuaded to implement more fully such measures
in employee selection.
Personality variables add significant explanatory and predictive power beyond other variables
(e.g., educational credentials, cognitive ability, work experience) often assessed during employment
decision-making. With better understanding of the structure of personality and criterion variables
and better measurement of both, personality will be more fully recognized for its very important
role in determining work behavior and performance.
REFERENCES
Alliger, G. M., Lilienfeld, S. O., & Mitchell, K. E. (1996). The susceptibility of overt and covert integrity tests
to coaching and faking. Psychological Science, 7, 32–39.
Arthur, W., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related validity
of assessment center dimensions. Personnel Psychology, 56, 125–154.
Averill, J. R. (1997). The emotions: An integrative approach. In R. Hogan, J. A. Johnson, & S. R. Briggs (Eds.),
Handbook of personality psychology (pp. 513–541). San Diego, CA: Academic Press.
Barge, B. N., & Hough, L. M. (1988). Utility of biographical data for the prediction of job performance. In
L. M. Hough (Ed.), Literature review: Utility of temperament, biodata, and interest assessment for predicting
job performance (ARI Research Note 88-020). Alexandria, VA: U.S. Army Research Institute.
Personality 313
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Barrick, M. R., & Mount, M. K. (2005). Yes, personality matters: Moving on to more important matters. Human
Performance, 18, 359–372.
Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the beginning of the
new millennium: What do we know and where do we go next? International Journal of Selection and
Assessment, 9, 9–30.
Barrick, M. R., Patton, G. K., & Haugland, S. N. (2000). Accuracy of interviewer judgments of job applicant
personality traits. Personnel Psychology, 53, 925–951.
Bartram, D. (2005). The Great Eight competencies: A criterion-centric approach to validation. Journal of
Applied Psychology, 90, 1185–1203.
Baumeister, R. F., Gailliot, M., DeWall, C., & Oaten, M. (2006). Self-regulation and personality: How interven-
tions increase regulatory success, and how depletion moderates the effects of traits on behavior. Journal
of Personality, 74, 1773–1801.
Benson, M. J., & Campbell, J. P. (2007). To be, or not to be, linear: An expanded representation of personality
and its relationship to leadership performance. International Journal of Selection and Assessment, 15,
232–249.
Birkland, A. S., & Ones, D. S. (2006). The structure of emotional stability: A meta-analytic investigation. Paper
presented at the International Congress of Applied Psychology, Athens, Greece.
Blackman, M. C., & Funder, D. C. (2002). Effective interview practices for accurately assessing counterpro-
ductive traits. International Journal of Selection and Assessment, 10, 109–116.
Block, J. (1995). A contrarian view of the five-factor approach to personality description. Psychological
Bulletin, 117, 187–215.
Borman, W. C., Penner, L. A., Allen, T. D., & Motowidlo, S. J. (2001). Personality predictors of citizenship
performance. International Journal of Selection and Assessment, 9, 52–69.
Chan, D. (2004). Individual differences in tolerance for contradiction. Human Performance, 17, 297–324.
Chan, D. (2008). So why ask me? Are self-report data really that bad? In C. E. Lance & R. J. Vandenberg
(Eds.), Statistical and methodological myths and urban legends: Received doctrine, verity, and fable in
the organizational and social sciences. Hillsdale, NJ: Lawrence Erlbaum.
Chan, D., & Schmitt, N. (2002). Situational judgment and job performance. Human Performance, 15,
233–254.
Chan, D., & Schmitt, N. (2005). Situational judgment tests. In A. Evers, O. Voskuijl, & N. Anderson (Eds.),
Handbook of selection (pp. 219–242). Oxford, England: Blackwell.
Chernyshenko, O. S., Stark, S., Drasgow, F., & Roberts, B. W. (2007). Constructing personality scales under the
assumptions of an ideal point response process: Toward increasing the flexibility of personality measures.
Psychological Assessment, 19, 88–106.
Christiansen, N. D. (1997). The development and validation of a job-related choice method of personality
assessment. Unpublished doctoral dissertation, Northern Illinois University, DeKalb, IL.
Connelly, B. S. (2008). The reliability, convergence, and predictive validity of personality ratings: Another
perspective. Unpublished doctoral dissertation, University of Minnesota, Minneapolis, MN.
Connelly, B. S., Davies, S. E., Ones, D. S., & Birkland, A. S. (2008a, April). Agreeableness: A meta-analytic
review of structure, convergence, and predictive validity. In J. P. Thomas & C. Viswesvaran (Chairs),
Personality in the workplace: Advances in measurement and assessment. Symposium conducted at the
annual conference of the Society for Industrial and Organizational Psychology, San Francisco, CA.
Connelly, B. S., Davies, S. E., Ones, D. S., & Birkland, A. S. (2008b, January). Opening up openness: A meta-
analytic review of measures of the personality construct. Poster session presented at the annual confer-
ence of the Society for Personality and Social Psychology, Albuquerque, NM.
Connelly, B. S., & Ones, D. S. (2007, April). Combining conscientiousness scales: Can’t get enough of the trait,
baby. Poster session presented at the annual conference of the Society for Industrial and Organizational
Psychology, New York, NY.
Connolly, J. J., Kavanagh, E. J., & Viswesvaran, C. (2007). The convergent validity between self and observer
ratings of personality: A meta-analytic review. International Journal of Selection and Assessment, 15,
110–117.
Cortina, J. M., Goldstein, N. B., Payne, S. C., Davison, H., & Gilliland, S. W. (2000). The incremental validity
of interview scores over and above cognitive ability and conscientiousness scores. Personnel Psychology,
53, 325–351.
Costa, P. T., & McCrae, R. R. (1988). From catalog to classification: Murray’s needs and the Five-Factor
Model. Journal of Personality and Social Psychology, 55, 258–265.
314 Handbook of Employee Selection
Costa, P. T., & McCrae, R. R. (1992). The revised NEO Personality Inventory (NEO PI-R) and NEO Five-
Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources.
Coward, W., & Sackett, P. R. (1990). Linearity of ability-performance relationships: A reconfirmation. Journal
of Applied Psychology, 75, 297–300.
Davies, M., Stankov, L., & Roberts, R. D. (1998). Emotional intelligence: In search of an elusive construct.
Journal of Personality and Social Psychology, 75, 989–1015.
Davies, S. E., Connelly, B. S., Ones, D. S., & Birkland, A. S. (2008, April). Enhancing the role of extraversion
for work related behaviors. Poster session presented at the annual conference of the Society for Industrial
and Organizational Psychology, San Francisco, CA.
Day, D. V., & Silverman, S. B. (1989). Personality and job performance: Evidence of incremental validity.
Personnel Psychology, 42, 25–36.
DeGroot, T., & Kluemper, D. (2007). Evidence of predictive and incremental validity of personality factors,
vocal attractiveness and the situational interview. International Journal of Selection and Assessment, 15,
30–39.
Digman, J. M. (1997). Higher-order factors of the Big Five. Journal of Personality and Social Psychology, 73,
1246–1256.
Dilchert, S., Ones, D. S., Van Rooy, D. L., & Viswesvaran, C. (2006). Big Five factors of personality. In
J. H. Greenhaus & G. A. Callanan (Eds.), Encyclopedia of career development (pp. 36–42). Thousand
Oaks, CA: Sage.
Dilchert, S., Ones, D. S., Viswesvaran, C., & Deller, J. (2006). Response distortion in personality measurement:
Born to deceive, yet capable of providing valid self-assessments? Psychology Science, 48, 209–225.
Dudley, N. M., Orvis, K. A., Lebiecki, J. E., & Cortina, J. M. (2006). A meta-analytic investigation of consci-
entiousness in the prediction of job performance: Examining the intercorrelations and the incremental
validity of narrow traits. Journal of Applied Psychology, 91, 40–57.
Foldes, H. J., Duehr, E. E., & Ones, D. S. (2008). Group differences in personality: Meta-analyses comparing
five U.S. racial groups. Personnel Psychology, 61, 579–616.
Funder, D. C. (1995). On the accuracy of personality judgment: A realistic approach. Psychological Review,
102, 652–670.
Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: Wiley.
Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34.
Gough, H. G. (1968). The Chaplin social insight test manual. Palo Alto, CA: Consulting Psychologists Press.
Gough, H. G. (1994). Theory, development, and interpretation of the CPI Socialization scale. Psychological
Reports, 75, 651–700.
Griffeth, R. W., Hom, P. W., & Gaertner, S. (2000). A meta-analysis of antecedents and correlates of employee
turnover: Update, moderator tests, and research implications for the next millennium. Journal of
Management, 26, 463–488.
Guion, R. M. (1991). Personnel assessment, selection, and placement. In M. D. Dunnette & L. M. Hough
(Eds.), Handbook of industrial and organizational psychology (Vol. 2, pp. 327–397). Palo Alto, CA:
Consulting Psychologists Press.
Harold, C. M., McFarland, L. A., & Weekley, J. A. (2006). The validity of verifiable and non-verifiable bio-
data items: An examination across applicants and incumbents. International Journal of Selection and
Assessment, 14, 336–346.
Heggestad, E. D., Morrison, M., Reeve, C. L., & McCloy, R. A. (2006). Forced-choice assessments of person-
ality for selection: Evaluating issues of normative assessment and faking resistance. Journal of Applied
Psychology, 91, 9–24.
Hicks, L. E. (1970). Some properties of ipsative, normative, and forced-choice normative measures.
Psychological Bulletin, 74, 167–184.
Hogan, J., & Holland, B. (2003). Using theory to evaluate personality and job-performance relations: A socio-
analytic perspective. Journal of Applied Psychology, 88, 100–112.
Hogan, J., & Ones, D. S. (1997). Conscientiousness and integrity at work. In R. Hogan & J. A. Johnson (Eds.),
Handbook of personality psychology (pp. 849–870). San Diego, CA: Academic Press.
Hogan, J., & Roberts, B. W. (1996). Issues and non-issues in the fidelity-bandwidth trade-off. Journal of
Organizational Behavior, 17, 627–637.
Hogan, R. T. (1969). Development of an empathy scale. Journal of Consulting & Clinical Psychology, 33,
307–316.
Hogan, R. T. (2005a). Comments. Human Performance, 18, 405–407.
Hogan, R. T. (2005b). In defense of personality measurement: New wine for old whiners. Human Performance,
18, 331–341.
Personality 315
Hough, L. M. (1989). Biodata and the measurement of individual differences. In T. W. Mitchell (Chair),
Biodata vs. personality: The same or different class of individual differences? Symposium conducted at
the annual conference of the Society for Industrial and Organizational Psychology, Boston, MA.
Hough, L. M. (1992). The “Big Five” personality variables—construct confusion: Description versus predic-
tion. Human Performance, 5, 139–155.
Hough, L. M. (1997). The millennium for personality psychology: New horizons or good ole daze. Applied
Psychology: An International Review, 47, 233–261.
Hough, L. M. (1998). Personality at work: Issues and evidence. In M. D. Hakel (Ed.), Beyond multiple choice:
Evaluating alternatives to traditional testing for selection (pp. 131–166). Mahwah, NJ: Lawrence
Erlbaum.
Hough, L. M. (2001). I-Owes its advances to personality. In B. W. Roberts & R. Hogan (Eds.), Personality
psychology in the workplace (pp. 19–44). Washington, DC: American Psychological Association.
Hough, L. M. (in press). Assessment of background and life experience: Past as prologue. In J. C. Scott &
D. H. Reynolds (Eds.), Handbook of workplace assessment: Selecting and developing organizational
talent. Hoboken, NJ: Wiley & Sons.
Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. A. (1990). Criterion-related validi-
ties of personality constructs and the effect of response distortion on those validities. Journal of Applied
Psychology, 75, 581–595.
Hough, L. M., & Furnham, A. (2003). Use of personality variables in work settings. In W. C. Borman, D. R. Ilgen
& R. J. Klimoski (Eds.), Handbook of psychology. Vol. 12: Industrial and organizational psychology
(pp. 131–169). Hoboken, NJ: John Wiley & Sons.
Hough, L. M., & Ones, D. S. (2001). The structure, measurement, validity, and use of personality variables
in industrial, work, and organizational psychology. In N. Anderson, D. S. Ones, H. K. Sinangil, &
C. Viswesvaran (Eds.), Handbook of industrial, work, and organizational psychology. Vol. 1: Personnel
psychology (pp. 233–277). London, England: Sage.
Hough, L. M., Ones, D. S., & Viswesvaran, C. (1998, April). Personality correlates of managerial perfor-
mance constructs. Poster session presented at the annual conference of the Society for Industrial and
Organizational Psychology, Dallas, TX.
Hough, L. M., & Oswald, F. L. (2000). Personnel selection: Looking toward the future—Remembering the
past. Annual Review of Psychology, 51, 631–664.
Hough, L. M., & Oswald, F. L. (2005). They’re right, well... mostly right: Research evidence and an agenda to
rescue personality testing from 1960s insights. Human Performance, 18, 373–387.
Hough, L. M., & Oswald, F. L. (2008). Personality testing and I-O psychology: Reflections, progress, and
prospects. Industrial and Organizational Psychology, 1, 272–290.
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse
impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of
Selection and Assessment, 9, 152–194.
Hough, L. M., & Paullin, C. (1994). Construct-oriented scale construction: The rational approach. In G. S.
Stokes & M. D. Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical informa-
tion in selection and performance prediction (pp. 109–145). Palo Alto, CA: CPP Books.
Hough, L. M., & Schneider, R. J. (1996). Personality traits, taxonomies, and applications in organizations. In
K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 31–88). San Francisco,
CA: Jossey-Bass.
Houston, J. S., Borman, W. C., Farmer, W., & Bearden, R. M. (2005). Development of the Enlisted Computer
Adaptive Personality Scales (ENCAPS), Renamed Navy Computer Adaptive Personality Scales (NCAPS)
(Institute Report #503). Minneapolis, MN: Personnel Decisions Research Institutes.
Huffcutt, A. I., & Arthur, W. (1994). Hunter and Hunter (1984) revisited: Interview validity for entry-level jobs.
Journal of Applied Psychology, 79, 184–190.
Huffcutt, A. I., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment
of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86,
897–913.
Huffcutt, A. I., Roth, P. L., & McDaniel, M. A. (1996). A meta-analytic investigation of cognitive ability in
employment interview evaluations: Moderating characteristics and implications for incremental validity.
Journal of Applied Psychology, 81, 459–473.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–98.
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of
Applied Psychology, 85, 869–879.
316 Handbook of Employee Selection
Jackson, D. N., Wroblewski, V. R., & Ashton, M. C. (2000). The impact of faking on employment tests: Does
forced choice offer a solution? Human Performance, 13, 371–388.
Judge, T. A., Heller, D., & Mount, M. K. (2002). Five-factor model of personality and job satisfaction: A
meta-analysis. Journal of Applied Psychology, 87, 530–541.
Judge, T. A., & Ilies, R. (2002). Relationship of personality to performance motivation: A meta-analytic review.
Journal of Applied Psychology, 87, 797–807.
Kamp, J. D., & Gough, H. G. (1986). The Big Five personality factors from an assessment context. Paper pre-
sented at the annual conference of the American Psychological Association, Washington, DC.
Kenny, D. A., Albright, L., Malloy, T. E., & Kashy, D. A. (1994). Consensus in interpersonal perception:
Acquaintance and the Big Five. Psychological Bulletin, 116, 245–258.
Kilcullen, R. N., White, L. A., Mumford, M. D., & Mack, H. (1995). Assessing the construct validity of rational
biodata scales. Military Psychology, 7, 17–28.
Kohlberg, L., Levine, C., & Hewer, A. (1994). Moral stages: A current formulation and a response to critics.
In B. Puka (Ed.), Moral development: A compendium. Vol. 5: New research in moral development
(pp. 126–188). New York, NY: Garland.
Krause, D. E., & Thornton, G. C. (2009). A cross-cultural look at assessment center practices: Survey results
from Western Europe and North American. Applied Psychology: An International Review, 58, 557–585.
LaHuis, D. M., Martin, N. R., & Avis, J. M. (2005). Investigating nonlinear conscientiousness—Job performance
relations for clerical employees. Human Performance, 18, 199–212.
Lievens, F., Dilchert, S., & Ones, D. S. (2005, April). Personality validity increases in medical school:
A seven-year longitudinal investigation. In L. M. Hough & D. S. Ones (Chairs), Power of personality:
Longitudinal studies and meta-analyses. Symposium conducted at the annual conference of the Society
for Industrial and Organizational Psychology, Los Angeles, CA.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22, 55.
Manley, G. G., Benavidez, J., & Dunn, K. (2007). Development of a personality biodata measure to predict
ethical decision making. Journal of Managerial Psychology, 22, 664–682.
Martin, B. A., Bowen, C. C., & Hunt, S. T. (2002). How effective are people at faking on personality question-
naires? Personality and Individual Differences, 32, 247–256.
McCloy, R. A., Heggestad, E. D., & Reeve, C. L. (2005). A silk purse from the sow’s ear: Retrieving normative
information from multidimensional forced-choice items. Organizational Research Methods, 8, 222–248.
McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. (2007). Situational judgment tests, response
instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91.
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment inter-
views: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599–616.
McFarland, L. A., & Ryan, A. M. (2000). Variance in faking across noncognitive measures. Journal of Applied
Psychology, 85, 812–821.
McHenry, J. J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project A validity results:
The relationship between predictor and criterion domains. Personnel Psychology, 43, 335–354.
McManus, M. A., & Kelly, M. L. (1999). Personality measures and biodata: Evidence regarding their incremental
predictive value in the life insurance industry. Personnel Psychology, 52, 137–148.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007a). Are
we getting fooled again? Coming to terms with limitations in the use of personality tests for personnel
selection. Personnel Psychology, 60, 1029–1049.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007b).
Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60,
683–729.
Moscoso, S. (2000). Selection interview: A review of validity evidence, adverse impact and applicant reactions.
International Journal of Selection and Assessment, 8, 237–247.
Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure: The low-fidelity
simulation. Journal of Applied Psychology, 75, 640–647.
Mount, M. K., Barrick, M. R., & Strauss, J. P. (1994). Validity of observer ratings of the big five personality
factors. Journal of Applied Psychology, 79, 272–280.
Murphy, K. R., & Dzieweczynski, J. L. (2005). Why don’t measures of broad dimensions of personality per-
form better as predictors of job performance? Human Performance, 18, 343–357.
Nilsen, D. (1995). An investigation of the relationship between personality and leadership performance.
Unpublished doctoral dissertation, University of Minnesota, Minneapolis, MN.
Ones, D. S. (1993). The construct validity of integrity tests. Unpublished doctoral dissertation, University of
Iowa, Iowa City, IA.
Personality 317
Ones, D. S., & Dilchert, S. (2008, February). Recent assessment center research: Dimensions, exercises, group
differences, and incremental validity. Paper presented at the annual Assessment Centre Study Group
conference, Stellenbosch, South Africa.
Ones, D. S., Dilchert, S., Viswesvaran, C., & Judge, T. A. (2007). In support of personality assessment in orga-
nizational settings. Personnel Psychology, 60, 995–1027.
Ones, D. S., & Viswesvaran, C. (1996). Bandwidth-fidelity dilemma in personality measurement for personnel
selection. Journal of Organizational Behavior, 17, 609–626.
Ones, D. S., & Viswesvaran, C. (2001). Personality at work: Criterion-focused occupational personality scales
used in personnel selection. In B. W. Roberts & R. Hogan (Eds.), Personality psychology in the work-
place (pp. 63–92). Washington, DC: American Psychological Association.
Ones, D. S., & Viswesvaran, C. (2008). Customer service scales: Criterion-related, construct, and incremental
validity evidence. In J. Deller (Ed.), Research contributions to personality at work (pp. 19–46). Mering,
Germany: Hampp.
Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Personality at work: Raising awareness and correcting
misconceptions. Human Performance, 18, 389–404.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities:
Findings and implications for personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 679–703.
Oswald, F. L., & Hough, L. M. (2008). Personality testing and I-O psychology: A productive exchange and
some future directions. Industrial and Organizational Psychology, 1, 323–332.
Oviedo-Garcia, M. (2007). Internal validation of a biodata extraversion scale for salespeople. Social Behavior
and Personality, 35, 675–692.
Paunonen, S. V., & Jackson, D. N. (1996). The Jackson Personality Inventory and the Five-Factor Model of
personality. Journal of Research in Personality, 30, 42–59.
Ployhart, R. E., & Ehrhart, M. G. (2003). Be careful what you ask for: Effects of response instructions on the
construct validity and reliability of situational judgment tests. International Journal of Selection and
Assessment, 11, 1–16.
Reilly, R. R., & Chao, G. R. (1982). Validity and fairness of some alternative employee selection procedures.
Personnel Psychology, 35, 1–62.
Roberts, B. W., Chernyshenko, O. S., Stark, S. E., & Goldberg, L. R. (2005). The structure of conscientious-
ness: An empirical investigation based on seven major personality questionnaires. Personnel Psychology,
58, 103–139.
Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). The power of personality:
The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting
important life outcomes. Perspectives on Psychological Science, 2, 313–345.
Robie, C., & Ryan, A. M. (1999). Effects of nonlinearity and heteroscedasticity on the validity of conscien-
tiousness in predicting overall job performance. International Journal of Selection and Assessment, 7,
157–169.
Rothstein, M. G., & Goffin, R. D. (2000). The assessment of personality constructs in industrial-organizational
psychology. In R. D. Goffin & E. Helmes (Eds.), Problems and solutions in human assessment: Honoring
Douglas N. Jackson at seventy (pp. 215–248). New York, NY: Kluwer Academic/Plenum.
Rothstein, M. G., & Goffin, R. D. (2006). The use of personality measures in personnel selection: What does
current research support? Human Resource Management Review, 16, 155–180.
Russell, J. A., & Carroll, J. M. (1999). On the bipolarity of positive and negative affect. Psychological Bulletin,
125, 3–30.
Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation
and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391.
Salgado, J. F. (2002). The Big Five personality dimensions and counterproductive behaviors. International
Journal of Selection and Assessment, 10, 117–125.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research
scenarios. Psychological Methods, 1, 199–223.
Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of valid selection procedures
on work-force productivity. Journal of Applied Psychology, 64, 609–626.
Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Meta-analyses of validity studies published
between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37,
407–422.
Schmitt, N., & Kunce, C. (2002). The effects of required elaboration of answers to biodata questions. Personnel
Psychology, 55, 569–587.
318 Handbook of Employee Selection
Schmitt, N., Oswald, F. L., Kim, B. H., Gillespie, M. A., Ramsay, L. J., & Yoo, T.-Y. (2003). Impact of elabora-
tion on socially desirable responding and the validity of biodata measures. Journal of Applied Psychology,
88, 979–988.
Schneider, R. J., Ackerman, P. L., & Kanfer, R. (1996). To “act wisely in human relations”: Exploring the
dimensions of social competence. Personality and Individual Differences, 21, 469–481.
Schneider, R. J., & Hough, L. M. (1995). Personality and industrial/organizational psychology. In C. L. Cooper &
I. T. Robertson (Eds.), International review of industrial and organizational psychology (pp. 75–129).
Chichester, England: Wiley.
Schneider, R. J., Hough, L. M., & Dunnette, M. D. (1996). Broadsided by broad traits: How to sink science in
five dimensions or less. Journal of Organizational Behavior, 17, 639–655.
Scholz, G., & Schuler, H. (1993). Das nomologische Netzwerk des Assessment Centers: eine Metaanalyse.
[The nomological network of the assessment center: A meta-analysis]. Zeitschrift für Arbeits- und
Organisationspsychologie, 37, 73–85.
Sisco, H., & Reilly, R. R. (2007a). Development and validation of a biodata inventory as an alternative method
to measurement of the Five Factor Model of personality. Social Science Journal, 44, 383–389.
Sisco, H., & Reilly, R. R. (2007b). Five Factor Biodata Inventory: Resistance to faking. Psychological Reports,
101, 3–17.
Stark, S., & Chernyshenko, O. S. (2007, June). Adaptive testing with the multi-unidimensional pairwise prefer-
ence model. Paper session: New CAT models: Graduate Management Admission Council Conference on
Computerized Adaptive Testing. Minneapolis, MN.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2005). An IRT approach to constructing and scoring pair-
wise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-
preference model. Applied Psychological Measurement, 29, 184–203.
Stark, S., Chernyshenko, O. S., Drasgow, F., & Williams, B. A. (2006). Examining assumptions about item
responding in personality assessment: Should ideal point methods be considered for scale development
and scoring? Journal of Applied Psychology, 91, 25–39.
Stokes, G. S., & Cooper, L. A. (2001). Content/construct approaches in life history form development for selec-
tion. International Journal of Selection and Assessment, 9, 138–151.
Stokes, G. S., & Cooper, L. A. (2004). Biodata. In J. C. Thomas (Ed.), Comprehensive handbook of psycho-
logical assessment (Vol. 4: Industrial and organizational assessment, pp. 243–268). Hoboken, NJ: John
Wiley & Sons.
Stokes, G. S., Hogan, J. B., & Snell, A. F. (1993). Comparability of incumbent and applicant samples for the
development of biodata keys: The influence of social desirability. Personnel Psychology, 46, 739–762.
Tenopyr, M. L. (1994). Big Five, structural modeling, and item response theory. In G. S. Stokes, M. D. Mumford
& W. A. Owens (Eds.), Biodata handbook: Theory, research, and use of biographical information in
selection and performance prediction (pp. 519–533). Palo Alto, CA: Consulting Psychologists Press.
Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal
of Applied Psychology, 88, 500–517.
Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A response to Morgeson, Campion,
Dipboye, Hollenbeck, Murphy, and Schmitt (2007). Personnel Psychology, 60, 967–993.
Tett, R. P., & Guterman, H. A. (2000). Situation trait relevance, trait expression, and cross-situational consis-
tency: Testing a principle of trait activation. Journal of Research in Personality, 34, 397–423.
Tett, R. P., Jackson, D. N., Rothstein, M., & Reddon, J. R. (1999). Meta-analysis of bidirectional relations in
personality-job performance research. Human Performance, 12, 1–29.
Thornton, G. C., & Rupp, D. E. (2003). Simulations and assessment centers. In W. C. Borman, D. R. Ilgen, &
R. J. Klimoski (Eds.), Handbook of psychology: Industrial and organizational psychology (Vol. 12,
pp. 319–344). New York, NY: Wiley & Sons.
Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554.
Tupes, E. C., & Christal, R. E. (1958). Stability of personality trait rating factors obtained under diverse condi-
tions (WADC-TN-58-61). Lackland Airforce Base, TX: Personnel Laboratory, Wright Air Development
Center, U.S. Air Force.
Tupes, E. C., & Christal, R. E. (1961). Recurrent personality factors based on trait ratings (ASD-TR-61-97).
Lackland Airforce Base, TX: Personnel Laboratory, Aeronautical Systems Division, U.S. Air Force.
Tupes, E. C., & Christal, R. E. (1992). Recurrent personality factors based on trait ratings. Journal of Personality,
60, 225–251.
Underhill, C. M. (2006). Investigation of item-pair presentation and construct validity of the Navy Computer
Adaptive Personality Scales (NCAPS). Millington, TN: Navy Personnel Research, Studies, Technology
Division, Bureau of Naval Personnel.
Personality 319
Van Iddekinge, C. H., Raymark, P. H., Eidson, C. E., Jr., & Attenweiler, W. J. (2004). What do structured
selection interviews really measure? The construct validity of behavior description interviews. Human
Performance, 17, 71–93.
Van Iddekinge, C. H., Raymark, P. H., & Roth, P. L. (2005). Assessing personality with a structured employ-
ment interview: Construct-related validity and susceptibility to response inflation. Journal of Applied
Psychology, 90, 536–552.
Viswesvaran, C., & Ones, D. S. (2000). Measurement error in “Big Five factors” personality assessment:
Reliability generalization across studies and measures. Educational and Psychological Measurement,
60, 224–235.
Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52,
372–376.
White, L. A., Young, M. C., Hunter, A. E., & Rumsey, M. G. (2008). Lessons learned in transitioning personality
measures from research to operational settings. Industrial and Organizational Psychology, 1, 291–295.
White, L. A., Young, M. C., & Rumsey, M. G. (2001). ABLE implementation issues and related research. In
J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in personnel selection and classification (pp.
525–558). Mahwah, NJ: Lawrence Erlbaum.
Wiesner, W. H., & Cronshaw, S. F. (1988). A meta-analytic investigation of the impact of interview format and
degree of structure on the validity of the employment interview. Journal of Occupational Psychology,
61, 275–290.
Wiggins, J. S., & Trapnell, P. D. (1997). Personality structure: The return of the Big Five. In R. Hogan,
J. A. Johnson, & S. R. Briggs (Eds.), Handbook of personality psychology. (pp. 737–765). San Diego,
CA: Academic Press.
Young, M. C., White, L. A., Heggestad, E. D., & Barnes, J. D. (2004). Operational validation of the Army’s new
pre-enlistment attrition screening measure. Paper presented at the annual conference of the American
Psychological Association, Honolulu, HI.
Zuckerman, M., Kuhlman, D. M., Joireman, J., Teta, P., & Kraft, M. (1993). A comparison of three structural
models for personality: The Big Three, the Big Five, and the Alternative Five. Journal of Personality &
Social Psychology, 65, 757–768.
This page intentionally left blank
15 Values, Styles, and
Motivational Constructs
David Chan
For several decades now, cognitive ability and personality traits are the two major types of pre-
dictors examined in employee selection research. Construct-oriented studies have focused on the
structure and taxonomy of cognitive ability (see Chapter 12, this volume) and personality traits
(see Chapter 14, this volume), as well as the validity evidence for these two types of constructs.
In contrast, selection researchers have paid little attention to other types of individual difference
predictors such as those in the domains of values, cognitive styles, and motivational constructs.
To the extent that individual differences in values, cognitive styles, and motivations are distinct
from cognitive ability and personality constructs, and to the extent that these individual differences
predict work-relevant attitudes, perceptions, and behaviors, there is a need in selection research to
direct more attention to these “nontraditional” predictor constructs. The purpose of this chapter is
to provide an overview of the major values, cognitive styles, and motivational constructs that are
likely relevant in employee selection research. In the following sections, I discuss each of these three
construct domains with the objectives to (a) understand the basic conceptualizations of the focal
constructs and their potential value in employee selection research and practice, (b) illustrate the
variety of constructs and present the theory and research associated with their structure and validity,
and (c) discuss the current concerns and emerging issues in the conceptualization and measurement
of these constructs. I end the chapter with a discussion on practical considerations of the use of these
constructs in employee selection and a proposed strategic agenda for future research directions.
VALUES
The interest in the psychological research on the concept of values may be traced back to the
publication of Rokeach’s (1973) influential book The Nature of Human Values and the Rokeach
Value Survey, which he developed to measure the various value constructs described in his book.
Subsequent researchers who examined the structure of values or criterion-related validities of
values have tended to rely on Rokeach’s conceptual definition of values, which refers to the
individual’s “enduring belief that a specific mode of conduct or end-state of existence is person-
ally or socially preferable to an opposite or converse mode of conduct or end-state of existence”
(Rokeach, 1973, p. 5). Although researchers have defined values in different ways, there appears
to be a consensus from their conceptual definitions that values are the individual’s stable beliefs
that serve as general standards by which he or she evaluates specific things, including people,
behaviors, activities, and issues. These standards of evaluation are also considered abstract goals,
which are important guiding principles in life for the individual. There is also agreement that
values are more general than attitudes in that the latter are more referent-specific. Values are also
differentiated from interests in that the former is scaled on relative importance whereas the latter
is scaled on relative liking.
321
322 Handbook of Employee Selection
STRUCTURE OF VALUES
Following his conceptual definition of values, Rokeach (1973) made two useful distinctions in
the structure of values. The first distinction is between instrumental values and terminal values.
Instrumental values are about modes of conduct, and they refer to the subjective desirability about
the actions or conduct, such as being honest, obedient, or courageous, which are presumed as means
that lead to certain desirable outcomes. Terminal values are about end-states of existence, and they
refer to the subjective desirability of life outcomes such as equality or a peaceful world. The second
distinction is between values about well-being of the self and values about well-being of others. On
the basis of the above two distinctions, Rokeach produced a useful taxonomy of four major types
of values by factorially crossing the two independent distinctions. Instrumental values that are self-
oriented are called competence values (e.g., being ambitious, independent), whereas instrumental
values that are other-focused are called moral values (e.g., being altruistic, forgiving). Terminal
values that are self-oriented are called personal values (e.g., a materially comfortable life, a well-
respected person) whereas terminal values that are other-oriented are called social values (e.g., a
peaceful world, a society with little or no inequality).
Schwartz (1992) argued that the conceptual distinction between instrumental and terminal val-
ues, although intuitively attractive, may not be necessary and may in fact create confusion because
in many cases the same value may be construed as a means and an end. For example, pleasure may
be construed as a terminal value but it may also serve as an instrumental value in promoting other
terminal values such as happiness. Also, instrumental values, such as being honest, could also be
seen as a terminal value to be promoted by other instrumental values, such as being courageous.
Dissatisfied with the typology of values provided by Rokeach (1973), Schwartz and his col-
leagues (Schwartz, 1992; Schwartz & Bilsky, 1990) proposed a new framework or structure of
values that he believed to have universal content that can be applied across cultures. Schwartz and
colleagues presented respondents with items representing specific values and asked them to rate
the importance of each value to their lives. On the basis of these importance ratings, from large
and diverse samples of respondents, Schwartz organized the large variety of individuals’ specific
values into ten value types (e.g., power, achievement, hedonism, self-direction). Schwartz further
Values, Styles, and Motivational Constructs 323
proposed that the ten values may be organized at a higher level into two bipolar value dimensions;
namely, openness to change versus conservation and self-enhancement versus self-transcendence.
However, research using Schwartz’s framework has focused almost exclusively on the ten value
types probably because of the generic (and hence less useful) nature of the two bipolar value
dimensions. In addition to the ten value types at the individual level, Schwartz proposed seven
value dimensions at the cultural level to allow for cross-cultural comparisons in value research.
Examples of these cultural-level value dimensions are prosocial (active protection or enhancement
of the welfare of others), restrictive conformity (restraint of actions likely to harm others or vio-
late norms), and security (safety, harmony, and stability of the society of groups with whom one
identifies).
A major contribution of Schwartz’s framework is that in addition to providing a categorization
of values at the individual level, it offers a conceptual guide for us to understand and compare cul-
tures in terms of value dimensions. There is considerable empirical evidence that the framework,
including the ten value types and seven culture dimensions, can be used on a global basis to identify
and understand the content and structure of values across diverse cultures (e.g., Schwartz & Sagiv,
1995). To date, Schwartz’s framework represents the most comprehensive typology of values at the
individual and culture levels of analysis, and there is also a relatively large research literature on the
results of the Schwartz Value Survey administered in diverse cultures.
Another large-scale value survey project is the well-known World Values Survey, which was
developed from the original European Value Survey. The first World Values Survey, conducted in
1981, contained only 22 countries with 14 of them outside of Europe. The second wave, which con-
tained 42 countries, was conducted 10 years later. Subsequent waves, containing increasingly more
countries, were conducted at approximately 5-year intervals. Results on the World Values Survey
are freely available at http://www.worldvaluessurvey.com. One of the most well-known interpreta-
tions of the results of the World Values Survey is that the many values across countries may be
factor-analytically summarized into two global dimensions of cultural variation labeled as “tradi-
tional versus secular-rational” and “survival versus self-expression.” Given the large-scale results on
diverse cultures available on the World Values Survey and the Schwartz Value Survey, the utility of
these two value frameworks is likely to continue for many years.
1. An adequate structure of values clarifying taxonomy and typology issues (i.e., number,
level, and type of values) is fundamental for the study of values to contribute to the science
and practice of employee selection. In employee selection, the criterion constructs of inter-
est are primarily work-relevant attitudes and behaviors. The structure of values is impor-
tant because it provides the conceptual organizing principles to relate these work-relevant
attitudes and behaviors to value constructs. Although we now have several conceptual
frameworks that provide researchers a working structure of values, it remains unclear what
degree of comprehensiveness and level of specificity we would require of a values structure
for the purpose of employee selection research and practice. A structure is nonparsimoni-
ous and impractical if it specifies a large variety of specific values organized into many
different types, domains, and levels of conceptualization. On the other hand, a structure
with a few generic values is likely to lead to studies with misspecified models because of
omitted value variables. There has been a proliferation of value measures that are rarely
reconciled with earlier measures in the literature, and this makes comparison of studies
324 Handbook of Employee Selection
COGNITIVE STYLES
Cognitive styles refer to the characteristic mode, typical method, habitual patterns, or preferred
ways of processing information that are consistent over time and across many areas of activity. So,
we can speak of cognitive styles in terms of thinking styles, problem-solving styles, learning styles,
and so forth.
As noted by Sternberg and Grigorenko (1997), cognitive styles should be distinguished from
strategies. The latter refers to the operations that individuals use or follow to minimize errors in
Values, Styles, and Motivational Constructs 325
problem-solving and decision-making. The use of strategies involves the conscious choice of alter-
native operations, whereas cognitive styles typically function without the individual’s awareness.
In addition, strategies are used in task- or context-specific situations, whereas cognitive styles refer
to more stable characteristic modes of information processing that the individual uses consistently
across a large variety of task situations or contexts.
Cognitive styles refer to a set of preferences or habits and hence should be distinguished from
cognitive abilities. We can construe cognitive abilities as the “can do” aspect of cognition and
cognitive styles as the “tend to do” aspect of cognition. Because styles are not abilities, they should
not be inherently better or worse in an absolute or context-free sense. Instead, cognitive styles may
differ in their goodness of fit to different environments or situations, and the degree of fit could
lead to different extent of positive or negative consequences. Cognitive styles are also distinct from
personality traits. Although personality traits such as conscientiousness and extraversion also refer
to individual differences in stable characteristic modes of behaviors, they tend to be construed as
generic behavioral tendencies or predispositions, whereas cognitive styles refer to typical modes of
information processing.
VARIETIES OF STYLES
The idea of cognitive styles as an interface between cognitive ability and personality was very
popular in the 1950s and 1960s, and it was during that period that numerous types and measures of
cognitive styles were developed. However, not all of the purported cognitive style constructs are in
fact assessing cognitive styles. For example, Witkin and colleagues introduced the style construct
called field independence to refer to the degree to which individuals are dependent or independent
on the structure of the surrounding visual field when perceiving objects. The Rod and Frames Test
(Witkin, Dyke, Faterson, Goodenough, & Karp, 1962) and the Embedded Figures Test (Witkin,
Oltman, Raskin, & Karp, 1971) are the two most widely used measures of field independence. In
these measures, the individual’s task is to locate a true vertical (in the Rod and Frame Test) or an
object/figure (Embedded Figures Test) that can be accomplished only by ignoring the surrounding
visual field. The problem with the purported style construct of field independence is that it most
likely represents a cognitive ability as opposed to a cognitive style. The way the construct is con-
ceptualized and measured clearly involves objectively right and wrong answers and it assesses the
ability to objectively obtain the right answer. Contrary to the conceptualization of a cognitive style
construct as not inherently adaptive or maladaptive, high field independence appears to be inher-
ently more adaptive than low field independence. It is difficult to think of situations in which field
326 Handbook of Employee Selection
dependence is better than field independence. Rather than a preferred way of processing information
(i.e., a style), high field independence refers to a specific type of information processing ability.
Whereas some measures of cognitive styles are in fact assessing cognitive abilities, others are
probably assessing personality traits or multidimensional constructs that are composites of styles
and personality traits. For example, Myers built on Jung’s (1923) theory of psychological types and
developed the Myers-Briggs Type Indicator (MBTI; Myers & McCaulley, 1985) as a cognitive style
measure consisting of four factors, each containing two categories (i.e., thinking vs. feeling, extra-
version vs. introversion, intuition vs. sensing, judgment vs. perception) that are combined to form
16 possible types of individuals. Although widely used in business and education settings, there
are numerous validity problems with the MBTI (e.g., Druckman & Bjork, 1991, 1994). Moreover,
conceptually and empirically, each of the 16 types in the MBTI is clearly a composite of personality
traits (extraversion-introversion) and other individual difference constructs that may be cognitive
styles (e.g., intuition-sensing) or the degree to which personal values versus impersonal logic are
used as the basis for making judgment and decisions (thinking-feeling).
There are legitimate cognitive styles constructs. For example, several researchers introduced
(differently labeled) constructs that all refer to the degree to which individuals see things as similar
or different. These include constructs such as categorizing behavior (Gardner, 1953), conceptual
differentiation (Gardner & Schoen, 1962), and compartmentalization (Messick & Kogan, 1963).
These constructs refer to the tendency to separate ideas or objects into discrete categories. Clearly,
any two ideas or objects are similar in some ways and different in other ways. Depending on the
problem or situation, the similarities (or differences) may be task-relevant or task-irrelevant. Hence,
consistent with the conceptualization of a cognitive style construct, the tendency to see things as
similar or different is not inherently adaptive or maladaptive—the adaptive value of any given level
on the construct is dependent on its fit with the problem situation.
Other examples of legitimate cognitive style constructs are the preference for abstract versus
concrete information (Harvey, Hunt, & Schroder, 1961), the adaption versus innovation cognitive
style (Kirton, 1976), and the tolerance for contradiction cognitive style (Chan, 2004). In two dif-
ferent studies, Chan demonstrated that a cognitive style is not inherently adaptive or maladaptive
and that it may interact disordinally with the style demands of the work context (Chan, 1996) or
practical intelligence (Chan, 2004) to produce positive or negative consequences. Using Kirton’s
(1976) conceptualization of adaption versus innovation approach to problem solving, Chan (1996)
showed that the degree of cognitive style mismatch between the individual’s problem-solving style
and the style demands of the work context predicted actual turnover over the predictability provided
by job performance. In Chan (2004), construct validity evidence for the cognitive style construct of
tolerance for contradiction were provided in terms of convergent and discriminant validity with an
established set of external constructs. Using a sample different from the validation sample, Chan
(2004) then showed that tolerance for contradiction positively predicted job performance among
individuals with high practical intelligence but negatively predicted job performance among those
with low practical intelligence.
1. Unlike the structure of values, there are no widely used or commonly accepted frame-
works/taxonomies of cognitive styles. Sternberg and Grigorenko (1997) classified some
of the styles available in the literature into three broad categories: cognition-centered,
personality-centered, and activity-centered. However, this classification is not very useful
for various reasons. First, only a few examples are given in each category. Second, several
Values, Styles, and Motivational Constructs 327
of the constructs are very closely related conceptually and may even be identical. For
example, it is unclear if cognitive complexity, compartmentalization, conceptual differen-
tiation, and conceptual integration are four distinct styles or if some of these are simply
different labels for the same construct. Third, the cognition-centered category includes
some cognitive styles that are clearly cognitive abilities and others that more closely fit
the conceptualization of cognitive styles. Fourth, the only two examples [the MBTI and
Gregorc’s (1985) Energic Model] given in the personality-centered category are models or
typologies in which individuals are classified into composite types simply obtained from
a combination of several factors that appear to include cognitive styles, personality traits,
and other types of individual difference constructs. Fifth, the activity-centered category,
which consisted of learning and teaching styles, is simply a description of the learning or
teaching contexts in which various types of cognitive styles, personality traits, and motiva-
tional constructs may be applicable. An adequate taxonomy of typology of cognitive styles
is needed to organize the extant style constructs and measures; reduce the proliferation of
different construct labels, which in fact represent the same construct; provide meaningful
comparisons of results across studies; and aid the meta-analysis of cognitive styles.
2. Although cognitive styles are conceptually distinct from cognitive abilities and personality
traits, the literature on cognitive styles contains numerous conceptualizations and mea-
sures of styles that are highly related to or even indistinguishable from cognitive abilities or
personality traits. On the other hand, there are examples of cognitive styles with empirical
evidence suggesting that they are distinct from cognitive ability and personality traits [e.g.,
Chan’s (2004) tolerance for contradiction style; Harvey et al.’s (1961) abstract-concrete
preference; Kirton’s (1976) adaption-innovation style]. When studying a cognitive style in
the context of employee selection, it is important to provide clear theoretical arguments and
empirical evidence for the cognitive style vis-à-vis the traditional predictor space contain-
ing cognitive ability and personality traits (an adequate taxonomy of cognitive styles will
provide a useful conceptual basis). When carefully studied, cognitive styles could provide
important contributions in terms of incremental validity or interaction effects involving
other individual difference constructs or situational variables (e.g., Chan, 1996, 2004).
3. Given the basic definition that cognitive styles are not inherently adaptive or maladaptive,
it is important to validate new cognitive style constructs by identifying and showing, in
theory-driven ways, the boundary conditions under which the cognitive style is adaptive
and those under which it is maladaptive.
4. Cognitive style constructs are often conceptualized, and probably correctly so, as continu-
ous variables. However, many studies measure and analyze cognitive styles as categorical
variables in which individuals are classified into discrete types. This is not merely an issue
of loss of statistical power to detect an effect due to artificial categorization of a continu-
ous variable. It concerns mismatch in theory, measurement, and analysis, which are likely
to lead to erroneous substantive inferences. For example, dichotomizing the abstract-con-
crete style continuum into the abstract type or concrete type (hence ignoring the degree of
abstraction) makes it impossible to conceptualize and empirically test the hypothesis that
degree of abstraction is curvilinearly related to a criterion variable of interest, such as task
performance.
MOTIVATIONAL CONSTRUCTS
Motivation is often defined in terms of three features: it directs (i.e., goal-oriented), it energizes (i.e.,
activation and activity), and it perseveres (i.e., effort). Clearly, motivation is necessary for accomplish-
ing many tasks. Many researchers would agree with the conceptualization of job performance as a
function of ability and motivation (e.g., Campbell & Pritchard, 1976). Yet, in terms of the nonability
predictor construct space, the past 3 decades of employee selection research have largely focused
328 Handbook of Employee Selection
on personality traits rather than motivational constructs, such as trait goal orientations and need for
achievement. Some personality traits (e.g., conscientiousness) are more easily construed as motiva-
tional constructs than others (e.g., extraversion and neuroticism). Given that personality may overlap
with motivation, and even if we assume that personality is a subset of motivation (and I suspect not
many of us would make this assumption), there remains a large part of the motivational construct
space that is not captured by personality traits.
Although motivational constructs may be captured in selection methods such as interviews,
accomplishment records, biodata measures, and situational judgment tests, we must not confound
these methods with constructs (see Chan & Schmitt, 2005). These selection methods may be used
to assess a wide range of constructs including cognitive ability, personality traits, and motivational
constructs. Employee selection has focused much on cognitive ability and personality constructs
but paid relatively little explicit attention to motivational constructs, although some motivational
constructs may in fact be assessed together with ability and personality in the variety of selection
methods used. The purpose of this section is to highlight the fact that many established motivational
constructs are available in the literature and they deserve more attention from employee selection
researchers than is currently received.
Individuals who are high in performance goal orientation are motivated to seek favorable or avoid
unfavorable evaluations of their performance or competence. They tend to attribute performance
errors and failures to low competence and hence avoid challenges or difficult situations that are
“error-prone.”
The bulk of the research on goal orientation is found in the educational literature. In the 1990s,
several researchers noted that goal orientation is potentially useful in organizational research, including
studies on design and implementation of training programs, performance appraisal systems, and
task performance in general (e.g., Bobko & Colella, 1994; Farr, Hofmann, & Ringenbach, 1993).
Consequently, there has been strong interest in applying goal orientation in several areas within the
employee selection and organizational behavior domains (e.g., Van de Walle, Brown, Cron, & Slocum,
1999). Because of the potential applied value of goal orientation in organizational research, it is likely
that the interest in trait goal orientations will continue.
Fundamental issues of construct validation need to be better addressed to guide substantive
studies of goal orientation in organizational settings. The works of Dweck and colleagues appear
to treat goal orientation as a single bipolar continuum with learning goal orientation at one end and
performance goal orientation at the other. However, subsequent researchers have argued that learn-
ing goal and performance goal orientation are distinct factors. Button, Mathieu, and Zajac (1996)
reviewed the conceptualizations of goal orientation and argued for an uncorrelated two-factor
model in which learning goal and performance goal orientations are distinct and independent.
Although there is agreement with the conceptualization of learning goal orientation (LGO),
previous research has not distinguished or paid sufficient attention to two important, distinct,
and relatively independent dimensions of performance goal orientation. As noted by Elliot and
Harackiewicz (1996) and Van de Walle (1997), goal orientations can be conceptualized as a three-
factor model because performance goal orientation can be construed (and assessed) in terms of
either an avoid performance goal orientation (APGO) or a prove performance goal orientation
(PPGO). Individuals high on APGO strive to avoid unfavorable judgments about their ability.
Given this conceptualization, APGO individuals are less likely to be high on LGO because they
tend to perceive error-prone and difficult situations as threatening and are vulnerable to negative
evaluation rather than learning opportunities for increasing job performance. Individuals high on
PPGO strive to gain favorable judgments by demonstrating their ability and competence to others
through their performance. Unlike APGO, which is conceptualized as negatively associated with
LGO, PPGO is conceptually independent of LGO.
Previous research has produced mixed findings on the association between LGO and performance
goal orientation, with studies reporting zero, positive, and negative correlations (see Button et al.,
1996). The failure to distinguish the notion of performance goal orientation into its two relatively
independent dimensions (APGO vs. PPGO) may be one reason for the apparently mixed findings in
previous research. Conceptually, we would expect LGO to be negatively and substantially related
with APGO but unrelated with PPGO. Given this differential pattern of associations across the two
performance goal orientations, the “mixed findings” in research may not be surprising because the
magnitude and direction of the correlation between LGO and performance goal orientation would
be dependent on the relative extent to which the performance goal orientation measure was loaded
with APGO and PPGO. Because previous performance goal orientation items were not designed to
assess two independent dimensions, some of the items are likely to be bi- or multidimensional rather
than pure markers of APGO or PPGO.
Achievement Motivations
The most well-known construct of achievement motivation is McClelland’s (1961) Need for
Achievement. Individuals with high need for achievement have a strong desire for significant accom-
plishments. They tend to be approach-oriented and they work harder and spend substantive efforts
in striving to achieve success. In addition, they tend to be medium risk takers and select tasks with
intermediate level of difficulty so that they have more than a 50% chance of achieving success
330 Handbook of Employee Selection
(McClelland, 1985). According to McClelland, individuals high in need for achievement have a
greater need to achieve success and, conversely, avoid failure. That is, high need achievement indi-
viduals tend to also have a high fear of failure and therefore tend to be avoidance-oriented when it
comes to tasks with high risks of failure.
McClelland’s conceptualization of need for achievement has dominated motivational constructs
from the 1960s to the 1980s. Since the 1980s, the concept of need for achievement has evolved in
important ways with regard to the way the concept of achievement is construed. A major advance-
ment came from researchers in cross-cultural social psychology. These researchers distinguish
between the individualistic notion of achievement, which is based on an independent view of the
self as originally conceived by McClelland, and a different notion of achievement that is based on
an interdependent view of the self and more characteristic of individuals from collectivistic cultures
(e.g., East Asian) in which group harmony, interconnectedness, and social relationships are empha-
sized (e.g., Markus & Kitayama, 1991). Cultural models of self and need for achievement provide
important conceptual bases for addressing challenging cross-cultural issues of construct equiva-
lence, measurement invariance of responses to measures, and comparisons of criterion-related
validity involving achievement motivational constructs and achievement-related criterion contexts.
Advances in these areas will directly contribute to the employee selection research on issues related
to staffing cross-cultural teams and expatriate assignment. Another major advancement in the con-
strual of need for achievement is the distinction of different achievement domains in terms of the
type of goal striving. Trait goal orientation, as described above, is essentially a multidimensional
view of need for achievement according to the type of goals that one is striving to achieve.
Interests
Interest measures have been used more frequently in vocational guidance situations than in
employee selection, but the goal of selection, most broadly, is to find a person who has the charac-
teristics that best fit that of the job position, organization, or occupation. Interests in certain type of
works or careers certainly could be one type of these characteristics and they are therefore relevant
to employee selection. The primary reason for considering and measuring interests in employee
selection lies in the assumption that a person will be happiest and most productive when he or she
is working in a job or occupation in which he or she is interested (Schmitt & Chan, 1998). Dawis
(1991) summarized research indicating that interest and personality measures are correlated rela-
tively lowly. Although there are no reviews examining the correlations between interests and values
or cognitive styles, the notion of interests is conceptually distinct from the values and cognitive
styles. Interests are scaled in terms of liking, whereas values are scaled in terms of importance and
cognitive styles are scaled in terms of preference in information processing. Interests may be con-
strued as primarily motivational constructs insofar as interests tend to have the three motivational
features; namely, goal-orientation, activation and activity, and effort.
Holland (1985) focused on the similarity between an individual’s interests and the degree to
which an environment provides for engagement in activities of interest to the individual. According
to Holland’s (1985) framework, which is the most well-known taxonomy of interests, individuals and
environments could be characterized along six major dimensions: social, enterprising, conventional,
realistic, investigative, and artistic. For example, high scorers on the realistic dimension are usually
interested in dealing with concrete things and relatively structured tasks, and they include such
occupations as engineers, farmers, and carpenters. Individuals who score high on the social dimension
are interested in working and helping others, and these individuals are attracted to such occupations as
teaching, social worker, flight attendant, and mental health workers.
According to Holland, the interest patterns are organized in a fashion explained by a hexagon.
Interest areas next to an area of primary interest are also likely to be of interest to an individual,
whereas those interests opposite to a primary area on the hexagon are unlikely to be of much
interest. Holland’s structure of interests, measured by The Strong Vocational Interest Blank, has
received considerable corroborative support (see Tracey & Rounds, 1993).
Values, Styles, and Motivational Constructs 331
Holland’s framework for the structure and understanding of interest dominates the field of
counseling and vocational guidance, and it has potential for employee selection, although its
direct use is surprisingly limited. However, some of Holland’s interest dimensions are probably
captured in biodata measures.
1. With the emergence of new motivational constructs, basic construct validation efforts are
necessary. Specifically, clarifying the dimensionality of a motivational construct is critical
because it affects our theorizing and directs our hypothesis formulation and our interpreta-
tion of findings regarding the motivational construct. Consider the research on trait goal
orientations. If a three-factor model is correct, then future meta-analytic studies have to
take into account the type of performance goal orientation being assessed when coding
each primary study. The research on dimensionality of trait goal orientations also high-
lights the importance of explicating the role of goals in a motivational construct, including
the content and structure of goals and the goal striving process.
2. An important issue in the conceptualization and hence measurement of motivational
constructs concerns the level of specificity. Although the appropriateness of the level of
specificity of a motivational construct is likely to be dependent on the particular research
question or practical use, we need to ensure conceptual clarity as we move up or down the
ladder of specificity. For example, when a motivational construct is conceptualized at a
very general level, it is likely to be multidimensional and made up of multiple constructs
that may be motivational or nonmotivational constructs. This is best illustrated in the study
of interests. Although the concept of interest has the elements of motivational constructs,
the interest dimensions in Holland’s structure are descriptive categories of individuals or
environments rather than unitary individual difference motivational constructs. In fact,
each interest dimension probably reflects multiple personality traits and cognitive styles,
in addition to motivational constructs. For example, the artistic dimension describes a
category of individuals who are likely to also score high on personality traits such as
openness to experience and cognitive style constructs such as preference for abstraction.
In addition, individuals’ knowledge and skills (e.g., artistic “talent”) as well as their
education, opportunities, and experiences are likely to shape their interests. In short,
interests are probably better understood in terms of descriptions of individuals or
environments in composite terms reflecting motivational constructs but also a variety of
knowledge, skills, abilities, and other characteristics (KSAOs), such as personality traits
and cognitive styles.
3. Motivation is a process. Hence, to understand how motivational constructs affect behav-
iors, we may require conceptualizations of motivational constructs that are more dynamic
than the static conceptualizations that are typical of personality traits, values, and cog-
nitive styles. To begin, studies on motivational constructs need to relate the individual
differences in motivation to the larger literature on motivation, particularly the literature
on theoretical models of work motivation (for review, see Mitchell & Daniels, 2003). A
theoretical model of work motivation specifies the motivational processes or mechanisms
by which motivated individuals select specific goals and pursue them through allocating
effort, monitoring progress, and responding to obstacles and feedback. In each of the estab-
lished models in the work motivation literature, what is the role of individual differences in
motivational constructs? Specifically, where in the work motivational model do we locate
the motivational construct(s)? A theory-driven framework for including motivational
332 Handbook of Employee Selection
constructs in employee selection would require us to specify the appropriate direct effects
and interaction effects linking motivational constructs and the focal variables in the par-
ticular work motivation model.
4. As illustrated in the above discussion on cultural models of need for achievement, studies
on motivational constructs need to be sensitive to cultural differences in the conceptual
definition of the motivational construct. Even if construct equivalence exists across cul-
tures, culture effects may operate in other ways. For example, it is possible that culture
may moderate the relationship between a motivational construct and a criterion variable.
Consider the motivational construct of APGO, which has almost always been construed
and empirically demonstrated to be negatively associated with job performance in western
samples. It may be possible that in cultures (or task settings) in which there is low toler-
ance for performance errors and high emphasis on speed and accuracy, individuals high on
APGO may not necessarily be rated as poorer performers than those low on APGO, and
they may even be rated as better performers.
1. Legal and social issues: We need to consider the legal and social constraints when
recommending the use of individual difference measures of values, cognitive styles, or
motivations for the purpose of making employee selection decisions. Virtually all of the
legal and social issues involving the use of cognitive ability and personality tests (see
Chapters 29 and 30, this volume) are applicable to the use of values, cognitive styles,
and motivational measures, although the importance of each issue is dependent on
the specific measure and situation of use. Examples of these issues include the legal
determination of job-relevance, which may or may not overlap with psychometric validity;
allegations of discriminatory hiring practices; affirmative action and equal employment
opportunities; applicant reactions; and the distinction between psychometric test bias
and nonpsychometric fairness perceptions (for review, see Schmitt & Chan, 1998). In
practice, legal and social issues are often closely related, as evident in the issue of adverse
impact. In addition, the extent to which it is appropriate or acceptable (whether legally
or socially) to assess a construct for employee selection decisions may be tied to the
selection procedures used and the extent to which the construct is explicitly assessed.
For example, values may be assessed in some biodata items and interviewers’ assessment
of applicants’ values is probably captured, although mostly not in an explicit manner, in
the interview scores. The measurement of values as a component of biodata or interview
scores may not attract as much legal or social attention as the use of an inventory designed
Values, Styles, and Motivational Constructs 333
specifically to measure values. The last two decades of employee selection research has
focused much attention on applicant reactions, including its importance and the various
ways to engender favorable reactions. When adequately developed, measures of values,
cognitive styles, and motivational constructs can lead to positive applicant reactions (see
Chan & Schmitt, 2004; Schmitt & Chan, 1997).
2. Subgroup differences: A practical problem faced by many organizations is the use of
selection tests (particularly cognitive ability tests) that are valid predictors of job per-
formance for majority and minority applicants but show large subgroup differences in
mean test scores favoring the majority subgroup. This situation leads to a conflict between
the organization’s need to use a valid test and the goal to hire a diverse workforce for
legal and social reasons. In general, measures of values, cognitive styles, and motiva-
tional constructs are probably more similar to personality tests than cognitive ability
tests in that there is no evidence of substantial subgroup differences between majority
and minority applicants. However, this may not be true of some specific measures even if
the measures do not assess cognitive ability constructs. For example, it has been argued
and there is some empirical evidence showing that Black Americans, as compared to
White Americans, tend to perform better on a test that is loaded with socially interac-
tive and visual information than a test loaded with written and verbal information (Chan
& Schmitt, 1997). Hence, using a cognitive style measure to assess the preference for
processing visual versus verbal information is likely to result in Black-White subgroup
difference in test scores, leading to adverse impact problems in employee selection. But
in general, adding measures of values, cognitive styles, and motivational constructs to
cognitive ability tests is likely to reduce subgroup difference in the composite test scores
and hence adverse impact. Including these nonability measures also increases criterion-
related validity to the extent that the criterion space is expanded from the narrow focus
on ability-based maximum and technical job performance to the nonability-based typical
and contextual job performance.
3. Cultural differences: Issues of possible cultural differences in test validity (in terms of
content, criterion-related, and construct validity evidence) need to be considered whenever
we use a selection measure in a culture different from the culture in which the measure is
developed and validated. Discussions on methodological issues in cross-cultural measure-
ment, such as response sets and measurement invariance, are readily available in the lit-
erature and will not be repeated here (for a recent review, see Chan, 2008a). Note, however,
that cultural differences may affect the conceptualization and measurement of constructs
in substantive ways that go beyond the technical issues of cross-cultural measurement. This
is particularly relevant to values and motivational constructs given that cultures may differ
qualitatively in their conceptualizations of certain values (e.g., freedom, happiness) and
motivations (e.g., need for achievement).
4. Problems with self-report data: In the assessment of values, cognitive styles, and
motivational constructs, the large majority of the measures used are in self-report
format. Similar to the use of personality inventories, issues related to the validity
problems of self-report data are relevant when self-report measures of these three types
of constructs are used in employee selection, especially given the high stakes involved
in actual employee selection contexts. Some values (e.g., honesty) and motivational
constructs (e.g., LGO), given the evaluative nature of their content, may be particularly
susceptible to social desirability responding problems. In general, cognitive styles
are probably less likely than values and motivational constructs to suffer from social
desirability responding given the nonevaluative nature of cognitive style items. Finally,
although self-report data problems do occur in the measurement of values, cognitive
styles, and motivational constructs, many of the purported problems are often overstated
(see Chan, 2008b).
334 Handbook of Employee Selection
1. Dimensionality: Construct validation efforts that specify and test the dimensionality of a
construct are fundamental when examining a value, cognitive style, or motivational con-
struct. Specifically, it is important to determine if the construct of interest under study is
a single, “pure” factor or a composite construct consisting of multiple factors. Composite
constructs are particularly difficult to deal with. First, we will need to identify the number
and nature of the various factors. Second, we will need to establish the different contribu-
tions of the various factors to the composite construct. Third, failing to accurately identify
the number, nature, and weights of the factors making up the composite construct will
result in substantive inferential errors about values, cognitive styles, or motivational con-
structs. For example, a purportedly motivational construct may in fact be a composite label
reflecting not only multiple motivational constructs but also various nonmotivational con-
structs such as knowledge, skills, abilities, personality traits, values, and cognitive styles.
2. Level of specificity: Closely related to the issue of dimensionality and composite constructs
is the issue of level of specificity of a construct. Depending on the particular research
question, researchers need to ensure that the level of specificity of the value, cognitive style,
or motivational construct is appropriate, and this requires clear conceptual definitions of the
constructs and appropriate matching between predictor and criterion constructs. Broader
constructs (e.g., individualistic vs. collectivistic values, need for achievement) may be more
useful for obtaining better prediction of a general criterion (e.g., organizational commitment,
overall job performance) in a parsimonious and generalizable manner. More narrowly
defined constructs may be more useful for increasing understanding of the criterion space
and the predictor-criterion relationships, including possible mediating mechanisms (e.g.,
linking specific trait goal orientations to specific dimensions of job performance). Note
that the issue here is not about any inherently optimal level of specificity of the construct.
Any general statement on the relative value of broad versus narrowly defined constructs
is unlikely to be useful because it is the clarity of conceptual definition of constructs and
appropriate matching between the predictor and criterion spaces that will lead to higher
validities and better explanations.
3. Adaptive value: Studies on the three types of constructs, particularly with respect to their
use in employee selection, need to explicate and test the adaptive value of the construct.
As noted in this chapter, cognitive styles are not inherently adaptive or maladaptive in an
absolute and context-free sense. Consider a measure that was designed to assess a cogni-
tive style. Let us suppose high scorers on this measure perform better in tasks across
many domains and it is very difficult or impossible to conceive of two different situa-
tions in which the high scorers are adaptive in one and maladaptive in the other. In this
scenario, the measure is likely to be assessing cognitive abilities or some other construct
that is inherently adaptive rather than a cognitive style. On the other hand, motivational
constructs are inherently adaptive in nature in that they should correlate positively rather
than negatively with a criterion in which higher scores represent higher adaptive value. It
is difficult to think of generalizable situations in which higher motivated individuals, as
compared to lower motivated individuals, will experience less positive or more negative
consequences. Whether or not a value construct is adaptive or maladaptive is dependent
on the nature of the construct. Values with higher evaluative content such as honesty and
fairness are likely to be adaptive in many situations, whereas those with lower evalua-
tive content such as individualism-collectivism may be adaptive or maladaptive depending
on the nature of the situational demands. In addressing the adaptive value of a predictor
Values, Styles, and Motivational Constructs 335
EPILOGUE
Given the modest amount of criterion variance typically accounted for in employee selection, it is
understandable that researchers and practitioners seek to expand the predictor construct space by
336 Handbook of Employee Selection
going beyond cognitive abilities and personality traits to include values, cognitive styles, and moti-
vational constructs. I have provided an overview of the nature of these three types of constructs, their
potential usefulness, issues relating to conceptualization and measurement, practical considerations
to take into account in the use of these constructs for employee selection, and a strategic agenda
for future research directions. It is hoped that this chapter will provide an effective springboard for
fruitful construct-oriented research on values, cognitive styles, and motivational constructs.
REFERENCES
Bobko, P., & Colella, A. (1994). Employee reactions to performance standards: A review and research proposi-
tions. Personnel Psychology, 47, 1–29.
Button, S. B., Mathieu, J. E., & Zajac, D. M. (1996). Goal orientation in organizational research: A conceptual
and empirical foundation. Organizational Behavior and Human Decision Processes, 67, 26–48.
Campbell, J. P., & Pritchard, R. D. (1976). Motivation theory in industrial and organizational psychology. In M.
D. Dunnette (Ed.), Handbook of industrial and organizational psychology (1st ed., pp. 63–130). Chicago,
IL: Rand McNally.
Chan, D. (1998a). The conceptualization and analysis of change over time: An integrative approach incorporat-
ing longitudinal means and covariance structures analysis (LMACS) and multiple indicator latent growth
modeling (MLGM). Organizational Research Methods, 1, 421–483.
Chan, D. (1998b). Functional relations among constructs in the same content domain at different levels of
analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246.
Chan, D. (2002). Longitudinal modeling. In S. Rogelberg (Ed.), Handbook of research methods in industrial
and organizational psychology (pp. 412–430). Malden, MA: Blackwell.
Chan, D. (2004). Individual differences in tolerance for contradiction. Human Performance, 17, 297–325.
Chan, D. (2005a). Multilevel research. In F. T. L. Leong & J. T. Austin (Eds.), The psychology research hand-
book (2nd ed., pp. 401–418). Thousand Oaks, CA: Sage.
Chan, D. (2005b). Current directions in employee selection. Current Directions in Psychological Science, 14,
220–223.
Chan, D. (2008a). Methodological issues in international Human Resource Management. In M. M. Harris
(Ed.), Handbook of research in international human resources management, (pp. 53–76). Mahwah, NJ:
Lawrence Erlbaum.
Chan, D. (2008b). So why ask me?—Are self-report data really that bad? In C. E. Lance & R. J. Vandenberg
(Eds.), Statistical and methodological myths and urban legends: Received doctrine, verity, and fable in
the organizational and social sciences (pp. 309–336). Hillsdale, NJ: Lawrence Erlbaum.
Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judg-
ment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied
Psychology, 82, 143–159.
Chan, D., & Schmitt, N. (2004). An agenda for future research on applicant reactions to selection procedures:
A construct-oriented approach. International Journal of Selection and Assessment, 12, 9–23.
Chan, D., & Schmitt, N. (2005). Situational judgment tests. In A. Evers, O. Smit-Voskuijl, & N. Anderson
(Eds.), Handbook of personnel selection (pp. 219–242). Oxford, England: Blackwell.
Dawis, R. V. (1991). Vocational interests, values, and preferences. In M. D. Dunnette & L. M. Hough (Eds.),
Handbook of industrial and organizational psychology (Vol. 2, pp. 833–872). Palo Alto, CA: Consulting
Psychologists Press.
Druckman, D., & Bjork, R. A. (1991). In the mind’s eye. Washington, DC: National Academy Press.
Druckman, D., & Bjork, R. A. (Eds.). (1994). Learning, remembering, believing: Enhancing individual and
team performance. Washington, DC: National Academy Press.
Dweck, C. S. (1986). Motivational processes affecting learning. American Psychologist, 41, 1040–1048.
Dweck, C. S., & Leggett, E. L. (1988). A social-cognitive approach to motivation and personality. Psychological
Review, 95, 256–273.
Elliot, A. J., & Harackiewicz, J. M. (1996). Approach and avoidance achievement goals and intrinsic motiva-
tion: A mediational analysis. Journal of Personality and Social Psychology, 70, 461–475.
Elliott, E. S., & Dweck, C. S. (1988). Goals: An approach to motivation and achievement. Journal of Personality
and Social Psychology, 54, 5–12.
England, G. W., & Lee, R. (1974). The relationship between managerial values and managerial success in the
United States, Japan, England, and Australia. Journal of Applied Psychology, 59, 411–419.
Values, Styles, and Motivational Constructs 337
Farr, J. L., Hofmann, D. A., & Ringenbach, K. L. (1993). Goal orientation and action control theory: Implications
for industrial and organizational psychology. In C. L. Cooper & I. T. Robertson (Eds.), International
review of industrial and organizational psychology (Vol. 8, pp.191–232). New York, NY: Wiley.
Gardner, R. W. (1953). Cognitive style in categorizing behavior. Perceptual and Motor Skills, 22, 214–233.
Gardner, R. W., & Schoen, R. A. (1962). Differentiation and abstraction in concept formation. Psychological
Monographs, 76 (41, Whole No. 560).
Harvey, O. J., Hunt, D. E., & Schroder, H. M. (1961). Conceptual systems and personality organization. New
York, NY: Wiley.
Holland, J. L. (1985). Making vocational choices: A theory of careers (2nd ed.). Englewood Cliffs, NJ: Prentice
Hall.
Jung, C. (1923). Psychological types. New York, NY: Harcourt Brace.
Kidron, A. (1978). Work values and organizational commitment. Academy of Management Journal, 21, 239–247.
Kirton, M. J. (1976). Adaptors and innovators: A description and measure. Journal of Applied Psychology, 61,
622–629.
Maio, G. R., Roese, N., Seligman, C., & Katz, A. (1996). Ratings, rankings, and the measurement of values:
Evidence for the superior validity of ratings. Basic and Applied Social Psychology, 18, 171–181.
Markus, H., & Kitayama, S. (1991). Culture and self: Implications for cognition, emotion, and motivation.
Psychological Review, 98, 224–253.
McClelland, D. C. (1961). The achieving society. Princeton, NJ: Van Nostrand.
McClelland, D. C. (1985). Human motivation. New York, NY: Scott-Freeman.
Meglino, B. M., Ravlin, E. C., & Adkins, C. L. (1989). A work values approach to corporate culture: A field
test of the value congruence process and its relationship to individual outcomes. Journal of Applied
Psychology, 74, 424–432.
Messick, S., & Kogan, N. (1963). Differentiation and compartmentalization in object-sorting measures of
categorizing style. Perceptual and Motor Skills, 16, 47–51.
Mitchell, T. R., & Daniels, D. (2003). Motivation. In W. C. Borman, D. R. Ilgen, & R. J. Klimoski (Eds.),
Handbook of psychology (Vol. 12, pp. 225–254). New York, NY: Wiley.
Myers, I. B., & McCaulley, M. H. (1985). Manual: A guide to the development and use of the Myers-Briggs
Type Indicator. Palo Alto: CA: Consulting Psychologists Press.
Payne, S. C., Youngcourt, S. S., & Beaubien, J. M. (2007). A meta-analytic examination of the goal orientation
nomological net. Journal of Applied Psychology, 92, 128–150.
Posner, B. Z., Kouzes, J. M., & Schmidt, W. H. (1985). Shared values make a difference: An empirical test of
corporate culture. Human Resource Management, 24, 293–309.
Ravlin, E. C., & Meglino, B. M. (1987). Effects of values on perception and decision making: A study of
alternative work values measures. Journal of Applied Psychology, 72, 666–673.
Rokeach, M. (1973). The nature of human values. New York, NY: Free Press.
Schwartz, S. H. (1992). Universals in the content and structure of values: Theoretical advances and empirical tests
in 20 countries. In M. P. Zanna (Ed.), Advances in Experimental Social Psychology (Vol. 25, pp. 1–65). San
Diego, CA: Academic Press.
Schwartz, S. H., & Bilsky, W. (1990). Toward a theory of the universal content and structure of values:
Extensions and cross-cultural replications. Journal of Personality and Social Psychology, 58, 878–891.
Schwartz, S. H., & Sagiv, L. (1995). Identifying culture-specifics in the content and structure of values. Journal
of Cross Cultural Psychology, 26, 92–116.
Schmitt, N., & Chan, D. (1998). Employee selection: A theoretical approach. Thousand Oaks, CA: Sage.
Sternberg, R. J., & Grigorenko, E. L. (1997). Are cognitive styles still in style? American Psychologist, 52,
700–712.
Tracey, T. J., & Rounds, J. B. (1993). Evaluating Holland’s and Gati’s vocational interest models: A structural
meta-analysis. Psychological Bulletin, 113, 229–246.
Van de Walle, D. (1997). Development and validation of a work domain goal orientation instrument. Educational
and Psychological Measurement, 8, 995–1015.
Van de Walle, D., Brown, S. P., Cron, W. L., & Slocum, J. W., Jr. (1999). The influence of goal orientation and
self-regulation tactics on sales performance: A longitudinal field test. Journal of Applied Psychology, 84,
249–259.
Weiss, H. M. (1978). Social learning of work values in organizations. Journal of Applied Psychology, 63, 711–718.
Witkin, H. A., Dyke, R. B., Faterson, H. F., Goodenough, D. R., & Karp, S. A. (1962). Psychological differen-
tiation. New York, NY: Wiley.
Witkin, H. A., Oltman, P. K., Raskin, E., & Karp, S. A. (1971). Embedded figures test, children’s embedded
figures test, group embedded figures test: Manual. Palo Alto, CA: Consulting Psychologists Press.
This page intentionally left blank
16 Practical Intelligence,
Emotional Intelligence,
and Social Intelligence
Over the years, practical intelligence, social intelligence, and especially emotional intelligence
have received substantial attention in the academic and practitioner literatures. However, at the
same time, these individual difference “constructs” have also fueled controversies and criticisms,
including their applications to employee selection. It is without doubt that their definition, dimen-
sionality, and operationalization (measurement) have been much more questioned as compared with
the more traditional or established constructs (i.e. cognitive ability, personality) in this section of
the book.
This chapter has two main objectives. The first objective is to review and clarify the conceptual-
ization and measurement of these three constructs (or categories of constructs). In doing so, we aim
to identify commonalities and differences among the three constructs. The second objective is to
advance research on practical, social, and emotional intelligence. We aim to achieve both objectives
by placing the three intelligence constructs in an integrative conceptual framework that relates them
to traditional individual difference constructs and critical criterion constructs. We end by proposing
five strategies for future research.
PRACTICAL INTELLIGENCE
Sternberg and colleagues introduced the construct of practical intelligence in the mid- to late-1980s
(Sternberg, 1988; Wagner & Sternberg, 1985). As a common thread running through the various
definitions of practical intelligence, it is generally considered to refer to the ability of an individual
to deal with the problems and situations of everyday life (Bowman, Markham, & Roberts, 2001).
In lay terms, it can be characterized as “intuition” or “common sense,” and it is often referred to as
“street smart” to contrast with “book smart,” which is used to characterize traditional analytical or
academic intelligence.
A central element in practical intelligence is tacit knowledge. Sternberg, Wagner, Williams, and
Horvath (1995) defined tacit knowledge as “action-orientated knowledge, acquired without direct
help from others, that allows individuals to achieve goals they personally value” (p. 916). This
definition encompasses the key characteristics of tacit knowledge (see Hedlund et al., 2003). First,
tacit knowledge is difficult to articulate because it is not formalized in explicit procedures and rules.
339
340 Handbook of Employee Selection
Second, tacit knowledge is typically procedural knowledge, telling people how to act in various
situations. Third, individuals acquire tacit knowledge on the basis of their own everyday experience
related to a specific domain. Thus, tacit knowledge is not formally taught. Fourth, tacit knowledge
is practical because it enables individuals to obtain the goals that they value in life. These charac-
teristics exemplify the claim of practical intelligence and tacit knowledge being constructs that are
conceptually distinct from academic intelligence, technical job knowledge, or personality.
Research by Sternberg and colleagues as well as by others have found some support for or at least
produced findings consistent with some of these claims. First, tacit knowledge seems to increase
with experience. For example, business managers received higher tacit knowledge scores than busi-
ness graduate students, who in turn outperformed undergraduate students although sample sizes
in these groups were often small (Wagner, 1987). Second, scores on tacit knowledge inventories
showed low correlations (below .20) with measures of fluid and crystallized intelligence (Legree,
Heffner, Psotka, Martin, & Medsker, 2003; Tan & Libby, 1997). Finally, Bowman et al. (2001)
reviewed research on tacit knowledge in organizational, educational, and military settings and con-
cluded that the assessment of tacit knowledge has certain promise for predicting performance in
these real-world environments, although the level of prediction does not reach the values obtained
with g (see also Van Rooy, Dilchert, Viswesvaran, & Ones, 2006).
Bowman et al. (2001) leveled various criticisms with respect to the construct of practical intelli-
gence. From a conceptual point of view, questions have been raised whether practical intelligence
(tacit knowledge) at all exists as a single construct that is different from other types of intelligence,
job knowledge, and personality (see also Gottfredson, 2003; McDaniel & Whetzel, 2005). In
particular, McDaniel and Whetzel (2005) put various claims related to practical intelligence (tacit
knowledge) to the test. To this end, they used research related to situational judgment tests (SJTs),
a measurement method that is closely related to tacit knowledge inventories (see below). Consistent
with research by Sternberg and colleagues, McDaniel and Whetzel concluded that such tests predict
job performance and have incremental validity over more common selection procedures. However,
they argued that there was no support for the other claims. Specifically, they cited studies showing
that SJTs of practical intelligence were factorial complex and could not be represented by a general
factor in factor analytic studies and studies showing that these test scores were significantly related to
scores on established constructs such as g, conscientiousness, emotional stability, and agreeableness.
Later in this chapter, we argue that such criticisms are right and wrong—they are right that practical
intelligence is not a unitary construct, but they are wrong to conclude that the factorially complex
results and significant correlations with established constructs imply that practical intelligence is
not a distinct and valid construct.
EMOTIONAL INTELLIGENCE
Since the mid-1990s, emotional intelligence is probably the psychological construct that has
received the greatest attention in practitioner and academic literatures. Generally, a distinction is
made between two conceptualizations of emotional intelligence; namely, an ability emotional intel-
ligence model and a trait emotional intelligence model (e.g., Matthews, Zeidner, & Roberts, 2007).
The first model conceptualizes emotional intelligence as an ability akin to cognitive ability
and measures it via performance-based tests. In this paradigm, emotional intelligence is viewed
as another legitimate type of intelligence. Hence, this model is also referred to as emotional cogni-
tive ability or information processing emotional intelligence. Emotional intelligence is then defined
as “the ability to monitor one’s own and others’ emotions, to discriminate among them, and to
use the information to guide one’s thinking and actions” (Salovey & Mayer, 1990, p. 189). This
definition shows that the higher order construct of emotional intelligence is broken down into four
branches. The first branch—emotional identification, perception, and expression—deals with the
ability to accurately perceive emotions in others’ verbal and nonverbal behavior. Emotional facilita-
tion of thought is the second branch, referring to the ability to use emotions to assist thinking and
Practical Intelligence, Emotional Intelligence, and Social Intelligence 341
problem-solving. Third, emotional understanding denotes the ability to analyze feelings, discrimi-
nate among emotions, and think about their outcomes. Finally, emotional management deals with
abilities related to maintaining or changing emotions.
The second model, the trait EQ model, views emotional intelligence as akin to personality and
assesses it via self-report. In this model, emotional intelligence is defined as “an array of non-
cognitive capabilities, competencies, and skills that influence one’s ability to succeed in coping
with environmental demands and pressures” (Bar-On, 1997, p. 16). As the name suggests, this
model uses a broad definition of emotional intelligence. Abilities such as emotion perception are
typically combined with noncognitive competencies, skills, and personality traits. For example,
one of the most popular mixed models (Bar-On, 1997) measures five broad factors and 15 facets:
(a) intrapersonal (self-regard, emotional self awareness, assertiveness, independence, and self-ac-
tualization), (b) interpersonal (empathy, social responsibility, interpersonal relationship), (c) stress
management (stress tolerance and impulse control), (d) adaptability (reality testing, flexibility, and
problem solving), and (e) general mood (optimism and happiness). In the Goleman (1995) model,
a similar expanded definition of emotional intelligence is used, referring to emotional intelligence
as a set of learned competencies. Emotional intelligence competence is then defined as “an ability
to recognize, understand, and use emotional information about oneself or others that leads to or
causes effective or superior performance” (Boyatzis & Sala, 2004, p. 149). A distinction is further
made between five main competency clusters (with various subcompetencies): self-awareness, self-
regulation, motivation, empathy, and social skills. Given the trait-like nature of the mixed model,
some researchers have suggested using terms such as “trait emotional intelligence,” “emotional self-
efficacy” (Petrides & Furnham, 2003), or “emotional self-confidence” (Roberts, Schulze, Zeidner,&
Matthews, 2005).
Recent meta-analytic research (Van Rooy, Viswesvaran, & Pluta, 2005) has demonstrated
that these two models are not measuring the same constructs. Measures based on the two models
correlated only .14 with one another. In addition, these two models had different correlates.
Emotional intelligence measures based on the mixed model overlapped considerably with personality
trait scores but not with cognitive ability. Conversely, emotional intelligence measures developed
according to an emotional intelligence ability model correlated more with cognitive ability and less
with personality. Other research has clarified that ability model measures correlate especially with
verbal (crystallized) ability, with correlations typically between .30 and .40 (Mayer, Roberts, &
Barsade, 2008). Hence, some have posited that the term “emotional intelligence” should be replaced
by the term “emotional knowledge” (Zeidner, Matthews, & Roberts, 2004).
In addition to the construct validity of emotional intelligence, the criterion-related validity has
also been scrutinized. Van Rooy and Viswesvaran (2004) conducted a meta-analysis of emotional
intelligence measures (collapsing both models) for predicting performance. Their analysis of
59 independent empirical samples obtained a mean corrected correlation of .23. The validity of
emotional intelligence measures was .24, .10, and .24 for predicting performance in occupational,
academic, and life settings, respectively. However, a caveat is in order when interpreting the results
of this meta-analysis as it included only a small number of studies using ability-based emotional
intelligence instruments and a sizable number of studies using self-report measures of performance.
Thus, we are still far from being at the point of rendering a decision as to the incremental value of
emotional intelligence for selection purposes. However, in recent years, more positive conclusions
regarding the validity of emotional intelligence for predicting performance have been drawn. For
instance, Druskat and Jordan (2007) reviewed 26 studies that examined the validity of emotional
intelligence (both models) for predicting performance at the individual, team, and leadership level.
Importantly, all of the studies reviewed were published in peer-reviewed journals. The overall
conclusion was that “emotional intelligence predicts work performance over and above measures of
personality and general mental ability” (p. 2).
Despite this recent optimism, there are conceptual and methodological problems associated with
the research on emotional intelligence. Most criticisms were directed at the mixed model (Mayer
342 Handbook of Employee Selection
et al., 2008). First, the ambiguous (all-encompassing) definition and the very broad content of the
mixed model have been criticized (e.g., Landy, 2005; Locke, 2005; Matthews, Roberts, & Zeidner,
2004). For example, Landy (2005) succinctly noted: “The construct [of emotional intelligence] and
the operational definitions of the construct (i.e., the actual measurement instruments) are moving
targets” (p. 419). Similarly, Locke (2005) posited that “The concept of EI [emotional intelligence]
has now become so broad and the components so variegated that no one concept could possible
encompass or integrate all of them, no matter what the concept was called; it is no longer even an
intelligible concept” (p. 426).
Another criticism relates to redundancy of the mixed model with Big Five personality traits.
For instance, De Raad (2005) explored to what extent emotional intelligence (mixed model) can be
expressed in terms of personality traits. To this end, he gathered a total of 437 items from emotional
intelligence inventories. Sixty-six percent of the emotional intelligence descriptors could be classi-
fied in a well-known Big Five framework (The Abridged Big Five-Dimensional Circumplex). The
lion share of the terms was categorized under agreeableness and emotional stability. The main rea-
son for items not being classifiable was that they were ambiguous because they were related to vari-
ous Big Five factors. In other studies, the multiple correlation between Big Five scores and scores on
mixed model emotional intelligence measures ranged between .75 and .79 (Brackett & Mayer, 2003;
Grubb & McDaniel, 2007). However, other studies found incremental validity of the mixed model
over and above personality (Law, Wong, & Song, 2004; Tett, Fox, & Wang, 2005). Nonetheless, in
the scientific community, there have been calls to give up the mixed model (that is very popular in
practice) and focus solely on the ability model (Daus & Ashkanasy, 2005).
The ability model is not without limitations either. For example, a large-scale examination of
many emotional intelligence, cognitive intelligence, and personality measures showed that emotion
perception (as represented by measures of perception of emotions in faces and pictures) was the only
branch of the four branches of the ability model that could not be classified under established mea-
sures (Davies, Stankov, & Roberts, 1998). But even the emotion perception construct has drawbacks
because the construct does not seem to have generalizability across different measures (Gohm,
2004). That is, existing emotion perception measures correlate lowly among themselves (Roberts
et al., 2006).
In comparing the findings from the ability and the trait models, a major methodological prob-
lem exists because of a method-construct confound resulting from the fact that the ability model
is often measured using performance-based tests whereas the trait model is often measured using
self-reports. To advance research on the comparison of ability and trait models of emotional intel-
ligence (and also on the comparison of these models when applied to practical intelligence or social
intelligence), rigorous designs that allow us to clearly isolate construct and method variances are
needed (Chan & Schmitt, 2005).
SOCIAL INTELLIGENCE
Of the three intelligence constructs, social intelligence has the longest history. The idea goes back
to Thorndike (1920), who defined social intelligence as “the ability to understand and manage men
and women, boys and girls to act wisely in human relations” (p. 228). As noted by Landy (2005),
Thorndike did not build a theory of social intelligence, he only used the notion of social intelligence
to clarify that intelligence could manifest itself in different facets (e.g., abstract, mechanical,
social).
Social intelligence has a checkered history. Early studies tried to distinguish social intelligence
from academic intelligence (e.g., Hoepener & O’Sullivan, 1968; Keating, 1978). However, these
research efforts were unsuccessful. The problem was that measures of social intelligence did not
correlate highly among themselves and that academic intelligence and social intelligence formed one
factor. Methodologically, it was troublesome that both intelligences were measured with the same
method (paper-and-pencil measures). The early research led to the conclusion that the “putative
Practical Intelligence, Emotional Intelligence, and Social Intelligence 343
domain of social intelligence lacks empirical coherency, at least as it is represented by the measures
used here” (Keating, 1978, p. 221).
Two advancements led to more optimism. The first was the distinction between cognitive social
intelligence (e.g., social perception or the ability to understand or decode verbal and nonverbal
behaviors of other persons) and behavioral social intelligence (effectiveness in social situations).
Using this multidimensional definition of social intelligence and multiple measures (self, teacher,
and peer ratings), Ford and Tisak (1983) were able to distinguish social intelligence from academic
intelligence. In addition, social intelligence predicted social behavior better than academic intel-
ligence (see also Marlowe, 1986). The second advancement was the use of multitrait-multimethod
designs (and confirmatory factor analysis) to obtain separate and unconfounded estimates of trait
and method variance (Jones & Day, 1997; Wong, Day, Maxwell, & Meara, 1995).
These more sophisticated multitrait-multimethod designs have brought further evidence for
the multidimensionality of social intelligence and for its discriminability vis-à-vis academic
intelligence. For example, the aforementioned distinction made between cognitive social intelligence
and behavioral social intelligence has been confirmed (e.g., Wong et al., 1995). Similarly, a
distinction is often made between fluid and crystallized social intelligence. The fluid form of social
intelligence refers to social-cognitive flexibility (the ability to flexibly apply social knowledge in
novel situations) or social inference. Conversely, a term such as social knowledge (knowledge of
social etiquette, procedural and declarative social knowledge about social events) denotes the more
crystallized component of social intelligence (Jones & Day, 1997). Despite these common findings,
the dimensions, the definitions, and measures of social intelligence still vary a lot across studies.
Along these lines, Weis and Süss (2005) recently gave an excellent overview of the different facets
of social intelligence that have been examined. This might form the basis to use a more uniform
terminology when describing social intelligence subdimensions.
In recent years, interest in social intelligence has also known a renaissance under the general
term of social effectiveness constructs. According to Ferris, Perrewé, and Douglas (2002), social
effectiveness is a “broad, higher-order, umbrella term, which groups a number of moderately-
related, yet conceptually-distinctive, manifestations of social understanding and competence”
(p. 50). Examples are social competence, self-monitoring, emotional intelligence, social skill, social
deftness, practical intelligence, etc. The value of social skills has been especially scrutinized.
Similar to social intelligence, social skills are posited to have a cognitive component (interpersonal
perceptiveness) and a behavioral component (behavioral flexibility; Riggio, 1986; Schneider,
Ackerman, & Kanfer, 1996).
A key distinction between social skills and personality traits is that the former are learned (i.e.,
an ability), whereas the latter are relatively stable. Research has found that they are only moderately
(.20) correlated (Ferris, Witt, & Hochwarter, 2001). However, both constructs are also related in
that social skills enable personality traits to show their effects (Ferris et al., 2001; Hogan & Shelton,
1998). Research has indeed confirmed that social skills moderate the effects of personality traits
(conscientiousness) on job performance (Witt & Ferris, 2003). Social skills were also found to have
direct effects on managerial job performance, although personality and cognitive ability were not
controlled for in most studies (Semadar, Robins, & Ferris, 2006).
CONCLUSIONS
Our review of practical, social, and emotional intelligence highlights that these three constructs
share remarkable similarities. Specifically, we see at least three parallels. First, the origins and
rationale behind each of the constructs can be summarized as “going beyond g.” Cognitively
oriented measures of ability and achievement have been traditionally used in employment and
educational contexts. However, at the same time, there has always been substantial interest in
exploring possible supplemental (“alternative”) predictors for broadening the constructs measured
and reducing possible adverse impact. Supplementing cognitive with alternative predictors is seen
344 Handbook of Employee Selection
as a mechanism for accomplishing this (Sackett, Schmitt, Ellingson, & Kabin, 2001). Whereas
social intelligence is the oldest construct, practical intelligence came into fashion at the end of
the 1980s. Since Goleman’s (1995) book, emotional intelligence is the newest fad. Every time, the
construct was introduced as the panacea for the problem of an exclusive reliance on g. We agree
that there is a need to go beyond g and identify new and non-g constructs, but a new construct
has little scientific explanatory and utility value if it is defined solely by negation (i.e., as non-g).
Hence, good construct validity evidence for the three constructs is needed and the current state
of research indicates to us that more rigorous construct validation studies are needed. Second, the
conceptualizations of these three constructs have salient parallels. Each of these three constructs
has various definitions, is multidimensional, and there exists debate about their different dimen-
sions. Third, for each of these constructs, investigations of incremental validity over and above
more established constructs, such as cognitive ability and personality, have been the focus of
debate and research.
So, are there conceptual differences between the three constructs? According to Landy (2005),
emotional intelligence as a so-called new construct has simply replaced the older notion of social
intelligence. Similarly, Bowman et al. (2001) posited that “it is not certain to what extent tacit knowl-
edge, social, and EQ measures are structurally independent” (p. 148). Although our review shows
that these three constructs are definitely overlapping, it is possible to make at least some subtle
distinctions. On the one hand, emotional intelligence might be somewhat narrower than social intel-
ligence because it focuses on emotional problems embedded in social problems (Mayer & Salovey,
1993). That is probably why Salovey and Mayer (1990) defined emotional intelligence as a subset of
social intelligence (p. 189). Conversely, one might also posit that emotional intelligence is broader
than social intelligence because internal regulatory processes/emotions are also taken into account,
something that is not the case in social intelligence. Practical intelligence with its emphasis on
real-world problems is more distinct than the other two constructs because it makes no reference to
interpersonal skills (Austin & Saklofske, 2005). Domain specificity is another aspect of tacit knowl-
edge that contrasts to the more generic nature of social and emotional intelligence. In any case, these
conceptual distinctions are open to investigation because no study has explicitly examined the three
constructs together (Weis & Süss, 2005).
MEASUREMENT APPROACHES
In the previous section, we showed that the conceptual debate around practical, social, and emotional
intelligence shared many parallels. The same can be said about the measurement approaches used
because the similarities in how practical intelligence, social intelligence, and emotional intelligence
are measured are striking. Generally, six measurement approaches might be distinguished: (a) self-
reports, (b) other-reports, (c) interviews, (d) tests, (e) SJTs, and (f) assessment center exercises. The
following sections discuss each of these approaches including their advantages and disadvantages.
Examples of instruments are also given, and these are summarized in Table 16.1.
SELF-REPORTS
The self-report approach presents respondents with descriptive statements and asks them to use a
sort of rating scale to indicate the extent to which they agree or disagree with the respective state-
ments. An important advantage of self-report measures is that they can be administered inexpen-
sively and quickly to large groups of respondents.
Examples of the self-report approach are many. In fact, most examples of self-report emotional
intelligence measures are based on the popular mixed model approach to emotional intelligence.
Examples are the Emotional Competence Inventory (ECI; Sala, 2002), the Trait Meta-Mood Scale
(TMMS; Salovey, Mayer, Goldman, Turvey, & Palfai, 1995), EQ-I (Bar-On, 1997), and the Trait
Emotional Intelligence Questionnaire (TEIQue; Petrides & Furnham, 2003). Other emotional
Practical Intelligence, Emotional Intelligence, and Social Intelligence 345
TABLE 16.1
Overview of Methods (Including Some Examples) for Measuring
Practical, Emotional, and Social Intelligence
Ability Emotional Trait Emotional
Method Intelligence Model Intelligence Model Practical Intelligence Social Intelligence
Self-reports • WLEIS • EQ-I • Self-reports of • Social skills
• SREIT • ECI people’s behavior in inventories
• MEIA • TMMS everyday situations
• SUEIT • TEIQue
Other-reports • Same as self- • Same as self-reports • Other-reports of • Same as
reports people’s behavior in self-reports
• Workgroup everyday situations
Emotional
Intelligence Profile
Performance-based • MSCEIT • No known examples • Basic skills tests • LEAS
tests • DANVA2 • IPT-15
• PONS • Four-six-factor
• JACBART tests of social
• EARS intelligence
• VOCAL-I
• MSFDE
Interviews • Interview rating on • Interview rating on • Interview rating on • Interview rating
components of the mixed model people’s reported on applied social
four-branch model emotional behavior in everyday skills
of Mayer, Salovey, intelligence situations
and Caruso. competencies
(interpersonal
sensitivity, stress
tolerance, etc.)
SJTs • STEU • SJTs that aim to • Tacit Knowledge • George
• STEM measure mixed model Inventories Washington Social
emotional Intelligence Test
intelligence (judgment in
competencies social situations)
ACs • AC rating on • AC rating on mixed • Case situational • AC rating on
components of the model emotional problems applied social
four-branch model intelligence skills
of Mayer, Salovey, competencies
and Caruso
intelligence measures are based on the four-branch model (or its predecessors) (Salovey & Mayer,
1990) but use a self-report methodology (instead of performance-based tests) for measuring it.
Examples are the Wong Law Emotional Intelligence Scale (WLEIS; Law et al., 2004; Wong &
Law, 2002), the Multidimensional Emotional Intelligence Assessment (MEIA; Tett et al., 2005),
the Swinburne University Emotional Intelligence Test (SUEIT; Palmer & Stough, 2001), or the
Schutte Self-Report Emotional Intelligence Test (SREIT; Schutte et al., 1998). We refer to Pérez,
Petrides, and Furnham (2005) for a comprehensive list of trait EQ measures. There are also self-
report inventories of social intelligence/social skills (e.g., Ferris et al., 2001; Riggio, 1986; Schneider
et al., 1996). We are not aware of self-report instruments (excluding SJTs as self-report measures)
that assess tacit knowledge.
In the personality domain, there is a long history of using self-report measures and an equally
long debate over its use. Clearly, the debate and issues concerning the use of self-report measures
346 Handbook of Employee Selection
in personality research is readily generalizable to the use of self-report measures in assessing social
and emotional intelligence. A detailed review of the pros and cons of self-report measures is beyond
the scope of this chapter. Suffice to say that self-report data are by no means perfect and are in prin-
ciple susceptible to various validity problems such as faking and inflation of correlations because
of common method variance. However, it is noteworthy that the severity of many of the purported
problems of self-report data may be overstated (for details, see Chan, 2008).
OTHER-REPORTS
Other-reports (or informant reports) have also been used for measuring emotional and social intel-
ligence. One reason is that knowledgeable others might provide less lenient and more reliable
measurement. Another reason is that multidimensional constructs such as emotional and social
intelligence inherently have an important interpersonal component. Hence, it makes sense that in
other-reports the same emotional and social intelligence scales as listed above are used, with others
(peers, colleagues, teachers, parents, friends) now rating the focal person on descriptive statements.
For example, peers or supervisors can also complete the ECI of Goleman. There also exist emo-
tional intelligence measures that were specifically developed for use in team settings. For instance,
Jordan, Ashkanasy, Hartel, and Hooper (2002) developed a specific work group emotional intel-
ligence measure, namely the Workgroup Emotional Intelligence Profile.
Although there exists much research supporting the use of peers in the personality domain
(e.g., Borkenau & Liebler, 1993; Funder, 1987; Kenny, 1991), research with other-based emotional
intelligence measures is relatively scarce. Van der Zee, Thijs, and Schakel (2002) confirmed that
peer ratings of emotional intelligence were more reliable. However, they also found that these peer
ratings suffered from leniency. Law et al. (2004) reported that peer-reports of a trait-based emotional
intelligence measure had substantial incremental validity over self-reports of emotional intelligence and
personality. So, it seems beneficial to use peers for mixed model emotional intelligence measures.
PERFORMANCE-BASED TESTS
Whereas both self-reports and peer-reports are assumed to be measures of typical performance,
performance-based tests are posited to measure maximal performance. The rationale behind these
tests parallels the one behind cognitive ability tests because these tests present people with social
or emotion-based problem solving items. For example, in popular tests of emotion perception, indi-
viduals are presented with faces, voices, or pictures and are then asked to describe the emotions.
Historically, performance-based tests have been used for measuring social intelligence. An often-
cited example is O'Sullivan, Guilford, & deMille’s (1965) tests of social intelligence (see Landy,
2006, for other older examples). A more modern example is the Levels of Emotional Awareness
Scale (LEAS; Lane, Quinlan, Schwartz, Walker, & Zeitlin, 1990), although this test has also been
used as a measure of emotional intelligence (e.g., Barchard, 2003). Similarly, the Interpersonal
Perception Task-15 (IPT-15; Costanzo & Archer, 1993) is a performance-based measure that pres-
ents videotapes to participants.
Recently, these tests have known a renaissance in the context of the ability model of emo-
tional intelligence, with the Mayer-Salovey-Caruso Emotional Intelligence Test (MSCEIT) as the
best-known example. Other well-known examples are the Japanese and Caucasian Brief Affect
Recognition Test (JACBART; Matsumoto et al., 2000), the Diagnostic Analysis of Nonverbal
Accuracy (DANVA2; Nowicki, 2004), the Profile of Nonverbal Sensitivity (PONS; Rosenthal, Hall,
DiMatteo, Rogers, & Archer, 1979), the Emotional Accuracy Scale (EARS; Mayer & Geher, 1996),
the Montreal Set of Facial Displays of Emotion (MSFDE; Beaupré, Cheung, & Hess, 2000), and the
Index of Vocal Emotion Recognition (Vocal-I; Scherer, Banse, & Wallbott, 2001).
As noted by Spector and Johnson (2006), there is a difference between knowledge about
emotions and the actual skill. It is not because one knows how to regulate one’s emotion in the face
Practical Intelligence, Emotional Intelligence, and Social Intelligence 347
of problems that one will also do this in an actual context. With regard to practical intelligence, this
problem has been circumvented by using basic skills tests (Diehl, Willis, & Schaie, 1995). These tests
measure, among others, the ability to perform daily tasks such as cooking or using a bus schedule.
Scoring constitutes another problem of performance-based tests. In contrast to cognitive ability
tests, emotional intelligence tests using the ability model, for instance, do not have objectively
correct answers.
INTERVIEWS
Interviews constitute another possible method for measuring practical, social, and emotional intel-
ligence. In the past, especially social skills (social intelligence) have been frequently measured
in interviews. This is demonstrated by the meta-analysis of Huffcutt, Conway, Roth, and Stone
(2001), who reviewed the type of constructs most frequently targeted by interviews in 47 studies.
Specifically, social skills were measured in 27.8% of the interviews. Moreover, applied skills were
twice as frequently rated in high-structure interviews (behavior description interviews and situ-
ational interviews) as compared with low-structure interviews (34.1% vs. 17.7%).
Essentially, interviews are measurement methods that can be used to assess a wide variety of con-
structs. On the basis of multiple job-related questions, interviewees are asked to describe behavior that
is relevant for constructs deemed important. Therefore, interviews could also be used for measuring
practical intelligence (Fox & Spector, 2000) and emotional intelligence (mixed model; Schmit, 2006).
Schmit noted how interview questions can try to elicit situations from interviewees wherein they had
to recognize emotions of others and how they dealt with this. Yet, in interviews, observable samples of
behavior can be observed only for specific dimensions (e.g., interpersonal skills or oral communication
skills; Van Iddekinge, Raymark, & Roth, 2005). For other dimensions, candidates report past behavior
(in behavior description interviews) or intended behavior (in situational interviews).
SJTS
SJTs might be another approach for measuring practical, social, and emotional intelligence (Chan,
2000, 2006; O’Sullivan, 2007; Schulze, Wilhelm, & Kyllonen, 2007). SJTs are measurement meth-
ods that present respondents with job-related situations and sets of alternate courses of action to
these situations. For each situation, respondents either select the best and worst options or rate each
of the alternative actions in terms of its effectiveness. Because respondents have to respond to real-
istic (written and especially video-based) scenarios, SJTs might constitute a more contextualized
(ecologically valid) way of measuring practical, social, and emotional intelligence. This judgment
in a realistic context contrasts to the decontextualized nature of standardized tests. Technological
advancements make it possible to develop interactive SJTs that present different video fragments on
the basis of responses to earlier video fragments. This allows the SJT to simulate the dynamics of
interaction. Similar to emotional intelligence tests (ability model), multiple-choice SJTs are scored
using expert (excellent employees) or empirical (large pilot samples) grounds.
Over the years, SJTs have been developed for measuring each of the three constructs. First, as
noted by McDaniel, Morgeson, Finnegan, Campion, and Braverman (2001), the first SJTs were
social intelligence tests, namely the Judgment in Social Situations subtest of the George Washington
Social Intelligence Test. Second, instruments very similar to SJTs are used under the label “tacit
knowledge tests” for measuring practical intelligence (Sternberg et al., 1995). Examples are the Tacit-
Knowledge Inventory for Managers or the Tacit-Knowledge Inventory for Military Leaders. Third,
recent research has explored the use of SJTs for measuring two branches of Mayer and Salovey’s
emotional intelligence model. Specifically, MacCann and Roberts (2008) developed the Situational
Test of Emotional Understanding (STEU) and the Situational Test of Emotion Management (STEM).
There have also been recent attempts to develop video-based SJTs for measuring emotional intel-
ligence (Bedwell & Chuah, 2007).
348 Handbook of Employee Selection
SJTs are also referred to as low-fidelity simulations. Although they aim to provide a more eco-
logically valid approach for measuring practical, social, and emotional intelligence, they do not
require candidates to actually show how they would handle a specific situation. Candidates have
to pick the “correct” answer from a limited set of predetermined response options. Nevertheless,
a meta-analysis xof McDaniel et al. (2001) found a corrected correlation between SJTs and job
performance in employment settings of .34. In addition, recent research (Chan & Schmitt, 2002;
McDaniel, Hartman, Whetzel, & Grubb, 2007) provided evidence for the incremental validity of
SJTs in predicting job performance over and above the prediction provided by cognitive ability and
personality. Other validity research also found that video-based SJTs are more valid than written
ones (Lievens & Sackett, 2006).
An interesting aspect of SJTs is that differences in mean SJT scores between racial subgroups
are typically smaller than those reported for cognitive ability tests. The meta-analysis of Nguyen,
Biderman, and McDaniel (2005) found a difference in mean SJT scores between Whites and Blacks
of about .30 SD in favor of White candidates, which is much smaller than the 1.00 SD typically
found for cognitive ability tests (Jensen, 1998). A key determinant of whether SJTs show adverse
impact is the correlation of SJTs with cognitive ability. Yet, it should be noted that the lower reli-
ability of SJTs might also partially explain the lower subgroup differences found.
SJTs are inherently multidimensional because SJT items may refer to a range of situations and
include different types of content to which applicants attend when making a decision. In addition,
responses to SJT items with multiple options are the result of a combination of ability, experi-
ence, and personality (McDaniel et al., 2001; McDaniel & Whetzel, 2005). The multidimensional
nature of SJTs makes it often difficult to assess what they exactly measure. For instance, factor
analytic research on SJTs typically reveals a plethora of factors that are difficult to interpret (Chan &
Schmitt, 2005).
CONCLUSIONS
Our review of measurement approaches suggests parallels in how the three constructs are measured.
Although it is often thought that the three constructs are primarily measured with self-reports and
performance tests, this section highlighted that there are a wide array of other options possible.
Specifically, interviews, peer-reports, and instruments with somewhat more fidelity (e.g., SJTs and
AC exercises) are viable measurement approaches. Future research should explore these alternative
measurement methods.
Job knowledge
FIGURE 16.1 Conceptual framework for examining practical, emotional, and social intelligence. (Adapted
from Chan, D., & Schmitt, N., Situational judgment tests, in A. Evers, O. Smit-Voskuijl, & N. Anderson, Eds.,
Handbook of personnel selection, 219–242, Blackwell, Oxford, United Kingdom, 2005.)
linking each type of intelligence and the various individual difference constructs explain the consis-
tent findings from factor analytic studies that the intelligence measure is factorially complex and the
data from the measure do not produce good fit with a single factor model. These relationships also
explain the significant and sometimes substantial correlations between the intelligence measure and
the established measures of traditional KSAOs such as cognitive ability and personality traits. In
addition, these relationships provide the conceptual bases for examining ability models, trait mod-
els, and mixed models of emotional (as well as practical or social) intelligence.
The findings on the substantial zero-order validities and incremental validities of practical
intelligence in predicting job performance over the prediction provided by cognitive ability and
personality traits (e.g., Chan & Schmitt, 2002) are consistent with the proximal status of practical
intelligence competencies (relative to the distal status of KSAOs) in the prediction of job perfor-
mance. Similarly, the proximal status of emotional and social intelligence also explains the find-
ings from studies that showed zero-order and incremental validities of these intelligence measures
in the prediction of job performance and other criteria (for meta-analytic review of studies, see
Druskat & Jordan, 2007). Interestingly, Figure 16.1 may also explain why SJTs and ACs, which
are multidimensional measures, do better than factorially pure measures of single unitary con-
structs (e.g., cognitive ability, personality) in predicting job-relevant performance criteria, which
are often multidimensional in nature. That is, much of what SJTs and ACs are assessing may well
be multidimensional competencies similar, if not identical, to practical, emotional, and social
intelligence.
We believe the conceptual framework in Figure 16.1 is consistent with existing findings and
reconciles much of the debate on the validity of practical, emotional, and social intelligence,
but more direct empirical support of the framework is certainly needed. We reiterate the call in
Chan and Schmitt (2005) that to obtain more direct evidence for a framework that construes the
intelligence competencies as multidimensional mediators in the relationship between KSAOs and
job performance (and other criteria), we would need to specify and test hypothesized and alternative
structural equation models (on the basis of primary data from a single study or cumulation of
results from past studies using meta-analyses) linking KSAOs, intelligence competencies, and job
performance or other criterion outcomes. Future research could derive theory-driven specific models
from the general framework depicted in Figure 16.1 to empirically examine the validity of one or
more of the three intelligence constructs that would facilitate the interpretation of the correlations
Practical Intelligence, Emotional Intelligence, and Social Intelligence 351
between the intelligence construct and more established individual difference KSAOs, as well as the
zero-order and incremental validities of the intelligence construct in predicting different criterion
outcomes. In the following section, we suggest various strategies for formulating theory-driven
testable models that are likely to advance research in ways that make conceptual and practical
contributions to the study of practical, emotional, and social intelligence.
Roth, & Potosky, 1999; Chan & Schmitt, 1997, 2005; Lievens, Harris, Van Keer, & Bisqueret,
2003). Constructs refer to the substantive conceptual variables (e.g., conscientiousness, cognitive
ability, finger dexterity, field dependence-independence, reaction time, visual attention, emotional
intelligence) that the measures were designed to assess. Conversely, methods refer to the tests,
techniques, or procedures (e.g., paper-and-pencil tests, computer-administered tests, video-based
tests, interviews, ACs, self-reports, peer reports) used to assess the intended constructs. This
distinction between constructs and methods is especially crucial for multidimensional predictors
(Bobko et al., 1999). Conceptual and methodological issues of variance partition associated with the
construct-method distinction and their applications to constructs such as practical intelligence are
available in Chan and Schmitt (2005).
Given the multidimensional nature of practical, social, and emotional intelligence, clarity of the
method-construct distinction is critical. As shown in Table 16.1, practical, social, and emotional
intelligence might be measured in multiple ways. As noted above, recent research on social intel-
ligence has adopted such multitrait-multimethod design and cleared some of the confusion around
this construct. For example, social intelligence constructs (e.g., social understanding, memory, and
knowledge) were operationalized in a multitrait-multimethod design applying verbal, pictorial, and
video-based performance measures.
A similar strategy could be followed for clarifying some of the confusion related to emotional
intelligence. So far, research mainly compared self-reports of ability-based emotional intelligence
or mixed model emotional intelligence to personality inventories (see Roberts et al., 2006, for
an exception). However, many more strategies are possible. One possibility is to operationalize a
specific branch of the emotional intelligence ability model via different measurement approaches
(Wilhelm, 2005). For example, the emotion understanding branch of the ability model might be
measured via the MSCEIT and an SJT. Similarly, the emotion perception branch might be mea-
sured via faces, pictures, movies, voices, etc. As another example, people might complete an ability
emotional intelligence test, they might provide self-reports of their emotional intelligence, and they
might be rated by trained assessors on emotional intelligence (or conceptually similar competencies
such as interpersonal sensitivity) in AC exercises. Such research designs (see also Landy, 2006)
focus on convergent validity and enable one to answer key questions such as the following:
interventions that influence individuals’ scores on the criteria (by understanding what might affect
the mediator). Research could attempt to explicate the precise affective, cognitive, motivational, and
behavioral mechanisms that mediate the effects of practical, emotional, or social intelligence on the
criterion and directly measure and test these hypothesized mediation mechanisms. For example,
cognitions and motivations (expectancy and instrumentality beliefs), or more subtle mediators (like-
ability), may mediate the intelligence effects on criteria such as job satisfaction and performance.
When an intelligence construct interacts with another predictor (e.g., personality trait) to affect
the criterion, the interaction effect is mathematically equivalent whether we select intelligence or
the other predictor as the moderator. However, conceptually, the predictor selected as the moderator
reflects different research questions. Identifying moderators that affect the magnitude and even nature
of the relationship between the intelligence and criterion constructs is important as the moderator
effect clarifies the range and boundary conditions of the predictive validity of the intelligence
construct. There has been increasing research examining moderator effects in the predictive validity
of personality traits (e.g., Barrick, Parks, & Mount, 2005). However, in the domain of practical,
emotional, and social intelligence, research on moderator effects on their validity is still scarce. For
instance, Côté and Miners (2006) found that emotional intelligence was linked to task performance
and organizational citizenship behavior (OCB) toward the organization only for people low on
cognitive ability. Another rare example is Ferris et al. (2001), who reported that the relationship
between social intelligence and job performance was stronger among workers high rather than low
in cognitive ability. On the other hand, when the intelligence construct is the moderator affecting
the relationship between another predictor and the criterion, the importance of the intelligence
construct is demonstrated not in terms of its bivariate predictive validity of the criterion, but in terms
of its role in determining the range and boundary conditions of the bivariate predictive validity
of another predictor. Several studies have demonstrated important moderator roles of practical,
emotional, and social intelligence constructs. For example, Witt and Ferris (2003) found that the
conscientiousness-performance relationship was moderated by social intelligence in that high
levels of conscientiousness together with poor social intelligence led to lower performance. Chan
(2006) found that proactive personality predicts work perceptions (procedural justice perception,
perceived supervisor support, social integration) and work outcomes (job satisfaction, affective
organizational commitment, job performance) positively among individuals with high practical
intelligence (construed in terms of situational judgment effectiveness) but negatively among those
with low practical intelligence. The findings on the disordinal interaction effects show that high
levels of proactive personality may be either adaptive or maladaptive depending on the individual’s
level of practical intelligence and caution against direct interpretations of bivariate associations
between proactive personality and work-relevant criteria. In short, fruitful future research could be
conducted by adopting a strategy that goes beyond bivariate relationships to examine the mediators
that link the intelligence construct to the criterion construct, the moderators that affect the nature
of the intelligence-criterion relationship, and the role of the intelligence construct as a moderator
affecting the nature of a predictor-criterion relationship.
new things is typically contrasted to the more routine maintenance job stage (Murphy, 1989). For
instance, Thoresen, Bradley, Bliese, and Thoresen (2004) found that openness was related to per-
formance and performance trends in the transition stage but not to performance at the maintenance
stage. As another example, Jansen and Stoop (2001) discovered that the AC dimension of interper-
sonal effectiveness showed validity only after several years on the job.
We believe that future studies on practical, social, and emotional intelligence should also adopt a
longitudinal design where possible. Similar to personality, it might well be that the validity of these
intelligence constructs differs in the long run for predicting job performance. For example, the tran-
sitional job stage typically involves more adaptive demands than the routine maintenance job stage.
So, practical intelligence might predict job performance stronger in the transitional job stage than
in the routine maintenance job stage.
A construct-oriented approach to the study of practical, emotional, and social intelligence that
locates the constructs in the framework presented in Figure 16.1 would provide the conceptual basis
to hypothesize, test, and interpret performance changes over time. Using appropriate longitudinal
designs and change assessment techniques allows one to draw practical implications for key issues
such as changes in test validities, changes in mean performance, changes in rank order of individu-
als’ performance, and changes in dimensionality (i.e., number and/or nature of dimensions) of per-
formance (see Chan, 1998a, 2005a).
EPILOGUE
We have, under the constraint of a relatively short chapter length, critically reviewed the vast litera-
ture on practical, emotional, and social intelligence constructs. We have proposed a conceptual
Practical Intelligence, Emotional Intelligence, and Social Intelligence 355
framework, adapted from Chan and Schmitt (2005), which provides a way to organize the conceptu-
alizations of the intelligence constructs and their relationships with other individual difference and
criterion constructs. We believe that this framework also reconciles some, if not most, of the findings
and debates in the literature on the intelligence constructs. Finally, by explicating several strategies
for future research, we hope that more scientifically rigorous studies could be conducted on practi-
cal, emotional, and social intelligence to provide practitioners in personnel selection and other HR
functions a more evidence-based basis for the use of these intelligence constructs and measures.
REFERENCES
Amelang, M., & Steinmayr, R. (2006). Is there a validity increment for tests of emotional intelligence in
explaining the variance of performance criteria? Intelligence, 34, 459–468.
Arthur, W., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related validity
of assessment center dimensions. Personnel Psychology, 56, 125–154.
Arthur, W., & Villado, A. (2008). The importance of distinguishing between constructs and methods when
comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93,
435–442.
Austin, E. J., & Saklofske, D. H. (2005). Far too many intelligences? On the communalities and differences
between social, practical, and emotional intelligences. In R. Schulze & R. D. Roberts (Eds.), Emotional
intelligence—An international handbook (pp. 107–128). Cambridge, MA: Hogrefe & Huber.
Barchard, K. A. (2003). Does emotional intelligence assist in the prediction of academic success? Educational
and Psychological Measurement, 63, 840–858.
Bar-On, R. (1997). Bar-On emotional quotient inventory: a measure of emotional intelligence: Toronto,
Ontario: Multi-Health Systems.
Barrick, M. R., Parks, L., & Mount, M. K. (2005). Self-monitoring as a moderator of the relationships between
personality traits and performance. Personnel Psychology, 58, 745–767.
Beaupré, M. G., Cheung, N., & Hess, U. (2000). The Montreal Set of Facial Displays of Emotion (Slides).
Available from Ursula Hess, Department of Psychology, University of Quebec at Montreal, P.O. Box
8888, Station“Centre-ville,” Montreal, Quebec H3C 3P8.
Bedwell, S., & Chuah, S. C. (2007, April). Video-based assessment of emotion perception: toward high fidel-
ity. Paper presented at the 22nd Annual Conference of the Society of Industrial and Organizational
Psychology, New York, NY.
Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporat-
ing cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561–589.
Borkenau, P., & Liebler, A. (1993). Convergence of stranger ratings of personality and intelligence with self-
ratings, partner ratings, and measures of intelligence. Journal of Personality and Social Psychology, 65,
546–553.
Bowman, D. B., Markham, P. M., & Roberts, R. D. (2001). Expanding the frontier of human cognitive abilities:
So much more than (plain) g! Learning and Individual Differences, 13, 127–158.
Boyatzis, R. E., & Sala, F. (2004). Assessing emotional intelligence competencies. In G. Geher (Ed.), The
measurement of emotional intelligence: Common ground and controversy (pp. 147–180). Hauppauge,
NY: Nova Science.
Brackett, M. A., & Mayer, J. D. (2003). Convergent, discriminant, and incremental validity of competing mea-
sures of emotional intelligence. Personality and Social Psychology Bulletin, 29, 1147–1158.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt &
W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-
Bass.
Chan, D. (1998a). The conceptualization of change over time: An integrative approach incorporating longitu-
dinal means and covariance structures analysis (LMACS) and multiple indicator latent growth modeling
(MLGM). Organizational Research Methods, 1, 421–483.
Chan, D. (1998b). Functional relations among constructs in the same content domain at different levels of
analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246.
Chan, D. (2000). Understanding adaptation to changes in the work environment: Integrating individual difference
and learning perspectives. Research in Personnel and Human Resources Management, 18, 1–42.
Chan, D. (2005a). Current directions in personnel selection. Current Directions in Psychological Science, 14,
220–223.
356 Handbook of Employee Selection
Chan, D. (2005b). Multilevel research. In F. T. L. Leong & J. T. Austin (Eds.), The psychology research hand-
book (2nd ed., pp. 401–418). Thousand Oaks, CA: Sage.
Chan, D. (2006). Interactive effects of situational judgment effectiveness and proactive personality on work
perceptions and work outcomes. Journal of Applied Psychology, 91, 475–481.
Chan, D. (2008). So why ask me?—Are self-report data really that bad? In C. E. Lance & R. J. Vandenberg.
(Eds.), Statistical and methodological myths and urban legends: Received doctrine, verity, and fable in
the organizational and social sciences (pp. 309–336). Hillsdale, NJ: Lawrence Erlbaum.
Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judg-
ment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied
Psychology, 82, 143–159.
Chan, D., & Schmitt, N. (2002). Situational judgment and job performance. Human Performance, 15, 233–254.
Chan, D., & Schmitt, N. (2005). Situational judgment tests. In A. Evers, O. Smit-Voskuijl, & N. Anderson
(Eds.), Handbook of personnel selection (pp. 219–242). Oxford, England: Blackwell.
Costanzo, M., & Archer, D. (1993). The Interpersonal Perception Task-15 (Videotape). Berkeley, CA: University
of California Extension Media Center.
Côté, S., & Miners, C. (2006). Emotional intelligence, cognitive intelligence, and job performance.
Administrative Science Quarterly, 51, 1–28.
Daus, C. S., & Ashkanasy, N. M. (2005). The case for the ability-based model of emotional intelligence in
organizational behavior. Journal of Organizational Behavior, 26, 453–466.
Davies, M., Stankov, L., & Roberts, R. D. (1998). Emotional intelligence: In search of an elusive construct.
Journal of Personality and Social Psychology, 75, 989–1015.
De Raad, B. (2005). The trait-coverage of emotional intelligence. Personality and Individual Differences, 38,
673–687.
Diehl, M., Willis, S. L., & Schaie, K. W. (1995). Everyday problem-solving in older adults observational
assessment and cognitive correlates. Psychology and Aging, 10, 478–491.
Druskat, V., & Jordan, P. (2007). Emotional intelligence and performance at work. Manuscript submitted for
publication.
Ferris, G. R., Perrewé, P. M., & Douglas, C. (2002). Social effectiveness in organizations: Construct validity
and research directions. Journal of Leadership & Organizational Studies, 9, 49–63.
Ferris, G. R., Witt, L. A., & Hochwarter, W. A. (2001). Interaction of social skill and general mental ability on
job performance and salary. Journal of Applied Psychology, 86, 1075–1082.
Ford, M. E., & Tisak, M. S. (1983). A further search for social intelligence. Journal of Educational Psychology,
75, 196–206.
Fox, S., & Spector, P. E. (2000). Relations of emotional intelligence, practical intelligence, general intelligence,
and trait affectivity with interview outcomes: It’s not all just ‘G’. Journal of Organizational Behavior,
21, 203–220.
Funder, D. C. (1987). Errors and mistakes: Evaluating the accuracy of social judgment. Psychological Bulletin,
101, 75–90.
Gohm, C. L. (2004). Moving forward with emotional intelligence. Psychological Inquiry, 15, 222–227.
Goleman, D. (1995). Emotional intelligence: Why it can matter more than IQ. New York, NY: Bantam.
Gottfredson, L. S. (2003). Dissecting practical intelligence theory: Its claims and evidence. Intelligence, 31,
343–397.
Gowing, M. K. (2001). Measures of individual emotional competencies. In C. Cherniss & D. Goleman (Eds.),
The emotionally intelligent workplace (pp. 83–131). San Francisco, CA: Jossey-Bass.
Grubb, W. L., & McDaniel, M. A. (2007). The fakability of Bar-On’s Emotional Quotient Inventory Short
Form: Catch me if you can. Human Performance, 20, 43–59.
Hedlund, J., Forsythe, G. B., Horvath, J. A., Williams, W. M., Snook, S., & Sternberg, R. J. (2003). Identifying
and assessing tacit knowledge: Understanding the practical intelligence of military leaders. Leadership
Quarterly, 14, 117–140.
Hedlund, J., Wilt, J. M., Nebel, K. L., Ashford, S. J., & Sternberg, R. J. (2006). Assessing practical intelligence
in business school admissions: A supplement to the graduate management admissions test. Learning and
Individual Differences, 16, 101–127.
Hoepener, R., & O’Sullivan, M. (1968). Social intelligence and IQ. Educational and Psychological Measurement,
28, 339–344.
Hogan, R., & Shelton, D. (1998). A socioanalytic perspective on job performance. Human Performance, 11, 129–144.
Huffcutt, A. I., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment
of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86,
897–913.
Practical Intelligence, Emotional Intelligence, and Social Intelligence 357
Jaeger, A. J. (2003). Job competencies and the curriculum: An inquiry into emotional intelligence in graduate
professional education. Research in Higher Education, 44, 615–639.
Jansen, P. G. W., & Stoop, B. A. M. (2001). The dynamics of assessment center validity: Results of a 7-year
study. Journal of Applied Psychology, 86, 741–753.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.
Jones, K., & Day, J. D. (1997). Discrimination of two aspects of cognitive-social intelligence from academic
intelligence. Journal of Educational Psychology, 89, 486–497.
Jordan, P. J., Ashkanasy, N. M., Hartel, C. E., & Hooper, G. S. (2002). Workgroup emotional intelligence.
Scale development and relationship to team process effectiveness and goal focus. Human Resource
Management Review, 12, 195–214.
Keating, D. P. (1978). Search for social intelligence. Journal of Educational Psychology, 70, 218–223.
Kenny, D. A. (1991). A general-model of consensus and accuracy in interpersonal perception. Psychological
Review, 98, 155–163.
Lance, C. E., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of dimen-
sion and exercise variance components in assessment center postexercise dimension ratings. Journal of
Applied Psychology, 89, 377–385.
Landy, F. J. (2005). Some historical and scientific issues related to research on emotional intelligence. Journal
of Organizational Behavior, 26, 411–424.
Landy, F. J. (2006). The long, frustrating, and fruitless search for social intelligence: A cautionary tale. In K.
R. Murphy (Ed.), A critique of emotional intelligence: What are the problems and how can they be fixed?
(pp. 81–123). Mahwah, NJ: Lawrence Erlbaum.
Lane, R. D., Quinlan, D. M., Schwartz, G. E., Walker, P. A., & Zeitlin, S. B. (1990). The levels of emotional
awareness scale—A cognitive-developmental measure of emotion. Journal of Personality Assessment,
55, 124–134.
Law, K. S., Wong, C. S., & Song, L. J. (2004). The construct and criterion validity of emotional intelligence and
its potential utility for management studies. Journal of Applied Psychology, 89, 483–496.
Legree, P. J., Heffner, T. S., Psotka, J., Martin, D. E., & Medsker, G. J. (2003). Traffic crash involvement:
Experiential driving knowledge and stressful contextual antecedents. Journal of Applied Psychology, 88,
15–26.
Lievens, F., Buyse, T., & Sackett, P. R. (2005). The operational validity of a video-based situational judgment
test for medical college admissions: Illustrating the importance of matching predictor and criterion con-
struct domains. Journal of Applied Psychology, 90, 442–452.
Lievens, F., Harris, M. M., Van Keer, E., & Bisqueret, C. (2003). Predicting cross-cultural training perfor-
mance: The validity of personality, cognitive ability, and dimensions measured by an assessment center
and a behavior description interview. Journal of Applied Psychology, 88, 476–489.
Lievens, F., & Sackett, P.R. (2006). Video-based versus written situational judgment tests: A comparison in
terms of predictive validity. Journal of Applied Psychology, 91, 1181–1188.
Locke, E. A. (2005). Why emotional intelligence is an invalid concept. Journal of Organizational Behavior,
26, 425–431.
MacCann, C., & Roberts, R. D. (2008). New paradigms for assessing emotional intelligence: Theory and data.
Emotion, 8, 540–551.
Marlowe, H. A. (1986). Social intelligence—Evidence for multidimensionality and construct independence.
Journal of Educational Psychology, 78, 52–58.
Matsumoto, D., LeRoux, J., Wilson-Cohn, C., Raroque, J., Kooken, K., Ekman, P., et al. (2000). A new test
to measure emotion recognition ability: Matsumoto and Ekman’s Japanese and Caucasian Brief Affect
Recognition Test (JACBART). Journal of Nonverbal Behavior, 24, 179–209.
Matthews, G., Roberts, R. D., & Zeidner, M. (2004). Seven myths about emotional intelligence. Psychological
Inquiry, 15, 179–196.
Matthews, G., Zeidner, M., & Roberts, R. R. (2007). Emotional intelligence: Consensus, controversies, and
questions. In G. Matthews, M. Zeidner, & R. R. Roberts (Eds.), The science of emotional intelligence—
Knowns and unknowns (pp. 3–46). New York, NY: Oxford University Press.
Mayer, J. D., & Geher, G. (1996). Emotional intelligence and the identification of emotion. Intelligence, 22,
89–113.
Mayer, J. D., Roberts, R. D., & Barsade, S. G. (2008). Human abilities: Emotional intelligence. Annual Review
of Psychology, 59, 507–536.
Mayer, J. D., & Salovey, P. (1993). The intelligence of emotional intelligence. Intelligence, 17, 433–442.
McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. L. (2007). Situational judgment tests, response
instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91.
358 Handbook of Employee Selection
McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situ-
ational judgment tests to predict job performance: A clarification of the literature. Journal of Applied
Psychology, 86, 730–740.
McDaniel, M. A., & Whetzel, D. L. (2005). Situational judgment test research: Informing the debate on practi-
cal intelligence theory. Intelligence, 33, 515–525.
McHenry, J. J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project a validity results—
The relationship between predictor and criterion domains. Personnel Psychology, 43, 335–354.
Murphy, K. R. (1989). Is the relationship between cognitive ability and job performance stable over time?
Human Performance, 2, 183–200.
Newsome, S., Day, A. L., & Catano, V. M. (2000). Assessing the predictive validity of emotional intelligence.
Personality and Individual Differences, 29, 1005–1016.
Nguyen, N. T., Biderman, M. D., & McDaniel, M. A. (2005). Effects of response instructions on faking a situ-
ational judgment test. International Journal of Selection and Assessment, 13, 250–260.
Nowicki, S. (2004). A manual for the Diagnostic Analysis of Nonverbal Accuracy tests (DANVA). Atlanta, GA:
Department of Psychology, Emory University.
O’Sullivan, M. (2007). Trolling for trout, trawling for tuna. In Matthews, G., Zeidner, M., & Roberts, R. R.
(Eds.), The science of emotional intelligence—Knowns and unknowns (pp. 258–287). New York, NY:
Oxford University Press.
O’Sullivan, M., Guilford, J. P., & deMille, R. (1965). The measurement of social intelligence (Psychological
Laboratory Report No. 34). Los Angeles, CA: University of Southern California.
Palmer, B., & Stough, C. (2001). The measurement of emotional intelligence. Australian Journal of Psychology,
53, 85–85.
Parker, J. D., Hogan, M. J., Eastabrook, J. M., Oke, A., & Wood, L. M. (2006). Emotional intelligence and
student retention: Predicting the successful transition from high school to university. Personality and
Individual Differences, 41, 1329–1336.
Pérez, J. C., Petrides, K. V., & Furnham, A. (2005). Measuring trait emotional intelligence. In Schulze, R. &
Roberts, R. D. (Eds.), Emotional intelligence—An international handbook (pp. 181–201). Cambridge,
England: Hogrefe & Huber.
Petrides, K. V., & Furnham, A. (2003). Trait emotional intelligence: Behavioural validation in two studies of
emotion recognition and reactivity to mood induction. European Journal of Personality, 17, 39–57.
Riggio, R. E. (1986). Assessment of basis social skills. Journal of Personality and Social Psychology, 51, 649–660.
Roberts, R. D., Schulze, R., O’Brien, K., MacCann, C., Reid, J., & Maul, A. (2006). Exploring the validity of
the Mayer-Salovey-Caruso emotional intelligence test (MSCEIT) with established emotions measures.
Emotion, 6, 663–669.
Roberts, R. D., Schulze, R., Zeidner, M., & Matthews, G. (2005). Understanding, measuring, and applying
emotional intelligence. In Schulze, R. & Roberts, R. D. (Eds.), Emotional intelligence—An international
handbook (pp. 311–336). Cambridge, England: Hogrefe & Huber.
Rosenthal, R., Hall, J. A., DiMatteo, M. R., Rogers, P. L., & Archer, D. (1979). Sensitivity to nonverbal com-
munication: The PONS test. Baltimore, MD: Johns Hopkins University Press.
Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment,
credentialing, dentialing, and higher education—Prospects in a post-affirmative-action world. American
Psychologist, 56, 302–318.
Sala, F. (2002). Emotional competence inventory: Technical manual. Philadelphia, PA: McClelland Center For
Research, HayGroup.
Salovey, P., & Mayer, J. D. (1990). Emotional intelligence. Imagination, Cognition, and Personality, 9,
185–211.
Salovey, P., Mayer, J., Goldman, S., Turvey, C., & Palfai, T. (1995). Emotional attention, clarity and repair:
Exploring emotional intelligence using the Trait Meta-Mood Scale. In J. W. Pennebaker (Ed.), Emotion,
disclosure, and health (pp. 125–154). Washington, DC: American Psychological Association.
Scherer, K. R., Banse, R., & Wallbott, H. G. (2001). Emotion inferences from vocal expression correlate across
languages and cultures. Journal of Cross-Cultural Psychology, 32, 76–92.
Schmit, M. J. (2006). EI in the business world. In K. R. Murphy (Ed.), A critique of emotional intelligence:
What are the problems and how can they be fixed? Mahwah, NJ: Lawrence Erlbaum.
Schneider, R. J., Ackerman, P. L., & Kanfer, R. (1996). To “act wisely in human relations”: Exploring the
dimensions of social competence. Personality and Individual Differences, 21, 469–481.
Schulze, R., Wilhem, O., & Kyllonen, P. C. (2007). Approaches to the assessment of emotional intelligence. In
G. Matthews, M. Zeidner, & R. R. Roberts, (Eds.), The science of emotional intelligence—Knowns and
unknowns (pp. 199–229). New York, NY: Oxford University Press.
Practical Intelligence, Emotional Intelligence, and Social Intelligence 359
Schutte, N. S., Malouff, J. M., Hall, L. E., Haggerty, D. J., Cooper, J. T., Golden, C. J., et al. (1998). Development
and validation of a measure of emotional intelligence. Personality and Individual Differences, 25,
167–177.
Semadar, A., Robins, G., & Ferris, G. R. (2006). Comparing the validity of multiple social effectiveness
constructs in the prediction of managerial job performance. Journal of Organizational Behavior, 27,
443–461.
Spector, P. E., & Johnson, H. M. (2006). Improving the definition, measurement, and application of emotional
intelligence. In K. R. Murphy (Ed.), A critique of emotional intelligence: What are the problems and how
can they be fixed? (pp. 325–344). Mahwah, NJ: Lawrence Erlbaum.
Sternberg, R. J. (1988). The triarchic mind: A new theory of human intelligence. New York, NY: Penguin
Books.
Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common-sense. American
Psychologist, 50, 912–927.
Stricker, L. J., & Rock, D. A. (1990). Interpersonal competence, social intelligence, and general ability.
Personality and Individual Differences, 11, 833–839.
Tan, H. T., & Libby, R. (1997). Tacit managerial versus technical knowledge as determinants of audit expertise
in the field. Journal of Accounting Research, 35, 97–113.
Terpstra, D. E., Mohamed, A. A., & Kethley, R. B. (1999). An analysis of federal court cases involving nine
selection devices. International Journal of Selection and Assessment, 7, 26–34.
Tett, R. P., Fox, K. E., & Wang, A. (2005). Development and validation of a self-report measure of emo-
tional intelligence as a multidimensional trait domain. Personality and Social Psychology Bulletin, 31,
859–888.
Thoresen, C. J., Bradley, J. C., Bliese, P. D., & Thoresen, J. D. (2004). The big five personality traits and indi-
vidual job performance growth trajectories in maintenance and transitional job stages. Journal of Applied
Psychology, 89, 835–853.
Thorndike, E. L. (1920). Intelligence and its uses. Harper’s Magazine, 140, 227–235.
Van der Zee, K., Thijs, M., & Schakel, L. (2002). The relationship of emotional intelligence with academic
intelligence and the Big Five. European Journal of Personality, 16, 103–125.
Van Iddekinge, C. H., Raymark, P. H., & Roth, P. L. (2005). Assessing personality with a structured employ-
ment interview: Construct-related validity and susceptibility to response inflation. Journal of Applied
Psychology, 90, 536–552.
Van Rooy, D. L., Dilchert, S., Viswesvaran, C., & Ones, D. S. (2006). Multiplying intelligences: Are general,
emotional, and practical intelligences equal? In K. R. Murphy (Ed.), A critique of emotional intelligence:
What are the problems and how can they be fixed? (pp. 235–262). Mahwah, NJ: Lawrence Erlbaum.
Van Rooy, D. L., & Viswesvaran, C. (2004). Emotional intelligence: A meta-analytic investigation of predictive
validity and nomological net. Journal of Vocational Behavior, 65, 71–95.
Van Rooy, D. L., Viswesvaran, C., & Pluta, P. (2005). An evaluation of construct validity: What is this thing
called emotional intelligence? Human Performance, 18, 445–462.
Wagner, R. K. (1987). Tacit knowledge in everyday intelligent behavior. Journal of Personality and Social
Psychology, 52, 1236–1247.
Wagner, R. K., & Sternberg, R. J. (1985). Practical intelligence in real-world pursuits—The role of tacit knowl-
edge. Journal of Personality and Social Psychology, 49, 436–458.
Weis, S., & Süss, H.-M. (2005). Social intelligence—A review and critical discussion of measurement con-
cepts. In Schulze, R. & Roberts, R. D. (Eds.), Emotional intelligence—An international handbook
(pp. 203–230). Cambridge, UK: Hogrefe & Huber.
Wilhelm, O. (2005). Measures of emotional intelligence: Practice and standards. In R. Schulze & R. D. Roberts
(Eds.), Emotional intelligence—An international handbook (pp. 131–154). Cambridge, England: Hogrefe
& Huber.
Witt, L. A., & Ferris, G. R. (2003). Social skill as moderator of the conscientiousness—Performance relation-
ship: Convergent results across four studies. Journal of Applied Psychology, 88, 809–820.
Wong, C. M., Day, J. D., Maxwell, S. E., & Meara, N. M. (1995). A multitrait-multimethod study of academic
and social intelligence in college-students. Journal of Educational Psychology, 87, 117–133.
Wong, C. S., & Law, K. S. (2002). The effects of leader and follower emotional intelligence on performance
and attitude: An exploratory study. Leadership Quarterly, 13, 243–274.
Zeidner, M., Matthews, G., & Roberts, R. D. (2004). Emotional intelligence in the workplace: A critical review.
Applied Psychology—an International Review, 53, 371–399.
This page intentionally left blank
Part 4
Decisions in Developing, Selecting,
Using, and Evaluating Predictors
Ann Marie Ryan and Neal W. Schmitt,
Section Editors
This page intentionally left blank
17 Decisions in Developing and
Selecting Assessment Tools
INTRODUCTION
Typically, employers embrace testing for several reasons: identifying capable applicants; enhancing
the productivity of their workforce; minimizing error, waste, and accidents; complying with Equal
Employment Opportunity (EEO) regulations; and minimizing staffing costs. Once employers have
decided that formal selection procedures will be useful, they then must determine what measures
best meet the organization’s needs. This chapter reviews the decisions that must be made in choos-
ing the selection procedure and identifying the criteria that must be considered, as well as the poten-
tial ramifications of those decisions.
There are four basic decisions associated with developing a selection system: (a) What con-
structs should be measured in the selection system? (b) How should the chosen constructs be
measured? (c) How should validity evidence be gathered? and (d) How should the resulting
scores be used? Answers to each of these questions are dependent upon several criteria. The
choice of constructs to be measured must take into account criteria such as the feasibility of
measuring a construct; the number of knowledge, skills, abilities, and other personal charac-
teristics (KSAOs) to be measured; the utility of measuring multiple constructs; and the goals
of the organization. The decision regarding how to measure the chosen constructs depends on
factors such as the anticipated validity and utility of various kinds of tests, the appropriateness
and feasibility of different validation strategies, the potential group differences in test scores and
adverse impact, the availability of alternative selection procedures, possible legal ramifications,
applicant reactions, and a host of administrative concerns including costs of development and
administration, time requirements for development and administration, and feasibility within
the staffing context. Resolving the question of score use must take into account the form of the
test score, the ways in which test scores can be combined and weighted, and the procedures for
choosing cutoff scores and bands.
Although each criterion will be discussed independently, none can be viewed in isolation. Nor
can the criteria be considered in strict sequence. One has an effect on another, and decisions may
need to be revisited as additional criteria are considered. Similarly, many of the criteria must be
considered simultaneously; optimization of one may have a direct effect on another. For example,
most employers cannot simply choose the test with the highest validity without consideration of the
cost and feasibility of administration in the staffing context.
Organizations pursue different goals when developing and implementing a selection procedure,
and these goals define which criteria are most relevant to the decisions that must be made. Some
common goals include minimizing costs, maximizing prediction of performance, avoiding legal
challenges, achieving a diverse workforce, and engendering positive applicant reactions toward
the job and organization. When the goal is maximizing performance, criteria related to valid-
ity may be paramount; when the goal is cost control, criteria related to costs and time may be
363
364 Handbook of Employee Selection
most important. Often these goals are overlapping. For example, positive applicant reactions often
increase an applicant’s propensity to remain in the application process and thus reduce the costs
of recruiting. A diverse work force may increase the productivity of the organization or enhance
its ability to serve its customers. Avoidance of challenges to the selection system certainly reduces
overall organizational costs in the long term. In addition, the choice of one goal may heighten
the importance of another. When controlling costs takes precedence over other goals, criteria
related to the efficiency of administration may become more important and applicant reactions
less important.
The value placed on each of the goals differs by organization. Some organizations may place a
greater value on accuracy of prediction whereas others may focus on creating a diverse workforce
and engendering positive applicant reactions. Thus, an important step in any selection project is to
establish a clear understanding of the goals that the organization hopes to achieve.
The remainder of the chapter reviews the basic decisions and the criteria used in making those
decisions. It also highlights which issues are most important for the achievement of the organiza-
tion’s particular goals.
delineation between the KSAOs to be measured (that are important to job performance) and the
KSAOs required to take the test a critical part of deciding the type of test to use.
Another dimension of the issue of how much of the job domain to cover is the utility gained by
measuring multiple KSAOs. Many organizations attempt to balance their needs for accurate evalu-
ation of candidates’ skills, cost-effective staffing procedures, and legal compliance. Consequently,
they question the addition of a test that adds little incremental validity although additional KSAOs
are being measured.
TIMING
An important organizational consideration in the choice of selection tests is timing: How quickly
can the tests be deployed? Some employers have the luxury of time to develop a new selection
program that might be validated locally while continuing the use of an existing program; however,
many organizations lack an existing selection process and need to rapidly develop and install one.
The immediate need for a selection process may guide an organization toward off-the-shelf tests that
can be quickly validated through a generalizability study or a content-oriented validity strategy.
GROUP DIFFERENCES IN TEST SCORE MEANS AND ADVERSE IMPACT IN SELECTION DECISIONS
Regardless of the requirements of the applicable EEO laws, all organizations that desire a diverse
workforce and want to avoid unnecessarily eliminating members of protected groups are concerned
about group differences in test score means. Thus, another criterion that often informs decisions
about the choice of tests is group differences in test performance.
Group differences may be assessed in several ways. In addition to statistically analyzing mean
score differences, “adverse impact,” a legal term in the United States, can be evaluated in several
ways. Perhaps the most common method, which is described in the Uniform Guidelines, is to deter-
mine whether the selection rate of a protected group (e.g., female applicants, Hispanic applicants) is
less than 80% of the selection rate for the majority group (e.g., male applicants, White applicants).
Another way to calculate adverse impact is to compare the two selection rates using a statistical test
(e.g., Fisher’s exact test). Regardless of how group mean differences or adverse impact are assessed,
organizations must decide whether to avoid, reduce, or eliminate them through their choice of tests,
decisions on cutoff scores, or some other strategy such as alternate recruitment strategies.
Decisions in Developing and Selecting Assessment Tools 367
CONSIDERATION OF ALTERNATIVES
Another important factor when deciding what constructs to measure, how to measure them, and
how to combine the scores is consideration of alternate selection procedures. During this process,
an organization may compare various measures, combinations of measures, or methods of making
selection decisions (e.g., setting pass/fail cutoff scores, banding scores, using top-down selection)
to determine which ones result in the highest validities with the least amount of group differences.
In the United States, this quest for alternate selection procedures is an expectation first set by the
Uniform Guidelines.
It is not clear what was envisioned when the Uniform Guidelines or APA Standards were
written in terms of the steps that should be taken to comply with the consideration of alternatives
provision. From the practitioner’s perspective, many unanswered questions remain regarding the
effort that must be exerted to find acceptable alternatives. What is a legitimate alternative—a
different measure of the same construct, a measure of a different construct, different ways of
combining measures, different cutoff scores? In addition, there are practical issues to consider
when attempting to compare alternatives. For example, a comparison of the validity and adverse
impact of two alternatives assumes the availability of criterion-oriented validity data and adverse
impact data. Content-oriented validity strategies do not yield validity coefficients, and many do
not produce estimates of adverse impact unless the tests have been administered in a pilot study.
Often, validity data from criterion-oriented studies and adverse impact data are not available, or
such information has not been collected in similar situations (e.g., for similar jobs, using a similar
criterion).
If studies containing validity coefficients, group means, and adverse impact data can be obtained
for tests under consideration, the next step is to make comparisons among the tests. The Uniform
Guidelines indicate:
Where two or more selection procedures are available which serve the user’s legitimate interest in
efficient and trustworthy workmanship, and which are substantially equally valid for a given purpose,
the user should use the procedure which has been demonstrated to have the lesser adverse impact.
(Section 3B)
However, there is no guidance on what is “substantially equally valid” nor on what has “lesser
impact.” Thus, even if two tests have comparable validity coefficients and adverse impact data avail-
able, there are no common methods in the field for determining how much of a difference in valid-
ity coefficients or adverse impact findings is sufficient to warrant the use of one test over the other.
The test user must make this judgment. These difficulties with the process, in not only locating the
information but also determining how to make the comparisons, may explain why, in practice, some
test users have not made thorough reviews as pointed out by Guion (1998).
Broad reviews of alternative selection procedures may be limited in some circumstances. For
example, it is not uncommon in consulting firms that have a set of proprietary tests with robust
criterion-oriented validity evidence to choose from among those tests when developing a selection
process. Thus, the consideration of alternatives occurs by making decisions only on the basis of the
firm’s own tests. In fact, an organization that hires a particular consulting firm because of its tests
might question the consulting firm if tests from other firms are recommended.
Further, if a human resources professional or operating manager is choosing the tests without
the assistance of a qualified testing professional, he or she may not have the background necessary
to make nuanced judgments about the validity of different tests (Murphy & Davidshofer, 1988).
Further, decision-makers lacking a testing background may be easily swayed by test advertisements
that broadly claim the “validity” of a test, suggest a lack of adverse impact, or extol the ease of
administration. Test users who fail to understand that validity is not a characteristic of a test, but
rather a description of how well the test score predicts performance or other relevant criteria are not
in a position to critically evaluate such claims.
368 Handbook of Employee Selection
Regardless of the availability of validity and adverse impact data and the difficulty in interpret-
ing observed differences, the presence of a “better” alternative will have direct implications for the
choice of measure.
ADMINISTRATIVE CONCERNS
In addition to the considerations of validity, adverse impact, and costs and timing of test development
and validation discussed above, administrative concerns are critical factors in the decision regarding
what test to use (see Chapter 7, this volume, for additional information on administrative concerns
associated with employee testing). The use of a particular test that is valid and complies with
legal guidelines and professional standards may not be feasible in the context of an organization’s
staffing environment. For example, a multiple-choice test of cognitive ability that must be proctored
is not practical in an organization that processes applications via the Internet and does not have
the facilities and staff available to administer tests in proctored settings. Similarly, an assessment
center designed to evaluate customer service skills may not be cost-effective when the hiring rate
of customer contact personnel is high because of turnover. Some common administrative concerns
are discussed below.
Administrative Personnel
The test user must consider whether the personnel required to administer and score the test are
available and affordable. For example, a work sample test measuring knowledge and skill in the
repair of electronic equipment may require that an electronics technician score the test. In this case,
a work sample may not be feasible simply because the personnel needed to administer and score the
work sample are not available.
Cost is a related factor. Even if personnel with the prerequisite skills are available to admin-
ister assessments, an organization may find the cost of using many of these employees prohibi-
tive. For example, an organization using a structured interview as a first screen for a high-volume,
Decisions in Developing and Selecting Assessment Tools 369
high-turnover position may well find the cost of the interviewer exceeds the value of the interview.
Organizations must take into account not only the cost of the administrative personnel’s time but
also the cost of time for training the administrator and scorer. The added cost of training inter-
viewers may place a heavy financial burden on the organization. Moreover, additional costs are
incurred by conscientious organizations that also re-train, calibrate, and monitor personnel to main-
tain proficiency.
Time Requirements
The time to administer various types of tests is also a factor often considered in choosing a test.
The time spent testing has costs associated with the personnel required for administration and the
use of facilities. Instruments like situational judgment inventories (SJIs) can be good predictors that
also provide a realistic preview of the work. However, because of the reading demand, the SJIs may
370 Handbook of Employee Selection
require more time to administer than other forms of tests. In approximately the same amount of
time, an employer could administer either a 20-item SJI or a 100-item biodata form.
Another concern about lengthy tests is their impact on applicant flow. Anecdotally at least, many
recruiters believe that applicants have a low tolerance for lengthy evaluations, especially when they
are administered in unproctored Internet settings. Even when the applicant is asked to take a test at
an employer’s site, the amount of time spent testing can be a deterrent to maintaining the applicant’s
interest. When applicants are currently employed, convincing them to take significant amounts of
leave from their jobs may be difficult. Similarly, the number of tests administered can pose a chal-
lenge. When the testing program is based on multiple hurdles and the applicant is asked to return
multiple times to take tests in the sequence, the number of tests and the time required become major
factors in the applicant’s decision to remain in the hiring process.
ORGANIZATION REACTIONS
Most testing professionals who have spent time inside of organizations have learned that organiza-
tions have idiosyncratic preferences for certain types of tests and strong dislikes for others. For
example, some organizations promote the idea that with some hard work and a little coaching, an
individual can become anything he or she wants. Thus, tests that measure relatively immutable traits
(e.g., personality tests) or are based on past experiences (e.g., biodata) are not acceptable. Instead,
tests measuring developable skills and abilities (e.g., skills tests, achievement tests) or knowledge
that can be acquired are preferred. Other organizations promote the idea that they hire the best
by hiring individuals who have proven their skills by graduating from top schools. Consequently,
asking individuals to demonstrate their mental prowess through measures of cognitive ability is
anathema.
APPLICANT REACTIONS
When designing selection systems, applicants’ reactions should be considered for several reasons.
For example, applicants who are unhappy with a selection system may withdraw from the selection
process, may have an increased potential for filing lawsuits, and may spread a negative reputation
for the organization.
In recent years, several research studies have investigated the role of various factors such as
test type, administration format, procedural characteristics, and personal variables on applicant
Decisions in Developing and Selecting Assessment Tools 371
reactions to the testing event (see Gilliland, 1993 for a theoretical model of applicant reactions).
In general, the research indicates that tests are perceived more positively when the relationship
between the content of the test and the duties of the job is clear to the applicants.
However, more research is needed before concluding that applicants’ reactions to selection pro-
cedures actually predict applicant behaviors (e.g., withdrawal from the selection process, job accep-
tance, job performance). Moreover, it is important to remember that an applicant’s performance
on a test may override other variables associated with the test or the testing context. As with other
criteria, an applicant’s reaction is not the only factor in deciding which test to use. For example,
if a job requires cognitive ability, the finding that cognitive tests are not perceived as favorably by
applicants as interviews and work samples may be irrelevant.
interview questions and the KSAOs may be easier to demonstrate with a content-oriented strategy
than that between abstract reasoning test items and the KSAOs. Moreover, relevant court cases
regarding test validity and concerns regarding future legal defensibility may guide the employer
toward one strategy over another. Many factors impact the feasibility of different kinds of validity
studies. Organizations with a small incumbent population and low hiring volume are not likely to
have samples large enough to meet the necessary statistical assumptions in a study using a criterion-
oriented strategy. In these cases, the primary strategy for establishing validity may be a content-ori-
ented one. Other possible validation strategies include the use of a validity generalization strategy
such as a transportability study (a study conducted to justify the use of a selection procedure based
on the validity results of another study) or reliance on the results of meta-analyses of validity studies
involving relevant instruments, criteria, and jobs. Additionally, some organizations may address the
problem of validation studies with small incumbent populations by using a synthetic or job compo-
nent validity approach in which validity inferences are based on the relationship between scores on
a test and performance on a component of the job.
New jobs can pose special problems for content- and criterion-oriented validation. It may not be
possible to obtain local subject matter expert (SME) input for a job that currently has no incumbents
or supervisors, so a traditional content-oriented study cannot be conducted. In addition, the lack of
incumbents makes a concurrent criterion study impossible. In some cases, archival information on
the validity of a test for a particular job and criterion may be used on an interim basis to justify the
use of the test. In other cases, the test user may gather organizational information about the impetus
for the newly created job, proposed minimum qualifications, jobs from which current employees
will be promoted, and proposed training; information about similar jobs from external sources (e.g.,
O*NET™, the literature); as well as input from those within the organization who are designing the
job about the work processes that will be covered by the job, the tasks that need to be performed,
the equipment that will be used, etc. This information can then be used to make linkages between
the KSAOs of the new job and the proposed tests.
The presence or absence of an internal I-O psychologist or other testing professional may deter-
mine the feasibility of certain kinds of validity studies or test development efforts. An organization
without qualified personnel may be overwhelmed developing tests or executing a validation study.
Thus, the lack of an internal testing professional may force the organization to limit its consider-
ation of tests to those offered by test publishers and consulting firms or those that can be easily
validated without extensive internal guidance.
Another complicating factor for content- and criterion-oriented validation strategies can be the
consequences of the participation of employees. When the need for test security is high (e.g., police
and fire exams), the involvement in the test design process or validation effort of incumbents may
draw into question whether the test content has remained confidential or has been compromised.
It merits noting that although some organizations will adapt their choice of selection tests to the
type of validation effort they are willing and able to make, others will simply choose their selection
tests and eschew the validation effort altogether. Just as some managers discount the information on
validity of inferences, some will also ignore practice guidelines that indicate the conditions under
which various strategies are appropriate. Consequently, small-sample criterion-oriented validity
studies are not infrequent.
There are many sources of costs associated with developing and validating a testing. Validation
efforts can be expensive in terms of the time of employees who participate in job analysis activi-
ties, take experimental tests, or provide criterion data. Even retrieving archival performance data
can be difficult and time-consuming. The cost of equipment and supplies is another consideration.
Occasionally, special facilities for the validity study must be arranged and paid for. Moreover, the
Decisions in Developing and Selecting Assessment Tools 373
internal and/or external consultants who guide the validation effort add to the cost of developing
and validating a selection program. Many organizations have limited funds for the validation and
operational use of test programs. They must consider the cost of various kinds of validation efforts
when choosing the most appropriate test. For many, a less costly validation effort such as a trans-
portability study may dictate the use of an off-the-shelf test.
The source of the budget for these expenditures can become an important factor. For example, in
some organizations, test development and validation expenses are paid from a limited, centralized
human resources budget whereas administration costs come from richer, distributed operational
budgets. Thus, there is less money for development and validation and more for administration. In
these situations, an organization might be driven to select tools that are commercially available, less
costly to develop (e.g., interviews, work samples), or less costly to validate (e.g., those that can be
justified through a transportability study or a content-oriented validity strategy). An organization
that is less cost sensitive might prefer to develop and conduct a criterion-oriented validity study of a
proprietary test tailored to its industry, core values, or culture rather than buy one off-the-shelf.
When an organization has confidence in the value of its selection program, an important factor
can be the ownership of the selection tools. For some organizations, the volume of test use dictates
proprietary ownership. In others, the competitive advantage that accrues from the sole use of a
selection process may direct the employer to a proprietary test. When the value of a business’s ser-
vices is derived from something other than its employees (e.g., natural resources), a test shared with
others may be sufficient for its needs. In a few situations (e.g., utility companies), one organization
dominates a geographical area, and applicants come primarily from regional pools. Tests shared
with others have little effect on the organization’s competitive advantage.
Perhaps the ultimate yardstick for determining a test’s value to the organization is its utility,
which takes into account not only its costs but also its benefit. A test that is expensive to develop,
validate, and administer may be worth more to the organization than a less expensive test if its abil-
ity to predict future job performance is also higher.
testing professionals may gravitate toward more complex forms that convey more information about
the individual.
also pass the personality inventory—so there is little value in retaining the other two tests. Thus, the
decision on the cutoff score question negates the decision by the organization of which constructs
to measure and what tests to use.
CONCLUSIONS
This chapter has reviewed the issues test users consider in selecting or developing a test for valida-
tion or for operational use on an interim basis. As described, the issues are many, and hard and fast
answers are few. As noted earlier in the chapter, none of these factors can be evaluated without
consideration of the others. For example, the feasibility of test development and validation and their
costs are significant factors in the choice of tests. An organization with few incumbents in a job for
which tests are being considered may not be able to supply enough SMEs to complete job analysis
questionnaires and develop job-relevant tests, test-takers for a concurrent study, or SMEs for a con-
tent study. Even enterprises with many incumbents may not be able to relieve employees from their
job duties for the time needed to assist with test development and validation and maintain smooth
operations.
In addition, the answer to the first question (Which constructs should I measure?) may need
to be revisited depending on the answers to the next three questions (How should the constructs
be measured? How should evidence of validity be gathered? How should scores be used?). An
organization that decides to measure only problem-solving ability because it was the most important
KSAO for a particular job and then decides to use a work sample test may find that the work
sample can measure a broader array of KSAOs than just problem-solving abilities. Conversely, an
organization that decides to measure all of its important KSAOs may find that the number of tests
required is so large that testing requires 3 days and consequently is unaffordable to the organization
and intolerable to applicants.
A particularly difficult, overarching concern is how to arrive at one decision in the face of many
competing demands on the organization. Optimization of all factors would be challenging, if not
impossible. For example, increasing validity while minimizing adverse impact and meeting orga-
nizational constraints of time and cost associated with validation and administration remains a bal-
ancing act rather than a series of discrete decisions. Minimally, it is imperative that those who are
tasked with identifying or developing successful selection systems are familiar with the many deci-
sion points in the process. The test user responsible for designing selection systems must consider
these issues and their ramifications, weigh the tradeoffs, and make fully informed final decisions.
REFERENCES
American Educational Research Association, American Psychological Association, and National Council on
Measurement in Education (1999). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department
of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43,
38290–38315.
Gilliland, S. W. (1993). The perceived fairness of selection systems: An organizational justice perspective.
Academy of Management Review, 18, 694–734.
Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process.
In N. Schmitt & W. C. Borman (Eds.) Personnel selection in organizations. San Francisco, CA: Jossey-
Bass.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahway, NJ: Lawrence
Erlbaum.
Hunter, J. E., Schmidt, F. L., & Judiesch, M. K. (1990). Individual differences in output variability as a function
of job complexity. Journal of Applied Psychology, 75, 28–42.
Murphy, K. R., & Davidshofer, C. O. (1988). Psychological testing: Principles and applications. Englewood
Cliffs, NJ: Prentice-Hall.
376 Handbook of Employee Selection
Rynes, S. L., Colbert, A. E., & Brown, K. G. (2002). Human resource professionals’ beliefs about effec-
tive human resource practices: Correspondence between research and practice. Human Resource
Management, 41, 149–174.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychol-
ogy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Society for Industrial and Organizational Psychology (2003). Principles for the validation and use of personnel
selection procedures (4th ed.). Bowling Green, OH: Author.
Tippins, N. T., Beatty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., et al. (2006). Unproctored
Internet testing in employment settings. Personnel Psychology, 59, 189–225.
18 Administering Assessments
and Decision-Making
There are innumerable topics that could be covered under the heading “administering assessments
and decision-making.” Rather than attempt a superficial coverage of a wide range of topics, we have
chosen to focus our attention on two that are of particular relevance to the applied use of assess-
ments in the workplace. The first of these—unproctored Internet testing (UIT)—is a relatively
recent phenomenon and one that not only has the potential to greatly expand the reach of valid
professionally designed assessment but also to destroy valid assessment as we have known it in the
pre-Internet world. The second issue we cover is not new. It involves the aggregation of assessment
information for use by decision-makers. More specifically, it involves the knotty question of how to
maximize accuracy/validity, yet not steal from the authority of decision-makers to make decisions.
There is a thread that ties both issues together, and that is the how the nexus between scientifically
tested practices and the requirements of managerial preference and decision-making can be merged
to yield sound selection solutions.
377
378 Handbook of Employee Selection
forces opposed. We do not see a critical mass of any of these stakeholder groups clamoring to
curtail UIT.
A good deal of literature has been produced on this topic, and we will not attempt to duplicate it.
We refer the reader to the International Test Commission (2005); Naglieri et al. (2004); and Tippins
et al. (2006) for excellent overviews of the major issues involved in UIT. For our purposes, four
issues are most relevant:
This fourth issue has not, to our knowledge, been addressed in the research: Do decision-makers
believe assessments provide sufficient value to merit efforts to protect them from compromise?
out that its advantages should not be oversold. Discussion of an alternate use of verification testing
that might be more practical is presented later in this chapter.
TEST SECURITY
If examinee authentication is problematic, protecting test security under UIT is doubly so. Solutions
to the security problem can include hardware and software solutions (e.g., disabling the print screen
key, “locking down” the web browser), use of item banks, use of forensic analysis to detect suspi-
cious responding indicative of theft of test content, and monitoring of Internet activity and sites
where “pirated” test materials are trafficked (Burke, 2006).
All are worthy steps to consider in helping protect test security. Yet, it is clear that they can also
be ineffective and can lead to a sense of false security. For instance, safeguards such as disabled
hardware and locked-down browsers can easily be defeated by simply taking pictures of test screens
with a digital snapshot camera. Forensic analysis can only measure the likelihood of content theft
by a test-taker; it cannot provide proof. And, unlike in the case of cheating detection, in which
there is some potential remedy such as proctored verification testing, when test security is breached
the damage is done. The same is true of scanning websites for stolen test content. It may very well
locate pirated test materials, but by that point it is too late to do anything except begin planning the
replacement test. Moreover, as test security deteriorates, it would not be unreasonable to expect that
wide access to test contents might also increase the incidence of cheating on the test itself.
One solution is to use “honesty certificates” in which the candidate certifies before he/she begins
the test that he/she will not cheat and will not pilfer test content. Such an approach clearly com-
municates to the candidate the employer’s expectations regarding honesty (Burke, 2006; Cizek,
2001; Hughes & Tate, 2007; McCabe, Treviño, & Butterfield, 2001). This step is virtually without
cost and there is some indication it actually has some deterrent effect on cheaters and thieves. Yet,
Nathanson, Paulhus, and Williams’ (2006) finding that students high in subclinical psychopathy
were more likely to cheat than others reminds us that there are distinct limits on what we can expect
from honesty certificates. Rotating items from a large item bank will help by preventing the loss
of a test in toto. Yet, even here we know that merely having access to a parallel form of a test is
invaluable to a job candidate who wants a leg up on the competition. Nor should it be ignored that
developing an item bank of the size needed can be extremely costly.
We admit that we have painted a grim picture of the advisability of UIT from the standpoints of
candidate verification (cheating) and test security. In actuality, the picture may not be quite as dire
as the limited effectiveness or impracticality of many of the suggested remedies may imply. That is
because the simple fact of having some of these precautions in place and letting the examinees know
they are in place is likely to deter some of the cheating and test content theft.
However, in the end we agree with Pearlman, in Tippins et al. (2006), who wrote, “… it can be
safely assumed that for high-stakes applications, all test content available for viewing on a [UIT]
computer screen will eventually be compromised, the only questions being how quickly and to what
extent” (p. 205). It may be, at least with the tools available to us today, that eventual loss of test
security is, in the final analysis, an inherent cost of doing business in the UIT world.
What if they see assessment’s main value as being a way to cheaply screen many candidates down
to a smaller number so that the “real” evaluation can then begin? If so, it is quite logical that they
would believe that the care under which assessments are administered could be greatly compro-
mised without incurring any meaningful damage. The answer to this question is important because
it will dictate how we address UIT going forward. If there is a pervasive belief within our clientele
that assessment does not really do much other than provide a cheap way to cull out candidates, then
we will probably not achieve a mutually acceptable solution to the issues surrounding UIT because
the I-O psychologists will be pursuing one goal and the clients another.
There is some evidence that suggests clients, even in organizations where assessment is well
established, derogate its importance in identifying top-quality job candidates. Ryan and Tippins
(2004) observed that:
Even when presented with evidence of strong relationships between test scores and job performance,
some managers are unwilling to concede that structured selection programs are not just good ways to
select employees; they are better than other less structured alternatives. The power of “gut instinct” and
“chemistry” seems to override the hard data and rational arguments. (p. 306)
An early theoretical paper by Shrivastava and Mitroff (1984) anticipated some of the Ryan and
Tippins (2004) conclusions. Shrivastava and Mitroff compared the “assumptions organizational
decision-makers use in making decisions with those of researchers” (p. 19). Among their propositions
were that decision-makers have a “preference for intuitive images of the problem” (p. 20) and that
“inquiry [by decision-makers] is conducted through informal, personal, nonstandard procedures”
(p. 20). If correct, these propositions are consistent with those of Ryan and Tippins (2004), and it
should therefore not come as a surprise when managerial decision-makers are not all that concerned
with the safeguards that come from proctoring.
Lievens, Highhouse, and De Corte (2005) found that managers placed more value on general
mental ability and extraversion as useful selection constructs when they were measured via unstruc-
tured interview than when they were measured by test. Terpstra and Rozell (1997) reported that the
leading reason cited by human resources (HR) managers for not using cognitive tests or biodata
(other types of noncognitive tests were not included in the study) was that these managers were either
uncertain of the usefulness of the tests or felt the tests were not effective. Perhaps most telling was a
study by Terpstra (1996). He surveyed 201 HR executives, asking them to rate nine different classes
of selection procedures in terms of “how predictive of future job performance” (p. 16) each was (i.e.,
their perceived validity). Four of the classes involved tests (i.e., specific aptitude tests, personality
tests, general cognitive ability tests, and biographical information inventories) whereas the other five
are generally not regarded as being tests in the lay usage of the term (i.e., work samples, references,
unstructured interviews, structured interviews, and assessment centers). Of the nine classes of selec-
tion procedures rated, these HR executives put the four classes of tests at the very bottom of the list.
Note that these beliefs were expressed despite well-documented evidence to the contrary. In this
context it would not be surprising if there were a willingness to sacrifice the assurance afforded by
proctoring if there is a fundamental lack of confidence in the first place that tests work.
Although the studies cited above are, to be sure, only a smattering of the literature that could be
brought to bear on the issue of the value placed by managers on tests, they suggest that there is not as
much confidence in the efficacy of tests as might be supposed. And, if confidence is lacking, it fol-
lows that not much is lost by compromising the standards under which the tests are administered.
However, one might wonder why the organization would use testing at all if it has such weak
confidence in the value of testing. We can only speculate, but in the real world of organizations it
is rarely the same decision-maker who put the testing program in place as the one who desires to
compromise it. Sometimes this is because of decision-makers being in different parts of the organi-
zation (e.g., HR vs. sales) or because of the simple passage of time (e.g., HR executive A installed the
testing program; years later HR executive B is open to compromising it). In any event, much more
382 Handbook of Employee Selection
research is needed to better understand the orientations of managers (HR and line) toward testing
and its value and how those orientations may relate to managers’ willingness to compromise quality
assurance steps in employment testing.
reports of observed cheating to computer program “cheating detectors” (Nathanson et al., 2006).
Timeframes for the cheating to “count” vary widely. To further muddy the water, cheating behav-
ior can range from stealing answers to a test from someone else to using unauthorized aids (e.g.,
calculators on a math exam). For noncognitive measures, the most likely type of cheating would be
illicit collaboration (i.e., soliciting the unauthorized aid of another person in completing the assess-
ment) and only some of the studies in the literature address this variety of cheating. With respect
to assessment itself, several other variables come into play that prevent the current literature from
converging: whether the assessment is proctored or unproctored, whether it is cognitive or noncog-
nitive, whether high stakes or lower stakes are involved, and demographics (virtually all studies
involve student populations).
Our immediate interest in our simulation work was in unproctored, noncognitive, high-stakes
assessment involving adults. The literature that most influenced our decision on cheating rate
included studies where there was evidence reported of the two specific kinds of cheating of greatest
interest to us: cheating on a test (as opposed to, say, on homework) and willingness to engage in illicit,
deceptive collaboration with another. We would like to have focused on noncognitive measures and
adult populations, but the literature was virtually devoid of information in those areas. We also
would have liked to focus on situations involving only unproctored testing, but the vast majority of
the literature involved classroom testing, which we presumed to be at least to some degree proctored.
So, we were left with a literature mostly comprised of college student populations involved in the
kinds of test cheating/illicit collaboration found in the academic environment, that is, on cognitive
(achievement) tests taken under at least nominally proctored conditions rather than on noncognitive
tests taken by adults under unproctored conditions. The literature pertinent to estimating the third
parameter in our simulation (incidence of cheating) is summarized in Table 18.1.
In reviewing this literature, our first area of interest was the threshold question of how frequently
people cheat under UIT conditions. Hughes and Tates (2007) found that 12% of students in the United
Kingdom reported they had cheated under UIT conditions involving high-stakes (selection), ability
(cognitive) testing (time period covered was unspecified). One thing that is apparent in scanning the
studies summarized in Table 18.1 is that many capture cheating over extended periods of time (e.g.,
a year, a college career) and that all else being equal, longer time periods will inflate the reported
incidence of cheating. For that reason, we were particularly interested in the incidence of cheating
on a single, discrete event (i.e., on a given pre-employment test administration) or as close to it as we
could come. Nathanson et al. (2006) used a methodology that matched patterns of item responses
to identify pairs of students whose response patterns on an in-class exam were suspiciously similar.
The suspicions were corroborated by close proximity of the seats of the pairs of students in virtually
every case. The inferred cheating behavior was not limited to the single incident category, but it
was fairly close to it, being limited to two to five in-class exams. As might be expected, inferred
cheat rates were quite low—around 4%. However, this study involved only one category of cheating
(copying) and the tests were proctored, both of which would also drive down cheat rates. Blackburn
(1998) limited the observation period for the cheating behavior to count to the “first [college] class
[of] each week during one semester” (p. 173). Here too, self-reported cheat rates were comparatively
low, ranging from 2% (looking up answers in a closed-book test) to 17% (copying answers from
another student during a test). Although not explicitly stated, it could be assumed that where tests
were involved, they would be proctored, suppressing some cheating.
The next question had to do with the specific kind of cheating. In noncognitive UIT, we would
expect that the greatest source of cheating behavior would involve illicit collaboration with another
person to “figure out” the test. The studies we reviewed were rich in evidence of illicit collaboration.
We defined illicit collaboration as receiving assistance from another person with that person’s
acquiescence for the purpose of falsely improving one’s test score. This excluded such behaviors as
copying from another person during an exam without his/her knowledge. Eight studies (Table 18.1)
produced 17 estimates of the incidence of illicit collaboration, ranging from 1% (self-report of
taking a college exam for someone else or vice versa; Newstead, Franklyn-Stokes, & Armstead,
384 Handbook of Employee Selection
TABLE 18.1
Summary of Incidence of Cheating Studies
Percentage Cheating, Selected Behavior,
Study Description Period Covered Stakes
Blackburn (1998) Self-report; college students Measured 27 cheating behaviors—4 of interest Moderate
(N = 225) in UIT; cheating behavior reported if it occurred
“during the first class of each week over the
course of a semester”
• 17%—copying answers from another student
during a test or quiz
• 8%—using a “cheat sheet” during a test
• 2%—looking up answers in a closed-book test
• 5%—having another person write a paper or
assignment handed in as own work
Hughes & McCabe Self-report; Canadian college Cheating behavior reported if it occurred “at least Moderate
(2006) students (N = 13,644 once in the past year”
undergraduates; 1,318 graduate • 18% (undergraduates); 10% (graduate
students) students)—received unpermitted help on an
exam
• 6% (undergraduates); 3% (graduate
students)—copying from another student
during a test with his or her knowledge
• 9% (undergraduates); 4% (graduate
students)—turned in work done by someone
else
Hughes & Tate (2007) Self-report; U.K. university Cheating behavior reported if it ever occurred High
students (N = 534; 319 were under UIT conditions; time period covered not
involved in high-stakes UIT) specified
• 12%—used any of the four cheating methods
below
• 5% —prior practice with test under a
pseudonym
• 4-5%—illicit collaboration with another
• 4-5%—advance access to test content
• 2% —circumvention of UIT technology to
gain unfair advantage
Klein, Levenburg, Self-report; college students Measured 13 potential cheating behaviors—4 Moderate
McKendall, & (N = 268) pertinent to UIT; time period covered: sometime
Mothersell (2007) during college career
• 27%—collaborating on take-home exam
• 5%—using “cheat sheet” on an exam
• 14%—looking at someone else’s exam during
a test
• 29%—programming extra help into a
calculator to use during an exam
McCabe, Butterfield, & Self-report; Canadian and U.S. Cheating behavior reported if it occurred at least Moderate
Treviño (2006) graduate students (N = 5,331) once within the last academic year
• 10% (business students); 8% (nonbusiness
students)—engaged in “serious test cheating”
• 28% (business students); 23% (nonbusiness
students)—engaged in improper collaboration
Administering Assessments and Decision-Making 385
1996) to 61% (self-report of getting test items or answers from someone who has already taken the
test; Robinson, Amburgey, Swank, & Faulkner, 2004), with a median of 10%. In the closest of these
studies to our primary interest, illicit collaboration under unproctored conditions, Klein, Levenburg,
McKendall, and Mothersell (2007) reported that 27% of college students said they had collaborated
on a take-home (i.e., unproctored) exam, whereas Hughes and Tate (2007) reported that 4–5% self-
reported having engaged in illicit collaboration with another person in a UIT setting.
The results of these studies were so divergent it would be tempting to conclude they are of little
use in arriving at informed assumptions for entry into a simulation model. On the contrary, we
found these results to still be useful, and we concluded that we could bracket the most likely inci-
dence of cheating under these conditions by entering our simulation with three different assump-
tions of the incidence of cheating: 10%, 15%, and 30%. To be sure, these percentages were not as
high as many found in the self-report literature. However, we were focusing on a single event (a
given pre-employment test), not over an extended period of time. We were focusing on noncognitive
386 Handbook of Employee Selection
testing, which would be expected to have a lower cheating rate. Age is a commonly cited factor
in rates of cheating. Our examinees were all adults. Taking these into account, our cheating rate
estimates for the simulation were lower than those often seen in the literature. Because most of the
literature is based on self-report, it undoubtedly reflects underreporting of the true incidence of
cheating, and this was the main basis for choosing 30% for our highest-level estimate. All three of
our estimates, then, were considered conservative in the sense that they did not “stack the deck”
against UIT, something we were sensitive to in preparing a balanced case.
of this study was that it compared scores collected under (a) instructions to respond honestly with,
(b) instructions to fake good with, and (c) instructions to fake good combined with some degree of
coaching. We were particularly interested in the difference between faking alone and faking plus
coaching because it provided some sense of the incremental score gain we might expect via coaching
(illicit collaboration in our case). Means under the coaching conditions were roughly in the range of
one-third to over one-half SD higher than means under the fake-good-only condition. This suggests
that coaching could indeed boost personality scores. These results on one kind of noncognitive
test (personality) are tempered by results on another kind (biodata) reported by Kluger, Reilly, and
Russell (1991). In their study of the fakeability of biodata, they found that the keying method used
made a substantial difference. When item keying was used (i.e., where the response scale has a
monotonic relationship with the criterion measure) fakeability was significant. We would assume
that if fakeability were possible, collaboration with another person could have similar effects.
However, when the keying method used was option-keying (i.e., where each response option is keyed
individually, often nonmonotonically) fakeability was stymied. In fact, the fake-good condition in
this study actually produced lower scores than the honest condition. This illustrated the need to be
conservative in assigning score gains because of cheating, especially if the noncognitive instrument
at issue is option-keyed biodata.
It was clear that much guesswork would be necessary to arrive at credible assumptions for this
element of our simulation model. Taking the results of the two studies above into consideration and
then weighing the fact that our testing would be done under high-stakes conditions, we elected to
bracket score gains in the range of 0.25 SD to 0.75 SD. We acknowledge that an optimal simulation
model would actually have a distribution of different-sized score gains through cheating, including
some with actual reduction in scores (because of the cheater’s picking a less-able collaborator or
because option-keyed biodata were used). However, we concluded that the state of the art was not
yet sufficiently developed to incorporate this level of sophistication into our model. Clearly, more
research is needed in this area.
1 We made the point earlier that cheating and response distortion are related, but different, classes of behavior.
388 Handbook of Employee Selection
TABLE 18.2
Estimated Percentage of Cases Moving From “Fail” to “Pass”
Because of Cheating
Assumed Cheat Rate
Assumed Score Gain (SD) 10% 15% 30%
0.25 1.1 1.7 3.4
0.50 2.3 3.4 6.5
0.75 4.1 6.1 11.5
Entries were calculated by nf → p /(nf → p + np), where nf → p is the number of cases who
went from “fail” to “pass” because of the score gain obtained due to cheating, and np
is the number who passed the test before the adjustment for cheating gains was made.
many moved from fail to pass as a result of the adjustment to their scores for cheating. This pro-
cess was iterated 1,000 times and results were the percent of all passing cases who went from fail
to pass because of cheating averaged across the 1,000 iterations. It should be emphasized that we
approached this from a very pragmatic point of view: We were only interested in those cases which
moved from “fail” to “pass” because of estimated cheating. Cases that “cheated” but were already
above the cutoff score before having their score adjusted or cases that were designated as cheaters
but remained below the cutoff score after their scores were adjusted were not of interest. This pro-
cess was repeated for all remaining combinations of assumed cheat rate and score gain. As Table
18.2 shows, the estimated percentages of people who passed because of cheating would be minimal
under the most conservative of conditions (10% cheat rate, 0.25 SD score gain), with only about 1%
of the cases who were above the cutoff score getting there because of cheating. To the surprise of the
researchers, the effects of cheating were still quite modest even under the moderately conservative
condition (15% cheat rate, 0.50 SD score gain), rising to only 3.4%. Only under the upper realistic
limit (30% cheat rate; 0.75 SD score gain) did we finally observe the percent of the pool of candi-
dates who passed the test to contain more than 10% who got there by cheating.
A very large caveat is in order here. The results shown in Table 18.2 should be taken only as a
demonstration of the value of simulation as an aid in the decision of whether or not to engage in
UIT. Specific results depend not only on the assumptions discussed above, but also the shape of the
proctored test distribution (ours was somewhat negatively skewed) and the cutoff score in use (ours
generally yielded a 45–55% pass rate). Beyond that, the assumptions themselves will produce widely
varying results. Although beyond the scope of this chapter, it is readily apparent that if the test at
issue were heavily cognitive in content, the temptation to cheat and the average score gains realized
from cheating would increase substantially, with the result being a possibly dramatic increase in the
negative effects of UIT over the example we present here.
gains would be necessary before significant mean score differences appeared and/or validity were
impaired.2
UIT is increasingly being portrayed as inevitable. The simulation approach can assist in the
decision-making process by helping to evaluate how serious the consequences will be of dropping
the safeguard of proctoring. Of course, research is needed to help hone the assumptions that go into the
simulation model. Better estimates of the incidence of cheating, especially that which distinguishes
high- from low-stakes assessment and noncognitive measures from cognitive measures, would help
immensely. A more sophisticated understanding of the score changes likely to be realized under
UIT is also needed. Our simulation assigned the same score gain to every “cheater” case. A better
simulation would assign differing amounts of score change to individual cases, including, especially
in the case of noncognitive testing, the real possibility that some scores might actually go down.
Finally, we need to have a much better understanding of where along the score continuum cheaters
are most likely to reside. Our simulation used essentially a random assignment, but depending
on the type of test, whether it is high stakes or low, and even the characteristics of the examinee
population, the distribution of cheaters could be considerably different from random. For example,
the literature provides information that could be used to fine tune the distributional assumptions
used in a simulation. Gender (men report more cheating), perceived or actual ability (low-ability
people report more cheating), and age (younger people report more cheating) are reported with
some reliability (Klein et al., 2007; McCabe et al., 2001; Nathanson et al., 2006; Newstead et al.,
1996). Instead of a random selection of cases to be designated as cheaters in the simulation, younger
cases, male cases, and/or cases at the lower end of the test distribution could be oversampled to
more realistically represent the distribution of cheaters. We were not ready to go to this level of
sophistication in our simulation, but the potential to move beyond simple random distribution of
cheaters is clearly there.
Assuming that UIT is here to stay, we need much better capability to assess its likely conse-
quences in specific situations, and the simulation tools will improve only to the extent that the
assumptions entered into the simulations are themselves more accurate.
• Use honesty certificates, requiring the examinee to certify that he/she will not cheat and
will not take test content.
• Forensics, hardware, and software solutions may not, at present, objectively prevent
cheating or content theft, but the known presence of such things can act as a deterrent.
Publicizing to job candidates that such deterrents are in place will dissuade some from
cheating and/or theft.
• If proctored confirmation testing is feasible, use it. However, for many organizations
administering a proctored test to everyone who passes the UIT is not going to be feasible
or even desirable. An alternative might be to use validated forensic analysis to detect cases
of suspected cheating from among those who passed the test and then subject only those
candidates to proctored confirmation testing.
• Use item banks to create multiple forms or presentations of the test, where feasible.
• Instruct candidates to set aside a specified number of uninterrupted minutes needed to start
and finish the test in one sitting. If a candidate knows that once he/she starts a test it must
2 Simulation of the effects on validity would, of course, also require criterion data.
390 Handbook of Employee Selection
be finished, the likelihood that he/she will ensure that distractions and interruptions are
minimized during that limited amount of time may be increased.
• Reconcile oneself to the fact that putting a test “out there” in people’s homes will eventu-
ally compromise it. Periodic replacement of the assessment should be regarded as a normal
cost of using UIT.
It is virtually certain that the push for UIT will intensify. A rational first step before deciding on
a course of action in any given testing situation would be to construct a simulation that takes into
account likely scenarios. This will help determine the potential seriousness of the consequences
of going to UIT. If the consequences are likely to be very serious (i.e., many candidates passing
because they cheated), it would make sense to hold the line on the requirement for proctoring, either
in all cases or in verification testing of all who passed a preliminary UIT screen, whichever made
more economic and practical sense. If the consequences—as in our example—are not projected to
be serious, UIT may be fully appropriate. In any event, simulation can bring the salient factors on
which the decision should be made more clearly into focus.
The six suggestions we made above will help mitigate, if not eliminate, any ill effects of UIT.
In the future, we fully expect that technological aids such as web cameras will become as standard
on home computers as color monitors are now. At that point, verifying candidates’ Identities and
monitoring them as they take the test will become much easier. We expect forensics to become more
effective. It is not hard to imagine at some future point that we retire the term “unproctored Internet
testing” because technology will truly provide a measure of proctoring that may even surpass dili-
gent human proctors in their effectiveness.
We now turn from an issue of where the pressure is to exclude the human element from the assess-
ment process to an issue where the pressure is in the opposite direction—to include the human ele-
ment even where research clearly indicates there are “mechanical” methods that surpass the human
in arriving at good decisions.
• A structured panel interview is completed and the ratings across dimensions and inter-
viewers need to be aggregated to help arrive at the decision of whether to accept or reject
the candidate.
• Test scores must be combined with interview ratings to make an employment decision.
• Assessment center assessors’ evaluations must be combined before a decision to promote
or not can be made.
• Three tests make up a battery. The scores from the three must be used in some way to make
hiring decisions.
Administering Assessments and Decision-Making 391
These are all familiar issues to I-O psychologists, but how they are handled is typically not a uni-
lateral decision made by the psychologist. Hiring and HR managers also typically weigh in on these
topics. For example, in a “high-touch” organization, interviewers may resist averaging across their
individual ratings but would rather discuss the candidates until consensus is reached. In another
organization, managers may have much more faith in interviews than in selection tests and so want
to give more emphasis in the decision-making process to the interview. And, in a third organization,
a hiring manager may have become enamored of a particular test, thus tending to defer to its results
in preference to other sources of candidate data.
In fact, a rich literature exists to help us optimize the ways we combine and use assessment
data on the way to making employment-related decisions. Meehl’s (1954) and later Dawes’ (1979)
seminal work comparing the predictive accuracy of mechanical (statistical, actuarial) combination
of data with clinical judgment set in motion a controversy within the psychological discipline when
they empirically supported the claim that mechanical combination of information was superior to
clinical combination. In the intervening years, the Meehl/Dawes hypothesis has been repeatedly
substantiated, and its main contention is no longer considered controversial within psychology (e.g.,
Dana & Thomas, 2006; Dawes, 1979; Grove & Meehl, 1996; Grove, Zald, Lebow, Snitz, & Nelson,
2000; Kleinmuntz, 1990).
There is one little fly in the ointment. That is the fact that psychologists do not typically run the
organizations we work in or consult to. Managers do, and it is here that mechanical combination of
data runs up against some strong opposition. Not only is the proposition that a formulaic combina-
tion of data “untouched by human hands” could be superior to human judgment extremely coun-
terintuitive for most decision-makers, it also impinges on an often jealously-guarded management
prerogative: to decide who is best suited to be hired or promoted.
So, what should the I-O practitioner do? Options include insisting on mechanical combination
and, frankly, running the real risk of being ignored by the organization. Or, the I-O practitioner can
step aside and just let nature take its course. This course of action, unfortunately, cheats the organi-
zation out of the benefit of the most effective selection procedures available.
We think the answer lies in looking for clues that help guide the role of objective aggregation
of data so it can do the most good in the process without preempting management’s role in the
process.
Johns (1993) examined I-O psychology’s penchant for supplying technical solutions to prob-
lems when he discussed the more general issue of why organizations so consistently adopt sub-
optimal solutions to HR problems when the organizational research clearly points to better
solutions. He noted that adoption of new practices is considered innovative, and then went on
to make the distinction between technical innovation and administrative innovation. Technical
innovations are those that are central to the organization’s products and services and are judged
by their technical merit. Administrative innovations, by contrast, are central to the organiza-
tion itself and whether they are adopted or not depends more on the preferences of manage-
rial decision-makers. In comparison with technical innovations, the organization’s judgment
of administrative innovations is heavily influenced by factors outside of technical merit, such
as political interests, imitation of other organizations, legal/regulatory issues, and the like.
According to Johns (1993), where the trouble arises is that I-Os see our innovations—such
as mechanical weighting of selection information or utility analysis—as technical innovation,
but organizational management sees them as administrative innovation. Macan and Highhouse
(1994) examined I-O psychologists’ use of dollar utility analysis in communicating the value to
HR programs to managers. Although they did not work from the Johns (1993) technical versus
administrative innovation framework, their results were completely consistent with it: Although
46% of their I-O respondents indicated they had used utility analysis (a technical argument) to
communicate the value of selection tests to decision-makers, when asked (a) if the managers
perceived the utility estimates to be credible and (b) if the utility-based communication was
more effective than other information in its persuasive value, these same I-O psychologists
392 Handbook of Employee Selection
produced mean ratings that fell somewhere between the rating scale anchors of “occasionally”
and “hardly ever.” Hazer and Highhouse (1997) tested various methods of estimating utility for
its persuasive value with managers. Among their conclusions were “… the degree to which man-
agers accept the notion that the effectiveness of HR programs can be measured in terms of dol-
lars remains an open question” (p. 110). As Latham (1988) observed, “The reason why Brogden
and Taylor’s (1950) work has been largely ignored for 38 years may be that the customer has not
asked for it” (p. 561).
All of these studies are compatible with Johns’ (1993) hypothesis that I-O arguments such as util-
ity or mechanical combination of information tend to be technical, which would be fine if decision-
makers considered employee selection (or training) to be in the class of technical innovation. But,
such arguments will likely fall on deaf ears if the decision-makers in reality consider them to be
administrative innovations. If Johns (1993) is right, then it is no surprise that mounting ever more
powerful technical arguments for the superiority of mechanical combination meets with so little
success.
Referring to the Shrivastava and Mitroff (1984) paper referenced earlier, taking note of mana-
gerial decision-makers’ preference for “subjective, experiential data” (p. 20) and recognizing that
“the analytical devices, methodological rules of inquiry, and the decision heuristics adopted by
decision-makers are highly personalized, informal and non-standard” (p. 20) are considerations
worth taking into account when trying to find appropriate roles for mechanical combination of data
and for decision-makers. In our view, we need to find a happy meeting point between the two if we
are to progress. Finding that point will not simply be a matter of “better education” for the decision-
makers so they can begin to see the error of their ways.
We can start by looking closely at the literature on clinical versus mechanical combination. The
original research was stimulated by a firmly entrenched belief within the psychotherapeutic and
medical diagnostic communities that the predictions “in the heads” of clinical practitioners consti-
tuted the gold standard. Further, that clinical gold standard was expensive, that is, often involved
pooling the judgments of multiple highly compensated clinicians in case conferences. So, when
empirical research began to find that mechanical combination of data was equal to or better than
clinical combination in the vast majority of cases, the cost benefit of mechanical combination tipped
the scale in favor of conclusions of the almost universal superiority of mechanical combination—if
not in terms of predictive accuracy at least in terms of economic superiority. In the Grove et al.
(2000) meta-analysis, 136 studies were examined. In 63 (46%) of the studies, mechanical combina-
tion was superior in predictive validity. In only eight (6%) of the studies was clinical combination
more valid. But, in 65 (48%) of the studies there was no substantial difference. Ganzach, Kluger,
and Klayman (2000) reported the usual superiority of mechanical to clinical prediction in a study
of military recruits’ later performance. However, they indicated that the best prediction involved
both. These studies have profound implications for practice in organizations. Taken to its logical
conclusion, the odds are roughly equal that if we have a significant clinical element in the way
assessment information is combined, we will do about as well as if we follow a purely mechanical
combination model. And, if we pool clinical and mechanical prediction we could possibly do even
better. In other words, as I-O practitioners we need not be doctrinaire about the value of mechani-
cal combination to have an effective assessment program that is still palatable to decision-makers
and their needs.
The point is not so much that products of I-O science are too complex or “scientized.” Rather, it means
that they are out of sync with the way organizations make decisions about the class of problems for
which I-O innovations are thought to be solutions. (p. 573)
Applying this advice means putting less reliance on promoting the technical merits of mechanical
combination of pre-employment data and more on finding what drives managers’ decisions. It
means building persuasive cases that start by putting oneself in the mind of the decision-maker.
We might, for instance, capitalize on something Johns (1993) pointed out has great influence on
many managers: the persuasive power of imitation. Those who work inside organizations cannot
help but notice the legitimizing effect that framing an innovation as a “best practice” or through
“benchmarking” can have. If the I-O scientist-practitioner can find other, respected organizations
that have adopted less purely clinical ways of combining pre-employment predictors, he/she can
harness the power of imitation as an invaluable aid in helping mold decision-making.
It may help to capitalize on managerial values, some of which are the same things that cause
decision-makers to be attracted to UIT; namely, increased speed and reduced cost. Mechanical
combination generally beats clinical combination on both counts. However, both considerations
might be trumped by the felt need for “high touch” in some organizations. In these cases, it might
be worth experimenting with using mechanical combination to present the decision-makers not
with a single “right answer” but with a palette of options (or candidates) from which they can select.
Or, mechanical procedures could be used to supply the decision-makers with a suggested solution
(i.e., the candidate(s) proposed to be hired) but with the added step of asking the decision-makers
to ratify these recommendations. Of course, the decision-makers may choose to override these
recommendations, but this process would help retain the best of mechanical combination without
stripping managerial decision-makers of their latitude to make decisions.
the score on the accomplishment record being a mechanical combination of the points assigned.
However, a decision-maker notes from a particular candidate’s resume and references that she had
put herself through school while working two jobs, and completed a very successful internship, all
while tending to her elderly grandmother. Clearly, there was no time left in the day for campus lead-
ership positions for this otherwise high-achieving student. This might be a broken leg case in which
it would make sense to temper the mechanical prediction of success through the intervention of the
decision-maker. To be sure, given the tendency of many decision-makers to clinicalize decision-
making, there is the real risk of progressively defining-down what constitutes a metaphorical broken
leg. But, if well-defined guidelines can be developed for what constitutes a legitimate broken leg
and what does not, the decision-makers could play a valuable role in preventing mispredictions. Of
course, the guidelines themselves would have to have buy-in from the decision-makers. Better yet,
decision-makers could be involved in developing the guidelines. In this way, decision-makers could
be given a role that is complementary to, rather than competitive with, the formula.
Kleinmuntz (1990) added a couple of additional suggestions. He noted that researchers as far
back as the 1960s have written that a combination of expert judgment with mechanical combination
of results could actually outperform either clinical prediction alone or statistical prediction alone.
This, of course, begs the question of what manner of synthesis between clinical and mechanical pre-
diction improves results. It appears that when the experts (decision-makers, managers) are used to
collect the data and evaluate their subtleties and quirks, encode (rate) the information in a way that
is amenable to quantitative analysis, and then hand it over for statistical combination, the predictive
results are optimized. This approach assigns appropriate roles to the decision-maker/manager and
to the formulaic prediction model. Neither is poaching on the other’s territory; everyone is happy. In
the context of selection an obvious example is in interviewing. Here, interviewers, especially panels
of interviewers, could conduct a structured interview, judging from the candidate’s responses how
he/she will likely perform on the job. Then, the interviewer(s) encode their complex judgments as
ratings. At this point, if they are smart, the interviewers stop and let the formula combine their rat-
ings to make the prediction. Then, the decision-makers review the prediction, looking especially for
any broken legs. In most cases they will affirm the prediction. But, in some cases they will override
it, and provided they can clearly articulate job-related reasons why (“gut feel” would not qualify),
the override would usually be appropriate.
If decision-makers can be convinced to limit their role to what they do best, letting statistical
models do their part, prediction can be improved. However, there are impediments to this, a discussion
of which goes beyond the scope of this chapter. Suffice it to say that such things as overestimates by
decision-makers of their expertise (called “deluded self-confidence” by Kleinmuntz, 1990, p. 302)
can wreck the plan to apportion appropriate roles for clinical and statistical prediction. Steps beyond
just declaring that clinical and mechanical prediction must share the stage are usually going to be
necessary. This is discussed in the next section.
Capitalize on Self-Discovery
Self-discovery has long been recognized as an effective method of attitude and behavior change.
If decision-makers can be provided circumstances that allow them to discover on their own the
benefits of mechanical combination of data, the path will be paved for better use of the strengths
of data gathering (a clinical process) and data combination and prediction (a mechanical process).
Self-discovery could be helped along by borrowing from cognitive dissonance theory (Festinger,
1957; Wood, 2000). Decision-makers could be led to adopt mechanical combination more
extensively by dissonance created by the discrepancy between their clinical predictions’ results
and corresponding results from mechanical predictions. As noted in Wood (2000) one theory
of cognitive dissonance proposes that dissonance is increased “when people fail to behave in a
manner consistent with some valued self-standard” (p. 546). If, as seems likely, decision-makers
value hiring or promoting the right people, a large “miss rate” could create dissonance. Switching
behavior to a more mechanical approach would increase their hit rate and thus reduce dissonance.
Administering Assessments and Decision-Making 395
This would require convincing decision-makers to formalize their predictions about candidates
and then track the degree of success of the candidates who were hired or promoted, especially
those who were hired/promoted in conflict with the predictions of a mechanical model. This
would provide the data necessary to create dissonance. Greater reliance on objective combination
of predictive information would reduce it. This approach may be facilitated by the current
trend toward evidence-based management (e.g., Pfeffer & Sutton, 2006), which has legitimized
quantitative analysis in many places where pure intuition previously ruled. In any event, it would be
important to have the tracking process be transparent to the decision-makers, even as the data are
being collected. To “ambush” them with results showing the superiority of mechanical decision-
making would be counterproductive if the goal is to rationalize the way data are combined and
used in employment decisions. Such transparency may not lead to the most rigorous science, but
if the real goal is informed decisions about decision-making processes, this compromise would
be well worth it.
IN SUMMARY
To reiterate, specific steps that can be taken to find a healthy balance between quantitative rigor and
managerial decision-making rights include:
• Volunteer a meaningful role for decision-makers to play in the process, even if it means
compromising somewhat on methodological rigor and accuracy. Do not put decision-
makers in a position where they feel their judgment must compete with a formula. The
formula will always eventually lose.
• Have the mechanical formula produce a slate of candidates, not just one. Charge the deci-
sion-makers with selecting the one to receive the job offer or promotion. This is not a new
concept; post-World War II era civil service statutes stipulated that a “rule of 3” be used,
calling for three candidates to be presented to the hiring authority for each opening to be
filled.
• Involve decision-makers in the mechanical combination process by letting them set the
weights for various pieces of information to be combined. It will incorporate the intuitive
element decision-makers prefer for administrative innovations, and although it may not
actually enhance the mechanical combination of information, neither will it do it damage.
• Specifically charge decision-makers with identifying Meehl’s broken leg situations in
which the mechanical, formulaic results should be overridden by human judgment. At the
same time, be sure there are objective rules in place to prevent what defines a broken leg
from becoming diluted.
• Create situations of self-discovery for decision-makers. Most managers want to make good
decisions, but conventional education will only go so far, especially when what it teaches
is counterintuitive. Engage them in developing “track records” comparing the success
rates of the formulaic approach with the decision-makers’ preferred judgmental approach.
Presuming the formulaic approach proves itself; the dissonance created should help in self-
discovery of “a better way.”
• Utilize the imitative tendencies of most organizations by finding best practices in other,
respected organizations that have successfully merged mechanical combination with man-
agerial clinical judgment. Likewise, capitalize on the things that are meaningful to manag-
ers (e.g., increased speed, reduced cost) to let them discover how mechanical combination
can be supportive of the objectives they value.
• Capitalize on general changes in thinking within the organization. Many organizations are
just now discovering the wisdom of quantitative, objective analysis—often referred to as
“evidence-based management.” If this is the case, use this as a springboard for legitimizing
the benefits of mechanical combination of data to improve selection decisions.
396 Handbook of Employee Selection
CONCLUSIONS
In this chapter, we covered two important topics, one that occurs at the beginning of the assessment
process, unproctored testing, and the other that occurs at the end of the process, combining data for
decision-making. With regard to the former, the Internet has created great opportunity and great risk
for assessment. In this chapter we suggested that simulation of the most likely consequences of UIT
is a useful first step in the process of determining whether the benefit is worth the risk. With regard
to employment-related decision-making, like most research-oriented I-O practitioners, we would
normally prefer to use algorithms that combine information objectively. Yet, we also acknowledge
that approaches that strip organizational decision-makers of their capability to make those decisions
likely do not have a rosy future in most organizations. So, the trick should become one of finding a
solution that balances accuracy of prediction with managers’ rights to make employment decisions.
Some ways in which this might be accomplished were suggested.
Although unproctored testing and combination of data for prediction might seem at first blush to
have little in common, in fact there is considerable commonality between them in that both represent
where the rules and values of science and research often collide with the rules and values of man-
agement. There were many issues we could have chosen from the very broad area of administration,
scoring, and interpretation of assessments where this difference might crop up. The most pressing
issues of today, such as UIT, will eventually find a resolution only to be replaced by other issues
that involve the interface between the opinions of stakeholders in organizations and those of the I-O
scientist-practitioner. Whether it is a manager who wants to go to UIT, a manager who believes she
has an innate ability to assess people, a manager who has too little trust in pre-employment tests, or
one who has too much, I-O psychologists must play a lead role in helping determine where the right
balance is. This means that I-O psychologists should have in their role statement an obligation to
help resolve all kinds of misconceptions about assessment. Moreover, we must know how to do it in
ways that touch upon managers’/decision-makers’ value systems. This means we must be prepared
to act not only as educators for our organizations, but we may also be called upon to act as applied
social psychologists in creating situations for participation and self-discovery that lead to better-
informed, more effective practices within our organizations.
REFERENCES
Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2008). Magnitude and extent of cheating and response
distortion effects on unproctored Internet-based tests of cognitive ability and personality. Manuscript
submitted for publication.
Blackburn, M. A. (1998). Cheating and motivation: An examination of the relationships among cheating
behaviors, motivational goals, cognitive engagement, and perceptions of classroom goal structures.
Unpublished doctoral dissertation, The University of Oklahoma, Norman, OK.
Brogden, H. F., & Taylor, E. K. (1950). The dollar criterion—Applying the cost accounting concept to criterion
construction. Personnel Psychology, 3, 133–154.
Burke, E. (2006). Better practice for unsupervised online assessment. Surrey, England: SHL Group.
Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ: Lawrence Erlbaum.
Cizek, G. J. (2001, April). An overview of issues concerning cheating on large-scale tests. Paper presented at
the annual meeting of the National Council on Measurement in Education, Seattle, WA.
Dana, J., & Thomas, R. (2006). In defense of clinical judgment...and mechanical prediction. Journal of
Behavioral Decision Making, 19, 413–428.
Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist,
34, 571–582.
Ethics Resource Center. (2007). National business ethics survey: An inside view of private sector ethics.
Arlington, VA: Author.
Fallon, J., Janovics, J., Lamazor, A., & Gust, J. (2006, May). An early adopter’s view: Data from a one-year
unproctored Internet testing study. Paper presented at the 21st annual conference of the Society for
Industrial and Organizational Psychology, Dallas, TX.
Festinger, L. (1957). A theory of cognitive dissonance. Evanston, IL: Row Peterson.
Administering Assessments and Decision-Making 397
Foster, D. F. (2008, April). Secure, online, high-stakes testing: Science fiction or business reality? Paper pre-
sented at the 23rd annual conference of the Society for Industrial and Organizational Psychology, San
Francisco, CA.
Ganzach, Y., Kluger, A. N., & Klayman, N. (2000). Making decisions from an interview: Expert measurement
and mechanical combination. Personnel Psychology, 53, 1–20.
Grauer, E., Beaty, J., Davis, J., & Meyer, J. (2006, May). Unproctored Internet testing: Important questions
and empirical answers. Paper presented at the 21st annual conference of the Society for Industrial and
Organizational Psychology, Dallas, TX.
Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and
formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy. Psychology,
Public Policy, and Law, 2, 293–323.
Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical predic-
tion: A meta-analysis. Psychological Assessment, 12, 19–30.
Hazer, J. T., & Highhouse, S. (1997). Factors influencing managers’ reactions to utility analysis: Effects of SDy
method, information frame, and focal intervention. Journal of Applied Psychology, 82, 104–112.
Hough, L. M. (1984). Development and evaluation of the “accomplishment record” method of selecting and
promoting professionals. Journal of Applied Psychology, 69, 135–146.
Huff, K. (2006). The effects of mode of administration on timed cognitive ability tests. Unpublished doctoral
dissertation, North Carolina State University, Chapel Hill, NC.
Hughes, D., & Tate, L. (2007). To cheat or not to cheat: Candidates’ perceptions and experiences of unsuper-
vised computer-based testing. Selection & Development Review, 23, 13–18.
Hughes, J. M. C., & McCabe, D. L. (2006). Academic misconduct within higher education in Canada. Canadian
Journal of Higher Education, 36, 1–21.
International Test Commission. (2005). International guidelines on computer-based and Internet delivered
testing. Author.
Johns, G. (1993). Constraints on the adoption of psychology-based personnel practices: Lessons from organi-
zational innovation. Personnel Psychology, 46, 569–592.
Josephson Institute of Ethics. (2006). Report card on the ethics of American youth. Los Angeles, CA. Retrieved
May 1, 2008, from http://charactercounts.org/programs/reportcard/2006/data-tables.html
Junior Achievement/Deloitte. (2007). 2007 JA Worldwide/Deloitte teen ethics survey. Author.
Klein, H. A., Levenburg, N. M., McKendall, M., & Mothersell, W. (2007). Cheating during the college
years: How do business school students compare? Journal of Business Ethics, 72, 197–206.
Kleinmuntz, B. (1990). Why we still use our heads instead of formulas: Toward an integrative approach.
Psychological Bulletin, 107, 296–310.
Kluger, A. N., Reilly, R. R., & Russell, C. J. (1991). Faking biodata tests: Are option-keyed instruments more
resistant? Journal of Applied Psychology, 76, 889–896.
Latham, G. P. (1988). Human resource training and development. Annual Review of Psychology, 39, 545–582.
Lievens, F., Highhouse, S., & DeCorte, W. (2005). The importance of traits and abilities in supervisors’
hirability decisions as a function of method of assessment. Journal of Occupational and Organizational
Psychology, 78, 453–470.
Macan, T. H., & Highhouse, S. (1994). Communicating the utility of human resource activities: A survey of I/O
and HR professionals. Journal of Business and Psychology, 8, 425–436.
McCabe, D. L., Butterfield, K. D. & Treviño, L. K. (2006). Academic dishonesty in graduate business pro-
grams: Prevalence, causes, and proposed action. Academy of Management Learning and Education, 5,
295–305.
McCabe, D. L., Treviño, L. K., & Butterfield, K. D. (2001). Cheating in academic institutions: A decade of
research. Ethics and Behavior, 11, 219–232.
Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence.
Minneapolis, MN: University of Minnesota Press.
Naglieri, J. A., Drasgow, F., Schmit, M., Handler, L., Prifitera, A., Margolis, A., et al. (2004). Psychological
testing on the Internet: New problems, old issues. American Psychologist, 59, 150–162.
Nathanson, C., Paulhus, D. L., & Williams, K. M. (2006). Predictors of a behavioral measure of scholastic
cheating: Personality and competence but not demographics. Contemporary Educational Psychology,
31, 97–122.
Newstead, S. E., Franklyn-Stokes, A., & Armstead, P. (1996). Individual differences in student cheating.
Journal of Educational Psychology, 88, 229–241.
Nye, C. D., Do, B.-R., Drasgow, F., & Fine, S. (2008). Two-step testing in employee selection: Is score inflation
a problem? International Journal of Selection and Assessment, 16, 112–120.
398 Handbook of Employee Selection
Pfeffer, J., & Sutton, R. I. (2006). Hard facts, dangerous half-truths, and total nonsense: Profiting from
evidence-based management. Boston, MA: Harvard Business School Press.
Robinson, E., Amburgey, R., Swank, E., & Faulkner, C. (2004). Test cheating in a rural college: Studying the
importance of individual and situational factors. College Student Journal, 38, 380–395.
Ryan, A. M., & Tippins, N. T. (2004). Attracting and selecting: What psychological research tells us. Human
Resource Management, 43, 305–318.
Schmitt, N., & Oswald, F. L. (2006). The impact of corrections on faking on the validity of noncognitive mea-
sures in selection settings. Journal of Applied Psychology, 91, 613–621.
Shrivastava, P., & Mitroff, I. I. (1984). Enhancing organizational research utilization: The role of decision mak-
ers’ assumptions. Academy of Management Review, 9, 18–26.
Templer, K. J., & Lange, S. R. (2008). Internet testing: Equivalence between proctored lab and unproctored
field conditions. Computers in Human Behavior, 24, 1216–1228.
Terpstra, D. E. (1996). The search for effective methods. HRFocus, 73, 16–17.
Terpstra, D. E., & Rozell, E. J. (1997). Why some potentially effective staffing practices are seldom used.
Public Personnel Journal, 26, 483–495.
Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., et al. (2006). Unproctored
Internet testing in employment settings. Personnel Psychology, 59, 189–225.
Wood, W. (2000). Attitude change: Persuasion and social influence. Annual Review of Psychology, 51,
539–570.
Zickar, M. J., & Robie, C. (1999). Modeling faking good on personality items: An item level analysis. Journal
of Applied Psychology, 84, 551–563.
19 Evaluation of Measures
Sources of Error, Sufficiency,
and Contamination
Michael J. Zickar, Jose M. Cortina, and Nathan T. Carter
Suppose that we wish to develop a measure of cognitive ability. After receiving advice from an
“expert” neighborhood psychologist, we will ask each participant to throw a baseball as far as he/
she can, and the distance, in feet, will be our measure of cognitive ability. A shrewd observer might
remark that a given respondent might vary from one trial to the next in the distance that he/she
throws the ball. Wanting to improve upon our measure, we decide to use the average of ten throws.
Whereas there may be considerable intraindividual variability in the distance of a single throw,
there will be little variability in the average distance over ten throws.
A shrewder observer might now express concern over the possibility that men, being physically
larger than women on average, might have an unfair advantage. In addition, direction and strength of
the wind might play a role, as would previous experience throwing baseballs, age, physical strength,
etc. We therefore endeavor to hold wind constant and to adjust for any and all of the confounding
variables that have been mentioned.
We are now satisfied that we have intraindividual consistency and that confounding factors have
been eliminated. We now discover that, although we have no intraindividual variability, we also
have no interindividual variability. What to do …. Aha! After we have collected the average dis-
tance thrown for each participant, we will administer the most recent version of the Wechsler Adult
Intelligence Scale (WAIS; see Wechsler, 1939) and add that score to the distance score. This sum
will then serve as our measure of cognitive ability. To our great joy, we find in subsequent studies
that our measure predicts those things that should, rationally, relate to cognitive ability, and that it
does not predict those things that should not relate to cognitive ability.
This was a strange test development process to be sure. The initial result was off the mark (as was
Goddard’s initial attempt to screen immigrants at Ellis Island for cognitive ability with questions
requiring knowledge about aspects of American life, including what Crisco was; see Zenderland,
1998); however, our procedure did not stop there. It eventually went through all of the stages of test
development. We began by establishing the reliability of the measure. We focused only on test-retest
reliability because we were not sampling items from a domain of items (i.e., internal consistency),
and we were not using a measure for which scoring required judgment (i.e., interrater reliability).
Having established the reliability of the measure, we then set about establishing its validity. We
began by eliminating the influence of those factors that were not relevant but that might have exer-
cised influence over scores. When it then became obvious that the scores with which we were left
did not allow for appropriate inferences, we went searching for what was missing. In the end, we
had a measure with acceptably low levels of relevant forms of random error; although there was
some irrelevant variance in the measure, we deemed it to have acceptably low levels of contamina-
tion; finally, we deemed the measure to have acceptably high levels of sufficiency (i.e., the measure
covered the main sources of variance in the theoretical construct of interest).
Although we hope that most psychologists would not start off with such an absurd first attempt
at a psychological measure, our example details the main processes involved in evaluating a
399
400 Handbook of Employee Selection
Reliability
Although reliability theory is one of the first topics covered in graduate measurement courses, it is
one of the most misunderstood topics. Most students learn about reliability in the context of classi-
cal test theory (CTT) and are deceived by the simple formula, X = T + E, where an observed score
is mysteriously parsed into a true score, T, and error, E. Students who delve a little deeper into reli-
ability theory realize that there is little “true” about the true score, and often what they think is error
is not. What is often lost with novice researchers is that the type of reliability coefficient calculated
dictates the source of error that is identified in a particular measure. In this chapter, we focus on
three common types of error that are often present in psychological measures: error associated with
different items, error associated with different raters, and error due to issues related to momentary
time-limited phenomena.
written behavior, observers interpret information differently. Some raters are more lenient, others
more stringent. Some interviewers might give preference to blondes whereas others may uncon-
sciously give high ratings to people who wear blue ties. Differences in rater behavior can sometimes
be reduced by providing training, although given the different ways individuals view the world,
these differences are unlikely to be completely eliminated.
There is a large amount of error related to raters when judging job-related variables. For exam-
ple, Woehr, Sheehan, and Bennett (2005) found that unique, idiosyncratic source-specific factors
were responsible for two-thirds of the variance in performance ratings. Employment interview
researchers have also demonstrated the interrater reliability of interviewees is typically fairly low
(see Conway, Jako, & Goodman, 1995).
There are many ways to reduce the amount of error related to raters. If at all possible, it is
important to standardize the nature of information that different raters observe. In addition, provid-
ing frame-of-reference training that attempts to provide common standards of comparison might
help improve interrater reliability. Computerized scoring algorithms are used by large-scale testing
companies to interpret and score written essays in the GRE and other certification tests, thereby
eliminating the possibility of rater unreliability. If you cannot reduce error by standardizing infor-
mation, the best way to reduce it is to increase the number of raters, thereby reducing the amount
of error through aggregation in the same way that increasing the number of items reduces internal
inconsistency. Taking an average of many raters will cancel out the positive and negative errors
associated with individual raters.
Measures of reliability sensitive to time-limited factors (e.g., test-retest reliability) rest on the
assumption that all score differences across two separate testing administrations are due to momen-
tary time-limited errors. Of course, differences in scores across two administrations can be due not
only to time-limited errors such as the ones mentioned but also to true change in the underlying
construct. For example, dramatic changes in vocabulary test scores given across 2 years time are
more likely due to true growth in vocabulary rather than momentary, time-limited errors.
Validity
Our review of validity focuses on sufficiency and contamination, two concepts that are deemed
critical for demonstrating evidence of validity. There are several approaches to validation, including
criterion-related validation, content validation, and construct validation, with each of the methods
having different techniques and methods for evaluating the amount of validity. We do believe that
all approaches to validation are related to each other and the concepts of sufficiency and contam-
ination—although most often used in discussion of content validation—are relevant to all forms
of validation [see Landy, 1986; the Society for Industrial and Organizational Psychology (SIOP)
Principles, SIOP, 2003]. However, as we describe below, the issue is complicated by confusion
over the role of sufficiency in unidimensional scales and by an implied quasi-isomorphism between
sufficiency and content validity. For these reasons, we focus most of our attention on the role of suf-
ficiency in content validation.
Sufficiency
In discussions of validity, it is often asked whether the test in question covers all the ground
that it should. For example, measures of job performance have been expanded to accommodate
dimensions that have been added to models of job performance (e.g., adaptive performance;
Pulakos, Arad, Donovan, & Plamondon, 2000). This change was made because of criticism that
existing measures of job performance focused too restrictively on dimensions related to produc-
ing while ignoring dimensions related to supporting the organization and its members. Similarly,
intelligence measures have been expanded to accommodate dimensions that have been added
to models of intelligence (e.g., practical intelligence; Sternberg, Wagner, Williams, & Horvath,
1995). These additions were made because of criticisms that existing measures of intelligence
focused too much on types of cognitive skills related to academic success compared with other
domains in life.
Evaluation of Measures 403
One might conclude from these criticisms that existing measures were insufficient. However, it
would be more appropriate to say that existing models of performance were insufficient, and that
the measures merely reflected the inferior models on which they were based. If we assume that a
measure is unidimensional, then insufficiency can only indicate factorial complexity at the model
level. It seems more parsimonious, then, to stipulate that sufficiency is a property of conceptual
models rather than measures. Once a model has been deemed to cover the full breadth of its domain
(e.g., a performance model that is comprised on technical performance, contextual/citizenship per-
formance, adaptive performance, interpersonal performance, etc.), then unidimensional scales mea-
suring each factor can be developed. Reliability then reflects the proportion of true score variance,
and validity represents lack of contamination.
This position may seem at odds with the American Psychological Association (APA), National
Council for Measurement in Education (NCME), and American Educational Research Association
(AERA) Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999). In
the section on content-related evidence, it is stated that:
There are several observations to be made about this passage. The first is that sufficiency is inex-
tricably intertwined with content-related validity evidence. Evidence of insufficiency comes from
a comparison of test content to the “content domain.” Omissions suggest insufficiency. Second, the
solution that is offered in the passage has to do with contamination rather than sufficiency. This may
have been incidental, but it may also have been due to an inability to refer to insufficiency without
also referring to deficiencies in the definitions of the construct of interest and of the domain of items
that apply to it. Third, this passage is representative of the Standards (AERA, APA, & NCME,
1999) as a whole in that nowhere in the Standards are issues of sufficiency raised without reference
to content.
Although the term “sufficiency” does not appear in the index of the Standards (AERA, APA, &
NCME, 1999) or in any of the relevant standards themselves (e.g., 1.6, 1.7, 3.2, 3.6, 14.8, 14.11),
issues related to sufficiency appear in every section that deals with content-related evidence. Issues
relating to contamination, on the other hand, appear in every section that deals with evidentiary
bases of inferences.
Our position is that content-related evidence has the potential to expose insufficiency only if the
construct is poorly specified. If the construct is well specified, then insufficiency is not possible in
the absence of egregious oversight. Therefore, we recommend that to ensure sufficiency, researchers
devote additional effort to better explaining the conceptual foundations of their measure or model.
From our experience, many scale development efforts jump straight into item writing with little
attention paid to careful explication of the construct that those items are supposedly measuring.
Engaging in more “upfront” thinking about the target construct will help ensure sufficiency. Finally,
it should be noted that in some cases, organizations may purposely use measures that are insuffi-
cient for purposes of expediency or cost. In these cases, organizations should be careful to address
whether their insufficiency results in differences in selection rates for minority groups compared to
measures without insufficiency.
Contamination
As noted in the introduction, measurement contamination implies that a particular measure is influ-
enced by unwanted sources of variance, different from the construct of interest. Confirmatory factor
analytic frameworks are helpful in understanding the complex multidimensional nature of contami-
nation by isolating different sources of variance. As will be noted throughout this section, concern
404 Handbook of Employee Selection
for contamination is motivated not only by the psychometric goal of creating a “pure” measure, but
also by a desire to minimize sources of irrelevant variance that covary with membership in protected
demographic classes that are accorded special protection under employment law. Therefore, all
industrial-organizational (I-O) psychologists should be concerned with the contamination of their
instruments. Given the complexity of the analyses that can be used to quantify contamination, we
devote more space on this topic than reliability and sufficiency.
an appropriate way of assessing how the structure of constructs differs across groups. It should
be noted that item response theory (IRT) is another way to determine measurement equivalence,
although we have chosen to exclude discussion of IRT for parsimony’s sake. The interested reader
should consult Raju and Ellis (2002).
x = Λξ + δ (19.1)
defines the vector of observed variables. This can be expressed as a matrix formula in which the
values of x are determined by the strength of the relationships between the variables and the traits
(i.e., factor loadings), Λ; the values of the traits themselves, ξ; and a vector containing the residuals,
or uniquenesses, δ. To elaborate, consider a person’s responses to a test with four items or observa-
tions and two proposed dimensions or factors, which are clarified in Equation 19.2.
x1 λ11 0 δ1
x2 = λ 21 0 ξ1 + δ 2 (19.2)
x3 0 λ12 ξ 2 δ 3
x4 0 λ 22 δ 4
The researcher has specified that the first two items belong to the first of the j factors and the remain-
ing two to the second dimension by setting the factor loadings, λij , to 0 for the factors’ excluded
items, and leaving the others free to vary. Thus, the structure of the response-trait relationship is
specified a priori. The CFA model represented in Equation 19.2 is depicted in Figure 19.1.
As can be seen in the figure, we are working not only with observed measures, but also with unob-
servables. Thus, we cannot work with Equation 19.1 alone. Instead, we must turn to the covariance
matrix of the observed variables, xi. To utilize the covariance matrix, it must be assumed that the
factors, ξ, and the uniquenesses, δ, are random variables. Secondly, the residuals, or uniquenesses,
should not systematically vary across factors. Finally, observations must be measured in deviations
21
1 2
1 2 3 4
x1 x2 x3 x4
1 2 3 4
1 2 3 4
from their mean. If these assumptions hold, we can conclude that the covariance matrix takes the
form
Σ = Λ x ΦΛ ′x + Θ (19.3)
Equation 19.3 states that the covariance matrix of the observed variables is equal to the product of
the factor loading matrix; its transpose; and the variance-covariance matrix, Φ, of the latent traits,
ξ, plus the covariance matrix, Θ, of the uniqueness terms, δ. By specifying the expected form of
the variables as shown with Equation 19.2, its covariance matrix (Equation 19.3) can be used to
determine if the data are likely given the hypothesized model. The resulting χ2 statistic reflects the
deviance of the predictions made by the hypothesized model from the observed data. This statistic
is used to evaluate the goodness-of-fit of the CFA model.
(CIs)1 around the RMSEAs for each model and comparing each succeeding model’s interval to the
prior model’s interval. If the two do not overlap, there is a significant difference in fit, and thus a
violation of measurement invariance (see Cheung & Rensvold, 1992; Kline, 2005, for an excellent
introduction to issues involving fit).
In the following section, we focus on the steps taken in the case in which the covariance matrices
exhibit MI/E. The test of the above hypothesis is carried out in the same fashion as the tests discussed
here. If the test is rejected, we are concerned with uncovering the source of the problem through
sequential model comparisons, which are the focus of this paper. The four forms of MI/E presented
here are: (a) configural invariance, (b) metric invariance, (c) factor variance-covariance (FVC)
invariance, and (d) uniqueness invariance. We now present an examination of each form of MI/E,
how they are tested, and their implications for evaluating test scores across groups. Additionally, the
hypotheses are presented in the order in which they are conducted. These hypotheses provide infor-
mation relevant for evaluating contamination. They are presented utilizing an adaptation of Wang
and Russell’s (2005) examination of measurement equivalence between Chinese and American
respondents on the Job Descriptive Index (JDI).
1 Constructing a 90% CI around models and comparing them for overlap is conceptually equivalent to a significance
test for ΔRMSEA at the .10 level; if the intervals overlap, the test is nonsignificant, whereas if they do not, the test is
significant.
2 It should be noted that principal components is not appropriate here because we are dealing with latent traits, and they do
not discriminate between common and unique variance (see Fabrigar et al., 1999).
408 Handbook of Employee Selection
211
1 2
1 2 3 4
x1 x2 x3 x4
1 2 3 4
1 2 3 4
212
1 2
5 6 7 8
x1 x2 x3 x4
5 6 7 8
1 2 3 4
By confirming the hypothesis associated with configural invariance, we have only begun to
demonstrate MI/E, and only to a small degree. However, this is an important step in that (a) it is the
prerequisite to more accurately determining the “problem,” or source of MI/E; and (b) it allows us
to confirm that the same number and pattern of factors are found in different groups, suggesting that
the latent structure is not systematically impacted by group membership. By examining Table 19.1,
which was adapted from Wang and Russell’s (2005) examination of the JDI across American and
Chinese samples, we see that the RMSEA value is below .08, which suggests that configural invari-
ance holds for these groups. In the context of the JDI, this means that there is equivalent fit for the
data of both groups (Chinese and American) combined to a five-factor model including the trait fac-
tors pay, work, coworker, supervision, and promotion. By finding adequate fit in the baseline model,
we have only shown evidence that, across groups, the number and pattern of factors in the proposed
measurement model are adequate in describing the data. However, had the hypothesis been rejected,
stopping here would have left other important questions unasked. Thus, we continue the sequence
of hypothesis testing. Not surprisingly, we will see the other forms of invariance supported, as was
suggested by the omnibus test.
Evaluation of Measures 409
TABLE 19.1
Simultaneous CFA Across American (n = 1,664) and Chinese (n = 1,319) Subpopulations
Model
Specified Type of
Across Sample Invariance df χ2 Comparison Δdf Δχ2 RMSEA (90% CI)
Model 1: free Configural 4,948 20,925 None .047 (.042–.051)
Λx, Φ, Θδ
Model 2: equal Metric 5,015 22,286 Model 1 vs. Model 2 67 1,361 .048 (.043–.052)
Λx, free Φ, Θδ
Model 3: equal FVC 5,030 22,917 Model 2 vs. Model 3 15 630 .049 (.044–.053)
Λx, Φ, free Θδ
Model 4: equal Uniqueness 5,102 25,078 Model 3 vs. Model 4 72 2,160 .051 (.046–.055)
Λx, Φ, Θδ
Source: Adapted from Wang, M., & Russell, S. S., Educational and Psychological Measurement, 65, 709–732, 2005.
Metric Invariance
Whereas configural invariance is a weak test of MI/E, the test of metric invariance is considered
a “strong” test of MI/E (Vandenberg & Lance, 2000, p. 12). Configural invariance does not allow
us to draw the conclusion that the relationship between latent variables and observed item scores is
similar across groups. Where the configural test asks if the hypothesized latent structure is similar
across groups, the metric invariance test examines whether the relationship between the observa-
tions (or item responses), xi, and the latent traits, Λ, are equally strong across groups, suggesting
freedom from contamination by group membership. Stated formally, it is the hypothesis that there
is no significant difference between the factor loadings of the groups. By placing the constraint of
equality, Λ g = Λ g ′, on the baseline model we can test this hypothesis.
Consider the model in Figure 19.3. It is identical to that of Figure 19.2 except that loadings for
Group 2 have the same subscripts as the loadings for Group 1, indicating that the loading for each
item on its latent variable is constrained to be equal between groups.
Such a constraint also supplies us with Model 2 from Table 19.1. We may now compare Model
2 to the configural or baseline model (Model 1). As can be seen in Table 19.1, the 90% CIs for
RMSEA in the two models overlap. Therefore, we can say that evidence for metric invariance has
been observed, or that the relationship between the trait or factor and the observed response is
equivalent across Chinese and American respondents for the JDI.
A finding of metric invariance is deemed a strong test because it allows us to make much stronger
assertions about the comparability of items across groups or forms. If metric invariance holds for
the measure, it means that “People in different [groups] respond to items the same way” and may
therefore be compared across groups (Steenkamp & Baumgartner, 1998, p. 80). This is because
we have evidence that there is a similarly strong relationship between our observations and the
unobservable traits or abilities we are measuring. Further, this means that, for both groups, given a
certain increase in the latent trait, a particular magnitude of change is observed in scores. Without
showing evidence of metric invariance, comparing the two groups’ test scores would be analogous
to comparing temperatures between Arizona and Alaska when the researcher in Arizona has taken
measurements in Celsius and the Alaskan in Fahrenheit.
In a selection context, metric invariance is an important consideration. If the relationships between
the trait(s) and the observed responses are contaminated, resulting in lower scores for members of
one group in an applicant pool, an organization may unknowingly make employment decisions on
the basis of subgroups defined by race, sex, or national origin. Members of some of these groups are
protected by laws in several countries (see Myors et al., 2008 for a review). Additionally, Meredith
410 Handbook of Employee Selection
(1993) has shown that in a selection procedure (defining “fairness” as the case where two individu-
als with the same “true” score have the same probability of being selected from the applicant pool;
this is a scientific definition, not a legal one) metric invariance is necessary for fairness to exist.
Uniqueness Invariance
Measurement error lies at the heart of every test development effort. In considering the use of
a test of subpopulations, it is important that the scores resulting from observations or responses
have equivalent residuals in both groups. If an assessment shows smaller residuals for one par-
ticular group, then scores in that group have increased measurement precision compared with
the other group. This would imply that our estimates of uniqueness are contaminated by group
membership.
The matrices we constrained to create Model 4 contain what are known as uniquenesses; this
matrix is the sum of (a) random error and (b) systematic error not accounted for by the latent factor
(Mulaik, 1972). We can determine the invariance of uniquenesses of indicators by inducing the
constraint that Θ sg = Θ sg ′ ; unique variances are now constrained to be equal. Regarding Table 19.1,
such a constraint creates Model 4, which we in turn compare to Model 3. Examining Table 19.1,
we see that the hypothesis of equal residuals was not rejected. We can take this to mean that there
is no more measurement error for one group than the other. Several authors (e.g., Steenkamp
& Baumgartner, 1998; Wang & Russell, 2005) have used this interpretation. It should be noted
that attention has been drawn to the importance of establishing equal factor variances before the
interpretation of this test as one of equivalent uniquenesses, and that corrections must be made if
Evaluation of Measures 411
212
1 2
1 2 3 4
x1 x2 x3 x4
5 6 7 8
1 2 3 4
211
1 2
1 2 3 4
x1 x2 x3 x4
1 2 3 4
1 2 3 4
they are not (see Vandenberg & Lance, 2000, p. 34). Additionally, it is important to understand
that this test is not only indicative of differences in random error across groups, but also irrelevant
systematic error, which would be a source of contamination in scores. Thus, in the present case
(i.e., Table 19.1), we would conclude that similar levels of measurement precision (or inversely,
error) are obtained when fitting the five-factor JDI model to the American as compared with the
Chinese sample.
In evaluating employee selection measures, we would like to know that both groups’ scores are
equally precise. The amount of residual variance is indicative of (although not analogous to) the
lower bound of its validity (Crocker & Algina, 1986). Thus, tests (e.g., those in a selection battery)
whose scores are contaminated by group membership could have problems with validity in the form
of differential prediction or relationships with external variables.
the selection context, organized into two broad categories: as being either focused on (a) predictors
or (b) criterion. However, it should be noted that this is not meant to be a comprehensive review
of such studies.
Much interest in MI/E studies in the selection literature is focused on the equivalence of rating
sources. Facteau and Craig (2001) showed an interesting approach to this question by estimating a
baseline model across self, peer, supervisor, and subordinate performance ratings and compared it
to a metric invariant model. Evidence for metric invariance was found across all sources of ratings,
suggesting that these sources respond to items in the same fashion. This finding has repercussions
for how we view multisource rating data.
Diefendorff, Silverman, and Greguras (2005) filled this gap by examining self, peer, and super-
visor ratings using the sequence of hypothesis testing suggested by Vandenberg and Lance. Their
data showed a high degree of MI/E using various measures and these three rating sources. In the
same issue of the Journal of Business and Psychology, Greguras (2005) showed how MI/E perfor-
mance ratings can be used to ask theoretical questions about rating MI/E. His analysis showed that
the rating process of managers does not appear to be disturbed by job experience. Thus, although
MI/E analyses are more common on the predictor side, they can also be useful in efforts designed
to evaluate the criterion side.
Conclusions
In this chapter, we have discussed various measurement criteria and ways to evaluate measures
all with the goal of determining whether an instrument has minimized various sources of error,
minimized contamination, and maximized sufficiency. Instead of rehashing the main points made
throughout this chapter, we thought it best to end with three summary points.
1. Statistics should inform scale development but not dominate the process. These days,
many scale evaluation efforts could be conducted by a computer program that was pro-
grammed to select items merely on the basis of maximizing some statistical criterion, usu-
ally internal consistency or goodness-of-fit. Such analyses are easy to do given the ease of
programs such as SPSS that report the alpha-if-item-deleted statistic. People often have to
do strange things to get goodness-of-fit statistics to reach acceptable levels, such as letting
error variances covary. Statistics should certainly inform the process, because it is a good
thing, in general, to have models that fit and measures that reflect high internal consistency.
But relying on these statistics alone might result in narrow measures that fail to correlate
with any external variable.
2. Error is much more than internal consistency. Researchers are obsessed with internal con-
sistency measures for evaluating reliability. This fascination is undoubtedly due not to the
appropriateness of internal consistency but to the ease with which it can be computed and
the frequency with which it is accepted as the sole indicator of scale quality. Relying solely
on internal consistency may provide researchers with a false sense of comfort, given that
these indices ignore other sources of error that might be important. Other assessments of
reliability require multiple data collections or multiple raters, whereas internal consistency
merely requires multiple items.
3. Scale evaluation is hard work. Most researchers involved in scale development and evalu-
ation efforts are interested primarily in determining whether measure X is a good measure
of construct X for the purposes of conducting their substantive research. Scale evaluation
work is ancillary to their main interests and may warrant only a few sentences in the meth-
ods or results section. Scale development and evaluation should be considered a continuous
process that unfolds over time, rather than an objective that is achieved in a single study or
publication. Of course, the need to continuously improve a scale should be balanced by the
need for comparability across studies and time. Ideally, a scale should be scrutinized using
various methods and should be examined by various researchers. Too often, a scale gets
published and then other researchers use it assuming that, because it had been published,
that the psychometric work has been done. We wish things were that easy!
414 Handbook of Employee Selection
References
Age Discrimination in Employment Act of 1967, 29 U.S.C. § 621 (1967).
American Psychological Association, National Council for Measurement in Education, and American
Educational Research Association. (1999). Standards for educational and psychological testing.
Washington, DC: American Psychological Association.
Americans with Disabilities Act of 1990, 42 U.S.C. § 12101 (1990).
Bagozzi, R. P., & Edwards, J. R. (1998). A general approach for representing constructs in organizational
research. Organizational Research Methods, 1, 45–87.
Bollen, K. A. (1989). Structural equation modeling with latent variables. New York, NY: John Wiley.
Byrne, B. (1989). A primer of LISREL: Basic applications and programming for confirmatory factor analytic
models. New York, NY: Springer Verlag.
Cheung, G. W., & Rensvold, R. B. (1992). Evaluating goodness-of-fit indexes for testing measurement invari-
ance. Structural Equation Modeling, 9, 233–255.
Civil Rights Act of 1964, 42 U.S.C. § 253 (1964).
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA: Wadsworth-
Thompson Learning.
Diefendorff, J. M., Silverman, S. B., & Greguras, G. J. (2005). Measurement equivalence and multisource rat-
ings for non-managerial positions: Recommendations for research and practice. Journal of Business and
Psychology, 19, 399–425.
Drasgow, F. (1984). Scrutinizing psychological tests: Measurement equivalence and equivalent relations with
external variables are central issues. Psychological Bulletin, 95, 134–135.
Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal of Applied
Psychology, 72, 19–29.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory
factor analysis in psychological research. Psychological Methods, 4, 272–299.
Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources compa-
rable? Journal of Applied Psychology, 86, 215–227.
Gerbing, D. W., & Hamilton, J. G. (1996). Viability of exploratory factor analysis as a precursor to confirmatory
factor analysis. Structural Equation Modeling, 3, 62–72.
Gilliland, S. W., & Steiner, D. D. (1999). Applicant reactions. In R.W. Eder & M. M. Harris (Eds.), The employ-
ment interview handbook (pp. 69–82). London, England: Sage.
Greguras, G. J. (2005). Managerial experience and the measurement equivalence of performance ratings.
Journal of Business and Psychology, 19, 383–397.
Haig, B. D. (2005). Exploratory factor analysis, theory generation, and the scientific method. Multivariate
Behavioral Research, 40, 303–329.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in aging
research. Experimental Aging Research, 18, 117–144.
Hurley, A. E., Scandura, T. A., Schriesheim, C. A., Brannick, M. T., Seers, A., Vandenberg, R. J., et al.
(1997). Exploratory and confirmatory factor analysis: Guidelines, issues, and alternatives. Journal of
Organizational Behavior, 18, 667–683.
Jingo, M., & Kurosawa, A. (1950). Rashomon [Motion picture]. Japan: Daiei.
Joreskog, K. G., & Sorbom, D. (1996). LISREL 8: User’s reference guide. Chicago, IL: Scientific Software
International.
Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York, NY: Guilford.
Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist,
41, 1183–1192.
Meade, A. W., Michels, L. C., & Lautenschlager, G. J. (2007). Are Internet and paper-and-pencil personality
tests truly comparable? An experimental design measurement invariance study. Organizational Research
Methods, 10, 322–345.
Meredith, W. (1993). Measurement invariance: Factor analysis and factorial invariance. Psychometrika, 58, 525–543.
Mulaik, S. A. (1972). The foundations of factor analysis. New York, NY: McGraw-Hill.
Myors, B., Lievens, F., Schollaert, E., van Hoye, G., Cronshaw, S.F., Mladinic, A., et al. (2008). International
perspectives on the legal environment for selection. Industrial and Organizational Psychology:
Perspectives on Science and Practice, 1, 206–246.
Offerman, L. R., & Gowing, M. K. (1993). Personnel selection in the future: The impact of changing
demographics and the nature of work. In N. Schmitt & W.C. Borman (Eds.), Personnel selection in
organizations (pp. 385–417). San Francisco, CA: Jossey-Bass.
Evaluation of Measures 415
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace:
Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624.
Raju, N. S., & Ellis, B. B. (2002). Differential item and test functioning. In F. Drasgow & N. Schmitt (Eds.),
Measuring and analyzing behavior in organizations: Advances in measurement and data analysis
(pp. 156–188). San Francisco, CA: Jossey-Bass.
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based
on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–529.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two
approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566.
Rensvold, R. B., & Cheung, G. W. (1998). Testing measurement models for factorial invariance: A systematic
approach. Educational and Psychological Measurement, 58, 1017–1034.
Robie, C., Schmit, M. J., Ryan, A. M., & Zickar, M. J. (2000). Effects of item context specificity on the mea-
surement equivalence of a personality inventory. Organizational Research Methods, 3, 348–365.
Roznowski, M., & Hanisch, K.A. (1990). Building systematic heterogeneity into work attitudes and behavior
measures. Journal of Vocational Behavior, 36, 361–375.
Schmitt, N., & Mills, A. E. (2001). Traditional tests and job simulations: Minority and majority performance
and test validities. Journal of Applied Psychology, 86, 451–458.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks, CA: Sage.
Society for Industrial-Organizational Psychology. (2003). Principles for the validation and use of personnel
selection procedures (4th ed.). Bowling Green, OH: Author.
Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common sense. American
Psychologist, 50, 912–927.
Steenkamp, J. E., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer
research. Journal of Consumer Research, 25, 78–90.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature:
Suggestions, practices, and recommendations for organizational research. Organizational Research
Methods, 3, 4–70.
Wang, M., & Russell, S. S. (2005). Measurement equivalence of the job descriptive index across Chinese and
American workers: Results from confirmatory factor analysis and item response theory. Educational and
Psychological Measurement, 65, 709–732.
Wechsler, D. (1939). The measurement of adult intelligence. Baltimore, MD: Williams & Wilkins.
Williamson, L. G., Campion, J. E., Malos, S. B., Roehling, M. V., & Campion, M. A. (1997). Employment
interview on trial: Linking interview structure with litigation outcomes. Journal of Applied Psychology,
82, 900–912.
Woehr, D. J., Sheehan, M. K., & Bennett, W. (2005). Assessing measurement equivalence across rating sources:
A multitrait-multirater approach. Journal of Applied Psychology, 90, 592–600.
Zenderland, L. (1998). Measuring minds: Henry Herbert Goddard and the origins of American intelligence
testing. Cambridge, England: Cambridge University Press.
This page intentionally left blank
20 Assessment Feedback
Manuel London and Lynn A. McFarland
This chapter explores the role of feedback in the assessment process. The content of assessment
feedback can range anywhere from a pass/fail statement to a detailed, competency-based report
delivered in-person and followed up with suggestions for development. Unlike feedback in other con-
texts, such as performance appraisal, selection feedback is not primarily about how applicants can
improve their performance in the future. Indeed, many tools are used in selection to measure areas
where we do not believe people can easily improve, such as personality characteristics and general
cognitive ability. The purpose of feedback may be only to explain the selection decision. However,
feedback may influence applicants’ perceptions of fairness, their self-image, and their reactions to
the organization. This in turn may affect applicants’ behavior, such as whether or not to accept a job
or recommend the organization to another prospective employee. Also, although not necessarily the
intended purpose, selection feedback may be useful to guide applicants’ development—whether to
help them next time they apply for a position, repeat a similar test for the same job, or accept a job
offer and need or want further training to enhance their job performance. Further, organizations and
human resource (HR) professionals who are responsible for selection may view giving feedback as a
professional and ethical obligation. Assessment feedback also can affect the organization’s reputa-
tion, for instance, as a respectful, development-oriented employer.
Here we consider the benefits and drawbacks of providing feedback from the standpoint of the
organization and the candidate. We review the literature on test-givers’ obligations to provide feed-
back, candidates’ reactions to feedback, and the potential costs and benefits of feedback to the
recipients and the organization. Finally, we consider implications for practice and areas for research
to better understand the role of feedback from individual and organizational perspectives.
AC FEEDBACK
A large, national management consulting firm is hiring 15 recent MBA graduates for entry-level
support positions. The jobs require analytic skills, client relations, high work standards, ability to
work well in teams, and motivation. Organizational psychologists on the HR staff develop a selec-
tion method that includes a battery of psychological tests, a measure of cognitive ability, a biodata
form, two business exercises to produce work samples (simulations that ask candidates to set pri-
orities, write e-mails, and respond to phone calls), and a test that asks how the candidate would
417
418 Handbook of Employee Selection
handle certain situations. The assessment is conducted online. Participants can log in from remote
locations. The firm’s recruiters conduct on-campus interviews and identify top candidates to invite
to participate in the assessment. The assessment process provides a wealth of information that helps
the firm decide who to hire and, for those hired, what development and job experiences they need
to improve their chances for success. The information would also be valuable to candidates who are
not selected to help them guide their own early career decisions.
The HR staff considered the following: The online assessment allowed giving feedback immedi-
ately after the completion of testing. The ability to deliver feedback quickly and at little additional
cost has created a dilemma. In the past, such an assessment would be conducted in person with
small groups of participants. Specific feedback would be time-consuming and costly to deliver.
Applicants would need to return to the testing site, and individualized feedback would be provided.
Computerization of assessments no longer makes this difficult to do. Now the decision to not deliver
feedback is not as justifiable on economic or practical grounds. So, the HR staff wondered, should
they give all candidates feedback immediately after the assessment? Should they wait until a selec-
tion decision is made, inform the candidates, and then invite them to return for feedback? Does this
risk divulging proprietary information about the testing process and its validity? Does this put the
firm at risk if candidates question the selection process’ fairness or accuracy? Should only those
who are hired be offered feedback? Should the firm require that those who are hired receive feed-
back and use the information to establish a development plan? Who should present the feedback to
the candidates? Should a psychologist meet with each candidate to explain the results? This would
be expensive. Should a report be prepared for each candidate and sent to a new hires’ immediate
supervisor to review with the candidate?
TEST FEEDBACK
A restaurant is hiring food service personnel. It uses a biodata form, an integrity test to assess
self-reports of honesty, and an interview. Hundreds of people take the test each year, administered
by the managers of the company’s restaurants across the country. What type of feedback should
the restaurant provide to applicants who are selected and those who are not, other than to inform
them of the decision?
The questions about feedback in these examples deal with how feedback fits within the selection
process. Should applicants be told more than whether or not they were selected? If they were not
chosen, should they receive more specific information that might help them in the future? More gen-
erally, how does feedback fit within the assessment process? We can begin to answer these questions
by turning to professional testing standards for guidance.
Regardless of whether the scoring and interpretation are done by psychologists, by employees or
assistants, or by automated or other outside services, psychologists take reasonable steps to ensure that
explanations of results are given to the individual or designated representative unless the nature of the
relationship precludes provision of an explanation of results (such as in some organizational consulting,
pre-employment or security screenings, and forensic evaluations) and this has been clearly explained to
the person being assessed in advance. (Principle 9.10, p. 14)
The APA’s The Rights and Responsibilities of Test Takers: Guidelines for Testing Professionals
specifies that applicants receive a written or oral explanation of their test results within a reason-
able amount of time after testing and in commonly understood terms (APA, 1998). The document
emphasizes that the “rights and responsibilities” are neither legally based nor inalienable rights, but
they represent good professional practice.
Pope (1992) summarized psychologists’ responsibilities in providing psychological test feedback
to the organization and to job candidates. He viewed feedback as a process that includes clarifying
tasks and roles of the test-giver and taker, ensuring that the test-taker has given informed consent
(or refusal) before taking the test, framing the feedback, acknowledging its fallibility, guarding
against misuse and misinterpretation of the information, guarding the test results, and assessing and
understanding important reactions. He emphasized that applicants and the organization have a right
to understand the purpose and use for the assessment, the procedures involved, and the feedback
they can expect and from whom.
The rights of test-takers were controversial when they were first created. They seemed to ignore
some of the realities of large-scale employment testing. For instance, the right expressed in the
APS Guidelines, “have their test administered and their results interpreted by appropriately trained
individuals who follow professional codes of ethics” (pt. 6.0) (which is not APA policy), may not
be possible when test administration personnel have no code of ethics or when feedback is deliv-
ered electronically to many people. These guidelines need to be contrasted with practice. In our
experience, most organizations do not follow the guidelines when it comes to selection tests. When
it comes to preemployment tests of cognitive ability and personality or interviews, applicants are
likely to be told little more than whether they were offered the job.
Feedback tends to be integral to the operation of ACs used for selection. The Guidelines and
Ethical Considerations for Assessment Center Operations were developed and endorsed by practi-
tioners to delineate the key components and activities of an AC, including feedback (International
Task Force on Assessment Center Guidelines, 2000). ACs combine a host of qualitative and quan-
titative data about candidates and are used for selection, promotion, and development for general
management positions and for specific functions and various industries, such as manufacturing,
banking, and sales (Spychalski, Quiñones, Gaugler, & Pohley, 1997). Often, line managers are
420 Handbook of Employee Selection
trained as assessors and may also be charged with giving feedback to candidates. The Guidelines
defines AC feedback as “information comparing actual performance to a standard or desired level
of performance” (International Task Force on Assessment Center Guidelines, 2000, p. 10), indicat-
ing that applicants must receive information to help them understand their results. The organization
using the AC should establish and publish a policy statement about the use of the data (e.g., who
will receive reports, restrictions on access to information, planned uses for research and program
evaluation, and feedback procedures). Assessor training should include “thorough knowledge and
understanding of feedback procedures” as well as a “demonstrated ability to give accurate oral and
written feedback, when the assessor’s role is to give feedback” (p. 10). Further, guidelines for use of
the data suggest the following:
The various guidelines reviewed above recognize selection feedback as valuable to the applicant,
and indeed couch it in ethical terms—that applicants deserve to know the meaning of the informa-
tion collected about them and how the information was used to make decisions about them. There is
a growing body of research on applicants’ reactions to feedback that suggests the potential value of
feedback to the applicant and the organization. We examine this literature next.
TABLE 20.1
Applicants’ Reactions to Feedback and Implications for Practice
Examples of applicants’ perceptions and reactions
• Applicants’ reactions to the test and the testing process may change after receiving feedback about whether they
passed or failed the test (Marcus, 2003); feedback reactions are influenced by the context of the process (e.g.,
information about competition for the position; Ryan & Ployhart, 2000).
• Applicants who pass a test evaluate the organization and testing process more favorably than those who failed
(Bauer, Maertz, Dolen, & Campion, 1998; Kluger & Rothstein, 1993). Receiving negative feedback leads applicants
to perceive that the testing process was unfair (Lounsbury, Bobrow, & Jensen, 1989; Smither, Reilly, Millsap,
Pearlman, & Stoffey, 1993; Schinkel, van Dierendonck, & Anderson, 2004).
• Failing leads to lower self-image and self-perceptions of incompetence (Anderson & Goltsi, 2006; Fletcher, 1991;
Maertz et al., 2005; McFarland & Ross, 1982; Ployhart, Ryan, & Bennett, 1999; Ployhart & Harold, 2004).
• Feedback helps applicants have an accurate perception of their performance and benefit from the selection process
regardless of whether or not they are selected (Nicola & Macfarlane-Dick, 2006).
• E-mail produces fewer social context cues than voicemail, so e-mail increases the negative content of feedback,
filtering out the affect of the sender and receiver (Watts, 2007; Mishra, 2006).
Implications for selection system design
• Provide information about the fairness of the testing and decision-making process.
• Focus on implications of the results for behaviors, not personal characteristics that threaten self-image.
• Tie feedback to job requirements and developmental opportunities.
• Precede feedback with information about the selection test to explain its job relevance and fairness.
• Accompany feedback with behavior-based standards in relation to job knowledge, abilities, and experiences
required for success on the job.
• Explain why the test was fair to avoid applicants creating their own judgments about fairness.
• Convey personal and procedural information sensitively to generate more positive perceptions of the test and
organization, whether the applicant was hired or not.
• Explain reasons for test methods and results to promote accurate attributions and judgments of test fairness and
validity.
• Protect applicants’ self-image (e.g., remind them of the difficulty of the test, the tough competition, and the value of
knowing now that the job is not right for them).
• Recognize that machine-delivered feedback may require additional explanation than face-to-face feedback to
convey the meaning and intention.
relatedness of the selection methods), and organizational context (e.g., selection ratio/competition).
These affect applicants’ perceptions about the selection process and outcome, depending on
applicants’ expectations, the perceived desirability of the job and organization, available alternative
jobs, and social support. Applicants’ reactions to feedback need to be understood in relation to
their perceptions and feelings about the context and process before, during, and after testing and
feedback of results. The effects of assessment feedback may depend on conditions, such as applicants
perceiving that there was strong competition or applicants already having excellent jobs, thereby
providing an attribution beyond their own ability (Ryan & Ployhart, 2000). Recognizing that the
selection process involves evaluation, Ryan and Ployhart (2000) commented that reactions to tests
are not merely reactions to the process but reactions to being evaluated.
Research supports this line of thinking. Job candidates’ reactions to a test, what Schuler (1993)
called “social validity,” depend, in part, on the feedback they receive about their test performance.
Thus, how test results are presented will influence how candidates interpret the meaning of the
results and what they decide to do as a result, if anything (Marcus, 2003; Pope, 1992). Candidates’
reactions to a test and the testing process may change after receiving feedback about whether they
passed or failed the test. Bauer, Maertz, Dolen, and Campion (1998) assessed applicants’ reac-
tions before testing, after testing, and again after feedback about whether they passed or failed.
They found that applicants who passed the test evaluated the organization and testing process more
422 Handbook of Employee Selection
favorably than those who failed. Wiechmann and Ryan (2003) found that feedback about the selec-
tion decision had a significant effect on reactions to the test, consistent with a self-serving bias.
Candidates who were selected preferred to think that the test was fair and valid because this helped
them maintain their self-image. Those who were not selected expressed a negative perception about
the test because this diverted blame for the outcome from their own ability to the situation. In a
laboratory study, applicants who failed to meet hiring standards had a more negative view of the
organization than those who had passed (Kluger & Rothstein, 1993). In a study of actual applicants,
those who performed more poorly rated the entire selection process more negatively than those who
performed well (Macan, Avedon, Paese, & Smith, 1994). Thus, those who receive negative feedback
on test performance may automatically blame the test to protect their self-perceptions. However, if
a test is perceived to be procedurally fair before receiving feedback, but applicants receive feedback
that they did not perform well on the test, this may cause applicants to feel incompetent and thereby
lower their expectations of being able to perform on such tests in the future (Ployhart, Ryan, &
Bennett, 1999).
Ryan, Brutus, Greguras, and Hakel (2000) examined managers’ reactions to feedback from a
battery of tests and surveys collected for development purposes. Although this was not for selec-
tion, the study tells us how people react to assessment feedback. In this study, feedback content and
feedback sessions were standardized and assessors received extensive training for giving feedback.
In addition, the feedback was both positive and negative. Receptivity was measured by self-report,
observers’ ratings, and behavioral manifestations of receptivity (e.g., whether there were displays
of anger or denial). Individuals who were high in self-awareness and who were acquainted with the
feedback provider were more receptive to developmental feedback; older individuals and those who
were demographically different from the feedback provider were less receptive to it.
outside of their control or because they resolved to do better in the future as a way of compensating
for their lower performance.
In a laboratory study, Gilliland (2004) found that student applicants who were selected on the
basis of their test results reported greater fairness in the selection process than the rejected students,
especially when they had high expectations of being hired. Applicants who were rejected were more
likely to recommend application to others when they were offered an explanation for the use of
the selection procedure. Feedback influenced the factors that were important in perceiving that the
process was fair (Van Vienen et al., 2004). Explaining the selection decision to applicants increased
their perceptions of a fair selection process (Gilliland et al., 2001; Ployhart, Ryan, & Bennett, 1999).
Other research found that applicant reactions to the selection process and results are likely to be
fairer when they are given information about the process before being evaluated (Truxillo, Bauer,
Campion, & Paronto, 2002).
Because feedback can contribute to unfavorable perceptions of the selection process and out-
comes, understanding the effects of such perceptions is important. Truxillo, Steiner, and Gilliland
(2004) reviewed the literature on the effects of selection fairness on organizational outcomes. They
found that feelings of unfair treatment affect such “soft” outcomes as satisfaction with the selection
process and “hard” outcomes such as applicant withdrawal. Fairness perceptions matter particularly
in contexts of high uncertainty. Uncertainty is threatening, and applicants may construct fairness
perceptions to make things seem more certain, comprehensible, or tolerable. Hiring is by its nature an
uncertain situation because applicants do not know their chances of succeeding, and they have little
information about how well the job would work out if they were selected (Gilliland & Chan, 2001).
Organizations will want to be sure that applicants are clear about the purpose for the test and
treat all applicants in a friendly and respectful manner, including the process of giving feedback.
Procedural justice perceptions predicted organizational attractiveness and intention related to the
organization prior to receiving pass-fail feedback; however, this effect was diminished after con-
trolling for the pass-fail result (Maertz, Bauer, Mosley, Posthuma, & Campion, 2004). Procedural
justice perceptions seemed to be influenced by how well applicants thought they performed on the
test, and this in turn influenced their orientation toward the organization. However, when control-
ling for initial outcome levels, information about the test predicted organizational attractiveness,
and treatment at the testing site predicted intention toward the organization. Applicants who felt
they were informed about the test see the organization as more attractive, at least temporarily.
Similarly, feeling well treated at the test site increased applicants’ intentions to recommend the
organization to other applicants.
In summary, passing or failing a test contributes strongly to applicants’ subsequent reactions (Bauer
et al 1998; Thorsteinson & Ryan, 1997): the more positive the results, the more favorable the reaction.
Conveying personal and procedural information sensitively contributes to more positive perceptions
of the test and organization, whether the applicant was hired or not. Understanding the procedures
may limit the negative effects of failing a test on self-perceptions.
attributing poor test results to factors outside of their control, such as an unfair testing procedure
(Ployhart & Harold, 2004). In contrast, as we noted earlier, applicants may lower their self-image
when they receive a low score on a test that they perceive to be fair (Ployhart, Ryan, & Bennett,
1999). Nicola and Macfarlane-Dick (2006) suggested that candidates evaluate their own perfor-
mance during the testing process and create their own feedback in line with their self-image when
feedback is not given. For instance, if they do not receive specific information about the results
of a test and they were not chosen, they will rationalize that their rejection was not because of
their test performance but because of other factors beyond their control. This allows them to main-
tain their self-image, but it may also create erroneous impressions that could be damaging to the
organization—for instance, that the organization has unfair hiring or promotion practices. When
the candidate is offered the job, the candidate is likely to attribute the cause to his or her good test
performance, but not necessarily without feedback. The candidate with low self-esteem may errone-
ously conclude that the positive outcome was due to luck or other factors beyond his or her ability.
In addition, the successful candidate could benefit from test results that suggest ways to improve
his or her skills and knowledge to be even more valuable to the organization and increase his or
her chances of job and career success. Hence, feedback should be provided to influence applicants’
perceptions about the testing process and organization.
Social identity theory suggests how applicants’ social identities interact with their perceptions
of selection experiences to predict their withdrawal from the process. Herriot (2004) argued that
applicants are likely to withdraw from the selection process when there is incongruence between
their current perceptions of the organization’s identity and their own self-identities that are salient
during specific elements of the selection process. Social identities are individuals’ beliefs about
their membership in social categories (e.g., their gender, ethnicity, occupation, family, and reli-
gion), in contrast to their personal identities, which are beliefs about their own characteristics (e.g.,
their strengths and weaknesses). Social identities are associated with a range of beliefs, values, and
norms of behavior, and may incorporate prototypes, or beliefs about the typical or ideal member of
a category. Organizational identities are subsets of social identities. Applicants develop perceptions
about their organizational identity as they participate in the selection process and receive feedback.
The effects of degree of congruence on leaving the process may be moderated by the applicants’
perceptions of the probability of obtaining another job. People who believe there are plenty of other
opportunities will have a lower threshold for incongruence. Those who believe that job opportuni-
ties are scarce are more likely to tolerate a higher level of incongruence.
Schinkel, van Dierendonck, and Anderson (2004) studied the role of feedback in minimizing
the psychological effect of a negative selection decision on job applicants. Student subjects com-
pleted two tests and then received a rejection message. Half received just the rejection message and
half received the rejection message and bogus performance feedback (the percentile, how they did
relative to others). Core self-evaluations and affective well being of the rejected students receiving
performance feedback significantly decreased from before to after the testing and feedback com-
pared with that of students in the rejection message-alone condition. Core self-evaluations actually
increased for those who were rejected but were not given performance feedback, particularly if they
saw the procedure as unfair. Procedural fairness (candidates’ perceptions that they had a chance to
demonstrate their performance and that the test was related to the job function) interacted with feed-
back to affect core self-evaluation; distributive fairness (the perception that the selection decision
was correct) interacted with feedback to affect well being. The authors suggested an attribution the-
oretic explanation for these results. The students may have showed a self-serving bias after receiving
negative outcomes, attributing the negative outcome to external causes (e.g., unfair procedures that
were not under their control), thereby maintaining their self-perceptions. This comparison made
reducing the negative state following rejection even more important, following DeNisi and Kluger’s
(2000) concept that feedback entails a comparison.
Maertz, Bauer, Mosley, Posthuma, and Campion (2005) measured applicants’ self-efficacy for
cognitive ability testing before and immediately after the test and again after pass/fail feedback.
Assessment Feedback 425
The applicants were applying for a position at a utility company. Self-efficacy for the test prior to
actually taking it was higher for men, applicants who had prior successful experiences with cogni-
tive ability tests, those who perceived the test to be valid and fair, and those with higher general
self-efficacy. Self-efficacy for the test increased for those who passed it and decreased for those
who failed. Failing had a greater negative impact on subsequent self-efficacy for the test for women
and Whites and a lower negative effect for those who had previously been hired based on ability
tests. An implication is that those who fail the test and have significant decreases in self-efficacy
for cognitive tests might tell others that the test was particularly difficult and discourage possible
applicants.
Maertz et al. (2005) suggested that organizations consider attribution-related or other interven-
tions to bolster self-efficacy or to increase perceptions of fairness and test validity. Also, test admin-
istrators should follow procedural justice rules and emphasize in pretest preparation sessions the
proven validity of the test for predicting job performance. These interventions may also enhance
applicants’ attraction to the organization. Derous, Born, and de Witte (2004) found that applicants
valued and expected feedback. They argued that although applicants should not be deprived of their
right to performance scores, perhaps performance measures after selection should not be provided
or should be provided in ways that protect applicants’ self-image; for instance, reminding them of
the low selection ratio or that the position is not right for everyone and that knowing now is better
than being unhappy later.
Anderson and Goltsi (2006) formulated the construct of negative psychological effects (NPEs) of
selection and assessment methods upon applicants. They defined NPE as follows:
Declines in applicant psychological well-being, general mental health, or core self-esteem that are
inveterate, measurable, and statistically demonstrable, and that occur as a result of exposure to rejection
decisions, violations of applicant rights, or unprofessional feedback given to applicants by recruiters, at
any stage during organizational selection or promotion assessment procedures. (p. 237)
They also defined positive psychological effects (PPEs) as increases in applicant psychological
well being, general mental health, or core self-esteem that result from acceptance decisions,
perceived respect for applicant rights, or complementary feedback. Previous research had found
that applicants participating in an AC experienced negative psychological effects (Fletcher, 1991).
Anderson and Goltsi (2006) suggested that NPEs may be present for several weeks and months
after receiving a negative selection decision. They noted that much of the research on applicant
reactions to selection methods has focused on applicants’ immediate reactions and preference
perceptions to different predictors. This study investigated the long-term effects and outcomes of
applicants’ exposure to specific selection methods on candidate decision-making, attitudes toward
the organization, and psychological health and well being. For instance, measures included self-
esteem, mental health, positive and negative affect, and career exploration behavior. One hundred
seven applicants participating in an AC completed measures just before participating in the center,
immediately afterward but before they were told the outcome, and 6 months after the assessment.
All applicants received detailed feedback regardless of whether they were accepted or not. Rejected
applicants did not differ significantly from accepted applicants on the indices of NPEs. Accepted
applicants rated feedback dimensions more favorably than rejected applicants. Rejected applicants
seemed to attribute the negative decision to a lack of accuracy in the assessment process. The
authors thought that one reason why NPEs did not emerge for unsuccessful candidates may have
been that the selection ratio was so competitive in this organization that this may have moderated
applicants’ negative feelings from rejection. An implication is that providing detailed feedback to
unsuccessful candidates may be dysfunctional. For internal applicants, rejected applicants remain in
the organization, and NPEs could affect their job performance. This could also apply to successful
applicants who received inappropriately negative feedback or felt that they were not treated fairly
in some way.
426 Handbook of Employee Selection
ORGANIZATION’S PERSPECTIVE
Potential benefits of feedback for the organization include the following:
• Feedback informs the candidates’ decision-making. Candidates who are selected will have
a better understanding of why and the value the organization believes they will bring to
their positions.
• Knowing that candidates will receive feedback requires the organization to maintain focus
on the skills and knowledge needed by the successful candidate(s). These characteristics
may reflect skills needed to do the job today and/or needed for future career advancement
in the organization. Maintaining this focus is especially important when decisions are
based on qualitative information, such as interviews and supervisory opinions.
• Feedback is a way to maximize value from assessment dollars. Considerable time and
money may be spent evaluating candidates. This information can be useful not only for
making the decision but guiding candidates’ future development.
• Feedback may guard against illegal discrimination and defend against claims of such in
that feedback recipients will know and understand the reason for the selection process and
decision, see its job relevance, and recognize that it was not arbitrary or based on some-
thing other than bona fide job requirements.
Regarding potential costs of feedback, the organization needs to consider the following:
• Of course, the organization incurs the cost of delivering feedback. This may include the
cost of personnel, such as a coach, who meets with the candidates after the selection pro-
cess is concluded to explain the decision and provide the feedback.
• Feedback may create difficulties in maintaining the reliability and validity of the assess-
ment methods. For instance, if candidates know the test content, they may communicate
this to others, giving future candidates an unfair advantage, or causing them to respond in
certain ways to create impressions they feel the organization wants, and thereby limiting
the accuracy of the information.
• Test security is costly, and feedback that goes beyond test outcomes may make security
difficult. Having alternate forms of the selection process may eliminate this worry but adds
a further cost to create and validate these forms.
Assessment Feedback 427
• The organization may be obliged to go beyond feedback to include advice for development.
Such advice needs to be given in a way that does not lead the candidates to have expecta-
tions about future opportunities; for instance, implying that if they follow the advice, they
will be promoted.
• Guarding candidates’ confidentiality is also a cost. Candidates should be told about who
has access to the assessment results, how the information will be used by the organization,
how long it will be retained, and how identifying information will be secured.
• Giving feedback imposes obligations to follow-up, especially with internal employees. The
employees may want career advice in the future. Such feedback and career counseling
can be linked to a career development function in the organization. Such a function may
be valuable but imposes costs and commitment to maintain such a process, including the
professional personnel and systems to track individuals. Organizations may want to do this
in any case, and as suggested above, will thereby be taking maximum advantage of the
assessment data for talent management.
• The issue of longevity of assessment results needs to be examined by the organization.
Feedback indicates the value of the information for development, implying that the orga-
nization recognizes that people grow and develop over time. However, reassessment for
future career opportunities is costly. The organization will want to study changes in assess-
ment performance over time when varying degrees of development have occurred; for
instance, differences in performance between candidates who have received feedback and
those who have not and differences in performance between candidates who subsequently
participated in various developmental experiences compared to those who did not. Such
research is an additional but worthwhile cost in increasing the value of the assessment.
CANDIDATE’S PERSPECTIVE
Potential benefits of feedback for the candidate include the following:
Potential costs of feedback for the individual candidate include the following:
• Perhaps the major cost of feedback from the individual’s perspective, particularly for those
who are not offered positions, is losing face and declining self-confidence. Generally, less
specific information will be less threatening but of less value.
• Another potential cost is processing the information. This takes time, energy, and moti-
vation—sometimes more than the individual cares to give. Perhaps the applicant was not
highly motivated to apply for the position to begin with. Or perhaps the individual knew
from the start that this was a long shot. This person does not need to hear more than he or
she was not offered the job. However, perhaps more information may redirect the candi-
date’s time and attention toward more fruitful opportunities in the future.
• The results may lead the candidate to make a disappointing career decision. Candidates
who get positive results may be flattered by the job offer and not pay attention to details
of the feedback—or hear just about their strengths and ignore or give less attention to any
weaknesses. Also, they may not pay sufficient attention to the nature of the job, instead
focusing on their own competence. As a result, they may accept a position that may have
requirements (e.g., travel) that they really did not want. Conversely, candidates who are
rejected may focus on the rejection decision and not be able to process feedback about their
personal characteristics mindfully.
• Test batteries: Test batteries produce complex results and may require professional input
to integrate and summarize, although automatic scoring and prewritten interpretations can
be developed for electronic feedback.
• Biodata: Past behavior is an excellent predictor of future behavior. Biodata results can be
justified in terms of experience needed. This method may be valuable in guiding failed
candidates to better preparatory experiences.
• AC measures of management and leadership skills: ACs produce complex results and
allow in-depth feedback, focusing candidates’ attention on weaknesses that can be cor-
rected and strengths that suggest alternative career directions. These can be delivered to
the candidate online immediately or soon after an online AC.
• Multisource (360 degree) feedback survey results: This method of collecting performance
ratings from subordinates, supervisors, peers, and/or customers as well as self-ratings
is often used alone for development. When used for decision-making, the results can be
related to other performance data and/or to changes in performance ratings from different
sources over time. An executive coach may assist in providing the feedback and encourag-
ing the recipient to interpret it accurately.
• Supervisor nominations and evaluations: When used as sole input for a placement or pro-
motion decision, candidates may perceive nominations and performance evaluations as
unfair. This may increase their political behavior and impression management in hopes of
winning favor for future positive decisions
• Performance in a management/leadership development program: Participation in a man-
agement development program may indicate performance capabilities in areas in which
the candidates may not have had a chance to demonstrate their abilities on the job. The
participants would likely perceive the use of such results for making decisions about them
as unfair unless this purpose was evident before they participated in the program. Also,
the experiential situations in such a program may be viewed as artificial and not a good
representation of actual job performance.
feedback. This was not the case for voicemail senders. However, voicemail senders, but not e-mail
senders, were less comfortable with the negative feedback than receivers. Media conditions did not
influence feedback effectiveness (e.g., the perception that the feedback motivated the receiver to
work harder next time).
Mishra (2006) addressed whether people respond similarly to computer feedback about their
abilities as they do to human feedback. Students participated in a laboratory study in which they
were assigned randomly to one of four experimental conditions: scored test versus nonscored test
crossed with praise for success on easy task and no blame for failure on a difficult task versus
no praise for success on an easy task and blame for failure on a difficult task. Mishra reported
that participants took feedback at “face value” and seemed unwilling to commit to the same
level of “deep psychological processing” about the intention of the feedback, as they seemed to
do with face-to-face feedback. This was contrary to the position that people learn to respond to
computers as social actors with whom they interact in real time using natural language fulfilling
traditional social roles. When feedback comes from humans, people are more interpretive and try
to comprehend the context of the feedback. They do not do this when working with computers.
Psychological aspects of person-machine interaction are complex and difficult to understand.
Although students responded to the affect computer feedback, they seemed to disregard the con-
text within which the feedback was offered. They saw praise from the computer as being positive,
regardless of whether or not they thought that their ability level was known or whether the task
was easy or difficult.
People may not be willing to process inferences as deeply when they receive computer feedback
as when they receive feedback from other people (Mishra, 2006). When receiving feedback from a
human, praise for success on an easy task is discounted and may even reduce motivation, especially
when they believe that their ability level is known. Receiving blame for a difficult task leads to posi-
tive affect because the recipient believes that the feedback sender knows that the evaluator thinks
highly of his or her ability and that he or she can do better. If people respond more mindlessly to
computer feedback about their test scores, this could thwart the goals of the feedback; for instance,
to motivate higher achievement next time or to take the difficulty of the task into account when
receiving praise. However, research is needed to determine if providing feedback recipients with a
social script that suggests that the computer is capable of sophisticated inferences; for instance, that
the computer “respects” the subject’s intelligence because the computer has a record of informa-
tion about the subject and takes that into account in providing feedback. Ferdig and Mishra (2004)
explored the technology as a social actor, finding that people exhibited emotional responses (e.g.,
anger and spite) when they felt that the computer had treated them unfairly in an ultimatum bargain-
ing game.
not be as important in their current situation. They may behave as if they were promoted or were
working on a different job and ignore current job requirements.
feedback and its joint effects with other interventions, such as coaching, training, career paths,
online developmental resources, etc.
Returning to the three cases that we introduced at the outset of this chapter, here is how the
organizations answered the questions about whether to provide feedback, how much, and in what
form.
AC FEEDBACK
The consulting firm that used an AC to help select recent MBA graduates for entry-level positions
decided to inform the candidates that feedback would be given 1 week after the online assess-
ment. Although some test results would be available to the firm immediately, on the basis of com-
puter scoring, the exercises would provide qualitative results that the firm wanted to examine and
integrate with all the information about the candidates. Observers who reviewed the transactions
would record some feedback on each exercise. The feedback would be available to candidates who
passed the assessment and those who did not, although the nature of the feedback and tenor of the
conversation would differ. Those who passed and were offered jobs were told about developmental
experiences that would help them use the feedback for development in relation to the nature of the
work they would be doing and the organization’s performance expectations—essentially, a realistic
job preview. Those who were not offered positions were given the information in a way that did not
damage their self-image (e.g., suggested that this job may not have been right for them) and that
pointed to behaviors they could develop.
The firm asked the candidates not to describe the details of the selection process with others,
although they may want to reveal the results to others who could be helpful in their job search (e.g.,
their academic advisor or a friend or career mentor). The feedback included information about how
the data would be retained and protected. For selected candidates who accepted a position, this
included making the assessment results available to the individual’s new manager and the HR direc-
tor who would track the individual’s career and developmental assignments. For candidates who
failed the assessment, the data would be retained for a year under lock and key in case the individual
wanted to apply for another position with the firm.
TEST FEEDBACK
The restaurant hiring food service personnel using several evaluation sources (a biodata form, an
integrity test, and an interview) provided a written report for individuals who were offered jobs.
Those who were not hired were given information about the nature of the assessment to help them
realize that the selection methods were fair and related to the position. They were also given a sum-
mary statement of results written in relation to job requirements (e.g., “This job requires a person
to keep records accurately.”) without providing the individual’s actual results. Pains were taken not
to divulge the nature or purpose of specific aspects of the selection process, such as the integrity
test, or to provide results that would impugn the candidate in any way (e.g., indicate the potential
for dishonesty).
ongoing developmental coaching experience with the psychologist or another external coach at
the discretion of the new vice president.
CONCLUSIONS
Feedback of assessment results is a complex process involving issues of information security and
professional accountability as well as development. Professional standards suggest that developers
of selection tests and other assessment methods are obligated to provide some feedback, if only to
explain the selection method and rationale for the outcome. Feedback also affects candidates’ reac-
tions to the selection process, which in turn may affect their decisions about the organization and
their development goals. Feedback can benefit the candidates and the organization, but precautions
must be taken to guard the confidentiality of the information and protect the self-image of the feed-
back recipient. The organization must determine the level of feedback specificity that is in the best
interests of the organization and the candidates. Internal and external candidates may be treated
differently, providing more details and developmental support to internal candidates and offering
external candidates optional feedback and help interpreting the results. To be constructive, feedback
can focus on implications of the results for behaviors, not personal characteristics that threaten
self-image. Moreover, feedback can be tied to job requirements and developmental opportunities.
Feedback can be preceded by information about the selection test when possible to explain its job
relevance and fairness. Furthermore, feedback can be accompanied by behavior-based standards,
not merely comparisons to others but standards in relation to job knowledge, abilities, and experi-
ences that are required for the job and that predict success on the job.
Feedback should not be given without adequate support for using the information and ensuring
that the candidate was not harmed by the information. Although assessment feedback has potential
costs to the candidate and the organization, the benefit of assessment results can be maximized by
recognizing its value for selection and development.
REFERENCES
American Psychological Association. (1998). The rights and responsibilities of test takers: Guidelines and
expectations. Test Taker Rights and Responsibilities Working Group of the Joint Committee on Testing
Practices. Washington, DC: Author. http://www.apa.org/science/ttrr.html
American Psychological Association. (2002). Ethical principles of psychologists and code of conduct.
Washington, DC: Author.
Anderson, N., & Goltsi, V. (2006). Negative psychological effects of selection methods: Construct formula-
tion and an empirical investigation into an assessment center. International Journal of Selection and
Assessment, 14, 236–255.
Bangert-Drowns, R. L., Kulik, C. L. C., Kulik, J. A., & Morgan, M. T. (1991). The instructional effect of feed-
back in test-like events. Review of Educational Research, 61, 213–238.
Bauer, T. N., Maertz, C. P., Jr., Dolen, M. R., & Campion, M. A. (1998). Longitudinal assessment of appli-
cant reactions to employment testing and test outcome feedback. Journal of Applied Psychology, 83,
892–903.
Bauer, T. N., Truxillo, D. M., Sanchez, R. J., Craig, J. M., Ferrara, P., & Campion, M. A. (2001). Applicant
reactions to selection: Development of the selection procedural justice scale. Personnel Psychology, 54,
387–418.
Brockner, J., & Wiesenfeld, B. M. (1996). An integrative framework for explaining reactions to decision:
Interactive effects of outcomes and procedures. Psychological Bulletin, 120, 189–208.
Chan, D., & Schmitt, N. (2004). An agenda for future research on applicant reactions to selection procedures:
A construct-oriented approach. International Journal of Selection and Assessment, 12, 9–23.
DeNisi, A. S., & Kluger, A. N. (2000). Feedback effectiveness: Can 360-degree appraisals be improved.
Academy of Management Executive, 14, 129–139.
Derous, E., Born, M. P., & DeWitte, K. (2004). How applicants want and expect to be treated: Applicant’s
election treatment beliefs and the development of the social process questionnaire on selection.
International Journal of Selection and Assessment, 12, 99–119.
Drasgow, F., Olson, J. B., Keenan, P. A., Moberg, P., & Mead, A. D. (1993). Computerized assessment. Research
in Personnel and Human Resources Management, 11, 163–206.
Ferdig, R. E., & Mishra, P. (2004). Emotional responses to computers: Experiences in unfairness, anger, and
spite. Journal of Educational Multimedia and Hypermedia, 13, 143–161.
Assessment Feedback 435
Fletcher, C. (1991). Candidates’ reactions to assessment centres and their outcomes: A longitudinal study.
Journal of Occupational Psychology, 64, 117–127.
Folger, R., & Greenberg, J. (1985). Procedural justice: An interpretive analysis of personnel systems. Research
in Personnel and Human Resources Management, 3, 141–183.
Gilliland, S. W. (1993). The perceived fairness of selection systems: An organizational justice perspective.
Academy of Management Review, 18, 694–734.
Gilliland, S. W. (1994). Effects of procedural and distributive justice on reactions to a selection system. Journal
of Applied Psychology, 79, 691–701
Gilliland, S. W., & Chan, D. (2001). Justice in organizations: Theory, methods, and applications. In
N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.), International Handbook of Work and
Organizational Psychology (pp. 143–165). London, England: Sage.
Gilliland, S. W., Groth, M., Baker, R. C., Dew, A. F., Polly, L. M., & Langdon, J. C. (2001). Improving
applicants’ reactions to rejection letters: An application of fairness theory. Personnel Psychology,
54, 669–703.
Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-manager, self-peer, and peer-manager ratings.
Personnel Psychology, 41, 43–62.
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated
model and meta-analysis. Personnel Psychology, 57, 639–683
Herriot, P. (2004). Social identities and applicant reactions. International Journal of Selection and Assessment,
12, 75–83.
Hough, L. M., & Oswald, F. L. (2000). Personnel selection: Looking toward the future—Remembering the
past. Annual Review of Psychology, 51, 631–664.
International Task Force on Assessment Center Guidelines (1989). Guidelines and ethical considerations for
assessment center operations. Public Personnel Management, 18, 457–470.
International Task Force on Assessment Center Guidelines (2000). Guidelines and ethical considerations for
assessment center operations. Bridgeville, PA: Development Dimensions International.
Kluger, A. N., & Rothstein, H. R. (1993). The influence of selection test type on applicant reactions to employment
testing. Journal of Business and Psychology, 8, 3–25.
London, M. (2003). Job feedback: Giving, seeking, and using feedback for performance improvement (2nd ed.).
Mahwah, NJ: Lawrence Erlbaum.
Macan, T. H., Avedon, M. J., Paese, M., & Smith, D. E. (1994). The effects of applicants’ reactions to cognitive
ability tests and an assessment center. Personnel Psychology, 47, 715–738.
Maertz, C. P., Jr., Bauer, T. N., Mosley, D. C., Jr., Posthuma, R. A., & Campion, M. A. (2004). Do procedural
justice perceptions in a selection testing context predict applicant attraction and intention toward the
organization? Journal of Applied Social Psychology, 34, 125–145.
Maertz, C. P., Jr., Bauer, T. N., Mosley, D. C., Jr., Posthuma, R. A., & Campion, M. A. (2005). Predictors of
self-efficacy for cognitive ability employment testing. Journal of Business Research, 58, 160–167.
Marcus, B. (2003). Attitudes towards personnel selection methods: A partial replication and extension in a
German sample. Applied Psychology: An International Review, 52, 515–532.
McFarland, C., & Ross, M. (1982). Impact of causal attributions on affective reactions to success and failure.
Journal of Personality and Social Psychology, 43, 937–946.
Mishra, P. (2006). Affective feedback from computers and its effect on perceived ability and affect: A test
of the computers as social actor hypothesis. Journal of Educational Multimedia and Hypermedia, 15,
107–131.
Nicola, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A model and
seven principles of good feedback practice. Studies in Higher Education, 31, 199–218.
Ployhart, R. E., & Harold, C. M. (2004). The applicant attribution-reaction theory (AART): An integrative
theory of applicant attributional processing. International Journal of Selection and Assessment, 12,
84–98.
Ployhart, R. E., Ryan, A. M., Bennett, M. (1999). Explanations for selection decisions: Applicants’ reactions to
informational and sensitivity features of explanations. Journal of Applied Psychology, 84, 87–106.
Pope, K. S. (1992). Responsibilities in providing psychological test feedback to clients. Psychological
Assessment: Clinical & Forensic, 4, 268–271.
Ryan, A. M., Brutus, S., Greguras, G. J., & Hakel, M. D. (2000). Receptivity to assessment-based feedback for
management development. Journal of Management Development, 19, 252–276.
Ryan, A. M., & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and decisions: A critical
review and agenda for the future. Journal of Management, 26, 565–606.
Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450.
436 Handbook of Employee Selection
Schinkel, S., van Dierendonck, D., & Anderson, N. (2004). The impact of selection encounters on applicants:
An experimental study into feedback effects after a negative selection decision. International Journal of
Selection and Assessment, 12, 197–205.
Schleicher, D. J., Venkataramani, V., Morgeson, F. P., & Campion, M. A. (2006). So you didn’t get the job …
now what do you think? Examining opportunity-to-perform fairness perceptions. Personnel Psychology,
59, 559–590.
Schuler H. (1993). Social validity of selection situations: A concept and some empirical results. In H. Schuler,
J. L. Farr, & M. Smith (Eds.), Personnel selection and assessment: Individual and organizational perspectives
(pp. 11–26). Hillside, NJ: Lawrence Erlbaum.
Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant reactions to selection
procedures. Personnel Psychology, 46, 49–76.
Spychalski, A. C., Quiñones, M. A., Gaugler, B. B., & Pohley, K. (1997). A survey of assessment center practices
in organizations in the United States. Personnel Psychology, 50, 71–90.
Thibaut, J., & Walker, L. (1975). Procedural justice: A psychological analysis. Hillsdale, NJ: Lawrence
Erlbaum.
Thorsteinson, T. J., & Ryan, A. M. (1997). The effect of selection ratio on perceptions of the fairness of a selec-
tion test battery. Internal Journal of Assessment, 5, 159–168.
Truxillo, D. M., Bauer, T. N., Campion, M. A., & Paronto, M. E. (2002). Selection fairness information and appli-
cant reactions: A longitudinal field study. Journal of Applied Psychology, 87, 1020–1031.
Truxillo, D. M., Steiner, D. D., & Gilliland, S. (2004). The importance of organizational justice in person-
nel selection: Defining when selection fairness really matters. International Journal of Selection and
Assessment, 12, 39–53.
Van Vianen, A. E. M., Taris, R., Scholten, E., & Schinkel, S. (2004). Perceived fairness in personnel selection:
Determinants and outcomes in different stages of the assessment procedure. International Journal of
Selection and Assessment, 12, 149–159.
Watts, S. A. (2007). Evaluative feedback perspectives on media effects. Journal of Computer-Mediated
Communication, 12, 384–411.
Wiechmann, D., & Ryan, A. M. (2003). Reactions to computerized testing in selection contexts. International
Journal of Selection and Assessment, 11, 215–229.
Part 5
Criterion Constructs in
Employee Selection
Kevin R. Murphy and Elaine D. Pulakos,
Section Editors
This page intentionally left blank
21 The Measurement of Task
Performance as Criteria
in Selection Research
This chapter is about measuring task performance (i.e., the technical proficiency part of job per-
formance) in personnel selection research. When evaluating the validity of selection tests and
procedures, the accuracy of these validity estimates depends in turn on the accuracy of criterion
performance measurement. Accordingly, there is considerable motivation in selection research to
obtain reliable and accurate criterion scores for job incumbents participating in the research. Our
chapter covers the task performance criterion “space.” Chapter 22, this volume, describes citizen-
ship performance criteria. Specific topics covered in this chapter are (a) relatively objective criterion
measures, such as work samples, job knowledge tests, and production rates; (b) subjective mea-
sures (i.e., ratings of performance), including different rating formats and rater training strategies;
(c) dimensionality of job performance; and (d) validity estimates against task performance for sev-
eral predictor constructs (e.g., ability, personality, etc.).
OBJECTIVE CRITERIA
At first glance, we might assume that objective criterion measures should be preferred over subjec-
tive ratings of performance. However, “objective” may be at least in part a misnomer in that judg-
ment often enters into the use of objective criteria. Also, objective measures are notoriously deficient
as criteria because they usually tap into only a small part of the total criterion space. Contamination
can, as well, be a serious problem with objective measures. For example, factors beyond the con-
trol of the job incumbent can influence objective criterion measures. Nonetheless, when they are
relevant to important performance requirements and are reasonably reliable and uncontaminated
(or when corrections can be made to reduce contamination), objective measures can be useful for
measuring some criterion dimensions. In other words, the deficiency issue may not be a problem
with objective criteria if we measure well the task performance part of job performance and have
other criterion measures evaluating performance in other aspects of the job.
PRODUCTION RATES
For jobs that have observable, countable products that result from individual performance (e.g.,
military recruiters or patrol officers who are assigned traffic enforcement duties), a production rate
criterion is a compelling bottom-line index of performance. However, as often noted (e.g., Borman,
1991; Guion, 1965), considerable care must be taken in gathering and interpreting production data.
For example, work-related dependencies on other employees or on equipment for determining
439
440 Handbook of Employee Selection
production rates may create bias in these rates. Also, production standards and quota systems (e.g.,
in call center jobs) can create problems for criterion measurement.
Instability of production rates is another potential problem. Rothe’s (1978) extensive research
on production workers showed that week-to-week production rates are only moderately reliable.
Correlations between successive weeks’ production average .75 with incentives and .53 without
incentives (Rothe, 1978). Longer periods for data collection may be necessary to ensure stable
criterion production rates. Most importantly, researchers attempting to derive production criteria
should pay special attention to possible contaminating influences whereby employees have unequal
opportunities to produce at the same rate.
SALES
Initially, sales jobs may seem ideally suited for the use of objective criteria as performance mea-
sures. Number of sales per unit time, or some similar index of bottom-line sales volume, appear
compelling as global, overall performance measures. However, upon closer inspection, significant
criterion contamination issues are evident for objective sales criteria.
First, summary sales volume measures are a function of individual skill and effort and envi-
ronmental factors beyond the control of the salesperson. In the context of the Campbell, Dunnette,
Lawler, and Weick (1970) behavior-performance-effectiveness model, objective sales volume is an
effectiveness measure. Where environmental influences are substantial and unequal in their effect
on sales, criterion measurement will be contaminated.
One way to remove contamination is to adjust sales data for factors such as market potential (e.g.,
Kanfer & Borman, 1987). A practical strategy for making these adjustments is to create norms for
stores, sales territories, or for whatever organizational unit provides the appropriate comparison.
Then criterion scores for each salesperson can be compared to scores for other salespersons with
roughly the same selling-related environment and thus similar opportunities to produce sales.
Unfortunately, an inherent problem with this approach has to do with the norming process itself. For
example, if large sales territories with many salespersons are used to accomplish the norming, there may be
meaningful differences within territories with respect to opportunity to perform. If smaller territories are
used, then the norms tend to be unstable because the mean sales performance comparison indices are based
on too few salespersons. Thus, how one does the adjusting may be as important as whether or not to adjust.
As with most other objective performance measures, sales criteria suffer from problems of defi-
ciency in that global measures of sales volume will often fail to tap important parts of the job. For
example, identifying new customers and maintaining good relations with existing customers are
important aspects of sales but would not be directly indexed by objective sales measures.
WORK SAMPLES
Work sample or performance tests are sometimes developed to provide criteria for selection research.
Some argue that work sample tests have the highest fidelity for measuring criterion performance. In
a sense, the argument is compelling: What could be fairer than to assess employees’ performance
on a job by having them actually perform some of the most important tasks associated with it? Yet,
evaluation of work samples as criteria is not quite so simple, and their use involves several issues.
One issue in test scoring is whether to evaluate products or process relative to work sample per-
formance. In general, tasks associated with products (e.g., troubleshooting a problem with a radio)
can be oriented toward either product or process; tasks with no resulting products (e.g., interviewing
a job candidate) must be scored according to process considerations. An advantage to scoring prod-
ucts over process is that assessment is typically more objective. However, if the procedures taken to
arrive at the product are also important, process assessment is clearly necessary.
Other issues relevant to scoring work samples are germane here. Difficult-to-score process steps
are to be avoided. For example, checking and inspecting steps are difficult, if not impossible, to
The Measurement of Task Performance as Criteria in Selection Research 441
observe. Ill-defined steps and complex steps where an employee can do well on one part of the step
but poorly on another should also be avoided.
Still another issue with scoring work samples is the relative merits of pass-fail marks versus per-
formance level ratings on test steps. Guion (1978) argued for test step performance ratings because
they provide more information. Indeed, many steps seem amenable to a continuous performance
scale where such judgments as “more skillful,” faster,” and “more efficient” may have meaning for
evaluating performance. For certain very simple task steps, pass-fail may suffice, but it will usually
be desirable to develop continuous performance scales for use in work sample testing.
A major issue with work sample tests is that researchers may treat them as ultimate criteria; that
is, these tests are sometimes considered the criterion of choice for accurately assessing performance
in certain jobs, especially those that require complex motor skills. Work samples should not be
thought of in this light. First, they are clearly maximum performance rather than typical perfor-
mance measures. As such, they tap the “can-do” more than the “will-do” performance-over-time
aspects of effectiveness. Yet will-do longer-term performance is certainly important for assessing
effectiveness in jobs. Accordingly, these measures are deficient when used exclusively in measur-
ing performance. In sum, inherent shortcomings of work samples for measuring some aspects of
performance, as well as practical limitations such as time and equipment constraints, argue against
relying on such tests to provide a comprehensive index of overall performance.
SUBJECTIVE CRITERIA
Subjective criteria will typically be synonymous with performance ratings. The notion of supervisors
or peers providing numerical scores for employees on job-relevant performance areas is an interest-
ing idea. Ideally, it provides well-informed observers with a means of quantifying their perceptions
of individuals’ job performance. This is preferable to verbal descriptions of performances because
individuals can now be compared in a reasonably straightforward way. The notion can be viewed as
analogous to developing structured job analysis questionnaires to take the place of verbal job descrip-
tions for purposes of comparing jobs (McCormick, 1976). In each case, quantification of perceptions
clears the way for scientific study of an area that could not be previously studied in this manner.
The emphasis in this section will be on ratings gathered for research only as criteria for selec-
tion research applications. Although ratings can be generated for purposes of salary administration,
promotion decisions, or employee feedback and development, and although performance appraisal
systems to address these management functions are extremely important to individual and orga-
nizational effectiveness (e.g., DeVries, Morrison, Shullman, & Gerlach, 1981), they are not very
relevant to personnel selection research.
442 Handbook of Employee Selection
Performance ratings are indeed the most often used criterion measure in industrial and organi-
zational psychology. Landy and Farr (1980) referred to several surveys intended to assess how fre-
quently ratings are used as criterion measures in research reports. The percentages reach 75% and
higher, suggesting that considerable attention should be paid to this criterion measurement method.
Issues in using ratings as performance criteria include (a) design of the rating form to be used and
(b) type of training to provide to raters.
RATING FORMATS
Behaviorally Anchored Rating Scales
Smith and Kendall (1963) extended the notion of critical incidents (Flanagan, 1954) by designing
a rating format they referred to as behavioral expectation scales, now generally labeled behavior-
ally anchored rating scales (BARS). Smith and Kendall reasoned that different effectiveness lev-
els on job performance rating scales might be anchored using behavioral examples of incumbent
TABLE 21.1
Sample Rating Formats
Graphic Rating Scale Behaviorally Anchored Rating Scale
Administrative skills Organizational skills
Planning ahead; organizing time efficiently; A good constructional order of material slides smoothly from
completing paperwork accurately and on time; one topic to another; design of course optimizes interest;
keeping track of appointments; not wasting time students can easily follow organizational strategy; course outline
followed
Effective 7
10
6 This instructor could be expected to
assimilate the previous lecture into the
9 present one before beginning the lecture.
5
4 8
continued
444 Handbook of Employee Selection
a Statements 4–10 in the Behavioral Observation Scale were omitted because of space restrictions.
Graphic rating scale from Berk, R. A., Ed., Performance assessment: Methods and applications, The Johns Hopkins
University Press, Baltimore, 1986; behaviorally anchored rating scale from by Bernardin, H. J., & Beatty, R. W., Performance
appraisal: Assessing human behavior at work, Kent, Boston, 1984; behavior observation scale from Gatewood, R. D., &
Field, H. S., Human resource selection (5th ed.), Harcourt College, Fort Worth, TX, 2001.
performance. Accordingly, they developed performance rating dimensions, with scaled behavioral
examples anchoring the appropriate effectiveness levels on the dimensions (see Table 21.1 for exam-
ples of all of the formats discussed here ).
Essentially, the rater’s task is to compare observed job behaviors of the ratee with the behavioral
anchors on the scale to assign a rating on that dimension. This was seen as preferable to evaluating
a ratee without guidance regarding the effectiveness levels of different scale points. The BARS idea
is more than a format; it is a system, or even a philosophy (Bernardin & Smith, 1981). For example,
ideally raters should record examples of employee work behavior in preparation for assigning per-
formance ratings.
Another positive feature of BARS is that users of the system typically participate in scale devel-
opment, enhancing the credibility of the format. Further, from a domain sampling perspective,
BARS development steps provide an excellent methodology to aid in identifying all important
dimensions for a job.
summary scale anchors. These summary anchors represent sometimes four or five effectiveness
levels, but the main point is rather than the BARS practice of having individual incidents as scale
anchors, BSS has summary anchors capturing the behavioral content of several individual anchors
at each level of effectiveness for each dimension.
Regarding raters’ use of the BSS, the most important potential advantage is that the behavioral
construct underlying each aspect of job performance is made more evident to the rater. Raters do
not need to infer the dimensionality from a series of highly specific incidents. The inferential step is
accomplished in scale development, in which the behavioral essence from several specific incidents
is distilled in each behavior summary statement.
Accordingly, this approach should increase the probability that raters can match observed ratee
behavior directly with scaled behavior. That is, by increasing the scope of behavior representing
various performance levels on a scale, chances are greater that one of the anchors will accurately
describe a ratee’s performance on that dimension.
This argument makes good conceptual sense, but in the one format comparison study pitting
BARS against a BSS format, there were no consistent differences between these format types with
respect to psychometric error or accuracy (Borman, 1979). Thus, the seeming conceptual advantage
of BSS may not make any difference in the actual use of the scale.
One possible idea in the direction of a different rating measurement method started with consid-
eration of Thurstone’s (1927) law of comparative judgment in the context of the performance rating
process. Thurstone developed a method for scaling stimuli on the basis of paired-comparison judg-
ments. Arguably, his approach places stimuli on an interval scale. In the context of rating job per-
formance, Borman et al. (2001a) reasoned that pairs of behavioral statements might be presented to
the rater with instructions to pick the statement that is more descriptive of the ratee. If interval-scale
judgments of ratee performance levels can be achieved with this method, the paired-comparison
judgments may provide ratings that are more precise than those generated by other rating formats
that use a linear numerical scale, which arguably provide only ordinal-level measurement. Another
idea that might make the paired-comparison format even more effective is to apply an item response
theory (IRT) adaptive testing orientation to the method. For example, the rater could be presented
with a series of behavioral statement pairs such that responses to each successive pair provide a
more precise estimate of ratee performance.
Accordingly, our notion for computerized adaptive rating scales (CARS) was to develop a paired-
comparison rating task that used adaptive testing principles to help raters estimate a ratee’s performance
level through an iterative paired-comparison rating process. The idea was to initially present two behav-
ioral statements associated with a dimension—one reflecting somewhat below average performance
and the other reflecting somewhat above average performance. Depending on which statement the rater
indicated was more descriptive of the ratee, the rating algorithm, developed subsequently by Stark and
Drasgow (2002), selected two additional behavioral statements—one with a scaled effectiveness level
somewhat above the effectiveness value of the statement picked first as the more descriptive, and the
other with a scaled effectiveness level somewhat below the effectiveness value of that initially chosen
statement. The rater’s selection of the more descriptive statement for the second paired comparison
then revised the initial estimated ratee effectiveness level, and, as before, the algorithm selected two
more statements with effectiveness values bracketing the revised estimated performance level. Thus,
analogous to adaptive testing, a ratee’s “true” effectiveness level was to be estimated in an IRT sense
by this iterative paired-comparison rating task that presents in sequence item pairs that maximize the
amount of information about performance derived from each choice of an item.
In a laboratory study to evaluate selected psychometric properties of CARS compared to two
other formats, videotapes of six office workers were prepared depicting prescripted levels of per-
formance on three dimensions, and subjects rated these vignettes using the CARS format and one
of the other competing formats (graphic or behaviorally anchored rating scales). Results showed
23–37% lower standard errors of measurement for the CARS format. In addition, validity was sig-
nificantly higher for the CARS format (d = .18). Accordingly, in a laboratory study, CARS showed
promising results (Borman et al., 2001a).
One last point about formats—although different format ideas may not make very large differ-
ences related to psychometric properties, well-articulated performance standards for communicating
expectations and providing feedback in operational performance management systems can be quite
useful. Thus, especially the BARS and BSS can serve this important purpose in organizations.
RATER TRAINING
Rater training provides a promising approach to improving the quality of performance ratings.
Two general kinds of training programs have emerged to help raters generate more error free and
accurate ratings (Bernardin & Buckley, 1981; Smith, 1986; Woehr & Huffcutt, 1994). Rater error
training seeks simply to alert raters to certain psychometric or perceptual errors such as leniency/
severity, halo, restriction-in-range, and similar-to-me effects. Training often takes the form of a
brief lecture on or demonstration of each error and training to avoid such errors when making per-
formance ratings (Bernardin & Buckley, 1981).
Frame-of-reference training (Bernardin & Pence, 1980) attempts to convey to raters that per-
formance is multidimensional and to thoroughly familiarize them with the actual content of each
The Measurement of Task Performance as Criteria in Selection Research 447
HUNT (1996)
Hunt’s (1996) intention was to develop a dimension set that reflected important behavioral dimen-
sions of the performance requirements for entry-level jobs. Using a critical incidents approach, Hunt
(1996) derived an eight-dimension taxonomy of generic work behaviors focusing on non-job-specific
aspects of performance. In this investigation, Hunt used factor analysis to empirically derive dimen-
sions of generic work behaviors from supervisor ratings of employee behaviors. However, contrary
to the typical approach in which a single job family or single organization is used, Hunt obtained
supervisory ratings for nearly 19,000 employees in 52 different job settings across 36 different
companies. Because of the nature of these data (i.e., each data set included a slightly different com-
bination of the behaviors assessed), multiple factor analyses of the dimension structure could be
conducted. First, a sample of data sets was subjected to factor analysis, resulting in several initial
dimension structures. Similarities across these initial taxonomies were then cross-validated through
the use of additional factor analyses (using hold-out data sets) and subject matter expert (SME) rat-
ings of agreement.
These analyses resulted in an eight-dimension taxonomy of generic work behavior, including
task and citizenship behaviors. Specifically, the dimension structure consisted of two higher-order
448 Handbook of Employee Selection
dimensions and eight second-order dimensions. The higher-order dimension of required perfor-
mance behaviors included those behaviors required of an employee for continued employment,
including the second-order dimensions of attendance, off-task behavior (i.e., effort expended toward
non-job-related activities while at work; e.g., goofing off), and employee deviance (a combination
of unruliness, theft, and drug misuse). The second higher-order dimension, organizational citizen-
ship behaviors, is comprised of the second-order dimensions of schedule flexibility and work ethic
(a combination of industriousness and thoroughness). Although Hunt identified a ninth specific
dimension, adherence to confrontational rules, this factor was posited to be primarily relevant to
cash register work, so Hunt omitted it from his model of generic work behavior. In addition to
dimensions generated from the supervisory ratings, Hunt also identified four dimensions of generic
work behavior through a review of the literature, including teamwork, problem solving, safety, and
personal appearance.
VISWESVARAN (1993)
Viswesvaran (1993) built upon the lexical hypothesis of Galton (Goldberg, 1993) to develop a taxon-
omy of general job performance. In this investigation, Viswesvaran compiled and sorted measures
of job performance from the literature into summary categories, resulting in 25 conceptually dis-
tinct dimensions. Next, correlations were obtained from studies utilizing these dimensions. These
correlations were used in a meta-analysis to determine the true score correlations between the
dimensions. Finally, factor analysis was used to analyze these true score correlations and to derive
a set of ten job performance categories.
The dimensions identified in this investigation were intended to summarize overall job perfor-
mance. Dimensions identified by Viswesvaran (1993) included interpersonal competence, adminis-
trative competence, quality, productivity, effort, job knowledge, leadership, compliance/acceptance
of authority, communications competence, and an overall job performance dimension.
Borman and Brush (1993) identified 18 megadimensions of managerial job performance divided
into four groupings. The first grouping, interpersonal dealing and communication, consists of those
megadimensions involving communication skills, maintaining good interpersonal relationships at
work, representing the organization to others, and selling/influencing behaviors. Second, the leader-
ship and supervision grouping includes those megadimensions related to guiding, directing, moti-
vating, training, coaching, developing, and coordinating subordinates, as well as providing feedback
as needed. Third, technical activities and the “mechanics of management” involve megadimensions
pertaining to the technical proficiency required for a job, but also those related to managerial tasks
such as planning, organizing, decision-making, staffing, monitoring, and delegating. Finally, the
last grouping, useful personal behavior and skills, consists of megadimensions involving persis-
tence, handling stress, and organizational commitment. This system, at the 18 “megadimension”
level, is at a relatively high level of specificity, especially because it covers only management jobs.
all eight) are sufficient for describing its performance requirements. Several of these constructs
emerged in factor analyses of the performance measures administered in the Project A research (a
large-scale selection and classification study conducted in the U.S. Army; Campbell, 1990a) across
the many jobs studied in that program. As examples, for first-tour soldiers in 19 different jobs, tech-
nical proficiency, personal discipline, and effort were consistently represented in factor analyses of
performance criterion measures. For second-tour noncommissioned officer jobs, a leadership factor
was added to the mix of factors emerging. Accordingly, the Campbell et al. (1993) taxonomy has
been largely confirmed for several categories using data that are highly appropriate for testing its
generality across a wide variety of supervisory and nonsupervisory jobs.
Thus, the Campbell et al. (1993) dimension system includes these eight dimensions: (a) job-
specific technical proficiency, (b) non-job-specific technical proficiency, (c) written and oral com-
munication, (d) demonstrating effort, (e) maintaining personal discipline, (f) facilitating peer and
team performance, (g) supervision/leadership, and (h) management/administration. Parts or all of
Dimensions 1, 2, 4–7, and 8 were confirmed in Project A using multiple methods to measure job
performance, including hands-on performance tests; job knowledge tests; supervisor and peer rat-
ings on multiple dimensions of performance; administrative measures such as disciplinary cases,
awards and commendations, etc.; and a supervisory situational judgment test (SJT) criterion mea-
sure. Importantly, all of these criterion measures were developed and then administered to super-
visory and nonsupervisory people. The wide variety of criterion measures used to evaluate job
performance constructs ensured the criterion space was comprehensively reflected in the coverage
of the performance domain. Accordingly, the system seems quite generalizable across different
types of jobs and supervisory and nonsupervisory jobs.
professionalism
• Problem solving • Decision-making/ problem • Judgment and • Reasoning and Problem solving
solving problem solving decision-making
• Administrative • Administration and • Organization • Administering Organizing and
competence paperwork planning
• Planning and organizing
• Monitoring and
controlling resources
• Staffing
451
continued
452
commonalities and differences across the systems. Then, we provide a summary column, reflecting
the common content where it is evident. Also, the rows of Table 21.2 are ordered such that the first
row represents the most commonality across systems, the second row has the next most commonal-
ity, and so on.
All six dimension sets have content involving communicating and interacting with others,
although Hunt (1996) has only a teamwork dimension so communicating is not explicitly repre-
sented in his framework. In Viswesvaran (1993), Borman and Brush (1993), and Borman et al.
(1994), communicating and the interpersonal component are represented separately; in Peterson
et al. (1999), at the nine-dimension level, the two constructs are combined in a single dimension.
Productivity and proficiency are likewise reflected in all six dimension sets, although the config-
uration of performance dimension content for this construct is somewhat different across the dimen-
sion sets. For example, Viswesvaran (1993) has four of his nine dimensions related to this construct
(productivity, job knowledge, quality, and effort), Borman et al. (1994) have 3 of their 12 dimensions
(effort and productivity, job knowledge, and task proficiency) in this category; and Peterson et al.
(1999) divided the construct into performing complex/technical and physical/manual activities.
The third summary construct, useful personal qualities and skills, is more heterogeneous, but
five of the six dimension sets are represented in some fashion. The content varies from Hunt’s (1996)
industriousness and adherence to rules to Borman and Brush’s (1993) persisting to reach goals and
handling crises, to Borman et al.’s (1994) initiative, adaptability, and safety and Campbell et al.’s
(1993) personal discipline and effort (this effort dimension is defined less like productivity and
more like a personal quality compared with the other two effort dimensions). Viswesvaran’s (1993)
compliance dimension is similar to Hunt’s adherence to rules.
Problem solving draws on content from four of the six dimension sets. Three of these four (Borman
& Brush, 1993; Borman et al., 1994; and Peterson et al., 1999) include elements of decision-making
in addition to problem solving; Hunt’s (1996) system defined problem solving more narrowly.
The fifth construct, organizing and planning, also has representation by four of the dimension
sets. Because this construct can be seen as in part management-oriented, it is not surprising that
Borman and Brush’s (1993) managerial taxonomy has several dimensions in this category (i.e.,
administration and paperwork, planning and organizing, monitoring and controlling resources,
and staffing). Viswesvaran (1993), Borman et al. (1994), and Peterson et al. (1999), have a single
administering or organizing dimension. Finally, Campbell et al.’s (1993) management/administra-
tion dimension is broader than organizing and planning but does contain elements relevant to this
construct.
The sixth summary construct is leadership and supervision and is also represented in four of the
dimension sets. Again, as might be expected, the Borman and Brush (1993) managerial taxonomy
has multiple dimensions in this category (coordinating subordinates; guiding, directing, and moti-
vating subordinates; training, coaching, and developing subordinates; and a delegating dimension).
Campbell et al. (1993) have two leadership-related dimensions (supervision/leadership and at least
part of management/administration). We should note that the Hunt (1996) and Borman et al. (1994)
taxonomies were intended for entry-level and nonmanagement jobs, respectively, and thus would
not be expected to contain supervisory or managerial dimensions.
A seventh construct, information processing, had representation from only two systems: (a) infor-
mation and data processing, identifying and evaluating job-relevant information, and looking for
and receiving job-related information from Peterson et al. (1999) and (b) collecting and interpreting
data from Borman and Brush (1993). Hunt’s (1996) dimension set had several dimensions that could
be classified as counterproductive work behaviors. These included off-task behavior, unruliness,
drug misuse, and theft.
We are not advocating our six (or seven or eight) summary construct system as a preferred
dimension set. All of the six dimension sets reviewed have important strengths. What we are advo-
cating, following Campbell et al. (1993) and Campbell, Gasser, and Oswald (1996), is that the field
move toward some performance taxonomy that can be used in personnel selection research to more
454 Handbook of Employee Selection
systematically study empirical links between individual differences and individual performance
constructs as represented by a job performance taxonomy. A science of personnel selection could
benefit greatly from research using a common set of performance constructs to map individual
difference (e.g., abilities, personality, vocational interests) job performance relations (Borman,
Hanson, & Hedge, 1997; Campbell et al., 1996). This approach gets us beyond studying individual
differences-overall job performance correlations. These thrusts to conceptually differentiate crite-
rion constructs, and most importantly, in measuring them are prerequisites toward systematically
studying links between individual predictor and criterion variables in a selection context.
critical incidents technique, and a clear explication of the task domain. Performance was measured
using multiple indices, including hands-on job sample tests, multiple-choice knowledge tests, and
supervisor/peer ratings of performance on BSS. Army-wide and MOS-specific scales were devel-
oped, and administrative/archival records were also examined. On the basis of exploratory and
confirmatory factor analytic results, five dimensions of performance were specified: core technical
proficiency, general soldiering proficiency, effort and leadership, personal discipline, and physical
fitness and military bearing. The first two factors—core technical proficiency and general soldier-
ing proficiency—clearly represent task performance constructs; substantial loadings were evident
for hands-on performance tests, the job knowledge tests, and supervisor/peer ratings on some of the
technical performance dimensions. Thus, this section focuses on these two performance dimen-
sions. By differentiating between task and citizenship aspects of performance, measuring repre-
sentative and carefully developed predictor variables, and utilizing a large sample size, Project A
provides relatively accurate point estimates of predictor construct-task performance relationships.
The remainder of the chapter will summarize the correlations obtained in the concurrent valida-
tion study of Project A (Campbell & Zook, 1990). Mean validities are based on data from 4,039
incumbents in nine diverse MOS, including infantryman, cannon crewmember, armor crewman,
single channel radio operator, light wheel vehicle mechanic, motor transport operator, administra-
tive specialist, medical specialist, and military police. Validity estimates were corrected for range
restriction and criterion unreliability. Despite the exemplary aspects of Project A, one might ques-
tion the generalizability of the results, given that an Army sample was used. Thus, for each type of
predictor, relevant research conducted with other samples is also discussed.
and Motowidlo (1993) suggested that the relationship between cognitive ability and performance is
likely to be stronger for task versus citizenship performance. Thus, because the Project A validity
estimates against the task proficiency criteria represent task performance alone, whereas the validity
estimates reported in other meta-analyses represent validity against overall performance, it follows
that the Project A validities might be higher. In support of this prediction, the relationship between
general cognitive ability and the other three performance factors examined in Project A (which are
more aptly described as indices of citizenship performance) were significantly lower. Specifically,
general cognitive ability correlated .25 with effort and leadership, .16 with personal discipline, and
.20 with physical fitness and military bearing. Although this explanation is somewhat speculative,
one conclusion is clear: General cognitive ability is a robust predictor of task performance.
general cognitive ability and that this overlap is one potential explanation for the moderate relation-
ships between spatial, perceptual, and psychomotor abilities and performance criteria. Additionally,
in comparison to general cognitive ability, the validity coefficients for specific aptitudes tend to be
lower and much more variable (Schmidt, 2002), highlighting the utility of using general cognitive
ability rather than specific aptitudes in selection research.
PERSONALITY
Over the past 2 decades, research on the utility of personality in the selection field has received a
great deal of attention (see Chapter 14, this volume). Although early estimates of the relationship
between personality and performance were quite low, more recent results have been more optimis-
tic. Personality researchers generally credit the advent of a well-accepted taxonomy (i.e., the Big
Five) and the increased use of validity generalization techniques (e.g., Barrick & Mount, 1991) for
the recent positive findings.
Although Project A did not utilize the Five-Factor Model in the measurement of personality, it
did find moderate correlations between personality and task performance. Using the Assessment of
Background Life Experiences (ABLE), soldiers completed 11 scales (emotional stability, self-esteem,
cooperativeness, conscientiousness, nondelinquency, traditional values, work orientation, internal
control, energy level, dominance, and physical condition). Seven of the scales combined to form
four composite scores: adjustment, dependability, achievement orientation, and physical condition.
Overall, the mean validities for personality were .25 for both dimensions of task performance. This
is similar to the relationship of .27 reported by Barrick, Mount, and Judge (2001) between conscien-
tiousness and overall job performance in their meta-analysis of 15 earlier meta-analyses, including
such seminal articles as Barrick and Mount (1991) and Tett, Jackson, and Rothstein (1991). Project
A also examined relationships between personality dimensions and performance on the other three
more citizenship oriented criteria, reporting mean correlations of .33, .32, and .37, respectively, for
effort and leadership, personal discipline, and physical fitness and military bearing.
Hurtz and Donovan (2000) and Borman et al. (2001b) conducted other meta-analyses worthy
of mention. Hurtz and Donovan partitioned the criterion domain into three dimensions: task per-
formance, job dedication, and interpersonal facilitation. Their findings indicate the following rela-
tionships between the Big Five and task performance: .15 for conscientiousness, .13 for emotional
stability, -.01 for openness to experience, .07 for agreeableness, and .06 for extraversion. Furthermore,
they found that partitioning the criterion domain into various dimensions had minimal impact on
the magnitude of the validity coefficients. This contrasts with the results of Project A, which found
higher mean validities for the citizenship performance dimensions (.32 to .37) compared to the task
performance dimensions (.25). Similarly, in a review of 20 studies, Borman et al. (2001b) reported
that the relationship between conscientiousness and citizenship performance was higher than the
relationship between conscientiousness and task performance. Despite such findings, researchers
tend to agree that dispositional factors other than conscientiousness are not strongly related to citi-
zenship performance after controlling for common method variance (Podsakoff et al., 2000). Thus,
despite Borman and Motowidlo’s (1993) proposition that personality variables are more likely to
relate to contextual performance than task performance, researchers examining this question have
reported mixed findings.
VOCATIONAL INTERESTS
Another set of predictors investigated in Project A was vocational interests. On the basis of Holland’s
(1966) Basic Interests Constructs, as well as six different organizational climate scales, Project A
researchers developed the Army Vocational Interest Career Examination (AVOICE). Twenty-two
scales make up AVOICE, forming six composite scores: skilled technical, structural/machines,
combat-related, audiovisual arts, food service, and protective services. Across the six composites,
458 Handbook of Employee Selection
vocational interests related to core technical proficiency (.35) and general soldiering proficiency
(.34). Conversely, Schmidt and Hunter (1998), citing Holland (1986), commented that there is gener-
ally no relationship between interests and job performance. Although they considered this a some-
what surprising finding, they hypothesized that interests may affect one’s choice of jobs, but once
the job is selected, interests do not affect performance.
More recently, Morris (2003) conducted a meta-analysis of 93 studies, reporting a mean cor-
rected correlation of .29 between vocational interests and job performance. Interestingly, larger
effect sizes were observed when studies used task performance as the criterion. The reason for this
finding is unclear, but it does mirror the Project A results. Specifically, the correlations between
vocational interests and performance were higher for task performance dimensions (.34 to .35)
compared with citizenship performance dimensions (.12 to .24).
Overall, Project A research supports quite strong relationships between general cognitive ability,
spatial ability, and perceptual/psychomotor ability and task performance. Importantly, these cor-
relations are higher with task performance than when the criterion is overall performance, the cri-
terion almost always used in meta-analyses of these predictors’ validities against job performance.
The explanation we offered for this finding is the consistent trend in the literature of combining
task and citizenship performance into one overall factor, thus reducing the validities of these pre-
dictors. Relations for personality and vocational interests with task performance are more modest
but still far from trivial (mid .20s for personality and mid .30s for vocational interests). Thus, the
largest mean validities were observed for the maximum performance can-do factors (i.e., GMA,
spatial ability, and perceptual/psychomotor ability), whereas the typical performance or will-do
factors (i.e., personality and vocational interests) exhibited more modest validities against task
performance.
Because of the careful explication of the predictor and criterion variables, as well as the large
sample size, we can be confident that Project A provides reasonable point estimates of the relation-
ships between these constructs. Although the generalizability of the results is questionable, meta-
analyses examining more diverse samples also support the significance of these relationships, albeit
with generally smaller magnitudes.
REFERENCES
Adams-Webber, J. (1979). Intersubject agreement concerning relationships between the positive and negative
poles of constructs in reperatory grid tests. British Journal of Medical Psychology, 52, 197–199.
Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77,
836–874.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the beginning of the new
millennium: What do we know and where do we go next? Personality and Performance, 9, 9–30.
Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human behavior at work. Boston,
MA: Kent.
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review,
6, 205–212.
Bernardin, H. J., & Pence, E. C. (1980). Effects of rater training: Creating new response sets and decreasing
accuracy. Journal of Applied Psychology, 65, 60–66.
Bernardin, H. J., & Smith, P. C. (1981). A clarification of some issues regarding the development and use of
behaviorally anchored rating scales (BARS). Journal of Applied Psychology, 66, 458–463.
Bertua, C., Anderson, N., & Salgado, J. F. (2005). The predictive validity of cognitive ability tests: A UK meta-
analysis. Journal of Occupational and Organizational Psychology, 78, 387–409.
Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied
Psychology, 64, 410–421.
Borman, W. C. (1987). Personal constructs, performance schemata, and “folk theories” of subordinate effective-
ness: Explorations in an Army officer sample. Organizational Behavior and Human Decision Processes,
40, 307–322.
The Measurement of Task Performance as Criteria in Selection Research 459
Borman, W. C. (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette & L. M. Hough (Eds.),
Handbook of industrial and organizational psychology (Vol. 2, pp. 271–326). Palo Alto, CA: Consulting
Psychologists Press.
Borman, W. C., Ackerman, L. D., & Kubisiak, U. C. (1994). Development of a performance rating program in
support of Department of Labor test validation research. Unpublished manuscript.
Borman, W. C., & Brush, D. H. (1993). More progress toward a taxonomy of managerial performance require-
ments. Human Performance, 6, 1–21.
Borman, W. C., Buck, D. E., Hanson, M. A., Motowidlo, S. J., Stark, S., & Drasgow, F. (2001a). An examina-
tion of the comparative reliability, validity, and accuracy of performance ratings made using computer-
ized adaptive rating scales. Journal of Applied Psychology, 86, 965–973.
Borman, W. C., Hanson, M. A., & Hedge, J. W. (1997). Personnel selection. In J. T. Spence, J. M. Darley,
& D. J. Foss (Eds.), Annual review of psychology (Vol. 48, pp. 299–337). Palo Alto, CA: Consulting
Psychologists Press.
Borman, W. C., Jeanneret, P. R., Kubisiak, U. C., & Hanson, M. A. (1996). Generalized work activities:
Evidence for the reliability and validity of the measures. In N. G. Peterson, M. D. Mumford, W. C.
Borman, P. R. Jeanneret, & E. A. Fleishman (Eds.), O*NET final technical report. Salt Lake City, UT:
Utah Department of Employment Security.
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements of contextual
performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 71–98).
San Francisco, CA: Jossey-Bass.
Borman, W. C., Penner, L. A., Allen, T. D., & Motowidlo, S. J. (2001b). Personality predictors of citizenship
performance. International Journal of Selection and Assessment, 9, 52–69.
Borman, W. C., White, L. A., & Dorsey, D. W. (1995). Effects of ratee task performance and interpersonal fac-
tors on supervisor and peer performance ratings. Journal of Applied Psychology, 80, 168–177.
Campbell, J. P. (1990a). An overview of the Army Selection and Classification Project (Project A). Personnel
Psychology, 43, 231–239.
Campbell, J. P. (1990b). Modeling the performance prediction problem in industrial and organizational psy-
chology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychol-
ogy (2nd ed., Vol. 1, pp. 687–732). Palo Alto, CA: Consulting Psychologists Press.
Campbell, J. P., Dunnette, M. D., Lawler, E. E., & Weick, K. E. (1970). Managerial behavior, performance, and
effectiveness. New York: McGraw Hill.
Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). The substantive nature of performance variability. In
K. R. Murphy (Ed.), Individual differences and behavior in organizations. San Francisco, CA: Jossey-
Bass.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt &
W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-Bass.
Campbell, J. P., & Zook, L. M. (Eds.). (1990). Improving the selection, classification, and utilization of Army
enlisted personnel: Final report on Project A (ARI Research Report 1597). Alexandria, VA: U.S. Army
Research Institute for the Behavioral and Social Sciences.
DeVries, D. L., Morrison, A. M., Shullman, S. L., & Gerlach, M. L. (1981). Performance appraisal on the line.
New York, NY: Wiley.
Farr, J. L., & Levy, P. E. (2007). Performance appraisal. In L. L. Koppes (Ed.), Historical perspectives in indus-
trial and organizational psychology (pp. 311–327). Mahwah, NJ: Lawrence Erlbaum.
Flanagan, J. C. (1954). The critical incident technique. Psychological Bulletin, 51, 327–358.
Gatewood, R. D., & Field, H. S. (2001). Human resource selection (5th ed.). Fort Worth, TX: Harcourt College
Publishers.
Ghiselli, E. E. (1956). Dimensional problems of criteria. Journal of Applied Psychology, 40, 1–4.
Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34.
Guion, R. M. (1965). Personnel testing. New York, NY: McGraw-Hill.
Guion, R. M. (1978). Principles of work sample testing: III. Construction and evaluation of work sample tests.
Alexandria, VA: U.S. Army Research Institute.
Holland, J. (1986). New directions for interest testing. In B. S. Plake & J. C. Witt (Eds.), The future of testing
(pp. 245–267). Hillsdale, NJ: Lawrence Erlbaum.
Holland, J. L. (1966). The psychology of vocational choice: A theory of personality types and model environ-
ments. Oxford, England: Blaisdell.
Hülsheger, U. R., Maier, G. W., & Stumpp, T. (2007). Validity of general mental ability for the prediction of
job performance and training success in Germany: A meta-analysis. International Journal of Selection
and Assessment, 15, 3–18.
460 Handbook of Employee Selection
Hunt, S. T. (1996). Generic work behavior: An investigation into the dimensions of entry-level, hourly job
performance. Personnel Psychology, 49, 51–83.
Hunter, D. R., & Burke, E. F. (1994). Predicting aircraft pilot training success: A meta-analysis of published
research. The International Journal of Aviation Psychology, 4, 297–313.
Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity
generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Department of Labor,
Employment Service.
Hunter, J. E. (1986). Cognitive ability, cognitive aptitudes, job knowledge, and job performance. Journal of
Vocational Behavior, 29, 340–362.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–98.
Hunter, J. E., & Schmidt, F. L. (1996). Intelligence and job performance: Economic and social implications.
Psychology, Public Policy, and Law, 2, 447–472.
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of
Applied Psychology, 85, 869–879.
Jensen, A. R. (1986). g: Artifact or reality? Journal of Vocational Behavior, 29, 301–331.
Kanfer, R., & Borman, W. C. (1987). Predicting salesperson performance: A review of the literature (Research
Note 87-13). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107.
Landy, F. J., & Farr, J. L. (1983). The measurement of work performance: Methods, theory and applications.
New York, NY: Academic Press.
Latham, G. P., & Wexley, K. N. (1981). Increasing productivity through performance appraisal. Reading, MA:
Addison-Wesley.
Levine, E. L., Spector, P. E., Menon, S., Narayanan, L., & Cannon-Bowers, J. (1996). Validity generalization
for cognitive, psychomotor, and perceptual tests for craft jobs in the utility industry. Human Performance,
9, 1–22.
Martinussen, M. (1996). Psychological measures as predictors of pilot performance: A meta-analysis. The
International Journal of Aviation Psychology, 6, 1–20.
McCormick, E. J. (1976). Job and task analysis. In M. D. Dunnette (Ed.), Handbook of industrial and organi-
zational psychology (pp. 651–696). Chicago, IL: Rand McNally.
McCormick, E. J., Jeanneret, P. R., & Mecham, R. C. (1972). A study of job characteristics and job dimensions
as based on the position analysis questionnaire (PAQ). Journal of Applied Psychology, 56, 347–368.
Morris, M. A. (2003). A meta-analytic investigation of vocational interest-based job fit, and its relationship to
job satisfaction, performance, and turnover. Unpublished doctoral dissertation. University of Houston,
Houston, TX.
Motowidlo, S. J., Borman, W. C., & Schmit, M. J. (1997). A theory of individual differences in task and contextual
performance. Human Performance, 10(2), 71–83.
Motowidlo, S. J., & Van Scotter, J. R. (1994). Evidence that task performance should be distinguished from
contextual performance. Journal of Applied Psychology, 79, 475–80.
Mount, M. K., Oh, I., & Burns, M. (2008). Incremental validity of perceptual speed and accuracy over general
mental ability. Personnel Psychology, 61, 113–139.
Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales measure observation?
Journal of Applied Psychology, 67, 562–567.
Noonan, L., & Sulsky, L. M. (2001). Examination of frame-of-reference and behavioral observation training on
alternative training effectiveness criteria in a Canadian military sample. Human Performance, 14, 3–26.
Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of
Applied Psychology, 79, 845–851.
Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., & Fleishman, E. A. (Eds.). (1999). The occupa-
tion information network (O*NET). Washington, DC: American Psychological Association.
Podsakoff, P. M., MacKenzie, S. B., & Hui, C. (1993). Organizational citizenship behaviors and managerial
evaluations of employee performance: A review and suggestions for future research. In G. R. Ferris &
K. M. Rowland (Eds.), Research in Personnel and Human Resources Management (Vol. 11, pp. 1–40).
Greenwich, CT: JAI Press.
Podsakoff, P. M., MacKenzie, S. B., Paine, J. B., & Bachrach, D. G. (2000). Organizational citizenship behaviors:
A critical review of the theoretical and empirical literature and suggestions for future research. Journal of
Management, 26, 513–563.
Pulakos, E. D. (1984). A comparison of rater training programs: Error training and accuracy training. Journal
of Applied Psychology, 69, 581–588.
The Measurement of Task Performance as Criteria in Selection Research 461
Ree, M. J., & Earles, J. A. (1992). Intelligence is the best predictor of job performance. Current Directions in
Psychological Science, 1, 86–89.
Ree, M. J., Earles, J. A., & Teachout, M. (1994). Predicting job performance: Not much for than g. Journal of
Applied Psychology, 79, 518–524.
Rosenberg, S., & Sedlak, A. (1972). Structural representations of perceived personality trait relationships. In
A. K. Romney, R. N. Shepard, & S. B. Nerlave (Eds.), Multidimensional scaling (pp. 134–162). New
York, NY: Seminar.
Rothe, H. F. (1978). Output rates among industrial employees. Journal of Applied Psychology, 63, 40–46.
Schmidt, F. L. (2002). The role of general cognitive ability and job performance: Why there cannot be a debate.
Human Performance, 15, 187–210.
Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new research findings. American
Psychologist, 36, 1128–1137.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology:
Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnel selection. Annual Review of Psychology, 43,
627–670.
Schwab, D. P., Heneman, H. G., & DeCotiis, T. (1975). Behaviorally anchored rating scales: A review of the
literature. Personnel Psychology, 28, 549–562.
Smith, D. E. (1986). Training programs for performance appraisal: A review. Academy of Management Review,
11, 22–40.
Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of
unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149–155.
Stark, S., & Drasgow, F. (2002). An EM approach to parameter estimation for the Zinnes and Griggs paired
comparison IRT model. Applied Psychological Measurement, 26, 208–227.
Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A
meta-analytic review. Personnel Psychology, 44, 703–742.
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–286.
Viswesvaran, C. (1993). Modeling job performance: Is there a general factor? Unpublished doctoral dissertation,
University of Iowa, Iowa City, IA.
Werner, J. M. (1994). Dimensions that make a difference: Examining the impact of in-role and extra-role
behaviors on supervisory ratings. Journal of Applied Psychology, 79, 98–107.
Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal
of Occupational and Organizational Psychology, 67, 189–205.
This page intentionally left blank
22 Adaptive and
Citizenship-Related
Behaviors at Work
CONCEPTUALIZATION
Macro-level trends such as globalization, technology, demographic shifts, and alternative work
structures have led researchers and practitioners to challenge traditional definitions of individ-
ual work performance (Ilgen & Pulakos, 1999). Two major ways in which these definitions have
shifted include performing in interdependent and uncertain work contexts (Griffin, Neal, &
Parker, 2007). Thus, a complete understanding of performance must address how individuals adapt
to ever-changing situations and how individuals perform within given social settings and social
systems. In this chapter, we explore such expanded definitions of work performance by considering
what we know (and what we do not know) about adaptive and organizational citizenship-related
behaviors and how this knowledge might be used to inform selection.
Implicit in our effort to highlight adaptive and citizenship behavior is the assumption that such
behaviors are in some ways unique from traditional task performance. Although we argue in vari-
ous ways throughout this chapter that this is true, we acknowledge that the boundaries among such
performance components are fuzzy. It has been argued that neither adaptive nor citizenship per-
formance is mutually exclusive from task performance, and some conceptual and empirical over-
lap should be expected (Schmitt, Cortina, Ingerick, & Wiechmann, 2003; Johnson, 2003; Griffin
et al., 2007). Moreover, it has been demonstrated that differences in specific job requirements can
drive the relative importance (and profile) of various performance components (Pulakos, Arad,
Donovan, & Plamondon, 2000). See Chapter 21, this volume, for a detailed discussion about task
performance.
For the purposes of a selection volume, it is sufficient to observe that one of the reasons for dis-
tinguishing adaptive performance and citizenship performance from task performance is that they
have different determinants.
463
464 Handbook of Employee Selection
Interest in adaptability at the individual level has generated several important streams of research.
First, researchers have proposed that adaptive performance is a unique performance construct,
distinguishable from the well-documented components of task and contextual performance
(Campbell, 1999; Hesketh & Neal, 1999; Pulakos, et al., 2000). Second, in attempting to
understand (and predict) adaptive performance, researchers have begun investigating the individual
differences that enable various types of adaptive proficiency and performance (Pulakos, Schmitt,
Dorsey, Hedge, & Borman, 2002). Such research has produced an eight-dimension taxonomy of
adaptive performance (Pulakos et al., 2000) and some initial candidates for important individual
difference determinants (Kozlowski et al., 2001; LePine, Colquitt, & Erez, 2000; Pulakos et al.,
2002; Stewart & Nandkeolyar, 2006). However, this body of research is still in its infancy, and a
host of critical unanswered questions remains. For example, it is unclear whether all dimensions
of adaptive performance are in fact distinct from elements of task and contextual performance
(Johnson, 2001).
For the purpose of this chapter, we broadly define adaptive work-related behavior as an effective
change in response to an altered situation (White et al., 2005). Consequently, adaptive performance
entails the capability to do the following:
We note briefly two points associated with this definition. First, in the case of proactive adapta-
tion, the altered situation is perceptual rather than environmental. A person perceives that desired
goals could not be met given current circumstances, so the environment must be shaped to achieve
intended objectives. Second, in our view, performing adaptively, in a manner consistent with this
definition, is meaningful at levels of analysis beyond individual performers. As suggested through-
out the literature (e.g., Burke, Pierce, & Salas, 2006; Chen, Thomas, & Wallace, 2005; Griffin
et al., 2007), the capacity to perform adaptively is a multilevel organizational process or phenom-
enon, which is enabled or facilitated by a host of organizational factors and interventions. Here, we
focus on the critical importance of selecting (hiring) individuals with the necessary knowledge,
skills, abilities, and other attributes (KSAOs) that enable adaptive performance.
In terms of criterion definition, Pulakos et al. (2000) developed a taxonomy of adaptive job
performance that expands upon previous models of the performance domain (e.g., Campbell, McCloy,
Oppler, & Sager, 1993). Two studies were conducted to develop and refine this taxonomy. First,
more than 1,000 critical incidents from 21 different jobs (military and nonmilitary) were content-
analyzed, yielding an eight-dimension taxonomy of adaptive performance. Second, this taxonomy
was investigated empirically via the development and administration of a Job Adaptability Inventory
(JAI)—an instrument designed to describe the adaptability requirements of jobs. Exploratory factor
analyses of JAI data from 1,619 respondents across 24 jobs yielded an eight-factor solution that
mirrored the hypothesized eight-dimension taxonomy. Subsequent confirmatory factor analysis
(using a separate subsample) indicated a good fit for the eight-factor model. The eight dimensions of
adaptive performance are as follows (this research is further highlighted in Pulakos et al., 2000):
This initial research highlighted two particularly important points, namely that (a) adaptive perfor-
mance is a multidimensional construct and (b) individual jobs or organizational roles have unique
profiles of adaptability requirements, which vary predictably along the eight adaptability dimen-
sions. Both of these findings hold implications for personnel selection.
Note also that the eight dimensions subsume other constructs such as creativity/innovation, learn-
ing, and handling stress. Consistent with our overarching definition of adaptability (an effective
change in response to an altered situation), the eight dimensions in essence point out categories of
responses (e.g., learning) and categories of situations (e.g., intercultural dilemmas) that may appear
in specific job requirements or criterion definitions.
Some authors have argued that most of the dimensions in the Pulakos et al. (2000) eight-dimension
model are, in essence, dimensions of technical and contextual performance (Johnson, 2003). Thus,
some researchers have narrowed the definition of adaptive performance to focus primarily on
Pulakos et al.’s (2000) concept of “dealing with uncertain work situations” as a core element of this
construct (Griffin et al., 2007; Johnson, 2003). The primary definition that we provide above fits
with this line of thinking; dealing with an uncertain situation is roughly the same as our overarching
definition of an effective change in response to an altered situation.
Such debates about the substantive nature of adaptive performance beg the question “How
specifically is adaptive performance different from typical task performance?” Although some
aspects of adaptability may look similar to routine technical performance, adaptation may involve
doing the same activity to a greater degree, with greater intensity, or in a substantially different way.
Thus, distinctions regarding adaptive and technical performance may center on definitions of typical
versus maximal performance (DuBois, Sackett, Zedeck, & Fogli, 1993; Ployhart & Bliese, 2006).
In addition, as pointed out above, we view the eight dimensions as characterizing types of responses
and/or types of situations that may be relevant to various performance component definitions (which
may be technical or contextual in nature). In line with this view, Ployhart and Bliese (2006) suggested
that adaptability is a characteristic driven by the environment, not inherent in the criterion construct
domain, and that any task can have an adaptive component. The question is whether existing criterion
measures sufficiently capture such dynamic and adaptive performance requirements.
CITIZENSHIP DEFINED
Citizenship performance traces its conceptual lineage back to Barnard (1938), Katz (1964), and,
more recently, Organ and colleagues, who first coined the term “organizational citizenship behav-
ior,” or OCB (Bateman & Organ, 1983; Smith, Organ, & Near, 1983). Since its introduction, over 30
potential variants of OCB have arisen (Podsakoff, MacKenzie, Paine, & Bachrach, 2000), includ-
ing umbrella terms such as “contextual performance,” recently renamed “citizenship performance”
(Borman & Penner, 2001; Borman, Penner, Allen, & Motowidlo, 2001; Coleman & Borman, 2000);
prosocial organizational behavior (Brief & Motowidlo, 1986); contextual performance (Borman &
Motowidlo, 1993); and organizational participation (Graham, 1991). Many other variables, although
not variants of citizenship, are clearly linked to its dimensions. For example, organizational iden-
tification overlaps with the representing, loyalty, and compliance dimensions of citizenship, and
socialization is linked with helping and compliance.
Although most studies of citizenship refer to citizenship behavior, we prefer the term citizen-
ship performance because it emphasizes the notion that there is an aspect of quality to citizenship
behavior such that some citizenship behaviors are more successful than others. The notion of qual-
ity in citizenship is necessary for the recognition of the importance of knowledge and skill in the
prediction of citizenship.
466 Handbook of Employee Selection
Consistent with Borman and Motowidlo (1993), we define citizenship performance as activi-
ties that support the broader environment in which an organization’s technical core must func-
tion. Citizenship performance has many subdimensions, and there have been varied attempts
to identify them (e.g., Borman & Motowidlo, 1993; Smith et al., 1983; Williams & Anderson,
1991). In this chapter, we use the most detailed of these—that of Borman et al. (2001a). We
made this choice recognizing that there is considerable overlap among some of the subdimen-
sions in this model. However, we felt that reliance on a model with subtle distinctions would be
more appropriate because it would allow us to suggest potential differences in phenomenology.
If subsequent research suggests that there is little value in these distinctions, then future reviews
can collapse them.
The Borman et al. (2001b) model contained three subdimensions of citizenship performance,
each of which can be broken down further into facets. Table 22.1 contains detailed definitions of
the facets of the model. The three subdimensions are Personal Support, Organizational Support,
TABLE 22.1
Model Facets of Citizenship Performance
Personal Support
Helping Helping others by offering suggestions about their work, showing them how to accomplish
difficult tasks, teaching them useful knowledge or skills, directly performing some of
their tasks, and providing emotional support for personal problems
Cooperating Cooperating with others by accepting their suggestions, following their lead, putting team
objectives over personal interests, and informing others of events or requirements that are
likely to affect them
Courtesy Showing consideration, courtesy, and tact in relations with others
Motivating Motivating others by applauding their achievements and successes, cheering them on in
times of adversity, showing confidence in their ability to succeed, and helping them
overcome setbacks
Organizational Support
Representing Representing one’s organization favorably to outsiders by defending it when others
criticize, promoting its achievements and positive attributes, and expressing own
satisfaction with organization
Loyalty Showing loyalty by staying with one’s organization despite temporary hardships,
tolerating occasional difficulties, handling adversity cheerfully and without complaining,
and publicly endorsing and supporting the organization’s mission and objectives
Compliance Complying with organizational rules and procedures, encouraging others to comply with
organizational rules and procedures, and suggesting procedural, administrative, or
organizational improvements
Conscientious Initiative
Self-development Developing own knowledge and skills by taking courses on own time, volunteering for
training and development opportunities offered within the organization, and trying to
learn new knowledge and skills on the job from others or through new job assignments
Initiative Taking the initiative to do all that is necessary to accomplish team or organizational objectives
even if not typically a part of own duties, correcting nonstandard conditions whenever
encountered, and finding additional work to perform when own duties are completed
Persistence Persisting with extra effort despite difficult conditions and setbacks, accomplishing goals
that are more difficult and challenging than normal, completing work on time despite
unusually short deadlines, and performing at a level of excellence that is significantly
beyond normal expectations
Source: Adapted from Borman, W. C., Buck, D. E., Hanson, M. A., Motowidlo, S. J., Stark, S., & Drasgow, F., Journal of
Applied Psychology, 86, 965–973, 2001.
Adaptive and Citizenship-Related Behaviors at Work 467
and Conscientious Initiative. The first subdimension, Personal Support, has similarities with OCB-
individual (William & Anderson, 1991), social participation (Van Dyne, Graham, & Dienesch,
1994), interpersonal facilitation (Van Scotter & Motowidlo, 1996), and incorporates Organ’s (1988)
altruism and courtesy dimensions (Coleman & Borman, 2000). Personal Support is comprised of
four facets: helping, cooperating, showing courtesy and consideration, and motivating others.
The second dimension, Organizational Support, is similar to OCB-organizational (Williams &
Anderson, 1991) and incorporates sportsmanship and civic virtue (Organ, 1988) and loyalty and
obedience (Van Dyne et al., 1994). It is comprised of three facets: representing the organization
favorably, showing loyalty, and complying with organizational rules and procedures.
The final dimension, Conscientious Initiative, is similar to functional participation (Van Dyne
et al., 1994) and job dedication (Van Scotter & Motowidlo, 1996). It is comprised of three facets:
persisting with extra effort, showing initiative, and engaging in self-development.
Clearly, there is overlap among these facets. Indeed, a recent meta-analysis by Hoffman, Blair,
Meriac, and Woehr (2007) suggested that the citizenship domain is best characterized as a single
higher-order factor. Specifically, they suggested that the subdimensions of citizenship (i.e., coop-
eration, courtesy, etc.) are indicators of a more general citizenship construct. On the other hand,
these dimensions seem to be conceptually distinct (see Table 22.1). For example, Machiavellianism
often involves being courteous without being helpful or cooperative. These facets can be further
distinguished by the fact that they have different consequences. However, most important for our
purposes is the fact that they have different individual and situational determinants. It remains to
be seen if the covariance between subdimensions suggested by Hoffman et al. (2007) result from
halo or common method effects or from the true nature of citizenship dimensions as reflections of a
higher order (see also LePine, Erez, & Johnson, 2002). Before discussing determinants and conse-
quences, a discussion of the nature of citizenship is in order.
Most empirical studies of citizenship, and of performance generally, assign a single score to each
participant in the same way that studies of cognitive ability assign a single ability score to each par-
ticipant. The implication is that a single value can represent the standing of a person on the stable
construct, citizenship.
Ilies, Scott, and Judge (2006) used event sampling to generate daily reports of job attitudes
and citizenship and found that much of the variance in citizenship was within person. As will be
explained later, these authors found that this within-person variance was explained by other within-
person factors. For the moment, we merely wish to point out that the Ilies et al. (2006) findings cast
doubt on the practice of using a single value to represent citizenship and on the conceptualization
that this practice implies.
Predictors of performance often do not have simple cause-effect relationships with performance.
Rather, some predictors can serve as “boundary conditions” or can “act through” other variables.
These two conditions are known as “moderation” (boundary condition) and “mediation” (acting
through).
As we discuss in a later section, Ilies et al. (2006) found that stable individual differences “mod-
erated” within-person variability in citizenship behavior. In this case, levels of a stable trait act as
boundary conditions such that within-person variability is only important when an employee does
not have high levels of this trait. Practically speaking, this would suggest that hiring individuals
with higher levels of this trait should make for more consistent citizenship performance (as opposed
to more citizenship performance). This is because at this level of the trait, within-person causes
(e.g., affect, attitudes) are less relevant.
Mediation, on the other hand, is somewhat simpler. Mediation states that essentially, a third
variable “speaks for” the relation between the other two. For instance, a relationship between age
and performance may exist not because age causes performance, but that age is related to relevant
job-related experiences, which causes performance. Thus, the relationship of age with performance
acts through its relationship with experience. For practice, this suggests that one can use variables
that are mediated by others as proxies for the mediator variable of interest. In so doing, one should
468 Handbook of Employee Selection
remember that proxies are not the variable of direct interest, and their use may have legal implica-
tions (i.e., one would be ill advised to select for age, presuming it is a sufficient proxy for relevant
job experience).
TABLE 22.2
Predictors of Adaptability and Citizenship
Predictor Explanation Primary Source(s)
Adaptability: Distal predictors
Cognitive ability High-g people have more cognitive resources to Pulakos et al. (2002)
devote to adaptive situations, allowing them to
determine appropriate courses of action.
Conscientiousness/resiliency High-conscientiousness people are prepared to LePine et al. (2000); Pulakos et al.
struggle through uncertainty to achieve. They are (2002); Stewart & Nandekeolyar
more likely to engage in task pursuit. On the (2006)
other hand, they may be less willing to revise
their approach.
Openness to experience High-openness people are better prepared to LePine et al. (2000); Pulakos et al.
revise initial approaches on the basis of (2002); Stewart & Nandekeolyar
experience. On the other hand, they may be less (2006)
inclined to pursue formal objectives in the first
place.
Age The role of age is unclear. Although physical DeArmond et al. (2006)
adaptability may be more difficult for older
workers, their experience may help them to
appreciate the need for other kinds of
adaptability.
Citizenship: Distal predictors
Conscientiousness High-conscientiousness people are more likely to Borman et al. (2001b); Moon
recognize the need for and follow through on et al. (2008); Organ & Ryan
citizenship behaviors. “Duty” is positively related (1995)
to “taking charge,” whereas achievement striving
is negatively related.
Prosocial personality Prosocial people are more empathic and more Borman et al. (2001b)
interesting in engaging in activities that benefit
others or the organization.
Collectivism People high in collectivism feel shared fate with Jackson et al. (2006)
their groups and are more concerned with group
outcomes.
Narcissism Causes one to engage in less citizenship and to Judge et al. (2006)
inflate self-ratings of citizenship.
Motives Prosocial value motives and empathy predict Kamdar et al. (2006); Rioux &
OCB-I, whereas organizational concern motives Penner (2001)
and reciprocation wariness predict OCB-O.
Concern over future People high in CFC are less likely to engage in Joireman et al. (2006)
consequences OCB if term of employment is unclear.
Adaptive and Citizenship-Related Behaviors at Work 469
be implicated in effective adaptive performance. For example, the broader literature around coping
may suggest a host of variables that play a role in adaptive behavior (Ployhart & Bliese, 2006).
It is also noteworthy that some authors have posited an entirely different view of adaptability-
related individual differences. For example, Ployhart and Bliese (2006) offered a theory of adapt-
ability in which adaptability is viewed as a higher-order compound trait, determined by distal
KSAOs. To date, there is limited research on these issues.
Before moving on, one factor worth noting is the role of demographic factors such as age.
Although we are unaware of any large-scale studies that definitely link age to adaptive performance,
perceptions exist that older workers are less adaptable (e.g., see DeArmond et al., 2006). Although
there are certainly some task environments in which adaptability requirements may bound adaptive
performance among older workers (e.g., physical adaptability requirements), it is equally likely that
older workers excel on other dimensions of adaptive performance because of the wealth of experi-
ence that they have acquired. Possible reasons for increased adaptability among older workers on
some adaptive performance dimensions include (a) an increased probability of having acquired
coping skills, styles, and mechanisms to deal with stress, frustration, and change; and (b) acquired
experiences form a set of possible adaptive responses from which one can draw. As with most of the
areas of predictor research mentioned above, more research is needed to further our understanding
of the determinants of adaptive performance.
Similarly, Kamdar, McAllister, and Turban (2006) found that empathy (defined as empathetic
concern and perspective taking) was a better predictor of individual level citizenship behaviors
whereas reciprocation wariness was a better predictor of citizenship toward the organization. These
latter findings reflect an effort to understand citizenship through motives related to social exchange
theory (SET; Blau, 1956; Cropanzano & Mitchell, 2005). In SET, an individual’s motives are thought
to be contingent upon relational obligations and duties formed through exchanges of resources with
other individuals or groups. SET posits that two exchanging parties are likely to build high-quality
relationships characterized by trust and commitment to one another. From this perspective, citizen-
ship is a form of resource exchange with an organization or other individual. Accordingly, Kamdar
et al. suggested that empathetic individuals tend to define their job role in terms of citizenship and
are willing to exchange their citizenship with coworkers and the organization alike, whereas workers
who are more concerned about reciprocation from their exchange partner will only exchange with
the more powerful and resource-rich organization where there is a higher likelihood of reward.
Also from a social exchange perspective, Joireman, Kamdar, Daniels, and Duell (2006) found
that “concern over future consequences” (CFC) also predicts citizenship. Because the benefits of
engaging in citizenship behavior are longer term, individuals who are not concerned with their
own benefits in the long term are more likely to engage in citizenship behavior irrespective of their
expected length of tenure with an employer. Conversely, individuals who have higher CFC will
withhold citizenship in cases where their expected tenure is short because of lack of perceived ben-
efits. The importance of this finding for selection is that screening for citizenship will be ineffective
for high-CFC applicants if term of employment is unclear or known to be short.
Bowler and Brass (2006) found that individuals were more likely to both give and receive OCB
from those to whom they had a relational tie or were connected through a friend (third-party ties)
in a social network. Workers in powerful structural positions (e.g., individuals who mediated links
between groups of individuals) were also more likely to receive OCB from others. These findings
were bolstered by a later study by Venkatramani and Dalal (2007) who found that having a direct or
indirect (third party) positive affective tie led to increased instances of interpersonal helping. This
relationship was also affected by the amount of contact. Thus, the more frequently the individuals
were exposed to one another, the more likely they were to help. Taken together, these studies sug-
gest that friendship, direct and indirect, can increase the likelihood of giving and receiving helping
behavior from coworkers. In fostering a genial workplace, an organization may then reap the ben-
efits of citizenship performance.
Informational privacy has also been related to increased citizenship behavior (Alge, Ballinger,
Tangirala, & Oakley, 2006). Increased privacy of personal information leads to psychological
empowerment (or intrinsic motivation) because information control is expected to lead to feelings
of self-determination and value of membership in the organization. Informational privacy is then
related to citizenship, but empowerment mediates this relationship. Thus, establishing policy that
maintains confidentiality of sensitive employee information is a way in which an organization can
increase feelings of employee empowerment.
IMMEDIATE/PROXIMAL DETERMINANTS—ADAPTABILITY
If basic distal individual characteristics such as abilities, personality traits, and past experience
serve as determinants of adaptive performance, one can ask “What are more immediate or proxi-
mal determinants?” In contrast to distal predictors, we would expect proximal predictors to have
a more direct influence on performance behavior. Here, extant literature suggests that factors such
as knowledge, self-efficacy, and skill all predict adaptive performance (e.g., Kozlowsk et al., 2001).
To understand selection of those who are high in adaptive performance and good citizens, we must
understand these more proximal predictors. Moreover, there is evidence that such proximal pre-
dictors produce incremental validity over distal factors (Ford, Smith, Weissbein, Gully, & Salas,
1998; Kozlowski et al., 2001). Thus, including variables that are “closer” on the causal chain to
472 Handbook of Employee Selection
performance with distal predictors can improve our predictions of performance. This is noteworthy
in that the distal predictors themselves are thought to be causes of the proximal predictors (e.g.,
cognitive ability causing knowledge uptake causing performance).
In addition to skills and attitudinal changes (e.g., self-efficacy), it is likely that knowledge acqui-
sition plays a key role in adaptive performance. In particular, for individuals to perform adaptively,
they must adequately perceive and interpret various situations and possess knowledge concerning
how to operate adaptively in the given contexts. This assumption is supported by several theoretical
propositions and empirical findings from existing research, as described throughout this chapter. For
example, structural models of job performance have shown that cognitive ability exhibits a strong
indirect effect on ratings of overall performance through the mediating variables of job knowl-
edge (Borman, White, Pulakos, & Oppler, 1991; Borman, White, & Dorsey, 1995; Hunter, 1983;
Schmidt, Hunter, & Outerbridge, 1986) and technical proficiency (Borman et al., 1995; Schmidt
et al., 1986). Moreover, given the situational nature of adaptability, it is likely that some type of situ-
ational knowledge (conceptually close to the concept of practical intelligence, see Schmitt & Chan,
2006) plays an important role.
Another important avenue of current research is the literature regarding self-regulatory skills.
Regulatory processes have been implicated as mediational mechanisms that lead to better adaptive
performance (e.g., Chen et al., 2005). Regulatory mechanisms include motivational components
such as goal choice and goal striving (Kanfer, 1990; Mitchell & Daniels, 2003). Consistent with the
literature on goal-setting, goal choice and goal striving affect performance through the allocation
of attention to and persistence in achieving goals (Locke & Latham, 1990; Chen et al., 2005). The
notion of goal choice is similar to other proximal predictors such as strategy selection that have
been proposed as determinants of adaptive performance (Ployhart & Bliese, 2006). In summarizing
possible predictors of adaptive performance, it is important to highlight that not all of these factors
are strictly malleable (in the sense of being trainable). Thus, from a selection perspective, choosing
predictors that are more likely to be relatively fixed in individuals may be beneficial. Also, careful
consideration of which skills or attributes are needed at entry is warranted. However, we should also
mention that trainability does not necessarily imply that training is superior to selection. Training
programs can be very expensive to develop and administer. As a result, it can be more cost effective
to buy, rather than build, even the most malleable characteristics.
IMMEDIATE/PROXIMAL DETERMINANTS—CITIZENSHIP
Although stable individual differences are clearly important for the prediction of citizenship perfor-
mance, the better part of its variance remains unexplained. One likely reason for the modest cor-
relations between stable characteristics and citizenship is that their influence is transmitted by more
proximal variables. Three categories of predictors that hold great promise are attitudes, knowledge,
and skills.
In their seminal meta-analysis, Organ and Ryan (1995) found that, with the exception of consci-
entiousness, attitudinal variables (job satisfaction, perceived fairness, organizational commitment,
and leader supportiveness) were stronger predictors of OCB than dispositional variables, with mean
uncorrected correlations that were 5–15 points higher. Ilies et al. (2006) found that within-person
variance in citizenship could be explained by within-person variance in positive and negative affect
and job satisfaction.
Until recently, researchers had only hinted at knowledge and skills that might predict citizen-
ship performance. Motowidlo et al. (1997) provided general examples of citizenship performance
knowledge and skills when they stated that, “Examples of contextual knowledge include knowing
how to cooperate with a diverse group of people; knowing how to calm an upset worker; knowing
how to work productively with difficult peers, supervisors, and subordinates, and so forth” (p. 80).
Similarly, they suggested that examples of contextual skill include “skill in actually carrying out
actions known to be effective for handling situations that call for helping and coordinating with
Adaptive and Citizenship-Related Behaviors at Work 473
others; following organizational rules and procedures; endorsing, supporting, and defending orga-
nizational objectives; persisting; and volunteering” (p. 81).
Much of the empirical work has asserted a particular knowledge and/or skill as relevant for effec-
tive performance in a specific domain (e.g., Bergman, Donovan, & Drasgow, 2001, interpersonal
skill predicting leadership; Morgeson, Reider, & Campion, 2005, interpersonal and teamwork skills
predicting team performance; Motowidlo, Brownless, & Schmit, 1998, and Schmit, Motowidlo,
Degroot, Cross, & Kiker, 1996, customer service knowledge and skill predicting customer service
performance). Other research has worked backward to determine knowledge and skills (e.g., Bess,
2001; Schneider & Johnson, 2001). For example, a participant is presented with an interpersonal
situation that may be encountered on the job for which they are applying. Subject matter experts
(SMEs) then decide which items measure overall job knowledge relevant to citizenship performance
dimensions. Thus, no particular citizenship performance knowledge and skill is identified.
To our knowledge, only two studies have tested specific knowledge in the prediction of citizen-
ship performance. Bettencourt, Gwinner, and Meuter (2001) found that two customer-knowledge
antecedents (trait richness, or knowledge of customer characteristics and types, and strategy rich-
ness, or knowledge of strategies for dealing with customer needs and situations) explained unique
variance in service-oriented OCBs after controlling for attitudinal and personality variables.
Dudley and Cortina (2009) examined knowledge and skill predictors of helping performance
(one of the facets of Personal Support) and found that interpersonal construct knowledge, strategy
richness, means end knowledge, emotional support skill, and emotion perception skill predicted
supervisor ratings of person-related and task-related helping performance over and above various
dispositional variables.
Although relatively little empirical work has been conducted to uncover knowledge and skill pre-
dictors of citizenship, indirect evidence has been found for the effect of interpersonal skills on the
fostering of helping in creative tasks. Porath and Erez (2007) found that incidents of rudeness were
related to decreased levels of helping toward the rude person and to unrelated others. This suggests
that rudeness has wide-ranging effects in interdependent contexts, such as team performance, where
helping can contribute substantially to performance.
Dudley and Cortina (2008) developed a conceptual model linking specific knowledge and skill
variables to the Personal Support facets of citizenship performance. Among the most prominent
knowledge-based predictors were strategy richness, emotional knowledge, knowledge influence tac-
tics, and organizational norm knowledge. Among the most prominent skill-based predictors were
emotion support skills, emotion management skills, facework skills, behavioral flexibility, social
perceptiveness, and perspective-taking skills. We refer the reader to Dudley and Cortina (2008) for
the bases of the linkages themselves.
Leader characteristics and practices have been found to be predictive of citizenship behavior
consistent with social exchange. For instance, leaders who provide guidance to their followers on
appropriate types of behaviors (i.e., citizenship), when bolstered by behavioral integrity (word-
behavior consistency in leader), leads to follower OCB (Dineen, Lewicki, & Tomlinson, 2006).
Abusive supervision has also been linked to lower perceptions of organizational justice, which acts
as a mediator to predict levels of citizenship (Ayree, Chen, Sun, & Debrah, 2007). Positive leader-
member exchange (LMX) has also been found to relate to several positive organizational outcomes,
one of which is supervisor-oriented OCB-I (Ilies, Nahrgang, & Morgeson, 2007).
Transformational leadership has also been linked to citizenship behaviors. However, a recent
study found that Hackman and Oldham’s (1974) job characteristics theory constructs mediated the
transformational leadership-citizenship relation (Piccolo & Colquitt, 2006). Specifically, transfor-
mation leadership was found to influence the “meaning” ascribed by employees to elements of their
jobs (such as their perceptions of skill variety and identity). This in turn led to intrinsic motivation,
which mediated the relation between job characteristics constructs and citizenship behavior. Others
suggest that the transformational leadership-citizenship relation is mediated by LMX (Wang, Law,
Hackett, Wang, & Chen, 2006). In this case, transformational leadership is a cue to a follower that
474 Handbook of Employee Selection
the leader is a trustworthy exchange partner. This leads to greater investment by the follower in the
relationship and then stronger positive LMX perceptions.
Although not examined from a SET perspective, Rosen, Levy, and Hall’s (2006) study can be
also understood from a social exchange perspective. Rosen et al. found that fostering a climate in
which feedback is provided to their subordinates reduces perceptions of organizational politics and
increases employee morale. High feedback environments suggest that the outcomes enjoyed by indi-
viduals are determined in a procedurally fair manner and are not based on political skill but on con-
tributions to organizational success, which lead to citizenship. This suggests that resources devoted
to selecting good citizens may not be well spent unless the environment is conducive to citizenship.
Additionally, complementarity in control-related personality characteristics (leader having
more and subordinates having less controlling personalities) was posited to lead to positive social
exchanges (Glomb & Welsh, 2005). The complementarity hypothesis was not supported. Rather
there was a main effect for subordinate control, suggesting that individuals need a sense of control
over their work (consistent with some evidence outlined above) to exhibit OCB.
In closing, we should mention that most research conducted on predictors of citizenship and
adaptive performances has used job incumbents as subjects of their research. To increase the con-
fidence that the above predictors are in fact the causes of the observed effects on performance,
predictive validity studies using applicant populations need to be conducted.
MEASUREMENT
MEASURING ADAPTIVE PERFORMANCE
In considering the measurement of adaptive performance, we highlight two primary measurement
issues and two primary methods. The first issue is the importance of clearly defining the types of
adaptive behavior or performance of interest. As suggested above, there is a host of ways of defining
and conceptualizing adaptive behavior. Following the overarching definition of adaptability offered
above, measures of adaptive performance must focus on the way(s) in which task requirements have
shifted in a manner as to require a change in response. The specific nature of the changing task
requirements and the required responses will drive the relative predictive power (and weighting)
of various distal and proximal predictors (e.g., interpersonal adaptability requirements will relate
more to interpersonal predictors such as communication, negotiation, and persuasion skills). Thus,
in practice, it is essential that a job analysis is conducted to establish appropriate predictor-criterion
linkages. Assuming that all dimensions of adaptive or citizenship performances are de facto rel-
evant (or irrelevant) can lead to measures that are either deficient or contaminated.
This leads to a second important point: Given the possible “maximal” nature of adaptive perfor-
mance, in some contexts it may be hard to measure adaptive behavior under typical work conditions
or in laboratory settings. For example, handling severe crises, tolerating extremes of stress, or meet-
ing extreme physical demands requires observing behavior under nonroutine, possibly low-base-
rate, and possibly hard-to-replicate conditions.
Given the challenges in measuring adaptive performance, how would one go about it? First,
most of the existing techniques for gathering ratings-based performance measures apply to adaptive
performance. Thus, one can gather supervisor, peer, subordinate, or customer ratings using various
behavior-based rating formats. Pulakos et al. (2002) used this approach. Specifically, Pulakos et al.
gathered two types of ratings-based measures. The first measure was a set of eight behavior-based
rating scales constructed to reflect each of the eight adaptability dimensions highlighted earlier.
A second ratings measure asked supervisors to rate how effective each subordinate was in handling
each of 24 situations (3 situations for each of the eight dimensions); each of the situations was based
on job-related critical incidents that had been identified as requiring adaptive responses.
In other situations in which ratings-based measures might be deemed insufficient, simulations
or work samples might be used to measure adaptive behavior. As just one example, the U.S. Army
Adaptive and Citizenship-Related Behaviors at Work 475
Special Forces created a computer game simulation as part of the Adaptive Thinking & Leadership
(ATL) program to train Special Forces team leaders (Raybourn, Deagle, Mendini, & Heneghan,
2005). This simulation challenges trainees to function effectively in an immersive multicultural
environment where adaptability-related skills in negotiation, consensus building, communication,
judgment, self-awareness, and innovation are required. Simulation-based measures are also popular
in applications in which adaptive performance is conceptualized in terms of training transfer. For
a good example of this approach see Chen, Thomas, and Wallace (2005), in which training simula-
tions are used to measure adaptive performance at the individual and team levels.
• Tries to talk fellow cadets through personal problems, helping them come up with
solutions
• Tries to help fellow cadets with difficult assignments, even when assistance is not directly
requested
Brooks-Shesler, LaPort, Dudley, and Cortina (2009) used items such as the following to measure
cooperation:
These are not items from the original list of items, but rather they represent heavily modified items.
The original helping items were developed not only to be consistent with the conceptual nature of
their respective construct, but also to be distinct from cooperation among other things. The origi-
nal cooperation items were developed not only to be consistent with the conceptual nature of their
respective construct, but also to be distinct from helping. Despite these efforts, a group of SMEs had
difficulty in categorizing these items accurately. This is consistent with the Hoffman et al. (2007)
finding that raters typically have difficulty in discriminating between these subdimensions without
the aid of “frame-of-reference” training. In addition, the SMEs in the Dudley and Cortina (2008)
and Brooks-Shesler et al. (2009) studies had difficulty in distinguishing citizenship items not only
from one another, but also from items measuring knowledge and skills. In these studies, the items
were modified so that their distinctiveness was more apparent. This sort of iterative process is not
uncommon in scale development. Our point is that the process is particularly important and dif-
ficult in measuring the facets of citizenship because of the difficulty in demonstrating discriminant
validity.
It should also be mentioned that, just as there are unique difficulties in the measurement of citi-
zenship, there are unique difficulties in the measurement of the knowledge and skills that predict
citizenship. We mentioned in the previous paragraph that dimensions of citizenship are difficult to
measure in part because they are difficult to distinguish from the knowledge and skills that predict
them. This problem cuts both ways in that knowledge and skills are difficult to measure because
they are difficult to distinguish from the dimension that they are intended to predict.
host of group process factors that impact adaptability at team and individual levels (Burke, Stagl,
Salas, Pierce, & Kendall, 2006; Chen et al., 2005; Koslowksi et al., 2001). In some circumstances,
team members may adapt team roles or internal structures to align with the environment (Moon
et al., 2004).
In addition to the nature of tasks and possible team or group factors, the larger organizational
environment also likely moderates predictors of adaptability. As W. Edward Deming once observed,
put a good person in a bad system, the bad system wins every time. Thus, organizational norms,
cultures, climates, and systems can constrain or enhance adaptability. For example, organizational
communication patterns might enable more or less adaptive performance (Entin, Diedrich, &
Rubineau, 2003). Moreover, organizational phenomena such as mergers and acquisitions may
require additional adaptation on the part of individuals (Griffin et al., 2007). These “boundary con-
ditions” must be taken into consideration when an organization is attempting to enhance adaptation
in their employees or when diagnosing why adaptive performance interventions may not be having
the intended effect.
be more strongly related to “taking charge” citizenship (beneficial, organizational change related
behaviors) when perceived discretion over the demonstration of these behaviors is low (i.e., they are
role prescribed; McAllister, Kamdar, Morrison, & Turban, 2007).
Interestingly, this same study found that altruistic/interpersonal citizenship had a stronger rela-
tionship with high perceived discretion of citizenship when procedural justice was low (McAllister
et al., 2007). This finding is consistent with research by Halbesleben and Bowler (2007), in which
interpersonal citizenship was found to be used as a social support coping mechanism when condi-
tions at work are stressful.
Findings related to procedural justice climate (PJC) are consistent with some of the research
outlined above. For instance, Yang, Mossholder, and Peng (2007) found that average group levels
of “power distance,” or the extent to which individuals defer to decisions of powerful individuals
and accept power imbalances, moderates the relationship between PJC and citizenship. Groups
with high average levels of power distance will not “withhold” citizenship toward the organization
when faced with unfair treatment at the group level because they do not feel that arbitrary decisions
made by leaders justify such a reaction. Other instances of multilevel research suggest that attitude
targets moderate relationships across levels. Thus, group-level justice perceptions are more strongly
related to “higher-level” recipients of citizenship. For instance, procedural and informational justice
climate (created by averaging both justices at individual level independently) predicted better as the
target of the citizenship “moved up” in level toward “the organization” (Liao & Rupp, 2005).
The relationship between perceived union support (union analog of perceived organizational
support, an indicator of exchange relationship quality) and citizenship was shown to become more
negative with increases in exchange ideology (the extent to which exchanges’ benefits shape behav-
ior; sensitivity to exchange quality) directed toward labor unions (Redman & Snape, 2005). This
was posited to be the result of a lack of salience in individual exchanges relative to collective soli-
darity (group-level concerns).
Other leader characteristics and practices have also been found to moderate relations between
personality and contextual variables and citizenship behavior. For instance, charismatic leadership
has been found to interact with feelings of follower belongingness to the group such that charisma
is less important in cases in which follower belongingness is high as leaders supplement belonging-
ness by making group identity salient and thereby increasing citizenship (Den Hartog, De Hoogh, &
Keegan, 2007). Leader influence tactics on subordinate’s citizenship performance has also been
found to be contingent upon the quality of the relationship between leader and follower (i.e., LMX).
For instance, inspirational techniques are negatively related to citizenship for followers with poor-
quality LMX because these appeals reinforce value incongruence between the leader and the fol-
lower (Sparrowe, Soetjipto, & Kraimer, 2006). However, those higher in LMX were encouraged to
engage in more citizenship by using exchange appeals in which resources are exchanged between
leader and subordinate because this was likely construed as “just another exchange” of many already
positive exchanges between the leader and follower (Sparrowe et al., 2006).
As a whole, the justice and SET-related research above suggests that to the extent that some
other individual difference predictor (personal dispositions toward citizenship, defining job role
as being helping, etc.) is not making an employee engage in citizenship, being treated well by the
organization can compensate. Thus, high levels of certain individual differences (i.e., some people
do not need procedural justice to help the organization) bound justice and SET’s prediction of
citizenship.
In addition to SET as an explanation of citizenship behavior, researchers are beginning to rec-
ognize the role of self-enhancement as a motive for citizenship (e.g., Bowler & Brass, 2006; Yun,
Takeuchi & Liu, 2007). Research has shown that in cases where an employee’s role is ambiguous,
employees will engage in more citizenship performance toward the organization to make up for their
inability to determine which behaviors are valued (Yun et al. 2007). This relationship only holds for
employees perceived as having high levels of affective organizational commitment, otherwise their
citizenship motives are transparent and recognized as being self-interested (Yun et al., 2007).
Adaptive and Citizenship-Related Behaviors at Work 479
One recent study has suggested that commitment may be less predictive than the configurations
of differing types of commitment (Sinclair, Tucker, Cullen, & Wright, 2005). The purpose of this
study was to tease apart how different profiles or levels of affective and continuance commitment
within persons predicted citizenship performance between persons. “Devoted” (high affective, con-
tinuance commitment) employees were found to have consistently higher citizenship than other
profiles and “free agents” (moderate continuance, low affective) were found to have consistently
low citizenship.
Finally, we return once again to Ilies et al. (2006). These authors first demonstrated that there is
meaningful within-person variance in citizenship. Using an event sampling approach, these authors
showed that a sizable percentage of the total variance in citizenship was within-person variance.
These authors then showed that within-person variance had a good deal of overlap with within-
person variance in job attitudes such as job satisfaction. Finally, these authors found that stable
individual difference variables such as agreeableness moderated this within-person relationship
(i.e., agreeableness was a significant predictor in a level-2 random coefficient model equation). By
treating citizenship as a within-person variable, Ilies et al. cast doubt on much of the previous work
on the topic.
In short, there are task and organizational variables that suppress adaptability or citizenship,
change their components, muffle the influence of determinants, or transmit the effects of those
determinants. If one is to maximize the utility of an adaptability-based or citizenship-based selec-
tion system, then these variables and this impact must be recognized.
IMPACT OF CITIZENSHIP
Citizenship performance does indeed contribute to organizational effectiveness (e.g., George &
Bettenhausen, 1990; Karambayya, 1990, as cited in Podsakoff et al., 2000; Koys, 2001; Podsakoff,
Ahearne, & MacKenzie, 1997; Podsakoff & MacKenzie, 1994; Podsakoff et al., 2000). This is espe-
cially true in cases in which work tasks are interdependent in nature as highly interdependent tasks
are facilitated by helping behaviors and cooperation (Bachrach, Powell, Collins, & Richey, 2006b).
Bachrach et al. (2006b) found that high levels of interdependence lead to higher ratings of group-
level performance when moderate levels of OCB were observed. Other research has confi rmed
that fostering citizenship leads to positive organizational outcomes. Specifically, service-related
citizenship partially mediated the relationship between social exchange informed HR practices and
organizational productivity and turnover (Sun, Ayree, & Law, 2007).
Payne and Webber (2006) found that service-oriented citizenship (i.e., toward customers) is related
to customer attitudes such as satisfaction, loyalty intentions, relationship tenure, positive word-of-
mouth, and reduced complaining. Similar to Sun et al. (2007), these results suggested that positive
social exchanges and the resultant attitudes (i.e., employee job satisfaction) predict citizenship.
Research has demonstrated that citizenship performance is important in supervisory ratings of
performance (Borman et al., 1995; Conway, 1996; MacKenzie, Podsakoff, & Fetter, 1993; Rotundo
and Sackett, 2002; Werner, 1994). Group task interdependence has been found to moderate the
effects of OCB on performance appraisals (Bachrach, Powell, Bendoly, & Richey, 2006). Although
supervisors may typically ignore or deemphasize the centrality of OCB to overall job performance,
when tasks are highly interdependent, the need for cooperation and helping is more difficult to
disregard. Thus, with higher levels of interdependence, the influence of citizenship on performance
appraisal is more pronounced. However, other research suggests that the influence of citizenship on
performance appraisal is affected by social comparison. If an employee’s workgroup exhibits high
average levels of citizenship, any given employee’s levels are comparatively lower and tend to have
weaker associations with appraisal outcomes than employees in workgroups with lower average
levels of citizenship (Bommer, Dierdorff, & Rubin, 2007).
Whiting, Podsakoff, and Pierce (2008) decomposed the citizenship domain into “helping” or
altruistic citizenship, “voice” (similar to “taking charge”), and “loyalty” to gauge their independent
effects on performance appraisal outcomes. This study found that independent of task performance,
all three citizenship dimensions predicted appraisals, with loyalty having the strongest association.
Interestingly, a three-way interaction was found such that when helping is low and task performance
high, voice loses its association with positive appraisals. Because voice is more confrontational than
the other forms of citizenship, when an employee is not contributing in other areas of their job, their
challenges to organizational routine are undervalued.
Ferrin, Dirks, and Shah (2006) found that OCB-I’s or interpersonal citizenship behaviors were
predictive of ratings of interpersonal trust, especially in cases where the social networks of the rater
and ratee were similar. This is because interpersonal citizenship behaviors are explicitly altruistic in
nature and are therefore taken to be indicative of the trustworthiness of the person engaging in it.
directly addresses the boundary conditions under which adaptive performance becomes too adap-
tive, it should be one consideration in studying adaptive performance. In addition, there is little
extant evidence regarding subgroup differences in adaptive performance. Future research should
address this gap.
CONCLUSIONS
The purpose of this chapter was to review recent research on two performance dimensions that
represent departures from traditional job performance models: adaptive performance and citizen-
ship performance. For each of these dimensions, we began by offering definitions that clarify their
nature and their distinctiveness. We then reviewed research on distal and proximal predictors of
these dimensions. Next, we discussed variables that might moderate the relationships between these
dimensions and other variables. Finally, we discussed the consequences of these dimensions.
This chapter should make clear that adaptive and citizenship performance are important. One
reason that they are important is that they relate to variables that are of great interest to organi-
zations, such as performance appraisal, group effectiveness, change management, and stressors.
Another reason is that they are distinct from alternative dimensions, conceptually and nomologi-
cally. If, as seems to be the case, these dimensions are underrepresented in performance appraisals,
then the weighting of dimensions in those systems is suboptimal. In addition to the obvious conse-
quences for organizational productivity, this would also result in discrimination against protected
groups to the degree that these dimensions create smaller subgroup differences than do the dimen-
sions that are commonly included in appraisal instruments.
482 Handbook of Employee Selection
Assuming that these dimensions are important, more work must be done to determine how
important they are (i.e., optimal weighting) and how relative importance varies with situational
or organizational characteristics. More research must also be done to identify the determinants of
these dimensions. Because many of these determinants are malleable (e.g., skills, leader behaviors),
research must also evaluate interventions designed to increase adaptive performance and citizen-
ship through the improvement of these determinants.
In closing, we would point out that there is no reason to stop here. If performance is to be under-
stood, then adaptive performance and citizenship must receive specific attention. However, there
are bound to be other dimensions of performance that should also be added to the mix. If it can be
demonstrated that a new dimension is conceptually distinct from existing dimensions, has impor-
tant consequences, and has a set of determinants that differ from those of other dimensions, then
that new dimension should also be added to our models of performance.
REFERENCES
Alge, B. J., Ballinger, G. A., Tangirala, S., & Oakley, J. L. (2006). Information privacy in organizations:
Empowering creative and extrarole performance. Journal of Applied Psychology, 91, 221–232.
Aryee, S., Chen, Z. X., Sun, L.-Y., & Debrah, Y. A. (2007). Antecedents and outcomes of abusive supervision:
Test of a trickle-down model. Journal of Applied Psychology, 92, 191–201.
Bachrach, D. G., Powell, B. C., Bendoly, E., & Richey, R. G. (2006a). Organizational citizenship behavior and
performance evaluations: Exploring the impact of task interdependence. Journal of Applied Psychology,
91, 193–201.
Bachrach, D. G., Powell, B. C., Collins, B. J., & Richey, R. G. (2006b). Effects of task interdependence on
the relationship between helping behavior and group performance. Journal of Applied Psychology, 91,
1396–1405.
Barnard, C. I. (1938). The functions of the executive. Cambridge, MA: Harvard University Press.
Bateman, T. S., & Organ, D. W. (1983). Job satisfaction and the good soldier: The relationship between affect
and employee “citizenship.” Academy of Management Journal, 26, 587–595.
Bergman, M. E., Donovan, M. A., & Drasgow, F. (2001, April). A framework for assessing contextual perfor-
mance. Paper presented at the sixteenth annual conference of the Society of Industrial and Organizational
Psychology, San Diego, CA.
Bettencourt, L. A., Gwinner, K. P., & Meuter, M. L. (2001). A comparison of attitude, personality, and knowledge
predictors of service-oriented organizational citizenship behaviors. Journal of Applied Psychology, 86,
29–41.
Blau, P. M. (1964). Exchange and power in social life. New York, NY: Wiley.
Bolino, M. C., & Turnley, W. H. (2005). The personal costs of citizenship behavior: The relationship between
individual initiative and role overload, job stress, and work-family conflict. Journal of Applied Psychology,
90, 740–748.
Bommer, W. H., Dierdorff, E. C., & Rubin, R. S. (2007). Does prevalence mitigate relevance? The mod-
erating effect of group-level OCB on employee performance. Academy of Management Journal, 50,
1481–1494.
Borman, W. C., Buck, D. E., Hanson, M. A., Motowidlo, S. J., Stark, S., & Drasgow, F. (2001a). An examina-
tion of the comparative reliability, validity, and accuracy of performance ratings made using computer-
ized adaptive rating scales. Journal of Applied Psychology, 86, 965–973.
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements of contextual
performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 71–98).
San Francisco, CA: Jossey-Bass.
Borman, W. C., & Penner, L. A. (2001). Citizenship performance: Its nature, antecedents, and motives. In
B. W. Roberts & R. Hogan (Eds.), Personality psychology in the workplace. Decade of behavior (pp.
45–61). Washington, DC: American Psychological Association.
Borman, W. C., Penner, L. A., Allen, T. D., & Motowidlo, S. J. (2001b). Personality predictors of citizenship
performance. International Journal of Selection and Assessment, 9, 52–69.
Borman, W. C., White, L. A., & Dorsey, D. W. (1995). Effects of rate task performance and interpersonal fac-
tors on supervisor and peer performance ratings. Journal of Applied Psychology, 80, 168–177.
Borman, W. C., White, L. A., Pulakos, E. D., & Oppler, S. H. (1991). Models of supervisory job performance
ratings. Journal of Applied Psychology, 76, 863–872.
Adaptive and Citizenship-Related Behaviors at Work 483
Bowler, W. M., & Brass, D. J. (2006). Relational correlates of interpersonal citizenship behavior: A social net-
work perspective. Journal of Applied Psychology, 91, 70–82.
Brief, A. P., & Motowidlo, S. J. (1986). Prosocial organizational behaviors. Academy of Management Review,
11, 710–725.
Brooks-Shesler, L., LaPort, K., Dudley, N. M., & Cortina, J. M. (2009). Knowing how to get along: Knowledge
and skill predictors of cooperative performance. Manuscript submitted for publication.
Burke, S., Pierce, L., & Salas, E. (2006). Understanding adaptability: A prerequisite for effective performance
within complex environments. Cambridge, MA: Elsevier Science.
Burke, C. S., Stagl, K. C., Salas, E., Pierce, L., & Kendall, L. (2006). Understanding team adaptation: A con-
ceptual analysis and model. Journal of Applied Psychology, 91, 1189–1207.
Campbell, J. P. (1999). The definition and measurement of performance in the new age. In D. R. Ilgen, & E. D.
Pulakos (Eds.), The changing nature of performance. Implications for staffing, motivation, and develop-
ment (pp. 21–55). San Francisco, CA: Jossey-Bass.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt, W. C.
Borman, & Associates (Eds.) Personnel selection in organizations (pp. 35–70). San Francisco: Jossey-Bass.
Chen, G., Thomas, B. A., & Wallace, J. C. (2005). A multilevel examination of the relationships among training
outcomes, mediating regulatory processes, and adaptive performance. Journal of Applied Psychology,
90, 827–841.
Colbert, A. E., Mount, M. K., Harter, J. K., Witt, L. A., & Barrick, M. R. (2004). Interactive effects of person-
ality and perceptions of the work situation on workplace deviance. Journal of Applied Psychology, 89,
599–609.
Coleman, V. I., & Borman, W. C. (2000). Investigating the underlying structure of the citizenship performance
domain. Human Resource Management Review, 10, 25–44.
Conway, J. M. (1996). Additional construct validity evidence for the task-contextual performance distinction.
Human Performance, 9, 309–329.
Cropanzano, R., & Mitchell, M. S. (2005). Social exchange theory: An interdisciplinary review. Journal of
Management, 31, 874–900.
DeArmond, S., Tye, M., Chen, P. Y., Krauss, A., Rogers, D. A., & Sintek, E. (2006). Age and gender stereo-
types: New challenges in a changing workplace and workforce. Journal of Applied Social Psychology,
36, 2184–2214.
Den Hartog, D. N., De Hoogh, A. H. B., & Keegan, A. E. (2007). The interactive effects of belongingness and
charisma on helping and compliance. Journal of Applied Psychology, 92, 1131–1139.
Dineen, B. R., Lewicki, R. J., & Tomlinson, E. C. (2006). Supervisory guidance and behavioral integrity:
Relationships with employee citizenship and deviant behavior. Journal of Applied Psychology, 91,
622–635.
DuBois, C. L. Z., Sackett, P. R., Zedeck, S., & Fogli, L. (1993). Further exploration of typical and maximum
performance criteria: Definitional issues, prediction, and White-Black differences. Journal of Applied
Psychology, 78, 205–211.
Dudley, N. M., & Cortina, J. M. (2008). Knowledge and skills that facilitate the personal support dimension of
citizenship. Journal of Applied Psychology, 93, 1249–1270.
Dudley, N. M., & Cortina, J. M. (2009). Knowledge and skills that predict helping. Manuscript submitted for
publication.
Entin, E. E., Diedrich, F. J., & Rubineau, B. (2003). Adaptive communication patterns in different organiza-
tional structures. In Proceedings of the Human Factors and Ergonomics Society (HFES) 47th Annual
Meeting (pp. 405–409). Santa Monica, CA: HFES.
Ferrin, D. L., Dirks, K. T., & Shah, P. P. (2006). Direct and indirect effects of third-party relationships on inter-
personal trust. Journal of Applied Psychology, 91, 870–883.
Ford, J. K., Smith, E. M., Weissbein, D. A., Gully, S. M., & Salas, E. (1998). Relationships of goal orientation,
metacognitive activity, and practice strategies with learning outcomes and transfer. Journal of Applied
Psychology, 83, 218–233.
George, J. M., & Bettenhausen, K. (1990). Understanding prosocial behaviour, sales performance, and turn-
over: A group-level analysis in a service context. Journal of Applied Psychology, 75(6), 698–709.
Glomb, T. M., & Welsh, E. T. (2005). Can opposites attract? Personality heterogeneity in supervisor-subordi-
nate dyads as a predictor of subordinate outcomes. Journal of Applied Psychology, 90, 749–757.
Graham, J.W. (1991). An essay on organizational citizenship behavior. Employee Responsibilities and Rights
Journal, 4, 249–270.
Griffin, M. A., Neal, A., & Parker, S. K. (2007). A new model of work role performance: Positive behavior in
uncertain and interdependent contexts. Academy of Management Journal, 50, 327–347.
484 Handbook of Employee Selection
Hackman, J. R., & Oldham, G. R. (1980). Work redesign. Reading, MA: Addison-Wesley.
Halbesleben, J. R. B., & Bowler, W. M. (2007). Emotional exhaustion and job performance: The mediating role
of motivation. Journal of Applied Psychology, 92, 93–106.
Heilman, M. E., & Chen, J. J. (2005). Same behavior, different consequences: Reactions to men’s and women’s
altruistic citizenship behavior. Journal of Applied Psychology, 90, 431–441.
Hesketh, B., & Neal, A. (1999). Technology and performance. In D. R. Ilgen & E. D. Pulakos (Eds.), The
changing nature of performance. Implications for staffing, motivation, and development (pp. 21–55). San
Francisco: Jossey-Bass.
Hoffman, B. J., Blair, C. A., Meriac, J. P., & Woehr, D. J. (2007). Expanding the criterion domain? A quantita-
tive review of the OCB literature. Journal of Applied Psychology, 92, 555–566.
Hunter, J. E. (1983). A causal model of cognitive ability, job knowledge, job performance, and supervisor
ratings. In F. J. Landy, S. Zedeck, & J. Cleveland (Eds.), Performance measurement and theory (pp.
257–266). Hillsdale, NJ: Lawrence Erlbaum.
Ilgen, D. R., & Pulakos, E. D. (1999). Employee performance in today’s organizations. In D. R. Ilgen & E. D.
Pulakos (Eds.), The changing nature of performance: Implications for staffing, motivation, and develop-
ment. San Francisco, CA: Jossey-Bass.
Ilies, R., Nahrgang, J. D., & Morgeson, F. P. (2007). Leader-member exchange and citizenship behaviors: A
meta-analysis. Journal of Applied Psychology, 92, 269–277.
Ilies, R., Scott, B. A., & Judge, T. A. (2006). The interactive effects of personal traits and experienced states on
intraindividual patterns of citizenship behavior. Academy of Management Journal, 49, 561–575.
Jackson, C. L., Colquitt, J. A., Wesson, M. J., & Zapata-Phelan, C. P. (2006). Psychological collectivism: A
measurement validation and linkage to group member performance. Journal of Applied Psychology, 91,
884–899.
Johnson, J. W. (2001). The relative importance of task and contextual performance dimensions to supervisor
judgments of overall performance. Journal of Applied Psychology, 86, 984–996.
Johnson, J. W. (2003). Toward a better understanding of the relationship between personality and individual
job performance. In M. R. Barrick & A. M. Ryan (Eds.), Personality and work: Reconsidering the role of
personality in organizations (pp. 83–120). San Francisco, CA: Jossey-Bass.
Joireman, J., Kamdar, D., Daniels, D., & Duell, B. (2006). Good citizens to the end? It depends: Empathy and
concern with future consequences moderate the impact of a short-term time horizon on organizational
citizenship behaviors. Journal of Applied Psychology, 91, 1307–1320.
Judge, T. A., LePine, J. A., & Rich, B. L. (2006). Loving yourself abundantly: Relationship of the narcissistic
personality to self- and other perceptions of workplace deviance, leadership, and task and contextual
performance. Journal of Applied Psychology, 91, 762–776.
Kamdar, D., McAllister, D. J., & Turban, D. B. (2006). “All in a day’s work”: How follower individual differ-
ences and justice perceptions predict OCB role definitions and behavior. Journal of Applied Psychology,
91, 841–855.
Kamdar, D., & Van Dyne, L. (2007). The joint effects of personality and workplace social exchange relation-
ships in predicting task performance and citizenship performance. Journal of Applied Psychology, 92,
1286–1298.
Kanfer, R. (1990). Motivation theory and industrial and organizational psychology. In M. D. Dunnette & L. M.
Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 1, pp. 75–170). Palo
Alto, CA: Consulting Psychologists Press.
Karambayya, R. (1990). Contexts for organizational citizenship behavior: Do high performing and satisfy-
ing units have better ‘citizens’. York University working paper. See Podsakoff, MacKenzie, Paine, &
Bachrach (2000).
Katz, D. (1964). Motivational basis of organizational behavior. Behavioral Science, 9, 131–146.
Koys, D. J. (2001). The effects of employee satisfaction, organizational citizenship behavior, and turnover on
organizational effectiveness: A unit-level, longitudinal study. Personnel Psychology, 54, 101–114.
Kozlowski, S. W. J., Gully, S. M., Brown, K. G., Salas, E., Smith, E. M., & Nason, E. R. (2001). Effects of
training goals and goal orientation traits on multi-dimensional training outcomes and performance adapt-
ability. Organizational Behavior and Human Decision Processes, 85, 1–31.
Kozlowski, S. W. J., Watola, D., Nowakowski, J. M., Kim, B., & Botero, I. (2008). Developing adaptive teams:
A theory of dynamic team leadership. In E. Salas, G. F. Goodwin, & C. S. Burke (Eds.), Team effective-
ness in complex organizations: Cross-disciplinary perspectives and approaches (pp. 113–155). Mahwah,
NJ: Lawrence Erlbaum.
LePine, J. A., Colquitt, J. A., & Erez, A. (2000). Adaptability to changing task contexts: Effects of general cog-
nitive ability, conscientiousness, and openness to experience. Personnel Psychology, 53, 563–593.
Adaptive and Citizenship-Related Behaviors at Work 485
LePine, J. A., Erez, A., & Johnson, D. E. (2002). The nature and dimensionality of organizational citizenship
behavior: A critical review and meta-analysis. Journal of Applied Psychology, 87, 52–65.
Liao, H., & Rupp, D. E. (2005). The impact of justice climate and justice orientation on work outcomes:
A cross-level multifoci framework. Journal of Applied Psychology, 90, 242–256.
Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ:
Prentice Hall.
MacKenzie, S. B., Podsakoff, P. M., & Fetter, R. (1993). The impact of organizational citizenship behavior on
evaluations of salesperson performance. Journal of Marketing, 57, 70–80.
McAllister, D. J., Kamdar, D., Morrison, E. W., & Turban, D. B. (2007). Disentangling role perceptions: How
perceived role breadth, discretion, instrumentality, and efficacy relate to helping and taking charge.
Journal of Applied Psychology, 92, 1200–1211.
Mitchell, T. R., & Daniels, E. (2003). Motivation. In W. C. Borman, D. R., Ilgen, & R. J. Klimoski (Eds.),
Comprehensive handbook of psychology: Vol. 12. Industrial and organizational psychology (pp. 225–
254). New York, NY: Wiley.
Moon, H., Hollenbeck, J. R., Humphrey, S. E., Ilgen, D. R., West, B. J., Ellis, A. P. J., & Porter, C. O. L. H.
(2004). Asymmetric adaptability: Dynamic team structures as one-way streets. Academy of Management
Journal, 47, 681–695.
Moon, H., Kamdar, D., Mayer, D. M., & Takeuchi, R. (2008). Me or we? The role of personality and justice as
other-centered antecedents to innovative citizenship behaviors within organizations. Journal of Applied
Psychology, 93, 84–94.
Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The impor-
tance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58,
583–611.
Motowidlo, S. J., Borman, W. C., & Schmitt, M. J. (1997). A theory of individual differences in task and con-
textual performance. Human Performance, 10, 71–83.
Mueller-Hanson, R. A., White, S. S., Dorsey, D. W., & Pulakos, E. D. (2005). Training adaptable leaders:
Lessons from research and practice (ARI Research Report 1844). Arlington, VA: U.S. Army Research
Institute for the Behavioral and Social Sciences.
Organ, D. W. (1988). A restatement of the satisfaction-performance hypothesis. Journal of Management, 14,
547–557.
Organ, D. W., & Ryan, K. (1995). A meta-analytic review of attitudinal and dispositional predictors of organi-
zational citizenship behavior. Personnel Psychology, 48, 775–802.
Payne, S. C., & Webber, S. S. (2006). Effects of service provider attitudes and employment status on citi-
zenship behaviors and customers’ attitudes and loyalty behavior. Journal of Applied Psychology, 91,
365–378.
Piccolo, R. F., & Colquitt, J. A. (2006). Transformational leadership and job behaviors: The mediating role of
core job characteristics. Academy of Management Journal, 49, 327–340.
Ployhart, R. E., & Bliese, P. D. (2006). Individual adaptability (I-ADAPT) theory: Conceptualizing the anteced-
ents, consequences, and measurement of individual differences in adaptability. In S. Burke, L. Pierce, &
E. Salas (Eds.), Understanding Adaptability: A Prerequisite for Effective Performance within Complex
Environments (pp. 3–39). New York, NY: Elsevier.
Podsakoff, P. M., Ahearne, M., & MacKenzie, S. B. (1997). Organizational citizenship behavior and the quan-
tity and quality of work group performance. Journal of Applied Psychology, 82, 262–270.
Podsakoff, P. M., & MacKenzie, S. B. (1997). Impact of organizational citizenship behavior on organizational
performance: A review and suggestions for future research. Human Performance, 10, 133–151.
Podsakoff, P. M., MacKenzie, S. B., Paine, J. B., & Bachrach, D. G. (2000). Organizational citizenship behav-
iors: A critical review of the theoretical and empirical literature and suggestions for future research.
Journal of Management, 26, 513–563.
Porath, C. L., & Erez, A. (2007). Does rudeness really matter? The effects of rudeness on task performance and
helpfulness. Academy of Management Journal, 50, 1181–1197.
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the work
place: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85,
612–624.
Pulakos, E. D., Schmitt, N., Dorsey, D. W., Hedge, J. W., & Borman, W. C. (2002). Predicting adaptive perfor-
mance: Further tests of a model of adaptability. Human Performance, 15, 299–323.
Raybourn, E.M., Deagle, E., Mendini, K., & Heneghan, J. (2005). Adaptive thinking and leadership simulation
game training for Special Forces officers (I/ITSEC 2005). Proceedings, Interservice/ Industry Training,
Simulation and Education Conference, November 28-December 1, Orlando, FL.
486 Handbook of Employee Selection
Redding, J. (1997). Hardwiring the learning organization. Training & Development, 15, 61–67.
Redman, T., & Snape, E. (2005). Exchange ideology and member-union relationships: An evaluation of mod-
eration effects. Journal of Applied Psychology, 90, 765–773.
Rioux, S. M., & Penner, L. A. (2001). The causes of organizational citizenship behavior: A motivational analy-
sis. Journal of Applied Psychology, 86, 1306–1314.
Rosen, C. C., Levy, P. E., & Hall, R. J. (2006). Placing perceptions of politics in the context of the feed-
back environment, employee attitudes, and job performance. Journal of Applied Psychology, 91,
211–220.
Rotundo, M., & Sackett, P. R. (2002). The relative importance of task, citizenship, and counterproductive
performance to global ratings of job performance: A policy-capturing approach. Journal of Applied
Psychology, 87, 66–80.
Schmidt, F. L., Hunter, J. E., & Outerbridge, A. N. (1986). Impact of job experience and ability on job
knowledge, work sample performance, and supervisory ratings of job performance. Journal of Applied
Psychology, 71, 432–439.
Schmit, M. J., Motowidlo, S. J., Degroot, T., Cross, T., & Kiker, D. S. (1996, April). Explaining the relation-
ship between personality and job performance. Paper presented at the annual meeting of the Society for
Industrial and Organizational Psychology, San Diego, CA.
Schmitt, N., & Chan, D. (2006). Situational judgment tests: Method or construct? In J. Weekley & R. E.
Ployhart (Eds.), Situational judgment tests (pp.135–156). Mahwah, NJ: Lawrence Erlbaum.
Schmitt, N., Cortina, J. M., Ingerick, M. J., & Wiechmann, D. (2003). Personnel selection and employee perfor-
mance. Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 77–105). New
York, NY: John Wiley & Sons, Inc.
Sinclair, R. R., Tucker, J. S., Cullen, J. C., & Wright, C. (2005). Performance differences among four organiza-
tional commitment profiles. Journal of Applied Psychology, 90, 1280–1287.
Smith, C. A., Organ, D. W., & Near, J. P. (1983). Organizational citizenship behavior: Its nature and anteced-
ents. Journal of Applied Psychology, 68, 653–663.
Sparrowe, R. T., Soetjipto, B. W., & Kraimer, M. L. (2006). Do leaders’ influence tactics that relate to mem-
bers’ helping behavior? It depends on the quality of the relationship. Academy of Management Journal,
49, 1194–1208.
Stewart, G. L., & Nandkeolyar, A. (2006). Adaptability and intraindividual variation in sales outcomes:
Exploring the interactive effects of personality and environmental opportunity. Personnel Psychology,
59, 307–332.
Sun, L.-Y., Aryee, S., & Law, K. S. (2007). High-performance human resource practices, citizenship behav-
ior, and organizational performance: A relational perspective. Academy of Management Journal, 50,
558–577.
Van Scotter, J. R., & Motowidlo, S. J. (1996). Interpersonal facilitation and job dedication as separate facets of
contextual performance. Journal of Applied Psychology, 81, 525–531.
Venkataramani, V., & Dalal, R. S. (2007). Who helps and harms whom? Relational antecedents of interpersonal
helping and harming in organizations. Journal of Applied Psychology, 92, 952–966.
Wanberg, C.R., & Banas, J. (2000). Predictors and outcomes of openness to changes in a reorganizing work-
place. Journal of Applied Psychology, 85, 132–142.
Wang, H., Law, K. S., Hackett, R. D., Wang, D., & Chen, Z. X. (2005). Leader-member exchange as a mediator
of the relationship between transformational leadership and followers’ performance and organizational
citizenship behavior. Academy of Management Journal, 48, 420–432.
Werner, J. M. (1994). Dimensions that make a difference: Examining the impact of in-role and extra-role
behaviors on supervisory ratings. Journal of Applied Psychology, 79, 98–107.
White, S. S., Mueller-Hanson, R. A., Dorsey, D. W., Pulakos, E. D., Wisecarver, M. M., Deagle, E. A., &
Mendini, K. G. (2005). Developing adaptive proficiency in Special Forces officers. (ARI Research Report
1831). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Whiting, S. W., Podsakoff, P. M., & Pierce, J. R. (2008). Effects of task performance, helping, voice, and orga-
nizational loyalty on performance appraisal ratings. Journal of Applied Psychology, 93, 125–139.
Williams, L. J., & Anderson, S. E. (1991). Job satisfaction and organizational commitment as predictors of
organizational citizenship and in-role behaviors. Journal of Management, 17, 601–617.
Wood, R. E., & Locke, E. A. (1990). Goal setting and strategy effects on complex tasks. In B. Staw & L. L.
Cummings (Eds.), Research in organizational behavior (Vol. 12, pp. 73–110). Greenwich, CT: JAI
Press.
Yang, J., Mossholder, K. W., & Peng, T. K. (2007). Procedural justice climate and group power distance: An
examination of cross-level interaction effects. Journal of Applied Psychology, 92, 681–692.
Adaptive and Citizenship-Related Behaviors at Work 487
Yun, S., Takeuchi, R., & Liu, W. (2007). Employee self-enhancement motives and job performance behaviors:
Investigating the moderating effects of employee role ambiguity and managerial perceptions of employee
commitment. Journal of Applied Psychology, 92, 745–756.
Zaccaro, S. J., & Banks, D. (2004). Leader Visioning and Adaptability: Bridging the gap between research and
practice on developing the ability to manage change. Human Resource Management, 43, 367–380.
This page intentionally left blank
23 Counterproductive Work
Behavior and Withdrawal
Counterproductive work behavior (CWB) is an umbrella term that refers to a wide range of acts
conducted by employees that are harmful to organizations and their stakeholders. Whereas some
specific acts of CWB, most notably withdrawal behaviors such as absence and turnover, have been
investigated for decades, the emergence of the study of CWB as a broad class of behaviors is a
recent development. Unfortunately, the literature on behaviors that can be classified as CWB is
broad and disjointed and is in need of better integration. As we will note, there have been several
terms used to refer to conceptually distinct but operationally overlapping if not identical constructs
that are often studied in isolation from one another.
Our goal in this handbook chapter is to provide an integrative overview of the literature that links
the various forms of CWB that have been studied in the literature. We will begin with an overview
from a historical perspective of the different concepts that can be subsumed under the CWB term.
Second, we will discuss measurement issues and how CWB has been studied. Third, we will dis-
cuss potential antecedents of CWB that arise from the work environment and the person. Fourth,
we will discuss potential consequences of CWB to organizations and stakeholders, including indi-
vidual employees, groups, and customers/clients. Finally, we will take a look forward and suggest
areas that need attention by CWB researchers.
NATURE OF CWB
DEFINITIONS
CWB is defined as intentional behavior by employees that harms or intends to harm organizations
and people in organizations, including employees and customers/clients (Spector & Fox, 2005).
Excluded are behaviors that are accidental or not under a person’s control. Having an accident would
not be CWB unless it occurred because of willful failure to follow required work protocols (e.g.,
failure to wear safety equipment), nor would poor performance be CWB if it were due to lack of
ability to perform job tasks despite efforts to do so. This distinguishes CWB from task performance
itself.
The CWB term reflects a category of disparate behaviors that share the characteristic of being
harmful. This includes nasty interpersonal behavior (e.g., insults and yelling at someone), behaviors
directed toward inanimate objects (e.g., sabotage and theft), purposely doing work incorrectly, and
withdrawal from the organization (e.g., absence or lateness). These varied behaviors fit within the
framework of CWB, but they are not all manifestations of the same thing. Furthermore, these vari-
ous classes of behavior have been studied in isolation and have their own literatures distinct from
CWB and related terms.
Research interest in these various behaviors was sporadic until at least 1980, with relatively few
empirical articles being found before that time. A keyword check of the PsychInfo database from
489
490 Handbook of Employee Selection
1865 until 2007 for the earliest empirical journal article on withdrawal finds a turnover study in
1925 (Bills) and an absence study in 1971 (Chadwick-Jones, Brown, Nicholson, & Sheppard). The
earliest theft study in our journal literature was Rosenbaum (1976). Entering the keyword “employee
absenteeism” received 43 hits between 1865 and 1974, but 1115 hits from 1975 to 2007; for turnover
the numbers of hits were 70 and 1910, respectively, for these time periods. Clearly there has been an
acceleration of interest in these topics after the mid-1970s.
The idea of combining disparate acts into a single construct is fairly recent. Perhaps the earliest
empirical example was Spector (1975), who created a checklist of CWBs that he termed “organi-
zational aggression.” Basing his work on the social psychological aggression research, he argued
that these behaviors were a response to frustration at work. Hollinger and Clark (1982) applied the
sociological concept of deviance to the workplace, studying behavior that represented deviations
from the norms of organizations or society. Included were many of the same behaviors we define as
CWB. Few other researchers showed interest in these behaviors until the mid-1990s. Of particular
relevance is Robinson and Bennett’s (1995) expanded discussion of the deviance concept, suggesting
that it is meaningful to divide these behaviors by target, specifically the organization (subsequently
referred to as CWBO) that includes behavior such as calling in sick when not ill or destroying orga-
nizational property versus people (subsequently referred to as CWBP), which includes making nasty
comments to coworkers or physical violence. Several researchers following Robinson and Bennett
adopted the deviance terminology in their work on behaviors that we classify here as CWB.
From 1995 on there was an explosion of interest in the harmful acts by employees and a pro-
liferation of terms and perspectives (see Pearson, Andersson, & Porath, 2000, and Spector & Fox,
2005, for contrasts among the various terms). Furthermore, whereas the early researchers focused
on actors, some of the new generation of researchers focused on the targets of these behaviors.
On the actor side, of particular note are the concepts of retaliation (Skarlicki, & Folger, 1997),
which includes harmful acts in response to injustice, and revenge (Bies & Tripp, 1996), which
includes provoked behaviors intended to even the score. Other terms that can be found in the lit-
erature include antisocial behavior (Giacalone & Greenberg, 1997), counterproductive job perfor-
mance (Collins & Griffin, 1998), counterproductive employee behavior (Duffy, Ganster, & Shaw,
1998), counterproductive performance (Rotundo & Sackett, 2002), dysfunctional behavior (Griffin,
O’Leary-Kelly, & Collins, 1998), employee delinquency (Hogan & Hogan, 1989), and misbehavior
(Vardi & Weitz, 2002). A similar phenomenon has been studied as employee dishonesty (Jones,
1981) or integrity (Brown, Jones, Terris, & Steffy, 1987) from the perspective of employee selection
and for the purposes of weeding out job applicants who have the potential to engage in a variety of
CWBs. The term CWB itself seems to be catching on recently as a generic term that encompasses
most of these other terms. CWB first shows up in two places at the same time from independent
perspectives. Sackett and DeVore (2001) referred to counterproductive workplace behavior that they
view as an aspect of job performance; specifically, it consists of intentional acts that run counter to
an organization’s legitimate interests. Fox, Spector, and Miles (2001) took an employee perspective
in defining CWB as acts that harm organizations or people, whether or not it has implications for
performance.
On the target side, work in this area first began in Europe with research concerning abusive and
harmful behaviors that individuals endure, including bullying (Einarsen, Raknes, Mattiesen, &
Hellesoy, 1996) and mobbing (Knorz & Zapf, 1996). Although there are similarities in both con-
structs, the major difference is that bullying concerns the acts of a single actor directed toward one
or more targets, whereas mobbing concerns the acts of one or more actors directed toward a single
target (Zapf & Einarsen, 2005). In the United States, Tepper (2000) was among the first to study
abusive supervision, whereas at the same time Pearson, Andersson, and Porath (2000) introduced
the concept of workplace incivility, which is a milder and more ambiguous form of nasty interper-
sonal behavior.
Although the various terms for CWB represent conceptually distinct concepts, those distinctions
are not always maintained in researchers’ operationalizations. Studies taking various perspectives
Counterproductive Work Behavior and Withdrawal 491
provide respondents with checklists of behaviors that they have performed or endured. Absent from
these checklists in most cases is reference to intentions, motives, or norms. A notable exception is
Skarlicki and Folger (1997), who specifically asked about acts of retaliation. Spector and Fox (2005,
p. 157) illustrated the overlap in behaviors by providing example items from measures of aggres-
sion, deviance, and retaliation that were highly similar. For example, each scale had an item about
stealing, starting and/or spreading rumors, speaking poorly about the employer, and calling in sick
when not ill. The extent to which individual acts might represent aggression, deviance, or retaliation
is not always clear.
DIMENSIONS OF CWB
Most studies of CWB and related global constructs (e.g., deviance or retaliation) have used mea-
sures of overall CWB. A handful of studies have followed Robinson and Bennett’s (1995) lead in
dividing these behaviors according to the target of organization versus person (e.g., Fox & Spector,
1999; Penney & Spector, 2005). On the target side, most studies of abuse, bullying, incivility, and
mobbing have used a single index rather than devising subdimensions. More recently Spector, Fox,
Penney, Bruursema, Goh, and Kessler (2006) argued that the individual behaviors subsumed by
CWB were too diverse to be reduced to a single construct. They divided an overall CWB scale
into five dimensions of abuse (nasty acts directed toward people), production deviance (purposely
doing work incorrectly or working slowly), sabotage (physical destruction/defacing of property),
theft (stealing property from organizations or people), and withdrawal (working fewer hours than
required). They showed that the various dimensions had different patterns of relationships with
other variables, suggesting they should be studied independently.
Sackett and DeVore (2001) listed an even finer grained and expanded 11-category classification
of CWB identified by Gruys (1999). They are theft, property destruction, misuse of information,
misuse of resources, unsafe behavior, withdrawal, poor-quality work, alcohol use, drug use, inap-
propriate verbal actions, and inappropriate physical actions.
ASSESSMENT OF CWB
Except for studies of specific withdrawal behaviors such as absence or turnover, the predominant
method for studying CWB has been the self-report survey. Researchers have devised checklists of
behaviors that ask participants to indicate if or how often they engage in each of several specific
acts, or how often they have either observed or endured a list of acts. In most cases, individual
groups of researchers have devised their own checklists, although the specific items are quite over-
lapping (cf., Spector & Fox, 2005, p. 157). Some articles list the items used, making it possible to
recreate the measures (e.g., Skarlicki & Folger, 1997). Bennett and Robinson (2000) described the
development of their scale to assess CWB (which they termed deviance) directed toward organiza-
tions versus people. Spector et al. (2006) provided a scale to assess the five dimensions of CWB
noted in the prior section.
Several researchers have used alternative sources of data on CWB, typically coworkers. For
example, Skarlicki and Folger (1997) asked coworkers to report on the participant’s CWB. Spector
and colleagues have done a series of studies in which participant self-reports were contrasted with
coworker reports of the participant’s CWB divided into organization versus person target (e.g.,
Bruk-Lee & Spector, 2006; Penney & Spector, 2005; Fox, Spector, Goh, & Bruursema, 2007). In all
but one case (organization-targeted CWB in Fox et al., 2007), correlations were significant between
the two sources, with a mean across studies of .29.
For withdrawal behaviors studied separately, it is common to use organizational records on
absence and turnover, although some researchers have used self-reports. There is some evidence
suggesting at least with absence that under some conditions there is reasonable convergence between
self-reports and records (Johns, 1994). Self-reports become more attractive if absence is divided
492 Handbook of Employee Selection
into types, such as absence due to illness versus other reasons. When absence and other forms of
withdrawal are studied as forms of CWB, self-reports are typically used so individuals can report
on instances in which they purposely avoided work, such as calling in sick when not ill or taking
overly long breaks.
INDIVIDUAL DIFFERENCES
There has been considerable work showing a link between individual difference variables and
CWB. One stream of research has concerned the development and validation of integrity tests that
can be used to screen CWB-prone individuals from being hired. This work shows a link between
integrity, as defined by the tests, and CWB, but most of this literature does not identify specific
personality traits that might give insights into personality processes. Rather, test developers choose
disparate items based empirically on their relationships with CWB criteria. The conclusion from
this literature is that individual differences can predict CWB. A second stream is work linking spe-
cific personality characteristics to CWB. One substream links the five categories of the Five-Factor
Model to CWB, whereas another investigates individual personality traits. Finally, some studies
have included demographic characteristics to determine their relationship to CWB. A summary of
effect sizes reported in meta-analyses of antecedents can be found in Table 23.1 for CWB and Table
23.2 for absence and withdrawal behaviors.
Attitudes &
Environment CWB
emotion
Individual
differences
FIGURE 23.1 General framework depicting various antecedents of CWB and withdrawal.
Counterproductive Work Behavior and Withdrawal 493
TABLE 23.1
Summary of Effect Sizes Reported in Meta-Analyses of Antecedents of CWBs
Variables CWB CWBP CWBO Author(s) Comment
Integrity tests
Overt .39 Ones, Viswesvaran, & Mean observed correlation
Schmidt (1993)
Personality-based .22 Ones, Viswesvaran, & Mean observed correlation
Schmidt (1993)
Personality
Conscientiousness −.18 Salgado (2000) Mean observed correlation
−.29 Dalal (2005) Mean sample-size-weighted
correlation
−.19 −.34 Berry, Ones, & Sackett Mean sample-size-weighted
(2007) correlation
Agreeableness −.13 Salgado (2000) Mean observed correlation
−.36 −.25 Berry, Ones, & Sackett Mean sample-size-weighted
(2007) correlation
Emotional stability −.04 Salgado (2000) Mean observed correlation
−.20 −.19 Berry, Ones, & Sackett Mean sample-size-weighted
(2007) correlation
Openness −.10 Salgado (2000) Mean observed correlation
−.07 −.03 Berry, Ones, & Sackett Mean sample-size-weighted
(2007) correlation
Extraversion −.01 Salgado (2000) Mean observed correlation
.02 −.07 Berry, Ones, & Sackett Mean sample-size-weighted
(2007) correlation
Attitudes
Job satisfaction −.29 Dalal (2005) Mean sample-size-weighted
correlation
−.14 −.31 Hershcovis et al. (2007) Mean uncorrected correlation
Organization −.28 Dalal (2005) Mean sample-size-weighted
commitment correlation
Negative effect .34 Dalal (2005) Mean sample-size-weighted
correlation
.22 .24 Hershcovis et al. (2007) Mean uncorrected correlation
Trait anger .37 .28 Hershcovis et al. (2007) Mean uncorrected correlation
Positive effect −.28 Dalal (2005) Mean sample-size-weighted
correlation
Stressors
Organization .26 .31 Hershcovis et al. (2007) Mean uncorrected correlation
constraints
Interpersonal conflict .40 .33 Hershcovis et al. (2007) Mean uncorrected correlation
Justice
Organizational −.25 Dalal (2005) Mean sample-size-weighted
correlation
Distributive −.22 Cohen-Charash & Mean sample-size-weighted
Spector (2001) correlation
.12 .12 Hershcovis et al. (2007) Mean uncorrected correlation
.12 .10 Berry, Ones, & Sackett Mean sample-size-weighted
(2007) correlation
continued
494 Handbook of Employee Selection
Integrity Tests
There are several key papers/resources on the ability of integrity tests to predict CWB and with-
drawal behaviors. Ones, Viswesvaran, and Schmidt (1993) conducted the most comprehensive
meta-analysis of integrity test validities for predicting CWB and job performance, and Ones,
Viswesvaran, and Schmidt’s (2003) meta-analysis focused on predicting absenteeism. Sackett and
colleagues reviewed the state of research on integrity tests in two separate reviews and a book chap-
ter (Berry, Sackett, & Wiemann, 2007; Sackett & DeVore, 2001; Sackett & Wanek, 1996). As noted
in these reviews, larger coefficients on integrity tests are reported for self-report measures of CWB
and for concurrent rather than predictive validation designs (Sackett & DeVore, 2001). Integrity
tests can be designed to assess personality traits that relate to various CWBs (labeled personality-
based integrity tests) or designed to assess attitudes toward various forms of CWBs, such as theft
and admission of various forms of CWB (labeled overt-based integrity tests; Berry et al., 2007).
Personality-based (.22) and overt integrity tests (.39) show meaningful mean observed correlations
with the broad construct of CWB (Ones et al., 1993, Sackett & DeVore, 2001) and absence criteria
(.25 for personality-based tests) (Ones et al., 2003). A summary of these relationships can be found
in Tables 23.1 and 23.2.
New research relating integrity tests to CWB involves laboratory work in settings where the CWB
of participants can be observed and recorded (Berry et al., 2007). This approach seeks to improve
on some of the limitations of traditional methods that involve self-reports of CWB and the potential
bias and unwillingness to admit to behaviors associated with these approaches (Berry et al., 2007).
As noted in Berry et al. (2007), the results of this research are mixed in that some studies reported
significant relationships between integrity test scores and CWB (Mikulay & Goffin, 1998; Nicol &
Paunonen, 2002), whereas other studies found no relationship (Hollwitz, 1998; Horn, Nelson, &
Brannick, 2004). Concerns about the external validity of these types of research designs and also
the fidelity of the behaviors/settings have been raised (Berry et al., 2007).
TABLE 23.2
Summary of Effect Sizes Reported in Meta-Analyses of Antecedents of Withdrawal
Behaviors
Variable Turnover Absence Withdrawal Authors Comment
Integrity tests
All tests .14 Ones, Viswesvaran, & Mean observed correlation
Schmidt (2003)
Overt .06 Ones, Viswesvaran, & Mean observed correlation
Schmidt (2003)
Personality-based .25 Ones, Viswesvaran, & Mean observed correlation
Schmidt (2003)
Stressors
Hindrance .18 .17 Podsakoff, LePine, & Mean sample-size-
LePine (2007) weighted correlation
Challenge .04 (ns) .06 (ns) Podsakoff, LePine, & Mean sample-size-
LePine (2007) weighted correlation
Organizational
commitment
Affective −.17 −.15 −.56 Meyer et al. (2002) Weighted average
corrected correlation
Normative −.16 .05 (ns) −.33 Meyer et al. (2002) Weighted average
corrected correlation
Continuance −.10 .06 −.18 Meyer et al. (2002) Weighted average
corrected correlation
Justice
Distributive −.41 Colquitt, Conlon, Wesson, Mean uncorrected
Porter, & Ng (2001) population correlation
Procedural −.36 Colquitt, Conlon, Wesson, Mean uncorrected
Porter, & Ng (2001) population correlation
sample-size-weighted correlation) with CWBO (−.34) compared with CWBP (−.19), whereas agree-
ableness demonstrated the opposite pattern—stronger relationship with CWBP (−.36) compared
with CWBO (−.25) (Berry et al., 2007). The relationships for emotional stability did not differ for
CWBP (−.20) and CWBO (−.19). Less support was found for openness or extraversion. Thus, from
several meta-analytic findings, it can be concluded that conscientiousness, agreeableness, and emo-
tional stability tend to demonstrate the largest correlations with CWB.
It is also useful to note that research has shown nonnegligible correlations between integrity test
scores and each of conscientiousness, agreeableness, and emotional stability (of the FFM personality
traits) leading some researchers to suggest that integrity test scores reflect a compound variable
comprised of conscientiousness, agreeableness, and emotional stability (Ones & Viswesvaran,
1998). Hence, it is not surprising to find that integrity tests and these three personality traits correlate
significantly with CWB.
Although the preponderance of personality research on CWB over the years focused on the
FFM, Lee et al. (2005a,b) recently have considered the role of another factor of personality labeled
“honesty-humility.” This factor is not originally part of the FFM and has emerged as a sixth factor
in some lexical studies of personality. Honesty-humility includes adjectives such as honest, frank,
truthful, sly, and calculating (Ashton, Lee, & Son, 2000). Research indicates that it relates to CWB
with observed correlations ranging from −.34 to −.55 and explains incremental variance over the
FFM and integrity tests in predicting antisocial behavior and workplace delinquency (Lee, Ashton, &
de Vries, 2005; Lee, Ashton, & Shin, 2005; Marcus, Lee, & Ashton, 2007).
496 Handbook of Employee Selection
Cognitive Ability
Although the category of individual differences has been dominated by research on personal-
ity, recently research has considered cognitive ability as a correlate. In a predictive validity study
involving law enforcement job applicants, Dilchert, Ones, Davis, and Rostow (2007) reported nega-
tive relationships ranging in size from r = −.11 to r = −.20. On the other hand, Marcus and Schuler
(2004) were unable to find a significant correlation between cognitive ability and CWB.
1996), although the gender difference is smaller for less serious forms of crime (Steffensmeier &
Allan, 1996). Recent meta-analyses in the psychology and workplace deviance literatures found
support for some of these relationships and indicate that various demographic characteristics
relate to CWB. Specifically, older individuals, females, and individuals with more tenure and work
experience engage in less CWB (Berry et al., 2007). However, the relationships are not strong.
Furthermore, Hershcovis et al. (2007) found that females engaged in significantly less CWBP (mean
r = −.19) than males, but not CWBO (mean r = −.11). Research also indicates that factors assessed
during adolescence, such as childhood conduct disorder (.17) predict adulthood CWB (Roberts,
Harms, Caspi, & Moffitt, 2007).
Some research has also considered the fit between an individuals’ current job and their career
goals and how this fit or lack of fit may help us understand individuals’ tendencies toward CWB.
For example, Huiras, Uggen, and McMorris (2000) found support for the idea that the greater the fit
between a person’s current job and their long-term career goals the lower their CWB. Marcus and
Wagner (2007) reported that being in a preferred vocation has a negative and significant relationship
with CWB (r = −.19).
Research in the criminology and psychology literatures has shown that prior experience engag-
ing in criminal, aggressive, or deviant activities relates positively and significantly to CWB and
workplace errors. More specifically, Greenberg and Barling (1999) found that employees who had
a history of aggressive behavior are more likely to act aggressively toward coworkers. In an assess-
ment of 281 workplace violence incidents at a police department (ranging from obscene phone calls
to physical assaults), Scalora, Washington, Casady, and Newell (2003) reported that perpetrators in
the reported incidents who had prior criminal records were more likely to assault another employee.
Lastly, in a study of alcohol abuse of airline pilots, McFadden (1997) reported that pilots who had
prior convictions of driving while intoxicated had a greater risk of pilot-error accidents.
Evidence showing a positive relationship between prior history of deviant behavior and sub-
sequent CWB would support the use of background checks as part of employee selection sys-
tems, especially because the cost to organizations of negligent hiring is high. Organizations
can be sued for negligent hiring if they hire someone who has a record of criminal or violent
behavior into a position that might put the public (especially those who are vulnerable such
as children or elderly), other employees, or the organization at risk (e.g., of theft or assault;
Connerley, Arvey, & Bernardy, 2001). The scope of background checks can vary from confi rm-
ing prior employment or education to drug testing, criminal record checks, or motor vehicle
record checks. Research on the use of background checks in personnel selection is limited and
the research that does exist raises some questions about their reliability and validity. For exam-
ple, there is variability in the content of data included by different states in criminal records
(e.g., whether or not less serious offenses are recorded) and on how complete and up to date
the records are (Harris & Keller, 2005). Hence, more research is needed on the psychometric
properties of background checks.
On the other hand, those high in trait anger are likely to respond emotionally to even minor
provocations and may have trouble controlling associated impulses to strike out at the organiza-
tion or other people.
Demographic characteristics have a smaller relationship with CWB, whereas prior history or
experience engaging in various forms of deviance or criminal activities show positive and sig-
nificant relationships with CWB. Furthermore, the fit between a person’s current job and career
goals can be helpful in explaining CWB. Moving forward, researchers should put aside assessing
linear relationships between these individual difference characteristics as they relate to CWB. More
fruitful research should investigate the conditions under which individuals who are predisposed to
engage in CWB/withdrawal actually do engage in these behaviors.
ENVIRONMENT
There are several environmental conditions that have been linked to CWB. Some of the origi-
nal work on human aggression suggested that frustrating situations or things that interfere with
ongoing behavior and future goals are important precursors to aggression, although more recent
work has expanded to include various other experiences that are stressful, represent attacks on the
individual, or are violation of norms (Neuman & Baron, 1997). In the following sections we will
discuss environmental conditions that have been linked to CWB including job stressors, justice, and
leadership.
500 Handbook of Employee Selection
STRESSORS
Stress has become an increasingly important topic of research over the years because of its poten-
tially negative consequences for employees and organizations. It has also been considered as a fac-
tor that explains CWB and withdrawal behaviors (Neuman & Baron, 1997; Spector & Fox, 2005).
Included are interpersonal stressors such as conflict, poor relationships with supervisors/coworkers,
and workplace abuse and incivility. Also important are organizational or task-related stressors such
as perceptions of job insecurity, organizational change, poor working conditions, organizational
constraints, role conflict, role ambiguity, and work overload. For the most part, the preponderance
of research has reported that various stressors do relate to increased CWB (see Table 23.1 for a
summary).
The most studied interpersonal stressor in relation to CWB has been interpersonal conflict among
employees. Hershcovis et al. (2007) conducted a meta-analysis that showed a .33 correlation (uncor-
rected) between conflict and CWBO and a significantly larger .40 correlation between conflict and
CWBP. The latter relationship should not be surprising because some of the behaviors people often
engage in during conflicts are forms of CWBP, such as insulting or undermining others. However,
measures of conflict used in these studies focus more on what the person has experienced from
others rather than what they have done themselves. More surprising is the possibility that conflict
at work results in CWB directed toward the organization itself. A study by Bruk-Lee and Spector
(2006) may help explain this phenomenon. They found that the person with whom employees have
conflict might determine the target of their CWB. Conflict with coworkers related more strongly
with CWBP than CWBO (r = .45 vs. r = .23, respectively), whereas the opposite was the case for
conflict with supervisors (r = .14 vs. r = .21, respectively). These authors suggested that employees
are likely to direct their CWB toward organizations when having conflicts with supervisors because
the supervisor is seen as an agent of the organization. It is also likely that inanimate targets are seen
as safer, because CWB directed toward the supervisor might not go unpunished.
Whereas conflicts involve two-way interactions among individuals, abusive behavior and inci-
vility are one-way exchanges in which the employee is subject to stressful mistreatment by others.
At times this might result in conflict, but in many cases, the employee is the almost passive target
of mistreatment. Abusive supervisors might yell, scream, or use power to mistreat and intimidate
employees (Ashforth, 1997; Keashly, 1998). Research has shown that abusive supervision has a more
pronounced effect on supervisor-directed CWB and for individuals who score high on negative reci-
procity (Mitchell & Ambrose, 2007). Other research found that abusive supervision and managerial
oversight increases CWB (Detert, Treviño, Burris, & Andiappan, 2007). Incivility, which involves
rude and unpleasant behavior that might in some cases be unintentional, has also been linked to
CWBO and CWBP (Goh, 2007; Penney & Spector, 2005).
The organizational stressor that comes closest to the classic definition of environmental frustra-
tion is organizational constraints, which are things in the work environment that interfere with task
performance. Hershcovis et al.’s (2007) meta-analysis found that organizational constraints corre-
lated significantly more strongly with CWBO (mean r = .31) than CWBP (mean r = .26). Workload
has been shown to correlate significantly with overall CWB (Goh, 2007; Miles et al., 2002) but
not CWBP (Chen & Spector, 1992). However, Galperin and Burke (2006) found that employees
who experienced workaholism, but who were very involved in their jobs and felt driven for high
performance, actually engaged in less CWB (they used the term deviance). Role ambiguity and role
conflict are significantly related to CWB (Chen & Spector, 1992).
Organizational restructuring and mergers appear to define the organizational landscape, making
perceptions of job insecurity and organizational politics common organizational stressors. Research
has shown a negative relationship between job insecurity and CWB. Individuals who perceived
more insecurity reported less CWB (−.21); that is, the more people perceive their job to be at risk,
the less likely they are to engage in CWB (Probst, Stewart, Gruys, & Tierney, 2007). Although this
finding may appear counterintuitive, it could be the case that individuals who fear that they may
Counterproductive Work Behavior and Withdrawal 501
lose their job as a result of being downsized or outsourced may avoid engaging in negative behaviors
that might increase their chances of being the ones who lose their jobs. A study of part-time work-
ers found that organizational politics related positively to psychological withdrawal (.26; daydream,
chat with coworker about non-work-related subjects) and antagonistic work behaviors (.23; argue
with coworker, gossip) (Cropanzano, Howes, Grandey, & Toth, 1997). However, politics was no
longer significant after controlling for organizational support.
ORGANIZATIONAL JUSTICE
Perceptions of injustice have also been prominent in research on CWB. The negative consequences
of employee perceptions of injustice have been the focus of a great deal of justice research leading
to four meta-analyses in which organizational justice-CWB/withdrawal coefficients were analyzed
(Berry et al., 2007; Cohen-Charash & Spector, 2001; Dalal, 2005; Hershcovis et al., 2007). Strong
support was reported for organizational justice (−.25, sample-size-weighted correlation; Dalal,
2005), procedural (−.28), and distributive justice (−.22) in explaining CWB (Cohen-Charash &
Spector, 2001). The relationships for procedural and distributive justice on CWB did not differ for
CWBP and CWBO. However, the relationship for interpersonal justice was lower for CWBO (−.06)
compared with CWBP (−.17), as was the relationship for interactional justice and CWBO (−.18)
compared with CWBP (−.22) (Berry et. al., 2007; Hershcovis et al., 2007). Meta-analytic findings
also indicate that distributive (r = −.41) and procedural justice (r = −.36) relate to withdrawal behav-
iors (Colquitt, Conlon, Wesson, Porter, & Ng, 2001). These results are summarized in Tables 23.1
and 23.2.
Some researchers have also studied mediators or moderators of the relationship between stres-
sors, organizational justice, and CWB. In a laboratory study, Colquitt, Scott, Judge, and Shaw
(2006) considered trust propensity, risk aversion, and morality as moderators of the relationship
between justice and CWB and reported significant interaction effects for trust propensity and risk
aversion. Cohen-Charash and Mueller (2007) considered envy and self-esteem as moderators and
reported that high perceived unfairness combined with high envy triggered more CWB, whereas
envy did not seem to matter in conditions of low perceived unfairness. In a second study, they found
support for a three-way interaction in which more CWB occurred when perceived unfairness, envy,
and self-esteem were all high.
LEADERSHIP
One might expect that the behavior of leaders would have an impact on their follower’s CWB. The
most studied leadership variable in relation to CWB has been charisma. Most studies have shown
that leaders who are perceived to be high on charisma have subordinates who report low frequen-
cies of CWB (e.g., Brown & Treviño, 2006; Hepworth & Towler, 2004; Judge et al, 2006). One
exception is Marcus and Schuler (2004), who failed to find this relationship. Bruursema and Spector
(2004) found that transformational and transactional leadership related oppositely to CWB, with
transactional leadership being associated with more frequent CWB. They also found that cowork-
er-reported transactional but not transformational leadership was significantly related to CWB of
the study’s subjects. Kwok, Au, and Ho (2005) found that when supervisors reacted negatively to
employee CWB (which they labeled formal normative control), employee CWB decreased even
when the number of employees in the group was large.
GROUP INFLUENCE
Individual behavior can be greatly affected by the behavior of coworkers, often transmitted through
norms for appropriate and inappropriate behavior. This possibility is supported by Robinson and
O’Leary-Kelly (1998), who showed that individuals in workgroups were more likely to engage in
502 Handbook of Employee Selection
CWB if coworkers were engaging in those behaviors. Dunlop and Lee (2004) investigated the fre-
quency of CWB in workgroups, finding more between- than within-workgroup variance, suggesting
that CWB can be a group-level phenomenon.
Finally, locus of control was shown to moderate the relationship between experienced frustration
and CWB (Storms & Spector, 1987).
CONSEQUENCES
Very limited research has considered the consequences of CWB, probably because it is assumed
that the outcomes of CWB are negative regardless of the type of CWB that occurs. Nevertheless, it
is useful to study these outcomes because CWBs can impact various stakeholders, including organi-
zations, employees, coworkers, and customers, and the consequences can be severe. For example, in
the food industry it has been shown that employee CWB can affect organization-level outcomes. In
a study of staff and supervisors from an Australian fast-food chain, Dunlop and Lee (2004) reported
that CWBs engaged in by staff members resulted in lower supervisor-rated business unit perfor-
mance, longer drive-through service time, and higher unexplained food figures. Detert, Trevino,
Burris, and Andiappan (2007) found that CWB in restaurants (measured as food loss) negatively
impacted restaurant profitability and customer satisfaction.
Employee theft continues to cost organizations and even consumers billions of dollars. For some
organizations it may even result in bankruptcy, and for consumers it may result in higher prices on
goods and services (Langton & Hollinger, 2005; Payne & Gainey, 2004). Popular strategies that are
believed to reduce theft are not always effective. Research indicates that firms with higher shrink-
age are more likely to use security tags, closed-circuit television, observation mirrors, newsletters,
anonymous telephone lines, and conduct surveys on loss prevention than firms with lower shrink-
age, although these devices are believed to deter theft. The firms with low shrinkage are more likely
to employ criminal background checks, driving history checks, and drug screening (Langton &
Hollinger, 2005). These firms also report lower turnover among sales associates and management
and have fewer employees who work part-time (Langton & Hollinger, 2005). Thus, strategies that
seek to prevent high-risk employees from joining the firm appear to be associated with low shrink-
age rates, whereas asset control strategies appear to be less effective.
Service sabotage is a form of sabotage committed by employees who work in service settings.
It can consist of deliberate negative behavior directed at customers such as taking revenge, slow-
ing down the service, or ignoring rules (Harris & Ogbonna, 2002). Although it would appear that
this type of sabotage is likely to negatively impact the quality of the service that customers receive
and eventually firm performance, it has been shown that service sabotage can have positive effects
on the employees who engage in it because it raises their status, self-esteem, and job satisfaction
and even lowers their stress (Harris & Ogbonna, 2002). Research found support for these relation-
ships and also showed that employees who engage in this type of sabotage report lower perceptions
of customer service, lower perceptions of employee-customer rapport, and lower perceptions of
company performance (Harris & Ogbonna, 2006). Hence, future research may want to study more
504 Handbook of Employee Selection
directly what employees perceive to be the benefits of engaging in various forms of CWB. Whereas
in the case of merchandise theft the rewards may be obvious, other forms of CWB may present
psychological or emotional benefits. For example, the research on revenge and retaliation (Bies &
Tripp, 1996; Skarlicki & Folger, 1997) suggests that there are some benefits (e.g., evening the score),
but there may be others to consider.
As noted earlier, withdrawal can take many forms, including psychological withdrawal, late-
ness, absence, and actual turnover. Traditionally, measuring the consequences of withdrawal
focuses primarily on the financial costs associated with the loss in productivity of the absent
employee. However, recent research has considered a wider range of consequences that span
economic, financial, and noneconomic. More specifically, the indirect costs associated with how
the withdrawal behavior affects coworkers and their propensity to engage in similar forms of
withdrawal behaviors have been considered along with the indirect costs that arise when the
withdrawal behavior escalates to other more serious forms of withdrawal and eventually turnover
(Birati & Tziner, 1996). Sagie, Birati, and Tziner (2002) computed the total combined cost of dif-
ferent forms of withdrawal across employees in a medium-sized, Israeli high-tech company and
reported an estimate of $2.8 million (U.S.). This estimate includes the costs of turnover, absence,
lateness, replacement costs, and withholding effort, as well as psychological withdrawal in the
form of low performance.
Several of the antecedents of CWB that were summarized in the preceding sections may also
become consequences of CWB. As noted earlier, the causal direction of the relationship between
various antecedents of CWB and outcomes is not always clear. Organizational commitment, job
satisfaction, and perceptions of injustice may not only influence the extent to which individuals
engage in negative behaviors but these attitudes and perceptions may also change as a result of the
CWBs experienced by employees. That is, a highly satisfied and committed employee may become
less satisfied and committed if he or she suddenly or continuously is a victim of or observes CWB
by others, especially if the CWB is not addressed or dealt with by the leadership. Research has only
begun to consider these potential outcomes and indicates that employees who have been physically
attacked, been threatened, or experienced abusive supervision in the workplace report lower levels of
job or life satisfaction, lower commitment, greater job stress, lower physical and psychological well-
being, and are more likely to consider changing jobs or leaving the organization (Budd, Arvey, &
Lawless, 1996; Rogers & Kelloway, 1997; Tepper, 2000). Lapierre, Spector, and Leck (2005) con-
sidered the impact of sexual and nonsexual aggression on job satisfaction in a meta-analysis and
reported sizeable mean sample-size-weighted correlations ranging from −.29 (nonsexual aggression
male-only sample) to −.41 (nonsexual aggression female-only sample). Employees who reported
experiencing a personal offense were more likely to seek revenge if they blamed the offender, if the
offender’s status was lower than their own, and in general for lower- compared with higher-status
employees (Aquino, Tripp, & Bies, 2001).
In addition to negatively impacting the job attitudes of employees who are victims of coworker
CWB, employees who engage in CWB also run the risk of receiving lower ratings of overall job per-
formance by their managers. Rotundo and Sackett (2002) reported that employee CWBs decreased
their global ratings of job performance and explained on average 30% of the variation in these rat-
ings. These results generalized partially to Chinese managers, where on average 17% of the varia-
tion in ratings of overall job performance provided by managers was explained by the employee’s
CWB (Rotundo & Xie, 2008).
Employees who work for institutions that are suspected of and fined for white-collar crime
(as a result of the CWB of key executives or leaders) suffer numerous negative consequences. In
a case study of the closure of a bank that was suspected of false accounting, money laundering,
fraud, and conspiracy, employees who worked at the bank reported significant financial, emo-
tional, and psychological effects of the bank’s closure (Spalek, 2001). For example, employees
lost their jobs, access to their own funds and mortgage accounts, and experienced depression,
anxiety, and anger as a result of the crimes and subsequent bank closure (Spalek, 2001). Some
Counterproductive Work Behavior and Withdrawal 505
employees even found it difficult to find employment because of the stigma of being associated
with the bank (Spalek, 2001).
Individuals may have a history of engaging in negative behaviors during the course of their
lifespan and a question that arises is to what extent the early life experiences of rule-breaking
behavior predict outcomes later in life. In a longitudinal twin study Avolio, Rotundo, and
Walumbwa (2009) found that serious rule-breaking behavior (e.g., picked up by police, theft, drug
use) engaged in before high school had a significant and negative relationship with leadership
roles in adulthood, whereas modest rule-breaking (e.g., intentionally break window, skip school
without permission from parents) had a significant and positive relationship with leadership roles.
These preliminary results suggest that individuals who question and challenge boundaries of
various institutions to the extent that they break rules but do not go beyond the limits of the law
early in life may not devoid themselves of important career opportunities or of leadership posi-
tions later in life.
FUTURE DIRECTIONS
Interest in the study of CWB is a relatively recent phenomenon with absenteeism, turnover, and
theft attracting the earliest attention. More recently, aggression, deviance, and dishonesty along
with various related behaviors and constructs have been added to the mix and can be grouped under
the broad umbrella of CWB. The previous sections indicate that a significant amount of research
has emerged that seeks to clarify the definition and dimensionality of this construct as well as envi-
ronmental and individual antecedents of CWB. Less research has focused on understanding the
impact that these behaviors have on organizations, employees, coworkers, and customers and the
mechanisms through which CWBs have this effect. Almost all of this research is cross-sectional
in nature, in which antecedents and outcomes are assessed at the same time (with the exception of
some integrity test validation studies), limiting conclusions about the casual order of these variables.
It is well known that employees’ perceptions, attitudes, and behaviors are continuously evolving and
are influenced by daily events at work and their history with the employer. Hence, future research
needs to capture these dynamics in the measurement of CWB and its various antecedents and
consequences.
The global economy adds diversity of cultures to the workplace and raises questions about the
generalizability of many of the relationships summarized in this chapter that come almost exclu-
sively from North America. Some research has begun to study CWBs in other cultures (e.g., Lee
et al., 2005a; Lee et al., 2005b; Marcus et al., 2007; Rotundo & Xie, 2008). More specifically, Rotundo
and Xie (2008) investigated the dimensionality of CWBs in China using samples of Chinese manag-
ers and professionals. They reported a two-dimensional solution, in which one dimension is defined
by the task relevance of the CWB and the second dimension by the interpersonal/organizational
target of the CWB. These results are similar to what has been reported in prior research conducted
in Western cultures (Bennett & Robinson, 2000; Gruys & Sackett, 2003).
As noted throughout, the targets of CWB that have been studied so far include individuals
(in the form of CWBP) or organizations (in the form of CWBO). However, groups continue to
serve an important function in the workplace and a question that arises is when work groups (or
individual members of workgroups) can become targets of CWB, which we subsequently label
“intergroup CWB.” An important avenue for future research on CWB is to consider factors that
influence group members to engage in CWB directed at another workgroup or its members. Enns
(2007) provided a theoretical account of intergroup CWB engaged in by employees on the basis of
their identification with and membership in a workgroup. Enns and Rotundo (2007) reported that
participants who are members of workgroups express a greater willingness to engage in intergroup
CWB in conditions of realistic conflict and relative deprivation and that self-categorization moder-
ates this relationship. Future research can continue to study the antecedents and consequences of
intergroup CWB.
506 Handbook of Employee Selection
CONCLUSIONS
Given the paucity of research on CWB prior to the 1980s, it is difficult to know the extent to which it
is a growing problem. However, what is certain is that awareness of CWB has grown in the research
literature and popular press. Terms to refer to subsets of CWB, such as “desk rage” can be seen
in the media today. Articles on the topic can be found in all of the major journals with increasing
frequency.
What we have learned is that CWB is associated with many negative experiences at work,
including injustice and job stressors such as interpersonal conflict and organizational constraints.
Furthermore, there are considerable individual differences in employee tendencies to engage in
such behavior. For example, individuals high on hostile attribution bias, trait anger, and trait anxi-
ety are prone to CWB. The limited work that has been done shows connections between CWB
and negative consequences for employees and organizations. The body of work that has emerged
over the decades provides important insight into likely antecedents and consequences of CWB.
Nevertheless, numerous questions remain unanswered and the behaviors still occur. Thus, the need
for research continues.
REFERENCES
Ackerman, P. L., & Heggestad, E. D. (1997). Intelligence, personality, and interests: Evidence for overlapping
traits. Psychological Bulletin, 121, 219–245.
Aquino, K., Tripp, T. M., & Bies, R. J. (2001). How employees respond to personal offense: The effects of
blame attribution, victim status, and offender status on revenge and reconciliation in the workplace.
Journal of Applied Psychology, 86, 52–59.
Ashforth, B. E. (1997). Petty tyranny in organizations: A preliminary examination of antecedents and conse-
quences. Canadian Journal of Administrative Sciences, 14, 126–140.
Ashton, M. C., Lee, K., & Son, C. (2000). Honesty as the sixth factor of personality: Correlations with
Machiavellianism, primary psychopathy, and social adroitness. European Journal of Personality, 14,
359–369.
Avolio, B. J., Rotundo, M., & Walumbwa, F. (2009). Early life experiences as determinants of leadership role occu-
pancy: The importance of parental influence and rule breaking behavior. Leadership Quarterly, 20, 329–342.
Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-analysis.
Personnel Psychology, 44, 1–26.
Bennett, R. J., & Robinson, S. L. (2000). Development of a measure of workplace deviance. Journal of Applied
Psychology, 85, 349–360.
Berry, C. M., Ones, D. S., & Sackett, P. R. (2007). Interpersonal deviance, organizational deviance, and their
common correlates: A review and meta-analysis. Journal of Applied Psychology, 92, 410–424.
Berry, C. M., Sackett, P. R., & Wiemann, S. (2007). A review of recent developments in integrity test research.
Personnel Psychology, 60, 271–301.
Bies, R. J., & Tripp, T. M. (1996). Beyond distrust: “Getting even” and the need for revenge. In R. M. Kramer, &
T. R. Tyler (Eds.), Trust in organizations: Frontiers of theory and research (pp. 246–260). Thousand
Oaks, CA: Sage.
Bills, M. A. (1925). Social status of the clerical worker and his permanence on the job. Journal of Applied
Psychology, 9, 424–427.
Bing, M. N., Stewart, S. M., Davison, H. K., Green, P. D., McIntyre, M. D., & James, L. R. (2007). An integra-
tive typology of personality assessment for aggression: Implications for predicting counterproductive
workplace behavior. Journal of Applied Psychology, 92, 722–744.
Birati, A., & Tziner, A. (1996). Withdrawal behavior and withholding efforts at work (WBWEW): Assessing
the financial cost. Human Resource Management Review, 6, 305–314.
Brown, M. E., & Treviño, L. K. (2006). Socialized charismatic leadership, values congruence, and deviance in
work groups. Journal of Applied Psychology, 91, 954–962.
Brown, T. S., Jones, J. W., Terris, W., & Steffy, B. D. (1987). The impact of pre-employment integrity testing on
employee turnover and inventory shrinkage losses. Journal of Business and Psychology, 2, 136–149.
Bruk-Lee, V., & Spector, P. E. (2006). The social stressors-counterproductive work behaviors link: Are conflicts
with supervisors and coworkers the same? Journal of Occupational Health Psychology, 11, 145–156.
Counterproductive Work Behavior and Withdrawal 507
Bruursema, K. (2007). How individual values and trait boredom interface with job characteristics and job bore-
dom in their effects on counterproductive work behavior. ProQuest Information & Learning. Dissertation
Abstracts International: Section B: The Sciences and Engineering, 68 (4-B) (Electronic; Print)
Bruursema, K. & Spector, P. E. (2005, April). Leadership style and the link with counterproductive work
behavior (CWB). Paper presented at the 20th Annual Conference of the Society for Industrial and
Organizational Psychology, Los Angeles, CA.
Budd, J. W., Arvey, R. D., & Lawless, P. (1996). Correlates and consequences of workplace violence. Journal
of Occupational Health Psychology, 1, 197–210.
Chadwick-Jones, J. K., Brown, C. A., Nicholson, N., & Sheppard, C. (1971). Absence measures: Their reliability
and stability in an industrial setting. Personnel Psychology, 24, 463–470.
Chen, P. Y., & Spector, P. E. (1992). Relationships of work stressors with aggression, withdrawal, theft and
substance use: An exploratory study. Journal of Occupational and Organizational Psychology, 65(3),
177–184.
Cohen-Charash, Y., & Mueller, J. S. (2007). Does perceived unfairness exacerbate or mitigate interpersonal
counterproductive work behaviors related to envy? Journal of Applied Psychology, 92, 666–680.
Cohen-Charash, Y., & Spector, P. E. (2001). The role of justice in organizations: A meta-analysis. Organizational
Behavior and Human Decision Processes, 86, 278–321.
Colbert, A. E., Mount, M. K., Harter, J. K., Witt, L. A., & Barrick, M. R. (2004). Interactive effects of person-
ality and perceptions of the work situation on workplace deviance. Journal of Applied Psychology, 89,
599–609.
Collins, J. M., & Griffin, R. W. (1998). The psychology of counterproductive job performance. In R. W. Griffin,
A. O’Leary-Kelly & J. M. Collins (Eds.), Dysfunctional behavior in organizations: Violent and deviant
behavior (pp. 219–242). New York, NY: Elsevier Science/JAI Press.
Colquitt, J. A., Conlon, D. E., Wesson, M. J., Porter, C. O. L. H., & Ng, K. Y. (2001). Justice at the millenium:
A meta-analytic review of 25 years of organizational justice research. Journal of Applied Psychology,
86, 425–445.
Colquitt, J. A., Scott, B. A., Judge, T. A., & Shaw, J. C. (2006). Justice and personality: Using integrative theories
to derive moderators of justice effects. Organizational Behavior and Human Decision Processes, 100,
110–127.
Connerley, M. L., Arvey, R. D., & Bernardy, C. J. (2001). Criminal background checks for prospective and
current employees: Current practices among municipal agencies. Public Personnel Management, 30,
173–183.
Cropanzano, R., Howes, J. C., Grandey, A. A., & Toth, P. (1997). The relationship of organizational politics and
support to work behaviors, attitudes, and stress. Journal of Organizational Behavior, 18, 159–180.
Dalal, R. S. (2005). A meta-analysis of the relationship between organizational citizenship behavior and
counterproductive work behavior. Journal of Applied Psychology. Special Section: Theoretical Models
and Conceptual Analyses—Second Installment, 90, 1241–1255.
Detert, J. R., Treviño, L. K., Burris, E. R., & Andiappan, M. (2007). Managerial modes of influence and
counterproductivity in organizations: A longitudinal business-unit-level investigation. Journal of Applied
Psychology, 92, 993–1005.
Dilchert, S., Ones, D. S., Davis, R. D., & Rostow, C. D. (2007). Cognitive ability predicts objectively measured
counterproductive work behaviors. Journal of Applied Psychology, 92, 616–627.
Douglas, S. C., & Martinko, M. J. (2001). Exploring the role of individual differences in the prediction of
workplace aggression. Journal of Applied Psychology, 86, 547–559.
Duffy, M. K., Ganster, D. C., & Shaw, J. D. (1998). Positive affectivity and negative outcomes: The role of
tenure and job satisfaction. Journal of Applied Psychology, 83, 950–959.
Dunlop, P. D., & Lee, K. (2004). Workplace deviance, organizational citizenship behavior, and business unit
performance: The bad apples do spoil the whole barrel. Journal of Organizational Behavior, 25, 67–80.
Einarsen, S., Raknes, B. I., Matthiesen, S. B., & Hellesoy, O. H. (1996). Helsemessige aspekter ved mobbing
i arbeidslivet. modererende effekter av sosial stotte og personlighet. [The health-related aspects of
bullying in the workplace: The moderating effects of social support and personality.] Nordisk Psykologi,
48, 116–137.
Enns, J. R. (2007). The roles of realistic conflict and relative deprivation in explaining counterproductive
work behavior. ProQuest Information & Learning. Dissertation Abstracts International: Section B: The
Sciences and Engineering, 68 (1-B) (Electronic; Print)
Enns, J. R., & Rotundo, M. (2007). Intergroup counterproductive work behavior: Effects of conflict and
self-categorization. Paper presented at the 67th Annual Meeting of the Academy of Management, Phil-
adelphia, PA.
508 Handbook of Employee Selection
Fox, S., & Spector, P. E. (1999). A model of work frustration-aggression. Journal of Organizational Behavior,
20, 915–931.
Fox, S., & Spector, P. E. (Eds.). (2005). Counterproductive work behavior: Investigations of actors and targets.
Washington, DC: American Psychological Association.
Fox, S., Spector, P. E., Goh, A., & Bruursema, K. (2007). Does your coworker know what you’re doing?
Convergence of self- and peer-reports of counterproductive work behavior. International Journal of
Stress Management, 14, 41–60.
Fox, S., Spector, P. E., & Miles, D. (2001). Counterproductive work behavior (CWB) in response to job stressors
and organizational justice: Some mediator and moderator tests for autonomy and emotions. Journal of
Vocational Behavior, 59, 291–309.
Galperin, B. L., & Burke, R. J. (2006). Uncovering the relationship between workaholism and workplace
destructive and constructive deviance: An exploratory study. International Journal of Human Resource
Management, 17, 331–347.
Giacalone, R. A., & Greenberg, J. (Eds.). (1997). Antisocial behavior in organizations. Thousand Oaks, CA:
Sage.
Goh, A. (2007). An attributional analysis of counterproductive work behavior (CWB) in response to occupa-
tional stress. ProQuest Information & Learning. Dissertation Abstracts International: Section B: The
Sciences and Engineering, 68 (4-B) (Electronic; Print).
Greenberg, L., & Barling, J. (1999). Predicting employee aggression against coworkers, subordinates and
supervisors: The roles of person behaviors and perceived workplace factors. Journal of Organizational
Behavior, 20, 897–913.
Griffin, R. W., O’Leary-Kelly, A., & Collins, J. M. (Eds.). (1998). Dysfunctional behavior in organizations:
Violent and deviant behavior. New York, NY: Elsevier Science/JAI Press.
Gruys, M. L. (2000). The dimensionality of deviant employee behavior in the workplace. ProQuest
Information & Learning. Dissertation Abstracts International: Section B: The Sciences and Engineering,
61 (5-B) (Electronic; Print).
Gruys, M. L., & Sackett, P. R. (2003). Investigating the dimensionality of counterproductive work behavior.
International Journal of Selection and Assessment, 11, 30–42.
Harris, L. C., & Ogbonna, E. (2002). Exploring service sabotage: The antecedents, types and consequences of
frontline, deviant, antiservice behaviors. Journal of Service Research, 4(3), 163–183.
Harris, L. C., & Ogbonna, E. (2006). Service sabotage: A study of antecedents and consequences. Journal of
the Academy of Marketing Science, 34, 543–558.
Harris, P. M., & Keller, K. S. (2005). Ex-offenders need not apply: The criminal background check in hiring
decisions. Journal of Contemporary Criminal Justice, 21, 6–30.
Hepworth, W., & Towler, A. (2004). The effects of individual differences and charismatic leadership on workplace
aggression. Journal of Occupational Health Psychology, 9, 176–185.
Hershcovis, M. S., Turner, N., Barling, J., Arnold, K. A., Dupré, K. E., Inness, M., et al. (2007). Predicting
workplace aggression: A meta-analysis. Journal of Applied Psychology, 92, 228–238.
Hickman, M., & Piquero, A. (2001). Exploring the relationships between gender, control balance, and deviance.
Deviant Behavior, 22, 323–351.
Hogan, J., & Hogan, R. (1989). How to measure employee reliability. Journal of Applied Psychology, 74,
273–279.
Hollinger, R. C., & Clark, J. P. (1982). Formal and informal social controls of employee deviance. Sociological
Quarterly, 23, 333–343.
Hollinger, R. C., Slora, K. B., & Terris, W. (1992). Deviance in the fast-food restaurant: Correlates of employee
theft, altruism, and counterproductivity. Deviant Behavior, 13, 155–184.
Hollwitz, J. C. (1999). Investigations of a structured interview for pre-employment integrity screening. ProQuest
Information & Learning. Dissertation Abstracts International: Section B: The Sciences and Engineering,
60 (1-B).
Horn, J., Nelson, C. E., & Brannick, M. T. (2004). Integrity, conscientiousness, and honesty. Psychological
Reports, 95, 27–38.
Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. A. (1990). Criterion-related validi-
ties of personality constructs and the effect of response distortion on those validities. Journal of Applied
Psychology, 75, 581–595.
Huiras, J., Uggen, C., & McMorris, B. (2000). Career jobs, survival jobs, and employee deviance: A social
investment model of workplace misconduct. Sociological Quarterly, 41, 245–263.
Johns, G. (1994). How often were you absent? A review of the use of self-reported absence data. Journal of
Applied Psychology, 79, 574–591.
Counterproductive Work Behavior and Withdrawal 509
Jones, J. W. (1981). Dishonesty, burnout, and unauthorized work break extensions. Personality and Social
Psychology Bulletin, 7, 406–409.
Judge, T. A., LePine, J. A., & Rich, B. L. (2006). Loving yourself abundantly: Relationship of the narcissistic
personality to self- and other perceptions of workplace deviance, leadership, and task and contextual
performance. Journal of Applied Psychology, 91, 762–776.
Keashly, L. (1997). Emotional abuse in the workplace: Conceptual and empirical issues. Journal of Emotional
Abuse, 1, 85–117.
Knorz, C., & Zapf, D. (1996). Mobbing—eine extreme form sozialer stressoren am arbeitsplatz. [Mobbing:
A severe form of social stressors at work.] Zeitschrift Für Arbeits- Und Organisationspsychologie, 40,
12–21.
Kwok, C., Au, W. T., & Ho, J. M. C. (2005). Normative controls and self-reported counterproductive behaviors
in the workplace in china. Applied Psychology: An International Review, 54, 456–475.
Langton, L., & Hollinger, R. C. (2005). Correlates of crime losses in the retail industry. Security Journal, 18(3),
27–44.
Lapierre, L. M., Spector, P. E., & Leck, J. D. (2005). Sexual versus nonsexual workplace aggression and
victims’ overall job satisfaction: A meta-analysis. Journal of Occupational Health Psychology, 10,
155–169.
Lee, K., Ashton, M. C., & de Vries, R. E. (2005a). Predicting workplace delinquency and integrity with the
HEXACO and five-factor models of personality structure. Human Performance, 18(2), 179–197.
Lee, K., Ashton, M. C., & Shin, K. (2005b). Personality correlates of workplace anti-social behavior. Applied
Psychology: An International Review, 54, 81–98.
Marcus, B. (2001). Approaches to the explanation of counterproductive behaviors in organizations. In R.K.
Silbereisen (Ed.), Psychologie 2000 (pp. 414–425). Lengerich, Germany: Pabst.
Marcus, B., Lee, K., & Ashton, M. C. (2007). Personality dimensions explaining relationships between integ-
rity tests and counterproductive behavior: Big five, or one in addition? Personnel Psychology, 60, 1–34.
Marcus, B., & Schuler, H. (2004). Antecedents of counterproductive behavior at work: A general perspective.
Journal of Applied Psychology, 89, 647–660.
Marcus, B., & Wagner, U. (2007). Combining dispositions and evaluations of vocation and job to account
for counterproductive work behavior in adolescent job apprentices. Journal of Occupational Health
Psychology, 12(2), 161–176.
Martinko, M. J., Gundlach, M. J., & Douglas, S. C. (2002). Toward an integrative theory of counterproductive
workplace behavior: A causal reasoning perspective. International Journal of Selection and Assessment.
Special Issue: Counterproductive Behaviors at Work, 10, 36–50.
McFadden, K. L. (1997). Policy improvements for prevention of alcohol misuse by airline pilots. Human
Factors, 39, 1–8.
Meyer, J. P., Stanley, D. J., Herscovitch, L., & Topolnytsky, L. (2002). Affective, continuance, and normative
commitment to the organization: A meta-analysis of antecedents, correlates, and consequences. Journal
of Vocational Behavior, 61, 20–52.
Mikulay, S. M., & Goffin, R. D. (1998). Measuring and predicting counterproductivity in the laboratory using
integrity and personality testing. Educational and Psychological Measurement, 58, 768–790.
Miles, D. E., Borman, W. C., Spector, P. E., & Fox, S. (2002). Building an integrative model of extra role work
behaviors: A comparison of counterproductive work behavior with organizational citizenship behavior.
International Journal of Selection and Assessment. Special Issue: Counterproductive Behaviors at Work,
10, 51–57.
Mitchell, M. S., & Ambrose, M. L. (2007). Abusive supervision and workplace deviance and the moderating
effects of negative reciprocity beliefs. Journal of Applied Psychology, 92, 1159–1168.
Mount, M., Ilies, R., & Johnson, E. (2006). Relationship of personality traits and counterproductive work
behaviors: The mediating effects of job satisfaction. Personnel Psychology, 59, 591–622.
Neuman, J. H., & Baron, R. A. (1997). Aggression in the workplace. In R. A. Giacalone, & J. Greenberg (Eds.),
Antisocial behavior in organizations (pp. 37–67). Thousand Oaks, CA: Sage.
Nicol, A. A. M., & Paunonen, S. V. (2002). Overt honesty measures predicting admissions: An index of validity
or reliability. Psychological Reports, 90, 105–115.
Ones, D. S., & Viswesvaran, C. (1998). Integrity testing in organizations. In R. W. Griffin, A. O’Leary-Kelly
& J. M. Collins (Eds.), Dysfunctional behavior in organizations: Violent and deviant behavior (pp. 243–
276). New York, NY: Elsevier Science/JAI Press.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities:
Findings and implications for personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 679–703.
510 Handbook of Employee Selection
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2003). Personality and absenteeism: A meta-analysis of
integrity tests. European Journal of Personality. Special Issue: Personality and Industrial, Work and
Organizational Applications, 17(Suppl. 1), S19-S38.
Payne, B. K., & Gainey, R. R. (2004). Ancillary consequences of employee theft. Journal of Criminal Justice,
32, 63–73.
Pearson, C. M., Andersson, L. M., & Porath, C. L. (2000). Assessing and attacking workplace incivility.
Organizational Dynamics, 29(2), 123–137.
Penney, L. M., & Spector, P. E. (2002). Narcissism and counterproductive work behavior: Do bigger egos mean
bigger problems? International Journal of Selection and Assessment. Special Issue: Counterproductive
Behaviors at Work, 10, 126–134.
Penney, L. M., & Spector, P. E. (2005). Job stress, incivility, and counterproductive work behavior (CWB): The
moderating role of negative affectivity. Journal of Organizational Behavior, 26, 777–796.
Probst, T. M., Stewart, S. M., Gruys, M. L., & Tierney, B. W. (2007). Productivity, counterproductivity and
creativity: The ups and downs of job insecurity. Journal of Occupational and Organizational Psychology,
80, 479–497.
Roberts, B. W., Harms, P. D., Caspi, A., & Moffitt, T. E. (2007). Predicting the counterproductive employee in
a child-to-adult prospective study. Journal of Applied Psychology, 92, 1427–1436.
Robinson, S. L., & Bennett, R. J. (1995). A typology of deviant workplace behaviors: A multidimensional
scaling study. Academy of Management Journal, 38, 555–572.
Robinson, S. L., & O’Leary-Kelly, A. M. (1998). Monkey see, monkey do: The influence of work groups on the
antisocial behavior of employees. Academy of Management Journal, 41, 658–672.
Rogers, K., & Kelloway, E. K. (1997). Violence at work: Personal and organizational outcomes. Journal of
Occupational Health Psychology, 2, 63–71.
Rosenbaum, R. W. (1976). Predictability of employee theft using weighted application blanks. Journal of
Applied Psychology, 61, 94–98.
Rotundo, M., & Sackett, P. R. (2002). The relative importance of task, citizenship, and counterproductive
performance to global ratings of job performance: A policy-capturing approach. Journal of Applied
Psychology, 87, 66–80.
Rotundo, M., & Xie, J. L. (2008). Understanding the domain of counterproductive work behavior in China.
International Journal of Human Resource Management, 19, 856–877.
Sackett, P. R., & DeVore, C. J. (2002). Counterproductive behaviors at work. In N. Anderson, D. S. Ones, H. K.
Sinangil, & C. Viswesvaran (Eds.), Handbook of industrial, work and organizational psychology, volume
1: Personnel psychology (pp. 145–164). Thousand Oaks, CA: Sage.
Sackett, P. R., & Wanek, J. E. (1996). New developments in the use of measures of honesty, integrity, conscien-
tiousness, dependability, trustworthiness, and reliability for personnel selection. Personnel Psychology,
49, 787–829.
Sagie, A., Birati, A., & Tziner, A. (2002). Assessing the costs of behavioral and psychological withdrawal: A
new model and an empirical illustration. Applied Psychology: An International Review. Special Issue:
Challenges of Applied Psychology for the Third Millennium, 51, 67–89.
Salgado, J. F. (2000). The Big Five personality dimensions as predictors of alternative criteria. Paper presented at the
15th Annual Conference of the Society for Industrial and Organizational Psychology, New Orleans, LA.
Salgado, J. F., Viswesvaran, C., & Ones, D. S. (2002). Predictors used for personnel selection: An overview
of constructs, methods and techniques. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran
(Eds.), Handbook of industrial, work and organizational psychology, volume 1: Personnel psychology
(pp. 165–199). Thousand Oaks, CA: Sage.
Scalora, M. J., Washington, D. O., Casady, T., & Newell, S. P. (2003). Nonfatal workplace violence risk factors:
Data from a police contact sample. Journal of Interpersonal Violence, 18, 310–327.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psycho-
logy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Skarlicki, D. P., & Folger, R. (1997). Retaliation in the workplace: The roles of distributive, procedural, and
interactional justice. Journal of Applied Psychology, 82(3), 434–443.
Smith, D. A., & Visher, C. A. (1980). Sex and involvement in deviance/crime: A quantitative review of the empiri-
cal literature. American Sociological Review, 45, 691–701.
Spalek, B. (2001). Regulation, white-collar crime and the Bank of Credit and Commerce International. The
Howard Journal, 40, 166–179.
Spector, P. E. (1975). Relationships of organizational frustration with reported behavioral reactions of employees.
Journal of Applied Psychology, 60, 635–637.
Counterproductive Work Behavior and Withdrawal 511
Spector, P. E. (1988). Development of the work locus of control scale. Journal of Occupational Psychology,
61(4), 335–340.
Spector, P. E., & Fox, S. (2002). An emotion-centered model of voluntary work behavior: Some parallels
between counterproductive work behavior and organizational citizenship behavior. Human Resource
Management Review, 12, 269–292.
Spector, P. E., & Fox, S. (2005). The stressor-emotion model of counterproductive work behavior. In S. Fox &
P. E. Spector (Eds.). Counterproductive workplace behavior: Investigations of actors and targets
(pp. 151–174). Washington, DC: American Psychological Association.
Spector, P. E., Fox, S., Penney, L. M., Bruursema, K., Goh, A., & Kessler, S. (2006). The dimensionality of
counterproductivity: Are all counterproductive behaviors created equal? Journal of Vocational Behavior,
68, 446–460.
Steffensmeier, D., & Allan, E. (1996). Gender and crime: Toward a gendered theory of female offending.
Annual Review of Sociology, 22, 459–487.
Storms, P. L., & Spector, P. E. (1987). Relationships of organizational frustration with reported behavioural
reactions: The moderating effect of locus of control. Journal of Occupational Psychology, 60(3),
227–234.
Tepper, B. J. (2000). Consequences of abusive supervision. Academy of Management Journal, 43, 178–190.
Vardi, Y., & Weitz, E. (2002). Using the theory of reasoned action to predict organizational misbehavior.
Psychological Reports, 91, 1027–1040.
Zapf, D., & Einarsen, S. (2005). Mobbing at work: Escalated conflicts in organizations. In S. Fox, & P. E.
Spector (Eds.), Counterproductive work behavior: Investigations of actors and targets (pp. 237–270).
Washington, DC: American Psychological Association.
This page intentionally left blank
24 Defi ning and Measuring
Results of Workplace
Behavior
The previous chapters in this section focused on the measurement of task performance, construc-
tive personal behavior (citizenship and adaptability), and counterproductive behavior and how
these fit in the context of conducting selection research. Each of these represents a conceptually
distinct content area within the performance domain, and all consist of reasonably well-defined
constructs that have been reliably and validly measured in the past and successfully used as cri-
teria in validation research, albeit some more so than others. Alternatively, this chapter focuses
on measuring results—the actual end products, outcomes, or deliverables one produces on a job.
Unlike other criterion constructs, discussions of a “results” construct are relatively rare in the
industrial and organizational (I-O) psychology literature. Likewise, results measures have not
been as well defined and researched as other types of performance measures (e.g., task, citizen-
ship, adaptive, etc.). Thus, we know less about their reliability, validity, accuracy, and fairness
compared with other, more commonly used performance measures. We also know less about
how to develop effective results measures that will possess adequate psychometric properties and
validity.
Given that we already have several conceptually distinct, well-defined, and psychometrically
sound performance measures that appear to comprehensively cover the criterion domain, one might
reasonably question why we should bother adding results measures to the mix. The answer is that
many organizations today are focusing on defining work in terms of the results employees are
expected to achieve, and likewise, they are evaluating and rewarding staff on the extent to which
they have delivered tangible outcomes that are important to the organization’s success. Thus, if a
situation arises in which we must conduct validation research using criterion measures that are
available, chances are that we will increasingly encounter measures of results. In addition, opera-
tional performance measures are sometimes used as predictors in making selection decisions; for
example, performance ratings are often one input used in making promotion decisions. Here, again,
such predictors are increasingly likely to include measures of results.
Many of the performance measures used in validation research have focused on measuring
work behavior, which is important to ensure job relevance. Behavioral measures have also been
used extensively in the past as a basis for performance management. These measures focus on how
employees get the job done; for example, how they contribute to a team, communicate, plan and
organize work, and so forth. Irrespective of how productive employees may be, we are all familiar
with the problems and disruptions they can cause if they are difficult to work with, unhelpful, or
exhibit maladaptive behavior. Thus, evaluating workplace behavior is important. We are all also
familiar with employees who are extremely helpful, communicate well, and are nice to everyone;
yet never seem to get anything done. This is why considering the results that an employee achieves
513
514 Handbook of Employee Selection
is also an important part of overall performance measurement, and as mentioned, it is one that orga-
nizations are increasingly emphasizing.
Although a choice can be made to assess results or behavior, it may be important to include both
types of measures when comprehensive performance measurement is the goal (Landy & Trumbo,
1980; Pulakos, 2008), as would be the case when the measures are used as criteria for validation
research or as predictors in selection processes. Because earlier chapters have discussed behavioral
performance measurement in detail, our focus here is on how to obtain useful and meaningful
measures of results. However, because relatively little research attention has been directed to mea-
suring results, there is not an extensive literature to draw on that speaks directly to the quality and
utility of results measures or how they relate to other, more commonly used predictors and criteria.
Accordingly, we draw on related research to propose methods for developing results measures that
should maximize their reliability, validity, and fairness.
We begin below by reviewing the debate that has surrounded measuring workplace behavior
versus results and discuss why the measurement of results has become increasingly popular
today. We then propose methods for developing results measures and challenges associated with
these. We review the concept of cascading goals and provide guidelines for developing individual
objectives, which are thought to be an important precursor to achieving organizationally relevant
results. We then discuss evaluation methods that should facilitate accurate and fair measure-
ment of the results employees achieve, using a combination of objective and subjective mea-
sures. Finally, we discuss individual difference constructs that are likely to predict performance
results.
Although there are clearly challenges inherent in measuring results, the evaluation of behavior
is not without issues of its own. First and foremost, the common practice of using subjective ratings
to assess behavioral performance (see Chapter 21, this volume) yields measures with notoriously
attenuated variance. This is particularly true when these ratings are collected for operational pur-
poses (e.g., pay, promotion), circumstances in which a large proportion of employees are rated at
the highest levels of the rating scale (Pulakos, 2004). This lack of discrimination among employees
renders the measures virtually useless for validation research or for use as selection measures.
Although for-research-only ratings collected in validation studies tend to be more variable, lack of
discrimination is a chronic problem with subjective ratings, undermining their reliability, validity,
and utility.
Second, advocates of results-based measurement assessment argue that a focus exclusively on
behavior misses what is most important, namely whether or not an employee actually delivered
important bottom-line results. Although an employee can engage in highly effective behaviors, they
are of little value if they do not result in organization-relevant outcomes (Bernardin et al., 1998).
To that end, it has been suggested that behaviors should be measured only if they can be linked to
outcomes that drive organizational success. In addition, research has shown that employees perform
more effectively when they have specific goals and expectations so that they know what they are
accountable for delivering (e.g., Locke, Shaw, Saari, & Latham, 1981). Defining and measuring the
results each employee is expected to achieve and aligning those to organizational performance helps
everyone work toward a common set of important goals.
Despite the difficulties associated with measuring results (e.g., criterion contamination and defi-
ciency), there has been an increasingly popular trend over the last decade for organizations to adopt
a results-focus in the measurement of job performance. This is largely because business leaders and
organizational consultants have become convinced that an exclusive focus on behaviors is remiss
in not sufficiently emphasizing the importance of delivering meaningful results that are critical to
organizational success. This orientation has likely been driven by intensified pressure from stock-
holders and increasingly formidable national and international competition. It is noteworthy that
two other chapters in this volume share the perspective that results criteria are important indices of
selection-related value. Chapter 11, this volume, makes this point in discussing the business value of
selection as it relates to system and organizational level outcomes, whereas Chapter 9, this volume,
discusses this in relation to multilevel issues.
Even public sector and not-for-profit organizations that have not traditionally driven toward
results have adopted this focus to demonstrate their value. In the late 1990s, the Internal Revenue
Service (IRS), the Federal Aviation Administration (FAA), and the Government Accountability
Office (GAO) all initiated pay-for-performance systems, which focused on measuring and reward-
ing results. More recently, the U.S. Departments of Defense (DoD) and Homeland Security (DHS)
have begun to develop similar programs. This results-focus has become so pervasive that the U.S.
Office of Personnel Management (OPM) codified procedures that require Federal government agen-
cies to develop performance management systems for executives that link their performance to
results-oriented goals and to explicitly evaluate results. Similar systems are also being implemented
at lower organizational levels.
With results-oriented performance measurement increasingly emerging as a significant trend,
the remainder of this chapter is devoted to methods for defining and evaluating results in a man-
ner that will yield the highest quality measures possible. One important caveat to point out is that
results have been fairly narrowly defined in the past to include only those outcomes that could be
evaluated using bottom-line, highly objective criteria, such as dollar volume of sales and number
of products produced. More recent operationalizations of results continue to emphasize objective
measurement, but there has also been recognition that it may not be possible to translate every
important aspect of a result into a bottom-line, objective metric. This has opened the door for the
use of some subjective (i.e., judgmental) measures along with objective measures in assessing the
quality of results.
516 Handbook of Employee Selection
Organization’s strategic
goals
Administrative division
goals
Accounting and
finance department
goals
Finance professional’s
individual goals
critical for organizations to set goals and objectives in a thoughtful and realistic way to ensure that
mistakes made in setting the highest level goals do not cascade down throughout the entire orga-
nization. For this reason, goals set by the top leadership of the organization need the most critical
scrutiny and opportunities for correction or revision, things that do not always occur to the degree
they should.
Assuming the highest-level goals are well thought through and realistic, one of the challenges in
cascading goals is that it is sometimes difficult to see direct relationships between high-level goals
and what an individual does on the job. This is why organizational goals need to be translated or
cascaded into increasingly refined goals at the division, department, and individual levels. The
process of developing cascading goals usually requires several meetings in which organizational
leaders first develop division goals that align with the organizational goals. Then, mid-level manag-
ers develop unit goals that align with division goals. Then, managers develop group goals that align
with unit goals, and so on until the organizational goals are cascaded down to individual employees.
The process of cascading goals thoughtfully and meaningfully is quite time-consuming and chal-
lenging, especially if managers are not accustomed to thinking about the linkages between goals at
different organizational levels.
On average, organizations spend less than 10 hours per year on performance management activi-
ties for each employee (Brentz, Milkovich, & Read, 1992). However, the process of cascading goals
requires considerably more time. In fact, it is not uncommon for organizations that are initiating
cascading goals to take until the end of the second quarter of the operating year to complete the
process. This poses difficulties, because half of the rating period may have passed before individual
goals and expectations are set for employees, leaving little time for goal attainment. However, as
organizations gain more experience with cascading goals, efficiencies are realized. Training man-
agers and employees in how to identify cascading goals that are meaningful, useful, and show clear
relationships between levels also facilitates the process. The bottom line is that the implementation
of cascading goals requires time, effort, and considerable handholding, at least initially, to ensure
that the process is done well.
Once goals have been successfully cascaded down to the level just above the individual, there are
two ways individual goals can be linked to these higher-level goals:
1. Start with performance objectives and work upward to link them to higher-level goals.
2. Start with higher-level goals that are relevant to an employee’s job and work downward to
develop individual performance objectives.
The decision to link upwards or downward is a personal preference. Some find it easier to start
with something concrete from their job and work upward to a less-tangible concept. Others find it
easier to start with a higher-level goal and develop something they can do on their job that relates
to that goal. Figure 24.2 shows an example of how department goals could be cascaded down to
individual objectives for a human resources (HR) professional. Note that the individual objectives
are related to only one of the department goals. Objectives can be related to more than one goal at
the next higher level, but as mentioned previously, it is unlikely that goals at one level will relate to
all of the goals at the next level.
There are several guidelines that should be followed when developing individual performance
objectives. Many of these are a direct outgrowth of the well-established principles found in the goal-
setting literature (Locke & Latham, 1990). Following these guidelines will help to ensure that the
objectives are clear, employees know what is expected, and they are motivated to achieve success.
• Objectives must be specific: Objectives must be clearly defined, identifying the end results
employees are expected to achieve. Ambiguity is reduced by specifying the outcomes,
products, or services in terms of quality, quantity, and timeliness expectations. Despite the
fact that research has continually found that well-defined objectives are associated with
518 Handbook of Employee Selection
Develop a process to tie employee performance to rewards by the end of the year.
Train all managers on the new performance management system by December 31.
Reduce the average cost of processing retirement claims by 5% by the end of year.
Conduct a competency gap analysis for critical occupations by the end of third quarter.
Design and obtain leadership approval for a new employee incentive system by march.
Implement an employee incentive system for all full-time employees with payouts to
employees beginning in January.
• Employees must be committed to the objectives: A key aspect of commitment that we have
already discussed is that employees must feel that they can reach their objectives, or they
will be demotivated to try. The best way to facilitate employees accepting their objectives
is to make them an active part of the objective-setting process and work with them to arrive
at objectives that are challenging yet achievable. Once managers and employees have come
to agreement on the employee’s objectives in principle, asking employees to prepare the
written description of their objectives helps enhance their ownership of them.
• Managers must show their commitment to the objectives: It is important for managers to
show their support by providing guidance and resources as well as removing obstacles to
goal achievement. The literature clearly shows that management commitment is critical
to successful achievement of objectives (Rodgers, Hunter, & Rogers, 1993).
were possible to develop such taxonomies, efficiencies would be gained from alleviating the need
for every organization in a similar line of business to devote time and energy to re-inventing the
wheel over and over again.
In more unique jobs and situations, it may be impossible to predefine objectives that apply across
positions, jobs, or organizations. Although a group of employees may occupy a given job, the spe-
cific results each individual is expected to achieve may vary depending on the nature of his or her
assignments. For example, some organizational consultants may have production or sales results,
others in essentially the same job may be responsible for developing and implementing systems, oth-
ers may have specific levels of customer satisfaction outcomes they are expected to meet, and still
others may have employee development or team-leadership goals. To the extent that people holding
similar jobs have different goals and objectives, evaluating and comparing their performance in a
fair and standardized manner becomes increasingly challenging.
Under these circumstances, we recommend developing individual objectives that further define
critical tasks from a comprehensive job analysis. This helps to ensure that a common base of job-
relevant information is used to develop objectives. Objectives derived in this manner will contain
more specific information than the tasks or work behavior statements, such as what specific proj-
ect, customer, product, etc. the employee is responsible for and what specific quality, quantity, and
timeliness criteria will be measured. Two examples of how objectives can be developed by further
specifying validated work behaviors appear in Table 24.1. The first task is to evaluate and monitor
the quality of product information supplied to potential customers. A reasonable objective for this
task would be to monitor the specific channels that are used to provide information to customers and
evaluate the accuracy and timeliness of the information supplied according to measurable criteria.
The second task is to design, administer, analyze, and interpret surveys. Specifying what type of
survey a given employee is responsible for and the timeline required for its completion allows this
work task to be transformed into an individual objective.
CHALLENGE 3: HELPING MANAGERS DEVELOP COMPARABLE AND FAIR OBJECTIVES FOR EMPLOYEES
A problem that occurs when different managers set objectives for employees who occupy the same
job is that natural inconsistencies among them can result in objectives that are too easy, unattainable,
or unsystematic across employees (Jamieson, 1973; Strauss, 1972). This often results in employees
in the same job being evaluated against objectives that vary significantly in their difficulty and com-
plexity. For example, assume one employee’s objective is to perform a simple information-catalogu-
ing project, whereas another in the same job and level is given the objective of managing the design
and implementation of a complex information management system. If the value of these different
objectives is not established and there is no mechanism in place to review objectives for fairness
and consistency across employees, both of these employees could be considered performing equally
TABLE 24.1
Transforming Work Tasks Into Performance Objectives
Work Task Transformed Into Performance Objective
Evaluate and monitor the quality of Monitor calls to company call center and provide feedback to staff as
information provided to potential necessary to ensure 95% accuracy of product information provided
customers Monitor responses to e-mail inquires to ensure that responses are made
within 24 hours and that accuracy of information provided is at least 95%
Design, administer, analyze, and Develop items, select vendor, and administer survey by January; analyze
evaluate surveys data, conduct focus groups to further understand survey results and write
report with clear, actionable, and feasible recommendations that requires
no grammatical editing and minimal substantive editing by July
Defining and Measuring Results of Workplace Behavior 521
well if they both achieved their stated objectives. Yet, the employee who managed the design and
implementation of the information management system would have undertaken a much more diffi-
cult and complex assignment and contributed substantially more. Thus, evaluating employee results
cannot merely take into account whether each individual simply met or did not meet the established
objectives. This would not only undermine accuracy of performance measurement but could also
rightly be viewed as unfair, with a consequential negative impact on employee acceptance of the
measurement process (e.g., Dipboye & de Pontbraind, 1981; Greenberg, 1986).
We recommend several strategies to mitigate this problem. First, the training provided to man-
agers and employees needs to focus on teaching them how to develop objectives that are of similar
difficulty and complexity for individuals in the same or similar jobs. This process is similar to
frame-of-reference training described by Bernardin and Buckley (1981), in which review and dis-
cussion of example objectives helps to calibrate trainees to apply similar standards. As a supplement
to training, especially in the early stages of implementation, having managers meet to review the
objectives for staff in the same job helps ensure that similarly difficult and complex objectives are
set for similarly situated employees. Such meetings also serve to reinforce development of a com-
mon frames-of-reference among managers for setting objectives.
A third recommendation to facilitate the quality and consistency of individual objectives is to
retain them in a searchable database organized by job and level. These can be used again verbatim
or refined and edited over time to develop future objectives.
Finally, even if an objective appears appropriate for the job and level and is comparable to those
for similarly situated others, a project, program, or goal will sometimes turn out to be much more or
less difficult than anticipated. For this reason, we feel it is important to evaluate employees not only
on the extent to which they achieved or exceeded their stated results but also on the difficulty and
complexity of what they delivered relative to what would be expected for their job. Although this
involves a subjective judgment, it provides a fairer and more accurate assessment of the employee’s
performance overall and a systematic basis for making meaningful comparisons between employ-
ees who may have achieved different types of results. We further discuss and provide examples of
subjective rating criteria to evaluate results later in this chapter.
When the work is team-focused or requires significant interdependencies, objectives should be set
at the level where the key work products are produced. If jobs are so intertwined or dependent on a
team, it may not be practical or even appropriate to set individual objectives. In such circumstances,
individual objectives should be abandoned and replaced with objectives set at the higher group or team
level (Lawler, 1994). Ployhart and Weekley (Chapter 9, this volume) similarly make the point that task
and result interdependencies may make individual-level performance and results impossible to measure
well, if at all, and only aggregated performance/results may be measurable in any reasonable way.
Timeliness refers to the timeframe in which the work was performed. Examples of timeliness
metrics include responding to customer complaints within 24 hours and providing statistical reports
on a quarterly basis that summarize progress toward affirmative action goals.
Quality refers to the effectiveness of the result. Examples of quality metrics include improving
the layout for navigating a website to make it more user-friendly as indicated by a 10% improve-
ment in user survey satisfaction results, independently creating a report containing relevant and
concise information on program operations that required no revisions, and developing an online
training program in which trainees successfully learned 85% of the materials. Although it is use-
ful to develop quantifiable metrics of quality where it is possible to do so, quality assessments will
sometimes require subjective judgments (e.g., how relevant and concise the information contained
in a report actually was). Providing predefined rating criteria to guide subjective judgments helps
ensure employees are fairly evaluated against uniform standards.
Quantity refers to how much work is performed. Examples of quantity metrics include respond-
ing to 95% of requests, providing computer training to 90% of employees, and conducting two-
onsite reviews each month to assess compliance with regulations.
Finally, financial metrics relate to the efficient use of funds, revenues, profits, or savings. Examples
of financial metrics include budgeting operations to achieve a 10% cost savings compared to last
year and convincing customers to increase expenditures for service by 15% more than last year.
Although there are four primary ways to measure results, the different types of measures can be
used together, which usually improves the clarity of expectations. For example:
• Processed 99% of candidate job applications within 1 week of receiving them (quantity
and timeliness)
• Developed an online training course that taught 90% of employees how to use automated
transactional systems and reduced training costs by $500/employee (quantity, quality, and
financial metrics)
Table 24.2 presents examples of well-defined objectives that specify timeliness, quality, quantity,
and/or financial metrics and examples of poorly defined objectives that fail to specify measurable
criteria. As it can be seen by reviewing the first set of objectives (i.e., well-defined) in the table,
articulating expected results in terms of the four types of measures is likely to increase understand-
ing and agreement about whether or not the objectives were achieved. Alternatively, the second set
of objectives (poorly defined) is vague and nonspecific, which could easily lead to differing opinions
about the extent to which they were met.
TABLE 24.2
Example Performance Objectives
Well-defined objectives
• By June 30, develop a plan that allows for 90% of general inquires to company website to be responded to within
72 hours.
• By the end the operating year, implement a self-service benefits system that reduces processing costs by 10%.
• By June 30, draft and submit to the HR vice president a plan and timeline that is accepted without revision for
expanding telework options to at least 70% of full-time employees.
• Reduce average cost of processing travel reimbursements by 5% by end of year.
Poorly defined performance objectives
• Provide effective customer service.
• Coordinate with the legal department to revise the company’s HR policy.
• Promote volunteering in the local community.
• Reduce operating costs of company fleet program.
524 Handbook of Employee Selection
• Drive achievement of important results from all employees that contribute to the organiza-
tion’s success.
• Reward employees on the basis of their performance, which requires accurate performance
measurement. Architects of pay-for-performance systems felt this could be best achieved
by defining results in terms of concrete, objective measures, thus mitigating the chronic
inflation that characterizes subjective ratings.
With this as background, we now discuss three challenges inherent in measuring results and recom-
mendations for addressing these.
• Did the employee who produced the most pieces also produce the highest quality pieces?
• Did the website redesign that was completed on time and within budget actually improve
usability?
• Was the driver who made the most deliveries speeding and endangering others?
The reality is that even when measuring performance on objectives seems straightforward, it is
important to consider the consequences of the measures selected because employees will drive to
those measures. For example, quantity measures are usually easier to define than quality measures.
However, if only quantity metrics are used, employees will focus on production, possibly to the det-
riment of quality. It is also important not to fall prey to measuring peripheral aspects of an objective
that may be easy to measure but are unimportant. For example, meeting a deadline is easy to mea-
sure but improving customer service may be what is important. Researchers and practitioners have
long argued against using convenience criteria because they are often unrelated to the most critical
aspects of job performance (e.g., Smith, 1976).
Despite the limitations associated with use of subjective criteria, inclusion of some subjective
judgment in the measurement of results increases the likelihood that the most important aspects of
performance will be measured. However, we also recommend that uniform standards be provided
to guide raters in making these judgments fairly and systematically across employees. Also, incor-
porating standardized criteria on which ratings are made provides a mechanism for making direct
comparisons between employees who may have delivered different types of results. Shown in Table
24.3 are example criteria with a five-point rating scale that could be used to evaluate the quality of
different results.
Defining and Measuring Results of Workplace Behavior 525
TABLE 24.3
Performance Standards for Evaluating Quality of Results
Low High Exceptional
1 2 3 4 5
The product, service, or other The product, service, or other The product, service, or other deliverable
deliverable had significant deliverable possessed high quality possessed flawless and impeccable quality
problems, did not meet minimum and fully met expectations. There that met the highest possible standards and
quality standards, and fell well were only minor errors or mistakes surpassed expectations. There were no
short of expectations. There were that were easily corrected and errors or mistakes and no revision or
many or very significant errors or inconsequential. reworking was needed.
mistakes and substantial revision
or reworking was needed.
TABLE 24.4
Extent to Which Objective Was Met
Not Met Met Exceeded
1 2 3 4 5
Several of the quality, quantity, All of the quality, quantity, timeliness, The quality, quantity, timeliness, or
timeliness, or financial measures and financial measures established for financial measures established for this
established for this objective were this objective were met. objective were significantly exceeded.
not met.
TABLE 24.5
Level of Result Achieved
Did Not Meet Met Exceeded
1 2 3 4 5
The result achieved fell far below the At this level, work is moderately complex and The result achieved far
difficulty and complexity of work difficult such that critical analysis, integration exceeded the difficulty and
expected for this job level. of multiple sources of information, and complexity of work expected
analyzing pros and cons of multiple solutions for this job level.
are required. Work is performed with minimal
supervision and guidance. The result achieved
was consistent with the difficulty and
complexity of work expected for this job level.
Defining and Measuring Results of Workplace Behavior 527
TABLE 24.6
Contribution of Result
Low Moderate High
1 2 3 4 5
The efficiency or effectiveness of The efficiency or effectiveness of The efficiency and
operations remained the same or were operations was improved, consistent with effectiveness of operations
only minimally improved. The quality what was expected. Product or service was improved tremendously,
of products or services remained the quality showed expected improvements. far surpassing expectations.
same or was only minimally improved. The quality of products or
services was improved
tremendously.
to achieve results. The two major components of conscientiousness are achievement motivation and
dependability. Achievement motivation, in particular, which refers to one’s desire to achieve results
and master tasks beyond others’ expectations, may be particularly relevant to predicting results.
Although the Big Five are rarely broken down into their component parts, Hough (1992) and Hough
and Dilchert (Chapter 14, this volume) have argued for and shown potential advantages of examin-
ing lower-level personality constructs in the prediction of job performance. Because of the direct
conceptual similarity between achievement motivation and achieving results, this may be a circum-
stance in which examining the validity of the component personality constructs may prove fruitful.
CONCLUSIONS
The development of individual performance objectives, linked to key organizational goals and pri-
orities, has been hypothesized to drive important results. Given the pervasive use of results measures
in today’s organizations, future research should investigate the relationships between these perfor-
mance measures and more commonly used predictors and criteria. Many practitioners and organi-
zational leaders certainly believe that unique variance is accounted for in measuring results versus
other types of performance measures. Because this belief has led to implementation of complex
and time-consuming results-based systems, it is important to know if the added effort associated
with these systems is, in fact, producing different or better information than other, less demanding
performance measurement approaches.
Research should also be conducted to evaluate the psychometric properties of results measures to
assess whether or not they possess sufficient reliability, validity, and fairness to be used in validation
research and for making selection decisions. Research is also needed to investigate the underlying
dimensionality of results measures as well as predictors of them. Throughout this chapter, we drew
from the literature to propose methods for identifying objectives and evaluating results that should
maximize the likelihood of obtaining measures with adequate measurement properties, validity,
and utility. However, data need to be collected using these methods to evaluate their efficacy.
Competent development of fair, job-relevant, and useful objectives is difficult, resource-in-
tensive, and time-consuming, requiring considerable training and effort on the part of managers,
employees, and HR staff. If organizational members are not committed to developing effective
objectives, doing this consistently for all employees and devoting the time that is needed to yield
high-quality measures, we recommend that results measures not be collected or included in perfor-
mance measurement processes. This is because poorly developed objectives will neither motivate
employees nor will they provide useful criterion measures for validation research or operational
selection decisions. However, if organizational members are willing to devote the time, energy,
and resources necessary to overcome the inherent challenges involved in developing objectives
and monitoring their effectiveness and completion, results-based measures may hold considerable
528 Handbook of Employee Selection
promise. Research and practice have certainly suggested that defining and measuring results can
have a profoundly positive effect on individual and organizational performance (Locke & Latham,
1990; Rodgers & Hunter, 1991).
REFERENCES
Banks, C. G., & May, K. E. (1999). Performance management: The real glue in organizations. In A. I. Kraut &
A K. Korman (Eds.), Evolving practices in human resource management (pp 118–145). San Francisco,
CA: Jossey-Bass.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6,
205–213.
Bernardin, H. J., Hagan, C. M., Kane, J. S., & Villanova, P. (1998). Effective performance management: A
focus on precision, customers, and situational constraints. In J. W. Smither (Ed.), Performance appraisal:
State of the art in practice (pp 3–48). San Francisco, CA: Jossey-Bass.
Borman, W. C. (1987). Behavior-based rating scales. In R. A. Berk (Ed.). Performance assessment: Methods
and application. Baltimore, MD: Johns Hopkins University Press.
Borman, W. C. (1991). Job behavior, performance, and effectiveness. In M. D. Dunnette & L. M. Hough (Eds.),
Handbook of industrial and organizational psychology (Vol. 2, pp. 271–326). Palo Alto, CA: Consulting
Psychologists Press.
Borman, W. C., White, L. A., & Dorsey, D. W. (1995). Effects of ratee task performance and interpersonal fac-
tors on supervisor and peer performance ratings. Journal of Applied Psychology, 80, 168–177.
Brentz, R. D., Milkovich, G. T., & Read, W. (1992). The current state of performance appraisal research and
practice: Concerns, directions, and implications. Journal of Management, 18, 321–352.
Cardy, R. L. (1998). Performance appraisal in a quality context: A new look at old problems. In J. W. Smither
(Ed.), Performance appraisal: State of the art in practice (pp 132–162). San Francisco, CA: Jossey-Bass.
Cardy, R. L., & Dobbins, G. H. (1994). Performance appraisal: Alternative perspectives. Cincinnati, OH:
South-Western.
Cascio, W. F. (1998). Applied psychology in human resource management. Upper Saddle River, NJ: Prentice
Hall.
Dipboye, R. L., & de Pontbraind, R. (1981). Correlates of employee reactions to performance appraisals and
appraisal systems. Journal of Applied Psychology, 66, 248–251.
Dunnette, M. D. (1966). Personnel selection and placement. Belmont, CA: Wadsworth.
Feldman, D. (1992). The case for non-analytic performance appraisal. Human Resources Management Review,
2, 9–35.
Graves, J. P. (1986). Let’s put appraisal back in performance appraisal: Part 1. Personnel Journal, 61,
844–849.
Greenberg, J. (1986). Determinates of perceived fairness of performance evaluations. Journal of applied psy-
chology, 71, 340–342.
Guion, R. M. (1965). Personnel testing. New York, NY: McGraw-Hill.
Hillgren, J. S., & Cheatham, D. W. (2000). Understanding performance measures: An approach to linking
rewards to the achievement of organizational objectives. Scottsdale, AZ: WorldatWork.
Hough, L. M. (1992). The “Big Five” personality variables—Construct confusion: Description versus predic-
tion. Human Performance, 5, 139–155.
Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity
generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S. Department of Labor,
Employment Service.
Jamieson, B. D. (1973). Behavioral problems with management by objective. Academy of Management Review,
16, 496–505.
Kane, J. S. (1986). Performance distribution assessment. In R. A. Berk (Ed.), Performance assessment: Methods
and applications (pp 237–274). Baltimore, MD: Johns Hopkins University Press.
Landy, F. J., & Trumbo, D. A. (1980). The psychology of work behavior. Homewood, IL: Dorsey Press.
Latham, G. P. (1986). Job performance and appraisal. In C. L. Cooper & I. Robertson (Eds.), International
review of industrial and organizational psychology (pp. 117–155). New York, NY: Wiley.
Lawler, E. E. (1994). Performance management: The next generation. Compensation and Benefits Review, 26,
16–20.
Defining and Measuring Results of Workplace Behavior 529
Levinson, H. (2005). Management by whose objectives? In Harvard Business Review on Appraising Employee
Performance. Boston, MA: Harvard Business School Publishing.
Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ:
Prentice-Hall.
Locke, E. A., Shaw, K. N., Saari, L. M., & Latham, G. P. (1981). Goal setting and task performance, 1969–
1980. Psychological Bulletin, 90, 125–152.
Motowidlo, S. J., & Van Scotter, J. R. (1994). Evidence that task performance should be distinguished from
contextual performance. Journal of Applied Psychology, 79, 475–480.
Muczyk, J. P. (1979). Dynamics and hazards of MBO application. Personnel Administrator, 24, 51–61.
Murphy, K. R., & Cleveland, J. N. (1991). Performance appraisal: An organizational perspective. Needham
Heights, MA: Allyn & Bacon.
Olian, R. L., & Rynes, S. L. (1991). Making total quality work: Aligning organizational processes, performance
measures, and stakeholders. Human Resources Management, 30, 303–330.
Podsakoff, P. M., MacKenzie, S. B., Paine, J. B., & Bachrach, D.G. (2000). Organizational citizenship behav-
iors: A critical review of the theoretical and empirical literature and suggestions for future research.
Journal of Management, 26, 513–563.
Pulakos, E. D. (2004). Performance management. A roadmap for developing, implementing, and evaluating
performance management systems. Alexandria, VA: Society for Human Resources Management.
Pulakos. E. D. (2008). Performance management: How you can achieve important business results. Oxford,
England: Blackwell.
Pulakos, E. D., Hanson, R. A., & O’Leary, R. S. (2007). Performance management in the United States.
In A. Varma, P. Budhwar, & A. DeNisi (Eds), Global performance management. London, England:
Routledge.
Rodgers, R., & Hunter, J. E. (1991). Impact of management by objectives on organizational productivity.
Journal of Applied Psychology, 76, 322–336.
Rodgers, R., Hunter, J. E., & Rogers, D. L. (1993). Influence of top management commitment on management
process success. Journal of Applied Psychology, 78, 151–155.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology:
Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schneier, C. E., Shaw, D. G., & Beatty, R. W. (1991). Performance measurement and management: A tool for
strategy execution. Human Resource Management, 30, 279–301.
Smith, P. C. (1976). Behaviors, results, and organizational effectiveness: The problem of criteria. In M. D.
Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 239–287). Chicago, IL:
Rand Mc Nally.
Strauss, G. (1972). Management by objectives: A critical review. Training and Development Journal, 26,
10–15.
This page intentionally left blank
25 Employee Work-Related
Health, Stress, and Safety
Organizations are increasingly concerned with the health and safety of their employees. There are
several factors contributing to this concern. First, the legal environment in many countries stipu-
lates that employers are responsible for the safety and health of their employees, at least while at
work. For example, the U.S. Occupational Safety and Health Act of 1970 mandates that employers
provide a safe and healthy work environment for employees, and the European Agency for Safety
and Health at Work has issued several Council Directives that protect the health and safety of work-
ers throughout the European Union. It also has been argued that there are growing concerns and
expectations by the general public over health protection from communicable diseases and noncom-
municable environmental hazards including in the work environment (Nicoll & Murray, 2002).
Secondly, the cost of healthcare continues to climb. In the United States, health insurance cover-
age of employees is a direct expense to employers and continues to increase. Although many compa-
nies have shifted more of the healthcare costs to employees (Kaiser Network, 2006), there is still a
potential savings to organizations to have healthy employees because health insurance premiums in
the United States are often based on claims experience from illnesses and injuries. In the European
Union and other countries around the world, the cost of healthcare is more of a social, public health
responsibility rather than an organizational responsibility based in part on differences in funding
of healthcare systems.
Thirdly, employee health is related to productivity and organizational effectiveness. Recent
research on the association of health risks and on-the-job productivity estimated the annual cost
of lost productivity in one organization was between $1,392 and $2,592 per employee, based on
self-reported health risk factors (Burton et al., 2005). In another study of chronic health conditions,
Collins et al. (2005) found that the cost associated with lost productivity from chronic health condi-
tions exceeded the combined costs of absenteeism and medical treatment. Therefore, the concern
over employees’ health and safety is not limited to healthcare costs but also loss of productivity and
this concern is increasingly a global issue (World Health Organization, 2008).
The purpose of this chapter is to examine potential mechanisms that organizations may use to
maintain and promote healthy employees. These mechanisms might be the selection of “healthy”
workers, modifications to the work environment to reduce work stressors and increase safety, and
employee training. Each mechanism has its pros and cons with respect to maintaining healthy
employees as well as a safe and healthy work environment. We examine the importance of having
healthy workers and the role of stress and safety in the workplace on organizational effectiveness.
Figure 25.1 is an illustration of the chapter overview.
HEALTHY WORKERS
In this section, we first examine the rising costs of healthcare. Next, we discuss how organizations
can develop and possibly obtain healthy workers through wellness programs and selection.
531
532 Handbook of Employee Selection
Work
Selection Training
environment
Organizational effectiveness
Performance
Lower health care costs
Fewer accidents
Less turnover and
absenteeism
HEALTHCARE COSTS
Healthcare spending is rising faster than incomes in most developed countries, with the United States
spending more per capita on healthcare than other countries. On the basis of the Organization for
Economic Cooperation and Development (OECD) data, a report by the Kaiser Family Foundation
(January 2007) indicated that in the United States per capita healthcare spending as a proportion
of gross domestic product (GDP) rose from 8.8% in 1980 to 15.2% in 2003 compared with other
OECD countries such as Finland, where the share of GDP spent on healthcare grew from 7.0% in
1980 to 7.4% in 2003 and Switzerland where the share of GDP spent on healthcare grew from 7.4%
in 1980 to 11.5% in 2003.
International comparisons of individual and family health spending are difficult given differ-
ences in healthcare systems. A recent Kaiser Daily Health Report (2008) indicated that medical
care for the typical insured family of four in the United States was $13,382 in 2006—an increase
of 9.6% from 2005, with employers paying 62% or $8,362 per family in 2006. According to
another report by the Kaiser Family Foundation (March 2008), health insurance premiums had
a cumulative growth of 78% between 2001 and 2007, with much of this increase in health insur-
ance cost being borne by employers. On the basis of data from the National Compensation Survey
conducted by the U.S. Bureau of Labor Statistics, the mean health insurance costs per worker hour
for employees with access to coverage rose from $1.81 in 2004 to $2.37 in 2009 (Bureau of Labor
Statistics, 2009). Although the numbers differ based on industry, size of organization, and exact
definitions, it appears that healthcare costs and health insurance premiums for individuals and
organizations will continue to increase.
Analyses of healthcare expenditures find that healthcare costs are often related to modifiable
lifestyle behaviors. Anderson et al. (2000) found that modifiable health risks were associated with
25% of the total healthcare expenditures among a sample of 46,026 employees. Uncontrolled stress,
smoking, and obesity were the three most costly risk factors based on healthcare expenditures in
this study. In a larger study of over 300,000 employees from six companies, Goetzel, Hawkins,
Ozminkowski, and Wang (2003) found that physical health problems cost a total of $3,524 per eli-
gible employee for medical care, absenteeism, and short-term disability program use, and mental
health problems cost only $179 on average. On the basis of an earlier study, Goetzel et al. (1998)
reported that employees with depression had 70% higher healthcare expenditures than those indi-
viduals who were not depressed. In addition, they found that individuals with uncontrolled stress
had 46% greater medical costs than those who were not stressed and the third most costly risk factor
was high blood glucose. Employees with high blood glucose had 35% greater medical expenses than
those with normal blood glucose. Other costly risk factors were obesity, tobacco use, high blood
pressure, and poor exercise habits. Somewhat surprising, excessive alcohol consumption was not
Employee Work-Related Health, Stress, and Safety 533
associated with increased medical costs, although Goetzel et al. suggested this might be reflective
of individuals with drinking problems tending to avoid the healthcare system.
These studies, as well as others, highlight the effects of modifiable health risk factors on overall
healthcare costs. It is not surprising then that organizations have implemented wellness programs
that focus on exercise, nutrition, smoking cessation, and stress management as attempts to enhance
the health of their workforce.
toward individuals under HIPAA; in other words, that the incentive does not depend on the health
status of the individual and rewards must be available to all similarly situated individuals or at least
provide a reasonable alternative standard for attaining the reward. Simon et al. (2006) and Kendall
and Ventura (2005) suggested that although a particular wellness program may not be discrimina-
tory on the basis of HIPAA, it may still be counter to the ADA and/or state insurance plans. For
example:
If an employer’s wellness program offers lower health plan premiums to employees who complete a
healthy lifestyles course, the ADA may require the employer to ensure that a blind or deaf employee can
access the classes and course materials. In addition, if a disability prevents an employee from earning a
wellness incentive, the ADA may require the employer to work with the employee to develop alternative
incentives that are within the employee’s ability. (Simon et al., 2006, p. 57)
and specificity (i.e., the measure is accurate in identifying people who do not have the condition
being assessed). This is a concern for traditional selection factors such as cognitive ability tests but
is equally a concern for health and health risk factors.
Given the legal implications of using health risk factors for selection and the potentially chang-
ing levels of many health risk factors before and after hiring, the advisability of using selection for
creating and maintaining a health workforce seems weak. The empirical evidence suggests that use
of organizational wellness programs may be more efficient. Future research may determine which
health factors in interaction with which elements of the work environment and organizational well-
ness programs are most effective and which may be appropriate for use in selection systems.
WORK STRESS
The above section highlighted the cost of stress in healthcare expenditures. Considering the results
of Anderson et al. (2000) and Goetzel et al. (2004) that psychosocial risk factors, especially depres-
sion and stress, are prevalent in organizations and account for significant proportions of disabilities,
absences, and healthcare expenditures, this section will focus on stress in the workplace.
Stress in the workplace is associated with numerous health problems, including headaches,
weight control problems, sleeplessness, gastrointestinal problems, heart disease, compromised
immune systems, difficulties with memory, and psychological disorders. It has been estimated that
stress annually costs organizations billions of dollars in disability claims, absenteeism, and lost pro-
ductivity (e.g., Xie & Schaubroeck, 2001; Ryan & Watson, 2004). Various reviews of the extensive
stress literature have generally concluded that prolonged exposure to certain job demands can have
debilitating consequences for employees (Tetrick, 2002).
In this first section, we examine organizational-, job-, interpersonal-, and personal-level predic-
tors of experienced work stress. At the organizational level, we focus on work hours and various
work schedule arrangements that includes a discussion of the pros and cons of using realistic
job previews as a recruiting tool. At the job level, we examine role ambiguity and conflict, job
demands, personal control at work, and adaptive performance. At the interpersonal level, we exam-
ine a lack of social support, abusive supervision, organizational politics, and political skill. Finally,
we examine several personality types and individual-level demographic predictors that include age
and gender.
the work may be performed at night or early mornings, such as the “graveyard shifts.” This can
benefit the organization by allowing services or production to be on a continual basis, but this can
also help the employees by allowing flexibility in their work schedules so that they best meet their
own needs. For example, dual career couples with small children may like the idea of working dif-
ferent shifts because this might aid in their ability to care for their children. Flexible time schedules,
job sharing (e.g., two employees working part-time but they share one job), temporary employment,
home-based work, and teleworking (e.g., working from home, hotels, or other remote work sites)
have all become more popular in recent years (Barling et al., 2002a). However, the issue is not if
these types of arrangements aid in flexibility for the organization and the employee, but rather if
these arrangements are reducing experienced stress for employees and are consistent with a healthy
workforce. The following section examines the consequences of alternative work arrangements on
the well being of employees.
Research on the psychological well being of part-time workers versus full-time workers has not
demonstrated significant differences in terms of employee job attitudes or well being (Barling &
Gallagher, 1996). What appears to be the most important factor differentiating full- from part-time
workers regarding their well being is whether working part-time is voluntary. Voluntary part-time
employment actually has been shown to be beneficial in terms of job satisfaction and general well
being if the part-time employee has a sense of control over work scheduling (Krausz, Sagie, &
Biderman, 2000). Finally, many workers are part-time workers who are only working part-time
because of a preexisting health concern (Mykletun & Mykletun, 1999).
Health concerns may become even more pronounced when coupled with rotating shifts (Jamal &
Baba, 1997). Shiftwork, especially nightwork, has been found to be a risk factor for cardiovascular
disease (Boggild & Knutsson, 1999). Parkes (2003) found that, in general, dayworkers reported
more favorable perceptions of their work environment than shiftworkers. However, she also found
that differences in the work environment (i.e., onshore vs. offshore) between dayworkers and shift-
workers were a moderator in these relationships. She argued that the organizational setting in which
work routines are similar for dayworkers and shiftworkers, and in which all resources are available
to both groups, might reduce the negative perceptions associated with shiftworkers. Several factors
may explain the relationship between shiftwork and health concerns, including the employee’s abil-
ity to adjust to differing schedules and the supportive nature of the employee’s family. Additional
research that can separate out these effects is needed before we can make a clear statement about
the relationship between full-time versus part-time workers and working shifts on employee stress
and health. One factor that does appear to be important in promoting health in employees is whether
the work arrangements are voluntary.
In a review of the research on work arrangements, Barling and colleagues reviewed several
important work arrangements, including temporary workers, job sharing, shiftwork, full-time ver-
sus part-time work, and seasonal and migrant employment (Barling et al., 2002a). They concluded
that psychological well being depends less on the nature of the work arrangement and more on
whether the arrangement was voluntary or not. Being able to choose or have some control over work
arrangements is a very important factor in the ability to handle job stressors and the health and well
being of employees.
Given that organizational work hours and schedules have the potential to be stressful to many
workers, perhaps recruiting individuals who are comfortable with less traditional schedules might
help to ensure a long and effective employment relationship. One way to recruit workers who have
an understanding of the employment environment is through realistic job previews (RJPs). The
basic argument is that job applicants will be better able to make informed decisions about whether
or not to pursue a job opportunity if they have a clear idea about the job and job environment. RJPs
give applicants a sense of realism for positive and negative aspects of the job and job environment
that (a) might reduce the number of applicants who remain interested in the job but (b) increase the
retention of those workers who are hired (Wanous, 1980). However, some empirical research (i.e.,
Bretz & Judge, 1998) suggests that RJPs may have too many opportunity costs for the organization
Employee Work-Related Health, Stress, and Safety 537
because the highest quality applicants may be less willing to pursue jobs for which negative infor-
mation has been presented. Clearly, organizations need to be honest about the actual job; how-
ever, emphasizing the negative aspects of the job may hurt recruiting, especially with high-quality
applicants.
JOB-LEVEL STRESSORS
Job and role stressors such as role conflict, role ambiguity, and role overload (Kahn, Wolfe, Quinn,
Snoek, & Rosenthal, 1964) have long been known to contribute to the stress experience. Role conflict
occurs when employees’ expectations are incongruent with those expressed by their role senders.
Communication by those in authority of work expectations that are largely incompatible with those
understood and internalized by employees may increase the likelihood of burnout. Role ambiguity
is related to the amount of predictability in the work environment; thus, experienced stress may be
more prevalent in situations in which employees are uncertain about their work goals and the means
available for accomplishing them. Role overload, qualitative and quantitative, can contribute to the
experience of burnout. Workers may experience qualitative overload if they feel deficient in the basic
skills necessary for effective task completion. Quantitative overload is characterized by the belief
that one’s work cannot be completed in the time allotted. Employees may experience stress if they
perceive that they cannot successfully complete their work because of lack of skill, lack of time,
or both. Vandenberg and colleagues have argued how occupational stress research has consistently
demonstrated the deleterious effects of role stressors on occupational strain outcomes such as burn-
out, dissatisfaction, decreased performance, and psychophysiological responses such as increased
heart rate and blood pressure (Vandenberg, Park, DeJoy, Wilson, & Griffin-Blake, 2002).
Perhaps one of the most well-known conceptualizations of job stress is that of Karasek’s (1979)
Demands-Control model. Karasek suggested that heavy job demands, coupled with a lack of con-
trol, are associated with strain and job dissatisfaction. This is because control provides individuals
with the confidence and efficacy to perceive and interpret their task environment in nonthreatening
ways, thereby neutralizing the potentially dysfunctional effects of job demands (Theorell, 2004).
Accordingly, we must distinguish between the job demands placed on the individual employee, as
well as the discretion permitted the worker in deciding how to cope with these heightened expec-
tations. Implicitly, demands can increase with little or no threat to the individual’s psychological
strain as long as appropriately adequate levels of job control are maintained (Mauno, Kinnunen, &
Ruokolainen, 2007).
Finally, the need for adaptive workers has become increasingly important as today’s organiza-
tions are characterized by changing, dynamic, and sometimes, turbulent environments (Ilgen &
Pulakos, 1999). Employees need to be adaptable, flexible, and tolerant of uncertainty to perform
effectively (Pulakos, Arad, Donovan, & Plamondon, 2000). Employee adaptability encompasses a
wide variety of behaviors including handling emergencies or crisis situations, handling work stress,
dealing with uncertainty, and learning new technologies and procedures (Pulakos et al., 2000). The
question is how can managers select adaptable workers or train workers to be adaptable?
Given the various types of adaptable behaviors, it might not be possible to select or train workers
to be adaptable on all aspects of their performance; however, we may be able to offer some general
guidelines for selection and training. First, research has shown some evidence that certain person-
alities might be more (or less) adaptive. For example, LePine, Colquitt, and Erez (2000) examined
the effects of conscientiousness on decision-making before and after unforeseen changes in a task
context and found individuals higher in conscientiousness do not adapt quickly to change. Much
more research is needed on personality profiles (i.e., examining several personality dimensions in
conjunction with one another) before selecting employees based on personality is warranted. This
will be discussed in more detail in a later section.
Second, managers may want to consider prior experience in adaptability as a selection crite-
rion. Research has long demonstrated that one of the best predictors of future performance is
538 Handbook of Employee Selection
past performance (e.g., Wernimont & Campbell, 1968). Biodata instruments that emphasize prior
experiences with crises and emergencies may prove to be an effective means of selection (cf.
Pulakos et al., 2000). Finally, training employees to be more adaptive by exposing them to various
unpredictable situations in a training setting that they might be expected to encounter in the work
setting may prepare workers to be more adaptive and creative.
INTERPERSONAL RELATIONSHIPS
Employees require specific job resources (e.g., social support) to successfully navigate stressful
work environments while maintaining psychological health. To date, social support has
attracted the most extensive amount of investigation in the interpersonal domain, and findings
consistently support the idea that a lack of support from coworkers and supervisors is highly
correlated with increases in occupational stress and burnout (Maslach, Schaufeli, & Leiter, 2001).
Work environments that fail to support emotional exchange and instrumental assistance may
exacerbate strain by isolating employees from each other and discouraging socially supportive
interactions. Workplaces characterized by conflict, frustration, and hostility may have the same
effect. Besides a general lack of social support, we focus on two additional types of interpersonal
stressors commonly found in the workplace—abusive supervision and perceptions of politics. We
also examine how having political skill can help to alleviate some of the negative effects from
interpersonal stressors.
Abusive supervision is one of the most detrimental interpersonal stressors found in the work-
place. Abusive supervision reflects subordinates’ perceptions of negative and hostile verbal and
nonverbal leader behaviors (Tepper, 2007). Behaviors include public criticism, yelling, rudeness,
bullying, coercion, and blaming subordinates for mistakes they did not make (Burton & Hoobler,
2006; Tepper, Duffy, & Shaw, 2001).
Research indicates that abused subordinates are less satisfied with their jobs, less committed
to their organizations, and more likely to display turnover intentions than nonabused subordinates
(Schat, Desmarais, & Kelloway, 2006). Employees consider abusive supervision to be a source of
stress and injustice in the workplace that has the potential to influence their attitudes, psychologi-
cal distress, and physical well being (Aryee, Chen, Sun, & Debrah, 2007; Grandey, Kern, & Frone,
2007; Tepper, 2007).
If employees believe their behaviors have no bearing on the accrual of desired outcomes, their
sense of volition is weakened (Greenberger & Strasser, 1986), and many researchers believe that
perceptions are more powerful predictors of functioning than actual control (Burger, 1989). This
distinction is critical because individuals’ perceived control influences their behaviors and emo-
tions, regardless of the actual control conditions contributing to these perceptions. Work environ-
ment factors such as regulated administration, available help, and feedback influence perceived
control. Not surprisingly, recent work suggests that supervisors may engage in abuse to protect their
own sense of power and control over work situations (Tepper, Duffy, Henle, & Lambert, 2006), thus
limiting that of their employees. Supervisors who behave in an abusive way toward subordinates
have been found to lead to more experienced stress (Tepper, 2007) and reduced psychological and
physical well being for employees (Grandey et al., 2007).
Another well-researched interpersonal level stressor is organizational politics. Organizations
have long been considered political arenas, and the study of organizational politics has been a
popular topic for many years. Mintzberg (1983) defined politics as an individual or group behavior
that is typically disruptive, illegitimate, and not approved of by formal authority, accepted ideol-
ogy, or certified expertise. Organizations, indeed, can be viewed as political arenas, where informal
negotiation and bargaining, deal-making, favor-doing, quid-pro-quo interactions, and coalition and
alliance building characterize the way things really get done. Environmental circumstances, such
as perceptions of organizational politics, can be thought of as work demands, which are potential
sources of stress because they threaten or cause a depletion of the resources individuals possess.
Employee Work-Related Health, Stress, and Safety 539
Furthermore, research indicates that workplace politics are a significant concern and source of
stress for many workers.
What we know less about are the characteristics that enable one to exercise influence in ways that
lead to success in political work environments. Some have referred to such qualities as interpersonal
style, “savvy,” “street smarts,” and “political skill” (e.g., Reardon, 2000). Research has demon-
strated how different forms of personal control (e.g., interpersonal social skill or political skill) can
mitigate the negative effects of job stressors. Political skill is the ability to effectively understand
others at work and to use such knowledge to influence others to act in ways that enhance one’s per-
sonal and/or organizational goals (Ferris et al., 2007).
Politically skilled individuals are socially astute and keenly aware of the need to deal differently
with different situations and people. Therefore, they reflect the capacity to adjust their behavior
to different and changing situational demands (i.e., self-monitoring) in a sincere and trustworthy
manner. It has been suggested that political skill generates an increased sense of self-confidence
and personal security because politically skilled individuals experience a greater degree of inter-
personal control, or control over activities that take place in social interactions at work (Paulhus &
Christie, 1981; Perrewé et al., 2005). Furthermore, greater self-confidence and control lead indi-
viduals to interpret workplace stressors in different ways, resulting in such individuals experiencing
significantly less strain/anxiety at work (Kanter, 2004). Consistent with this argument, Perrewé
et al., (2005) found that political skill neutralized the negative effects of role conflict on psycho-
logical anxiety, somatic complaints, and physiological strain (i.e., heart rate, systolic and diastolic
blood pressure) and that political skill moderated the role overload-strain relationship in a similar
manner (Perrewé et al., 2005). The important message is that personal control, such as believing
one has the political skill to successfully navigate his or her work environment, appears to play a
fairly significant role in buffering the negative effects of work stressors. On the other hand, a lack
of personal control may exacerbate the stressor-strain relationship or it may even be perceived as a
stressor itself.
Given the importance of political skill, we recommend organizations consider this an important
attribute in employees and encourage the development of political skill. Today’s training, more
than ever before, needs to be compelling, realistic, practical, relevant, and lasting. In addition,
training should encourage risk-taking and facilitate improved awareness and behavioral flexibility.
Assessment centers that include simulations and drama-based training may be viable options for
political skill development (see Ferris, Davidson, & Perrewé, 2005, for a more in-depth discussion
of developing political skill). Drama-based training is a training model that includes lifelike simu-
lations for participants to practice managing complex human interactions in a safe and controlled
learning environment (St. George, Schwager, & Canavan, 2000), and, as such, it provides a useful
vehicle to shape and develop various social skills.
Further, assigning individuals to work with skilled mentors is another important way to develop
influence skills. Individuals can observe professionals in real work situations as they exercise influ-
ence in meetings with subordinates and peers. Language, facial expressions, body posture, and
gestures will convey messages to observers as to how influence is best exercised. The key is to be
sure that individuals are assigned to talented and understanding mentors who have plenty of social
influence interactions and are given plenty of opportunities to discuss various social influence inter-
actions encountered.
PERSONAL CHARACTERISTICS
Although various aspects of the external environment play a critical role in the experience of stress
and burnout, specific personal characteristics may lead some individuals to be more likely to expe-
rience strain than others in the same environment. The evidence on individual differences, such
as personality differences, suggests that certain individuals are more prone to strain than others.
The Five-Factor Model or “Big Five” model of personality has been extensively examined in the
540 Handbook of Employee Selection
organizational context over the past decade. Although some disagreement exists over the appro-
priate names for the five factors, most would agree that the Five-Factor Model consists of five
broad dimensions of personality: extraversion, neuroticism, conscientiousness, agreeableness, and
openness to experience. Proponents suggest that these five factors are the basic building blocks of
personality (McCrae & Costa, 1999). Research using this typology indicates that individuals high
in neuroticism are more likely to experience stress and burnout (Zellars & Perrewé, 2001). Further,
extraversion, agreeableness, and openness to experience are less likely to experience stress (Zellars,
Perrewé, & Hochwarter, 2000). In a review of the role of personality in organizations, Perrewé and
Spector (2002) discussed how Type A behavior pattern and negative affectivity have been shown
to have positive associations with experienced stress and negative associations with health and well
being. In addition, individuals with a high internal locus of control experience strain less than indi-
viduals with high external locus of control.
Individuals high in conscientiousness are described as efficient, diligent, thorough, hardwork-
ing, persevering, and ambitious (e.g., McCrae & John, 1992). Conscientiousness has been posi-
tively related to organizational citizenship behaviors (Borman & Penner, 2001), job performance
(Barrick & Mount, 1991), workplace safety performance (Wallace & Vodanovich, 2003), intrinsic
and extrinsic career success (Judge, Higgins, Thoresen, & Barrick, 1999), and negatively related to
absence (Judge, Martocchio, & Thoresen, 1997). Taken in its entirety, it is clear that conscientious
employees possess several positive work attributes.
However, more research is needed on personality profiles before selecting employees based on
personality is warranted. For example, conscientiousness had a negative relationship with decision
quality after an unanticipated change, which suggests that conscientious people do not adapt quickly
to change (LePine et al., 2000). Thus, we do not recommend selecting (or not selecting) employees
on the basis of one personality dimension (e.g., conscientiousness) alone. For example, research has
found that conscientiousness, when coupled with positive affectivity (i.e., the dispositional tendency
to experience positive emotions across situations and time), resulted in the lowest levels of reported
job tension (Zellars, Perrewé, Hochwarter, & Anderson, 2006). Further, the individual difference
variables of perceived control, optimistic orientation, and self-esteem are highly correlated vari-
ables and, together, form a hardy or “resilient personality” (Major, Richards, Cooper, Cozzarelli,
& Zubek, 1998) that can help workers to adapt to change and cope with work stressors. Although a
comprehensive examination of personality is beyond the scope of this chapter, personality clearly
has the potential to be a powerful selection tool. However, additional research is critical before
confident predictions about workers’ ability to handles stressors and adaptable performance can be
made.
In addition to personality characteristics, simple demographic differences have been shown to
have an association with experienced stress. We focus on two demographic characteristics that have
been found to have some relation with occupational stress—specifically, age and gender. Research
has demonstrated that younger employees consistently report a higher level of burnout (Maslach
et al., 2001). Some researchers suggest that older employees experience lower levels of burnout
because they have shifted their own expectations to fit reality on the basis of their personal experi-
ences (Cordes & Dougherty, 1993). These findings suggest that older, more experienced employees
are better able to handle the demands of stressful work environments. Or alternatively, the findings
regarding older workers may reflect that they have passed some critical threshold of burnout that
would trigger turnover; that is, they may handle stressful environments by altering their percep-
tions and reducing their expectations of what is possible in terms of career accomplishment or
satisfaction. High expectations and unmet expectations can encourage increased levels of burnout
(Cordes & Dougherty, 1993). Younger employees tend to be more idealistic and thus may react more
intensely when their overly optimistic career expectations are shattered.
On the other hand, younger workers have been shown to respond more positively to some work-
place stressors—specifically, perceptions of organizational politics. Results across three studies
demonstrated that increases in politics perceptions were associated with decreased job performance
Employee Work-Related Health, Stress, and Safety 541
for older employees and that younger employees achieved higher performance scores when percep-
tions of politics were high (Treadway et al., 2005).
In regard to gender, stress can result from feelings of discrimination in a male-dominated work
environment (Sullivan & Mainiero, 2006). First, there is much literature that suggests that work-
ing women are more likely to experience stress on the basis of being female (see Powell & Graves,
2003). Being employed in a male-dominated work environment is a cause for stress, because the
norms, values, and expectations of the male-dominated culture are uniquely different (Maier, 1999).
Further, women in male-dominated environments are more likely to face certain stressors such as
sexual harassment and discrimination (Nelson & Burke, 2000).
Gender does not appear to be a strong predictor of preferred work hours (Jacobs & Gerson,
2004). Family circumstances are more important than gender in predicting preferred work hours
(Barnett, 2006); specifically, women and men with young children want more time away from work
than do other groups. Although women with young children cut back on their time at paid work
more so than men, they do so to a smaller extent than in previous generations. Jacobs and Gerson
(2004) found little support for the popular belief that married women with young children are the
primary group wishing to work less and they state that “About half of married men and women
across a range of family situations express such a desire” (p. 73).
Selecting employees on the basis of personality, gender, or age is not recommended. What is
encouraged is setting realistic expectations for career advancement, allowing flexibility and control
in work schedules, and training opportunities for all employees.
OCCUPATIONAL SAFETY
In this section, we review the role selection plays in supporting safer workplaces. Accidents at work
are costly for individuals and organizations and avoiding severe accidents is an essential priority
for all organizations. Therefore, it is not surprising that considerable attention is paid to factors that
might influence whether an individual is involved in an accident. Sanders and McCormick (1987)
identified selection as one of the key strategies used to reduce human error in addition to training
and job design.
Despite the popularity of selection systems to manage safety, there is limited evidence for the
effectiveness of selection for improving organizational safety. Guastello (1993) conducted a meta-
analytic review of accident prevention programs and found that personnel selection programs had
a relatively weak relationship with accident rates. He found that although individual selection prac-
tices were the most common type of accident reduction programs that used, they had the least
542 Handbook of Employee Selection
effective outcome compared with ten types of intervention. Sanders and McCormick (1987) con-
sidered work design to be a more effective approach to improving safety compared with selection
because it requires less ongoing maintenance and support. Moreover, they argued it is easier to limit
the occurrence of human errors by making them impossible, difficult, or inconsequential through
work design rather than relying on changing and managing individuals.
There is substantial agreement among researchers from various disciplines that safety needs to be
considered from a systemic perspective that includes factors at the individual, micro-organizational,
and macro-organizational level (Hofmann, Jacobs, & Landy, 1995). Vredenburgh (2002) found that
selection practices were effective as part of a broader proactive strategy of recruiting and training
safety conscious individuals. However, focusing solely on individual causes is neither sufficient nor
useful for understanding safety at work. Some researchers suggest that selection practices designed
to improve safety will be less important than training and design interventions, which have a more
systemic and wide ranging impact on safety (Lawton & Parker, 1998).
Figure 25.2 depicts how selection processes can be situated within a larger systemic framework
that includes other systems (e.g., training) while focusing on individual differences and behavior.
The goal of selection is to identify individual characteristics that might influence individual behav-
ior and cognition, which, in turn, might influence consequences such as accidents in an organiza-
tion. The figure also shows that systemic factors might shape any aspect of the implied causal chain
from individual differences to consequences. In a safety context, these systemic factors have been
conceptualized in terms such as “latent failures.” They represent the impact of managerial and
organizational processes on safety outcomes.
With the above considerations in mind, our review will focus on the role of selection within a
broader context. We begin by looking more closely at the meaning of safety as a criterion construct.
System
(e.g., training, risk)
Selection
processes
for selection methods and measures that reduce accident outcomes. Many reviews of safety also
include road accidents and injuries as part of their measurement. However, we exclude studies of
road safety unless they specifically incorporate the work context.
There is a long history of research seeking to identify factors that predict whether individuals
will experience a work accident. The most often-studied attribute—and perhaps least successful—
has been the search for an “accident-prone” individual. Despite the many studies and the popular-
ity of this idea, there is little consistency in the findings from this kind of research. Overall, there
is little evidence that an accident-prone personality can be identified or that systematic individual
differences can distinguish employees who have accidents from those who do not (Lawton &
Parker, 1998).
Glendon and McKenna (1995) argued that it is impossible to define an overall stable profile that
identifies an accident-prone individual. Although the concept of accident proneness does not match
the evidence, researchers continue to identify personality characteristics that might be linked to
accidents (Lawton & Parker, 1988).
A wide range of personality dimensions have been investigated as potential antecedents of acci-
dents and injury. Dimensions of the Big Five have received considerable attention, and Clarke and
Robertson (2005) were able to conduct a meta-analysis of studies involving workplace and motor
vehicle accidents. Results from the workplace studies showed that low conscientiousness (nine stud-
ies) and low agreeableness (seven studies) were associated with more individual accidents.
Trait affectivity has also been considered as a broad personality dimension that might predict
accidents and injury. Iverson and Erwin (1997) found trait positive affectivity and trait negative
affectivity were related to accidents 1 year later after controlling for a range of job conditions.
Although they did not control for stability in these characteristics, the design was stronger than
many in this area. They suggested that extraversion factors such as overconfidence and intolerance
were associated with risk-taking, and neuroticism factors such as anxiety and indecision were asso-
ciated with task distractibility. Frone (1998) found negative affectivity but not rebelliousness and
impulsivity to be related to accidents. However, this relationship disappeared after taking account
of physical hazards and workload.
Outside of the Big Five and trait affectivity, Guastello (1993) found that predictors associated
with maladjustment did show a positive relationship with lower accidents. Two types of maladjust-
ment were considered. Personal maladjustment was based on measures such as distractibility and
tension. Social maladjustment was based on measures such as safety locus of control. Studies of
impulsivity, alcohol use, and drug use showed no significant relationships with accidents. Liao,
Arvey, Butler, and Nutting (2001) found psychopathic deviant and conversion hysteria scales of the
Minnesota Multiphasic Personality Inventory (MMPI) were associated with frequency of injuries in
a prospective study of firefighters. They also found social introversion to be associated with injury
544 Handbook of Employee Selection
rates (Liao et al., 2001, p. 231, for a review of MMPI types). Conversion hysteria was based on
patients who exhibited some sensory or motor disorder. Psychopathic deviants were more likely to
act on impulse or ignore rules. Finally, locus of control has been identified in some studies as being
related to accidents; however, research in this area is inconsistent and inconclusive (see Lawton &
Parker, 1998).
A range of demographic factors and job types have been linked to accidents. Adolescents repre-
sent the age group with the highest risk for nonfatal accident and injury (Frone, 1998). A concurrent
study of adolescents found work injuries were associated with gender, negative affectivity, job ten-
ure, and exposure to physical hazards, excessive workloads, job boredom, poor physical health, and
on-the-job substance abuse (Frone, 1998). Female firefighters experienced more injuries (Liao et al.,
2001). Studies of general mental ability have been contradictory (see Hansen, 1989, for a review).
Physical abilities such as strength and flexibility can be valid predictors of performance in hazard-
ous work environments and so might be used to predict safety outcomes (Hogan & Lesser, 1996).
SAFETY COMPLIANCE
Behaviors associated with safety compliance include vigilance, perseverance, and accurately
following procedures. Tests for vigilance can provide information about the extent to which
individuals are able to maintain attention. For example, in critical medical contexts, the ability
to maintain vigilant scanning might be an important element of safety behavior (Subramanian,
Kumar, & Yauger, 1994).
Persevering with safety compliance requires consistent production and maintenance of effort
over time. Conscientiousness is a predictor of effort that has been validated for general job per-
formance and linked to accident outcomes (Barrick & Mount, 1991; Clarke & Robertson, 2005).
Conscientiousness should play an important role in sustaining safety compliance. From a different
perspective, distractibility or neuroticism can reduce an individual’s ability to maintain consistent
effort over time (Hansen, 1989).
Avoiding errors and mistakes is important for safety compliance. Errors of execution and
action (e.g., slips, lapses, trips, and fumbles) and procedural mistakes are more likely to arise
Employee Work-Related Health, Stress, and Safety 545
from attention failures. Several cognitive processes have been linked to attention failures and the
situational awareness required for scanning and responding to the work environment (Carretta,
Perry, & Ree, 1996). Cognitive failure (Simpson, Wadsworth, Moss, & Smith, 2005) and safety
consciousness (Westaby & Lee, 2003) describe the way individuals pay attention to the safety
requirements of the work environment, and selection activities can assess the degree to which
individuals are able to demonstrate these capacities. On the other hand, knowledge-based mistakes
occur when an individual lacks the appropriate information to perform correctly. For these types
of mistakes, safety knowledge is likely to be a more important predictor (Hofmann et al., 1995).
Finally, it is important to consider deliberate noncompliance with safety requirements. Integrity
tests have shown validity for predicting organizationally counterproductive behaviors such as rule-
breaking and dishonesty (Sackett & Wanek, 1996).
SAFETY PARTICIPATION
The behaviors that comprise safety participation have received less attention than safety compli-
ance behaviors from the perspective of personnel selection. By definition, these behaviors often
go beyond individual core task requirements and may be discretionary in some jobs. Standard job
analysis practices that focus on individual task performance are therefore less likely to articulate the
behaviors that are important for safety participation.
Selection for these behaviors requires consideration of the broader context and its constraints.
For example, the ability to communicate safety concerns with others and encourage safety com-
pliance of team members might be critical where teams work in high-risk environments. Validity
evidence from personality testing suggests that extraversion can predict performance in jobs
requiring social interaction (Barrick & Mount, 1991). To date, research has focused on contextual
factors that motivate these behaviors such as leadership (Barling, Loughlin, & Kelloway, 2002b)
and job conditions (Probst & Brubaker, 2001). Organizations that can articulate the nature of
safety participation in their specific context will be better able to identify potential individual
predictors of these activities.
SUMMARY OF SAFETY
In summary, our review suggests that selection can play a part in a safer work environment but its role
is complex. Many of the attributes required are trainable or are strongly influenced by the organi-
zational environment. Methodological limitations, such as the use of concurrent designs, reduce the
ability of many safety studies to inform selection systems. However, an equally important concern
is the degree to which theory is used to explain the behaviors that constitute the criterion domain
of work safety. Further theoretical development about the way individual behaviors contribute to a
safety system will enhance the role that can be played by selection procedures.
CONCLUSIONS
In the three sections above, we have examined correlates of employee health, work stress, and
safety. On the basis of the literature, there appear to be consistent findings that workplace factors
can enhance the health and safety of employees. Also, there are some relatively stable individual
characteristics that have been found to be related to stress, resilience, safety compliance, and safety
participation. Unfortunately, the empirical literature has not generally considered workplace factors
and individual characteristics to exam potential interactions between person characteristics and
environmental factors or the relative contribution of each in predicting health, stress, and safety.
Many of the theoretical perspectives relative to occupational safety and health including stress have
not specifically taken an interactional perspective and tend to focus on situational factors or per-
sonal characteristics.
546 Handbook of Employee Selection
Although there is support for the effects of some relatively stable individual characteristics
that might be useful for selection purposes in creating and maintaining a healthy workforce and
a healthy work environment, the current empirical evidence is not strong and there are potential
legal ramifications in using some of these characteristics in selection systems. It is possible that
given certain contexts, selection based on individual characteristics may have utility. However, the
literature as a whole appears to currently favor workplace interventions as more effective compared
with selection.
REFERENCES
Anderson, D. R., Whitmer, R. W., Goetzel, R. Z., Ozminkowski, R. J., Dunn, R. L., Wasserman, J., & Serxner,
S. (2000). The relationship between modifiable health risks and group-level health care expenditures.
American Journal of Health Promotion, 15, 45–52.
Aryee, S., Chen, Z., Sun, L., & Debrah, Y. A. (2007). Antecedents and outcomes of abusive supervision: Test
of a trickle-down model. Journal of Applied Psychology, 92, 191–201.
Barling, J., Inness, M., & Gallagher, D. G., (2002a). Alternative work arrangements and employee well being.
In P. L. Perrewé & D. C. Ganster (Eds.), Historical and current perspectives on stress and health:
Research in occupational stress and well being (Vol. 2, pp. 183–216). Oxford, England: JAI Press/
Elsevier Science.
Barling, J., Loughlin, C., & Kelloway, E. K. (2002b). Development and test of a model linking safety-specific
transformational leadership and occupational safety. Journal of Applied Psychology, 87, 488–496.
Barnett, R. C. (2006). Relationship of the number and distribution of work hours to health and quality-of-life
outcomes. In P. L. Perrewé & D. C. Ganster (Eds.), Employee health, coping and methodologies: Research
in occupational stress and well being (Vol. 5, pp. 99–138). Oxford, England: JAI Press/Elsevier Science.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Beehr, T. A., & Glazer, S. (2001). A cultural perspective of social support in relation to occupational stress. In
P. L. Perrewé & D. C. Ganster (Eds.), Exploring theoretical mechanisms and perspectives: Research in
occupational stress and well being (Vol. 1, pp. 97–142). Oxford, England: JAI Press/Elsevier Science.
Boggild, H., & Knutsson, A. (1999). Shift work, risk factors, and cardiovascular disease. Scandinavian Journal
of Work, Environment & Health, 25, 85–99.
Borman, W. C., & Penner, L. A. (2001). Citizenship performance: Its nature, antecedents, and motives. In B.
W. Roberts & R. Hogan (Eds.), Personality in the workplace (pp. 45–61). Washington, D.C.: American
Psychological Association.
Burton, J., & Hoobler, J. (2006). Subordinate self-esteem and abusive supervision. Journal of Managerial
Issues, 18, 340–355.
Burton, W. N., Chen, C., Conti, D., Schultz, A., Pransky, G., & Edington, D. (2005). The association of health
risks with on-the-job productivity. Journal of Occupational and Environmental Medicine, 47, 769–777.
Bureau of Labor Statistics. (2009). Health insurance costs for civilians. Retrieved from http://data.bls.gov/
cgi-bin/surveymost
Carretta, T. R., Perry, D. C., & Ree, M. J. (1996). Prediction of situational awareness in F-15 pilots. The
International Journal of Aviation Psychology, 6, 21–41.
Clarke, S., & Robertson, I. T. (2005). A meta-analytic review of the Big Five personality factors and accident
involvement in occupational and non-occupational settings. Journal of Occupational and Organizational
Psychology, 78, 355–376.
Collins, J. J., Baase, C. M., Sharda, C. E., Ozminkowski, R. J., Nicholson, S., Billotti, G. M., Turpin, R. S.,
Olson, M., & Berger, J. L. (2005). The assessment of chronic health conditions on work performance,
absence, and total economic impact for employers. Journal of Occupational and Environmental Medicine,
47, 547–557.
Cordes, C. L., & Dougherty, T. W. (1993). A review and integration of research on job burnout. Academy of
Management Review, 18, 621–656.
Fenwick, R., & Tausig, M. (2001). Scheduling stress: Family and health outcomes of shift work and schedule
control. The American Behavioral Scientist, 44, 1179–1198.
Ferris, G. R., Davidson, S. L., & Perrewé, P. L. (2005). Political skill at work. Mountain View, CA, Davies-
Black/CPP.
Ferris, G. R., Treadway, D. C., Perrewé, P. L., Brouer, R. L., Douglas, C., & Lux, S. (2007). Political skill in
organizations. Journal of Management, 33, 290–320.
Employee Work-Related Health, Stress, and Safety 547
Frone, M. R. (1998). Predictors of work injuries among employed adolescents. Journal of Applied Psychology,
83, 565–576.
Goetzel, R. Z., Anderson, D. R., Whitmer, R. W., Ozminkowski, R. J., Dunn, R. L., & Wasserman, J. (1998).
The relationship between modifiable health risks and health care expenditures. An analysis of the multi-
employer HERO health risk and cost database. Journal of Occupational and Environmental Medicine,
40, 843–854.
Goetzel, R. Z., Hawkins, K., Ozminkowski, R. J., & Wang, S. (2003). The health and productivity cost burden
of the “top 10” physical and mental health conditions affecting six large U.S. employers in 1999. Journal
of Occupational And Environmental Medicine, 45, 5–14.
Goetzel, R. Z., Long, S. R., Ozminkowski, R. J., Hawkins, K., Wang, S., & Lynch, W. (2004). Health, absence,
disability, and presenteeism cost estimates of certain physical and mental health conditions affecting U.S.
employers. Journal of Occupational and Environmental Medicine, 46, 398–412.
Grandey, A. A., Kern, J., & Frone, M. (2007). Verbal abuse from outsiders versus insiders: Comparing fre-
quency, impact on emotional exhaustion, and the role of emotional labor. Journal of Occupational Health
Psychology, 12, 63–79.
Greenberger, D. B., & Strasser, S. (1986). Development and application of a model of personal control in orga-
nizations. The Academy of Management Review, 11, 164–177.
Griffin, M. A., & Neal, A. (2000). Perceptions of safety at work: A framework for linking safety climate to safety
performance, knowledge, and motivation. Journal of Occupational Health Psychology, 5(3), 34–58.
Guastello, S. J. (1993). Do we really know how well our occupational accident prevention programs work.
Safety Science, 16, 445–463.
Hale, A. R., & Glendon, A. I. (1987). Individual behaviour in the control of danger. New York, NY: Elsevier.
Hansen, C. P. (1989). A causal model of the relationship among accidents, biodata, personality, and cognitive
factors. Journal of Applied Psychology, 74, 81–90.
Hofmann, D. A., Jacobs, R., & Landy, F. (1995). High reliability process industries: Individual, micro, and
macro organizational influences on safety performance. Journal of Safety Research, 26(3), 131–149.
Hogan, J., & Lesser, M. (1996). Selection of personnel for hazardous performance. Stress and Human
Performance, 195–222.
Ilgen, D.R., & Pulakos, E.D. (1999). Employee performance in today’s organizations. In D. R. Ilgen & E. D.
Pulakos (Eds.), The changing nature of work performance: Implications for staffing, motivation, and
development (pp. 1–20). San Francisco, CA: Jossey-Bass.
Iverson, R., & Erwin, P. (1997). Predicting occupational injury: The role of affectivity. Journal of Occupational
and Organizational Psychology, 70, 113–128.
Jacobs, J. A., & Gerson, K. (2004). The time divide: Work, family and gender inequality. Cambridge, MA:
Harvard University Press.
Jamal, M., & Baba, V. V. (1997). Shift work, burnout and well-being: A study of Canadian nurses. International
Journal of Stress Management, 4, 197–204.
Judge, T. A., Higgins, C. A., Thoresen, C. J., & Barrick, M. R. (1999). The Big Five personality traits, general
mental ability, and career success across the life span. Personnel Psychology, 52, 621–652.
Judge, T. A., Martocchio, J. J., & Thoresen, C. J. (1997). Five-Factor model of personality and employee
absence. Journal of Applied Psychology, 82, 745–755.
Kahn, R. L., Wolfe, D. M., Quinn, R. P., Snoek, J. D., & Rosenthal, R. A. (1964). Organizational stress: Studies
in role conflict and ambiguity. New York, NY: Wiley.
Kaiser Daily Health Report. (2008, January). U.S. health care spending reaches $2.1T in 2006, increasing
6.7%. Retrieved from http://www.kaiserhealthnews.org/Daily-Reports/2008/January/08/dr00049709.
aspx?referrer=search
Kaiser Family Foundation. (2007). Health care spending in the United States and OECD countries. Retrieved
from http://www.kff.org/insurance/snapshot/chcm010307oth.cfm
Kaiser Family Foundation. (2008). Employer health insurance costs and worker compensation. Retrieved from
http://www.kff.org/insurance/snapshot/chcm030808oth.cfm
Kaiser Network (2006). Companies shift more medical costs to workers. Retrieved from http://www.kaisernetwork.
org./daily_reports/rep_index.cfm
Kanter, R. M. (2004). Confidence. New York, NY: Crown Business.
Karasek, R. A. (1979). Job demands, job decision latitude, and mental strain: Implications for job redesign.
Administrative Science Quarterly, 24, 285–308.
Kendall, J., & Ventura, P. L. (2005). A stumbling block on the road to wellness: The ADA disability-related
inquiry and medical examination rules and employer wellness incentive programs. Benefits Law Journal,
18, 57–76.
548 Handbook of Employee Selection
Krausz, M., Sagie, A., & Biderman, Y. (2000). Actual and preferred work schedules and scheduling control as
determinants of job-related attitudes. Journal of Vocational Behavior, 56, 1–11.
Lawton, R., & Parker, D. (1998). Individual differences in accident liability: A review and integrative approach.
Human Factors: The Journal of the Human Factors and Ergonomics Society, 40, 655–671.
LePine, J. A., Colquitt, J. A., & Erez, A. (2000). Adaptability to changing task contexts: Effects of general cog-
nitive ability, conscientiousness, and openness to experience. Personnel Psychology, 53, 563–593.
Liao, H., Arvey, R. D., Butler, R. J., & Nutting, S. M. (2001). Correlates of work injury frequency and duration
among firefighters. Journal of Occupational Health Psychology, 6(3), 229–242.
Maier, M. (1999). On the gendered substructure of organization: Dimensions and dilemmas of corporate
masculinity. In G. N. Powell, (Ed.), Handbook of gender and work (pp. 69–93). Thousand Oaks, CA:
Sage.
Major, B., Richards, C., Cooper, M. L., Cozzarelli, C., & Zubek, J. (1998). Personal resilience, cognitive
appraisals, and coping: An integrative model of adjustment to abortion. Journal of Personality and Social
Psychology, 74, 735–752.
Maslach, C., Schaufeli, W. B., & Leiter, M. P. (2001). Job burnout. Annual Review of Psychology, 52,
397–422.
McCrae, R. R., & Costa, P. T. (1999). A five-factor theory of personality. In L. Pervin & O. P. John (Eds.),
Handbook of personality (2nd ed., pp. 139–153). New York, NY: Guilford Press.
McCrae, R. R., & John, O. P. (1992). An introduction to the Five-Factor model and its applications. Journal of
Personality, 60, 175–215.
Mills, P. R., Kessler, R. C., Cooper, J., & Sullivan, S. (2007). Impact of a health promotion program on employee
health risks and work productivity. American Journal of Health Promotion, 22, 45–53.
Mintzberg, H. (1983). Power in and around organizations. Englewood Cliffs, NJ: Prentice-Hall.
Motowidlo, S. J., Borman, W. C., & Schmit, M. J. (1997). A theory of individual differences in task and
contextual performance. Human Performance, 10(2), 71–83.
Mykletun, R. J., & Mykletun, A. (1999). Comprehensive schoolteachers at risk of early exit from work.
Experimental Aging Research, 25, 359–365.
Nelson, D. L., & Burke, R. J. (2000). Women executives: Health, stress and success. Academy of Management
Executive, 14, 107–121.
Nicoll, A., & Murray, V. (2002). Health protection—A strategy and a national agency. Public Health, 116,
129–137.
Parkes, K. R. (2003). Shiftwork and environment as interactive predictors of work perceptions. Journal of
Occupational Health Psychology, 8, 266–281.
Parks, K. M., & Steelman, L. A. (2008). Organizational wellness programs: A meta-analysis. Journal of
Occupational Health Psychology, 13, 58–68.
Paulhus, D., & Christie, R. (1981). Spheres of control: An interactionist approach to assessment of perceived
control. In H. M. Lefcourt (Ed.), Research with the locus of control construct, 1: Assessment methods (pp.
161–188). New York, NY: Academic Press.
Perrewé, P. L., & Spector, P. E. (2002). Personality research in the organizational sciences. In G. R. Ferris &
J. J. Martocchio (Eds.), Research in personnel and human resources management (Vol. 21, pp. 1–85),
Oxford, England: JAI Press/Elsevier Science.
Perrewé, P. L., Zellars, K. L., Ferris, G. R., Rossi, A. M., Kacmar, C. J., & Ralston, D. A. (2004). Neutralizing
job stressors: Political skill as an antidote to the dysfunctional consequences of role conflict stressors.
Academy of Management Journal, 47, 141–152.
Perrewé, P. L., Zellars, K. L., Rossi, A. M., Ferris, G. R., Kacmar, C. J., Liu, Y., Zinko, R., & Hochwarter, W. A.
(2005). Political skill: An antidote in the role overload—Strain relationship. Journal of Occupational
Health Psychology, 10, 239–250.
Powell, G. N., & Graves, L. M. (2003). Women and men in management. Thousand Oaks, CA: Sage.
Probst, T. M., & Brubaker, T. L. (2001). The effects of job insecurity on employee safety outcomes: Cross-
sectional and longitudinal explorations. Journal of Occupational Health Psychology, 6(2), 139–159.
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K.E. (2000). Adaptability in the workplace:
Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624.
Reardon, K. K. (2000). The secret handshake: Mastering the politics of the business inner circle. New York,
NY: Doubleday.
Rothstein, M. A. (1983). Occupational safety and health law. St. Paul, MN: West Group.
Ryan, D., & Watson, R. (2004). A healthier future. Occupational Health, 56(7), 20–21.
Sackett, P. R., & Wanek, J. E. (1996). New developments in the use of measures of honesty, integrity, conscien-
tiousness, trustworthiness, and reliability for personnel selection. Personnel Psychology, 49, 787–829.
Employee Work-Related Health, Stress, and Safety 549
Sanders, M. S. & McCormick, E. J. (1987). Human factors in engineering and design (6th ed.). New York:
McGraw-Hill.
Schat, A. C. H., Desmarais, S., & Kelloway, E. K. (2006). Exposure to workplace aggression from multiple
sources: Validation of a measure and test of a model. Unpublished manuscript, McMaster University,
Hamilton, Canada.
Shurtz, R. D. (2005). Reining health care costs with wellness programs: Frequently overlooked legal issues.
Benefits Law Journal, 18, 31–60.
Simon, T. M., Bruno, F., Grossman, N., & Stamm, C. (2006). Designing compliant wellness programs: HIPAA,
ADA, and state insurance laws. Benefits Law Journal, 19, 46–59.
Simon, T. M., Traw, K., McGeoch, B., & Bruno, F. (2007). How the final HIPAA nondiscrimination regulations
affect wellness programs. Benefits Law Journal, 20, 40–44.
Simpson, S. A., Wadsworth, E. J. K., Moss, S. C., & Smith, A. P. (2005). Minor injuries, cognitive failures and
accidents at work: Incidence and associated features. Occupational Medicine, 55, 99–108.
Smerd, J. (2007). Smoker? Can’t work here more firms say. Workforce Management, October 22. Retrieved
on June 12, 2008, from http://www.lexisnexis.com/us/lnacademic/frame.do?tokenKey=rsh-20.312685
.587091969
St. George, J., Schwager, S., & Canavan, F. (2000). A guide to drama-based training. National Productivity
Review, Autumn, 15–19.
Subramanian, R., Kumar, K., & Yauger, C. (1994) The scanning of task environments in hospitals: An empirical
study. Journal of Applied Business Research, 10, 104–115.
Sullivan, S. E., & Mainiero, L. (2007). Women’s kaleidoscope careers: A new framework for examining women’s
stress across the lifespan. In P. L. Perrewé & D. C. Ganster (Eds.), Exploring the work and non-work inter-
face: Research in occupational stress and well being (Vol. 6, pp. 205–238). Oxford, England: JAI Press/
Elsevier Science.
Tepper, B. (2007). Abusive supervision in formal organizations: Review, synthesis and research agenda. Journal
of Management, 33, 261–289.
Tepper, B. J., Duffy, M. K., & Shaw, J. D. (2001). Personality moderators of the relationship between abusive
supervision and subordinates’ resistance. Journal of Applied Psychology, 86, 974–983.
Tetrick, L. E. (2002). Understanding individual health, organizational health, and the linkage between the
two from both a positive health and an ill health perspective. In P. L. Perrewé & D. C. Ganster (Eds.),
Historical and current perspectives on stress and health: Research in occupational stress and well being
(Vol. 2, pp. 117–141). Oxford, England: JAI Press/Elsevier Science.
Theorell, T. (2004). Democracy at work and its relationship to health. In P. L. Perrewé & D. C. Ganster (Eds.),
Research in occupational stress and well being (Vol. 3, pp. 323–357). Oxford, England: JAI Press/
Elsevier Science
Treadway, D. C., Ferris, G. R., Hochwarter, W. A., Perrewé, P. L., Witt, L. A., & Goodman, J. M. (2005). The
role of age in the perceptions of politics—job performance relationship: A three-study constructive repli-
cation. Journal of Applied Psychology, 90, 872–881.
Vandenberg, R. J., Park, K., DeJoy, D. M., Wilson, M. G., & Griffin-Blake, C. S. (2002). The healthy work
organization model: Expanding the view of individual health and well being in the workplace. In P. L.
Perrewé & D. C. Ganster (Eds.), Historical and current perspectives on stress and health: Research in
occupational stress and well being (Vol. 2, pp. 57–115). Oxford, England: JAI Press/Elsevier Science.
Vredenburgh, A. G. (2002). Organizational safety: Which management practices are most effective in reducing
employee injury rates. Journal of Safety Research, 33(2), 259–276.
Wallace, J. C., & Vodanovich, S. J. (2003). Workplace safety performance: Conscientiousness, cognitive fail-
ure, and their interaction. Journal of Occupational Health Psychology, 8, 316–327.
World Health Organization. (2008). Workplace health promotion. Retrieved from http://www.who.int/
occupational_health/topics/workplace/en
Xie, J. L., & Schaubroeck, J. (2001). Bridging approaches and findings across diverse disciplines to improve
job stress research. In P. L. Perrewé & D. C. Ganster (Eds.), Exploring theoretical mechanisms and per-
spectives: Research in occupational stress and well being (Vol. 1, pp. 1–53). Oxford, England: JAI Press/
Elsevier Science.
Zellars, K. L., & Perrewé, P. L. (2001). Affective personality and the content of emotional social support:
Coping in organizations. Journal of Applied Psychology, 86, 459–467.
Zellars, K. L., Perrewé, P. L., & Hochwarter, W. A. (2000). Burnout in healthcare: The role of the five factors
of personality. Journal of Applied Social Psychology, 30, 1570–1598.
Zellars, K. L., Perrewé, P. L., Hochwarter, W. A., & Anderson, K. S. (2006). The interactive effects of positive
affect and conscientiousness on strain. Journal of Occupational Health Psychology, 11, 281–289.
This page intentionally left blank
26 Criterion Validity and
Criterion Deficiency
What We Measure Well
and What We Ignore
Jeanette N. Cleveland and Adrienne Colella
Work psychologists have had a long-standing interest in the criterion problem and have been par-
ticularly concerned with determining how to measure job performance and success at work. Many
notable industrial and organizational (I-O) psychologists have urged researchers to develop theo-
ries of employee performance (e.g., Campbell, 1990). Within the last 2 decades, we have made
progress in the articulation and measurement of required tasks and behaviors at work (Campbell,
1990; Chapter 21, this volume) and the identification of contingent, discretionary behaviors that are
important for person-team success (e.g., Borman & Motowidlo, 1993; Chapter 22, this volume).
However, the approach to the criterion problem followed by most researchers continues to be gener-
ally narrow and reinforces the status quo in terms of what is defined as success or successful work
behaviors in organizations.
The definitions of what represents success in organizations at the individual level (e.g., job perfor-
mance) and the organizational level (e.g., organizational effectiveness) have not changed fundamen-
tally over the years. There have been advances in understanding particular aspects of performance
and success (e.g., contextual performance), but there have not been substantial changes in the way
we think about the criteria that are used to evaluate personnel selection, training, or other interven-
tions in organizations. Our thinking has not changed, but the context in which work occurs certainly
has (Cascio & Aguinis, 2008).
The boundaries between the spheres of work, as well as between nonwork, local, national,
and global or international boundaries, have steadily eroded and these domains increasingly over-
lap. The world of work is becoming increasingly complex and intrusive (e.g., it is common for
employees to take their work with them when they go home or on vacations), and the definition
of success in the workplace is constantly evolving. This implies the need for a broader view of
the criterion domain. Several previous chapters in this volume (e.g., Chapters 21 and 22) pro-
vide excellent reviews and discussions about specific aspects or facets of workplace behavior and
performance domains. Each of these performance aspects is important to the effectiveness and
health of employees and organizations within the 21st-century workplace. In the current chapter,
we argue that both traditional and emerging facets of the performance domain must be incorpo-
rated as part of the foundation for an integrated and long-term focused human resources (HR) sys-
tem, and, importantly, that issues concerning the larger context, especially the interface between
work-nonwork issues, must be incorporated into our criterion models to more fully capture the
increasing complexities of the workplace and the diversity of the workforce. Finally, as Ployhart
and Weekley so eloquently articulate in Chapter 9, this volume, using a multilevel lens, we may be
better able to link individual-level HR systems to more macro indices of organizational productiv-
ity and sustainability.
551
552 Handbook of Employee Selection
relative success or failure” (p. 1). Success was recognized as nearly always multidimensional in
nature, suggesting that its sources of variability are complex. The choice of dimensions to represent
or define performance depends on how broadly or narrowly one interprets the meaning of success
(i.e., conceptual criterion; Nagle, 1953; Toops, 1944).
Traditionally, discussions of the criterion problem have started with the assumption that the
conceptual or ultimate criterion of success is reasonably well defined and that the major problem
involves the shift from conceptualizing or defining success to its actual measurement. When this
shift is made, a gap is likely to develop between the “ideal” conceptualization of success and its
practical or actual measurement. The relationship between conceptual and practical measurement
of success is depicted using two general notions: conceptual criterion and actual criteria. The term
“conceptual,” “theoretical,” or “ultimate criterion” (Thorndike, 1949) describes the full domain of
everything that ultimately defines success (Cascio, 2000). Because the ultimate criterion is strictly
conceptual, it cannot be measured or directly observed. It embodies the notion of “true,” “total,”
“long-term,” or “ultimate worth” to the employing organization (Cascio, 2000).
Implicit in this model is the questionable assumption that we all know and agree about the
conceptual definition of success (i.e., the idea that the ultimate criterion is obvious and noncontro-
versial). Yet, key performance stakeholders (e.g., employees, organizations, families, society, and
the environment) do not necessarily know or agree on the conceptual definition and content of the
ultimate criterion. In short, an ultimate criterion is important because the relevance or linkage of
any operational or measurable criterion is better understood if the conceptual stage is clearly and
thoroughly documented (Astin, 1964). I-O psychology can and does measure some facets of success
very well, but these may reflect a small, narrow proportion of the ultimate criterion.
Task Performance
Campbell et al.’s (1993) model of performance focused on required worker behaviors in a given job
and attempts to delineate the dimensions of job performance. This model guides researchers and
managers in assessing and preventing criterion contamination and deficiency by articulating the
eight most important aspects of job performance. The eight factors (e.g., job-task proficiency, non-
job-specific task proficiency, written and oral communication proficiency, demonstrating effort,
maintaining personal discipline, facilitating peer and team performance, supervision/leadership,
and management/administration) are assumed to be the highest-order factors that are sufficient
to describe the latent hierarchy among all jobs. That is, the construct of performance cannot be
554 Handbook of Employee Selection
Criterion contamination
Criterion deficiency
Criterion relevance
FIGURE 26.1 Criterion components: relationships among criterion relevance, contamination, and deficiency.
meaningfully understood by combining these factors into a smaller subset or one general factor.
Although the content of the factors may vary slightly across jobs, the focus of each is in terms of the
observable and behavioral things that people do which are under their control.
There is a good deal of value to articulating what task performance actually means. However,
the specific performance components articulated by Campbell et al. (1993) and others address work
success from what is arguably a very narrow perspective. In particular, this model defines perfor-
mance in the workplace as a set of behaviors that is independent from behavior associated with our
nonwork lives, or at least that nonwork factors are not relevant for defining success at work. From
this perspective, the flow back and forth between the work and nonwork spheres of life is at best a
form of criterion contamination.
job, demonstrating effort, helping and cooperating with others, following organizational rules, and
supporting organizational objectives (Borman & Motowidlo, 1993). A number of these behaviors
would fall under a subset of components identified by Campbell et al. (1993). Borman et al. (2001)
found that the structure of citizenship behaviors could be described using three categories: personal
support (behaviors benefiting individuals in the organization including helping, motivating,
cooperating with, and showing consideration), organizational support (behaviors benefiting the
organization including representing the organization favorably, showing loyalty, and complying
with organizational rules and procedures), and conscientious initiative (behaviors benefiting the job
or task including persisting with extra effort to complete tasks, taking initiative, and engaging in self
development activities; Borman, et al., 2001; Johnson, 2003).
Although we do not have consensus on the dimensionality of OCBs, the articulation of contextual
performance challenges traditional definitions of individual work performance (Ilgen & Pulakos,
1999). Further, as discussed in greater detail in Chapter 22, this volume, contextual performance/
OCBs reflect an initial shift toward broadening work performance criteria to include performing in
interdependent and uncertain work contexts (Neal & Hesketh, 2002).
Adaptive Performance
A third component of job performance, adaptive performance, is distinct from task and contex-
tual performance (Hesketh & Neal, 1999). Adaptive performance is the proficiency with which a
person alters his or her behavior to the demands of the environment, an event or a new situation
(Pulakos, Arad, Donovan, & Plamondon, 2000), or an effective change in response to an altered
situation (White et al., 2005). Although some dimensions of adaptive performance overlap with
task or contextual performance, the dimension of addressing uncertain and unpredictable work
situations may be distinct from task and citizenship performance (Johnson, 2003). Related to the
construct of adaptive performance, the recent conceptualization of successful aging refers to the
construct as successfully adjusting to change that is developmental (Baltes & Baltes, 1990) or
as competently adapting or adjusting (Featherman, 1992; Abraham & Hansson, 1995; Hansson,
DeKoekkoek, Neece, & Patterson, 1997). Adaptive performance is reviewed in more detail in
Chapter 22, this volume.
be aware of how current measures of success impact the future ability of employees to remain with
the organization and to continue to perform in a manner that is beneficial to the organization, them-
selves, and society. Considering the sustainability of HR requires taking a longer-term perspective
than is usually the case. Furthermore, given current trends in criterion measurement and typical
failure to consider multiple stakeholders, our criteria of “success” continue to be deficient in at least
two ways. First, we need to expand the notion of criteria to include aspects of individual functioning
outside of the work context. Employee health and well-being, stress, marital quality, and parental
performance are all potential aspects of an emerging performance domain within the larger context
of our lives and are inextricably linked with work organizations (Cleveland, 2005). Behavior at work
affects behavior away from work and vice versa, and a truly comprehensive definition of effective-
ness and success in an organization is likely to include facets (e.g., health and well-being) that have
traditionally not been thought of as part of the performance domain.
Second, the content of our criteria should continue to be broadened to explicitly recognize the
multilevel implications of the construct (Cleveland, 2005; DeNisi, 2000; Murphy & Cleveland,
1995). We need to more explicitly link conceptions of individual performance and success to
definitions of effectiveness and sustainability at the group, organizational, and societal level. The
same behaviors and outcomes that contribute to success as traditionally defined at the individual
level (e.g., high level of competitiveness, high level of involvement in work) might sow the seeds
of failure at other levels (e.g., by building destructive conflict within organizations, by contribut-
ing to the failure of community institutions that compete with the workplace for employees’ time
and effort). These are themes that are echoed in Cascio and Aguinis (2008) and Chapter 9, this
volume.
Recognition of the multilevel nature of performance is important for several reasons (DeNisi,
2000). Notably, it provides one way that we can examine how our definitions and measures of
success at one level are linked with potential costs associated with or incurred at another level.
Broadening the definition of the criterion space to include extra-work functioning and multilevel
effects leads us to consider domains that have been given little emphasis by traditional definitions
of individual performance and organizational success. Two such domains are individual well-being
and organizational health.
Individual Health
Drawing on Warr’s framework (1987, 1994a), variations in mental health reflect different relative
emphases on ill health and good health. Mental or psychological health can be described using six
dimensions: subjective or affective well-being, positive self-regard, competence, aspiration, auton-
omy, and integrated functioning (Warr, 2005). Well-being is the most commonly investigated facet
of mental health and, according to Warr (1987), it includes two orthogonal dimensions: pleasure
(feeling bad to feeling good) and level of arousal (low to high). He identified three assessable aspects
of well-being that can be viewed in terms of their location on these dimensions: specifically, the hor-
izontal axis of pleasure or displeasure, which is measured in terms of satisfaction or happiness; an
axis from anxiety (high arousal, low pleasure) to comfort (low arousal, high pleasure); and an axis
from depression (low arousal, low pleasure) to enthusiasm (high arousal, high pleasure). Indicators
of well-being that emphasize the detection of ill-health rather than good health assess anxiety,
depression, burnout, psychological distress, and physiological or psychosomatic symptoms. On the
other hand, indicators of well-being that emphasize positive mental health assess high arousal-high
pleasure states, such as enthusiasm. Job satisfaction is considered to be either an indicator of ill
Criterion Validity and Criterion Deficiency 557
health (e.g., dissatisfaction) or positive health (e.g., job satisfaction). Either way, it is thought to be a
relatively passive form of mental health because, although it assesses the degree of pleasure/displea-
sure about a job, it does not assess arousal (Parker, Turner, & Griffin, 2003; Warr, 1997).
In addition to affective well-being, Warr (1987) identified the five other components of mental
health: competence (e.g., effective coping), aspiration (e.g., goal directedness), autonomy/indepen-
dence (e.g. proactivity), positive self-regard (e.g., high self esteem), and integrated functioning
(i.e., states involving balance harmony and inner relatedness). These are important components
of mental health in their own right because (a) they are potentially more enduring than affec-
tive well-being and (b) competence, aspiration, and autonomy/independence represent more active
states and behaviors than most measures of well-being that reflect passive contentment (e.g., job
satisfaction).
How does individual well-being fit as part of a definition of effectiveness or success? First,
there is considerable evidence that workplace stresses are an important source of physical and
mental health problems. Warr (1987, 1999) has developed a framework that identifies key fea-
tures of an environment that have been shown to be related to mental health. The ten features are
described in Table 26.1 in positive terms, but low values are viewed as stressful (Warr, 2005). This
table suggests that the design of jobs (e.g., variety, opportunities for skill use), workplaces (e.g.,
physical security), reward systems (e.g., availability of money), leadership training and develop-
ment systems (e.g., supportive supervision), and design of personnel recruitment selection sys-
tems (e.g., valued social position) could all have mental health implications for the workforce. For
example, given the shrinking, aging, and increasingly diverse global workforce, organizations
need to rethink the primary objectives of recruitment and selection systems. Organizations may
increasingly face the situation of having more job vacancies than qualified individuals to fill them.
Selection systems may need to be retooled to reflect more “recruitment selection.” Selection tests
or measures not only may need to assess how well applicants can perform across various work
contexts over a period of time, but also convey to the applicant what range of situations they are
likely to encounter and what resources the organization can provide to sustain their performance
and worklife health. It can certainly be argued that individuals who perform well in an environ-
ment that has adverse effects on their physical or mental health should not be necessarily described
as successful.
Second, health effects today are likely to have performance effects tomorrow. That is, an
employee whose health is impaired by the workplace will probably make a smaller contribution to
the organization, the family, and the community over the long run than one whose employment is a
source of well being. Thus, employee well-being and health are important components to sustaining
TABLE 26.1
Job Characteristics Related to Mental Health
Opportunity for personal control
Opportunity for skill use
Externally generated goals
Variety
Environmental clarity
Availability of money
Physical security
Supportive supervision
Opportunity for interpersonal contact
Valued social position
Adapted from Warr, P., Work, well-being and mental health, in J. Barling, E. K. Kelloway, & M. R. Frone, Eds., Handbook
of work stress, 547–574, Sage, Thousand Oaks, CA, 2005.
558 Handbook of Employee Selection
an organization’s human capital. Indeed, successful organizations already are aware of the link
between health and well-being, performance, and sustainability, as documented in Chapter 25,
this volume. For example, IBM’s corporate policy on employee well being has led the organiza-
tion to develop a myriad of programs to insure employee health, well-being, and family balance.
Furthermore, they tie these programs to criteria such as work-related injury and lost workdays (IBM
Corporate Responsibility Report, 2008).
Organizational Health
A healthy organization is one that is competitive within the marketplace and also has low rates of
injury, illness, and disability (Hofmann & Tetrick, 2003). Individual health outcomes are distin-
guished from organizational-level outcomes, but both are likely to be related to individual behavior
at work. Together, individual and organizational effectiveness constitute the health of an organiza-
tion (Parker, Turner, & Griffith, 2003). That is, a healthy organization is one that accomplishes the
business-related goals that define traditional financial success and the human goals of advancing the
health and welfare of the organization’s members.
It is possible to move this discussion one step further and define a healthy organization as involv-
ing three dimensions: (a) competitive within the marketplace; (b) low rates of injury, illness, and
disability (lack of negative outcomes); and (c) promoting long-term sustainability and well being of
its constituents (e.g., work that increases the success of constituents in terms of competence, aspira-
tion, autonomy, and balance).
1. We assess and develop individual employee performance with the intent of ultimately
affecting the performance of the team or the whole organization.
2. Individuals and teams perform in ways to allow the organization to achieve outcomes
referred to as “organizational performance.”
Criterion Validity and Criterion Deficiency 559
3. Performance at higher levels of analysis is more than just the simple sum of performance at
lower levels; that is, it is not always sufficient to change individual performance to change
team or organization performance.
4. Variables at higher levels of analysis (e.g., organizational structure or climate) can serve as
constraints on (or facilitators of) the performance of individuals and teams. Therefore, we
must understand the organizational context in order to fully understand the performance
of individuals or teams.
In particular, thinking about performance and success from a multilevel perspective might help
understand how and why the ultimate criterion should be expanded. For example, we tradition-
ally construct and validate personnel selection systems as if the only objective of those systems
was to predict future performance at the individual level (e.g., virtually all validation studies use
measures of individual job performance as the criterion of choice). Yet it is clear that the goals of
a personnel selection system are not solely to predict future performance; the goals are to help the
organization make better strategic decisions, be profitable, and sustain productivity. Consistent
with the message conveyed in Chapter 9, this volume, it is critical that the criteria are linked
with unit or organizational strategy. Therefore, our criteria may include unit-, organizational-,
and societal-level assessments, as well as individual-level performance assessments, to be most
consistent with a firm’s strategy. One plausible reason that a validated selection system does not
translate into better unit performance may be the narrowly defi ned criteria used. There are usu-
ally real and legitimate differences in different stakeholders’ definitions of “better decisions.” For
example, an organization may use a class of tests in personnel selection that results in predicted
increases in individual performance but also results in adverse impact, in confl ict between super-
visors and subordinates, and in negative images of the organization. This might not be thought of
as a success, even if the validity coefficients are all large and positive (Murphy, 2010). Therefore,
the logic that Ployhart and Weekley develop in Chapter 9, this volume, to link individual-level
selection tests to organizational business strategy should also be applied to the re-examination
and development of the criterion domain. That is, relevant macro work context and nonwork fac-
tors should be included within the articulation and domain of success. Cascio and Aguinis (2008)
make similar recommendations using the emerging construct they label, “in situ performance,”
which refers to the situational, contextual, strategic, and environmental effects that may influ-
ence individual, team, or organizational performance. By integrating or specifying these effects,
we develop a “richer, fuller, context-embedded description of the criterion space that we wish to
predict” (Cascio & Aguinis, 2008, p. 146). With the changing nature of work and the workforce,
such criterion evolution can more fully capture how work is done in the 21st century (Cascio &
Aguinis, 2008).
I-O psychologists devote a great deal of time and effort in helping organizations make high-
stakes decisions about people (e.g., whom to hire, where to place them, and what behaviors to
reward and sanction). A multi-level perspective suggests that these decisions can and probably
should be evaluated in terms of their effects on individuals, work groups, organizations, and
families and communities, and that short- and long-term perspectives should be considered.
To be sure, there are many difficult issues that have to be addressed to put such a program
of criterion development in place. Whose perspectives should be considered and how much
weight should be given to each stakeholder in defi ning individual or organizational success?
How should confl icts between stakeholders be addressed (e.g., it might benefit organizations but
harm the communities that support them if many employees put in 80-hour weeks)? There are
no simple answers to these questions, but I-O psychologists do have experience dealing with
the multilevel issues in several other domains, and we may be able to draw from this research
and this experience to gain insights into developing more inclusive defi nitions of what it means
to be a success in the workplace. In particular, there is much to be learned from research on
work-family confl ict.
560 Handbook of Employee Selection
1998). Workers are spending more time in the workplace in response to job insecurity, workplace
demands, perceived career needs, and financial pressure. That is, the most compelling finding in
the domain of work-family conflict is not that family interferes with work but that work interferes
with family. If we, as I-O psychologists, focus only on outcomes that directly affect the employer’s
interests (particularly employers’ short-term interests), we are likely to dismiss the most important
aspect of work-family conflict (i.e., the way work can adversely affect families) as outside of the
boundaries of the criterion domain. If we consider the interests of employees, their families and
their communities as a legitimate part of the definition of the ultimate criterion space, we are
less likely to dismiss this important set of findings as being largely irrelevant, or at least as being
someone else’s problem.
Women and men in the United States increased their annual working hours by an average of 233
and 100 hours, respectively, between 1976 and 1993 (Bureau of Labor Statistics, 1997). In 1999, the
average weekly full-time hours over all industries were 43.5–45.2 for certain professional groups
and executives (Bureau of Labor Statistics, 1999). Many employees work longer hours, and dual-
earner couples may work unusual hours or shifts. In the United States and the United Kingdom,
workers feel they need to put in substantial “face time” to demonstrate their commitment (Bailyn,
1993) and many in low-wage occupations work in more than one job. Despite the increasing time
and effort devoted to work, employees are feeling increasing levels of job insecurity (Burchell,
Felstead, & Green, 1997; Reynolds, 1997). From the perceptive of multilevel systems, this increas-
ing focus on face time, long hours, and increased insecurity is arguably evidence that organizations
are increasingly unhealthy, and, therefore, increasingly unsuccessful.
Similarly, we can think of research on workers’ experiences with family-friendly work policies
(e.g., parental leave, flextime) differently if we broaden our definitions of performance, effective-
ness, and success. For example, family-friendly policies are of limited value without a secure job,
and there is evidence that many qualified employees decline opportunities to participate in these
programs (Lewis et al., 1998). One way of evaluating the success of an organization would be to pay
attention to the uptake rates for policies such as these. If employees report stress and dissatisfac-
tion as a result of work-family conflict but are unwilling or unable to take advantage of workplace
policies designed to reduce these stresses, this can be considered evidence that the organization is
failing its stakeholders, regardless of what the balance sheet says.
Few of the studies examining the effects of family-friendly policies focus on the couple or the
family as the unit of analysis. In addition, such factors as marital well being and healthy family
relations are rarely assessed. Finally, few studies in I-O psychology or management tap spousal or
children’s perceptions of work-family conflict and employee or parental behaviors. As a result, we
know little about how family-friendly policies actually affect families. A multilevel perspective in
defining success suggests that we should try to learn more about all of these issues.
Although studied far less frequently than work-related outcomes, psychological research has not
completely neglected outcomes in the family domain (Major & Cleveland, 2007). Numerous empir-
ical studies demonstrate a negative relationship between work-family conflict and life satisfaction
(e.g., Adams, King, & King, 1996; Netemeyer et al., 1996); the results of two meta-analyses (Allen,
Herst, Bruck, & Sutton, 2000; Kossek & Ozeki, 1998) reinforce this conclusion. The results are
similar for work-family conflict and marital functioning and/or satisfaction (e.g., Duxbury, Higgins,
& Thomas, 1996; Netemeyer et al., 1996) and family satisfaction (e.g., Parasuraman, Purohit,
Godshalk, & Beutell, 1996). Yet again, often this research taps only the perceptions of the employed
worker and does not collect information from spouses or children.
Children are virtually absent from I-O research on the work-family interface (Major & Cleveland,
2007), and when they are included, it is typically as a demographic control variables (i.e., number of
children, age of youngest child) in studies of an employed parent’s family demands (see Rothausen,
1999 for a review). With few exceptions (e.g., Barling, Dupre, & Hepburn, 1998), children’s out-
comes are seldom considered in I-O work-family research. Moreover, I-O research lacks a rich
treatment of the how children and other family variables influence employee behavior (cf. Eby,
562 Handbook of Employee Selection
Casper, Lockwood, Bordeaux, & Brinley, 2005) or, importantly, how workplace characteristics
and the employment/parental behaviors of both working parents influence the well being and work
attitudes of their children. Further, current measures of success are deficient and lack consideration
of children’s well being. If we think about the family as one of the important set of stakeholders
in defining what we mean by success, we will be more likely to consider the reciprocal effects of
work and family in deciding whether our HR systems (e.g., personnel selection) are indeed leading
to better decisions.
In the traditional model of success, in which the ultimate criterion is entirely focused on what is
good (often in the short term) for the organization, including measures of work-family conflict in
evaluations of careers, organizations, etc., would probably be dismissed as criterion contamination.
If we recognize that the worlds of work and nonwork are inextricably intertwined, we are likely to
reach a very different conclusion; that is, that the failure to include variables such as work-family
conflict in our definitions of success has led to conceptions of the ultimate criterion that are them-
selves deficient.
Chapter 25, this volume, for a discussion of work-related health, stress, and safety). Rather, just
as many organizations collect and monitor various safety criteria at the organizational level (e.g.,
accident rates), an organization can monitor at an aggregate level the work and nonwork health
of their organization. To ensure privacy for employees, information on nonwork issues can be
collected at group or organizational levels of analysis about marital health and family relationships,
not from individual employees. However, information on work performance using task and
citizenship behaviors can be collected at individual and aggregated levels. Further, it is important
that organizations tap not only perceptions of individual employees, coworkers, supervisors, and so
forth, but also the perceptions of employees’ partners/spouses and children. Just as 360º performance
feedback programs have gained some popularity in management circles (Bracken, Timmreck, &
Church, 2001), organizations should also receive feedback from nonwork sources (Shellenbarger,
2002). Using a type of family 360 may provide useful feedback to the employee.
Adopting a broader, more heterogeneous conceptualization of worker success would have impor-
tant implications for the way we evaluate the validity and adequacy of our criteria and for the
conclusions we reach about the validity and value of many of the systems psychologists develop
for organizations. A broader concept of success may have considerable appeal for employees and
their families and could even be thought of as a competitive advantage for organizations (i.e., orga-
nizations that think more broadly about defining success may be better positioned to recruit and
retain particular employees) and enhance the sustainability of the organization. Perhaps one basis
for worker dissatisfaction with performance appraisal is that what employees value as success is
not reflected in the organization evaluation process. Taking a broader perspective may also pro-
vide the organization with a strategic advantage within the public’s eye. In addition, organizations
would gain essential insight to potential HR challenges facing working families that can provide
the basis for innovative and effective interventions. Not only would I-O psychologists and managers
have more actual measures to tap success, but they would also have more sources of performance
information. Finally, using a multilevel orientation to tap multisource information, we plausibly can
begin to (a) link our HR systems with business strategy (as discussed in Chapter 9, this volume) and
(b) develop selection tools that predict in situ performance and more fully reflect individual success
and well-being as well as organizational sustainability.
CONCLUSIONS
The way we go about predicting and understanding success in organizations (and designing per-
sonnel selection systems that will maximize success) depends largely on how we define success.
Researchers and practitioners increasingly question the adequacy of traditional definitions of job
performance, promotions, salary, job title, organizational level, and so forth as indicators of success.
These are all important and relevant, but success almost certainly should be defined more broadly
and comprehensively. As the permeability of the boundaries between work and nonwork domains
increases in the 21st century, our definition of what it means to the organization, the individual, and
the broader society to be a success or a failure in the workplace is likely to change.
We have argued in this chapter that criteria such as marital and family well-being are of legiti-
mate concern to responsible organizations and are part of the ultimate criterion. The wealth of
evidence shows that employees place family as their number one priority (Lewis & Cooper, 1999)
and that employees’ work demands regularly interfere with their ability to meet family demands,
and (to a lesser degree) there is also some evidence that employees’ family demands interfere with
their ability to carry out work demands (cf. Greenhaus & Parasuraman, 1999). Business strategies
that emphasize promoting long-term sustainability and concern with the construct of in situ perfor-
mance (Cascio & Aguinis, 2008) will necessarily be concerned with determining how work and
nonwork domains affect one another and with how nonwork criteria such as family well being are
likely to influence the viability of organizations. The literature includes constant calls for aligning
HR practices with business strategy (Chapter 9, this volume) to promote the long-term benefit of
564 Handbook of Employee Selection
organizations, and it is likely that understanding the effects of work and organizational demands on
the quality of nonwork life will be an important factor in building and sustaining healthy organiza-
tions. Our current criteria for success (and theories of performance) are arguably deficient because
we ignore the facets and structures of work that affect nonwork areas of our lives.
For example, suppose that a new performance management system led employees to book more
hours at work but also led to increased stress at home. It might be reasonable to ask whether the
organization should consider their new system a success or a failure. It may not be easy to determine
the best balance between the positive and negative effects of this system, but it seems reasonable to
at least ask the question of how interventions that have what seem like beneficial effects at one level
of analysis might have negative effects at other levels. Our current narrow focus on what is good for
the organization may lead us to miss the effects of what happens in the workplace on any number
of domains other than work.
What happens at work does not always stay at work; the workplace affects our nonwork lives,
and our nonwork lives affect the workplace. It is important to more fully appreciate the reciprocal
relationships between work and nonwork and to recognize the larger developmental and cultural
context in which work behaviors unfold. Including nonwork factors in our evaluations of careers,
jobs, and organizations is not a source of criterion contamination. Rather, failure to consider these
factors in defining success should be thought of as a source of criterion deficiency. There are many
challenges in determining what to measure, how to measure it, and how to use that information, but
the case seems clear—we need to take a broader (and richer) approach to defining performance and
success for individuals and organizations.
REFERENCES
Abraham, J. D., & Hansson, R. O. (1995). Successful aging at work: An applied study of selection, optimiza-
tion, and compensation through impression management. Journal of Gerontology, 50, 94–103.
Adams, G. A., King, L. A., & King, D. W. (1996). Relationships of job and family involvement, family social
support, and work-family conflict with job and life satisfaction. Journal of Applied Psychology, 81,
411–420.
Allen, T. D., Herst, D. E., Bruck, C. S., & Sutton, M. (2000). Consequences associated with work-to-family
conflict: A review and agenda for future research. Journal of Occupational Health Psychology, 5,
278–308.
Aryee, S. (1992). Antecedents and outcomes of work-family conflict among married professional women:
Evidence from Singapore. Human Relations, 45, 813–837.
Astin, A.W. (1964). Criterion-centered research. Educational and Psychological Measurement, 24, 807–822.
Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77,
836–874.
Bailyn, L. (1993). Breaking the mold: Women, men and time in the new corporate world. New York, NY: Free
Press.
Baltes, P. B., & Baltes, M. M. (1990). Psychological perspectives on successful aging: The model of selective
optimization with compensation. In P. B. Baltes & M. M. Baltes (Eds.), Successful aging: Perspectives
from the behavioral sciences (pp. 1–33). Cambridge, England: Cambridge University Press.
Barling, J., Dupre, K. E., & Hepburn, C. G. (1998). Effects of parents’ job insecurity on children’s work beliefs
and attitudes. Journal of Applied Psychology, 83, 112–118.
Bingham, W. V. (1926). Measures of occupational success. Harvard Business Review, 5, 1–10.
Borman, W. C. (1993). Job behavior, performance, and effectiveness. In M. D. Dunnette & L. M. Hough (Eds.),
Handbook of industrial and organizational psychology (Vol. 2, pp. 271–326). Palo Alto, CA: Consulting
Psychologists Press.
Borman, W., & Motowidlo, S. (1993). Expanding the criterion domain to include elements of contextual
performance. In N. Schmitt & W.C. Borman (Eds.), Personnel selection in organizations. (pp.71–98).
San Francisco, CA: Jossey-Bass.
Borman, W. C, Penner, L. A., Allen, T. D., & Motowidlo, S. J. U. (2001) Personality predictors of citizenship
performance. International Journal of Selection and Assessment, 9, 52–69.
Bracken, D., Timmreck, C., & Church, A. (2001). The handbook of multisource feedback: The comprehensive
resource for designing and implementing MSF processes. San Francisco, CA: Jossey-Bass.
Criterion Validity and Criterion Deficiency 565
Burchell, B., Felstead, A., & Green, F. (1997, September). The age of the worried workers: Extent, pattern
and determinants of insecurity in Britain over the last decade. Paper presented at the 12th Annual
Employment Research Unit Conference, Cardiff, Wales.
Bureau of Labor Statistics. (1997). Workers are on the job more hours over the course of a year (issues in Labor
Statistics, Summary 97-3). Washington, DC: U.S. Department of Labor.
Bureau of Labor Statistics. (1999, April). Household data. Washington, DC: U.S. Department of Labor.
Campbell, J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology.
In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology, (2nd ed.,
Vol. 1, pp. 687–732). Palo Alto, CA: Consulting Psychologists Press.
Campbell, J. P., Dunnette, M. D., Lawler, E. E., & Weick, K. (1970). Managerial behavior, performance and
effectiveness. New York, NY: McGraw-Hill.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt &
W. C. Borman (Eds.), Personnel selection in organizations (pp. 35–70). San Francisco, CA: Jossey-
Bass.
Cascio, W. F. (2000). Managing human resources: Productivity, quality of work life and profits. New York, NY:
McGraw-Hill.
Cascio, W., & Aguinis, H. (2008). Staffing twenty-first-century organizations. In J. P. Walsh & A. P. Brief
(Eds.). Academy of management annals (pp. 133–165). Mahwah, NJ: Lawrence Erlbaum.
Cleveland, J. N. (2005). What is success? Who defines it? Perspectives on the criterion problems as it relates
to work and family. In E. E. Kossek & S. J. Lambert (Eds.), Work and life integration: Organizational,
cultural and individual perspectives (pp 319–346). Mahwah, NJ: Lawrence Erlbaum.
DeNisi, A.S. (2000). Performance appraisal and performance management: A multilevel analysis. In K. J. Klein
& S. J. W. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations,
extensions, and new directions (pp 121–156). San Francisco, CA: Jossey-Bass.
Duxbury, L. E., Higgins, C. A., & Thomas, D. R. (1996). Work and family environments and the adoption of
computer-supported supplemental work-at-home. Journal of Vocational Behavior, 49, 1–23.
Eagle, B. W., Miles, E. W., & Icenogle, M. L. (1997). Interrole conflicts and the permeability of work and fam-
ily domains: Are there gender differences? Journal of Vocational Behavior, 50, 168–184.
Eby, L. T., Casper, W. J., Lockwood, A., Bordeaux, C., & Brinley, A. (2005). A twenty-year retrospective on
work and family research in IO/OB: A content analysis and review of the literature. [Monograph] Journal
of Vocational Behavior, 66, 124–197.
Featherman, D. L. (1992). Development of reserves for adaptation to old age: Personal and societal agendas. In
E. Cutler, D. W. Gregg, & M. P. Lawton (Eds.), Aging, money, and life satisfaction: Aspects of financial
gerontology (pp. 135–168). New York, NY: Springer.
Flanagan, J. C. (1954). The critical incidents technique. Psychological Bulletin, 51, 327–358.
Friedman, J. H., & Greenhaus, S. D. (2000). Work and family—Allies or enemies? What happens when business
professionals confront life choices. New York, NY: Oxford.
Frone, M. R. (2000). Work-family conflict and employee psychiatric disorders: The national comorbidity survey.
Journal of Applied Psychology, 85, 888–895.
Frone, M. R., Russell, M., & Cooper, M. L. (1997). Relation of work-family conflict to health outcomes: A four-
year longitudinal study of employed parents. Journal of Occupational & Organizational Psychology, 70,
325–335.
Greenhaus, J. H., Collins, K. M., Singh, R., & Parasuraman, S. (1997). Work and family influences on departure
from public accounting. Journal of Vocational Behavior, 50, 249–270.
Greenhaus, J. H., & Parasuraman, S. (1999). Research on work, family and gender: Current status and future direc-
tions. In G. N. Powell (Ed.), Handbook of gender and work (pp. 391–412). Thousand Oaks, CA: Sage.
Griffin, B., & Hesketh, B. (2003). Adaptable behaviours for successful work and career adjustment. Australian
Journal of Psychology, 55, 65–73.
Gutek, B.A., Searle, S., & Klepa, L. (1991). Rational versus gender role explanations for work-family conflict.
Journal of Applied Psychology, 76, 560–568.
Hammer, L. B., Allen, E., & Grigsby, T. D. (1997). Work-family conflict in dual-earner couples: Within-individual
and crossover effects of work and family. Journal of Vocational Behavior, 50, 185–203.
Hansson, R. O., DeKoekkoek, P. D., Neece, W. M., & Patterson, D. W. (1997). Successful aging at work: Annual
Review, 1992–1996: The older worker and transitions to retirement. Journal of Vocational Behavior, 51,
202–233.
Hesketh, B., & Neal, A (1999). Technology and performance. In D. R. Ilgen & E. D. Pulakos (Eds.), The
changing nature of performance: Implication for staffing, motivation, and development (pp. 21–55). San
Francisco, CA: Jossey-Bass.
566 Handbook of Employee Selection
Hofmann, D. A. & Tetrick, L. E. (2003). The etiology of the concept of health: Implications for “organizing”
individual and organizational health. In D. A. Hofmann & L. E. Tetrick (Eds.), Health and safety in orga-
nizations: A multilevel perspective (pp. 1–26). San Francisco, CA: Jossey-Bass.
Johnson, J. W. (2003). Toward a better understanding of the relationship between personality and individual
job performance. In M. R. Barrick & A. M. Ryan (Eds.), Personality and work: Reconsidering the role of
personality in organizations (pp. 83–120). San Francisco, CA: Jossey-Bass.
Karasek, R. A. & Theorell. T. (1990). Healthy work: Stress, productivity, and the reconstruction of working life.
New York, NY: Basic Books.
Kossek, E. E., & Ozeki, C. (1998). Work-family conflict, policies, and the job-life satisfaction relationship:
A review and directions for organizational behavior-human resources research. Journal of Applied
Psychology, 83, 139–149.
Lewis, S., & Cooper, C. L. (1999). The work-family research agenda in changing contexts. Journal of
Occupational Health Psychology, 4, 382–393.
Lewis, S., Smithson, J., Brannen, J., Das Dores Guerreiro, M., Kugelberg, C., Nilsen, A., & O’Connor, P.
(1998). Futures on hold: Young Europeans talk about combining work and family. London, England:
Work-Life Research Centre.
Lyness, K. S., & Thompson, D. E. (1997). Above the glass ceiling? A comparison of matched samples of
female and male executives. Journal of Applied Psychology, 82, 359–375.
Major, D. A. & Cleveland, J. N. (2005). Psychological perspectives on the work-family interface. In S. Bianchi, L. Casper,
& R. King (Eds.), Work, family, health and well-being (pp. 169–186). Mahwah, NJ: Lawrence Erlbaum.
Major, D. A. & Cleveland, J. N. (2007). Reducing work-family conflict through the application of industrial/
organizational psychology. International Review of Industrial and Organizational Psychology, 22,
111–140.
Murphy, K. R. (1998). In search of success: Everyone’s criterion problem. SIOP Presidential Address at the
Annual Conference for Industrial and Organizational Psychology, Dallas, TX.
Murphy, K. R. (2010). How a broader definition of the criterion domain changes our thinking about adverse
impact. In J. Outtz (Ed.), Adverse impact (pp. 137–160). San Francisco: Jossey-Bass.
Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational and
goal-oriented perspectives. Newbury Park, CA: Sage.
Nagle, B. F. (1953). Criterion development. Personnel Psychology, 6, 271–289.
Neal, A., & Hesketh, B. (2002). Productivity in organizations. In N. Anderson, D. S. Ones, H. K. Sinangil, &
C. Viswesvaran, (Eds.), Handbook of industrial, work and organizational psychology, Vol. 2:
Organizational psychology (pp. 7–24). Thousand Oaks, CA: Sage.
Netemeyer, R. G., Boles, J. S., & McMurrian, R. (1996). Development and validation of work-family conflict
and family-work conflict scales. Journal of Applied Psychology, 81, 400–410.
Organ, D. W. (1988). Organizational citizenship behavior. Lexington, MA: D.C. Health.
Parasuraman, S., Purohit, Y. S., Godshalk, V. M., & Beutell, N. J. (1996). Work and family variables, entrepre-
neurial career success and psychological well-being. Journal of Vocational Behavior, 48, 275–300.
Parker. S. K., Turner, N., & Griffin, M. A. (2003) Designing healthy work. In D. A. Hofmann & L. E. Tetrick
(Eds.), Health and safety in organizations: A multilevel perspective (pp. 91–130). San Francisco, CA:
Jossey-Bass.
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000) Adaptability in the workplace:
Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624.
Reynolds, J. R. (1997). The effects of industrial employment conditions on job related distress. Journal of
Health and Social Behaviour, 38, 105–116.
Robinson, S. L., & Bennett, R. J. (1995). A typology of deviant workplace behaviors: A multidimensional scaling
study. Academy of Management Journal, 38, 555–572.
Rothausen, T. J. (1999). “Family” in organizational research: A review and comparison of definitions and
measures. Journal of Organizational Behavior, 20, 817–836.
Sackett, P. R., & Wanek, J. E, (1996). New development in the use of measures of honesty, integrity, conscien-
tiousness, dependability, trustworthiness and reliability for personnel selection. Personnel Psychology,
49, 787–830.
Senge, P. M., Smith, B., Kruschwitz, N., Laur, J., & Schley, S. (2008). The necessary revolution: How individuals
and organizations are working to create a sustainable world. New York, NY: Random House.
Shellenbarger, S. (2002). Executive dad asks family for a 360 review. Wall Street Journal. Retrieved June 12,
2002, from http://www.careerjournal.com
Simpson, R. (1998). Organisational restructuring and presenteeism: The impact of long hours on the working
lives of managers in the UK. Management Research News, 21, 19.
Criterion Validity and Criterion Deficiency 567
Smith, C. A., Organ, D. W., & Near, J. P. (1983). Organizational citizenship behavior: Its nature and antecedents.
Journal of Applied Psychology, 68, 655–663.
Smolensky, E., & Gootman, J. A. (2003). Working families and growing kids: Caring for children and adoles-
cents. Washington, DC: The National Academies Press.
Thorndike, R. L. (1949). Personnel selection: Test and measurement techniques. New York, NY: Wiley.
Toops, H. A. (1944). The criterion. Educational and Psychological Measurement, 4, 271–297.
Viswesvaran, C., & Ones, D. S. (2000). Perspectives on models of job performance. International Journal of
Selection and Assessment, 8, 216–226.
Warr, P. B. (1987). Work, unemployment and mental health. Oxford, England: Clarendon Press.
Warr, P. B. (1994a). A conceptual framework for the study of work and mental health. Work and Stress, 8,
84–97.
Warr, P. (1994b). Age and employment. In H. Triandis, M. Dunnette, & L. Hough (Eds.), Handbook of industrial
and organizational psychology (2nd ed., Vol. 4, pp. 485–550). Palo Alto, CA: Consulting Psychologists
Press.
Warr, P. (2005). Work, well-being and mental health. In J. Barling, E. K. Kelloway, & M. R. Frone (Eds.).
Handbook of work stress. (pp. 547–574). Thousand Oaks, CA: Sage.
Warr, P. B. (1997). Age, work and mental health. In K. W. Schaie & C. Schoder (Eds.), The impact of work on
older adults (pp. 252–296). New York, NY: Springer.
Warr, P. B. (1999). Well-being in the workplace. In D. Kahneman, E. Diener & N. Schwarz (Eds.), Well-being:
The foundation of hedonic psychology (pp. 392–412). New York, NY: Russell Sage Foundation.
This page intentionally left blank
Part 6
Legal and Ethical Issues in
Employee Selection
P. Richard Jeanneret and Paul R. Sackett,
Section Editors
This page intentionally left blank
27 Ethics of Employee Selection
Joel Lefkowitz and Rodney L. Lowman
Each of the chapters in this handbook focuses on determinants of how the organizational human
resource (HR) practice of employee selection can be done well. That is, the contents are aimed at
providing the guidance needed to develop selection and promotion procedures that are accurate,
valid, and useful for organizations. In this chapter we suggest another standard. In addition to doing
selection well, we add a concern for doing it right. Hence, added to the technical and procedural
knowledge and empirical criteria that guide employee selection, this chapter emphasizes the nor-
mative or moral standards associated with notions of the good, right, fair, or just. We suggest that
doing selection well (i.e., technical competence) is inextricably bound up with doing it right. This
approach also opens to reflection the implicit values and moral justification underlying the practice
itself, in addition to considering the manner in which its constituent activities are implemented. In
other words, the ethics of employee selection are as relevant as the ethics in employee selection.
SOME META-ISSUES
THE INEXTRICABLE MIX OF COMPETENCE, ETHICS, JUDGMENT, AND VALUES
In the selection enterprise, industrial-organizational (I-O) psychologists work at the intersection of no
fewer than four domains that ostensibly are conceptually distinct but which have ambiguous, uncer-
tain, and probably overlapping boundaries. We make decisions that reflect simultaneously varying
aspects and degrees of (a) technical competence, (b) ethical considerations, and (c) differences in
professional judgment. Moreover, interpretations and conclusions regarding the substantive matters
at hand also reflect (d) the individual I-O psychologist’s views regarding such questions as “Whose
interests matter?”, “Who is to benefit?”, or “What is the right thing to do?”, as well as other personal
beliefs, attitudes, assumptions, and social values. For example, the choices and decisions made to
estimate population parameters from single-sample or mean validity coefficients involve the gener-
ally unrecognized melding of technical, normative, and values issues in which it may be difficult to
disentangle one’s professional judgment from one’s personal preferences. One sometimes encounters
parameter estimates that are based on national rather than local labor pool measures of predictor
variability in the applicant population or that use low archival estimates of the reliability of supervi-
sor criterion ratings when actual reliability data may be accessible. And perhaps most dramatically,
the economic utility of selection tests based on the prediction of individual-level subjective criteria
(supervisor ratings) may be extrapolated to estimates of organizational-level firm financial perfor-
mance in the absence of data justifying such causal inferences, particularly at that level of analysis.1
1 A recent meta-analysis of the between-organizations effects of “high-performance work practices” (HPWPs) on organizational-
level performance uncovered just 15 studies that investigated the impact of selection (Combs, Liu, Hall, & Ketchen, 2006).
They yielded a mean validity coefficient of only .11 (.14 corrected for measurement error). Moreover, most such studies
have used “postdictive” designs in which a claim that the HPWP has had a causal influence on the organizational outcomes
is not warranted (Wright, Gardner, Moynihan, & Allen, 2005). Perhaps more importantly, we are not aware of any within-
organization studies documenting the effect of selection systems on overall financial performance of the firm.
571
572 Handbook of Employee Selection
The point of the illustration is not to denigrate the attempt to better understand the validity
and utility of selection systems but to point out the underlying nature of the estimation procedure
and our critique. They both inextricably entail decisions reflective of not only technical knowl-
edge and experience, but also ethical considerations of appropriateness and professional judgment.
Moreover, all of those actions are shaped in the context of motives that reflect personal, societal,
and/or professional interests and values. Is it coincidental that the effect of each of the choices
and common practices mentioned above is to maximally increase the numeric value of estimated
validity and utility? As noted below, one of the customary “gut checks” for those who consciously
wrestle with ethical dilemmas is to look with suspicion on one’s tendency to opt for a solution that
just happens to be self-serving.
Those who have given some thought to the matter have identified the values structure of I-O
psychology as representing primarily managerial or corporate interests; historically, even at times
to the extent of having an anti-labor bias (Baritz, 1960; Katzell & Austin, 1992; Lefkowitz, 1990,
2003, 2004, 2005, 2008; Zickar, 2001). The viewpoint that informs this chapter differs in that we
are greatly influenced by three values positions that are at variance with such a unitary perspective.
We will have more to say about them later, but we highlight them briefly here so that the reader
may be clear about our values position and how it may agree with or differ from the reader’s own.
First is the universalist principle in moral philosophy that suggests that no one’s interests warrant
a priori preference over anyone else’s—although there may be factual reasons in specific instances
that justify doing so (Rachels, 1993; Singer, 1995). Second, and commensurate with universalism,
is the normative version of the prominent business ethics model of multiple stakeholder manage-
ment (Freeman, 1984; Freeman & Phillips, 2002), which asserts that it is right and just that powerful
organizations that have enormous impact on society should recognize the legitimacy of the interests
of those affected by it. (The instrumental version of the model holds that an organization’s actual
success is dependent on how well it manages its relationships with all its key stakeholder groups.)
And third, complementing the first two, is the so-called “professional ideal” (Kimball, 1992) or
“professional model” (Hall, 1975), which asserts that the power and authority granted by society to
a profession, such as I-O psychology, entail reciprocal responsibilities of the profession extending
beyond its direct clients to the society at-large.
With respect to the practice of employee selection, there are at least ten discernable groups of
people who have a stake, directly or indirectly, in the process and/or its outcomes. They include
qualified job candidates who are recognized as such by the selection system and thus hired; quali-
fied candidates who are misidentified by the system and rejected; unqualified candidates correctly
identified and so not hired; unqualified candidates who are misidentified and hired; coworkers of
the successful candidates, and other employees, whose own work is in some way impacted by them;
their direct supervisors, whose own success may be dependent on the work performance of the new-
hires; higher-level supervisors and managers of superordinate work units whose success also may
be contingent on the performance of the newcomers; the owners or shareholders of the company,
whose investments depend on the overall performance of the organization; the company’s clients or
customers, who purchase the goods or services produced by it; and the local community from which
the job applicants are drawn, which may be affected in various ways by the company’s actions and
success. The nature of their interests or “stake” in the selection system differs for many of them, as
does the extent of its impact on them. But they all potentially have some legitimate claim to have
their interests considered.
participate in it as subjects (Lefkowitz, 2007a,b). This is not to deny that research participants may
ultimately benefit from the application of research findings (contingent on positive study outcomes)
through the development of new drug treatments, more effective teaching strategies, more reward-
ing and satisfying jobs, or by not being placed in an ill-fitting job. But most of the applied research
conducted by I-O psychologists is driven by the intentions of senior organizational policy-makers in
the service of organizational objectives or by the theoretical interests, curiosity, or ambitions of the
researcher. For example, the development and validation of employee selection methods is generally
not aimed explicitly at benefiting members of the validation sample(s). Whether the participants are
current employees or job applicants, the results are applied to subsequent groups of job candidates
to improve organizational effectiveness. Admittedly, however, those selected may benefit indirectly
by having more competent coworkers.
Because research participants are often, in this sense, used by us for testing and validation
research in which they may have no personal interest in the outcome, we are ethically duty-bound
to seriously consider issues such as the voluntary nature of their participation in the study, the
extent of potential subjects’ obligation to participate, obtaining their participation through use of
various inducements or implicit coercion, providing informed consent, examinees’ rights to access
their own test data, and providing feedback. And when a testing program becomes operational,
additional matters arise, including whether to provide an opportunity for retesting rejected appli-
cants and the confidentiality and use of assessment data from incumbents. In the United States,
researchers’ obligations in some of these areas are incorporated in regulations promulgated by
the Federal Office for Human Research Protection (OHRP) of the U.S. Department of Health and
Human Services (OHRP, 1991).2
However, the circumstances under which employee selection I-O psychologists conduct employee
selection are generally recognized to provide us with somewhat greater ethical latitude. For example,
(a) informed consent for testing is ordinarily not required of educational or employment applicants
because they are deemed to have implicitly given consent by virtue of having applied (American
Educational Research Association, American Psychological Association, National Council on
Measurement in Education Standards for Educational and Psychological Testing, 1999 [hereafter,
APA Standards, 1999], Standard 8.4; American Psychological Association, Ethical Principles of
Psychologists and Code of Conduct, 2002 [hereafter, APA Code], Ethical Standard 9.03[a]); (b)
job applicants may acceptably be asked to waive (or be assumed to have waived) access to their
test results and so not receive any feedback (APA Standards, 1999, Standard 8.9 & 11.6; APA Code
2002, Ethical Standard 9.10); and (c) providing an opportunity for job candidates to be retested is
not obligatory (APA Standards, 1999, Standard 12.10). On the other hand, some ethical require-
ments are viewed as virtually universal, even in the operational employment setting, such as safe-
guarding to the extent feasible the confidentiality of test data and protecting against their misuse
and informing employees or applicants beforehand of any limitations on confidentiality (APA Code,
Ethical Standards 1.01, 3.11, 4.01, 4.02, 4.05, 9.04).
Moreover, we argue there are some very good reasons why we ought not always avail ourselves
of all the legitimate ethical exceptions that have been ceded to the practice of I-O psychology and
should instead behave as if the more stringent ethical research environment pertained. As one of us
has noted previously:
2 The regulations pertain to all research with human participants whether supported by government funds or not. And
research is defined as “a systematic investigation, including testing and evaluation, designed to develop or contribute to
generalizable knowledge” (§46.102[d]). Special note should be taken that “contribut[ing] to generalizable knowledge” is
often operationalized as seeking to publish or otherwise make public the findings of the study (such as at a professional
conference). The regulations are available at http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm#46.102
574 Handbook of Employee Selection
all research participation should be explicitly voluntary and the existence of a relatively open-ended
implicit obligation of workers to participate in legitimate organizational research. Notwithstanding
the implied obligation, adherence to the moral principle of respect for persons requires that we treat
research participation as genuinely voluntary and volitional to avoid even the semblance of coercion.
(Lefkowitz, 2003, p. 336)
To the moral arguments regarding respect for the worth and dignity of all people, we add some
pragmatic considerations. To function effectively in an organization, I-O psychologists depend on
the good will and cooperation of organization members, which in turn depend on the psychologist’s
reputation and the general reputation of the profession. Treating people cavalierly by taking their
cooperation for granted is likely to produce adverse effects on future interactions—including those
initiated by other I-O researchers and practitioners in that organization. In other words, it is in our
own self-interest and in the interests of the profession to always treat those with whom we work
honestly and with deference. A good principle to follow stems from the suggestion of the social
psychologist Robert Rosenthal (1994), which is that we ought to think of our potential participants
as a “granting agency” to which we must apply for necessary resources to implement our proposed
investigations. Obviously, most of the selection work we do could not be accomplished without the
input and cooperation of applicants, employees, and/or other subject matter experts (SMEs).
of all those affected by an action are to be considered equal with respect to moral judgments regard-
ing the action, unless good and reasonable arguments to the contrary can be made.
The best-known reflection of moral universalism in the field of business ethics and the social
responsibility of business institutions is the normative version of the multiple stakeholder perspec-
tive (Freeman, 1984; Freeman & Phillips, 2002). Instrumental stakeholder theory is merely descrip-
tive. Normative stakeholder models are prescriptive and stem from recognition of the enormous
power, size, and widespread societal impact of corporations. From those observations it is con-
cluded that they have an obligation to take into account the interests of the many constituencies that
are impacted by their actions and with whom they may be thought of as having implicit social con-
tracts. I-O psychologists are probably ahead of our colleagues in other subfields of psychology who
apparently are only now anticipating the likelihood of finding themselves “increasingly drawn into
situations where a multitude of social and political interests apply across hierarchies of individuals
to whom we owe various degrees of professional duties” (Koocher, 2007, p. 381). But Koocher’s
advice in such situations is well taken. Among other things, he recommended:
By focusing on the welfare and best interests of the most vulnerable party in the chain of individuals
to whom one owes a legitimate professional duty, psychologists optimize the likelihood of a good
outcome. At times, the most vulnerable party may be the public at large. (p. 381)
All of this suggests that in evaluating a selection system we ought to consider not only its effec-
tiveness but, from an ethical perspective, its impact on all of those affected. We always have an
instrumental concern for the extent of productivity improvement our client company or employer
can expect from cohorts of job applicants hired on the basis of validated predictor measures. That
is, in the language of selection classification, we anticipate significantly increasing the proportion
of those selected who are successful on the job (“true positives”) in relation to those hired who are
unsuccessful (“false positives”). But we should also be concerned about the proportion of incorrectly
rejected applicants who have been denied employment (“false negatives”) because of the imperfect
validity of those predictors. In other words, enhancing the interests of the organization adversely
and arguably unfairly impacts a substantial proportion of the applicant population. In addition, cus-
tomary attempts to further increase productivity improvement by means of more restrictive hiring
(decreasing the selection ratio) can generally be expected to exacerbate the harm by increasing the
proportion of false negatives among those rejected—to a greater extent than the decrease in the pro-
portion of false positives. The structure of this situation is that of a classic ethical dilemma—actions
that benefit some directly hurt innocent others—yet to our knowledge it has never been addressed
seriously in the literature of I-O psychology. We surmise that the reason is that I-O psychology, at
least in the context of employee selection, tends to view the organization itself as the primary (or
only) relevant stakeholder. The interests of applicants, especially rejected applicants who are not
and will not be members of the organization, are generally not considered.
But that is an inappropriately narrow view for professionals to hold. Professions are character-
ized by attributes that distinguish them from other occupations (Haber, 1991), and among the more
salient of those attributes is a sense of responsibility and obligation that extends beyond the paying
client to segments of the broader society. This is generally thought to constitute a quid pro quo for
the considerable amount of power, influence, and respect afforded by society to professions and their
members. This broader perspective has been referred to as a “true professional ideal” (Kimball,
1992, p. 303) or “the professional model” (Hall, 1975, p. 72). In sum, the universalist tradition in
moral philosophy; the multiple stakeholder approach from the study of management, business, and
society; and the professional model from the sociological study of occupations all coalesce around
the notion that ethical evaluations of our selection programs require that their impact on all of those
affected by them be considered. This could mean assuring that rejected candidates are afforded an
opportunity for retesting and perhaps even, when circumstances and budgets allow, utilizing rela-
tively low cut-scores and relying on probationary employment as a selection device.
576 Handbook of Employee Selection
ETHICAL PRINCIPLES
Although space constraints preclude delving into the origins of the ethical principles presented here,
it should be noted that they emerge from a long history of moral philosophy and more recent work
in moral psychology (see Lefkowitz, 2003, for a review). They are reflected in various normative
ethical theories that are generally either deontological or consequentialist in nature. Deontological
theories are concerned with right and wrong per se; they hold that the rightness or wrongness of
an action is intrinsic to the nature of the act on the basis of whether it violates a moral principle.
Deontologists are concerned with principled expressions of rights, duties, responsibilities, virtue,
fairness, and justice. Some deontological principles are expressed positively in terms of affirmative
duties (e.g., treat job applicants with respect; protect the confidentiality of test data), but many are
expressed negatively in terms of actions that are disallowed versus what it is permissible to do (e.g.,
do not exaggerate or make “hyperclaims” to organization decision-makers about the likely benefits
of your proposed selection system). If one of us fails to protect the confidentiality of employee test
data, the deontologist will view that person as having wronged those individuals even if there is no
evidence of their actually having been harmed. Note, however, that harm may have been caused to
the reputation of the profession.
Consequentialists, on the other hand, define right and wrong in terms of the good and bad (or
benefits and harms) that will result from an action. The relative morality of alternative actions is
directly reflected in the net amounts of goodness that can be expected to result from each. The
option that leads to the greatest amount of net good or the least amount of net harm (considering
all those to be impacted) is the morally imperative choice, not merely a permissible one, as with
deontological approaches. Neither system of thought is free from legitimate criticism by proponents
of the other perspective, and some situations seem more amenable to analysis by one rather than the
other of the two strategies, so that the prudent professional should be familiar with both approaches
to ethical analysis and decision-making.
Respect
People have the right to be treated with dignity and respect and allowed to exercise their rights to
privacy, confidentiality, freedom, autonomy, and self-expression. These rights are universalizable
(i.e., applicable as much to anyone else as to oneself) and bounded by reciprocal obligations. For
example, our right to pursue our research objectives should not supersede an employee’s right to
not participate in the study. With regard to the principle of respect, psychologists are obligated in
particular to be aware of and to eliminate the effects of prejudice and bias related to “age, gender,
gender identity, race, ethnicity, culture, national origin, religion, sexual orientation, disability, lan-
guage, and socioeconomic status” (APA Code, 2002, Principle E).
and burdens associated with membership in a social system (e.g., organization or nation). However,
alternative criteria of fairness represent social and moral values positions and so are influenced
greatly by macro-level political and economic systems (e.g., the marked preference in the American
free-enterprise system for the distributive justice criterion of equity, or “merit” as opposed to equal-
ity or need).
On the face of it, equality appears to be a straightforward and uncomplicated standard of fairness.
If a metaphorical apple pie represents the sum benefits and rewards provided by a social system such
as a work organization or an entire society, and there are ten members of the system, it would seem a
simple matter to cut ten equal pieces. But even this simple criterion may be difficult to implement if
there are procedural and/or technological difficulties in dividing the pie into exactly equal portions.
For example (a) some resources may be easily divided, such as money, others less so (e.g., there may
be fewer promotions available than the number of eligible and deserving employees); (b) it may be
difficult to equate portions if the benefits consist of multiple nonfungible rewards such as an apple
pie and a steak; moreover (c), because equality does not consider likely differences in people’s pref-
erences for different benefits, numerical equality may not correspond with perceived equality.
Now if we add the further need to specify members’ inputs or contributions to the social
system—as we must to achieve the criterion of equity—things get really complicated. Equity has
to do with the equivalence of people’s ratio of inputs to outcomes or as it is frequently expressed,
“benefits and burdens” or “earned rewards,” and this is the normative hallmark of fairness in the
United States. All of the potential administrative and measurement problems of achieving equality
are multiplied by having also to consider people’s contributions to the system and their relation-
ships to rewards, in an attempt to reach equity! Suppose that some of our ten members each spent
a couple of hours in the kitchen baking that apple pie and grilling that steak, whereas some spent
the past year tending to the orchard that produced the apples and raising the cow that yielded the
steak, whereas a couple of people are merely members of the family that owns the farm. How are
their contributions to be assessed? And we have not yet considered the criterion of need. What if all
of the above “facts” are true, but in addition we know that some of the ten members have been and
continue to be well-fed, some have not eaten since yesterday, and some are near-starvation? Why
should we not recognize need as the relevant criterion of fairness? And if we were to do so, would it
matter why some of them are so hungry? Lastly, the importance of the criterion chosen is suggested
by the observation that there is no reason to suppose any substantial correlation among the three—
meaning that the rank ordering of people in terms of benefits received will likely be different under
equality, equity, and need.
Caring: Beneficence
The origins of this principle are in consequentialist theory and the “ethics of care” in moral psychol-
ogy. It is reflected in the traditional service ideal of the professions: “Psychologists are committed
to increasing scientific and professional knowledge of behavior … and to the use of such knowledge
to improve the condition of individuals, organizations, and society” (APA Code, 2002, p. 1062), and
“providers of I-O psychological services are guided primarily by the principle of promoting human
welfare” (APA, 1981, p. 668). Although this principle is generally interpreted within the context of
the universalist meta-principle that the interests of all those concerned are to be considered equal, it
is also generally recognized that most of us justifiably care more for some people more than others,
and/or there may be some to whom we owe a special obligation or duty (e.g., family, friends, neigh-
bors, colleagues, clients, employer). Therefore, it is usually not viewed as unethical per se to act on
those special concerns and obligations. However, there may be occasions when such actions slide
impermissibly far down the slippery slope of favoritism, prejudice, bias, or nepotism.
Caring: Nonmaleficence
This principle—refraining from unjustifiably harming others—is the moral principle about
which there seems to be the greatest consensus among ethicists. It is not in all respects simply the
578 Handbook of Employee Selection
opposite of beneficence. In contrast to the equivocation regarding the extensiveness of the domain
of beneficence as a function of our social identity and role relationships, the obligation to not cause
unjustifiable harm is generally thought to apply equally to all others, even strangers. It is espe-
cially pertinent with regard to those who are in potentially vulnerable positions (e.g., employees,
students, job candidates, research participants). The primacy of nonmaleficence is indicated in the
APA Code (2002):
When conflicts occur among psychologists’ obligations or concerns, they attempt to resolve these
confl icts in a responsible fashion that avoids or minimizes harm … and [they] guard against personal,
fi nancial, social, organizational, or political factors that might lead to misuse of their influence.
(p. 1062)
Moral Character
Many ethical treatises and professional codes of conduct include discussions of personal attributes
having to do with the character of the person potentially faced with a moral dilemma rather than
on the process of his or her ethical decision-making. For example, the APA’s (2002) ethical code
is explicit about the importance of fidelity and responsibility (to those with whom we work, to
our communities, and to society) and of integrity (accuracy, honesty, truthfulness, and promise-
keeping) (Principles B and C, respectively, p. 1062).
In recent years technological advances (e.g., web-based job applications and employment test-
ing), changes in the nature and conditions of work (e.g., home-based work, increased use of teams),
and other dramatic changes such as the globalization of organizations have impacted the way in
which ethical problems in employee selection (and other professional domains) are manifested.
Notwithstanding those changes in the manifest circumstances of contemporary work life, the
importance of the five sets of fundamental moral principles noted above is reflected in the following
observation:
The paradigmatic forms taken by those problems, the character traits and motives needed to recognize
them as such, the ethical reasoning used to address them, as well as the substance of the ethical
principles on which such reasoning is based are all essentially unaffected and still pertain. (Lefkowitz,
2006, p. 245)
Suppose you are an internal consultant in the management development section of the HR depart-
ment of a large corporation and you are beginning to train a group of high-level managers to serve as
assessors for a planned assessment center. When leading an evaluation discussion following a mock
exercise engaged in by actors playing hypothetical promotion candidates, a senior executive—the
apparent informal leader of the group—makes a demeaning sexist remark about the one female
candidate being discussed, and all of the other managers laugh appreciatively. Responding appro-
priately and ethically may require an abundance of courage and social skill.
Paradigm IV. Values Conflict: Facing Conflicting and Relatively Equally Important
Personal Values So That Expressing One Entails Denying the Other(s)
At a somewhat macro-level, this is the battlefield on which conflicts play out between the objectives
of shareholder value and corporate financial performance (CFP) on one side versus the putative
societal obligations of business reflected in the corporation’s “social performance” (CSP; Lefkowitz,
2007c). At the level of specific HR systems such as selection, it gets reflected in the attempt to bal-
ance ostensibly competing objectives such as increasing economic utility and decreasing adverse
580 Handbook of Employee Selection
impact on minorities (De Corte, Lievens, & Sackett, 2007). It is on a note of optimism that we point
out that the most recent accumulations of evidence suggest that CFP and CSP may be entirely com-
patible or even complementary (Guenster, Derwall, Bauer, & Koedijk, 2005; Orlitzky, Schmidt, &
Rynes, 2003).
None of the four paradigms explicitly mention matters of technical competence. Is competence
an ethical issue? In fact, the APA Code (2002) contained six enforceable standards in the section
headed “Competence”; for example, “Psychologists’ work is based upon established scientific and
professional knowledge of the discipline” (Standard 2.04), and “Psychologists undertake ongoing
efforts to develop and maintain their competence” (Standard 2.03). Suppose an I-O psychologist
conducts a well-done validation study for a client and reports a statistically significant validity coef-
ficient for a set of predictors but fails to report several other nonsignificant coefficients with respect
to other relevant criteria investigated. If the psychologist is ignorant of the stricture against exag-
gerating validation findings by capitalizing on chance relationships, he/she is not competent. Now,
what if the psychologist was not ignorant of these psychometric matters, but had struggled with
temptation and ultimately, and with some misgivings, omitted the negative findings out of concern
for disappointing the client? Or worse still, in a third variant, suppose he/she freely and deceitfully
chose to distort the nature of the findings to justify having guaranteed favorable results in advance?
Ethical analyses invariably include a focus on the “bottom line” of an action taken or actions
contemplated (i.e., the consequences of the act(s) on all those affected). Each of these three sce-
narios represents an ethical transgression because of the potential harm to be caused the client and
job applicants by using an ineffective selection system and ultimate harm to the reputation of the
profession when the ineffectiveness becomes apparent. However, the motives of the psychologist
are rather different in each scenario, and in ethical analyses it is also true that motives matter. So,
in what way do they matter? In each of the scenarios the psychologist is portrayed as increasingly
venal: from merely inexcusably ignorant and failing to live up to one’s professional obligations,
to defensively self-protective and disrespectful of a client’s rights, to premeditatedly self-serving
and deceitful. Two observations are warranted. First, these different motives—implying different
moral characters—make little or no difference in terms of the consequences of the incomplete data
reporting. That is why “mere” incompetence is an ethical matter. Second, it is likely that the reader
feels somewhat differently about each of our three hypothetical transgressors—perhaps feels that
their degree of venality is related directly to their degree of culpability. That may, depending on
circumstances, appropriately lead to differences in what we view as the suitable degree of penalty
or opprobrium for each of our three transgressors for the “same” offense.
I swear … that I will fulfill according to my ability and judgment this oath and this covenant:
I will apply dietetic measures for the benefit of the sick according to my ability and judgment; I will
keep them from harm and injustice.
Ethics of Employee Selection 581
I will neither give a deadly drug to anybody who asked for it, nor will I make a suggestion to this
effect.
I will not use the knife … but will withdraw in favor of such men as are engaged in this work.
Whatever houses I may visit, I will come for the benefit of the sick, remaining free of all intentional
injustice, of all mischief . . . .
What I may see or hear in the course of the treatment or even outside of the treatment in regard to
the life of men, which on no account one must spread abroad, I will keep to myself, holding such things
shameful to be spoken about . . . . (p. 6)
Note that the Hippocratic Oath is not particularly abstractly philosophical. It does not emphasize
moral principles underlying the ethical admonitions despite having been created in a golden era of
philosophical and ethical considerations. It imposed on those taking the oath very specific obliga-
tions to behave in a certain way and not to behave in other ways. Some of its tenets are readily
interpretable in terms of modern professional ethical standards and others would be irrelevant or
considered inappropriate in today’s world. Another aspect of the oath that is relevant to contempo-
rary I-O psychologists is, despite its pragmatic orientation, its explicit recognition of the broader
professional and moral context within which the specific obligations are taken on. The doctor is
not only to be technically competent (providing good dietary recommendations, not performing
surgery), but is to be pure, prevent harm and injustice, and protect confidentiality.
Some professions favor a narrow and explicit approach to ethical practice—i.e., if it is not explic-
itly prohibited (e.g., by an ethics code) then it is not unethical. (An extension of this approach is
the view that any action that is not clearly illegal is morally permissible.) Others see the need for
a broader more proactive approach in which moral principles and the values to which they give
rise deserve expression, even when no specific ethical “violation” is identified. In its enforceable
Standards, the APA Code (2002) is an example of the former pragmatic approach; in its General
Principles, it exemplifies the latter. It bears reminding that for more than the first half century of
its existence, the APA had no ethics code at all. This case of apparent arrested development reflects
the historical growth of a field that for the early part of its existence was not as concerned with the
practice of psychology as with the establishment of the field as a science. Only with the burgeoning
growth of psychological practice around the time of World War II did the need for a formal code of
ethics for psychology become more intensely apparent.
The initial code of ethics for psychologists emerged from an empirical rather than an a priori
theoretical or philosophical base (cf. Pope & Vetter, 1992). Members of the association were polled
about incidents they had encountered that raised ethical issues, and from those data, an initial code
of ethics was written. The field of I-O psychology is relatively new as an applied area of training and
practice (cf. Lowman, Kantor, & Perloff, 2006). As a result, until the 2002 revision of the code there
has not included much in it to suggest that it was written with applied practice in I-O psychology in
mind. A partial exception was the area of testing, a domain that is included in many types of applied
practice, and so has had a long-time focus in the various editions of the code. However, more atten-
tion is paid in the code to issues associated with test construction and with applications in individual
assessment contexts than explicitly to the mass testing often associated with employee selection.
However, the 2002 code did take modest steps to address how the ethics principles and standards
applied to I-O and organizational consulting psychology. The code added several references to
consulting and psychological work in organizations and includes “organizational clients” in most
categories of service. For example, ethics Standard 3.11 explicitly concerns psychological services
delivered to or through organizations. It clarifies the issues involved in working with individuals
versus organizations and the responsibilities of psychologists to the individual organization mem-
bers with whom they work when they are not themselves the defined client.
of the services, (2) the intended recipients, (3) which of the individuals are clients, (4) the relationship
the psychologist will have with each person and the organization, (5) the probable uses of services
provided and information obtained, (6) who will have access to the information, and (7) limits of
confidentiality. As soon as feasible, they provide information about the results and conclusions of such
services to appropriate persons.
(b) If psychologists will be precluded by law or by organizational roles from providing such information
to particular individuals or groups, they so inform those individuals or groups at the outset of the
service. (APA Code, 2002)
The entire ninth standard, which is on assessment, has direct applicability to most employee
selection psychology. It encompasses individual assessments and those done in the context of groups
such as with applicant selection, and it indicates that consent may ethically be implied.
This section of the code also identifies the requirements for test construction, issues related to out-
dated assessments, and issues related to feedback on the results of tests. It also deals with release of
information about tests, obligations of psychologists concerning test security, and situations involv-
ing obsolete tests.
However, it can be argued that the code says very little per se about the common situation in
which psychologists who administer large testing programs in industry or government work for
non-psychologists, and the decisions about testing programs are made by persons with little psycho-
logical training. A section of the code does cover such situations generically.
1.02 Conflicts Between Ethics and Law, Regulations, or Other Governing Legal Authority
If psychologists’ ethical responsibilities conflict with law, regulations, or other governing legal authority,
psychologists make known their commitment to the Ethics Code and take steps to resolve the conflict.
If the conflict is unresolvable via such means, psychologists may adhere to the requirements of the law,
regulations, or other governing legal authority. (APA Code, 2002)
There is also the following ethics standard which imposes an ethical obligation to take appropri-
ate action in response to misuse of one’s work:
3 The authors are grateful for input on this section from Robert Hogan, Joel Moses, George C. Thornton III, John G. Veres
III, and Michael J. Zickar.
Ethics of Employee Selection 583
Professional Competence
As noted earlier, competence in the conduct of one’s profession is an important ethical issue for
many reasons (cf. Ethical Standards 2.01B2.06, APA Code, 2002). The issue of competence of
course overlaps that of validity but it also requires that psychologists base their practice on mastery
of the relevant technical knowledge base associated with their area of psychology; the consequences
of inappropriate practice by psychologists in this area can be considerable. As indicated by the fol-
lowing code:
Test Security
Psychologists are mandated by Ethical Standard 9.11 to maintain test security, viz.:
In the case of a psychologist administering intelligence tests in the context of a private practice
or school system, the issues of maintaining test security may be straightforward. However, in
today’s employee selection context, tests may be administered to hundreds of thousands of appli-
cants, tests may be administered electronically with no oversight of the test-taking circumstances,
and a team of psychologists and nonpsychologists may help to create and validate a test with little
direct control by a psychologist of the security of the process. Although the psychologist’s ethi-
cal mandate to protect test security is conceptually clear, the practical realities of corporate and
government testing contexts are often far more complicated and difficult than the code may have
contemplated.
584 Handbook of Employee Selection
Assessment Centers
Assessment centers (ACs) seem to have attributes that attract ethical challenges. Their notable
early and well-publicized success has led to a faddish proliferation beyond the resources of those
actually trained and skilled in their development, implementation, and administration. (Refer to
Chapter 37, this volume, for a description of the seminal AT&T Management Progress Study.)
For example, Caldwell, Thornton III, and Gruys (2003) have itemized ten “classic errors” in this
area of practice, most of which have ethical implications (poor planning, shoddy exercise devel-
opment, no pretesting of exercises, using unqualified assessors, etc.) Colloquially, experienced
AC practitioners also have observed problems such as promotion by consultants of the utility of
their AC entirely on the basis of the general research literature, which has little to do with the
consultants’ specific proposed procedures; use of unprofessional assessors who receive virtually
no training in behavioral observation, rating, and evaluation and who may be unfamiliar with the
particular exercises used; subsequent pressure on AC staff or consultants to use the data from an
earlier developmental AC for personnel actions (e.g., retention decisions during a reduction-in-
force, sometimes exacerbated by how old the data are); widely disseminating individual assessment
data in the organization for various unintended purposes; and using generic exercises that have
little or no demonstrable relationship to the target jobs. In addition, various administrative gaffs
Ethics of Employee Selection 585
have been noted, such as failing to maintain the security of measures for days 2 and 3 of a multi-day
AC; allowing nonassessors (e.g., senior managers) to influence assessment evaluations; and fail-
ing to provide appropriate feedback to candidates or providing inappropriate feedback such as
implying organizational actions to be taken (“you clearly have the talent to be promoted soon”),
etc. All of these matters (and others, such as “rights of the participant”) are discussed carefully
in Guidelines and Ethical Considerations for Assessment Center Operations (International Task
Force on Assessment Center Guidelines, 2000).
1. The potential difficulties one might encounter in this regard in implementing a selection
study are likely to be as much or more influenced by the history of union/management rela-
tions in the organization as by the attributes of the proposed program, so that one should
become familiar with that history.
2. The parameters of the project may be set or limited by terms of the collective bargaining
agreement, so one also needs to be knowledgeable about that.
3. Even if it were legal (by virtue of recognized management prerogative in the union con-
tract) to implement a selection program for union members without obtaining prior union
agreement, the prudent I-O psychologist (i.e., one who would like the project to succeed)
would be well-advised to proceed as if such approval were necessary—and to not proceed
4 See Chapter 8, this volume, concerning technology and employee selection for a more comprehensive presentation.
586 Handbook of Employee Selection
until some acceptable arrangement was achieved, preferably even actively involving the
union in the project from its outset.
4. Because the topic of unions tends to be a volatile one on which people hold strong opin-
ions, and because most I-O psychologists tend to view themselves as representatives of
management, it is advisable to consider the extent to which one’s personal values in that
regard might conflict with a more universalistic set of professional values including respect
for all relevant stakeholders, including union members.
5 See Chapter 7, this volume, on the use of test scores for a more thorough treatment of this issue.
Ethics of Employee Selection 587
settings are not psychologists and may not understand the ethical constraints on psychologists or the
complexities of standards associated with employee selection research. For example, the adequacy
of sample sizes for establishing reliable validity coefficients may seem like an academic concern to
an impatient manager eager to get on with implementation. Psychologists do have an ethical obliga-
tion in such circumstances, but, in recognition of the salience of organizational realities, it is not
absolute.
CONCLUSIONS
It was not our intention to present an exhaustive catalogue of potential ethical problems in
selection—nor, in all likelihood, would it have been possible if we had intended to do so. But it is
our hope that the reader comes away with an enhanced understanding of the following:
• The nature of moral and ethical issues and the general forms in which they may be
manifested
• How ethical issues differ and can be differentiated from other sorts of problems
• The complexity of moral dilemmas (i.e., their interconnectedness with matters of technical
competence, professional judgment, and personal values)
• How ethical problems may arise in the realm of employee selection
• Some ways to judge the relative appropriateness of alternative solutions to ethical dilemmas
• The guidance offered by and limitations of both the more specific, but narrow, formal-
istic approach to ethical compliance represented by specifically prescribed and (mostly)
proscribed actions that are contained in codes of conduct, and the more general approach
that emphasizes the expression of well-established moral principles and the values they
engender
• Some of the typical ethical issues potentially associated with various facets of employee
selection in organizations, including contextual and organizational influences as well as
those particularly associated with specific measures and/or procedures
Following all of that descriptive and analytic input, it seems most appropriate, if not necessary, to
conclude by focusing on application and solution, that is, what to do. On one hand, general ethical
principles and written sources such as APA’s Code of Conduct and the Society for Industrial and
Organizational Psychology (SIOP)’s casebook (Lowman, 2006) are readily available but may not
explicitly include one’s particular problem(s). On the other hand, specific potential ethical issues—
even within a limited domain such as employee selection—are innumerable, not entirely predict-
able, and so cannot all be itemized a priori. The best we can hope to do, aside from noting some
particularly common examples as we have done, is to present a general scheme emphasizing preven-
tion (cf. Pryzwansky & Wendt, 1987), that is, highlighting the importance of trying to anticipate and
prevent problems before they arise. Largely on the basis of the work of Canter, Bennett, Jones, and
Nagy (1994), as well as Pryor (1989), we offer the following six-step plan.
(1990), the Society for Human Resource Management (1990), the International Task force on
Assessment Center Guidelines (2000), and other relevant organizations. Gaining familiarity with
them can help one to avoid blundering into ethical indiscretions because of sheer ignorance, which
is important because “lack of awareness or misunderstanding of an ethical standard is not itself a
defense to a charge of unethical conduct” (APA Code, 2002, p. 1061). Indispensable sources of pro-
fessional information include the APA Standards (1999), SIOP Principles (2003), and knowledge
concerning how tests are often misused (Moreland, Eyde, Robertson, Primoff, & Most, 1995).
2. BE FAMILIAR WITH RELEVANT FEDERAL, STATE, AND LOCAL LAWS AND REGULATIONS
These pertain to rules regarding conducting research with human participants (OHRP, 1991), one’s
particular (U.S.) state laws regulating the licensing of psychologists, and federal and state laws
governing employment practices such as the Civil Rights Acts of 1964 and 1991, the Americans
With Disabilities Act of 1990, the Age Discrimination in Employment Act of 1967, the Equal Pay
Act of 1963, and the Uniform Guidelines on Employee Selection Procedures (Equal Employment
Opportunity Commission, Civil Service Commission, U.S. Department of Labor, and U.S.
Department of Justice, 1978). (Refer to Chapters 28, 29 and 30, this volume, for in-depth treatments
of these professional and legal standards.)
This admonition is obviously not entirely necessary for you, the reader. Attending courses, work-
shops, and professional conference presentations and seminars; subscribing to journals; and reading
books that focus on ethical, professional, and technical matters are some of the means of keeping
abreast of new technical developments and honing one’s ethical sensitivities and decision-making
skills. Conferring with colleagues is often indispensable—most especially when one is in the throes
of an uncomfortable ethical dilemma or sees the potential for one developing. In addition to our
national association, SIOP, the regularly published newsletters of several local organizations of
applied psychologists have proven to be consistently reliable sources of information, including the
Metropolitan New York Association for Applied Psychology (Metro); the Personnel Testing Council
of Metropolitan Washington, DC (PTC/MW); and the Personnel Testing Council of Southern
California (PTC/SC).6
6 Information may be obtained from the following websites: http://www.siop.org, http://www.metroapppsych.com, http://
www.ptcmw.org, and http://www.ipmaac.org/ptcsc
Ethics of Employee Selection 589
REFERENCES
Academy of Management. (2002). Academy of Management code of ethical conduct. Academy of Management
Journal, 45, 291–294.
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association.
American Psychological Association. (1981). Specialty guidelines for the delivery of services by I/O psycholo-
gists. American Psychologist, 36, 664–669.
American Psychological Association. (2002). Ethical principles of psychologists and code of conduct. American
Psychologist, 57, 1060–1073.
Baritz, L. (1960). The servants of power: A history of social science in American industry. Westport, CT:
Greenwood.
Caldwell, C., Thornton, G. C., III, & Gruys, M. L. (2003). Ten classic assessment center errors: Challenges to
selection validity. Public Personnel Management, 32, 73–88.
Canadian Psychological Association. (2000). Canadian code of ethics for psychologists (3rd ed.). Ottawa,
Ontario: Author.
Canter, M. B., Bennett, B. E., Jones, S. E., & Nagy, T. F. (1994). Ethics for psychologists: A commentary on the
APA ethics code. Washington, DC: American Psychological Association.
Combs, J., Liu, Y., Hall, A., & Ketchen, D. (2006). How much do high performance work practices matter?
A meta-analysis of their effects on organizational performance. Personnel Psychology, 59(3), 501–528.
Cooper, T. L. (Ed.). (2001). Handbook of administrative ethics. New York, NY: Marcel Dekker.
De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between
selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393.
590 Handbook of Employee Selection
Edelstein, L. (1967). The Hippocratic Oath: Text, translation and interpretation. In O. Temkin & C. L. Temkin
(Eds.), Ludwig Edelstein. Ancient medicine: Selected papers of Ludwig Edelstein (pp. 3–64). Baltimore,
MD: Johns Hopkins University Press.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department
of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43(166),
38290–38315.
Fletcher, C. (1992). Ethical issues in the selection interview. Journal of Business Ethics, 11, 361–367.
Freeman, R. E. (1984). Strategic management: A stakeholder approach. Boston, MA: Pitman.
Freeman, R. E., & Phillips, R. A. (2002). Stakeholder theory: A libertarian defense. Business Ethics Quarterly,
12, 331–349.
Gellermann, W., Frankel, M. S., & Ladenson, R. F. (1990). Values and ethics in organization and human sys-
tems development: Responding to dilemmas in professional life. San Francisco, CA: Jossey-Bass.
Green, B. F. (1996). Setting performance standards: Content, goals, and individual differences. The second
annual William H. Angoff Memorial Lecture. Princeton, NJ: Educational Testing Service.
Guenster, N., Derwall, J., Bauer, R., & Koedijk, K. (2005, July). The economic value of corporate eco-effi-
ciency. Paper presented at the Academy of Management Conference. Honolulu, HI.
Haber, S. (1991). The quest for authority and honor in the American professions, 1750–1900. Chicago, IL:
University of Chicago Press.
Hall, R. T. (1975). Occupations and the social structure (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Hausknecht, J. R., Halpert, J. A., DiPaolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection: A
meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology,
92, 373–385.
International Personnel Management Association. (1990). IPMA code of ethics. Alexandria, VA: Author.
International Task Force on Assessment Center Guidelines. (2000). Guidelines and ethical considerations for
assessment center operations. Pittsburgh, PA: Development Dimensions International.
Jeanneret, P. R. (1998). Ethical, legal, and professional issues for individual assessment. In R. Jeanneret &
R. Silzer (Eds.). Individual psychological assessment: Predicting behavior in organizational settings (pp.
88–131). San Francisco, CA: Jossey-Bass.
Katzell, R. A., & Austin, J. T. (1992). From then to now: The development of industrial-organizational psychol-
ogy in the United States. Journal of Applied Psychology, 77, 803–835.
Kimball, B. A. (1992). The “true professional ideal” in America. Cambridge, MA: Blackwell.
Koocher, G. P. (2007). APA Presidential Address. Twenty-first century ethical challenges for psychology.
American Psychologist, 62, 375–384.
Koocher, G., & Keith-Spiegel, P. (1998). Ethics in psychology: Professional standards and cases (2nd ed.).
New York, NY: Oxford University Press.
Lefkowitz, J. (1990). The scientist-practitioner model is not enough. The Industrial-Organizational Psychologist,
28(1), 47–52.
Lefkowitz, J. (2003). Ethics and values in industrial-organizational psychology. Mahwah, NJ: Lawrence
Erlbaum.
Lefkowitz, J. (2004). Contemporary cases of corporate corruption: Any relevance for I-O psychology? The
Industrial-Organizational Psychologist, 42(2), 21–29.
Lefkowitz, J. (2005). The values of industrial-organizational psychology: Who are we? The Industrial-
Organizational Psychologist, 43(2), 13–20.
Lefkowitz, J. (2006). The constancy of ethics amidst the changing world of work. Human Resource Management
Review, 16, 245–268.
Lefkowitz, J. (2007a). Ethics in industrial-organizational psychology research. In S. Rogelberg (Ed.), The ency-
clopedia of industrial and organizational psychology (Vol. 1, pp. 218–222). Thousand Oaks, CA: Sage.
Lefkowitz, J. (2007b). Ethics in industrial-organizational psychology practice. In S. Rogelberg (Ed.), The
encyclopedia of industrial and organizational psychology (Vol. 1, pp. 215–218). Thousand Oaks, CA:
Sage.
Lefkowitz, J. (2007c). Corporate social responsibility. In S. Rogelberg (Ed.), The encyclopedia of industrial
and organizational psychology (Vol. 1, pp. 114–118). Thousand Oaks, CA: Sage.
Lefkowitz, J. (2008). In order to prosper the field of organizational psychology should … expand its values to
match the quality of its ethics (Special issue). Journal of Organizational Behavior, 29, 439–453.
Lowman, R. L. (1991). The clinical practice of career assessment: Interests, abilities, and personality.
Washington, DC: American Psychological Association.
Lowman, R. L. (Ed.). (2006). The ethical practice of psychology in organizations (2nd ed.). Washington, DC:
American Psychological Association/Society for Industrial-Organizational Psychology.
Ethics of Employee Selection 591
Lowman, R. L., Kantor, J., & Perloff, R. (2006). History of I-O psychology educational programs in the United
States. In L. L. Koppes (Ed.), Historical perspectives in industrial and organizational psychology (pp.
111–137). Mahwah, NJ: Lawrence Erlbaum.
Mills, C. N., & Melican, G. J. (1987). A preliminary investigation of three compromise methods for establishing
cut-off scores (Report No. RR-87-14). Princeton, NJ: Educational Testing Service.
Moreland, K. L., Eyde, L. D., Robertson, G. J., Primoff, E. S., & Most, R. B. (1995). Assessment of test user
qualifications: A research-based measurement procedure. American Psychologist, 50, 14–23.
Office for Human Research Protections, Department of Health and Human Services (1991). Protection of
human subjects. Code of Federal Regulations, Title 45, Public Welfare. June 18.
Orlitzky, M., Schmidt, F. L., & Rynes, S. L. (2003). Corporate social and financial performance: A meta-
analysis. Organization Studies, 24, 403–441.
Pope, K. S., & Vetter, V. A. (1992). Ethical dilemmas encountered by members of the American Psychological
Association: A national survey. American Psychologist, 47, 397–411.
Potosky D., & Bobko, P. (2004). Selection testing via the Internet: Practical considerations and exploratory
empirical findings. Personnel Psychology, 57, 1003–1034.
Prien, E. P., Schippmann, J. S., & Prien, K. O. (2003). Individual assessment as practiced in industry and con-
sulting. Mahwah, NJ: Lawrence Erlbaum.
Pryor, R. G. L. (1989). Conflicting responsibilities: A case study of an ethical dilemma for psychologists work-
ing in organisations. Australian Psychologist, 24, 293–305.
Pryzwansky, W. B., & Wendt, R. N. (1999). Professional and ethical issues in psychology: Foundations of
practice. New York, NY: W.W. Norton & Co.
Rachels, J. (1993). The elements of moral philosophy (2nd ed.). New York, NY: McGraw-Hill.
Rest, J. R. (1994). Background: Theory and research. In J. R. Rest & D. Narvaez (Eds.). Moral development in
the professions (pp. 1–26). Hillsdale, NJ: Lawrence Erlbaum.
Rosenthal, R. (1994). Science and ethics in conducting, analyzing and reporting psychological research.
Psychological Science, 5, 127–134.
Samuelson, P. A. (1993). Altruism as a problem involving group versus individual selection in economics and
biology. American Economic Review, 83, 143–148.
Society for Human Resource Management. (1990). Code of ethics. Alexandria, VA: Author.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures (4th ed.). Bowling Green, OH: Author.
Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006).
Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225.
Wittmer, D. P. (2001). Ethical decision-making. In T. L. Cooper (Ed.), Handbook of administrative ethics (2nd
ed., pp. 481–507). New York, NY: Marcel Dekker.
Wright, P. M., Gardner, T. M., Moynihan, L. M., & Allen, M. R. (2005). The relationship between HR practices
and firm performance: Examining causal order. Personnel Psychology, 58, 409–446.
Zickar, M. J. (2001). Using personality inventories to identify thugs and agitators: Applied psychology’s con-
tribution to the war against labor. Journal of Vocational Behavior, 59, 149–164.
This page intentionally left blank
28 Professional
Standards
Guidelines/
INTRODUCTION 1
There are three sources of authoritative information and guidance that can be relied upon in the
development, validation, and implementation of an employment selection procedure. The sources
are the Standards for Educational and Psychological Testing (1999; Standards), the Principles
for the Validation and Use of Personnel Selection Procedures (2003; Principles), and the Uniform
Guidelines on Employee Selection Procedures (1978; Uniform Guidelines). The term selection pro-
cedure in this instance should be interpreted broadly to be inclusive of any process or information
used in personnel decision-making. Selection procedures would include (but not be limited to) all
forms and types of tests (e.g., cognitive, personality, work samples, and assessment centers), job
performance appraisals, and measures of potential. These procedures may be administered, scored,
and interpreted as paper-and-pencil or computer-based instruments and/or by individuals internal
or external to the organization. This broad view is consistent with the interpretations expressed by
the authoritative sources. Oftentimes the term “test” is used in one of the sources. For the purposes
of this chapter, a test is synonymous with a selection procedure.
1 This chapter reproduces with some modifications the major part of “Professional and Technical Authorities and Guide-
lines” by P. Richard Jeanneret, which is found in Landy, F. J. (2005). Employment discrimination litigation: Behavioral,
quantitative and legal perspectives. San Francisco, CA: John Wiley & Sons.
593
594 Handbook of Employee Selection
counseling, program evaluation, and publishing that involve psychological instruments and
measurements. This chapter does not address these applications. The focus is strictly on orga-
nizational settings and employment related selection decisions.
Guidelines: (a) validity generalization, (b) utility analysis, (c) differential prediction, and
(d) validity requirements and their documentation. Similarly, SIOP believed the Uniform
Guidelines was incorrect with respect to requiring fairness studies, the definition of
construct validity, and how validity generalization and utility analyses were considered
(Camera, 1996). Nevertheless, the EEOC and the Office of Federal Contact Compliance
Programs (OFCCP) currently rely on the Uniform Guidelines to determine whether or
not a selection procedure is discriminatory.
5. For those who are involved in the judicial process (particularly judges and lawyers) the
authoritative sources are alternative reference sources to case law and other judicial
writings. The three sources have been relied upon by experts in the fields of personnel,
industrial, organizational, and measurement psychology when formulating opinions about
selection procedures. In such instances, the authoritative sources have become benchmarks
that help define sound professional practice in the employment setting. Unfortunately the
apparent use of the three sources is rather limited as indicated by the judicial interviews in
Chapter 15 of Landy (2005).
APPLICATION
The 1999 Standards is applicable to the entire domain of educational and psychological measure-
ment. In fact, there is one chapter devoted completely to educational testing and assessment, as well
as one on testing in employment and credentialing. Because the Standards provides a comprehen-
sive wealth of information on psychological measurement, it is not possible to adequately discuss all
596 Handbook of Employee Selection
of the standards in their entirety. Consequently, this review will focus on just those components of
the Standards that are most applicable to psychometric issues in employment selection.
VALIDITY DEFINED
A key term that will be appearing throughout this chapter and other chapters is “validity” or one
of its derivatives (e.g., validation process). In several respects the Standards has established the
most current thinking regarding validity and provides the definition that should be accepted by all
professionals concerned with the psychometrics of a selection procedure.
According to the Standards:
Validity refers to the degree to which evidence and theory support the interpretations of test scores
entailed by proposed uses of tests. Validity is, therefore, the most fundamental consideration in
developing and evaluating tests. The process of validation involves accumulating evidence to provide a
sound scientific basis for the proposed score interpretation. It is the interpretation of test scores required
by the proposed uses that are evaluated, not the test itself. When test scores are used or interpreted in
more than one way, each intended interpretation must be validated. (p. 9)
Validity is a unitary concept and can be considered an argument based on scientific evidence
that supports the intended interpretation of a selection procedure score (Binning & Barrett, 1989;
Cronbach & Meehl, 1955; McDonald, 1999; Messick, 1980, 1989; Wainer & Braun, 1988). There are
24 specific standards regarding validity incorporated in the 1999 document.
Generally speaking, if a test does not have validity for a particular purpose, it also will not
have utility. Utility is an estimate of the gain in productivity or other practical value that might
be achieved by use of a selection procedure. There are several measures used to estimate utility,
including increases in job proficiency, reduced accidents, savings in employee costs, reduction in
turnover, training success, and others (Cascio, 2000; Cronbach & Gleser, 1965; Hunter & Hunter,
1984; Naylor & Shine, 1965; Schmidt, Hunter, McKenzie, & Muldrow, 1979). Consequently, it
will be an unusual situation whereby an organization would want to use a test that lacked validity
and utility. Furthermore, if a test lacks validity it is possible that some unintended consequence
might result from its use. Thus, reliance on test scores that are not valid will not yield results
intended by the selection process and may in fact yield outcomes that are detrimental to the
organization.
categorize some standards as “primary” and others as “secondary” as occurred in earlier versions.
An entire chapter that incorporates 12 standards is focused on fairness; another chapter devotes
13 standards to the rights and responsibilities of test-takers; and still another chapter is focused on
testing individuals with diverse linguistic backgrounds (11 standards). This level of description is
oftentimes less precise in the Principles and Uniform Guidelines.
1. Evaluation of a selection procedure is not just a matter of checking-off (or not) one stan-
dard after the other to determine compliance. Rather the evaluation process must consider
(a) professional judgment, (b) satisfaction of the intent of a relevant standard, (c) alternate
selection procedures that are readily available, and (d) feasibility of complying with the
standard given past experience and research knowledge.
2. The Standards offers guidance to the expert in a legal proceeding, but professional judg-
ment determines the relevance of a standard to the situation.
3. Blanket statements about conformance with the Standards should not be made without
supporting evidence. Otherwise, care should be exercised in any assertions about compli-
ance with the Standards.
4. Research is ongoing and knowledge in the field will continue to change. Accordingly, the
Standards will be revised over time.
5. The Standards is not intended to mandate use of specific methodologies. The use of a
“generally accepted equivalent” is always understood with regard to any method provided
in the Standards.
1. Content: This evidence is based on an analysis of the relationship between the content of a
selection procedure and the construct it is intended to measure. This construct is typically
an important job requirement [knowledge, skill, ability, and/or other personal character-
istics (KSAOs)], job behavior, or job performance (e.g., performing a task or duty). Often,
reference is made in employment settings to the job content domain. A content study is typ-
ically completed by relying on logical judgments made by subject matter experts (SMEs)
regarding the inferences that can be made linking the test content and scores to the content
domain of the job.
2. Response processes: Studying the processes followed by individuals in responding to a
selection procedure can provide validity evidence regarding the fit between the construct
being measured and the behavior actually engaged in by respondents. For example, if a
work sample test is given to select individuals for an electronics technician job, the exam-
iner may not only want to know whether the candidate ultimately identified the correct
solution, but also what analytical processes were used to fi nd the solution. And, the exam-
iner may want to know if differences in scores among applicants are due to knowledge dif-
ferences or reading comprehension differences.
598 Handbook of Employee Selection
3. Internal structure: Empirical analysis (e.g., factor analysis, differential item functioning)
of the internal structure of a selection procedure can provide evidence regarding the extent
to which the relationships between test items conform to the construct that is the founda-
tion for test score interpretations (McDonald, 1999). For example, a test may be expected to
measure a single construct (e.g., numerical ability) or several constructs (e.g., five different
dimensions of personality). The extent to which the test item relationships conform to the
hypothesized dimensionality is evidence of validity. Another aspect of internal structure
is the extent to which test items function differently for different subgroups of applicants
(differential item functioning). This does not necessarily represent a problem, but rather
may be due to multidimensionality in the test.
4. Relations to other variables: This source of evidence has typically been labeled “criterion-
related validity.” However, it also includes what was once called “construct validity.” It is
based on the empirical relationship(s) between scores on a selection procedure and some
external variable(s). If the relationship is between a test and a measure of job performance
or some other job related outcome (e.g. turnover), this has been called criterion-related
validity evidence. If the relationship is between a selection procedure and some other
measure(s) of the same or different construct(s) this was labeled construct validity in the
past. The essential point is that a hypothesis is set forth as to the nature of the relation-
ship that is tested using appropriate statistical methods, and empirical relationships are
established between the selection procedure of interest and one or more external vari-
ables. Criterion-related validity evidence can be collected in two ways: concurrently and
predicatively. In a concurrent study, the selection procedure scores and measurement(s)
of the criterion/criteria are obtained at approximately the same time (concurrently) for
an incumbent group. The data are analyzed to determine the accuracy with which the
selection procedure scores predict criterion performance. In a predictive criterion-related
validity study, selection procedures scores are obtained but not used in decision-making
for an applicant group. A portion of the applicants is hired, and at some later point in time
(after a period of time whereby the applicant can reliably demonstrate his/her perfor-
mance) the criterion measure is obtained and the accuracy of the selection procedure is
evaluated. Generally, concurrent and predictive studies reach the same conclusions about
the validity of inferences from selection procedure scores, at least for cognitive abil-
ity measures (Barrett, Phillips, & Alexander, 1981; Bemis, 1968; Pearlman, Schmidt, &
Hunter, 1980).
The evidence based on relations to other variables is further explained in terms of two strategies
that can be used to understand those relationships: convergent/discriminant validity and validity
generalization. These two strategies are briefly discussed next.
Validity Generalization
The issue is whether or not validity evidence obtained in one situation can be generalized to a new
situation without further study of the validity of that procedure in the new setting. When criterion-
related validity evidence has been accumulated for a selection procedure, meta-analysis has been
a useful statistical method for studying the validity generalization question. There are numerous
methodological and statistical issues associated with meta-analytic studies, and these matters are
too lengthy to be addressed here. The interested reader is referred to Hunter and Schmidt (1990),
Rosenthal (1991), or Chapter 42, this volume.
5. Consequences of testing: This is probably the most recent source of validity evidence to
be addressed in the literature regarding selection procedures. The question being asked
is whether or not the specific benefits expected from the use of a selection procedure are
being realized, and are there other consequences that are not expected or desired. In the
employment context, an unintended consequence might be differential pass rates for vari-
ous subgroups of job candidates. This alone does not detract from the validity of test score
interpretations. Only if the subgroup differences can be linked to a specific source of bias
or contamination in the selection procedure will the negative consequences detract from
the validity of score interpretations. Said differently, if the selection procedure is assessing
an important skill underlying job performance and any subgroup differences are attributed
to unequal distributions of that skill in the applicant pool, then the observation of those
differences do not undermine the validity of the inferences that are drawn from the selec-
tion procedure. The Standards is very clear on this point and explicitly states that there is a
difference between issues of psychometric validity versus social policy, and the Standards
is only addressing validity.
VALIDITY STANDARDS
There are 24 specific standards presented in the validity chapter of the Standards. Although all
24 standards are important, there are certain themes that are particularly relevant in the context of
employment selection. A brief summary of these themes follows.
• The rationale and intended interpretation of selection procedure scores should be set forth
at the outset of a validity study. When new interpretations or intended uses are contem-
plated, they should be supported by new validity evidence.
• Descriptions of individuals participating in validation studies should be as detailed as is
practical. If SMEs are used, their qualifications and the procedures they followed in devel-
oping validation evidence should be documented.
• When criterion-related validity studies are completed, information about the quality and
relevance of the criterion should be reported.
600 Handbook of Employee Selection
• When several variables are predicting a criterion, multiple regressions should be used to
evaluate increments in the predictive accuracy achieved by each variable. Results from the
analyses of multiple variables should be verified by cross-validation whenever feasible.
• If statistical adjustments (e.g., the correction of correlations for restriction in range) are
made, the unadjusted and adjusted correlations and the procedures followed in making the
adjustments should be documented.
• If meta-analyses are relied upon as criterion-related validity evidence, the comparability
between the meta-analytic variables (predictors and criteria) and the specific situation of
interest should be determined to support the applicability of the meta-analytic findings
to the local setting. All assumptions and clearly described procedures for conducting the
meta-analytic study should be reported.
There are 17 standards set forth in Chapter 14. They address the collection and interpretation
of validity evidence, the use of selection procedure scores, and the importance of reliability
information regarding selection procedure scores. The chapter’s introduction emphasizes that
the contents of many other chapters in the Standards also are relevant to employment testing.
One point of emphasis in Chapter 14 is the influence of context on the use of a selection proce-
dure. There are nine contextual features identified, which by their labels are self-explanatory (see
Standards, pp. 151–153).
The Standards emphasizes that the validation process in employment settings is usually grounded
in only two of the five sources of validity evidence: relations to other variables and content. One or
both types of evidence can be used to evaluate how well a selection procedure (predictor) predicts
a relevant outcome (criterion). However, there is no methodological preference or more correct
method of establishing validity; rather the selection situation and professional judgment should be
the determiners of what source(s) of evidence are appropriate.
• The objective of the employment selection procedure should be set forth, and an indication
of how well that objective has been met should be determined.
• Decisions regarding the conduct of validation studies should take into consideration prior
relevant research, technical feasibility, and the conditions that could influence prior and
contemplated validation efforts.
• When used, the representativeness of the criterion (which could be important work behaviors,
work output, or job-relevant training) should be documented.
• Inference about the content validity of a selection procedure for use in a new situation is
only appropriate when “critical job content factors are substantially the same (as deter-
mined by a job analysis), the reading level of the test material does not exceed that appro-
priate for the new job, and there are no discernable features of the new situation that would
substantially change the original meaning of the test material.” (Standards, p. 160)
• When there is multiple information available to decision-makers regarding an employment
process, the use of each informational component should be supported by validity evidence.
Furthermore, the role played by each component as it is integrated into a final decision should
be clearly explained. In credentialing situations, the rules and procedures followed when com-
bining scores from multiple information sources should be made available to candidates.
• Cut scores for credentialing tests should be determined on the basis of the skill or knowledge
level necessary for acceptable job performance and not on the basis of the number or propor-
tion of candidates passing.
FAIRNESS
For the first time an entire chapter in the Standards (pp. 73–84) has been devoted to fairness. In the
1985 edition there were no standards related to fairness; in 1999 there is considerable background
information plus 12 standards devoted to the topic. Fairness is defined in the Standards as “the
principle that every test-taker should be assessed in an equitable way” (p. 175). There are several
viewpoints as to exactly what fairness means, and several of these perspectives are addressed in the
chapter as described below.
• Candidates of equal standing on the construct being measured should, on average, obtain
the same test score, irrespective of group membership. The requirement that the overall
passing rates be comparable or equal across groups is not a definition that is accepted in
the professional literature.
• There is agreement that equality of treatment for all candidates is mandatory and is one
acceptable definition of fairness.
• The absence of bias in the test instrument itself or how it is used is also an acceptable
definition of fairness.
• The fourth definition discussed in the Standards gives consideration to the opportunity to
learn the material assessed by a test and is specific to educational achievement testing.
In employment testing, the issue of fairness is typically addressed by statistically examining test
results for evidence of bias. It is not simply a matter of whether or not test score averages differ by
majority and minority groups, but whether or not there are differences in test score predictions by
subgroup. If the predictions are equivalent (i.e., no difference in the slopes or intercepts), then there is
Professional Guidelines/Standards 603
no bias. Another statistical perspective is that of differential item functioning (DIF). In this instance
if there is bias, candidates of equal ability differ in their responses to a specific item according to their
group membership. Unfortunately, the underlying reason for DIF, when it has been observed, has not
been apparent; oftentimes items work to the favor of one group relative to another for no explainable
reason such as item content. Use of sensitivity review panels that are comprised of individuals repre-
sentative of the subgroups of interest has been one mechanism that is intended to prevent item content
being relevant for one group but not another. Thus, members of the review panel would be expected to
flag items that were potentially unfair to a subgroup. However, there has not been research evidence
indicating that sensitivity review panels find a great deal to alter in test item content.
• A cut score is used to partition candidates into two groups: one passing or successful and
the other not passing or not successful. There is no single or best method for setting a cut
score. Further, because selection procedures are not perfect, there will be errors—some
candidates will pass who do not truly have adequate skills (false positives), and some will
fail when in fact they do have adequate skills (false negatives). Changing a cut score to cor-
rect for one concern will increase the occurrence of the other. Thus, professional judgment
always must be relied upon when setting a cut score.
• Normative data should be clearly documented in terms of demographics, sampling proce-
dures, descriptive statistics, and the precision of the norms.
• The psychometric characteristics of different forms of the same test should be clearly doc-
umented, and the rationale for any claim of equivalency in using test scores from different
test forms must be reported.
• Standardization in the administration procedures is extremely important, and all instruc-
tions should be clear.
• Selection procedures and results (including individual scores) should be treated as confi-
dential and kept in a secure manner.
• Documentation for a selection procedure typically includes information about intended
purpose; prior research evidence; the development process; technical information regard-
ing validity, reliability, fairness, score interpretation, scaling, or norming, if appropriate;
administration; and use of the results (e.g., pass/fail).
SUMMARY
The 1999 Standards reflects the state of the science and much of the most current professional
knowledge available regarding psychological testing. As in the past, the Standards will be revised in
604 Handbook of Employee Selection
the future and that process is now underway. Nevertheless, the Standards is extremely informative
about the requirements associated with the development and application of a selection procedure
in an employment setting. The document has been published to promote the professionally sound
and ethical use of selection procedures, and to provide a set of standards that can be the basis for
developing and implementing a new selection procedure, or for evaluating the quality of an existing
selection procedure and practice.
First, unlike the Standards, the Principles does not enumerate a list of specific principles in the
same manner as the Standards sets forth 264 standards. Consequently, the Principles is more aspi-
rational and facilitative in content, whereas the Standards is more authoritative in nature.
Second, the Standards is much broader than the Principles with respect to psychological mea-
surement. For example, although many of the concepts expressed in the Principles could be relevant
to the field of educational testing, the Standards directly addresses the topic. The same is true for
such topics as testing in program evaluation and public policy.
Third, the Standards gives somewhat more attention to the rights and responsibilities of test-
takers, whereas the Principles focuses more on the responsibilities of selection procedure develop-
ers and users. This is no doubt at least partially due to the fact that the Principles places most of
the responsibility for proper selection processes on the employer rather than the candidate, whereas
the Standards considers a much wider group of test-takers to include students, patients, counselees,
and applicants.
Fourth, and finally, the Principles provides more guidance on how to plan a validation effort
and collect validity evidence within the context of an employment setting. Consequently, there is
more discussion of such topics as feasibility of a validation study; strategies for collecting informa-
tion about the work and work requirements, as well as about job applicants or incumbents and their
capabilities; analyzing data including such topics as multiple hurdles versus compensatory models,
cut-off scores, rank orders, and banding; and information to be included in an administrative guide
for selection procedure users.
APPLICATION TO LITIGATION
The Principles offers the most up to date and relevant information and guidance regarding person-
nel selection procedures that might be the subject of litigation. Although the document is not written
in absolute terms, it provides a wealth of information that defines best practices in the validation
and implementation processes required to properly use a selection procedure. When examining the
qualities of a validation study or the implementation of a selection procedure, a decision-maker in
litigation proceedings might find that one or more expectations set forth in the Principles were not
met and ask why. Sound and logical explanations might be forthcoming; if not, then the unexplained
issues could be strong indicators that the processes being scrutinized were not established in accord
with accepted professional expectations.
ANALYSIS OF WORK
Given that the Principles is focused on selection procedures in the employment setting, there is
a particular emphasis on the analysis of work. Such an analysis establishes the foundation for
collecting validity evidence. More specifically, information from the analysis of work defines
relevant worker requirements and determines the KSAOs needed by a worker to successfully
perform in a work setting. Second, the work analysis defines the criterion measures that, when
appropriate for the validation strategy being used, indicate when employees have successfully
accomplished relevant work objectives and organizational goals.
Historically, the analysis of work was labeled “job analysis,” and that term is still frequently
used. The Principles expanded the term to the analysis of work to give clear recognition to the
realization that the concept of a traditional job is changing. Furthermore, the “analysis” should
incorporate the collection of data about the workers, the organization, and the work environment, as
well as the specific job or some future job if that is relevant to the study. As implied by the various
permutations that might be considered, there is no one preferred method or universal approach that
is appropriate for completing an analysis of work.
The Principles encourages the development of a strategy and a sampling plan to guide an analy-
sis of work. Further, the analysis of work should be conducted at a level of detail consistent with
606 Handbook of Employee Selection
the intended use and availability of the work information. Any method used and outcomes obtained
should be well documented in a written report.
VALIDATION
The Principles adopts the same definition of validity as given in the Standards. Validity is a unitary
construct and there are different sources of evidence that can contribute to the degree to which there
is scientific support for the interpretation of selection procedure scores for their proposed purpose.
If a selection procedure is found to yield valid interpretations, it can be said to be job-related.
The Principles recognizes the five sources of evidence discussed in the Standards. However, the
Principles places more emphasis on the two sources of evidence most frequently relied upon when
studying validity in the employment context—criterion-related and content validity.
• Feasibility: Are the job determinant conditions (e.g., is the work stable or constantly chang-
ing), worker-related variables (e.g., are past experiences no longer relevant for the current
work), or contextual matters (e.g., are the work conditions extremely different from the
testing environment) that might influence the outcome of the validity study under sufficient
control so as to not contaminate the study?
• Design: Has an adequate sample of important work behaviors and/or worker KSAOs been
obtained and analyzed?
• Content domain: Has the work content domain been accurately and thoroughly defined
and linked to the selection procedure?
Professional Guidelines/Standards 607
• Selection procedure: Does the selection procedure adequately represent the work content
domain? The fidelity of this relationship is the basis for the validity inference.
• Sampling: Is there a sound rationale for the sampling of the work content domain?
• Specificity: Has the level of specificity necessary in the work analysis and selection proce-
dure been described in advance?
• Administrative procedures: Are there adequate guidelines established for administering and
scoring the selection procedure that will maintain the integrity of the validity evidence?
The Principles also recognized internal structure validity evidence. The Principles points out
that evidence based on the structure of a selection procedure is not sufficient alone to establish the
validity of the procedure for predicting future work performance or other work-related behaviors
(e.g., attendance, turnover). However, consideration of the internal structure can be very helpful dur-
ing the design of a selection procedure.
Transportability
This strategy refers to relying on existing validity evidence to support the use of a selection pro-
cedure in a very similar but new situation. The important consideration underlying the transport
argument is work/job comparability in terms of content and requirements. Also, similarity in work
context and candidate groups may be relevant to documenting the transport argument (Gibson &
Caplinger, 2007).
Meta-Analysis
The information on meta-analysis in the Standards and Principles is very similar. In the Principles,
meta-analysis is acknowledged as the technique that serves as the foundation for validity gener-
alization. Both documents point out that meta-analytic findings may be useful but not sufficient
to reach a conclusion about the use of a selection procedure in a specific situation. Rather, a local
validation study may be more appropriate. Both sources also emphasize that professional judg-
ment is necessary to evaluate the quality of the meta-analytic findings and their relevance to the
specific situation of interest. The general conclusion in the Principles is that meta-analytic find-
ings for cognitive tests indicate that much of the difference in validity coefficients found from
one study to the next can be attributed to statistical artifacts and sampling error (Callendar &
Osburn, 1981; Hartigan & Wigdor, 1989; Hunter & Hunter, 1984). Similar but not conclusive
evidence is occurring for noncognitive measures (Barrick & Mount, 1991; Barrick, Mount, &
Judge, 2001; Hurtz & Donovan, 2000). Furthermore, the strength of the validity may be less for a
noncognitive test (Morgeson et al., 2007; Hogan, Davies, & Hogan, 2007).
608 Handbook of Employee Selection
The Principles add to the literature by discussing the appropriateness of the technique and its
interpretation in specific situations. In general, reliance on meta-analytic results is most appropriate
when the studies contributing to the meta-analysis focus on constructs. In such instances, the find-
ings reflect the degree to which the measures of the constructs are measuring the same construct.
In contrast, when the studies in the meta-analysis focus on methods (e.g., the interview) instead of
constructs, several interpretational difficulties arise. Because the interview measures different con-
structs, it is difficult to generalize about the general method of the interview unless the features of
the interview method “are clearly understood, if the content of the procedures and meaning of the
scores are relevant for the intended purpose, and if generalization is limited to other applications
of the method that include those features” (Principles, p. 30). Generalizing from a meta-analysis
of “the” interview method to a new interview method measuring different constructs or to a new
interview that addresses a new situation is problematic when constructs do not serve as the founda-
tion of the analysis.
Predictive Bias
An alternative term is differential prediction, but in either case bias has occurred if consistent, non-
zero errors of prediction are made for individuals in a particular subgroup. Multiple regression is the
typical method used to assess predictive bias, which is indicated if there are slope and/or intercept
differences observed in the model. Research on cognitive ability measures has typically supported
the conclusion that there is no predictive bias for African-American or Hispanic groups relative to
Whites, and when predictive differences are observed they indicate overprediction of the perfor-
mance of the minority group. It is also important to note that there can be mean score differences
on a selection procedure for minority versus majority subgroups but no predictive bias.
Measurement Bias
This form of bias is associated with one or more irrelevant sources of variance contaminating a
predictor or criterion measure. There are not well-established approaches to assessing measurement
bias, as is the case for predictive bias. However, DIF and item sensitivity analyses are suggested as
options in the Principles, but considerable caution in the value of such analyses is also mentioned.
As noted by Sackett, Schmitt, Ellingson, and Kabin (2001), the research results indicate that item
effect is often very small and there is no consistent pattern of items that favor one group of individu-
als relative to another group.
OPERATIONAL CONSIDERATIONS
Almost one half of the Principles is devoted to operational considerations. The issues discussed
are related to initiating and designing a validation effort; analysis of work; selecting predictors, a
validation strategy, and criterion measures; data collection and analyses; implementation; recom-
mendations and reports (technical and administrative); and other circumstances that may influence
the validation effort (e.g., organizational changes; candidates with disabilities; and responsibilities
of selection procedure developers, researchers, and users).
There are a few topics discussed in the operational considerations section of the Principles that
could be critical issues in the development and implementation of an employment selection proce-
dure that are discussed in the following subsections.
Professional Guidelines/Standards 609
Utility
Gains in productivity, reductions in outcomes (e.g., accidents, manufacturing rejects, or absentee-
ism), or comparisons among alternate selection procedures can be estimated by utility computa-
tions. Typically, several assumptions must be made with considerable uncertainty to satisfy the
computational requirements of the utility models. Thus, caution should be observed in relying upon
such utility estimates.
Bands
When a range of selection procedure scores is established that considers all candidates within the
range to be equivalent, a band results. Banding may necessarily lower expected criterion outcomes
and selection utility when compared to top-down, rank-order selection, but these consequences may
be balanced by increased administrative ease and the possibility of increased workforce diversity.
give clear information regarding the research sample and the statistical analyses conducted, as well
as recommendations on implementation and the interpretation of the selection procedure scores.
SUMMARY
The Principles offers a very current and comprehensive resource for use by decision-makers when
developing and implementing employment selection procedures. Because the Principles is focused
specifically on the employment setting, there is frequently more guidance offered on matters that
arise in the development and use of selection procedures than will be found in the Standards.
Nevertheless, the two documents are very compatible and not at all contradictory. The Principles
has undergone substantial professional peer review and represents the official policy of the SIOP
and APA.
PURPOSE
The Uniform Guidelines is intended to do the following:
[I]ncorporate a single set of principles which are designed to assist employers, labor organizations,
employment agencies, and licensing and certification boards to comply with requirements of Federal
Law prohibiting employment practices which discriminate on grounds of race, color, religion, sex, and
national origin. They are designed to provide a framework for determining the proper use of tests and
other selection procedures. These guidelines do not require a user to conduct validity studies of selection
procedures where no adverse impact results. However, all users are encouraged to use selection procedures
which are valid, especially users operating under merit principles. (Section 1.B 29C.F.R.1607)
The Q & As was prepared “to interpret and clarify, but not to modify, the provisions of the Uniform
Guidelines.” (Introduction, Federal Register 43, 166, 11996–12009, March, 1979)
Professional Guidelines/Standards 611
All subsequent references in this chapter to the Uniform Guidelines should be considered to be
inclusive of the Q & As.
DISCRIMINATION/ADVERSE IMPACT
The Uniform Guidelines explicitly defines discrimination and introduces the term “adverse impact.”
In essence, discrimination occurs when a selection procedure results in unjustifiable adverse impact.
Adverse impact occurs when the selection rate for a protected group is less than four-fifths (80%)
of the rate for the group with the highest rate (typically the nonprotected group). To illustrate, if the
passing rate for the majority group is 60% whereas the passing rate for a protected group is 40%,
the 40/60 yields 67%, which is less than 80%, and consequentially, the conclusion may be adverse
impact. If, on the other hand, the passing rate of the protected group was 50%, then 50/60 yields
83% and thus the ruling will be no adverse impact. This “rule of thumb” is not intended as a legal
definition and for good reason because it is problematic from a couple of perspectives. First, it is
highly influenced by sample size. For example, if there are 50 male and 50 female applicants and
20 open positions, the only way a selection process will not violate the 80% rule is to hire at least
9 females (9/50 = 18%), which does not violate the 80% rule in this case because the passing rate
for the males is 22% (18/22 = 82%). Note that if the samples of males and females were each 500,
then the same percentages of 22% and 18% hired would yield 110 males and 90 females hired; this
difference of 20 would not be considered adverse impact.
Secondly, and perhaps most important, the 80% rule of thumb is not a statistical test. The null
hypothesis is not stated, and there is no estimate of the likelihood of any difference observed being
because of chance. Such hypothesis testing is accomplished using binomial or hypergeometric prob-
ability models. Typically the .05 level of statistical significance under a two-tailed test (e.g., 1.96
standard deviation units) is considered the threshold of significance. Although the 80% value has no
standing in the scientific literature, the .05 level of significance is well accepted in social sciences
research as indicating statistical significance. But this test also has its practical limitation because
statistical significance is a function of sample size. A difference of 5 points between two groups
would be statistically significant if the total sample were in the 1000s, but would not be statistically
612 Handbook of Employee Selection
significant if the total sample was two-digit (e.g., 30). Although the Uniform Guidelines recognizes the
problems inherent in the rule of thumb in Section 3D, where it recognizes that statistical significance
is impacted by “small numbers,” it does not provide guidance as to what is the favored strategy—
the 80% rule or statistical difference. This issue then becomes a point of argument in litigation.
FAIRNESS
This concept is introduced in the discussion of criterion-related validity [see Sec. 7.B (3) and Sec.
14.B (8)]. The Uniform Guidelines requires that a fairness investigation of a selection procedure
be conducted if technically feasible before applying validity evidence from one situation to a new
situation. Further, if adverse impact is observed and data from a criterion-related validation study
are available, the user is expected to conduct a fairness analysis. Unfairness occurs when lower
minority scores on a selection procedure are not reflected in lower scores on the criterion or index
of job performance. The Standards and Principles consider this a matter of predictive bias, and it
is found when consistent nonzero errors of prediction occur for a protected subgroup. Moderated
multiple regression is the accepted statistical method for examining predictive bias, which occurs if
there are slope and/or intercept differences between subgroups. As previously mentioned, there is
no consistent research evidence supporting predictive bias on cognitive tests for African Americans
or Hispanics relative to Whites. Research studies of noncognitive test results by subgroups have not
yet been conducted in sufficient numbers to draw any definitive conclusions.
CUT-OFF SCORES
Cut-off scores are discussed first in the Uniform Guidelines as part of the general standards for
validity studies (Sec. 5. H.) and then in the Technical standards section [Sec. 14. B. (6)]. According
to the Uniform Guidelines, “Where cutoff scores are used, they should normally be set so as to
be reasonable and consistent with normal expectations of acceptable proficiency within the work
force” (Sec. 5. H.).
This definition seems to imply the need for professional judgment in setting a cut-off score, and
such a stance is consistent with the Standards and the Principles.
BOTTOM LINE
Another concept introduced by the Uniform Guidelines when trying to assess adverse impact or
discrimination is the bottom-line approach. If there are multiple components to a selection proce-
dure the final decision point is evaluated for adverse impact. According to the Uniform Guidelines,
only if the adverse impact occurs at the bottom line must the individual components of a selection
procedure be evaluated. However, this concept was struck down by the U.S. Supreme Court in
Connecticut vs. Teal (1982). Hence, today it is typical that all components of a selection procedure
are evaluated in terms of adverse impact if they can be examined individually.
many selection procedures are available, “which serve the user’s legitimate interest in efficient and
trustworthy workmanship” but still may not be a feasible alternative (see 3.B.). Examples of concerns
include faking or response distortions of personality and biodata inventories, costs of development
and implementation, and the ability to assess very large numbers of applicants at the same time.
Also of note is the general application of the “alternative selection procedure” section of the
Uniform Guidelines, Section 3B. Whereas most of the attention in the literature and litigation has
focused on alternative procedures, the Uniform Guidelines also considers “an investigation of …
suitable alternative methods of using the selection procedure which have as little adverse impact as
possible.” Thus, application of a particular method in a given situation might be used as pass-fail
instead of as top-down selection.
JOB-RELATEDNESS/BUSINESS NECESSITY
An employment selection procedure that has adverse impact may be justified in two ways: (a) show-
ing that the procedure is job-related or (b) showing that the procedure is justified by business neces-
sity. Job-relatedness is demonstrated by the validation process. Business necessity is demonstrated
when a selection procedure is necessary for the safe and efficient operation of the business entity.
Oftentimes there are relevant statutes and regulations that define the business necessity argument
(i.e. legislation regarding public safety job requirements); other times information from the analysis
of work will demonstrate the business necessity of a selection procedure.
VALIDITY
The Uniform Guidelines sets forth what are considered acceptable types of validity studies by
the enforcement agencies and identifies three types: criterion-related, content, and construct. The
document notes that new validation strategies “will be evaluated as they become accepted by the
psychological profession” (see 5.A.). The Uniform Guidelines also states that the validation provi-
sions “are intended to be consistent with generally accepted professional standards … such as those
described in the Standards for Educational and Psychological Tests … and standard textbooks and
journals in the field of personnel selection” (see 5.C). Of course the Standards being referred to
were published in 1974, and there have been two major revisions published in 1985 and 1999. The
Uniform Guidelines makes no specific reference to the Principles although the first edition was pub-
lished in 1975. Consequently, it is easy to understand how the treatment of validity by the Uniform
Guidelines is not particularly consistent with the state of the scientific knowledge as set forth in the
current editions of the Standards and the Principles.
When introducing validity, the Uniform Guidelines offers several warnings or conditions.
• Do not select on the basis of knowledge, skills, and abilities (KSAs) that can be learned on
the job during orientation.
• The degree of adverse impact should influence how a selection procedure is implemented,
and evidence sufficient to justify a pass/fail strategy may be insufficient for rank order.
• A selection procedure can be designed for higher level jobs if most employees can be
expected to progress to those jobs in about 5 years.
• An employer can use a selection procedure if there is substantial validity evidence from
other applications and if the employer has in progress, if technically feasible, a validity
study that will be completed in a reasonable period of time.
• Validity studies should be reviewed for currency, particularly if there may be available
alternative procedures with equal validity but less adverse impact.
• There are no substitutes for validity evidence and no assumptions of validity based on
general representation, promotional material, testimony, and the like.
• Employment agencies are subject to the guidelines in the same manner as employers.
614 Handbook of Employee Selection
Criterion-Related Validity
The Uniform Guidelines’ position on criterion-related validity is very consistent with the informa-
tion set forth in the Standards and Principles. Job analysis is important for decisions regarding
grouping jobs together and selecting and developing criterion measures. An overall measure of job
performance may be used as a criterion if justified by the job analysis (the Principles and Standards
emphasize the need for construct equivalence for predictor and criterion measures and typically
there is a greater degree of construct specificity than “overall performance” developed from the job
analysis); success in training also can be used as a criterion. Concurrent and predictive designs are
recognized, and emphasis is placed on the representativeness of the sample of individuals partici-
pating in the validity study, regardless of its design.
Criterion-related validity evidence should be examined using acceptable statistical procedures,
and the Uniform Guidelines establishes the .05 level of statistical significance as the threshold for
concluding that there is a relationship between a predictor and a criterion. Usually the relationship
is expressed as a correlation coefficient, which must be assessed in the particular situation, “There
are no minimum correlation coefficients applicable to all employment situations” [see 14.B. (6)].
Additionally, care must be taken to not overstate validity findings.
Content Validity
The technical standards for content validity studies begin by focusing on the appropriateness of
such a study. A selection procedure must be a representative sample of the job content or purport
to measure KSAs that are required for successful job performance. Selection procedures based on
inferences about mental abilities or that purport to measure traits such as intelligence, common
sense, or leadership cannot be supported only on the basis of content validity. Solid job analysis
information that is representative of the jobs, and when necessary is operationally defined, is critical
to a content validity argument.
The Uniform Guidelines provides for the ranking of candidates assessed by a content-valid selec-
tion procedure, given that the procedure is measuring one or more capabilities that differentiate
between levels of job performance. This is generally compatible with the guidance offered by the
Principles, although the Q & As to the Uniform Guidelines gives more examples as to when it is, or
is not, appropriate to use rank ordering.
Construct Validity
This form of validity is defined in Section 14.D (1) of the Uniform Guidelines as “a series of research
studies, which include criterion related and which may include content validity studies.” In Section
14.D (1) and (3), it is stated that a “construct” is the intermediary between the selection procedure
on the one hand and job performance on the other. A job analysis is required, and one or more con-
structs that are expected to influence successful performance of important work behaviors should
be identified and defined. To accomplish a construct validity study, it should be empirically dem-
onstrated “that the selection procedure is validly related to the construct and that the construct is
validly related to the performance of critical or important work behaviors” [14.D (3)]. (This is the
definition that drew the objections of APA and SIOP.) In turn, a selection procedure is developed
that will measure the constructs of interest.
Documentation Required
The Uniform Guidelines sets forth many documentation requirements for a validity study, and many
of these requirements are labeled “essential.” Generally speaking, the information expected as part
of the documentation effort is very consistent with the material presented in each of the various sec-
tions of the Uniform Guidelines.
Utility
There is one term that does not have a definition in the Uniform Guidelines that could have many
interpretations; that term is “utility.” It is found in the sections dealing with the uses and applications
Professional Guidelines/Standards 615
of a selection procedure that has been evaluated by a criterion-related validity study. Specifically,
when documenting the methods considered for using a procedure it “should include the rationale for
choosing the method of operational use, and the evidence of validity and utility of the procedure as
it is to be used (essential)” [see 15.B. (10)]. Identical sentences appear in the uses and applications
sections for content and construct validity. Furthermore, in Section 5.G. the Uniform Guidelines
states:
If a user decides to use a selection procedure on a ranking basis, and that method of use has a greater
adverse impact than use of an appropriate pass/fail basis …, the user should have sufficient evidence of
validity and utility to support the use on a ranking basis.
• The Uniform Guidelines is some 20 years older than the Standards and 25 years older than
the Principles. The later two documents have undergone one and two revisions, respec-
tively, since the Uniform Guidelines was published, but the Uniform Guidelines has never
been revised or brought up to date.
• The Uniform Guidelines was written to guide the enforcement of civil rights legislation.
The Standards and Principles were written to guide research and professional practice and
to inform decision-making in applicable areas of employment selection. Hence, the latter
two documents have more of a scientific focus and rely heavily on the current research
literature; the Uniform Guidelines was intended to be consistent with generally accepted
professional standards set forth in the 1974 version of the Standards, but was not necessar-
ily research-based at the time of its preparation.
TABLE 28.1
Validation and Psychometric Terminology Comparison
Standards Principles Uniform Guidelines
Validity (unitary The degree to which The degree to which accumulated Not defined.
concept) accumulated evidence and evidence and theory support
theory support specific specific interpretations of scores
interpretations of test scores. from a selection procedure
entailed by the proposed uses of
that selection procedure.
Sources of validity evidence
(a) Relations to other The relationship of test scores to The statistical relationship Empirical data showing
variables / variables external to the test between scores on a predictor that the selection
criterion-related such as measures of some and scores on a criterion procedure is predictive
criteria that the test is expected measure. of or significantly
to predict. correlated with
important elements of
work behavior.
(b) Content The linkage between a predictor The extent to which content Data showing that the
and one or more aspects of a of a selection procedure is a content of a selection
criterion construct domain. representative sample of procedure is
work-related personal representative of
characteristics, work important aspects of
performance, or other work performance on the
activities or outcomes. job.
(c) Internal structure The extent to which the The degree to which psychometric Not defined.
relationships between and statistical relationships
test items conform to the among items, scales, or other
construct that is the foundation components within a selection
for test score interpretation. procedure are consistent with the
intended meanings of scores on
the selection procedure.
(d) Response process The study of the cognitive The study of the cognitive account Not defined.
account of some behavior, such of some behavior, such as
as making a selection procedure making a selection procedure
item response. item response.
(e) Consequences of Whether or not the specific Evidence that consequences of Not defined.
testing benefits expected from the use selection procedure use are
of a selection procedure are consistent with the intended
being realized. meaning or interpretation of the
selection procedure.
Construct validity An indication that test scores are Evidence that scores on two or Data showing that the
to be interpreted as indicating more selection procedures are selection procedure
the test-taker’s standing on the highly related and consistent with measures the degree to
psychological construct the underlying construct; can which candidates have
measured by the test. The term provide convergent evidence in identifiable
construct is redundant with support of the proposed characteristics that have
validity. The validity argument interpretation of test scores as been determined to be
establishes the construct representing a candidate’s standing important for successful
validity of a test. on the construct of interest. job performance.
Convergent validity Evidence based on the relationship Evidence of a relationship Not defined.
between test scores and other between measures intended to
measures of the same constructs. represent the same construct.
Professional Guidelines/Standards 617
continued
618 Handbook of Employee Selection
The above definitions or explanations are taken verbatim from the glossaries or definition section of the authoritative
sources whenever possible. Otherwise the definitions were extracted from document text on the subject.
Professional Guidelines/Standards 619
Construct Validity
The Uniform Guidelines treats construct validity as a separate type of validity. In the Standards and
Principles, all selection procedure scores or outcomes are viewed as measures of some construct.
Consequently, any evaluation of validity is a “construct validity” study.
Validity Generalization
The concept was known at the time the Uniform Guidelines was prepared but was not specifically
used in the document. Many have interpreted Section 7.B of the Uniform Guidelines as providing
for validity generalization arguments. The provisions of that section are described under transport
of validity evidence in Table 28.1.
Transport of Validity
The three authoritative sources agree that a work or job analysis is necessary to support the
transport of validity. However, the Uniform Guidelines goes further and requires that there be
an existing criterion-related validity and a fairness study of the selection procedure for relevant
protected subgroups. However, there is no guidance as to the acceptability of transporting the
validity of a selection procedure that has some demonstrated unfairness. Furthermore, in many,
if not most situations, sample sizes preclude adequate fairness analyses.
Meta-Analysis
Again, given the document date, the authors of the Uniform Guidelines did not have knowledge of
the research findings that have emerged from meta-analytic research. This, unfortunately, is another
void, and a significant amount of research is available today that might not be considered to be
within the scope of validation strategies acceptable under the Uniform Guidelines.
Reliability
The term is not defined in the Uniform Guidelines as it is in the other two authoritative sources
but is considered to be important for selection procedures that have been supported with a content
validity strategy. The Standards and Principles emphasize that the reliability of any measurement
be considered whenever it is technically feasible to do so.
SUMMARY
There are some levels of consistency or agreement across the three authoritative sources but also
consequential areas of disagreement. It is very likely that the advances in selection procedure
research and scholarly thinking regarding validity that have occurred over the last 35 years account
for these differences. Although the Uniform Guidelines is the document that seems most deficient
in terms of knowledge of the field, it is also the first document of the three compared in terms of its
Professional Guidelines/Standards 621
adoption. Accordingly, its deficiencies can be excused by being out of date. But, as noted earlier in
this chapter, the Uniform Guidelines allowed for other procedures and issues to take precedence.
Sections 5.A and 5.C acknowledge, respectively, that “New strategies for showing the validity of
selection procedures will be evaluated as they become accepted by the psychological profession”
and that “The provisions of these guidelines … are intended to be consistent with generally accepted
professional standards … and standard textbooks and journals in the field of personnel selection.”
These clauses can be interpreted to suggest that deference should be given to the Principles and
Standards where they disagree with the Uniform Guidelines. Nevertheless, sometime in the near
future it will be important for the Uniform Guidelines to be revised to reflect the current state of
the science. Until that time, the decision-maker involved in employment selection should look to the
Standards and Principles for guidance on many issues that either are now incorrect or not addressed
in the Uniform Guidelines.
FINAL THOUGHTS
SCIENCE VERSUS LITIGATION VERSUS TECHNICAL AUTHORITIES/GUIDELINES
It is recognized that there are some significant inconsistencies at this time between the technical
information provided by the Standards and Principles, on the one hand, and the Uniform Guidelines,
on the other hand, and that these differences can be extremely important in the event of litigation
regarding a selection procedure. However, these differences can be resolved. Unfortunately, until a
revision to the Uniform Guidelines is forthcoming, to the extent that there is more than one author-
ity introduced in litigation that is offered as support to only one side of an argument, resolution of
differences that appear in print will need to be part of the judicial decision-making process. In this
regard, it is incumbent upon those who do rely on any of the authoritative sources during the course
of litigation to be clear about the relevance and currency of the source(s) that are providing guidance
to their opinions.
CONCLUSIONS
In closing, there are several broad as well as some specific issues that we want to note. We will
start with the broader issues. First, what deference should be given to the Uniform Guidelines,
Principles, and Standards in guiding psychologists as they make decisions in employment settings?
We ask this question given that the three documents are in many ways static, whereas the field is
dynamic. That is, research is constantly being conducted that provides new knowledge and/or influ-
ences how we interpret behavioral phenomena. For example, it is a commonly excepted fact that the
validity of cognitive ability tests generalizes across situations and jobs (Hunter & Hunter, 1984).
Yet this was not always the “accepted” fact in that in the 1960s, validity was described as “situation
specific” (Ghiselli, 1966). If there had been three sets of sources promulgated by various agencies in
the 1960s, they most likely would have advocated for “situation specificity,” and the practice would
have been for the need to validate tests in every situation for every job. The point of this example is
that perhaps the current sources—Uniform Guidelines, Principles, and Standards—should not be
viewed as authoritative regarding knowledge, but rather as primers for “how to conduct research”
and what factors to consider when determining the validation of a test.
Reliance on the sources for factual information may hamper the field. The documents are not
“living” and thus cannot account for changes due to new research. However, the practitioner or
researcher can rely on the sources with regard to how to establish the validity of a test and what
information is needed as part of the research.
Acceptance of the above premise brings us to the second broad issue. Given that the sources
are relied upon in litigation, whether introduced directly in testimony in court cases or as authority
references when trying to explain to lawyers what and how we conduct our research, the question
622 Handbook of Employee Selection
becomes “How sound are the sources as authoritative documents in court proceedings?” This brings
us to the utility of the sources in our practice.
We believe that psychologists know science when it is forthcoming; it is a method of inquiry
and not necessarily a category of knowledge. But, we cannot go into court and pronounce conclu-
sions from our “science” without being required to pass what are known as “Daubert thresholds.”
Daubert thresholds or criteria address what is admissible scientific evidence. The thresholds came
about from Supreme Court (Daubert v. Merrell Dow Pharmaceuticals Inc., 1993) rulings pertain-
ing to what is expert testimony and scientific evidence and has been codified in the Court’s Federal
Rules of Evidence.
Rule 702: If scientific, technical, or other specialized knowledge will assist the trier of fact to understand
the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill,
experience, training, or education, may testify thereto in the form of an opinion or otherwise, if (1) the
testimony is based upon sufficient facts or data, (2) the testimony is the product of reliable principles
and methods, and (3) the witness has applied the principles and methods reliably to the facts of the
case.
Essentially, the courts have determined that our expert opinions must be reliable and valid. But note
two issues: (a) we are not certain the courts are using “reliable” and “valid” as we use them in psychology
and (b) there is lots of room for interpretation (e.g., “sufficient facts,” “reliable principles and methods,”
and “applied … reliably”).
But the Daubert decision provided four factors for the Courts to use when assessing the scientific
validity underlying expert testimony.
The above is not a checklist nor is it exhaustive. Different courts have used it differently. The focus
has been on the expert’s principles and methodology and not necessarily on the conclusions that
the principles and methodology generate. This position reinforces our view that the three sources
should be relied upon for the procedural and methodological content and not necessarily for the
knowledge being espoused when the source was written.
Can we work with the Daubert factors? Before answering the question, let us present some con-
clusions from a survey of about 300 judges reported by Landy (2005) that focused on reported prob-
lems with expert testimony; areas where judges experienced discomfort with experts. The results,
reported in descending order of magnitude, show the following problems:
Given that nearly all of the cases underlying the above conclusions involved experts who were
familiar with one or more of the sources covered in this chapter, why would there be such conclu-
sions? One answer is that the sources were simply ignored by the experts. Second, the court may
have discounted the value or authority of the source. Third, the courts may not have gained an
Professional Guidelines/Standards 623
appreciation for the technical details. Finally, the courts may prefer to rely more on common sense
than I-O knowledge. Regardless of the reason, we need to do a better job of conveying our “prin-
ciples” and “standards.”
Given that the reality is that we will continue to have revisions of authoritative sources, what
should those revisions address? The general headings and categories in the sources are appropriate.
What needs to be addressed are specific issues and considerations: The need to consider the fact that
the global economy is changing the way in which we work and with whom we work. Accordingly,
future sources should address cultural issues and the changing nature of work.
1. The need to consider assessment of individuals with diverse linguistic backgrounds as well
as the need to accommodate test-takers whose first language is not English.
2. The need to consider electronic, Internet, and web-based technology and the fact that the
next generation of workers will likely have not been exposed to the same methods of train-
ing, operating, and performing at work as the current generation. Advanced technology
should provide for greater opportunity to capture actual samples or simulations of job
behaviors than are garnered in paper-and-pencil multiple-choice formats.
3. The need to identify criteria that are relatively focused on more short-term gains than those
that have been used in the past (e.g., tenure in the position for at least 1 year). Businesses
want to cut their losses (such as incorrect “hires”) much more quickly than was common in
the past.
4. The need to recognize that current tests explain at most, approximately 25% of the variance
in criteria. Although it is appropriate to concern ourselves with searching for additional
predictors, we need to consider ways in which to broaden the criterion space and how to
combine the criteria in such a fashion as to provide a “comprehensive” picture of the worker.
That is, although we can predict to a reasonable degree (15–25% of the variance) how well
entering college students may perform as represented by the criterion of final grade point
average, we need to examine other factors that measure success in college and how these
additional factors can be combined to represent success in the “college experience.”
Authoritative sources that incorporate principles, guidelines, and standards have a valuable role
to play in the science of employment selection. However the limitations inherent to such sources
must be openly recognized, and to the degree there is disagreement or conflicts among the sources
they should be revealed before they attain a stature that creates a disservice to employees, employ-
ers, and I-O psychology.
REFERENCES
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington,
DC: American Education Research Association.
Barrett, G. V., Phillips, J. S., & Alexander, R. A. (1981). Concurrent and predictive validity designs: A critical
reanalysis. Journal of Applied Psychology, 66, 1–6.
Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: A meta-analysis.
Personnel Psychology, 44, 1–26.
Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). The FFM personality dimensions and jobs: Meta-analysis
of meta-analysis. International Journal of Selection and Assessment, 9, 9–30.
Bemis, S. E. (1968). Occupational validity of the General Aptitude Test Battery. Journal of Applied Psychology,
52, 240–249.
Binning, J. F., & Barrett, G.V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
evidential bases. Journal of Applied Psychology, 74, 478–494.
Callender, J. C., & Osburn, H. G. (1981). Testing the constancy of validity with computer-generated sampling
distributions of the multiplicative model variance method estimate: Results for petroleum industry
validation research. Journal of Applied Psychology, 66, 274–281.
624 Handbook of Employee Selection
Camara, W. J. (1996). Fairness and public policy in employment testing: Influences from a professional asso-
ciation. In R. S. Barrett (Ed.), Fair employment strategies in human resource management. Westport, CT:
Quorum Books.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod
motive. Psychological Bulletin, 56, 81–105.
Cascio, W. F. (2000). Costing human resources: The financial impact of behavior in organizations (4th ed.).
Cincinnati, OH: Southwestern.
Connecticut v. Teal, 457 U.S. 440 (1982).
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions (2nd ed.). Urbana, IL:
University of Illinois.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52,
281–302.
Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993).
Equal Employment Opportunity Commission, Civil Service Commission, Department of Justice, & Department
of Labor. (1978). Uniform guidelines on employee selection procedures. 29 CFR, 1607.
Equal Employment Opportunity Commission, Office of Personnel Management, Department of Justice,
Department of Labor, & Department of Treasury. (1979). Questions and answers to clarify and provide
a common interpretation of the Uniform Guidelines on Employee Selection Procedures. 44FR, No. 43,
March 2, 1979.
Ghiselli, E. E. (1996). The validity of occupational aptitude tests. New York, NY: John Wiley.
Gibson, W. M., & Caplinger, J. A. (2007). Transportation of validation results. In S. M. McPhail (Ed.),
Alternative validation strategies: Developing new and leveraging existing validity evidence (pp. 29–81).
San Francisco, CA: John Wiley and Sons.
Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment testing. Washington, DC: National
Academy Press.
Hoffman, C. C., Rashkovsky, B., & D’Egidio, E. (2007). Job component validity: Background, current research,
and applications. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and leverag-
ing existing validity evidence (pp. 82–121). San Francisco, CA: John Wiley and Sons.
Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In S. M. McPhail
(Ed.), Alternative validation strategies: Developing new and leveraging existing validity evidence (pp.
181–229). San Francisco, CA: John Wiley and Sons.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–88.
Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis. Newbury Park, CA: Sage.
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of
Applied Psychology, 85, 869–879.
Johnson, J. W. (2007). Synthetic validity: A technique of use (finally). In S. M. McPhail (Ed.), Alternative
validation strategies: Developing new and leveraging existing validity evidence (pp. 122–158). San
Francisco, CA: John Wiley and Sons.
Landy F. J. (2005). A judge’s view: Interviews with federal judges about expert witness testimony. In F. Landy
(Ed)., Employment discrimination litigation: Behavioral, quantitative and legal perspectives (pp. 503–
572). San Francisco, CA: Jossey-Bass.
McDonald, R. P. (1999). Test theory: Unified treatment. Mahwah, NJ: Lawrence Erlbaum.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
Messick, S. (1989). Validity. In R. L. Linn (Ed), Educational measurement (3rd ed.). New York, NY: Macmillan.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007).
Reconsidering the use of personality tests in personnel selection contexts. Personnel Psychology, 60,
683–729.
Naylor, J. C., & Shine, L.C. (1965). A table for determining the increase in mean criterion scores obtained by
using a selection device. Journal of Industrial Psychology, 3, 33–42.
Nunnally, J.C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw Hill.
Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job
proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373–406.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach.
Hillsdale, NJ: Lawrence Erlbaum.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury Park, CA: Sage.
Sackett, P. R., & Roth, L. (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on
performance and minority hiring. Personnel Psychology, 49, 549–572.
Professional Guidelines/Standards 625
Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). High-stakes testing in employment, creden-
tialing, and higher education. American Psychologist, 56, 302–318.
Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of valid selection procedures
on work-force productivity. Journal of Applied Psychology, 64, 609–626.
Society for Industrial and Organizational Psychology. (2003). Principles for the validation and use of person-
nel selection procedures (4th ed.). Bowling Green, OH: Author.
Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage.
Wainer, H., & Braun, H. I. (Eds.). (1988). Test validity. Hillsdale, NJ: Lawrence Erlbaum.
This page intentionally left blank
29 AinSampler of Legal Principles
Employment Selection
INTRODUCTION
We are very pleased that we have been asked to contribute a chapter on legal principles for this
volume. For almost 40 years, principles of employee selection—practice and legal—have been illu-
minated through employment litigation opinions. Granted, judges write opinions, but these judges
depend heavily on industrial-organizational (I-O) psychologists in framing these opinions and are
generous in their citations to our work and our testimony. So we see clear role for employment liti-
gation knowledge bases in this volume. In the context of this role, we have been asked to identify
some selection practices that are commonly at issue in employment discrimination cases. Further,
we have been asked to formulate some “principles” that the practitioner might keep in mind when
attempting to minimize legal risk when using such practices. Finally, we have been asked to identify
exemplar court cases related to each principle we have formulated. Here it is important to make a
distinction between our current use of the term “principle” as simply a basic and generally accepted
combination of science and practice and “the” Principles promulgated by the Society for Industrial
and Organizational Psychology (SIOP; 2003), which are currently in their fourth edition. The SIOP
Principles are not intended to illuminate exclusively legal issues, whereas our principles are. The
SIOP Principles do not ignore legal foundation but have much broader applications in mind and are
considerably greater in scope than the current “legal” principles. Neither are we intending to address
the “best” practice from a theoretical or empirical perspective. Instead, we are simply articulating
what might be seen as a “defensible” practice from the legal perspective. Nevertheless, the principles
we present are general rather than specific and are well documented through years of litigation.
Like other chapter authors in this handbook, we are constrained by allotted pages. Therefore,
we had to be selective in formulating principles and providing underlying case law. As a result, we
do not consider these principles to be exhaustive, or possibly even the most important, from a psy-
chometric or theoretical sense. Although we would consider these principles more rather than less
important than others that might have been chosen, we acknowledge that there may be other equally
important principles that we might have chosen to articulate: so many practices, so few pages.
The principles we have chosen to present either deal with a general practice (e.g., work analysis,
adverse impact), a statute (e.g., ADA, CRA, ADEA), a protected group (e.g., gender, race, age, dis-
ability), or some combination of those factors. We have not included principles that directly address
some statutes (e.g. the Equal Pay Act, Fair Labor Standards regulations); again, not because they are
trivial, unimportant, or not within the domain of I-O psychology, but because of page limitations
and distance from the general focus of this book—selection decisions. As a matter of convention
1 We are saddened by the loss of our good friend and colleague Frank J. Landy. He was a great mentor to us and a major
contributor to our field who will be missed more than words can possibly say. We dedicate this chapter to his memory. He
will live forever in our hearts.
627
628 Handbook of Employee Selection
for this chapter, we have chosen to use the term “work analysis” rather than the more historically
common term “job analysis.” This is in keeping with the evolution of the study of work behavior and
compatible with the structure and content of Chapter 4, this volume.
The employment selection practices we have chosen to address in the 14 principles below repre-
sent “terms and conditions” of employment as defined in key federal laws. Terms and conditions are
broadly defined to include recruitment, hiring, training, promotion, benefits, and termination. The
term “law” is broadly defined as including statutes, executive orders, and constitutional amendments.
Broader coverage of these Equal Employment Opportunity (or EEO) laws is provided by Gutman
(2000), Landy (2005), and Outtz (2009). As a group, these EEO laws proscribe discrimination in the
workplace on the basis of race or color, religion, sex, national origin, age, and/or disability.
Title VII of the Civil Rights Act of 1964 is the most critical of the statutes. It includes proscrip-
tions relating to the largest number of protected classes (race or color, religion, sex, and national
origin) and the full gamut of terms and conditions of work. Title VII also provided a critical model
for later statutes, including the Age Discrimination Act of 1967 (or ADEA) and the Americans
with Disabilities Act of 1990 (or ADA), which address the full gamut of terms and conditions for
age and disability, respectively. The ADA updates the Vocational Rehabilitation Act of 1973 (or
Rehab-73). Finally, the Equal Pay Act of 1963 (or EPA) is a more surgical statute that proscribes
only wage discrimination on the basis of sex, which is also proscribed in Title VII.
Among other relevant laws, Executive Order 11246 (or EO 11246) contains rules for contrac-
tors doing business with the federal government. These contractors must agree to provide affirma-
tive action for minorities and women if there are discrepancies between their representation in the
employer’s workforce and qualified workers in the labor pool. It is critical to note that the primary
requirement for affirmative action in EO 11246 is recruitment and outreach, and there is no require-
ment (in fact, it is illegal under Title VII and other laws) to make actual selection decisions only on
the basis of race or gender. The key constitutional amendments governing employment decisions are
the 5th and 14th Amendments, which apply to federal and state employers, respectively, and which
have been used extensively in so-called “reverse discrimination” lawsuits on the basis of race or sex
and provide overlapping coverage with Title VII. Additional overlapping coverage with Title VII
is represented in the 13th Amendment, which applies to the full gamut of terms and conditions of
employment as in Title VII, with the exception of adverse impact challenges. Critically, although
each of these amendments preceded Title VII, no one ever thought to use them in workplace dis-
crimination cases until after Title VII, not only for overlapping coverage, but also for gaps in cover-
age in Title VII that do not apply to constitutional claims, such as the minimal number of employees
required in Title VII (N = 15); no such minimum workforce size exists for constitutional claims.
Some final but important points to note are that each of the aforementioned statutes, except for
Rehab-73, requires that claims of discrimination be processed by the Equal Employment Opportunity
Commission (EEOC). In comparison, constitutional claims can go directly to federal district court.
Additionally, the Office of Contract Compliance Programs (OFCCP) of the Department of Labor
regulates EO 11246 in the nonfederal sector, the EEOC regulates EO 11246 in the federal sector,
and the EEOC and Merit Systems Protection Board (MSPB) share responsibilities for federal claims
in Title VII and the ADEA.
Because this entire volume is dedicated to the concept of “selection,” some of our principles may
appear to stray from that mark. We construe selection to encompass many personnel actions. These
include but are not limited to applicant screening, hiring, promotion, selection to training experi-
ences that are gateways to promotion, and layoffs (or deselection). We believe that selection should
be broadly and not narrowly construed because in each case a personnel decision is being made that
implies a direct or indirect movement (or lack thereof) of an applicant or a current employee. Finally,
in the context of a selection or promotion decision, we will occasionally comment on a co-relevant
issue such as compensation or training. It is not that we intend to turn the focus to compensation, but
that often compensation decisions are made at the same time as selection or promotion decisions and
deserve comment.
A Sampler of Legal Principles in Employment Selection 629
The precise definition of the term “applicant” depends upon the user’s recruitment and selection
procedures. The concept of an applicant is that of a person who has indicated an interest in being
considered for hiring, promotion, or other employment opportunities.
This definition was rendered obsolete by recent developments in Internet recruitment, because it
is easy to indicate interest in many jobs at the same time. Therefore, the EEOC appended Q15 on
March 4, 2004 (FR Doc 04-4090) to require more than an expression of interest. Additionally, the
OFCCP issued separate guidance on March 29, 2004 (41 CFR Part 60-1).
The OFCCP focuses on record keeping for EO 11246 (emphasizing recruitment), whereas the
UGESP focuses on selection (emphasizing adverse impact in Title VII). Because our focus is also
on selection, the following discussion relates to the EEOC modification to Q15. Readers should note
that SIOP’s Professional Practice Committee has provided a detailed evaluation of the EEOC and
OFCCP documents (Reynolds, 2004).
The EEOC answered five new questions. Two of the answers clarify that Internet methodolo-
gies are covered by laws such as Title VII and by the UGESP, and two others clarify that Internet
search criteria are subject to adverse impact rules and that tests administered online are subject to
the UGESP. The last and most critical of the five answers defines the term “Applicant” using the
following three prongs:
For prong 1, the EEOC cites a company seeking two hires from among 200 recruits in a database. If
100 recruits respond to employer inquiries and 25 are interviewed, all 100 responders are applicants
and all 100 nonresponders are not applicants.
For prong 2, the EEOC cites two examples: (a) if employers require completion of an “online
profile,” only those completing the profile are applicants; and (b) if employers e-mail job seekers
requesting applications, only those who respond are applicants.
The prong 3 answer clarifies that individuals are not applicants if they (a) only post a resume, (b)
express interest in several potential jobs, or (c) follow prong 2 for a job other than those the employer
acts on (prong 1).
judgment was granted to the university because Parker never expressed interest in a specific posted
job. Parker also claimed adverse impact, but it was dismissed because he could provide no basis for
assuming he was harmed. It should be noted that as a standard university policy, recruiters routinely
searched the database and forwarded resumes to hiring officers of individuals who expressed inter-
est in a posted job and who met the minimum qualifications (MQs) for that job.
Interestingly, the district court judge ruled that Parker satisfied the four-prong prima facie test
for disparate treatment from McDonnell Douglas v. Green (1973) in that he (a) was a protected class
member, (b) applied for a position for which he was qualified, (c) was not hired, and (d) positions
remained open for which Parker was qualified. However, the judge also ruled that the university
articulated a legitimate nondiscriminatory reason for not hiring him and that Parker could not prove
that the articulation was a pretext for discrimination. Accordingly:
Defendant claims to have failed to consider Parker’s application because of its resume reviewing
procedures: The reason Penn did not hire Parker is because he had not applied for any specific position,
and it did not conduct any search of the resumes of non-applicants for positions being filled. (Parker v.
University of Pennsylvania, 2004)
There are several things to note about this ruling. Although under prior rules, Parker would
have been considered an applicant, on the basis of the new rules Parker should not be considered an
“Applicant” and therefore should lose at the prima facie level. Nevertheless, even nonapplicants may
file valid adverse impact claims if they are “chilled” by MQs such as education requirements that
disproportionately exclude minorities (Griggs v. Duke Power, 1971; Albemarle v. Moody, 1975) or
physical requirements such as height and weight that disproportionately exclude females (Dothard
v. Rawlinson, 1977).
2. Adverse impact may appear at any or all stages of a multistage selection or promotion
decision process. It should be examined at every point at which it may appear. If there is no
adverse impact at a particular step, there is no requirement that a selection procedure repre-
sented in that step be job-related. MQs represent a “stage” in a selection process and should
be considered as vulnerable to challenge as a traditional predictor would be.
The employment process is often more complex and staged than it might appear at first glance.
Consider the following steps that might characterize a hire:
Each of these process steps implies a decision on the part of the employer. The MQs review assumes
that the applicant will self-select in or out of the subsequent process. The application blanks are
reviewed and a subset of applicants is invited to complete some preliminary screening tests. A
subset of those applicants is invited to complete a more extensive assessment test. A subset of those
applicants is invited to participate in interviews. A subset of those applicants receives job offers.
What has been referred to as a “bottom-line” analysis of adverse impact would compare those
individuals offered a position (step 6) with those individuals who applied (step 2). Nevertheless,
adverse impact could occur as early in the process as the consideration of MQs (i.e., the self-
selection step). If the MQs are not job related, then a case could be made that adverse impact should
be evaluated at step 1 (although it would be difficult to identify those potential applicants who
self-selected out of the process). Therefore, in a multistage hiring process (also known as a multiple
A Sampler of Legal Principles in Employment Selection 631
hurdle process), adverse impact should be considered at each stage of the multistage process and not
simply at the last step.
The complaining party shall demonstrate that each particular challenged employment practice causes
disparate impact, except that if the complaining party can demonstrate to the court that the elements
of a respondent’s decision making processes are not capable of separation for analysis, the decision
making process may be analyzed as one employment practice.
In summary, when combined with subsequent case law and CRA-91, Teal implies that employ-
ers should not only examine individual components of a multiple hurdle for adverse impact but
also offer transparency on the components of the selection process. If they do not adopt this
transparency, they may be faced with the prospect of defending an entire selection process if there
is bottom-line adverse impact.
3. The search for alternatives to procedures that might result in or have resulted in lesser
adverse impact is split between defendants and plaintiffs. At the outset of a selection project,
defendants should consider alternatives that will meet business needs, be job-related and will
minimize adverse impact. After a demonstration of job-relatedness, the plaintiffs have the
burden of demonstrating that there is an alternative procedure that might retain the same
levels or similar levels of job-relatedness but shows lesser adverse impact.
In years past, for both of these stages (initial and postjob relatedness), the search for alternatives
may have been considered an onerous and overly burdensome task. However, the I-O literature has
expanded significantly over the past decade with regard to decision-making tools and strategies for
assessing the tradeoffs between validity and adverse impact. As an example, De Corte, Lievens,
and Sackett (2007) investigated the possibility of combining predictors to achieve optimal tradeoffs
between selection quality and adverse impact. This is a follow-up to earlier research in this area.
De Corte (1999) and Sackett and Ellingson (1997) investigated the effects of forming predictor
632 Handbook of Employee Selection
composites on adverse impact. Other researchers have used meta-analytic techniques to forecast
the likely outcomes of combining high and low adverse impact predictors in a selection proce-
dure (Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997; Bobko, Roth, and Potosky, 1999). Thus
employers and their consultants (in-house or external) should become familiar with this literature to
meet their burden of demonstrating a search for alternatives at the outset of a selection project
After the fact, once adverse impact has been shown and defendants have adequately made the
job-related argument, it is the plaintiffs’ burden to demonstrate a feasible alternative to that job-
related process that would have lesser adverse impact while retaining job relatedness. One of the
interesting aspects of this requirement is that plaintiffs can meet their burden by showing that there
is an alternative method of using the selection procedure (e.g. an alternative method of weighting
predictor components) that would have less adverse impact without affecting validity. Section 5(G)
of the UGESP states the alternative method of use requirement.
The evidence of the validity and utility of a selection procedure should support the method the
user chooses for operational use of the procedure, if that method of use has a greater adverse impact
than another method of use.
This principle was established in Albemarle v. Moody (1975). The Supreme Court ruled that if
job-relatedness is proven, the plaintiff may prove pretext by showing that “other tests or selection
devices, without a similarly undesirable racial effect, would also serve the employer’s legitimate
interest in ‘efficient and trustworthy workmanship’.” This principle was subsequently written into
the UGESP and later codified in CRA-91.
single ability alone. Johnson is the most intriguing ruling because it is the only one in which alterna-
tives decided the case. As a caveat, all three are district court rulings, and reader should stay abreast
of these and related rulings as they are appealed.
4. Work analysis should precede any selection process. The shelf life of a work analysis
will normally run 5–8 years, in the absence of any information that the job has changed
substantially.
This is one point about which there is little disagreement. The SIOP Principles, the APA
Standards, and the UGESP all clearly indicate that a demonstration of the validity of a selection
device begins with a thorough analysis of the job if the criterion of interest is job performance.
Nevertheless, if the criterion of interest is turnover or absenteeism, a full work analysis is not
necessary.
Work analysis refers to a detailed examination of a job, which may include the determination of
what is done, how it is done, the context in which it is done (including the strategic importance of
that job and its essential functions to organizational success), as well as the knowledge, skills, abili-
ties and other characteristics (KSAOs) required to perform the job successfully. The most common
“O” characteristic is some measure of personality. Work analysis is also important in demonstrating
criterion-related validity because it establishes the importance or relevance of the criterion measure.
Finally, work analysis is important for demonstrating content validity because it establishes fidelity
of the test content to the job content.
Proper work analysis is a critical requirement to meet professional standards, particularly for
content-oriented validation. Work analysis should establish the linkage between important func-
tions/characteristics of the job and KSAOs, and it should form the basis for linking a selection
device to those KSAOs, thus linking the selection device to the job. The importance of work analy-
sis to validation, particularly content validation, is clearly established in the UGESP. Section 14(A)
provides the following guidance regarding the importance of work analysis:
Validity studies should be based on review of information about the job. Any validity study should be
based upon a review of information about the job for which the selection procedure is to be used. The
review should include a work analysis. Section 14 (C2) is also instructive.
There should be a work analysis which includes an analysis of the important work behavior(s) required
for successful performance and their relative importance and, if the behavior results in work product(s),
an analysis of the work product(s). Any work analysis should focus on the work behavior(s) and the
tasks associated with them. If work behavior(s) are not observable, the work analysis should identify
and analyze those aspects of the behavior(s) that can be observed and the observed work products. The
work behavior(s) selected for measurement should be critical work behavior(s) and/or important work
behavior(s) constituting most of the job.
[F]airly measures the knowledge or skills required by the particular job or class of jobs which the
applicant seeks, or which fairly affords the employer a chance to measure the applicant’s ability to
perform a particular job or class of jobs. (Griggs v. Duke Power, 1971, footnote 9)
634 Handbook of Employee Selection
The Albemarle Paper Company attempted to defend cognitive tests with a deficient criterion
validity study conducted one month prior to trial. The most critical deficiency was absence of work
analysis. In Moody v. Albemarle (1973), the 4th Circuit ruled:
In developing criteria of job performance by which to ascertain the validity of its tests, Albemarle failed
to engage in any work analysis. Instead, test results were compared with possibly subjective ratings of
supervisors who were given a vague standard by which to judge job performance. Other courts have
expressed skepticism about the value of such ill-defined supervisor appraisals.
The Supreme Court affirmed the 4th Circuit ruling and the Griggs and Albemarle rulings formed
the basis of the 1978 UGESP.
The role of work analysis in content validity was established in Guardians v. Civil Service (1980)
and has since been followed by every circuit that has ruled on content validity. Relevant cases for
content validity are discussed under Principle 5. On the basis of Guardians and other cases, it is
axiomatic that content validity cannot be established absent work analysis.
Case law also reveals that work analysis is critical in establishing MQs. For example, in
Brunet v. City of Columbus (1993), criterion validity data showed that cognitive skills were
more predictive of fi refighter performance than physical skills. Only physical skills produced
adverse impact (for females). Nevertheless, physical skills were justified based on work analysis
data showing that they distinguished between average and superior fi refighters, whereas cogni-
tive skills do not. Additionally, as we will discuss under Principle 5, work analysis is critical for
establishing transportability, and as we will discuss under Principle 8, work analysis is also critical
for establishing cutoff scores.
Finally, the concept of a “shelf-life” of a work analysis was endorsed in Bradley v. City of Lynn
(2006), in which a 1992 work analysis was offered to support a 2004 criterion validity study. The
judge rejected the defendant’s argument that the work analysis was sufficiently current to inform
test development. Experts for the plaintiffs and defendants agreed that a work analysis should be
at least revisited every 5–8 years. One reason for revisiting a work analysis every 5–8 years is that
although the basic structure of a job (e.g. the major duties and responsibilities) may not change over
that period, the relative emphasis placed on specific responsibilities and related work behaviors job
may change.
5. Validation evidence can come in many forms and is not limited to one particular approach.
In general, the greater number of converging sources of validation evidence, the better.
In the period immediately following passage of the Civil Rights Act of 1964, many considered
criterion-related validity the most appropriate design for determining job-relatedness of assessment
devices or processes. This was mainly due to the 1970 EEOC Guidelines requiring that employ-
ment tests be “predictive of or significantly correlated with important elements of work behavior.”
Subsequently, the 1978 UGESP added content and construct-related validation to the arsenal of
tools for demonstrating job-relatedness, thus finally cementing the “Trinitarian” view of validation,
which had been introduced in psychometric literature the 1950s. However, Sec. 1607.C (1) contained
the following warning on “appropriateness of content validity studies”:
A selection procedure based on inferences about mental processes cannot be supported solely or
primarily on the basis of content validity. Thus, a content strategy is not appropriate for demonstrating
the validity of selection procedures that purport to measure traits or constructs such as intelligence,
aptitude, personality, common sense, judgment, leadership, and spatial ability.
For a short time, this passage perpetuated the myth that criterion-related validity is superior
to content-related validity. By the mid-1980s there was a growing consensus that discussions of
“acceptable” models for validation were inappropriate because virtually any validation evidence,
regardless of how gathered, is possible evidence of job relatedness on its merits rather than by
A Sampler of Legal Principles in Employment Selection 635
its name. This “Unitarian” view is now widely accepted by I-O psychologists and the courts.
Furthermore, in recent years, there have been several new and powerful techniques within and
beyond the “Trinity,” including:
• Validity transport: Transporting criterion-related validity to a novel setting where the jobs
in question can be seen to be highly similar
• Synthetic validity: Aggregation of data from different settings that address subsets of
predictor and/or criterion variables
• Construct validation: The use of powerful statistical techniques such as structural equation
modeling or path analysis to test theories of predictor-criterion associations
• Meta-analysis: Collecting validity coefficients from many settings and correcting for
statistical artifacts and possible moderator variables to estimate the mean and variance
of the validity coefficients after correction for relevant artifacts
• Validity generalization: Demonstrating that the confidence interval for the true validity
coefficient does not include r =.00 by assembling data from a wide variety of job titles,
variations on a predictor type, and criteria
In Guardians, although the defendants could not support a cut score or rank ordering on the
basis of weakness in steps 2 and 3 (see Principle 8 below), they were able to use a content validation
strategy for demonstrating job relatedness. In Gillespie v. Wisconsin (1985), the 7th Circuit ruled
“neither the Uniform Guidelines nor the psychological literature express a blanket preference for
criterion-related validity” and in Police Officers v. Columbus, Ohio (1990), the 6th Circuit, citing
the 1987 SIOP Principles ruled that it is critical that “selection instruments measure a substantial
and important part of the job reliably, and provide adequate discrimination in the score ranges
involved.” Content validity has been supported in many subsequent cases, including the 2nd Circuit
in Gulino v. NY State (2006).
Attempts to support job-relatedness on the basis of meta-analysis alone have often failed (EEOC
v. Atlas, 1989 and Lewis v. Chicago, 2006). Nevertheless, meta-analysis has been credited as sup-
plemental to local validity studies (Adams v. Chicago, 1996, and Williams v. Ford Motor, 1999).
Also, transportability and upward score adjustments were supported in Bernard v. Gulf Oil (1989),
in which criterion validity was found for two of five jobs, and the 5th Circuit found “sufficient simi-
larity in the skills required” for all five jobs. The Court also ruled “the adjusted figures … are better
estimates of validity” and uncorrected coefficients “underestimate” validity of the tests.
636 Handbook of Employee Selection
In summary, although any of the “Trinity” approaches alone are likely sufficient for job-
relatedness, the body of case law as a whole suggests the “Unitarian” approach may have greater
acceptance in court and that new methods of supporting validity are at least acceptable supplements
to traditional methods.
In summary, irrespective of whether the issue is hiring, promotion, termination, or the criterion
in a criterion validity study, the Albemarle, Watson, and Meacham rulings carry inherent warnings
that performance appraisals should be based on work analysis and that the methodology used to
appraise worker performance meet acceptable psychometric principles.
These guidelines demand that employers using tests have available data demonstrating that the test is
predictive of or significantly correlated with important elements of work behavior which comprise or
are relevant to the job or jobs for which candidates are being evaluated.
However, the Supreme Court did not specify an alpha level for significance, leaving this issue to the
lower courts.
The 5th Circuit was the first appeals court to pin down the alpha = .05 level for validity coeffi-
cient in U.S. v. Georgia Power (1973). Citing the “95% Guideline” from the 1970 EEOC Guidelines,
the 5th Circuit ruled that “this guideline requires it be shown that there was no more than 1 chance
out of 20 that job relatedness appeared by coincidence.” The 5th Circuit struck down the chal-
lenged tests because the expert’s validation study was deemed a product of “chance not science.”
Subsequently, in Boston NAACP v. Beecher (1974), the 1st Circuit struck down a nonsignificant
validity coefficient of r = +.078, ruling that “were this statistically significant, it would indicate
that the test grade ‘explained only approximately 0.6% of all observed variance in fire fighters’
on-the-job performance.’” Because the value was nonsignificant, it was not necessary for the Court
to address any corrections for unreliability or range restriction. All courts have since generally
accepted alpha = .05 as a minimum requirement for the validity coefficient.
638 Handbook of Employee Selection
The Beecher Court was also the first appeals court to distinguish between statistical significance
(“likely to occur by chance in fewer than five of one hundred similar cases”) and practical signifi-
cance (“correlation of ± 0.3 or higher, thus explaining 9% or more of the observed variation”). Similar
distinctions were made by the 5th Circuit in Ensley NAACP v. Seibels (1980), by the 9th Circuit in
Clady v. Los Angeles (1985), and the 11th Circuit in Hamer v. Atlanta (1989). The Clady Court also
noted that “the greater the test’s adverse impact, the higher the correlation which will be required.”
More recently, in Lanning v. Septa (1998), the Eastern District Court of Pennsylvania accepted
expert testimony that “practical significance is found through regression analysis and expectancy
tables” as opposed to corrected correlations that exceed the r=.30 standard. Using a regression
analysis with multiple predictors and statistical controls, the expert demonstrated that scores on an
aerobic test above the minimum cutoff were associated with a 10% increase in arrests for serious
crimes and a 4% increase in arrests for all crimes for transit authority police officers. As noted
under Principle 8 (immediately below), the 3rd Circuit ultimately affirmed the district court in
Lanning, but in doing so, it established a new standard for cut-off scores.
8. Cut scores should be based on a rational foundation that may or may not include empirical
analyses.
There are several types of cut scores. The first is a nominal cut score. This is a value often
established as an arbitrary pass-fail score in a multiple hurdle system. It designates the threshold
for continuing to the next stage in the process. The second is known as the effective cut score. This
is the score below which no one is hired or appointed. It is not predetermined but simply identi-
fied after-the-fact. This is most often seen in strict rank-order appointment where candidates are
appointed from the top scorer down until all positions are filled. The score of the individual filling
the last opening is the effective cut score. The third type of cut score is the critical cut score. This is
a value that has been chosen to represent the score below which an applicant is thought to fall below
a minimal standard for effective job performance. Often, a critical cut score is tied to safety or well
being of the candidate or the public/customer base.
Nominal cut scores are often set by combining practical and theoretical issues. A practical con-
sideration might be the cost of testing. Assume there are 1,000 applicants for 20 openings and there
will be a multistage assessment process. The employer might want to limit the cost of testing by
eliminating many individuals at the first stage of the process. In this case, the cut score can be set
by looking at the selection ratio and deriving some estimate of acceptable performance expectations
for candidates.
There is no need to “set” an effective cut score. It is an axiomatic score determined solely by
the score of the last person hired or appointed in a rank-ordered list of candidates. With respect
to a critical cut score, there are several options available, but all require some consideration of a
criterion level that distinguishes between competent and incompetent or minimally qualified and
less-than-minimally qualified. Subject matter experts can estimate requisite predictor performance
and associated criterion performance. This, of course, would benefit greatly from a comprehensive
and accurate work analysis. Incumbent populations can be used to identify predictor scores associ-
ated with minimally acceptable performance. Regardless of which techniques are used to set the
critical cut score, there should be a rational foundation for its choice, or an empirical one based on
a work analysis. Although it is axiomatic that a single score (i.e., a cut score) cannot be “validated,”
it is possible to produce evidence that provides some confidence that a critical cut score is related to
an anticipated real-world outcome.
For example, in Guardians v. Civil Service (1980) and Gillespie v. Wisconsin (1985), two early
cases discussed under Principle 5, the 2nd and 7th Circuits ruled that
An employer may establish a justifiable reason for a cut-off score by, for example, using a professional
estimate of the requisite ability levels, or, at the very least by analyzing the test results to locate a logical
break-point in the distribution of scores.
Both studies employed content validity strategies. However, the defendants lost on cut-off score in
Guardians, but won in Gillespie.
In Guardians, the defendants made selections on strict rank-ordering, and defined the cutoff at
below the point that the last applicant was selected (effective cut score). The 2nd Circuit ruled:
If it had been shown that the exam measures ability with sufficient differentiating power to justify
rank-ordering, it would have been valid to set the cutoff score at the point where rank-ordering filled
the City’s needs . . . . But the City can make no such claim, since it never established a valid basis for
rank-ordering.
However, in Gillespie, the 7th Circuit ruled in favor of the defendant on the basis of two factors:
(a) establishment of interrater reliability and (b) the cutoff was selected to permit interviewing “as
many minority candidates as possible while at the same time assuring that the candidates possessed
the minimum skills necessary to perform” the job. On rank ordering, the 7th Circuit cited verbatim
from Q62 of the “Questions and Answers” of the UGESP.
Use of a selection procedure on a ranking basis may be supported by content validity if there is evidence
from work analysis or other empirical data that what is measured by the selection procedure is associated
with differences in levels of job performance.
Therefore, up until the Lanning ruling, there was little dispute among the circuit courts on either
cutoff scores or rank ordering.
In Lanning, the 3rd Circuit interpreted the terms “job-related” and “consistent with business
necessity” from the Civil Rights Act of 1991 as implying separate standards and that the “business
necessity” part implied proof that the cut-off score “measures the minimum qualifications neces-
sary for successful performance of the job in question.” This interpretation was explicitly rejected
in Bew v. City of Chicago (2001), where the 7th Circuit ruled, “Griggs does not distinguish business
necessity and job relatedness as two separate standards.” Although most of the other circuits have
continued to follow precedents from Guardians and Gillespie, in one recent case (Isabel v. City of
Memphis, 2005), the 6th Circuit adopted the Lanning standard.
In summary, the basis for establishing cut-off scores is evolving, and interested readers need to
keep abreast of rulings in this area.
performance are the focus, cognitive ability tests (instruments that typically have high adverse impact)
result in the highest validity for a single predictor. However, if overall organizational effectiveness is
the objective, factors such as legal defensibility, strategic positioning within the marketplace, employee
performance, workforce diversity, and corporate social responsibility must all be considered.
The bottom line is that the organization’s mission and values dictate what constitutes acceptable
performance and, ultimately, the best methods of achieving that performance. This argument was
made quite forcefully in the University of Michigan Law School admission case. The University of
Michigan took the position that its objective was to admit a first-year class that collectively advanced
the Law School’s overall mission as opposed to simply admitting each student with the highest prob-
ability of achieving a given law school grade point average (GPA). Whether one agrees with the
school’s stated mission or not, once formulated, that mission basically defines job-relatedness. It
also drives the types of strategies that are most likely to achieve an optimum balance between job
relatedness and reduction of adverse impact.
It seems to us that there is an even more compelling need for diversity in a large metropolitan police
force charged with protecting a racially and ethnically divided major American city like Chicago.
Under the Grutter standards, we hold, the city of Chicago has set out a compelling operational need for
a diverse police department.
The 7th Circuit then upheld out-of-rank promotions of minority applicants on grounds that it was
narrowly tailored.
It should be noted that the key ingredient in a successful defendants’ diversity argument is prov-
ing to the satisfaction of the courts that diversity is an important job-related factor that furthers the
mission of the organization. Diversity for diversity’s sake, therefore, will not work. For example,
in Lomack v. City of Newark (2006), a newly elected mayor transferred firefighters to and from
various posts so that all 108 fire stations were racially diverse. The mayor felt there were “educa-
tional and sociological” benefits for such a “rainbow,” but the 3rd Circuit saw it as “outright racial
A Sampler of Legal Principles in Employment Selection 641
balancing.” Similar rulings were rendered in Biondo v. City of Chicago (2004) and Rudin v. Lincoln
Land Community College (2005).
10. Many of the same practices that define responsible selection define responsible downsizing
efforts.
Downsizing (also known as RIFs, or reductions-in-force) represents a special instance of selec-
tion. Individuals are selected to “stay” in an organization (or conversely to “leave” an organization).
It might be thought of as “deselection.” It represents the mirror image of the selection scenario.
However, there are some unique aspects to downsizing. In the context of litigation, the employer
must be prepared to show that the reduction in force was not a pretext for simply eliminating mem-
bers of a protected group (e.g., female, minority, and/or older employees). Thus, the reduction in
force should be tied to a larger business plan that documents the need for the reduction. This busi-
ness plan should also support why certain jobs, departments, or divisions have been targeted for the
force reduction.
As is the case in many selection/promotion scenarios, the employer should identify the par-
ticular knowledge, skills, abilities, or other personal characteristics that are central to the decision
about who to lay off. These variables might include abilities and skills needed for future vitality of
the organization, critical experience bases, customer contacts, past performance, and future output
responsibilities. Many of these variables will be illuminated by a current or future-oriented work
analysis. In addition, rating scales used to evaluate current employees (e.g. performance ratings,
skill ratings, knowledge ratings) should conform to accepted professional and scientific standards
for the use of such rating devices.
Unlike selection scenarios, it is difficult to “validate” a downsizing process because there is no
obvious criterion for success beyond simple organizational survival. Nevertheless, just as in the case
of selection, it is important to develop a theory of the downsizing process and its desired results.
This theory would include some statement of requisite KSAOs for going forward as well as the
organizational need for downsizing.
The defendant argued that the three employees were not similarly situated because they performed
different job functions, but the 6th Circuit ruled for the plaintiff because the three eliminated jobs
involved common knowledge, skills, and abilities.
Plaintiffs have also successfully used direct evidence of stray remarks by supervisors. For exam-
ple, in Starceski v. Westinghouse (1995), a supervisor admitted he stated that “it was actually a fact
that older engineers … were going to be let go” (see also Madel v. FCI Marketing, 1997).
Finally, employers must strictly follow eight explicit requirements in the Older Workers Benefit
Protection Act of 1990 if they offer enhanced benefits to older employees who accept early retire-
ment in exchange for waiver of the right to sue. These requirements are that the (a) the waiver docu-
ment is clearly written and easily understood, (b) the waiver document cites the ADEA, (c) it only
affects rights prior to the effective date of the waiver, (d) it offers enhanced benefits, (e) it advises
employees of their right to seek counsel, (f) it provides 21 days for individuals and 45 days for
groups to make a decision, (g) it is revocable within 7 days of signing, and (h) it provides extensive
information about who is affected if a group is involved. In Oubre v. Entergy (1998), the Supreme
Court made it clear it will strictly construe these eight requirements. Therefore, employers should
know them.
11. In the employment context, disparate treatment may appear in both obvious and nuanced
ways. In addition, disparate treatment based on race and gender may appear very differently
than disparate treatment based on age or disability.
A now classic mechanism by which disparate treatment based on race or gender may occur is in
steering racial minorities or women into dead-end, lower-paying positions outside of the primary
functions within the organization.
Subtle instances of disparate treatment based on gender can also occur in the initial salary offer
made to female applicants when compared to similarly situated males. In such a scenario, a female
applicant with the same qualifications as a male counterpart may be started out in a lower pay
grade. This difference can then carry over throughout the employee’s tenure, although salary raises
and job promotion may be similar from that point on. Unfortunately the damage may already have
been done despite the fact that the two employees appear to have been treated fairly after hire.
Job titles, advancement rates, and salary growth are often used by organizations as surrogates for
“promotion potential” of employees or applicants, thus representing an indirect effect on selection
or promotion.
In some downsizing situations (i.e., deselection), an older worker may be the victim of disparate
treatment with regard to a performance appraisal rating. The older employee is rated lower sup-
posedly because he or she does not have the ability to adapt or adjust to future conditions in the
marketplace. It has been alleged by older plaintiffs that performance appraisals may be arbitrary
and excessively subjective assessments that represent a pretext for age discrimination. Without
appropriate rating scale development and construct definition, the ability may be ill-defined and
may lend itself to biased evaluations.
Similarly, an employee with a disability may be assumed to be unable to carry out some physical
task associated with the job simply because the employer failed to recognize alternative ways of per-
forming the task or subtle ways in which the task could be restructured. In other words, the employer
ignored the responsibility of providing reasonable accommodations to the disabled employee.
the Supreme Court decision was based), which ruled in Wards Cove, previously found the challenged
practices in Wards Cove (word-of-mouth recruitment, walk-in hiring, nepotism, and “vague subjec-
tive hiring criteria”) as more indicative of disparate treatment than adverse impact in Domingo v.
New England Fish (1984). Similarly, post-Wards Cove cases with parallel charges have also implied
disparate treatment (e.g., Thomas v. Washington County, 1990, and EEOC v. O&G, 1991).
Initial gender-based pay differences are illustrated in EEOC v. Aetna (1980) and EEOC v. Ligget
Myers (1982). In Aetna, Barratt, a female, was hired at the same time as Garrett, a male. Barratt learned
that Garrett was paid more only after Garrett was fired, and Aetna settled with Barratt. In Ligget
Myers, male supervisors were paid more than their female counterparts as a result of prepromotion sal-
aries. Although it used a sex-blind procedure for all promotions to supervisor (absolute dollar amount
increases), the company could not explain why prepromotion salaries were higher for the males.
“Adaptability” in downsizing is illustrated in Meacham v. Knolls (2006), in which 30 of 31 laid
off employees were over age 40. The plaintiffs proved adverse impact, but the defendants, using the
statutory RFOA (reasonable factors other than age) defense, articulated that the basis for determin-
ing layoffs was whether employees were “flexible” and “retrainable” for alternative assignments.
A divided panel of the 2nd Circuit ruled for Knolls because the plaintiffs could not prove this
employer strategy was unreasonable. However, the dissenting judge argued that the defendant’s
burden is to factually prove (rather than simply articulate) that its criteria are reasonable, and it is
not the plaintiff’s burden to prove otherwise. In 2008, the Supreme Court ruled that the employer
actually had the burden of proof, and the case has been remanded for further trial proceedings in
which the employer will be required to present that proof.
Finally, job restructuring is not required under the ADA if it entails elimination of essential job
functions. However, employers are free to restructure if they so desire. For example, in Barth v.
Gelb (1993), a diabetic applied unsuccessfully for an overseas job at Voice of America (or VOA).
VOA required rotation among its sites and medical facilities were not available at each cite. VOA
routinely permitted such selective assignments for injured employees. However, distinguishing
between “friends” and “strangers,” the DC Circuit ruled it was legal to so restructure jobs for exist-
ing employees without being obligated to do likewise for applicants
12. Employers should take proactive measures to prevent, detect, and correct EEO violations
involving procedural unfairness to workers or violations inferred from analysis of workforce
data.
The model for preventing, detecting, and correcting EEO violations involving unfair treatment
of employees is provided in exemplar form by EEOC’s approach to sexual harassment (SH). In
EEOC Policy Guidance 915.050 (June 1999; http://www.eeoc.gov/policy/docs/currentissues.html),
the EEOC distinguishes between minimum requirements and best practices. The minimum require-
ments are as follows:
Although Policy Guidance 915.050 was expressly targeted at SH, it applies equally well to
broader actions of any sort. For example, employees may feel that their performance is being
improperly appraised or that they are not receiving training or certification opportunities necessary
for advancement.
Workforce data may include confidential survey information obtained by a trained EEO Officer
or statistical data relating to performance of different groups of employees on selection tests, com-
position of the workforce in relation to the appropriate labor pool, or the funneling of different
groups into different jobs (e.g., female offered jobs as cashiers vs. male offered jobs as assistant
managers). Properly analyzed data could lead to the detection of a potential adverse impact or pat-
tern and practice violation, thus enabling employers to take action before expensive and disruptive
litigation while simultaneously increasing employee perceptions of fairness.
Finally, employers need a good antiretaliation policy. EEOC statistics reveal that retaliation com-
plaints are increasing although EEO claims in general have stabilized and even decreased (Zink &
Gutman, 2005). Adding to this caution, the Supreme Court recently lightened the burden on plain-
tiffs to prove retaliation, requiring employers to educate their HR professionals and managers on
what to do when individuals complain about a workplace policy or file formal charges. It is not
unusual for plaintiff employees to complain about routine selection decisions such as lost training
opportunities that might presage promotions, or failure to promote per se. It is critically important
for employers to prevent reprisals against those who do complain.
complaint. These actions were deemed insufficient for proof of racial or SH, but were deemed suf-
ficient to prove retaliation against the complainants, although they were not original plaintiffs.
Finally, the risks employers face when altering selection processes to avoid adverse impact is
illustrated in Ricci v. Destafano (2008) and Hayden v. County of Nassau (1999). In Ricci, where
exams were discarded, there was a weak basis for the New Haven CSB to believe they would lose an
adverse impact challenge to minority applicants. However, in Hayden, where Nassau County (New
York) eliminated portions of the test to reduce the adverse impact, there was a much stronger basis
for that fear since the county was under a consent decree to create a valid test with the least amount
of adverse impact.
13. Using statistics to demonstrate age discrimination is more challenging than using similar
statistics to argue for racial or gender discrimination because many variables that would not
be expected to be correlated with race or gender (e.g. career progress, promotions, compensa-
tion) are commonly correlated with age.
Race and gender are immutable personal characteristics. Age is not because we all age (although
some less willingly than others.) When considering race and gender in the calculation of adverse
impact, it is common to compare differences in the relative rates of “success” of men and women,
or Whites and Blacks. When holding relevant variables constant (e.g. education, experience, per-
formance), in a “fair” work environment, one would not expect to find differences in those success
rates when considering race or gender. In contrast, age changes over time for an individual and is
often correlated with variables of interest such as experience and stage of career development. It is
widely recognized that early in one’s career, promotions occur more rapidly as do percent increases
in compensation. Thus, it is common to find that older individuals tend to receive fewer promotions
and lower percent increases in compensation. This is tempered by the fact that these same older
individuals are typically more highly paid than their younger counterparts in absolute dollars and
that these older individuals also tend to be at higher levels in the organization.
In age discrimination cases, the plaintiff will often present statistics demonstrating a decrease
in the rate of promotions for older employees as a way of addressing the issue of adverse impact.
Similarly, plaintiffs may produce evidence of statistically significant differences in performance
ratings between older and younger employees. To the extent that performance ratings play a role in
decisions regarding promotions, and/or layoffs, plaintiffs argue that performance ratings are biased
against the older employee and represent the foundation for palpable harm to the older employee.
For any informative discussion of the possibility of adverse impact in age cases, one should intro-
duce controls related to organizational and job tenure, work-related experience, and knowledge/
skill/ability variables. Longitudinal analyses are superior to cross-sectional analyses when dealing
with age. With respect to differences in performance ratings, two phenomena should be recognized.
The first might be called the “frog pond” or “the all-state-quarterback effect.” To the extent that
performance ratings represent comparisons between an individual and peers, as an individual pro-
gresses up the organizational ladder (and, at the same time, ages), he or she is being compared with
an increasingly competent cohort. Thus, the individual arrives at “ponds” with increasingly bigger
frogs. The second phenomenon can be termed the “speed of success” factor. This means that an
individual who moves rapidly up in an organization might be considered more capable and effective
than a colleague who spends longer periods of time in a given job title.
Then in Hazen v. Biggens (1993), a 62-year old was terminated shortly before eligibility for
pension vestment, a clear-cut ERISA violation. However, the lower courts also favored disparate
treatment because age and years of service are correlated. The Supreme Court reversed on disparate
treatment, ruling unanimously that employer decisions may be motivated by “factors other than age
… even if the motivating factor is correlated with age.” Additionally, three justices opined that it is
“improper to carry over disparate impact analysis from Title VII to the ADEA.” After Hazen, three
circuit courts continued to entertain age-based adverse impact claims, but seven circuit courts found
adverse impact inapplicable in the ADEA as a matter of law. Disputes among circuits often lead to
Supreme Court decisions intended to add consistency across circuits.
Then, in Smith v. City of Jackson (2005), police officers and dispatchers with less than 5 years
of experience received higher percentage compensation increases. The lower courts ruled adverse
impact was unavailable in the ADEA, but the Supreme Court ruled that Hazen does not prelude
such claims. However, the Supreme Court affirmed Hazen that factors correlated with age (e.g.,
years of service) do not qualify and, where adverse impact is proven, defendants may use the statu-
tory RFOA in lieu of proving job-relatedness. The plaintiffs ultimately lost the prima facie case
(failure to prove adverse impact) and the City had a valid RFOA (the need to compete with neigh-
boring municipalities for filling lower-level positions by increasing the compensation for those posi-
tions, although the entry-level positions were often filled by younger applicants).
Two subsequent lower court rulings in which adverse impact was proven are worth noting. As
noted in Principle 11 (below), in Meacham v. Knolls (2006), 30 of 31 laid-off employees were over
age 40. Additionally, in EEOC v. Allstate (2006), over 90% of employees subject to a “reorganiza-
tion plan” (making them ineligible for rehire) were over age 40. As noted above, in the Meacham
case, the Supreme Court ruled that the employer has the burden of proof when invoking an RFOA
defense.
14. All disability-related decisions should be made on a case-by-case basis, including deter-
mining if an applicant or employee is (a) disabled, (b) needs accommodations for performing
essential job functions, and/or (c) assessments of KSAOs deemed necessary to perform essen-
tial job functions.
Under the ADA of 1991 (and the prior Rehabilitation Act of 1973), there is no such thing
as “disability as a matter of law.” The general requirements for being disabled under the law
include: (a) a physical or mental impairment that (b) interferes with a major life function. In addi-
tion to prongs (a) and (b), the individual must (c) be able to perform all essential job functions
with or without accommodations. This is true regardless of whether the impairment is current
or past or if the employer mistakenly believes an individual is disabled. As a result, disabilities
fall within an interval between prongs (b) and (c) in which individuals with minor or temporary
impairments cannot demonstrate interference with major life functions, and individuals with
extremely severe impairments may not be able to perform all essential job functions, even with
accommodations.
The first step in determining if a disabled person requires accommodations is determining
whether there is a nexus between a physical or mental impairment and essential job functions.
A person with one leg is clearly disabled under the ADA, but it is unlikely that accommoda-
tions beyond access to the workplace are required if the job is computer programmer. If there
is a nexus, the employer should next meet with the applicant or employee and together explore
potential accommodations to overcome the barrier implied by the disability. If there are no such
accommodations possible, the individual, unfortunately, faces an insurmountable yet legal barrier
to employment.
If assessment is involved in the selection process, it is important to accommodate applicants
with special needs. For example, if a paper-and-pencil or computer-presented format is used and the
construct of interest is “good judgment,” it may be important to allow applicants with limited vision
to have the test questions read to them. More generally, it is good practice to incorporate KSAOs
A Sampler of Legal Principles in Employment Selection 647
needed to perform essential job functions in the testing process itself. For example, if a job or criti-
cal tasks can be performed without undue concern for the passage of time, it may be inappropriate
to use demanding time limits for a test to be taken by an individual who claims a learning disability
related to speed of reading or processing.
This is not to suggest that monocular individuals have an onerous burden in trying to show that they
are disabled . . . . We simply hold that the Act requires monocular individuals … to prove a disability by
offering evidence that the extent of the limitation in terms of their own experience, as in loss of depth
perception and visual field, is substantial.
In other words, Kirkingburg could have proven he was disabled within the meaning of the law, but
did not; assuming amblyopia is a disability as a matter of law.
In the other two cases, the Supreme Court ruled that impairments must be evaluated with mitigation
(eyeglasses for visual impairments in Sutton and high blood pressure medication for hypertension in
Murphy). In an extension of the ruling in Kirkingburg, the Murphy Court ruled:
Murphy could have claimed he was substantially limited in spite of the medication. Instead, like
Kirkingburg, [he] falsely assumed that his impairment was a disability as a matter of law.
Taking advantage of this “advice,” plaintiffs subsequently proved disability despite medication, as
for example, in EEOC v. Routh (2001) (seizures only partially controlled with epilepsy medication)
and Lawson v. CSX (2001) (debilitating side effects of insulin medication for diabetics). It should be
noted that in the ADA Amendments Act of 2008, Congress changed the rules for mitigating mea-
sures (except for eyeglasses), thus reversing the Albertsons and Kirkingburg rulings. This does not,
however, alter the principle of assessing impairment on a case-by-case basis.
Example of insurmountable barriers include Southeastern v. Davis (1979) (a deaf woman excluded
from nursing school), Treadwell v. Alexander (1983) (a heart patient who cannot perform all-day
foot patrols excluded from park ranger job), and Miller v. Illinois (1996) (a blind person who could
perform some but not essential functions of corrections officer). However, it is critical that the job
functions in question are essential, as, for example, in Stone v. Mt. Vernon (1997) (paraplegic former
firefighter refused a desk job because he cannot fight fires in emergencies).
When accommodations are possible, plaintiffs have a duty to inform and both parties have
a duty to “flexibly interact” to seek accommodations. Examples of failure to notify include
Hedberg v. Indiana Bell (1995) (notification of fatigue syndrome after termination) and Taylor v.
Principle Financial (1997) (notification of bipolar disorder after a poor performance evaluation).
Examples of employee failures to flexibly interact include Beck v. University of Wisconsin (1996)
(the employee refused a request for medical records to identify accommodations) and Grenier
v. Cyanamid (1995) (the employee refused a request for psychiatric information). Examples of
employer failures to flexibly interact include Bultmeyer v. Fort Wayne (1996) (the employer
ignored a request by psychiatrist for reassignment), Feliberty v. Kemper (1996) (the employer
falsely assumed that a medical doctor can design his own accommodations), Whiteback v. Vital
signs (1997) (the employer ignored a request for motorized cart because “it wouldn’t look right”),
and Dalton v. Suburu-Izuzu (1998) (the employer ignored a request for step stools and guard rails
without discussion).
Mistakes in assessment include Stutz v. Freeman (1983), in which a dyslexic applicant failed the
General Aptitude Test Battery (GATB) exam for a job (heavy truck operation) that did not require
648 Handbook of Employee Selection
reading skills. On the other hand, in Fink v. New York City (1995), the city was not liable when
accommodations for blind applicants (readers and interpreters) did not result in passing scores on a
civil service exam.
CONCLUSIONS
The term “law” in the selection context has two separate but related meanings. There is statutory
law embodied in the Civil Rights Acts, ADA, ADEA, and similar federal statutes that we have dis-
cussed above. There is also “case law,” which is embodied in the opinions of various levels of the
federal judiciary (trial, appeals, and Supreme Courts). The latter interprets the former from the legal
perspective. Selection practitioners must be aware of both aspects of “the law.” They must be aware
of the statutory requirements as well as how judges have interpreted these requirements. Statutory
statements of the law seldom recognize the specific contributions of I-O psychology to selection
practices. Judges, on the other hand, often cite the testimony of I-O psychologists and the standards
by which selection practice is evaluated (e.g. UGESP, SIOP Principles). In this chapter, we have
attempted to bring together practice, statutory law, and case law as a way of educating practitioners.
Other chapters provide more detailed descriptions of practices. We provide a legal context for many
of those practices.
REFERENCES
Adams v. City of Chicago, 469 F.3d 609 (CA7 2006).
Albemarle Paper Co. v. Moody, 422 U.S. 405 (1975).
Albertsons v. Kirkingburg, 527 U.S. 555 (1999).
Barth v. Gelb, 2 F.3d 1180 (DC Cir. 1993).
Baty v. Willamette Industries, 172 F.3d 1232 (CA10 1999).
Beck v. University of Wisconsin Bd. of Regents, 75 F.3d 1130 (CA7 1996).
Bernard v. Gulf Oil, 890 F.2d 735 (CA5 1989).
Bew v. City of Chicago, 252 F.3d 891 (CA7 2001).
Biondo v. City of Chicago, 383 F.3d 680 (CA7 2004).
BNSF v. White, 126 S.Ct. 240 (2006).
Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporat-
ing cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561–590.
Boston NAACP v. Beecher, 504 F.2d 1017 (CA1 1974).
Bradley v. City of Lynn, 443 F. Supp. 2d. 145 (D.Mass 2006).
Bridgeport Guardians, Inc. v. City of Bridgeport, 933 F.2d 1140 (CA2 1991).
Brunet v. City of Columbus, 58 F.2d 251 (CA6 1995).
Bultmeyer v. Fort Wayne School, 100 F.3d 1281 (CA7 1996).
Burlington Industries, Inc. v. Ellerth, 524 U.S. 742 (1998).
Clady v. Los Angeles, 770 F.2d 1421 (CA9 1985).
Coates v. Sundor Brands, 164 F.3d 1361 (CA11 1998).
Connecticut v. Teal, 457 U.S. 440 (1982).
Dalton v. Subaru-Isuzu Automotive, Inc., 141 F.3d 667 (CA7 1998).
De Corte, W. (1999). Weighing job performance predictors to both maximize the quality of the selected work-
force and control the level of adverse impact. Journal of Applied Psychology, 84, 695–702.
De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between
selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393.
Dees v. Johnson Controls World Services, Inc., 168 F.3d 417 (CA11 1999).
Detroit Police Officers Association v. Young, 999 F.2d. 225 (CA6 1979).
Domingo v. New England Fish, 727 F.2d 1429 (CA9 1984).
Dothard v. Rawlinson, 433 U.S. 321 (1977).
EEOC v. Aetna Insurance Co., 616 F.2d 719 (CA4 1980).
EEOC v. Allstate, 458 F. Supp. 2d 980 (ED Missouri 2006).
EEOC v. Atlas Paper, 868 F.2d. 487 (CA6 1989).
EEOC v. Liggett & Myers, Inc., 690 F.2d 1072 (CA4 1982).
A Sampler of Legal Principles in Employment Selection 649
Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differences
and adverse impact. Personnel Psychology, 50, 707–722.
Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). Adverse impact and predictive efficiency
of various predictor combinations. Journal of Applied Psychology, 82, 719–730.
Shaw v. AutoZone, Inc., 180 F.3d 806 (CA7 1999).
Smith v. City of Jackson, 544 U.S. 228 (2005).
Society for Industrial and Organizational Psychology. (2003). Principles for the validation use of personnel
selection procedures. Bowling Green, OH: Author.
Southeastern Community College v. Davis, 442 U.S. 397 (1979).
Starceski v. Westinghouse, 54 F.3d 1089 (CA3 1995).
Stone v. City of Mount Vernon, 118 F.3d 92 (CA2 1997).
Stutts v. Freeman, 694 F.2d 666 (CA11 1983).
Sutton v. United Air Lines, 527 U.S. 471 (1999).
Talbert v. City of Richmond, 648 F.2d 925 (CA4 1981).
Taylor v. Principal Financial Group, 93 F.3d 155 (CA5 1997).
Thomas v. Washington County, 915 F.2d 922 (CA4 1990).
Treadwell v. Alexander, 707 F.2d 473 (CA11 1983).
US v. Georgia Power, 474 F.2d 906 (CA5 1973).
Watson v. Fort Worth Bank & Trust, 487 U.S. 977 (1988).
Wards Cove Packing Company v. Atonio, 490 U.S. 642 (1989).
Whiteback v. Vital Signs, 116F.3d 588 (CA DC 1997).
Wingate, P. H., Thornton, G. C., III, McIntyre, K. S., & Frame, J. H. (2003). Organizational downsizing and
age discrimination litigation: The influence of personnel practices and statistical evidence on litigation
outcomes. Law and Human Behavior, 27, 87–108.
Zaccaginini v. Levy, 338 F.3d 672 (CA7 2003).
Zink, D. L., & Gutman, A. (2005). Statistical trends in private sector employment discrimination suits. In
F. J. Landy (Ed.), Employment discrimination litigation: Behavioral, quantitative, and legal perspectives
(pp. 101–131). San Francisco, CA: Jossey Bass.
30 Perspectives From Twenty-
Two Countries on the Legal
Environment for Selection
In the United States, the legal context plays a major role in how psychologists approach selection
system development. Psychologists know well the set of protected groups, the approaches to making
an a priori case of discrimination (e.g., differential treatment vs. adverse impact), the key court cases
influencing selection, and the prohibitions against preferential treatment (e.g., the 1991 ban on score
adjustment or within-group norming). Selection texts (e.g., Guion, 1998) and human resource man-
agement texts (e.g., Cascio & Aguinis, 2008) give prominent treatment to the legal context. In recent
years, there has been a growing internationalization of industrial-organizational (I-O) psychology
such that psychologists from all over the world work with clients in other countries and contribute
to our journals and to our conferences. Test publishers and consulting firms establish offices all
over the world. As this internationalization continues to increase, it becomes increasingly useful to
take a broader look at the legal environment for selection, examining similarities and differences
in various countries. For example consider a U.S firm with operations in several other countries.
Although U. S. fair employment law applies only to those overseas employees who are U.S. citizens,
1 All authors contributed equally to this chapter. Paul R. Sackett and Winny Shen integrated the text materials provided
by each author. Portions of this chapter were drawn from an article by the same set of authors: Myors, B., Lievens, F.,
Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., et al. (2008). International perspectives on the legal environ-
ment for selection. Industrial and Organizational Psychology: Perspectives on Science and Practice, 1, 200–256. Used
by permission of the Society for Industrial and Organizational Psychology and Wiley Blackwell.
651
652 Handbook of Employee Selection
the employment by U.S. firms of host country nationals or third country nationals is subject to the
legal environment of the host country.
Question 1: Are there racial/ethnic/religious subgroups such that some are viewed as “advan-
taged” and others as “disadvantaged”?
Table 30.1 identifies the major groups viewed as “disadvantaged” in each country (note that gen-
der is treated separately in the next section, and specific legal protections for disadvantaged groups
are treated under Question 4). As Table 30.1 indicates, the disadvantaged groups differ on several
dimensions. First, the basis for disadvantaged status varies: (a) native/aboriginal people in a setting
where colonizers became the majority group (e.g., Native Americans in the United States, Maori in
New Zealand, First Nations Peoples in Canada), (b) recent immigrants (e.g., many European coun-
tries), (c) racial groups either native to or with long histories in the country (e.g., African Americans
in the United States; Blacks, colored individuals, and Indians in South Africa), (d) religious groups
(e.g., India), and (e) language groups (e.g., Francophones in Canada, Rhaeto-Romanic speakers in
Switzerland). Second, the size of the minority population varies, from a very small percentage of
the population in some countries to the South African extreme of a previously disadvantaged Black
majority. These findings illustrate that there is considerable variability from country to country in
what constitutes a disadvantaged group.
Question 2: What is the general picture regarding women in the workplace (e.g., historical
trends regarding employment for women; current data on percentage of women in the work-
force; and current status regarding occupational segregation, such as gender representation
in various job classes and at various organizational levels)?
Among the countries surveyed, women make up a substantial portion of the workforce. In gen-
eral, women make up from over one quarter to slightly less than one half of the working population
(see Table 30.2). Great strides have been made such that women are being increasingly involved
in the workforce across all countries surveyed, as evidenced by reports of the increased rate of
women’s participation in the workforce, with the exception of Turkey, who reports a slight decline
in the recent years (34% in the early 1990s down to 25.4% in 2004; State Institute of Statistics,
2006). There is substantial variability among countries in terms of the percentage of women who
Perspectives From Twenty-Two Countries on the Legal Environment for Selection 653
TABLE 30.1
Disadvantaged Groups Within Each Country
Country Group Percentage of Population
Australia Indigenous Australians 2.5
Belgium Non-Western immigrants
Moroccan 0.8
Turkish 0.4
Canada Immigrants 18.4
Visible minorities 13.4
First Nations peoples 2.1
Francophones 15.7
Chile Recent immigrants
Argentinean
Peruvian 1.2
Bolivian
Ecuadorian
France Immigrant groups 7.4
European 3.33
North African 2.22
Other African 0.67
Asian 0.96
Germany Migrant workers/immigrants
Turkish 3.7
Southern European countries
Reimmigrants (Volga-Germans) 2.8
Greece Immigrants 7.0
Albanian
Bulgarian
Georgian
Romanians
India Within Hindu Castesa
Scheduled castes 15.06
Scheduled tribes 7.51
Other backward classes 43.70
Muslims 13.0
Israel Palestinian Arabs 22.0
Druze 2.0
Sephardic Jews 31.0
Iraq
Iran
Morocco
Ethiopia
Italy Albanian 1.0
Rumanian 0.9
Moroccan 0.9
Ukrainian 0.4
Chinese
Japan North and South Korean 0.5
Chinese 0.4
Brazilians 0.2
Philippines 0.1
continued
654 Handbook of Employee Selection
a The Hindu caste system differentiates between “forward” (advantaged) and “backward” (disadvantaged) groups. A national
“schedule” or classification of castes differentiates between scheduled castes (previously “untouchable” castes), scheduled
tribal groups, and other backward castes.
participate in the workforce, ranging from approximately one quarter of women in Turkey (State
Institute of Statistics, 2006) to between 60 and 70% in France (Attal-Toubert & Lavergne, 2006),
Kenya (primarily due to the high involvement of women in small-scale farming and pastoralism),
New Zealand, and the United Kingdom. These differences are undoubtedly at least partially due
to the multitude of differences among countries including those in history, culture and values, eco-
nomic conditions, and political conditions. It is interesting to note that in no instance is the female
participation rate higher than the male participation rate; this may partially reflect the traditional
division of labor between men and women, such that women are more likely to have household and
childcare duties.
Although women are less likely to participate in the workforce than their male counterparts, it
appears that there tend to be no or small differences in the unemployment rate for men and women
(usually within 1 or 2 percentage points). In fact, in recent years in Taiwan, the unemployment
rate for women has been lower than that for men. Exceptions to this trend include Greece (where
the unemployment rate of women is often 2 to 3-fold that of men), Kenya, and Switzerland, where
women are still substantially more likely to be unemployed then male workers. However, it must
be noted that even small changes in the unemployment rate may have strong repercussions for the
economic, political, and social situation of a country.
Among all nations surveyed, there is still gender disparity in pay, and this disparity continues to
be substantial in magnitude. Among all countries where gender disparity information was available,
women earned between 11 and 34% less than men. However, this figure may be lower or higher
among countries where we currently do not have the information available. Although it is unclear as
to whether these estimates take into account factors such as differences in occupations, differences
in full- versus part-time work, differences in educational attainment, etc., other research has shown
that even taking into account some of these extraneous factors, women still earn less than their male
counterparts (although the magnitude does decrease slightly). The U.S. General Accounting Office
(2003) reported that women still only earn 80% of what men earn (compared to 75% when not tak-
ing into account differences) in 2000 after taking into account occupation, industry, race, marital
status, and job tenure. Currently, the most positive outlook for women’s earning are in Belgium,
France, Israel, New Zealand, Switzerland, and the United Kingdom, where women earn 80 cents or
more for every dollar earned by men (Equal Opportunities Commission, 2004).
There continues to be occupational segregation to some extent in all 22 countries. Across the
board, women are still more likely to be found in clerical or secretarial, retail or sales, healthcare
656
TABLE 30.2
Women’s Status in the Workplace Within Each Country
Percentage of Men Percentage of Women
Percentage of Workforce Participation in Participation in Male Unemployment Female Unemployment
Country Population Workforce Workforce Rate Rate Wage Differentiala
Australia 72.0 57.0 66.0
Belgium 73.6 58.3 7.6 9.6 85.0
Canada 64.0
Chile 38.2
France 46.4 74.5 63.8 81.0
Germany 47.0
Greece 64.1–65.0 38.9–42.7 5.1–8.2 13.0–18.6
India 30.0
Israel 81.6
Italy 69.7 45.3
Japan 30.0b 73.2–79.4 48.0–50.0 67.1
Kenya 29.0d 74.7 72.6 25.0c 38.0 c
Korea 41.7 74.4 50.0 3.8 3.1 66.2
The Netherlands 70.0–77.0 54.0 4.5 6.8
New Zealand 61.2 81.0–87.0
Spain 42.2 3.5 4.8
South Africa 45.7
Switzerland 72.8 56.9 79.0–89.0
Taiwan 67.6 48.1 4.3 3.9 76.9
Turkey 36.3 72.3 25.4
United Kingdom 78.0 69.0 83.0
United States 46.4 74.0 59.0 4.7 4.5 77.0
The authors representing the various countries have undertaken to report the most recent data available from their country; there may be slight discrepancies between the years reported for
each country.
a Percent of women’s salary compared to men’s salary (men’s salary = 100%).
c In urban areas.
(e.g., nursing, childcare services), education (e.g., elementary school teachers), public services, or
small-scale agricultural farming occupations (e.g., Kenya and Turkey) than their male counterparts.
Furthermore, the occupations that women are most heavily concentrated in tend to be in the lower
income segment. Women remain underrepresented in business and management positions, particu-
larly higher levels of management. In most countries, women continue to lag behind in representa-
tion for technical and scientific positions, professional jobs, higher-level governmental positions
(e.g., judges, cabinet members, etc.), and most higher-level jobs across sectors.
Authors for several countries note that women are more likely to join the workforce as part-time
workers (e.g., Belgium, France, Germany, Japan, Switzerland, and the United Kingdom) to bet-
ter balance work and family demands or leave the workforce because of childcare demands (e.g.,
Japan, Korea, and the United States). The latter trend is particularly pronounced in Japan, where
the participation ratio by age groups shows an M-shaped curve, because labor force participation
rate declines in women’s early 30s because of childcare responsibilities. During the period of 1970–
2004, the valley section of this M-curve has shifted northeastward due in part to the trend of late
marriage and late childbirth. In addition, both peaks of this M-curve have become higher, indicating
that women’s workforce participation has substantially increased in their 20s and late-30s or older
(Japan Institute of Labor Policy and Training, 2007). However, some countries also indicated that
the wage gap between men and women may be even more pronounced among part-time workers.
For example, in the United Kingdom, women are paid 17% less then men in full-time work and 38%
less in part-time work (Equal Opportunities Commission, 2004).
Question 3: Is there research documenting mean differences between groups identified above
on individual difference measures relevant to job performance?
Mean differences on ability and personality measures are commonly examined in the United
States, with enough data for large-scale meta-analytic summaries. Mean differences on tests of
developed abilities of roughly 1.00 standard deviation (SD) between Whites and African Americans
and roughly 0.67 SD between Whites and Hispanics have been consistently reported. The largest-
scale summary of this literature is a meta-analysis by Roth, Bevier, Bobko, Switzer, and Tyler
(2001). This abundance of data proves to be in marked contrast to the pattern of findings in the
countries examined here. In fact, for most countries, the authors reported finding either no research
or research with samples so small that they refrained from drawing conclusions (i.e., Chile, France,
Greece, Italy, Japan, Korea, Spain, Switzerland, Turkey, and the United Kingdom). Although lim-
ited, there are some data on group differences in some countries.
Two countries (Australia and Taiwan) report research on cognitive ability differences between
aborigines and the advantaged group. The lower cognitive ability scores for Australian aborigi-
nes may reflect differences in language and culture. Aborigines in Taiwan, who typically have
lower educational attainment (Council of Indigenous Peoples, 2002), also score lower than non-
aborigines on several cognitive ability tests. Data from the United Arrangement Commission for
college entrance examinations in Taiwan in 2006 showed d values between 0.44 and 0.68 in favor
of nonaborigines, depending on the particular test subject (A. Chuang, personal communication,
May 1, 2007).
Cognitive ability mean score differences have been reported of d = 1.39 between Turkish/
Moroccan immigrants and Dutch test-takers and d = 1.08 between Surinamese/Antillean and Dutch
test-takers, in both cases favoring the majority group (te Nijenhuis, de Jong, Evers, & van der Flier,
2004). Language differences appear to contribute to these findings because higher scores are found
for second-generation than first-generation immigrants. Studies in Belgium also report mean differ-
ences of about 1.00 SD on cognitive tests between Belgians and Turkish and Moroccan immigrants
in samples of children (Fontaine, Schittekatte, Groenvynck, & De Clercq, 2006).
In South Africa, mean score differences on cognitive tests between Black and White groups are
normally larger than U.S. studies and have d values of approximately 1.00–1.50, with Whites obtain-
ing the higher mean scores. In a study performed in a South African financial services organization,
658 Handbook of Employee Selection
d values of 0.99 for averbal ability, 1.03 for a numerical ability, and 1.14 for a diagrammatic ability
test were found (SHL, 2006). In South Africa, these differences are largely ascribed to the differ-
ences in the educational level of the racial groups. In the 2001 census, it was determined that 22.3%
of Africans, 8.3% of Colored (mixed race), 5.3% of Indians, and 1.4% of Whites had no schooling
(Statistics South Africa, 2001).
Limited data report lower scores for Arabs than Jews in Israel (Zeidner, 1986), for Canadian
First Nations people than for Whites, for New Zealand Maori than for Whites (Chernyshenko,
2005; Guenole, Englert, & Taylor, 2003), and differences between individuals in various provinces
in Kenya (Kinyungu, 2006). Data on personality measures are even more limited than for cogni-
tive ability, with authors reporting personality data from only two countries: a large-scale study of
Black-White differences in South Africa (Kriek, 2006) showing small differences and several stud-
ies of Dutch-immigrant differences in the Netherlands showing much larger differences (van Leest,
1997; te Nijenhuis, van der Flier, & van Leeuwen, 1997, 2003).
Overall, several findings of interest emerge. First, it is clear that gathering data and reporting
mean differences by group is generally far more common in the United States than in virtually all
of the countries contributing to this report. This is likely the result of the legal scrutiny to which
tests are held in the United States. The Uniform Guidelines on Employee Selection Procedures
(U.S. Equal Employment Opportunity Commission, 1978) use adverse impact computations as the
basis for a prima facie case of discrimination, and thus, adverse impact resulting from test use is
routinely examined, with mean differences between groups and the method of test use (e.g., a high
or a low cutoff) functioning as key determinants of adverse impact. Second, although data tend to
be more sparse than in the United States, group differences are studied and observed in various
settings involving different types of disadvantaged groups (e.g., immigrant groups in Belgium and
The Netherlands; native peoples in Australia, New Zealand, and Canada; tribal and provincial dif-
ferences in Kenya; the native Black population in South Africa; and Arab groups in Israel). Third,
as in the United States, there is interest not only in whether there are group differences, but also
in understanding the basis for these differences. Language, culture, and differences in educational
access and attainment are seen as key concerns in understanding differences in test scores across
groups.
In the United States, disparate impact is the basis for a prima facie case of discrimination. The
implicit assumption is that various groups are expected to obtain similar mean scores absent bias in
the measure. Our data suggest that many European countries target certain groups as immigrants
to meet specific labor shortages. Thus, immigrants might have higher or lower abilities, depending
whether a country tried to attract highly skilled people (e.g., recent immigrants into Switzerland
from northern and western Europe) or tried to attract people with low skills (e.g., Turkish immi-
grants to Germany). In other words, even if one has a general expectation of no group differences
at the population level, a finding of differences between locals and immigrants would be expected
given this targeted immigration.
Question 4: Are there laws prohibiting discrimination against specific groups and/or man-
dating fair treatment of such groups? Which groups are protected? Which employers are
covered? Which employment practices are covered (e.g., selection, promotion, dismissal)?
Table 30.3 presents summary information addressing the above questions for each country.
Several findings emerge. First, there is some basis for legal protections for members of specified
groups in all countries. The bases for these protections vary widely. In many cases the national
constitution provides general, or at times specific, protections. This may be seen as analogous to the
5th and 14th Amendments to the U.S. Constitution, which respectively state that “no person shall …
be deprived of life, liberty, or property without due process of law,” and that “no state shall …
deny to any person within its protection the equal protection of the laws.” However, in virtually all
cases there are also specific laws defining specified protected classes, specifying which employment
practices are covered and which employers are required to comply. The intent here is to identify the
TABLE 30.3
International Laws and Practices
Country Law Employers Covered Employment Practices Covered
Australia The Crimes Act 1914 All employers; EOWW of 1999 refers to All stages of the employment relationship including but not
Racial Discrimination Act 1975 organizations of 100+ limited to recruitment, selection, termination, training, and
Sex Discrimination 1984 promotion.
Human Rights and Equal Opportunity Commission
Act 1986
Disability Discrimination Act 1992
Workplace Relations Act 1996
Equal Opportunity for Women in the Workplace Act
1999
Age Discrimination Act 2004
Belgium Belgian Constitution of 1994 Article 10, 11, 191 All employers Most employment practices including selection and
Law Equality of Men-Women of 1978 appointment, promotions, employment opportunities,
Antidiscrimination law of 2003 labor conditions, dismissal, and wages.
Canada Canadian Human Rights Code of 1985 Federal government departments, crown Most employment practices including selection,
Section 15 of the Charter of Rights and Freedoms corporations, and other federally regulated performance appraisal, termination, and compensation.
(1982) agencies and organizations
Federal Employment Equity Act (2004)
Federal Contractors Program
Pay equity legislation (federal and some provinces)
Chile Constitution, Chapter 3 (Rights and Duties), article 19 All employers The Constitution establishes the general nondiscrimination
Nº 16 (Freedom of Work and its protection) and principle on the basis of race, color, sex, age, marital
Work Code, Article 2º status, union membership status, religion, political
(2002) opinions, nationality, and national or social origin. In
March 2008, a new law went into take effect (law #
20,087). This new law defines discrimination as any action
Perspectives From Twenty-Two Countries on the Legal Environment for Selection
continued
659
660
continued
661
662
continued
Perspectives From Twenty-Two Countries on the Legal Environment for Selection
663
TABLE 30.3 (continued)
664
Age Discrimination Act 1967 Private employers with 20+ employees, state and Prohibits discrimination against individuals age 40 or older.
local governments
Americans with Disabilities Act 1990 and ADA covers private employers, state and local Prohibits discrimination against individuals with disabilities
Rehabilitation Act 1973 governments; Rehabilitation Act covers federal in the full range of employment decisions.
government; Virtually all employers
Equal Pay Act 1963 Prohibits discrimination against women in pay decisions.
Handbook of Employee Selection
Perspectives From Twenty-Two Countries on the Legal Environment for Selection 665
major contemporary federal laws and government decrees, and as such it is not a complete record
of all historical employment regulations. For example, in the United States a specialist can rightly
note that the Civil Rights Acts of 1866 and 1871 are still relied upon on occasion, although these are
not listed in the table. Also, several states and cities have additional statutes, offering protection to
groups beyond those covered by federal law.
Second, the protections offered are generally quite sweeping in terms of the types of employers
covered. In most cases all employers are covered. Some laws are restricted to government employ-
ees, and in some cases, coverage is restricted to larger employers, with the coverage threshold
varying quite widely for some statutes (e.g., more than 6 employees in Israel, 15 in the U.S., 100 in
Taiwan, and 300 in Korea).
Third, it is typical for a broad range of employment practices to be included. Employee selection
is specifically included in all countries except Chile, which has the least developed set of employ-
ment rights regulations of the countries examined here, and which has yet to specify a set of cov-
ered employment practices. However, Chile does prohibit discrimination based on race, color, sex,
age, marital status, union membership, status, religion, political opinions, nationality, and national
or social origin in its Constitution but does not specify which specific employment practices are
covered.
Fourth, there is considerable commonality and variation in the classes that receive protection in
each country. Table 30.4 identifies the most common protected classes and indicates whether those
classes are covered in each of the contributing countries. The classes covered in U.S. Civil Rights
law emerge as widely commonly covered across countries: race, color, religion, gender, national ori-
gin, age, and disability status. Three categories not protected by federal statute in the United States
are protected in most countries: political opinion, sexual orientation, and marital/family status.
Several protected classes are covered in only a few counties or are unique to a few countries; Table
30.5 identifies these less common protected classes. Examples include language, physical appear-
ance, union membership, socioeconomic status, and HIV status.
TABLE 30.4
Most Common Protected Classes
Common Protected Classes
National/ Political Sexual Marital/
Country Race Sex Ethnic Origin Color Age Religion Disability Opinion Orientation Family Status
Australia X X X X X X X
Belgium X X X X X X X X X X
Canada X X X X X X X X X
Chile X X X X X X X X
France X X X X X X X X X
Germany X X X X X X
Greece X X X X X
India X X
Israel X X X X X X X X
Italy X X X X X X X
Japan X X X X X X
Kenya X X X X X X X
Korea X X X X X X X X X X
The Netherlands X X X X X X X X X
New Zealand X X X X X X X X X X
South Africa X X X X X X X X X X
Spain X X X X X X X X X
Switzerland X X X X X X X
Taiwan X X X X X X X
Turkey X X X X X
United Kingdom X X X X X X X
United States X X X X X X X
Handbook of Employee Selection
Perspectives From Twenty-Two Countries on the Legal Environment for Selection 667
TABLE 30.5
Other Protected Classes by Country
Country Other Protected Classes
Australia Breastfeeding, family or career responsibilities, irrelevant criminal record, physical features,
potential pregnancy, trade union or employer association activity, sexual harassment, pregnancy and
transgender status
Belgium Union membership, membership of other organizations, health, and any other personal characteristic
Chile Union membership status
France Moral principles, genetic characteristics, union activities or activities in a “mutuelle,” physical
appearance, family name, and health
Germany Philosophy of life, sexual harassment
India Scheduled castes, scheduled tribes, and other backward classes
Israel Personal status and military service
Italy Personal and social conditions and language
Japan Social status
Kenya Tribe, local connection, and HIV/AIDS status
Korea Social status, region of birth, appearance, criminal record after punishment has been served,
academic background, medical history, pregnancy, and physical conditions (e.g. appearance, height,
weight)
The Netherlands Philosophy of life, chronic disease, full-time/part-time work, and type of contract
New Zealand Ethical belief, employment status, and sexual and racial harassment
South Africa HIV status, conscience, belief, culture, birth, pregnancy, and language
Spain Social condition and membership to a labor union
Switzerland Socioeconomic status, way of life, and language
Taiwan Thought, provincial origin, appearance, facial features, union membership, status, and language
Turkey Philosophical belief, sect, and language
United Kingdom Persons who have undergone gender reassignment or intend to
United States Pregnancy
job-related is only partially reduced or remains with the plaintiff. There is a general notion that
the employer should present evidence to support the job relatedness of the employment practice in
question, but rarely is the required form of such evidence specified. The identification of validity
evidence as a mechanism for establishing job relatedness is rare.
TABLE 30.6
Evidence Needed to Refute a Discrimination Claim, Consequences of Violation, and Permissibility of Preferential Treatment by Country
Country Evidence Needed to Refute a Claim Consequences of Violation Permissibility of Preferential Treatment
Australia Inherent requirements of the job, existence of special Injunction to stop the act, award of damages, order to Within-group norming is not banned and is used by
measures to eliminate discrimination, occupational the organization to redress the situation, variation, or some psychological testers as a means of complying
requirements, actions required by law, employment cancellation of a contract or agreement that violates with legislation (Myors, 2003). Targets may be used
within small organizations, consistent beliefs (e.g. the law. in some EEO plans, but explicit quotas are avoided.
religious organizations or educational institutes). The
statutes make no reference to the psychological
concept of validity nor has it arisen in case law.
Belgium Statistical data or practical tests can be used as Mediation or binding judgment from civil court. Preferential treatment is permitted to remedy a
evidence. Imprisonment and/or fines. historical discrimination against a group. Quotas are
permitted, but seldom utilized. Some organizations
also utilize target numbers.
Canada The employer must demonstrate that the employment Fines, payment for lost wages, reinstatement, and Preferential treatment permitted (mainly in the public
policy, practice, or procedure that is challenged is a ordering of special programs. sector).
bona fide occupational requirement. Tribunals and
courts are quite liberal in the evidence that they will
accept from employers in defense of their
employment practices. Empirical and statistical
evidence generated by I-O psychologists (e.g., local
validation studies) may be useful in defending
employment practices, but courts and tribunals often
lack the sophistication to make full use of such
detailed and complex technical information.
Chile Unclear, unless for sexual harassment or unionization Unknown. Currently, sexual harassment suits may Government has enacted an informal quota for women
suits. Empirical evidence not required. result in monetary compensation and up to 3 years in minister positions; however, this has not crossed
imprisonment. over into the private sector.
Handbook of Employee Selection
France Vague. Employer should present any information Three years imprisonment and/or a fine for conviction Considerable discussion about this; politically,
showing the decision is legitimate, nondiscriminatory, in a criminal court. Discriminatory act is annulled in a preferential treatment is seen as undesirable. However,
and based on objective information. civil court and possibly financial compensation. there are settings where it is used.
When parties present lists of candidates for regional and
senatorial elections they are required to have an equal
number of men and women. Also, there are quotas in one
setting: at least 6% of workforce needs to be handicapped
for organizations with more than 20 employees.
Germany Needs to be based on job requirements. Employee has right to refuse to work while on No formalization, but public authorities are to give
payroll and sue employers for damages. preference to women and handicapped persons.
Greece Employer must show that there has been no The employer who infringes the laws about equal Preferential treatment to prevent or compensate for
breach of the principle of equal treatment. treatment on the grounds of racial or ethnic origin, disadvantages linked to any of the protected classes.
religion or belief, disability, age or sex is punished by
imprisonment of 6 months up to 3 years and together
with a penalty of 1,000 up to 5,000 euros.
India At the discretion of the judge. Preferential treatment in the form of a relaxation of
qualifying scores for protected groups in external
recruitment is permitted; however, a common standard
is required for promotion. Not all members of protected
groups are equally eligible, also dependent on social/
economic status. Government positions also use quotas.
Israel Evidence of test reliability and validity, which can be Small fines. Hiring, reinstatement, or career Preferential treatment is required by public
based on validity generalization. In addition, the advancement of plaintiff, payment of back wages. organizations and state-owned enterprises for women
National Labor Court recently ruled that employers and minorities. Preferential treatment is permitted in
seeking to prove their innocence will be subject to the private sector.
less severe tests of selection validity to the extent that
they are accused of discriminating against internal as
opposed to external candidates; the logic being that
employers typically have far greater information upon
which to base a selection decision when choosing
Perspectives From Twenty-Two Countries on the Legal Environment for Selection
continued
669
670
ability tests, provided that such tests are not designed, intended, or used to discriminate. Additional
instances reported of restrictions on specific selection methods in participating countries include a
prohibition against comprehensive personality assessment in Switzerland and a restriction on the
use of certain Minnesota Multiphasic Personality Inventory (MMPI) and California Psychological
Inventory (CPI) items in Spain.
The most strikingly different approach to regulating selection practices is found in South
Africa. Rather than the common approach of a presumptive right of an employer to use a particu-
lar method absent a successful challenge by a plaintiff, South African law puts the burden imme-
diately on the employer. According to the Employment Equity Act of 1998 (Government Gazette,
1999), psychological testing and other similar assessments are prohibited unless the test is proven
to be scientifically valid and reliable, can be applied fairly to all employees, and is not biased
against any employee or group. The Society for Industrial and Organizational Psychology (SIOP)
in South Africa published “Guidelines for the Validation and Use of Assessment Procedures for
the Workplace” during 2005 to provide guidelines for practitioners in the field of I-O psychology to
ensure that their assessment instruments and practices comply with the scientific requirements
and international best practices. These guidelines were largely based on the American SIOP
guidelines.
Question 8: What is the legal status of preferential treatment of members of minority groups
(e.g., quotas or softer forms of preference)?
To set the stage, note that the term “affirmative action” is used in various contexts, only some of
which involve preferential treatment for protected groups. Some forms of affi rmative action involve
outreach efforts to publicize openings and to encourage applications from members of protected
groups. However, there is no preferential treatment given once an individual is in the applicant pool.
Approaches involving preferential treatment fall into two main classes: (a) those that set differing
standards for protected and nonprotected groups without setting aside a specified number or propor-
tion of openings for members of protected groups (e.g., using different cut-off scores, using within-
group norming) and (b) quota approaches that set aside a fixed number or proportion of openings
for members of protected groups.
Table 30.6 summarizes the status of preferential treatment in the participating countries. Preferential
treatment is a domain in which the United States emerges as a clear outlier. Preferential treatment in
terms of differing score cutoffs or separate norming of tests within group is prohibited by the U.S.
Civil Rights Act of 1991 (U.S. Code, 1991), and the use of quotas is restricted to very limited set-
tings, such as a court-ordered remedy following a finding of discrimination. In contrast, in only two
countries do commentators report a prohibition against minority preference (Turkey and the United
Kingdom). The types of preference permitted and the settings in which it is used do vary widely. The
status of quotas varies, from prohibited (Australia), to permitted but rarely used (Belgium), to permit-
ted and widely used (South Africa), to used in government sectors (backward classes in India and
women in Chile), to required for certain groups (e.g., aborigines in Taiwan, individuals with disabili-
ties in France, Japan, Kenya, and Korea). Several commentators note that applying lower standards to
protected groups (e.g., different cutoffs or within-group norming) is used (Australia, India, and South
Africa). In India, lower qualifying scores for protected groups are permitted for external selection,
but not for promotion.
Question 9: How have laws and the legal environment affected the practice of science-based
employee selection in this country?
In only a few countries (Canada, South Africa, and the United States) is the legal environment
seen as having a large effect on science-based employee selection. In general, this can partially be
attributed to the much more amorphous legal standards and consequences with regards to employ-
ment discrimination in most countries surveyed. The reciprocal relationship between science-based
selection and the legal environment will need to be continually monitored because many countries
Perspectives From Twenty-Two Countries on the Legal Environment for Selection 673
are still in the process of developing legal statutes and requirements or establishing guidelines for
the prosecution and rulings on employment discrimination.
Overall, most employers in the countries surveyed have great latitude in choosing what selec-
tion procedures to utilize. However, most employers are aware of the social and political nature of
selection procedures and seem to err on the side of mainstream, popular, and usually well-validated
selection methods. The most common type of selection procedures do vary by country. It is com-
mon to see reports of increased use of the tools and techniques of science-based selection, but the
driving forces are more commonly the presence of multinational firms and consulting firms that
import these techniques into the country.
DISCUSSION
In this section we offer 35 broad summary statements about the patterns emerging from the narra-
tives from the various countries.
DISADVANTAGED GROUPS
1. Disadvantaged groups could be divided into four main groups: immigrants or foreign resi-
dents, religious minorities, racial/ethnic minorities, and language group minorities (speak
different primary language).
2. Many European (especially European Union) nations have disadvantaged groups who are
immigrants or foreign workers. The groups that are disadvantaged are usually Eastern
European or African.
3. Many Asian countries also have disadvantaged groups who are immigrants or foreign
workers.
4. Many of the racial/ethnic minorities are indigenous people (e.g., Australia, Canada, New
Zealand, Taiwan, and the United States).
5. Most disadvantaged groups are a relatively small proportion of the population, most below
the 20% “breaking point” specified in research on tokenism (Kanter, 1977).
6. Disadvantaged groups can constitute the majority of the population (e.g., South Africa).
DISCRIMINATION LAWS
15. Every country has a law or directive that prevents discrimination on the basis of sex or
race/ethnic origin and many other personal characteristics and beliefs.
16. Most discrimination cases seem to be settled by special commissions and/or courts rather
than by juries (which do not exist in several countries).
17. In many countries, few actual cases are actually filed and/or brought to trial, not because
discrimination does not occur, but because workers do not understand their rights, are not
used to protecting these rights (e.g., collectivistic orientation, etc.), or do not see much
benefit in going to court.
18. Punishment is generally usually rather light (e.g., minimal to moderate fine or reinstate-
ment, payment of back wages).
19. Concerns about privacy are very prominent in Europe. Many European countries are so
concerned that data on race or gender are not collected.
MINORITY PREFERENCE
27. Minority preference is permitted (and even recommended) in most countries. This is more
likely to be true for women or those with disabilities than for racial groups.
28. It is more common for government entities than for private-sector fi rms to engage in prac-
tices involving preferential treatment.
Perspectives From Twenty-Two Countries on the Legal Environment for Selection 675
29. Forms of affirmative action vary, ranging from active recruitment and training of women
or racial groups that have been traditionally disadvantaged to lower standards for these
groups.
30. Quotas are relatively rare but are present in several countries; for example, India (lower
castes), Taiwan (aborigines), Korea and France (disabled), and South Africa (race and
gender).
31. Explicitly forbidding preferential treatment is rare (e.g., Turkey).
We fully anticipate that some readers may question the value of knowing the legal environment
of countries other than their own, because they are inevitably bound by the legal constraints of the
country they operate in. We have several responses. First, in today’s global world, more and more
firms engage in business that extends across national boundaries. Second, there is value in extend-
ing one’s framework beyond the national setting with which one is most familiar. Discovering that
the same issue is treated differently elsewhere breaks the mold of viewing a certain set of circum-
stances as inevitable. Third, documenting these differences sets the stages for comparative research
asking questions about why certain variations are found. For example, why is preferential treatment
not generally permitted and held in such negative popular opinion in the United States and not in
many other countries? Why are some groups protected in some countries but not others?
In conclusion, we hope this compilation of information about perspectives from a wide range of
countries is useful to students, researchers, and practitioners around the globe. We encourage inter-
national collaborations on other workplace issues, and hope this project provides a useful model.
AUTHORS’ NOTE
This research was conducted while Antonio Mladinic was on leave from the Pontificia Universidad
Católica de Chile and holding a visiting appointment at the University of Texas at El Paso and while
Herman Aguinis was on sabbatical leave from the University of Colorado Denver and holding a
visiting appointment at the University of Salamanca (Spain). Oleksandr Chernyshenko is now at the
Nanyang Business School, Nanyang Technological University (Singapore).
REFERENCES
Attal-Toubert, K., & Lavergne, H. (2006). Premiers résultats de l’enquête sur l’emploi 2005. [Initial results
from the 2005 employment survey]. INSEE Première, number 1070. Paris: INSEE. Retrieved April 15,
2007, from http://www.insee.fr/fr/ffc/ipweb/ip1070/ip1070.pdf
Cascio, W. F., & Aguinis, H. (2008). Applied psychology in human resource management (6th ed.). Upper
Saddle River, NJ: Pearson Education.
Chernyshenko, O. S. (2005). Report on psychometric evaluation of the general reasoning test (GRT2) for
the New Zealand Police: Measurement equivalence across ethnic and gender groups. Auckland, New
Zealand: OPRA Consulting Group.
676 Handbook of Employee Selection
Council of Indigenous Peoples. (2002). Yearbook of Taiwanese aborigines statistics. Taipei, Taiwan: Executive
Yuan.
Equal Opportunities Commission. (2004). Sex and power: Who runs Britain. Manchester, England: Author.
Fontaine, J. R. J., Schittekatte, M., Groenvynck, H., & De Clercq, S. (2006). Acculturation and intelligence
among Turkish and Moroccan adolescents in Belgium. Unpublished manuscript, Ghent University,
Ghent, Belgium.
Government Gazette. (1999). Employment Equity Act, 1998 (Act No. 55 of 1998), R 1360.
Guenole, N., Englert, P., & Taylor, P. (2003). Ethnic group differences in cognitive ability test scores within a
New Zealand applicant sample. New Zealand Journal of Psychology, 23, 39–54.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence
Erlbaum.
Japan Institute of Labor Policy and Training. (2007). Labor situation in Japan and analysis 2006/2007. Retrieved
June 5, 2007, from http://www.jil.go.jp/english/laborinfo/library/documents/Labor2006_2007.pdf
Kanter, R. M. (1977). Men and women of the corporation. New York, NY: Basic Books.
Kinyungu, C. (2006). KCPE: Public schools feel the heat. Retrieved January 31, 2007, from http://www.
eastandard.net/archives/cl/hm_news/news.php?articleid=1143963072
Kriek, H. J. (2006). Personality assessment: Group differences, language proficiency and fairness. Presented at
the Society of Industrial and Organizational Psychology Conference, May 2006, Dallas, TX.
Myors, B. (2003). Within-group norming: Just because it’s illegal in America, doesn’t mean we can’t do it here.
Paper presented at the 5th Australian Conference on Industrial/Organisational Psychology, Melbourne,
Australia.
Myors, B., Lievens, F., Schollaert, E., Van Hoye, G., Cronshaw, S. F., Mladinic, A., et al. (2008). International
perspectives on the legal environment for selection. Industrial and Organizational Psychology:
Perspectives on Science and Practice, 1, 206–256.
Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive
ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330.
SHL. (2006). Validity study V036. Retrieved on June 7, 2007, from http://www.shl.com/globallocations/pages/
southafrica.aspx
Society for Industrial and Organisational Psychology in South Africa. (2005). Guidelines for the validation
and use of assessment procedures for the workplace. Retrieved on June 7, 2007, from http://www.siopsa.
org.za
State Institute of Statistics. (2006). Census of population: Social and economic characteristics. Ankara, Turkey:
Author.
Statistics South Africa. (2001). Census 2001: Primary tables South Africa, Census 1996 and 2001 compared.
Johannesburg, South Africa: Author.
te Nijenhuis, J., de Jong, M., Evers, A., & van der Flier, H. (2004). Are cognitive differences between immi-
grant and majority groups diminishing? European Journal of Personality, 18, 405–434.
te Nijenhuis, J., van der Flier, H., & van Leeuwen, L. (1997). Comparability of personality test scores for
immigrants and majority group members: Some Dutch findings. Personality and Individual Differences,
23, 849–859.
te Nijenhuis, J., van der Flier, H., & van Leeuwen, L. (2003). The use of a test for neuroticism, extraver-
sion, and rigidity for Dutch immigrant job-applicants. Applied Psychology: An International Review, 52,
630–647.
U.S. Code. (1964). Pub. L. 88-352.
U.S. Code. (1991). Pub. L. 102-166.
U.S. Equal Employment Opportunity Commission. (1978). Uniform guidelines on employee selection proce-
dure. 29 CFR 1607.1. Washington, DC: Author.
U.S. Office of General Accounting. (2003). Women’s earnings: Work patterns explain difference between men
and women’s earning. Retrieved on June 20, 2008, from http://usgovinfo.about.com/gi/dynamic/offsite.
htm?zi=1/XJ&sdn=usgovinfo&cdn=newsissues&tm=110&gps=64_261_1276_825&f=00&tt=2&bt=
0&bts=0&zu=http%3A//www.gao.gov/new.items/d0435.pdf
van Leest, P.F. (1997). Persoonlijkheidsmeting bij allochtonen [Assessment of personality for ethnic minori-
ties]. Lisse, The Netherlands: Swets & Zeitlinger.
Zeidner, M. (1986). Are scholastic aptitude tests in Israel biased toward Arab student candidates? Higher
Education, 15, 507–522.
Part 7
Employee Selection in Specific
Organizational Contexts
Rick Jacobs and Ann Howard, Section Editors
This page intentionally left blank
31 Selection and Classification
in the U.S. Military
The quality of a military, or any workforce for that matter, depends upon the quality of its people.
Successful attainment of military missions requires a force composed of dedicated, knowledge-
able, and competent members. When an organization can hire persons with prior experience, an
evaluation of past performance can serve as the primary criterion for selection and assignment
into jobs. Other characteristics such as aptitudes and education assume less importance. However,
when organizations hire young people without job experience, it becomes important to evaluate
aptitudes, education, interests, and other characteristics known to predict success in jobs sought by
the applicants.
The U.S. Department of Defense (DoD) is the world’s largest employer of young people.
Depending on personnel requirements, the DoD screens hundreds of thousands of youth for enlist-
ment annually. During the late 1970s, the DoD screened approximately 1 million applicants each
year; that number declined to “only” about 500,000 during the first years of the 21st century
(Sellman, 2001). As noted above, the military’s task in screening potential recruits is complicated
by the fact that the available personnel pool is composed predominately of young men and women
who have never held a permanent full-time job. Consequently, the services must depend mainly on
indicators of potential performance such as aptitude and levels of education.
679
680 Handbook of Employee Selection
military’s action officers and are largely involved in the tactical level of the military organization.
Field-grade officers typically fill many operational-level positions and most command and staff
assignments. Lastly, the general or flag officers are the service executives and are primarily engaged
in strategic, policy-making decisions that affect the organization in the long-term.
There are four principal paths that can be taken to become a commissioned officer. Two of
the primary officer commissioning programs, the service academies and the Reserve Officers
Training Corps (ROTC), are administered in conjunction with an individual’s undergraduate
academic studies. The two remaining principal commissioning programs, Officer Candidate/
Training School (OCS/OTS) and Direct Commissioning, are designed almost exclusively for
individuals who already possess at least a baccalaureate degree (U.S. Department of Defense,
2006).
Creating a new officer through either the service academies or ROTC is a 4-year process. The
services use OCS/OTS as a source for a specific portion of their new officers annually. In addi-
tion, in times of growth, OCS/OTS provides a quick-reaction surge capability that the longer-term
programs cannot match. Direct commissioning is normally reserved for people with professional
credentials (e.g., physicians, attorneys).
With the sole exception of the Air Force, the Army, Navy, and Marine Corps have warrant
officers who fill highly specialized leadership positions. Unlike their commissioned officer coun-
terparts whose experiences are broad and service-encompassing, warrant officers are employed in
positions that require highly specialized technical or tactical skills (e.g., helicopter pilots). Selection
as a warrant officer is highly competitive and only available to those who meet rank and length-of-
service requirements in the enlisted force.
Distinct from the civilian sector, the military is a completely closed personnel system; this
means that the services fill personnel vacancies with members already employed within their ranks.
American military leaders are “grown” from the junior ranks; the services do not hire military indi-
viduals to enter mid- or senior-level ranks. Because it takes years to successfully replace a member
who leaves the military, attracting officer and enlisted candidates is a high priority for military
policy-makers. Selecting the correct number of high-quality individuals each year is essential to
sustain a flow of seasoned leaders for the future.
Table 31.1 shows the names of the officer ranks as well as the number of officers in each service
as of December 31, 2007. The rank structure is analogous to a pyramid, with junior individuals serv-
ing as the base of the pyramid and outnumbering those individuals in the increasingly senior ranks.
The table also shows the same information for the enlisted ranks (highest is E-9 and the lowest is
E-1) but does not identify them by name because each service uses its own nomenclature.
Within the services, there are literally hundreds of military occupations. Although many are
similar to civilian jobs, there also are large numbers of occupations that are unique to the military.
The services categorize the plethora of job specialties into several broad occupational areas as
shown in Tables 31.2 and 31.3. Because of the large number of military enlistees (about 350,000
annually in the active and reserve components) who must be assigned into a large number of
military occupations, the services, unlike most civilian employers, must be proficient at job clas-
sification as well as personnel selection. However, classification to military occupations depends
on eligibility, individual preference, and availability of openings (U.S. Department of Defense,
2006b; Campbell & Knapp, 2001). With an enormous diversity of occupations, a vast number of
openings at specific positions, and a variety of individual skills, the challenge of classification is
appreciable.
TABLE 31.1
U.S. Department of Defense Current Officer, Warrant Officer,
Enlisted, and Cadet/Midshipmen Numbers by Servicea
Rank/Grade—All Services
Army Navy Marine Corps Air Force Total
General-Admiral 11 11 3 15 40
Lieutenant General-Vice Admiral 52 34 15 31 132
Major General-Rear Admiral (U) 92 69 27 95 283
Brigadier General-Rear Admiral (L) 151 110 40 146 447
Colonel-Captain 4,084 3,128 699 3,385 11,296
Lieutenant Colonel-Commander 9,127 6,713 1,842 9,928 27,610
Major-Lieutenant Commander 15,436 10,324 3,633 14,723 44,116
Captain-Lieutenant 25,006 17,061 5,572 22,418 70,057
1st Lieutenant-Lieutenant (JG) 6,862 5,975 2,812 7,565 23,214
2nd Lieutenant-Ensign 9,944 6,239 3,019 7,104 26,306
Chief Warrant Officer W-5 456 61 84 601
Chief Warrant Officer W-4 2,382 257 268 2,907
Chief Warrant Officer W-3 3,369 780 531 4,680
Chief Warrant Officer W-2 4,493 503 758 5,754
Warrant Officer W-1 3,233 232 3,465
Total officers 84,698 51,265 19,535 65,410 220,908
E-9 3,580 2,844 1,555 2,688 10,667
E-8 11,498 7,122 3,591 5,148 27,359
E-7 39,119 23,632 8,121 26,112 96,984
E-6 61,332 49,654 13,725 43,209 167,920
E-5 81,674 68,861 28,351 69,288 248,174
E-4 118,117 51,928 37,147 52,330 259,522
E-3 63,316 43,855 38,170 47,348 192,989
E-2 33,037 18,193 19,867 6,397 77,494
E-1 21,327 14,476 16,147 10,340 62,290
Total enlisted 433,300 280,565 166,674 262,860 1,143,399
Cadets-midshipmen 4,390 4,384 0 4,393 13,167
Grand total 522,388 336,214 186,209 332,663 1,377,474
a The most recent figures for this table can be obtained from the website noted in the source line below.
Source: Adapted from http://siadapp.dmdc.osd.mil/personnel/MILITARY/rg0712.pdf
part recent high school graduates. For enlisted selection and classification, the Armed Services
Vocational Aptitude Battery (ASVAB) is the single test used to determine enlistment eligibility
of applicants for the Army, Navy, Air Force, and Marine Corps as well as their respective reserve
components. In addition, ASVAB is used to assign successful applicants to military occupations.
Although a part of the U.S. Department of Homeland Security, the Coast Guard also uses ASVAB
for personnel selection and job classification.
ASVAB is administered in computer adaptive and paper-and-pencil versions (Sands, Waters, &
McBride, 1997). The computer adaptive version is administered to about 70% of applicants at 65
Military Entrance Processing Stations across the country. The remaining 30% of applicants receive
the paper-and-pencil form at 650 remote, satellite testing sites (Sellman, 2004).
ASVAB is a battery comprising ten tests that measure verbal, mathematics, and science/techni-
cal skills and knowledge. The Armed Forces Qualification Test (AFQT), a composite of ASVAB
tests, measures verbal (word knowledge and paragraph comprehension) and mathematics (arithmetic
682
TABLE 31.2
FY2006 Active Component Enlisted Corps by Occupational Area and Service
Occupational Area
Infantry, Gun Crews, Other
Gender and Seamanship Electronics Communications Medical Technical Administrators Electrical Craftsman Supply Nonoccupationala Total
Army 110,778 23,649 49,199 34,208 15,626 59,520 58,472 9,300 56,642 2,771 420165
Navy 24,605 35,817 26,147 25,185 2,752 35,698 86,549 17,362 21,000 17,988 293,103
Marine 39,982 10,187 11,626 0 4,129 25,016 25,561 3,870 18,786 22,120 161,277
Corps
Air Force 27,945 23,970 23,434 20,688 11,083 56,331 64,963 11,765 13,939 19,872 273,990
DoD total 203,310 93,623 110,406 80,081 33,590 176,565 235,545 42,297 110,367 62,751 1,148,535
a Nonoccupational includes patients, students, those with unassigned duties, and unknowns.
Handbook of Employee Selection
TABLE 31.3
FY2006 Active Component Officer Corps by Occupational Area and Service
Occupational Area
Engineering Supply,
General Tactical and Scientists and Health Procurement,
Service Officers Operations Intelligence Maintenance Professionals Care Administration and Allied Nonoccupationala Total
Selection and Classification in the U.S. Military
Army 318 23,515 4,582 10,592 5,082 14,043 4,776 5,894 808 69,610
Navy 225 20,458 2,183 5,049 2,102 10,033 2,267 2,518 5,574 50,409
Marine Corps 83 8,343 848 1,345 454 0 982 2,115 2,272 16,442
Air Force 287 23,856 3,859 10,952 4,628 11,153 4,666 6,026 4,746 70,173
DoD Total 913 76,172 11,472 27,938 12,266 35,229 12,691 16,553 13,400 206,634
a Nonoccupational includes patients, students, those with unassigned duties, and unknowns.
Source: U.S. Department of Defense, Office of the Under Secretary of Defense for Personnel and Readiness, Population representation in the military services FY2006, Washington, DC,
2008.
683
684 Handbook of Employee Selection
reasoning and mathematics knowledge) abilities. AFQT1 is the primary enlistment screen for all
services and is the DoD’s first index of recruit quality. The tests of science/technical knowledge
include general science, electronics information, mechanical comprehension, auto information,
shop information, and assembling objects.
On the basis of statistical validity analyses, the services combine the various ASVAB tests into
“aptitude area” composites, which are used to assign new recruits to military occupations. Each ser-
vice computes and uses its own set of composites for job classification giving each service a degree
of flexibility to stipulate the necessary skills required to fill each job position (Diaz, Ingerick, &
Lightfoot, 2004; Lightfoot, Diaz, Heggestad, Darby, & Alley, 1999; Rumsey, Walker, & Harris,
1994; Waters, Laurence, & Camara, 1987). Although there has been some research to explore the
relationship between AFQT and service composite scores and performance in the second term of
enlistment (generally from 4 to 8 years of service), ASVAB validity is usually established using
entry-level training or first-term job performance as criteria (Oppler, McCloy, & Campbell, 2001).
Such service differences in composites make sense, even for what appear to be virtually identical
occupations (e.g., electronic repair specialists, motor mechanics, cooks, supply technicians, clerks).
The services have distinctly different functions that affect their need to fulfill their respective mis-
sions. For example, the Army and Marine Corps have extensive ground combat responsibilities
that are quite different from most Navy and Air Force activities. Certainly, a ship’s environment is
very different from that of an aircraft or tank. Consequently, for what is ostensibly the “same” job,
the particular equipment used by personnel in the different services may dictate a different mix of
abilities (Waters et al., 1987).
ASVAB is normed against a nationally representative sample of young people ages 18–23 years
old tested in 1997 as part of the Bureau of Labor Statistics’ National Longitudinal Survey of Youth
(Segall, 2004). Such norms allow the comparison of applicant and recruit aptitude levels with those
of the contemporary civilian youth population from which they come. AFQT scores are expressed
on a percentile scale and grouped into five categories for reporting purposes. Table 31.4 shows the
percentile score ranges and percent of civilian youth that correspond with each AFQT category.
Persons who score in Categories I and II tend to be above average in cognitive ability; those in
Category III, average; those in Category IV, below average; and those in Category V, markedly
below average. (Category III is divided at the 50th percentile into subcategories A and B. This
facilitates reporting the proportion of scores above and below the mean of the AFQT distribution.)
By law, Category V applicants and those in Category IV who have not graduated from high school
are not eligible for enlistment.
The best single predictor of successful adjustment to military life is possession of a high school
diploma. Consequently, the services also value recruits with high school diplomas because they are
more likely to complete an initial tour of duty than are enlistees with alternative credentials or non-
graduates. About 80% of high school diploma graduates complete their first 3 years of service, com-
pared to only 50% of high school dropouts (U.S. Department of Defense, 1996). Completion rates
for enlistees holding an alternative credential such as a General Education Development (GED)
certificate fall in between the high school diploma graduate and nongraduate rates (Elster & Flyer,
1981; Flyer, 1959; Laurence, 1984, 1997). Thus, educational achievement is the DoD’s second index
of recruit quality.
Over the past 25 years, there has been a proliferation of education credentials in the United
States. In addition to earning a regular high school diploma, young people can receive credentials
through adult education programs and home schooling, through experiential learning, and by taking
high school equivalency tests. The DoD uses a three-tier system to classify education credentials.
The system was developed after research indicated a strong relationship between level of education
1 The AFQT, as either a stand-alone test or a composite of ASVAB tests, has been in use for personnel selection since 1950.
Although its content has changed somewhat over the years, the AFQT has always been a measure of “g” with verbal and
math components.
Selection and Classification in the U.S. Military 685
TABLE 31.4
AFQT Categories by Corresponding Percentile Score Ranges
and Percentage of Civilian Youth Population
AFQT Categories Percentile Score Range Percentage of Civilian Youth
I 93–100 8
II 65–92 28
IIIA 50–64 15
IIIB 31–49 19
IV 10–30 21
V 1–9 9
and successful completion of the first term of military service (Laurence, 1997; U.S. Department
of Defense, 1996). Tier 1 includes regular high school diploma graduates, adult diploma holders,
and nongraduates with at least 15 hours of college credit. Tier 2 comprises alternative credential
holders such as those with GED diplomas or certificates of completion or attendance, and Tier 3 is
composed of non-high-school graduates.
The services prefer to enlist people in Tier 1 (high school diploma graduates) because they have
a higher likelihood of completing a first term of service than do individuals in Tiers 2 and 3 (e.g.,
GED holders or high school dropouts). Consequently, education standards refer to the application of
progressively higher aptitude test score minimum requirements for high school diploma graduates,
equivalency credential holders, and nongraduates, respectively (Laurence, 1984). The rationale for
this policy is based on the differential attrition rates of these three education groups. That is, mem-
bers of Tiers 2 and 3 are about twice as likely to leave service prematurely as those in Tier 1. Higher
aptitude requirements for Tiers 2 and 3 are used to accept only the “best” from the statistically less
successful and thus less preferred group of applicants (U.S. Department of Defense, 1996).
2 “Access” is a term used by the U.S. military to indicate “entrance” into service. Thus, an “accession” is an entering service
member.
686 Handbook of Employee Selection
The most analyzed indicator of maladjustment to the military is first-term attrition, the failure to
complete an obligated period of service. According to the U.S. Government Accountability Office
(GAO), it cost $40,000 in 1997 to replace (recruit, train, and equip) each individual who failed to
successfully complete a first tour of duty (U.S. Government Accountability Office, 1997). Given
the substantial increase in recruiting resources associated with recruiting challenges brought on
by the war in Iraq, today that number is considerably higher (Stewart, 2005). There also are non-
pecuniary or indirect costs, which include force instability, lowered morale, and lack of readiness.
Individuals also may pay a personal price: Failure in military service may significantly affect their
future employment opportunities and earning potential. Consequently, it is in the interest of recruits
and the services to reduce first-term attrition (Strickland, 2005).
Attrition of newly commissioned officers during their initial service commitment is generally not
a problem; however, officer retention beyond that initial service commitment is a constant concern.
Because a service academy education represents a substantial investment (with cost estimates rang-
ing as high as $403,000 per graduate), the services need many of those officers to remain in service
far past their initial obligation.
all-volunteer military evolved from criticism of the inequities of conscription during the Vietnam
War—who shall serve when not all serve? In the late 1960s, President Richard Nixon established
a commission to develop a comprehensive plan for eliminating conscription and moving toward an
all-volunteer force. The commission built a case for a volunteer military by pointing out the unfair-
ness of conscription, establishing the feasibility of a volunteer force on economic grounds, and
refuting all major arguments against ending conscription and relying totally on volunteers (Lee &
Parker, 1977; Gates, 1970).
The commission believed that sufficient numbers of qualified youth could be persuaded to
volunteer by increasing military pay to levels more competitive with civilian wages. They dis-
puted claims that total reliance on volunteers would lead to a mercenary force consisting mainly
of minorities, the poor, and the uneducated, and loss of civilian control. After much debate within
the Administration and Congress and across the country, it was decided that an all-volunteer force
was feasible, affordable, and would not jeopardize the nation’s security (Rostker, 2006; Defense
Manpower Commission, 1976). Thus, the authority for conscription was allowed to lapse on July 1,
1973, and the last conscript entered the Army in December 1972.
With adequate resources and support to attract and retain the brightest personnel, conscription is
not needed to meet future military personnel requirements (Bicksler & Nolan, 2006). An all-volunteer
force is more expensive than a conscription force in terms of military compensation and funds for
advertising and enlistment incentives. However, a voluntary military is less expensive in overall costs
(Fredland, Gilroy, Little, & Sellman, 1996; Lee & McKenzie, 1992; Warner & Asch, 1996). It is
more stable and career-oriented, thereby leading to extra performance and experience with reduced
training and other turnover costs (Oi, 1967). During conscription, 10% of new inductees reenlisted;
today’s new recruits reenlist at a 50% rate (Rostker, 2006). In short, military service is an economi-
cally rational choice for high-quality men and women looking for an edge on life. The military also
is a good choice for people who want to serve a greater cause (Bicksler, Gilroy, & Warner, 2004).
During the first years of the all-volunteer force, the AFQT was used to identify individuals who
had a reasonable probability of success in service, and other service-specific tests were required for
job classification. The Army Classification Battery, the Navy Basic Test Battery, and the Airman
Qualifying Examination, just to name a few, were used from the late 1950s to the mid-1970s
(Waters, 1997). During this period, the AFQT was administered to military applicants (including
draft inductees) at Armed Forces Examining and Entrance Stations (AFEES) across the country for
selection purposes. Because women were not subject to the draft, a different aptitude test was used
for female applicants for enlistment. The Armed Forces Women’s Selection Test was administered
to female applicants in lieu of the AFQT from 1956 to 1974. If individuals successfully “passed” the
AFQT and were accepted for service, they were sent to basic training, although the specific occu-
pation to which they would be assigned had not yet been determined. During basic training, new
enlistees were administered their service’s classification tests and were assigned to their appropriate
military occupations.
During the mid-1970s, DoD determined that a single test that measured aptitude and job place-
ment was to be used, resulting in the development and implementation of the ASVAB, which is still
in use today (Sellman & Valentine, 1981). The ASVAB’s creation and implementation enabled DoD
to successfully screen applicants, match applicants with job positions, reserve job skill training for
applicants if they qualified, and provided a uniform standard measure on which all applicants across
the board could be ranked. This was a departure from previous procedures when selection testing
was conducted at AFEES during the entrance process (for either enlistment volunteers or draft
inductees) and classification testing was accomplished at service basic training centers preparatory
to assigning new enlistees to military occupations and sending them for job-skill training.
By combining selection and classification testing at the AFEES, the testing process was to be
made more expedient for the newly implemented all-volunteer military. Young people volunteering
for enlistment would take one test and come away from the AFEES knowing not only if they quali-
fied for enlistment, but, if qualified, also the military occupation to which they would be assigned.
688 Handbook of Employee Selection
Thus, the new testing process enabled the services to improve the matching of applicants with
available occupations before they actually reported for duty and allowed job guarantees for individuals
qualified for enlistment.
With the end of conscription and the advent of the all-volunteer force, there has been a
significant change in the composition of new recruit cohorts (Sellman, Carr, & Lindsley, 1996). The
percentage of female accessions has more than tripled, rising from 5% in 1973 (Goldman, 1973)
to approximately 17% in 2006 among nonprior service members (Manning & Griffith, 1998; U.S.
Department of Defense, 2008). Although the services have increased their proportions of women,
youth propensity polls indicate that young women are still approximately 50% less likely to indicate
an interest in joining the military than are young men (Ramsberger, 1993; Sackett & Mavor, 2004;
U.S. Department of Defense, 2008).
The percentage of Black enlisted accessions also rose, with some fluctuation, following the end
of the draft (MacGregor, 1981). Increases in the proportion of Black accessions coincided with the
ASVAB misnorming, which led to erroneous enlistment of many low-scoring applicants. Thus,
representation of Blacks—whose test scores are generally lower than those of Whites—increased
during the misnorming period. In the early 1980s, revised standards corrected the ASVAB scoring
error. As young Black men and women increasingly viewed the military as an opportunity for upward
mobility, a gradual increase in Black accessions ensued through the early 1990s. Participation for
active component Black enlisted has remained relatively stable at around 20% into the 21st century
(U.S. Department of Defense, 2008).
Hispanics make up a much smaller but growing proportion of the military services than Blacks.
Enlisted Hispanics comprised just over 1% in the early 1970s, but by the late 1980s, that percent-
age had increased to nearly 5%. There has been a steady increase in enlisting men and women
of Hispanic descent ever since. However, with 11% of active duty enlisted members counted as
Hispanic in 2006, this group remained underrepresented relative to the growing comparable civil-
ian population (17%) (U.S. Department of Defense, 2008).
1980; Means, Nigam, & Heisey, 1985; Shields & Grafton, 1983). As a consequence of the misnorm-
ing, members of Congress and policy-makers in DoD became interested in the methods used to set
enlistment standards and to establish recruit quality requirements.
In the congressional view, the fact that the ASVAB traditionally had been validated against suc-
cess in training rather than on-the-job performance was potentially problematic. Supporting studies
regarding the relationship between recruit quality and military performance lacked persuasive power
because proxy measures (e.g., attrition, promotion rates, or reenlistment eligibility) were used rather
than actual measures of job performance. Congressional scrutiny of the ASVAB misnorming and
surrounding issues of recruit quality and entry standards led to the Joint-Service Job Performance
Measurement/Enlistment Standards Project—hereafter referred to as the JPM Project.
The JPM Project comprised three phases: (a) determine the feasibility of measuring hands-on
job performance; (b) if feasible, validate ASVAB against on-the-job performance; and (c) develop
an enlistment standards cost/performance trade-off model that linked recruit quality, recruiting
resources, and job performance. The overall project strategy called for each service to develop and
demonstrate various job performance measurement approaches that could be used to link enlistment
standards to job performance (U.S. Department of Defense, 1991; Wigdor & Green, 1986, 1991).
Each service developed and demonstrated hands-on job performance measures in several mili-
tary occupations. These job performance measures were used to evaluate certain surrogate measures
of performance (less expensive, easier to administer tests or existing performance information) as
substitutes for the more expensive, labor-intensive, hands-on job performance tests (Armor & Roll,
1984; Green, Wing, & Wigdor, 1988). The performance tests consisted of tasks selected from the
domain of tasks in selected military occupations, on which examinees (job incumbents) were evalu-
ated. These measures were designed to replicate actual job performance yet provide objective evalu-
ation of the performance demonstrated.
Integration of the different service research efforts into a joint service product was accomplished
through development of a common data analysis plan. These analyses (a) described the distributions
of hands-on performance test scores, aptitude scores, job experience, and educational attainment;
(b) assessed the reliability of the hands-on performance test scores; and (c) measured the degree of
relationship (i.e., correlation) between the performance test scores and other variables of interest.
These tests were administered to 8,000 incumbent, first-term soldiers, sailors, airmen, and
marines assigned to 24 different occupations (U.S. Department of Defense, 1991). The occupations
were selected to be representative of all military occupations, with large numbers of recruits enter-
ing job skill training (McCloy, 1994). The examinees averaged 25.1 months in service, and the aver-
age AFQT score was 55.1 on a 100-point percentile scale (U.S. Department of Defense, 1991).
The average reliability coefficient for the performance tests across all 24 occupations in the JPM
Project was .72 (U.S. Department of Defense, 1991). The measures of reliability showed an accept-
able degree of consistency in the performance test scores, and the services believed that those scores
reflected that a reliable benchmark measure had been developed against which to compare the vari-
ous surrogate measures of job performance (U.S. Department of Defense, 1991). Those surrogate
measures could be used in subsequent selection and classification research.
The correlation between AFQT and hands-on performance tests, corrected for restriction in
range, yielded an average validity coefficient of .40 (U.S. Department of Defense, 1991). This level
of validity is of interest because the AFQT is a test of general aptitude, whereas the performance test
scores reflected observable performance in different types of occupations. Thus, the JPM project
established the link between measured aptitude for performing a job and the demonstration of doing
it. Considering the nature of the performance test criterion, a validity coefficient of .40 compared
well with other military validity studies (Armor & Sackett, 2004).
The job performance measurement research performed by the services provided performance
measures that closely replicated actual job performance. Rather than assessing, via a paper-and-
pencil test, what enlisted personnel might know about calibrating a piece of precision avionics
equipment or operating a weapon’s targeting system, the services were able to assess how well
690 Handbook of Employee Selection
enlisted job incumbents did such tasks. Although the two are related, knowledge about a job is
not the same thing as being able to do the job. Typically, the (corrected) validities of military apti-
tude tests for predicting training success or supervisor ratings have ranged between .30 and .60
(Hartigan & Wigdor, 1989).
Research shows a strong relation between ASVAB (including AFQT) scores and success in
military job skill training and hands-on job performance across a range of occupations (Campbell,
1990; Claudy & Steel, 1990; Dunbar & Novick, 1988; Earles & Ree, 1992; Holmgren & Dalldorf,
1993; Hunter, Crosson, & Friedman, 1985; Mayberry & Carey, 1997; Welsh, Kucinkas, & Curran,
1990; Wigdor & Green, 1991). The services value recruits with above-average aptitude because they
are more trainable and their job performance is superior to that of their lower-scoring peers. Even
with on-the-job experience, enlistees with lower aptitude continued to lag behind those with higher
aptitude. As is shown in Figure 31.1, below-average (AFQT Category IV) recruits require more than
3 years of experience to attain the level of performance at which the higher aptitude recruits (AFQT
Categories I-II) begin (Armor & Roll, 1994; Armor & Sackett, 2004; U.S. Department of Defense,
1991). Higher-aptitude personnel also experience fewer disciplinary problems.
The information shown in Figure 31.1 came from the JPM project (U.S. Department of Defense,
1991). Although collected more than a decade ago, these job performance data continue to be the
best source of information about the job performance of enlisted personnel. For one thing, research
has consistently demonstrated that cognitive ability, such as is measured by AFQT, is a strong
predictor of job performance across a variety of occupations (Campbell, 1990; Hunter & Hunter,
1984; Schmitt, Gooding, Noe, & Kirsch, 1984; Welsh, Watson, & Ree, 1990). In addition, recent
interviews with military training specialists responsible for the occupations used in the research
reported that the occupations had changed little since the original job performance data were col-
lected. Thus, it is safe to generalize from these data and to conclude that the relation between apti-
tude, experience, and job performance is still pertinent.
One of the major objectives of the JPM project was development of a mathematical model to
link recruit quality, recruiting resources, and job performance. Working with the National Research
Council, in 1991 the DoD used that model to establish the DoD recruit quality benchmarks (Sellman,
1997). In general, enlistment standards are based on judgments by service policy-makers as to the
level of job performance required. However, standards should be guided by empirical evidence
56
54
52
50 I-II
48
46
IIIA
44 IIIB
42
40 IV
38
0-12 13-24 25-36 37+
Job experience (months)
Note: AFQT percentile: I (93-99); II (65-92); IIIA (50-64); IIIB (31-49); IV (10-30)
FIGURE 31.1 Hands-on job performance scores as a function of aptitude and experience.
Selection and Classification in the U.S. Military 691
of the relationship between recruit quality and the required level of performance. Although it is
extremely difficult to specify an absolute value of performance that can be considered sufficient to
guarantee successful military mission accomplishment, even so, the research performed within the
JPM project developed reliable and valid measures of individual job performance that became the
basis for the linkage model.
For years, industrial psychologists contended that job performance was the ultimate criterion for
validating selection tests. In fact, S. Rains Wallace (1965), an eminent psychologist, once called it
the holy grail of industrial psychology. Measuring job performance is a very expensive proposition.
With the support of Congress and the DoD’s effort to recover from the embarrassing misnorm-
ing episode, $40 million was made available for the JPM project. Another aspect of this research
effort that made it unique was its sustainability. It was widely recognized as a project of great merit
and it lasted for over 15 years, spanning five presidential administrations, both Democratic and
Republican.
Wise et al. research, which looked at applicants across all services. In responding, the chair of that
committee noted:
The conclusions from the analyses—that the ASVAB technical composites are fair and sensitive—are
clear and compelling, and the use of the same enlistment standards and qualification scores for military
occupations for all young people is justified.” (Drasgow, 1992, p. 2)
ENLISTMENT PROCESS
Young men and women interested in joining the military enter the enlistment process by contract-
ing service recruiters. In addition to providing information about service life, opportunities, and
benefits, recruiters also begin the initial screening of applicants. Most prospects take an enlistment-
screening test at a recruiting office. This enlistment-screening test is used to predict the likelihood
of “passing” the AFQT. Estimates are that 10–20% of prospects do not continue beyond this point
(U.S. Department of Defense, 2004).
There are multiple requirements that must be met before applicants are selected for service. After
recruiters have completed the preliminary screening and prospects have decided to enlist, they can
go either to a Military Entrance Processing Station (MEPS3) or a military entrance test (MET) site
to take the ASVAB. The military and civilian staffs at MEPS evaluate applicants’ medical4 quali-
fications, aptitude, and moral character standards on the basis of standards predetermined by the
services. Some services also require a test of physical ability at the MEPS.
If an applicant achieves qualifying ASVAB scores and wants to continue the application pro-
cess, a physical examination and background review is conducted at the MEPS. The physical exam
assesses medical fitness for military service and includes the measurement of blood pressure, pulse,
visual acuity and hearing; blood testing and urinalysis; drug and HIV testing; and medical his-
tory. If a correctable or temporary medical problem is detected, applicants may be required to get
treatment before proceeding. Other applicants may require a service waiver for some disqualifying
medical conditions before being allowed to enlist (Sackett & Mavor, 2006).
Furthermore, applicants must meet rigorous moral character standards. Applicants undergo
detailed interviews covering any involvement with civil law enforcement (e.g., arrests, convictions)
and some undergo a financial check or computerized search for criminal records. Some types of
criminal activity are immediately disqualifying; other cases may offer the possibility of a waiver
of the rule, wherein the services examine applicants’ circumstances and make an individual deter-
mination of qualification (Putka, Noble, Becker, & Ramsberger, 2004). Moreover, applicants with
existing financial problems are not likely to overcome those difficulties on junior enlisted pay.
Consequently, credit histories may be considered as part of the enlistment decision.
If the applicant’s ASVAB score, education credentials, medical fitness, and moral character
qualify for entry, the applicant meets with a service classification counselor at the MEPS to
discuss options for enlistment (Sackett & Mavor, 2003). The counselor considers the appli-
cant’s qualifications along with service training or skill openings, schedules, and enlistment
incentives. In this classification process, high-scoring recruits are discouraged from choosing
jobs that require only low aptitude, and recruits who want to enter jobs for which they barely
meet the standard but who have high aptitudes in other areas, are encouraged to choose jobs for
which they are better qualified. Each service has incorporated its algorithms into computerized
job reservation systems that service counselors at MEPS use to match the individuals’ desires
with the needs of the services so that one component of those needs will be how well recruits’
ASVAB scores suit them for the various jobs.
3 MEPS is the current name given for the 65 enlistment processing centers located across the country; MEPS replaces the
earlier term Armed Forces Examining and Entrance Stations.
4 Although the Americans with Disabilities Act (ADA) generally precludes employers from administering any type of
medical exam before making a job offer, the military services are explicitly exempt from that requirement.
Selection and Classification in the U.S. Military 693
Generally, those who score higher on tests will have more occupational options. Although the
process differs by service, specific skills and occupational grouping are arranged similarly to an
airline reservation system, with the training “seat” and time of travel (to recruit training) based on
the school or the field unit position openings. Using enlistment incentives (cash bonuses or extra
money that can be used to cover college costs), recruiters may encourage the applicant to choose
hard-to-fill occupational specialties. Ultimately, it is the applicant’s decision to accept or reject
the offer. Although some discuss options with their family and friends, others decide not to enlist
(Sackett & Mavor, 2006).
(GPA) and class rank, and extracurricular activities may all be weighed together to derive an applicant
numerical score. The service academies draw on highly selective national pools of high school gradu-
ates, and their classes look very much like students entering highly competitive civilian universities,
with comparable GPA and SAT/ACT scores. Most Army and Air Force cadets and Navy midshipman
are nominated by their local U.S. senator or congressional representative (Segal & Segal, 2004).
Selection for nonscholarship ROTC varies by service wherein candidates may be selected based
on service-specific test scores in conjunction with fulfillment of other academic, physical fitness,
and experience-based criteria. Factors considered for OCS/OTS include the Physical Aptitude
Examination, college GPA, letters of recommendation, college major, an interview by a selec-
tion board, and scores on service-specific officer selection measures (e.g., Army Officer Aptitude
Rating, Air Force Officer Qualifying Test) (Eitelberg, Laurence, & Brown, 1992). Individuals enter-
ing a direct commissioning program have completed graduate programs and are not subject to
selection testing. Instead, they are evaluated by an entrance board to ensure their adherence to the
DoD quality standards for members of their professions (attorneys, physicians, and other healthcare
professionals).
Most individuals interested in attending one of the service academies must first receive a congres-
sional appointment.5 It is not necessary for individuals to personally know the member of Congress,
because most Congressional offices use a strictly competitive process based on college admission test
scores, academic grades, and leadership performance in high school and other organizations to make
their appointments. Additionally, applicants may not be married or have any dependents to qualify to
attend one of the service academies. Should individuals, who are married or have a dependent wish to
become an officer, they have the option to complete ROTC or OCS/OTS to obtain their commission.
Once accepted to a service academy, individuals become cadets/midshipmen and receive a
Bachelor of Science degree at the completion of their 4-year program. While attending the academy,
the cadets/midshipmen are provided free room and board, tuition, medical and dental care, and a
monthly allowance. Further, a cadet/midshipman’s academy life typically revolves around educa-
tion, military training, and physical fitness.
In conjunction with their academic studies, the cadets/midshipmen participate in military train-
ing in leader and follower roles. As they advance through the academy, they are exposed to a mul-
titude of leadership activities designed to hone their leadership skills in preparation for active duty.
Additionally, there is a significant emphasis on physical education, whereby they continually train
in intercollegiate sports, intramural sports, clubs, or physical fitness classes. Upon completion of
their degree, cadets/midshipmen receive a commission in the U.S. military with an active duty
service commitment of 5 years, plus 3 years in the active or inactive reserve. Selection for further
training after commissioning may incur an additional active duty or reserve commitment.
The largest source of commissioned officers is ROTC. Historically, ROTC was designed for
officers who would enter into reserve status. However, following World War II, officers receiving
their commissions via ROTC had the option to enter active duty upon graduation, and in the decades
that followed, larger percentages did so. Today, only the Army continues to offer the opportunity to
commission directly into the reserve components from ROTC; the majority of Army cadets and all
other service cadets/midshipmen commission directly onto active duty. To participate in ROTC,
individuals must register for and complete military/naval science classes in addition to normal aca-
demic coursework through the university. ROTC classes range from 2 to 3 hours each week in addi-
tion to a weekly leadership laboratory.
The ROTC classroom activities include various instructional methods designed to educate
cadets/midshipmen on the military culture. Leadership laboratories provide the environment for
cadets/midshipmen to hone their leadership skills as the upperclassmen assume military leadership
positions within the ROTC unit, and train the underclassmen. In addition cadets/midshipmen must
5 Some applicants (e.g., children of military personnel who have been awarded the Medal of Honor, military enlisted
personnel on active duty) compete for appointments separately from the Congressional category.
Selection and Classification in the U.S. Military 695
attend various summer training programs throughout their college years: summer cruises for mid-
shipmen and various levels of field training for Army and Air Force cadets. Successful completion
of summer training programs is mandatory for all ROTC students.
Although attending ROTC, only scholarship cadets/midshipmen receive some portion of
room and board and/or tuition. All contracted cadets/midshipmen (which can include non-
scholarship students who are at least sophomores) are eligible for medical and dental benefits
and a monthly allowance. Upon graduation, the cadets/midshipmen enter active duty with any-
where from a 2- to 4-year active duty service obligation, depending on whether and what type
of scholarship, they received. As with academy cadets/midshipmen, ROTC graduates receiving
further training after commissioning (e.g., pilots or nuclear officers) may incur additional active
duty obligations.
The most flexible commissioning program is OCS/OTS. This program is designed to accept
applicants who currently possess at least a bachelor’s degree and is designed to select those with
particular skills or aptitudes to fill service-manning requirements not met through the academies or
ROTC. Training occurs during one or two short (usually 2 to 4 months) training cycles. This training
is unique to each service on the basis of its culture and requirements but is generally similar to basic
training for enlisted personnel, but with the added aspects of leadership training. Active duty ser-
vice commitments upon commissioning range from 2 to 4 years, depending on the service and other
factors. Because of the shorter time span required for this commissioning option, OCS/OTS serve
as the short-term officer accession valve, in which the services can increase or decrease enrollment
on the basis of officer requirements and the projected number of academy and ROTC graduates to
ensure that the congressionally authorized supply of commissioned officers is maintained.
The smallest and most specialized commissioning method is through direct appointment. This
program is designed for those individuals who currently possess an advanced degree and wish
to enter the military in the fields of medicine, dentistry, law, or the chaplainry. Upon selection,
individuals are immediately commissioned and subsequently attend a short training course to pre-
pare them for the military. These officers receive wages comparable to civilian professionals by
commissioning into the military at a higher rank commensurate with their expertise and experi-
ence. Once on active duty, there are additional bonus pays of varying levels offered to healthcare
professionals on the basis of their particular specialty. Further, for certain health professionals, they
may apply to the Uniformed Services University of Health Sciences, which offers a salary and free
tuition in a program leading to a doctor-of-medicine (MD) degree. In return, graduates owe a 7-year
commitment to the military or the U.S. Public Health Service.
to implement a higher promotion rate. For 2007, the promotion rate to Captain was 98% and the
promotion rate to Major was 97%, which are both substantially higher than the Defense Officer
Personnel Management Act’s goals of 90% and 80%, respectively (U.S. Government Accountability
Office, 2007). Lastly, the Army began offering eligible Captains a retention menu of incentives for
additional service to include: graduate school or military school opportunities; branch or post of
choice; or a Critical Skills Retention Bonus (CSRB).
CONCLUSIONS
Since the advent of the Army Alpha and Beta in 1917, the U.S. military has been on the cutting
edge regarding personnel selection, and, later, job classification. During World War II, many
eminent psychologists participated in the Army Air Corps psychology program, focusing on
aircrew selection and supporting measurement techniques (Flanagan, 1948). In the 1980s, the
DoD sponsored research to refine item response theory and to develop groundbreaking techniques
to calibrate paper-and-pencil tests to computer adaptive versions. This led to implementation of
the Computer Adaptive ASVAB—the first and certainly the largest adaptive “employment” test
program in use today. The job performance measurement project, conducted over a 15-year period
in the 1980s and early 1990s, demonstrated that ASVAB was a valid predictor of hands-on job
performance and provided the foundation for the DoD model linking recruit quality and recruiting
resources to job performance. This model is used to set recruit quality benchmarks and to establish
and defend DoD’s recruiting budget. That the model has been widely accepted by Congressional
staffers and analysts at the Office of Management and Budget is testimony to the quality of the
science underpinning the effort.
Implementation of the new selection and classification procedures should affect the military in four
significant ways: (a) increase personnel productivity, (b) increase job satisfaction and commitment,
(c) reduce first-term attrition, and (d) reduce adverse impact against women and minorities. To empha-
size the significance of the potential benefits of enhancing military selection and classification, con-
sider that approximately 30% of those who entered the military during FY 2008 will fail to complete
their first 3-year term of service. The GAO estimated in 1997 that it cost DoD about $40,000 to recruit,
train, and equip each replacement for an individual who prematurely leaves the service (U.S. General
Selection and Classification in the U.S. Military 699
Accounting Office, 1997). That figure is probably approaching $45,000 today. Thus, any increase in
job performance or decrease in attrition will improve military readiness and save valuable resources.
In summary, as the United States military transforms itself in the 21st century, military selection
and classification methods will undoubtedly change as well to meet the needs of the all-volunteer
force. Today, as it was several decades ago, the ASVAB continues to serve as the principal screening
tool for selection and job classification for enlisted personnel. With the end of conscription and the
inception of the all-volunteer force, each new recruit represents an investment and with millions of
dollars and national security at stake, those whom we select today will represent us in the future as
leaders of the military services.
REFERENCES
Armor, D. J., & Roll, C. R. (1994). Military manpower quality: Past, present, and future. In B. F. Green &
A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 13–34). Washington,
DC: National Academy Press.
Armor, D. J., & Sackett, P. R. (2004). Manpower quality in the all-volunteer force. In B. A. Bicksler, C. L.
Gilroy, & J. T. Warner (Eds.), The all-volunteer force: Thirty years of service (pp. 90–108). Washington,
DC: Brassey’s.
Bicksler, B. A., Gilroy, C. L., & Warner, J. T. (2004). The all-volunteer forces: Thirty years of service.
Washington, DC: Brassey’s.
Bicksler, B. A., & Nolan, L. G. (2006). Recruiting the all-volunteer force: The need for sustained investment in
recruiting resources. Policy Perspectives, 1, 1–27.
Boldt, R. F. (1980). Check scaling of the AFQT 7C portion of the Armed Services Vocational Aptitude Battery
Form 7, and General Classification Test Form 1C to the Armed Forces Qualification test scale. Princeton,
NJ: Educational Testing Service.
Brown, D. G., Stawarski, C. A., Sellman, W. S., & Warthen, M. S. (2008). Using a Delphi procedure to establish
ASVAB research priorities. Alexandria, VA: Human Resources Research Organization.
Campbell, J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology.
In M. D. Dunnette & L. J. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed.,
Vol.1, pp. 687–732). Palo Alto, CA: Consulting Psychologists Press.
Campbell, J. P., & Knapp, D. J. (2001). Exploring the limits in personnel selection and classification. Mahwah,
NJ: Lawrence Erlbaum.
Claudy, J. G., & Steel, L. (1990). Armed Services Vocational Aptitude Battery (ASVAB): Validation for civilian
occupations using National Longitudinal Survey of Youth data (AFHRL-TR-90-29). Brooks AFB, TX:
Manpower and Personnel Division, Air Force Human Resources Laboratory.
Defense Manpower Commission. (1976). Defense manpower: The keystone of national security. Report to the
President and the Congress. Washington, DC: Author.
Diaz, T. E., Ingerick, M. J., & Lightfoot, M. A. (2004). New Army aptitude areas: Evaluation of new compos-
ites and job families for Army classification (FR-04-29). Alexandria, VA: Human Resources Research
Organization.
Drasgow, F. (1992, September). Review of sensitivity and fairness of the Armed Services Vocational Aptitude
Battery technical composites. Letter from the Chairman, Defense Advisory Committee on Military
Personnel Testing to the Director for Accession Policy, Office of the Assistant Secretary of Defense
(Force Management and Personnel). Champaign-Urbana: University of Illinois.
Drasgow, F., Embretson, S. E., Kyllonen, P. C., & Schmitt, N. (2006). Technical review of the Armed Services
Vocational Aptitude Battery (ASVAB) (FR 06-25). Alexandria, VA: Human Resources Research Organization.
Dunbar, S. B., & Novick, M. R. (1988). On predicting success in training for men and women: Examples from
Marine Corps clerical specialties. Journal of Applied Psychology, 73, 545–550.
Earles, J. A., & Ree, M. J. (1992). The predictive validity of the ASVAB for training grades. Educational and
Psychological Measurement, 52, 721–725.
Eitelberg, M. J. (1988). Manpower for military occupations. Washington, DC: Office of the Assistant Secretary
of Defense (Force Management and Personnel). Human Resources Research Organization.
Eitelberg, M. J., Laurence, J. H., & Brown, D. C. (1992). Becoming brass: Issues in the testing, recruiting, and
selection of American Military Officers. In B. R. Gifford & L. C. Wing (Eds.), Test policy in defense:
Lessons from the military for education, training and employment (pp. 79–219). Boston, MA: National
Commission on Testing and Public Policy.
700 Handbook of Employee Selection
Eitelberg, M. J., Laurence, J. H., & Waters, B. K. (with Perelman, L. S.). (1984). Screening for service: Aptitude
and education criteria for military entry. Washington, DC: Office of the Assistant Secretary of Defense
(Manpower, Installations, and Logistics). Human Resources Research Organization.
Elster, R. E., & Flyer, E. S. (1981). A study of the relationship between educational credentials and military
performance criteria. Monterey, CA: Naval Postgraduate School.
Fairbank, B. A., Welsh, J. R., & Sawin, L. L. (1990). Armed Services Vocational Aptitude Battery (ASVAB):
Validity of ASVAB form 14 for the prediction of high school course grades (AFHRL-TR-90-48). Brooks
AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory.
Flanagan, J. C. (Ed.). (1948). The aviation psychology program in the Army Air Forces (Report No. 1).
Washington, DC: U.S. Printing Office.
Flyer, E. S. (1959). Factors relating to discharge for unsuitability among 1956 airmen accessions to the Air
Force (WADC-TN-59-201). Lackland AFB, TX: Air Force Personnel Research Laboratory.
Fredland, J. E., Gilroy, C. L., Little, R. D., & Sellman, W. S. (1996). Professionals on the front line: Two
decades of the all-volunteer force. Washington, DC: Brassey’s.
Gates, T. S. (1970). Report of the President’s Commission on an All-Volunteer Armed Force. Washington, DC:
U.S. Government Printing Office.
Goldman, N. (1973). The changing role of women in the armed forces. The American Journal of Sociology,
78, 892–911.
Green, B. F., & Mavor, A. S. (Eds.). (1994). Modeling cost and performance for military enlistment. Washing-
ton, DC: National Academy Press.
Green, B. F., & Wigdor, A. K. (1988). Measuring job competency. In B. F. Green, H. Wing, & A. K. Wigdor
(Eds.), Linking military enlistment standards to job performance (pp. 23–44). Washington, DC: National
Academy Press.
Green, B. F., Wing, H., & Wigdor, A. K. (Eds.). (1988). Linking military enlistment standards to job performance.
Washington, DC: National Academy Press.
Greenberg, I. M. (1980). Mental standards for enlistment performance of Army personnel related to AFQT/
ASVAB scores (MGA-0180). Monterey, CA: McFann-Gray.
Harris, D. A., McCloy, R. A., Dempsey, J. R., Roth, C., Sackett, P. R., Hedges, et al. (1991). Determining the
relationship between recruit characteristics and job performance: A methodology and a model (FR-PRD-
90-17). Alexandria, VA: Human Resources Research Organization.
Hartigan, J., & Wigdor, A. K. (Eds.). (1989). Fairness in employment testing: Validity generalization, minority
issues, and the General Aptitude Test Battery. Washington, DC: National Academy Press.
Held, J. D., Fedak, G. E., Crookenden, M. P., & Blanco, T. A. (2002). Test evaluation for augmenting the Armed
Services Vocational Aptitude Battery. Paper presented at the 44th Annual Conference of the International
Military Testing Association, Ottawa, Canada.
Hogan, P. F., Curtis, J. S., & Warner, J. T. (2004). Sustaining the force in an era of transformation. In B. A. Bicksler, C.
L. Gilroy, & J. T. Warner, (Eds.), The all-volunteer force: Thirty years of service. Washington DC: Brassey’s.
Holmgren, R. L., & Dalldorf, M. R. (1993). A validation of the ASVAB against supervisors’ ratings in the
General Aptitude Test Battery (GATB). Washington, DC: U.S. Employment Service.
Hunter, J. E., Crosson, J. S., & Friedman, D. H. (1985). The validity of the Armed Services Vocational Aptitude
Battery (ASVAB) for civilian and military job performance. Washington, DC: Office of the Assistant
Secretary of Defense (Force Management and Personnel).
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological
Bulletin, 98, 72–98.
Jaeger, R. M., Linn, R. L., & Novick, M. R. (1980). A review and analysis of score calibration for the Armed
Services Vocational Aptitude Battery. Washington, DC: Committee Commissioned by the Office of the
Secretary of Defense.
Karpinos, B. D. (1966). The mental qualification of American youth for military service and its relationship to
educational attainment. Proceedings of the American Statistical Association. Alexandria, VA: American
Statistical Association.
Laurence, J. H. (1984). Education standards for military enlistment and the search for successful recruits
(FR-PRD-84-4). Alexandria, VA: Human Resources Research Organization.
Laurence, J. H. (1993). Education standards and military selection: From the beginning. In T. Trent & J. H.
Laurence (Eds.), Adaptability screening for the Armed Forces (pp. 1–40). Washington, DC: Office of the
Assistant Secretary of Defense (Force Management and Personnel).
Laurence, J. H. (1997). Does the education credential still predict attrition? In J.M. Arabian (Chair), Everything
old is new again: Current research issues in accession policy. Symposium conducted at the 105th Annual
Convention of the American Psychological Association, Chicago, IL.
Selection and Classification in the U.S. Military 701
Laurence, J. H., & Waters, B. K. (1993). Biodata: What’s it all about? In T. Trent & J. H. Laurence (Eds.),
Adaptability screening for the Armed Forces (pp. 41–70). Washington, DC: Office of the Assistant
Secretary of Defense (Force Management and Personnel).
Lee, D., & McKenzie, R. (1992). Reexamination of the relative efficiency of the draft and the all-volunteer
army. Southern Economic Journal, 59, 640–654.
Lee, G. C., & Parker, G. Y. (1977). Ending the draft: The story of the all-volunteer force (FR-77-1). Alexandria,
VA: Human Resources Research Organization.
Levy, D. G., Thie, H. J., Robbert, A. A., Naftel, S., Cannon, C., Enrenberg, R.G., & Gershwin, M. (2001).
Characterizing the future defense workforce (MR-304-OSD) Santa Monica, CA: The RAND
Corporation.
Lightfoot, M. A., Diaz, T. E., Heggestad, E. D., Darby, M. M., & Alley, W. E. (1999). New Air Force classi-
fication composites for USAF/DPX (FR-WATSD-99-17). Alexandria, VA: Human Resources Research
Organization.
MacGregor, M. (1981). Integration of the Armed Forces: 1940–1965. Washington, DC: Center of Military
History.
Maier, M. H. (1993). Military aptitude testing: The past 50 years (DMDC-TR-93-007). Monterey, CA: Defense
Manpower Data Center.
Maier, M. H., & Grafton, F. C. (1980). Renorming ASVAB 6 and 7 at Armed Forces examining and entrance
stations. Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Manning, L., & Griffith, J. E. (1998). Women in the military: Where they stand (2nd ed.). Washington, DC:
Women’s Research and Education Institute.
Mayberry, P. W. (1997). Competing criteria in the formation of aptitude composites (CAB-97-03). Alexandria,
VA: Center for Naval Analyses.
Mayberry, P. W., & Carey, N. B. (1997). The effect of aptitude and experience on mechanical job performance.
Educational and Psychological Measurement, 57, 131–149.
McBride, J. R., Wetzel, C. D., & Hetter, R. D. (1997). Preliminary psychometric research for CAT-ASVAB:
Selecting an adaptive testing strategy. In W.A. Sands, B. K. Waters, & J. R. McBride (Eds.), Computerized
adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association.
McCloy, R. A. (1994). Predicting job performance scores for jobs without performance data. In B. F. Green &
A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 61–99). Washington,
DC: National Academy Press.
Means, B. M., Nigan, A., & Heisey, J. G. (1985, October). When low-aptitude recruits succeed. Exceptional
recruits: A look at high and low-aptitude personnel. Symposium conducted at the 27th Annual
Conference of the Military Testing Association, San Diego, CA.
Moore, D. W., & Trout, B. (1978). Military advancement: The visibility theory of promotion. The American
Political Science Review, 72, 452–468.
Oi, W. Y. (1967). The economic cost of the draft. American Economic Review, 57, 39–62.
Oppler, S. H., McCloy, R. A., & Campbell, J. P. (2001). The prediction of supervisory and leadership
performance. In J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in personnel selection
and classification (pp. 389–410). Mahwah, NJ: Lawrence Erlbaum.
Putka, D. J., Noble, C. L., Becker, D. E., & Ramsberger, P. F. (2004). Evaluating moral character waiver
policy against servicemember attrition and in-service deviance through the first 18 months of service
(FR-03-96). Alexandria, VA: Human Resources Research Organization.
Ramsberger, P. F. (1993). Influences on the military enlistment decision-making process: Findings from the
1991 youth attitude tracking study (FR-PRD-93-06). Alexandria, VA: Human Resources Research
Organization.
Rostker, B. (2006). I want you: The evolution of the all-volunteer force. Santa Monica, CA: The RAND
Corporation.
Rumsey, M. G., Walker, C. B., & Harris, J. H. (Eds.). (1994). Personnel selection and classification. Hillsdale,
NJ: Lawrence Erlbaum.
Sackett, P. R., & Mavor, A. S. (Eds.). (2003). Attitudes, aptitudes, and aspirations of American youth:
Implications for military recruitment (pp. 70–96). Washington, DC: National Academy Press.
Sackett, P. R., & Mavor, A. S. (Eds.). (2004). Evaluating military advertising and recruiting: Theory and
methodology (pp. 40–67). Washington, DC: National Academy Press.
Sackett, P. R., & Mavor, A. S. (Eds.). (2006). Assessing fitness for military enlistment: Physical, medical, and
mental health standards (pp. 21–40). Washington, DC: National Academy Press.
Sands, W. A., Waters, B. K., & McBride, J. R. (Eds.). (1997). Computerized adaptive testing: From inquiry to
operation. Washington, DC: American Psychological Association.
702 Handbook of Employee Selection
Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Meta analysis of validity studies published
between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37, 407–422.
Segall, D. O. (2004). Development and evaluation of the 1997 ASVAB score scale. Monterey, CA: Defense
Manpower Data Center.
Sellman, W. S. (1986). Military adaptability screening: A manpower management perspective. In A. R. Lancaster
(Chair), Recent developments in military suitability research. Symposium conducted at the 28th Annual
Conference of the Military Testing Association, Munich, Germany.
Sellman, W. S. (1994). Job performance measurement: The nexus between science and policy. In B. F. Green &
A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 1–6). Washington, DC:
National Academy Press.
Sellman, W. S. (1997). Public policy implications for military entrance standards. Keynote address presented
at the 39th Annual Conference of the International Military Testing Association, Sydney, Australia.
Sellman, W. S. (1999). Military recruiting: The ethics of science in a practical world. Invited address to the
Division of Military Psychology, 107th Annual Convention of the American Psychological Association,
Boston, MA.
Sellman, W. S. (2001). Reinventing DoD corporate marketing. Paper presented at the International Workshop
on Military Recruitment and Retention in the 21st Century. Sponsored by the Belgian Defense Staff,
Royal Netherlands Army, and U.S. Office of Naval Research, The Hague, Netherlands.
Sellman, W. S. (2004). Predicting readiness for military service: How enlistment standards are established.
Commissioned paper prepared for the National Assessment Governing Board. Washington, DC: U.S.
Department of Education.
Sellman, W. S., Carr, W. K., & Lindsley, D. H. (1996). Shaping tomorrow’s military: The National agenda and
youth attitudes. Proceedings of the 15th Biennial Behavioral Sciences Symposium. Fort Collins, CO: U.S.
Air Force Academy.
Sellman, W. S., Shaw, M. N., Waters, S. D., & Geimer, J. L. (2007). Research and implementation plan:
Addressing recommendations for enhancing ASVAB and the DoD enlisted personnel selection and job
classification system (FR 07-46). Alexandria, VA: Human Resources Research Organization.
Sellman, W. S., & Valentine, L. D. (1981). Aptitude testing, enlistment standards, and recruit quality. Paper
presented at the 89th Annual Convention of the American Psychological Association, Los Angeles, CA.
Shields, J. L., & Grafton, F. C. (1983). A natural experiment: Analysis of an almost unselected Army population.
Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Sims, W. H., & Truss, A. (1978). An analyses of the normalization and verification of the Armed Services
Vocational Aptitude Battery (ASVAB) forms 6 and 7 (CNA 1115). Alexandria, VA: Center for Naval
Analyses.
Sims, W. H., & Truss, A. (1979). A reexamination of the normalization of the Armed Services Vocational
Aptitude Battery (ASVAB) (CNA 79-3059). Alexandria, VA: Center for Naval Analyses.
Sims, W. H., & Truss, A. (1980). A reexamination of the normalization of the Armed Services Vocational
Aptitude Battery (ASVAB) forms 6, 7, 6E, and 7E (CNA 1152, MCOAG). Alexandria, VA: Center for
Naval Analyses.
Smith, D. A., & Hogan, P. F. (1994). The accession quality cost/performance trade-off model. In B. F. Green &
A. S. Mavor (Eds.), Modeling cost and performance for military enlistment (pp. 105–128). Washington,
DC: National Academy Press.
Strickland, W. J. (2003). Trends in youth qualification and enlistment standards. In P. R. Sackett & A. S. Mavor
(Eds.), Attitudes, aptitudes, and aspirations of American youth: Implications for military recruitment
(pp. 70–96). Washington, DC: National Academy Press.
Strickland, W. J. (Ed.). (2004). A longitudinal study of first-term attrition and reenlistment among FY 1999
enlisted accessions (FR-04-14). Alexandria, VA: Human Resources Research Organization.
Strickland, W. J. (Ed.). (2005). Longitudinal examination of first term attrition and reenlistment among
FY1999 enlisted accessions (Technical Report 1172). Arlington, VA: U.S. Army Research Institute for
the Behavioral and Social Sciences.
Thirtle, M. R. (2001). Educational benefits and officer commissioning opportunities available to U.S. Military
Servicemembers (MR-981-OSD). Santa Monica, CA: The RAND Corporation.
Uhlaner, J. E., & Bolanovich, D. J. (1952). Development of the Armed Forces Qualification Test and predeces-
sor Army screening tests, 1946–1950 (PRS Report 976). Washington, DC: Personnel Research Section,
Department of the Army.
U.S. Code. (2009, January). Title X - Armed Forces, Subtitle - General Military, Part II - Personnel, Chapter
31 - Enlistments, Section 520 - Limitation on enlistment and induction of persons whose score on the
Armed Forces Qualification Test is below a prescribed level. Washington, DC: Author.
Selection and Classification in the U.S. Military 703
U.S. Department of the Air Force, Office of the Assistant Secretary of the Air Force (Manpower and Reserve
Affairs). (2005). Developing leaders of character at the U.S. Air Force Academy: From first contact to
commissioning. Washington, DC: Author.
U.S. Department of Defense. (2000). Report of the Defense Science Board Task Force on Human Resources
Strategy. Washington DC: Office of the Under Secretary of Defense (Acquisition, Technology, and
Logistics).
U.S. Department of Defense, Defense Human Resources Activity. (2007, June). Youth propensity and influ-
encer attitudes: Youth poll and advertising tracking findings. Washington, DC: Author.
U.S. Department of Defense, Office of the Assistant Secretary of Defense (Manpower, Installations, and
Logistics). (1985). Defense manpower quality (Vol. 1–3). Report to the Senate Committee on Armed
Services. Washington, DC: Author.
U.S. Department of Defense, Office of the Assistant Secretary of Defense (Force Management and Personnel).
(1991). Joint-service efforts to link military enlistment standards to job performance. Report to the House
Committee on Appropriations. Washington, DC: Author.
U.S. Department of Defense, Office of the Assistant Secretary of Defense (Force Management and Personnel).
(1993). Population representation in the military services: Fiscal year 1992. Washington, DC: Author.
U.S. Department of Defense, Office of the Assistant Secretary of Defense (Force Management Policy).
(1996). Educational enlistment standards: Recruiting equity for GED certificates. Report to Congress.
Washington, DC: Author.
U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2004).
Population representation in the military services: Fiscal year 2002. Washington, DC: Author.
U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2006).
Population representation in the military services: Fiscal year 2004. Washington, DC: Author.
U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2007).
Population representation in the military services: Fiscal year 2005. Washington, DC: Author.
U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2008).
Population representation in the military services: Fiscal year 2006. Washington, DC: Author.
U.S. Department of Defense, Office of the Under Secretary of Defense (Personnel and Readiness). (2009).
Population representation in the military services: Fiscal year 2007. Washington, DC: Author.
U.S. Government Accountability Office. (1997, January). Military attrition: DoD could save millions by better
screening enlisted personnel (NSIAD-97-39). Washington, DC: Author.
U.S. Government Accountability Office. (2007, January). Military personnel: Strategic plan needed to address
Army emerging officer accession and retention challenges. Report to the Committee on Armed Services,
House of Representatives. Washington, DC: Author.
Wallace, S. R. (1965). Criteria for what? American psychologist, 20, 411–417.
Wamsley, G. L. (1972). Contrasting institutions of Air Force socialization: Happenstance or bellwether? The
American Journal of Sociology, 78, 399–417.
Waters, B. K. (1997). Army Alpha to CAT-ASVAB: Four score years of military selection and classification
testing. In R.F. Dillon (Ed.), Handbook on testing. Westport, CT: Greenwood Press.
Waters, B. K., Laurence, J. H., & Camara, W. J. (1987). Personnel enlistment and classification procedures in
the U.S. military. Washington, DC: National Academy Press.
Waters, B. K., & Lindsley, D. H. (1996). Ability of military recruits: 1950 to 1994 and beyond. Proceedings of
the 15th Biennial Behavioral Sciences Symposium, Fort Collins, CO: U.S. Air Force Academy.
Warner, J. T., & Asch, B. J. (1996). The economic theory of a military draft reconsidered. Defense and Peace
Economics, 7, 297–312.
Welsh, J. R., Kucinkas, S. K., & Curran, L. T. (1990). Armed Services Vocational Aptitude Battery (ASVAB):
Integrative review of validity studies (AFHRL-TR-90-22). Brooks AFB, TX: Manpower and Personnel
Division, Air Force Human Resources Laboratory.
Welsh, J. R., Watson, T. W., & Ree, M. J. (1990). Armed Services Vocational Aptitude Battery (ASVAB):
Predicting military criteria from general and specific abilities (AFHRL-TR-90- 63). Brooks AFB, TX:
Manpower and Personnel Division, Air Force Human Resources Laboratory.
Wigdor, A. K., & Green, B. F. (Eds.). (1986). Assessing the performance of enlisted personnel. Washington,
DC: National Academy Press.
Wigdor, A. K., & Green, B. F. (1991). Performance assessment in the workplace (Vol. 1 & 2). Washington, DC:
National Academy Press.
Wise, L. L. (1984). Setting performance goals for the DoD linkage model. In B. F. Green & A. S. Mavor (Eds.),
Modeling cost and performance for military enlistment (pp. 37–60). Washington, DC: National Academy
Press.
704 Handbook of Employee Selection
Wise, L. L., Welsh, J. R., Grafton, F., Foley, P., Earles, J. A., Sawin, L. L., & Divgi, D. R. (1992). Sensitivity
and fairness of the Armed Services Vocational Aptitude Battery (ASVAB) technical composites (DMDC
Technical Report 92-02). Monterey, CA: Defense Manpower Data Center.
Yerkes, R. M. (1921). Psychological examining in the United States Army. Memoirs of the National Academy
of Sciences, Vol. XV. Washington, DC: U.S. Government Printing Office.
32 Public Sector Employment
Rick Jacobs and Donna L. Denning
Historians often cite the origin of civil service or public sector testing as far back as 2200 BC, when
a Chinese emperor used a process of systematic assessment to determine if his officials were fit for
office (DuBois, 1970; Frank, 1963). In these early times, individuals were assessed with what might
now be labeled job-relevant work samples; they included tests of specific skills such as horseman-
ship and archery. The Han Dynasty (202 BC to 200 AD) is credited with moving testing from the
actual actions required on the job to a surrogate, written format that included five areas of knowl-
edge: civil law, military affairs, agriculture, revenue, and geography (Gregory, 1996). Candidates
who were successful in meeting rigorous cut-off scores on local examinations were deemed appro-
priate to continue with the process of testing at regional and higher levels in the overall process,
much like what is seen today in multiple hurdle testing. Thus, in many respects, these ancient tests
were prototypes of what has become known generically as civil service examinations or, more gen-
erally, public sector testing.
This brief historical description depicts the genesis of public sector testing and illustrates that it
shares some similarities with current practices, but important differences exist. Most noteworthy,
use of these early tests was deficient in terms of systematic evaluation of outcomes or, more specifi-
cally, demonstration of their predictive validity. Further, the tests were conducted under extreme
conditions that required candidates to spend long hours in confined spaces that would never be
tolerated today, and they routinely had failure rates that were considerably higher than would often
prove viable today, well in excess of 90%.
If we move forward a few thousand years, from China (200 AD) to France (1791), England (1833),
and finally the United States (1883), we see the more immediate historical roots of modern-day pub-
lic sector testing (Graham & Lily, 1984). In these systems, tests were used to select individuals for
government positions in a way that was intended to be free of patronage and fair to all candidates.
These tests were each designed to identify the individuals most likely to succeed in a given position
on the basis of specific subject matter that made up the content of the tests, a precursor to what is
now routinely labeled as validity based on test content. Although much has been done over the years
to improve the characteristics of these assessments, such as more carefully matching test materials
to job requirements, further standardizing testing processes, and evaluating predictive efficiencies
by validation studies, the basic ideas underlying civil service examining have a long and rich his-
tory, in fact, one that predates emergence of the discipline of industrial psychology.
This chapter will provide details of the important characteristics of testing in the public sector.
It starts with the process of identifying the positions that are part of the competitive examination
process and then moves on to discuss the development and administration of examinations. A section
follows this on the key principle of validity, or linking tests to jobs. The next section addresses
recruitment of candidates; optimal selection decisions require maximizing the number of individuals
competing for the job. Part five of this chapter moves the discussion from entry-level testing to testing
for promotional opportunities. Next is a discussion of legal considerations surrounding testing in
the public sector. The chapter concludes with a summary of how public sector testing has evolved
through the past century and a view on where it might be moving in the 21st century.
705
706 Handbook of Employee Selection
a group of positions … sufficiently similar in respect to the duties, responsibilities, and authority thereof
that the same descriptive title may be used with clarity to designate each position allocated to the class,
that the same requirements as to education, experience, capacity, knowledge, proficiency, ability, and
other qualifications should be required of the incumbents, that the same tests of fitness may be used to
choose qualified employees, and that the same schedule of compensation may be used. (Committee on
Position-Classification and Pay Plans in the Public Service, 1941, p. 45, italics added)
Historically, this has been a judgmental exercise, but more recently it may include use of formal
job analytic methods. Subsequent to the creation of a class, additional positions are allocated to the
class when an additional need for work is determined and documented and the additional work is
deemed sufficiently similar to the work performed by incumbents in the class to warrant inclusion
of the position to the existing class. Similarly, an existing class is abolished when the need for the
work performed by those in the class no longer exists.
A description of the work performed by incumbents in a class and the qualifications neces-
sary for performing this work are then documented in a “class specification.” The “position
classification plan” of the organization, then, “consists of (1) the system of classes and class
specifications and (2) a code of formal fundamental rules for installation and maintenance of the
classification plan” (Committee on Position-Classification and Pay Plans in the Public Service,
1941, p. 47).
The classification of positions is a formal process that provides the underlying rationale for
assessment. With the specification of position qualifications and requirements, organizations can
seek to identify existing tests or construct new tests that match this information. What we have in
this approach are the roots of a content-based validation strategy, which may then stand on its own
or may be supplemented by additional validation information such as criterion-related evidence of
validity.
Adoption of the eligible list/register of eligibles, usually by a civil service commission or the
head of the department responsible for examining, marks the end point of the examination; but the
selection process has not concluded, because no one has yet been hired or promoted. This final step
is accomplished by the department with the vacancy requesting a “certification” of the list. Then,
in accordance with specific, strict rules, a designated number of candidates’ names are provided to
the department for their final hiring consideration (technically, they are provided to the “appointing
authority,” who is typically the department head and is the only person who can legally fill a posi-
tion in a civil service class). This certification rule has many variants and can range, for example,
from a “rule of one” (the highest scorer only, who is then hired unless there is reason not to do so, in
which case the second highest scorer is considered, and so forth), a “rule of three” (the three highest
scorers), a “rule of 2N + 1” (2 times the number of vacancies, plus 1, of the highest scores), to the
very liberal “rule of the list” (all persons on the list may receive final consideration for selection).
Evaluation of candidates for this final selection decision is to be based on additional job-related
criteria not found in the testing process. These can be difficult to identify when a thorough exami-
nation has been given.
In 1983, voters in Los Angeles approved a City Charter amendment for use of a rule of “Three
Whole Scores” for certification selection. This rule accomplished two things: First, the rounding of
scores to whole numbers eliminated the miniscule, decimal point differences in scores that had pre-
viously separated the ranks of candidates, and, second, the hiring department was able to consider
an expanded pool of candidates for final selection. In all instances described, all candidates tied at
a given score are treated the same (either certified for final hiring consideration or not), so rounding
scores to whole numbers has a greater impact than might be expected in grouping candidates at a
given score (rank) and thus expanding the pool from which a selection can be made.
Civil service examinations are seen as embodying “merit principles” in selection in that they are
based on job-related criteria and provide a ranking of candidates in terms of their relative degree of
qualification. Position classification invariably results in a class specification document that at the
very least provides a starting point for construction of a job relevant examination. Often the descrip-
tion of the job in the class specification document is supplemented by a more detailed job analysis.
Job analysis procedures may vary but have the common objective of specifying the work performed
(tasks) on the job. Additionally, and especially relevant for the purpose of developing selection test-
ing, the job analysis also often provides identification of the knowledge, skills, abilities, and pos-
sibly other personal characteristics needed to perform these tasks. This information then allows for
designation of the most appropriate types of tests for use and their content.
Once again, these provisions require that examinations are based on job requirements or the
ability to “discharge the duties of the class,” which logically results in a content-based test construc-
tion strategy. This, coupled with classification practices that are based on extreme similarity of the
work required of all positions in a class, results in nearly universal reliance on content-based testing.
And not incidentally, the narrowness in scope of the work performed by incumbents in a class thus
defined results in a proliferation of classes, relative to the number of employees, which precludes
local empirically based test validation strategies.
February 2008). Similarly, in a medium size city in Pennsylvania, the Civil Service Commission must
monitor staffing for over 400 classes. Although some testing for larger classes is predictable and regular,
many jobs may have a single vacancy only occasionally (and unpredictably), and the organization must
be ready to examine candidates for each and every one of them at any point in time. In the City of Los
Angeles, the Personnel Department is responsible for testing nearly 1,000 classes. The shear number of
jobs and requisites of the civil service system creates a situation in which tailoring selection programs
very specifically to each job can make timely completion of all examinations extremely challenging.
Another factor contributing to the volume of civil service examinations that must be developed
is the reluctance within many agencies to reuse tests. Test security reigns supreme, given the high
stakes of these examinations and the need to preserve the integrity of the process, to the extent that
considerable caution is exercised even with respect to repeat exposure of test material. And this
caution is well founded; incidents of candidates colluding to reproduce a test (by each memorizing
a specific set of items) have been repeatedly encountered.
as they are often called) must engage in developing an examination plan. Examination plans link
the information about the job to the types of assessments that are included in the selection process.
Examination plans provide not only a logical underpinning to the use of a given type of test and its
content but also to the weight that each test receives in the final examination score. As an example,
in police officer selection, there has been a movement to establish a more broad-based assessment
consisting of not only cognitive ability but also personality characteristics that lead to effective
policing. An examination plan for the class of police officer would likely include multiple types
of tests with specific assessment dimensions for each and instructions as to how these tests are to
be considered (pass/fail or weighted) in the final composite score on which candidates are ranked.
Following this logic, it is not hard to see that very different jobs (e.g., library clerical worker, meter
reader, lifeguard, and purchasing agent) would have examination plans that differ from police offi-
cer and from one another, given the nature of each job and the knowledge, skills, abilities, and other
characteristics (KSAOs) necessary to perform in the position.
Public sector employment covers a very wide range of jobs and thus requires use of a correspond-
ingly wide range of assessment tools. Although the final examination may differ for various posi-
tions, a similar process is used for examination development for the vast array of public sector jobs.
First an analysis of the job is undertaken. Then, based on the results of a job analysis, an examina-
tion plan is developed that identifies the optimal (and feasible) type(s) of test(s) necessary to assess
the knowledge, skills, and/or abilities and aptitudes critical to performance of the job. For library
clerical workers and meter readers, tests might focus on attention to detail, whereas for lifeguards
the certification of successful completion of a first aid course might be supplemented with a physical
abilities test that includes water rescue.
MINIMUM QUALIFICATIONS
Threshold requirements, usually in the form of training, experience, or certification attained, are
often established as minimal qualifications for potential applicants. Public sector organizations
rely heavily on minimum qualifications as an initial step in the employment process. Minimum
qualifications (alternatively referred to as “requirements”) are threshold requirements that potential
applicants must meet to participate in the competitive examination. In reality, they are the first test
in the examination, because they consist of carefully established job-related criteria that each appli-
cant must meet precisely to be allowed to proceed further in the examination process. These criteria
are clearly communicated to applicants (so those lacking can self-select out at the earliest possible
time), consistently and rigidly applied, and are often subject to verification. Their use is completely
consistent with the content-based approach to testing so prevalent in the public sector, in that the cri-
teria individuals must meet to participate in the examination for a given class are those that indicate
a reasonable likelihood that they will have acquired the knowledge, skills, and abilities that will be
subjected to more refined assessment through the remainder of the examination.
require the creation of new tests; in fact, this is often a primary responsibility of the public sector
testing professional. Clearly the development of new assessment tools requires a great deal of time,
effort, and skill, and when that is multiplied by the number of jobs in the organization, the workload
can become overwhelming. Many public sector organizations pursue another option in some cases
by outsourcing to individuals or consulting firms specializing in instrument development. This is
especially true for high-stakes positions in which many jobs are being filled and the likelihood of
follow-up objections and legal action on the part of candidates is high.
ALTERNATIVE MEASURES
All employers should identify appropriate predictors for use in employee selection and, in addition,
are required to search for alternative predictors for any original predictor that demonstrates a marked
difference in pass rates on the basis of designated candidate demographic/cultural group membership.
For example, cognitive ability tests are effective predictors of subsequent job performance. However,
the use of cognitive ability tests are alone will also usually result in large racial/ethnic group dif-
ferences in pass rates, with majority group members passing at a higher rate than members of most
minority groups. In this case, employers are required to seek out other predictors that can be used in
conjunction with or in place of the predictor(s) that result in large group difference(s) in pass rates.
The search for alternatives may take the form of identifying additional dimensions upon which to
assess candidates. This approach is reflected in the previously mentioned police officer selection
example, in which, for many decades, police candidates were given a cognitive ability test for the
initial identification of qualified candidates and then further vetted via interviews, physical ability
tests, background investigations, and medical/psychological evaluations. More recently, systems for
initial screening commonly include a cognitive test with additional areas measured, such as per-
sonality or biographical data, and expand alternative testing formats to include video-based tests
or job simulations. Both approaches to identifying test alternatives, content, and format should be
considered when meeting the mandate of alternative tests.
Public Sector Employment 711
analysis identifies the KSAs of people to attempt to attract, making it possible to target the recruit-
ing effort. As an example, although there are always many men who want to be firefighters, finding
women for the job can often be a problem for many agencies. Because a firefighter’s job is very
physically demanding, finding women who are physically capable of not only lifting and carry-
ing heavy objects, including other people, but doing so with about 46 lb of equipment presents an
additional challenge. Clearly, just going out and recruiting any woman is not a good strategy in light
of our additional knowledge about the job. The focus in this example and many others must be on
recruiting individuals who are interested in and likely to be qualified for the job.
Historical data can also be of great value by pointing agency personnel to past successful
recruits and helping to identify schools, vocational programs, and other training grounds where
those likely to be successful in the job may be found. Establishing partnerships with educa-
tional institutions may also be considered. Further to the point regarding targeted selection of
fi refighters, establishing physical pretraining programs targeted at women is not uncommon. In
one particularly good demonstration of effective recruiting, a city agency responsible for hiring
fi refighters concluded that the major hurdle in hiring women was not an overall lack of appli-
cants, but a lack of female applicants who could pass their physical abilities test. This conclu-
sion resulted in an adjustment in recruitment strategy to focus on recruitment of women from
places where previously successful female applicants had been found. These female fi refighters
belonged to local gyms, fitness clubs, and local athletic teams. The logic proved to be accurate
in that the targeted recruitment identified more female applicants who successfully passed the
physical abilities test. The strategy was successful, and the city increased its percentage of
female fi refighters.
Not all recruiting efforts return such positive results. Recruiting without information on the
KSAs required to be successful is likely to be a waste of time and effort. In some cases “random
recruiting” may actually detract from the goal of more effective testing. When recruits who are
poorly prepared or not truly interested in the job are solicited and become candidates, the potential
for changes in the passing rates of the various demographic applicant groups increases, and the
result can be an increase in adverse impact, no doubt the opposite of the desired outcome.
Indeed, effective recruitment is vital to creating an effective selection program, but the pro-
cess must be guided by the knowledge of what is required by the job, what have historically been
successful avenues for finding the needed talent, and what new approaches (e.g., pretraining pro-
grams and educational partnerships) may prove viable in the future.
used for either entry or promotional testing. As stated earlier, this method of validation establishes
logical links between test requirements and job requirements, often times supported with judgment
data from subject matter experts (SMEs). In many programs of validation for entry-level testing, this
strategy is supplemented with a criterion validity study. However, it is rare that a criterion-related
study is part of the validation process in promotional exams. This is, for various reasons, including
relatively small sample sizes for many promotable positions, issues of test security that arise as a
function of administering a test to a group of incumbents and then using it again for candidates,
difficulties in getting a sample of incumbents to do the necessary level of preparation to take the
test, and lack of incumbents’ motivation to participate in an activity that is seen as not relevant to
their job.
The number and variety of promotional examinations are also daunting, and the task of creating
and implementing job-related examinations is equally difficult. In the vast majority of instances,
promotional examinations include minimum qualifications that specify the lower-level class or
classes from which promotion to a given class must be made, as well as the required number of
years of service in the lower-level class or classes. Most public sector organizations have these rules
and regulations to not only specify job-relevant experience, but also as part of a mandate to prevent
patronage or favoritism from replacing the identification of the most talented individuals for the job.
The process of developing a promotion examination is similar to that for entry-level examinations,
with one very important difference: Most agencies emphasize promotion from within (again, often
legally mandated). As such, the amount of information known about the candidates greatly exceeds
what is known about entry-level applicants. This could be a tremendous advantage in the identifica-
tion of talent if properly tapped.
in employee development programs, have been devised for the measurement of past performance.
Behavioral accomplishment records (Hough, 1984), ratings of promotability, career review boards,
and behavior-based interviews have all been seen as additions to the overall promotional processes
used in public sector testing. It remains the task of testing professionals to further enhance promo-
tional processes by continuing to improve these techniques that capture prior job-relevant perfor-
mance. It is further the responsibility of the professional testing community to convince candidates
and their agencies to more frequently incorporate these improved testing practices into existing
assessment programs.
cause large levels of adverse impact when selection ratios drop below .20. This further complicates
the situation for the agency, because one goal is to make the jobs widely available, but doing so can
have a negative consequence on diversity.
DEFENSIBILITY OF PROCESS
Ultimately, a public sector agency must make its employee selection systems (entry and promo-
tional) defensible. To do this, there are steps that must be taken, and these steps must not only con-
form to the laws and guidelines governing selection, they must also be meticulously documented
(Guion, 1998).
In entry and promotional programs there are winners and losers. The winners get what they
desired, the job, and those less fortunate walk away either without a job (entry) or in the same job
they were in prior to studying and preparing for the promotional process. At times, those in the latter
category seem to decide that challenging the test on any number of grounds is a good strategy for
obtaining a second chance. In some situations, the tests may actually be poorly prepared, lacking in
job relevance, undocumented with respect to how they were created and/or linked back to the job, or
simply administered without regard to accepted testing practices. However, in other cases, the alle-
gations about the test may be a disingenuous attempt to vacate the results and provide all test-takers
with another opportunity for success. When a test is challenged, it does not automatically mean the
test was deficient or that the process violated the laws and guidelines that prevail. Challenging the
test is a right of any candidate who can establish an underlying legal basis, most often in the form
of adverse impact. Once adverse impact is established, it becomes the responsibility of those using
the test to establish the validity of the process.
Given the above, it is important to consider what must be present to make a hiring or promotional
system defensible. Below we provide further details on the following critical factors for defending
a testing process:
• Job analysis
• Links between test elements and aspects of the job
• Logic behind the combination of multiple scores into a final composite
• Test administration details
• Scoring processes
• Documentation
There is unanimous agreement that a fair test is only possible with confirmation that those responsible
for the test understand the job. In the context of public sector testing, this means that the job in question
has been defined and documented, which occurs at the inception of a job class through the creation
of a class specification, and then is often supplemented with a more detailed job analysis and/or dur-
ing examination development with the assistance of job experts. The results are widely accepted by
incumbents, supervisors, HR specialists, and potential applicants as reflecting the important features
of the job. To be useful as underlying test development documents, the job analysis and job description
must reflect not only the tasks and responsibilities of the job, but also the knowledge base required by
the job and those skills, abilities, and other personal characteristics that facilitate job performance.
A second important element of providing the necessary information for defense of a test is evi-
dence that links the test items, work samples, and other components of the examination to the actual
requirements of the job. This linking process often takes the form of surveys that identify the mate-
rials underlying the test questions and asks SMEs to identify the degree to which these materials aid
in completion of various work tasks and responsibilities. This process is commonly accepted as a
means of establishing validity on the basis of test content and takes many forms. At the root of any
successful demonstration of validity is a clear listing of the test requirements, the job requirements,
Public Sector Employment 717
and how those two sets of requirements are associated. Critical to this approach to the demonstra-
tion of validity based on test content is an appropriate sampling of incumbents, supervisors, and/or
other job experts, along with clear instructions to those who are providing the responses. Although
surveys are often used, this process can also be accomplished with review meetings involving job
experts in which the material is analyzed and discussed and consensus judgments of the experts
are documented.
In most contemporary entry and promotional processes, a single test does not represent the full
range of job requirements; so multiple tests are used for selection or promotion. When this is the
case, a third requirement comes into play. Because all selection and promotional processes require
a single, final score in the examination, the manner in which that score is calculated becomes an
important consideration. As an example, in police officer selection and in firefighter selection there
often is a written and a physical test. Combining these two scores becomes an issue because the
weight assigned to each score will determine, in part, its impact on final score. Years of job analysis
for both of these jobs have yielded consistent results. Although both jobs require physical capabil-
ity, the firefighter job is more physically demanding. Results from our own job analysis work across
various police and fire departments have shown that the job of a firefighter is between 40% and 60%
physical with the remainder requiring cognitive abilities, whereas the job of a police officer is often
reported by incumbents to be between 20% and 30% physical and 70–80% cognitive. The two jobs
clearly need different weights for the physical test when it comes to creating a selection composite.
Like the evidence for linking the test elements to the job requirements, a rationale for the weight-
ing used to form the final test score is necessary. This is often based on input from job experts and
professionals in the testing area. The important point is that there is a rationale for the weights that
are being used and that the rationale is tied to requirements of the job. It should also be noted that
there is no one best way to establish these weights; also, in many testing situations the components
of the examination are correlated with one another. When correlation exists among components, the
weights become somewhat less of an issue because small variations in weights do not substantially
change the overall results. As the correlations increase, the impact of differentially weighting the
components becomes far less of an issue, and at some point, simply equally weighting the test com-
ponents works in a similar manner to elaborately defining a very precise set of differential weights.
On the other hand, some would argue for use of equal weights simply because of their demonstrated
robustness in prediction (Schmidt, 1971).
A fourth area for defensibility is in the actual administration of the tests. The best-developed
tests, the ones with the highest degree of validity evidence and the strongest rationale for weight-
ing of components, can become useless if test administration processes are deficient. Threats to
administration can come in various forms, ranging from failure to protect testing materials before
the actual test date to administering the test in a room with poor lighting, loud outside noises, or
missing pages in test booklets. Although this seems to be the least difficult part of defending a test
and the easiest to achieve, it is often the Achilles heel of a testing process. Care must be given to all
phases of test administration; for example, materials such as instructions to candidates, information
about the testing locations and facilities, and any irregularities in the actual administration of the
test all must be well documented. Nothing is worse than doing all of the right things when it comes
to test development and then compromising it all during test administration.
All test materials must be scored, and the scoring process represents a fifth area in which threats
to the defense of a test can occur. Many modern tests are administered via paper and pencil and
scored by scanning machines or taken online and scored automatically. Scoring integrity must be
demonstrated in the form of getting the correct outcome for each candidate. In the case of scanned
answer sheets, this means that all answer sheets must be reviewed for irregularities. It is a good
idea to scan each test sheet twice and to compare scores for any differences in the two scans; any
differences indicate scanner problems or simple “hiccups” in the scanning process. Another way to
ensure accuracy is to compare candidates’ hand-scored tests with their scanned scores (usually for a
sample of candidates). For online testing, periodically sending through a “phantom candidate” with
718 Handbook of Employee Selection
a known score to make sure that the algorithm is generating the correct score is a useful step. With
respect to other types of potential test scoring problems, demonstrations of interrater agreement and
other forms of reliability help to substantiate the appropriateness of scoring protocols. Although a
discussion of reliability is not consistent with the goals of this chapter, any and all steps that can be
taken to show the consistency of test results will be of great assistance in addressing any challenges
to the scoring process.
A final step in the defensibility of a testing program is the documentation of all steps in the pro-
cess. This includes specification of how each test was developed, how each test was administered
and scored, and how the final examination score was calculated for all candidates. Creating the
paper trail of your work not only allows everyone to see the steps taken but also memorializes the
process. In many challenges to public sector testing, legal proceedings take place years after the
examination was developed and administered. Relying on memory and randomly filed memos of
what happened will never provide the information necessary to successfully support the conten-
tion of adequacy of the process. Public sector testing is best completed by the compilation of a
final report or file that details the project from start to finish. This documentation should be clear
and it should contain all of the necessary surveys, instructions, and tests used in the examination.
There is no better way to defend a test than to have it well documented. In situations in which
a challenge is presented to an agency regarding the testing process, the agency can provide the
potential plaintiff with a copy of the report. On more than one occasion, this has ended the chal-
lenge to the test.
Doing everything correctly and doing it by the steps outlined above is not a guarantee of
an instant determination of validity or being able to use the results of the examination. Public
sector testing has a long and interesting history of legal challenges and decisions. Not all deci-
sions follow what would logically flow from the various perspectives presented in a case. Public
sector testing is just that, it is public, and much of what is decided in a specific case is best
understood from the historical perspective of the selection problems for that city, state, or
agency. Like any other understanding of a complex issue that has developed over time, context
is critical.
CONCLUSIONS
Public sector testing has evolved over the past 2 centuries in terms of test content and test format.
We have seen the movement from tests based solely on memory and other cognitive abilities to
the inclusion of social judgment, personality, and biographical information. We have seen simple
paper-and-pencil testing transition to testing formats that are computer-based and inclusive of video
stimulus materials. With respect to the identification of critical underlying job requirements, we
have seen public sector testing programs expand their use of systematic job analytic techniques that
approach not only single jobs under study, but also classes of jobs, so that the inherent interrelation-
ships among jobs can be identified to better take advantage of opportunities to use a common testing
system across jobs. With respect to the legal arena, it is public sector testing that is often singled out
as the test case for looking at the defensibility of specific test formats and test content as well as the
way in which test scores are used in the decisions made about people and jobs. Clearly, as the chal-
lenges to the fairness of various types of testing programs move forward, public sector applications
will be part of the landscape.
Unlike the private sector, public sector employees are less susceptible, although still not immune,
to layoffs or downsizing, although hiring freezes are common. This fiscal reality translates to fact
that most cities and public agencies, even in their toughest financial times, continue to require sub-
stantial levels of staffing. Therefore, the enormous demand for testing programs for the hiring and
promotion of public sector employees will continue, and the need for accomplished and creative test
development professionals will offer tremendous opportunities to further develop the way in which
we measure candidates against job requirements.
Public Sector Employment 719
REFERENCES
Aguinis, H., & Smith M. A. (2007). Understanding the impact of test validity and bias on selection errors and
adverse impact in human resource selection. Personnel Psychology, 60, 165–99.
Baruch, I. (Chair). (1941). Committee on Position-Classification and Pay Plans in the Public Service. Position-
classification in the public service. Chicago, IL: Public Personnel Association.
Cascio, W. F., Jacobs, R. R., & Silva, J. (2010). Validity, utility and adverse impact: Practical implications from
30 years of data. In J. Outtz (Ed.), Adverse impact: Implications for organizational staffing and high
stakes selection (pp. 271–288). New York, NY: Psychology Press.
City of Los Angeles. (2009). Official city of Los Angeles charter. American Legal Publishing Corporation.
Retrieved November 16, 2009, from http://www.amlegal.com/nxt/gateway.dll?f=templates&fn=default.
htm&vid=amlegal:laac_ca
De Corte W., Lievens F., & Sackett P. R. (2007). Combining predictors to achieve optimal trade-offs between
selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393.
DuBois, P. H. (1970). A history of psychological testing. Boston, MA: Allyn & Bacon.
Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department
of Justice. (1978). Uniform guidelines on employee selection procedures. Federal Register, 43,
382990–38315.
Franke, W. (1963). The reform and abolition of the traditional Chinese examination system. Cambridge, MA:
Harvard University Press.
Graham J. R., & Lily, R. S. (1984). Psychological testing. Englewood Cliffs, NJ: Prentice Hall.
Gregory, R. J. (1996). Psychological testing: History, principles and applications (2nd ed.) Boston, MA: Allyn
& Bacon.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ:
Lawrence Erlbaum.
Hough, L. M. (1984). Development and evaluation of the Accomplishment Record: Method of selecting and
promoting professionals. Journal of Applied Psychology, 69, 135–146.
Jacobs, R. R., Conte, J. M., Day, D. V., Silva, J. M., & Harris, R. (1996). Selecting bus driver multiple perspectives
on validity and multiple estimates of validity. Human Performance, 9, 199–218.
Murphy, L. L., Plake, B. S., & Spies, R. A. (Eds.). (2006). Tests in print VII. Lincoln, NE: Buros Institute of
Mental Measurement.
Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and
sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172.
Pyburn, K. M., Jr., Ployhart, R. E., & Kravitz, D. A. (2008). The diversity-validity dilemma: Overview and legal
context. Personnel Psychology, 61, 143–151.
Sackett, P., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–450.
Schmidt, F. L. (1971). The relative efficiency of regression and simple unit predictor weights in applied
differential psychology. Educational and Psychological Measurement, 31, 699–714.
Spies, R. A., Plake, B. S., & Geisinger, K. F. (Eds.). (2007). The seventeenth mental measurements yearbook.
Lincoln, NE: Buros Institute of Mental Measurement.
Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of
tests in selection. Journal of Applied Psychology, 23, 565–578.
This page intentionally left blank
33 Selection Methods and
Desired Outcomes
Integrating Assessment
Content and Technology to
Improve Entry- and Mid-Level
Leadership Performance
Scott C. Erker, Charles J. Cosentino, and Kevin B. Tamanini
721
722 Handbook of Employee Selection
Wise, 1990), the selection systems at higher levels within organizations are just as, if not more,
critical. The variation in job performance is greater for those in leadership roles than other positions
(Schmidt & Hunter, 1998). From a utility perspective, selecting a superior manager will result in
48% more output than an average manager (as compared to 32% for skilled workers and 19% for
lower-level jobs; Hunter, Schmidt, & Judiesch, 1990). Getting the right leaders into the top posi-
tions will stimulate organizations to grow and prosper (Howard, 2006). Indeed, the financial health
of an organization is predicated on the optimal selection and placement of employees (Hunter,
Schmidt, & Judiesch, 1990). The bottom line is that it pays to have an effective selection system,
especially when dealing with leader-level positions.
This chapter focuses on the selection of entry- and mid-level leaders within private sector orga-
nizations. From this point forward, the authors refer to this as the selection of leaders or leader-
ship selection. A review of contemporary organizational challenges unique to entry- and mid-level
leadership sets the stage for a discussion of themes and strategies for enhancing leadership effec-
tiveness and bench strength through improved leadership selection and assessment practice. Next,
a comprehensive set of leadership assessment tools and procedures are described. Real-world case
studies are used to illustrate the application and results achieved from enhanced selection programs.
Finally, we overview common business scenarios in which assessment is used to guide leadership
selection in the private sector.
Recently, Development Dimensions International (DDI), a consulting firm that aids organiza-
tions in hiring and promoting better employees and developing leadership talent, conducted a lead-
ership forecast study among 4,559 leaders and 944 human resources (HR) representatives across
the management hierarchy (17% first-level, 38% mid-level, 34% higher-level, and 10% senior level;
Bernthal & Wellins, 2005). They found that regardless of management level, respondents had
more confidence in senior leaders than either mid-level or low-level leaders (55% rated their confi-
dence high as compared to 39% and 37%, respectively). In another DDI study, HR leaders in 348
organizations around the world anticipated greater difficulty filling leadership positions that can
meet the challenge of a global economy in the future (Bernthal & Erker, 2005). Specifically, 52%
of respondents expected problems filling mid-level leadership positions with qualified candidates
and 28% anticipated problems filling first-level leader positions with qualified candidates. These
statistics are alarming in that they suggest considerable variance/gaps in the selection of entry- and
mid-level leaders—a job that is increasingly important as global growth, service expectations, and
technology complexity have made the quality of leaders at this level more critical to sustained orga-
nizational success. This situation is exacerbated by the shortage of qualified candidates, especially
in developing economies. Urgency in filling a position with few candidates is one of the most com-
mon sources for selection errors. Although poor leader selection at the higher levels (e.g., senior
executives and CEOs) has more serious consequences monetarily, as well as in terms of organiza-
tional effectiveness, poor selection at lower leader levels (i.e., entry- and mid-level) is considered
more pervasive (Howard, 2006).
with development opportunities and exposure to more senior leaders while occupying lower-level
positions (e.g., assistant managers), were effective means for identifying and developing future lead-
ers. Since then, organizations have reduced leadership levels and eliminated assistant manager posi-
tions. This has reduced the organizations’ ability to identify and develop those individuals with the
greatest leadership potential. Reduced organizational levels have made each transition into a new
level of leadership responsibility more difficult for newly promoted leaders. This has increased the
importance of other strategies to identify, select, and accelerate leaders.
When organizations have long-tenured employees, they have the benefit of an experienced work-
force, but they are at risk of mass retirements. Many organizations that we work with, from chemi-
cal processing to transportation companies to financial service organizations, expect up to 70%
of their senior team to retire over the next 5 years. Many organizations are totally unprepared to
manage this challenge. The challenge has been made more difficult if there is no systematic pro-
cess in place to select and develop the next generation of leaders. Succession planning, especially
at the mid-level is largely absent as a strategy for managing the tremendous change in workforce
demographic. Couple this with the new demands associated with leading in a global economy and
the “perfect storm” is created. Underprepared leaders can be faced with an impossible situation.
Organizations do not have the time they had in the past to grow leaders through trial and error.
Changing business realities have increased the use of more visionary job analysis methods that
study the new leadership challenges facing the organization and select competencies most closely
associated with addressing those changes. The assessments are tailored to select and develop the
leaders against a new competency profile that is better aligned with the new business and cultural
strategies. Organizations have less confidence in the judgment of incumbent leaders as the sole
method of making decisions about who will become the new generation of leaders, especially dur-
ing times of change. Incumbent leader judgment is seen as skewed by strategies and skills that were
important in the past.
because of family concerns (Society for Human Resource Management, 2009). In addition, the
attrition rate among expatriates once they have returned to their home country is significant (27%
leave within the first year compared with 13% attrition for leaders who do not take an expatriate
assignment; Society for Human Resource Management, 2009) because they are challenged with
reassimilating into their companies as organizations struggle to fully utilize and recognize these
leaders’ new-found skills.
a new competitor or market reality. Consider a mid-level sales leader who has accountability for
$10 million of business. When this leader retires or is promoted, who will take her place? Will it be
a highly effective leader or a mediocre one? If we assume a 10% difference in the revenue results
achieved between the highly effective and mediocre leaders, even a small difference in revenue
performance can add up to the difference between success and failure. There is growing recogni-
tion that speed to productivity, enhanced through effective new leader selection, on-boarding, and
development programs, is an important lead measure of success in meeting and exceeding sales
goals. When direct sales costs and lost business opportunities are calculated across a large sales
organization, the dollar figure can run into the tens of millions of dollars. In this context, unwanted
turnover, failure to coach weak performing sales associates, ineffective sales tactics, and lost busi-
ness opportunities can be attributed to average versus highly effective leaders.
In summary, the five trends outlined above have contributed to an increase in the pressure and
difficulty organizations face when defining and managing the applicant pool for entry- and mid-
level leader positions. Although this list is not exhaustive, it does highlight some of the more severe
contextual challenges that must be taken into account when designing a sustainable leadership
selection program. Our contention is that those organizations that can best anticipate the impact
of these trends and then take action to implement programs that select the right leaders will be
the organizations that are prepared with the right people in place to meet their future business
challenges.
TABLE 33.1
Rank Order of Most Common Leadership Competencies for Major Global
Companies
1980–2000 2000–2008
1. Communications 1. Decision-making
2. Decision-making 2. Communication
3. Planning and organizing 3. Adaptability
4. Stress tolerance 4. Coaching
5. Gaining commitment 5. Gaining commitment
6. Adaptability 6. Planning and organizing
7. Job fit motivation 7. Building trust
8. Technical knowledge 8. Customer focus
9. Coaching 9. Delegation
10. Formal presentation 10. Building effective work relationships
11. Initiative 11. Innovation
726 Handbook of Employee Selection
The leaders before 2000 were required to be technical experts and good managers (making plan-
ning and organizing, decision-making, and formal presentation more likely to be identified) as well
as able to handle stress and adapt to change. After 2000, the priority of behaviors required for suc-
cess has changed. Although decision-making, communication, and adaptability remained high on
the list, coaching, building trust, delegation, and building effective work relationships were given
more weight. Today, effective managers are seen as balancing technical knowledge and experience
with interpersonal skills. Contemporary selection programs measure technical and social ability.
The customer (customer focus) and innovation are also valued as important in leadership roles.
Today, when it comes to human capital management, many managers and executives are becom-
ing increasingly sophisticated. They realize the importance of objective HR assessment and estab-
lishing the return on investment (ROI) for their HR and that information collected, compiled,
consumed, and applied from assessment practices can have a tremendous impact on organizational
success. High performance in leadership selection can best be achieved when the specific require-
ments for people in leader roles are understood and when systematic methods of assessment are used
as inputs to making selection decisions. In this section, we will outline common pitfalls that under-
mine selection and assessment; demonstrate the essential role that assessment plays in developing
world-class leaders; provide an overview of assessment tools; and show how to best use assessment
tools, systems, and processes to build a competitive advantage through people.
Quality leaders are essential drivers of business strategy and critical differentiators for com-
panies. Every day, organizations make critical business decisions, many of which are centered on
leaders, such as the following:
development focuses on trainable elements. Not all leader requirements are equally developable.
Training and development can enhance knowledge and build proficiency in skill-based competen-
cies. However, less developable personal attributes such as mastery of complexity, self-development
orientation, and leadership motivation lend themselves to identification through assessment. A well-
designed assessment program will examine both nontrainable and trainable dimensions of success.
This design recommendation is especially important for entry-level leader selection when there are
little data about the individual contributor for less trainable, positive leadership characteristics.
of behaviors related to job success. Thus, methods may also be categorized as those that provide
inferences about behavior (e.g., personality tests, cognitive tests), assess descriptions of work behav-
ior (e.g., biodata, interviews), or demonstrate behavior (e.g., job simulations) (Howard, 2006). This
is an important difference for organizations because the use of different methods requires different
validation strategies. Effective entry- and mid-level leadership assessment programs use various
assessment tools.
Whether selection methods measure constructs or focus on job content—that is, depict signs
(inferences) of behavior or samples (descriptions or demonstrations of behavior)—some have been
shown to be better predictors of leader performance than others. Although there is an abundance
of literature on the validity of selection predictors across jobs (mainly relying on entry-level jobs;
e.g., Hunter & Hunter, 1984), there is much less that has focused primarily on entry- and mid-level
leader selection. Considerable research is needed to confirm these inferences. The unique nature of
leadership positions will impact the results of studies targeted at this level.
SCREENING METHODS
Biographical Data
Biographical data or biodata measures are empirically developed and quantify descriptions of past
activities and accomplishments, such as life experiences, hobbies, and other pursuits. According to
Mumford, Stokes, and Owens (1990), studying patterns of life history sheds light on the ecology
of human individuality. Although that may be true, Schmidt, Ones, and Hunter (1992) noted that
by the late 1980s and early 1990s in the United States, only 6.8% of firms had ever used biodata
in employment decisions, and, similarly, Shackleton and Newell (1997) estimated that 6% of firms
used biodata in all employment decisions and 11% in some cases. We believe that the use of biodata
for leadership selection may be on the rise. Biodata typically lends itself to a decrease in adverse
impact and has the benefit of being verifiable through interviews or reference checks. Reilly and
Chao (1982) found that personal history correlated .38 with success in management positions, and
Stricker and Rock (1998) found that personality-based biodata measures effectively predicted lead-
ership potential. Rothstein and colleagues (1990) demonstrated a validity coefficient of .32 across
organizations with a sample of 11,000 first-line supervisors, which generalized across race, gender,
and age groups along with levels of education, work experience, and tenure with the company.
Similarly, Carlson, Scullen, Schmidt, Rothstein, and Erwin (1999) found that the validity for a bio-
data inventory predicting managerial success generalized across organizations (ρ = .53), and, again,
was predictive for men and women as well as for managers of all age groups, tenure with company,
and education levels. Further research on the predictive validity of life experiences for early success
as a leader is needed to substantiate these findings.
(past experiences), and items that ask respondents about other’s opinions of them are more valid
than direct self-report items (Lefkowitz, Gebbia, Balsam, & Dunn, 1999). Although the behavioral
consistency method is time-consuming and costly to construct, Schmidt and Hunter (1998) noted
that the method is well worth the cost and effort for higher-level jobs, such as entry- and mid-level
leaders. There have been no empirical studies that have directly examined the predictive validity of
achievement profiles for entry- to mid-level leaders.
Personality Measures
Personality measurement has been extensively researched, and practitioners continue to explore the
practical value of personality for predicting leadership success. Within personnel selection, personality
predictors can be roughly divided into two categories: (a) general measures of adult personality (e.g.,
NEO-PI, 16PF, HPI) that are used to describe individual differences in behavior and (b) measures of
personality (e.g., integrity tests, violence scales, drug and alcohol scales, etc.) that are used for the pre-
diction of individual differences (Salgado et al., 2001). Despite the extensive research on the Big Five
for predicting job performance (e.g., Barrick & Mount, 1991) and relatively high validity coefficients for
both conscientiousness (.31; Schmidt & Hunter, 1998) and integrity tests (.41; Ones, Viswesvaran, &
Schmidt, 1993), these measures may have low validity for management jobs depending upon how the
construct (e.g., conscientiousness) is defined (Hough & Oswald, 2000). For example, Hogan and Ones
(1997) defined conscientiousness as conformity and socially prescribed impulse control. On the basis
Selection Methods and Desired Outcomes 731
of this definition, Hough and Oswald believed that conscientiousness would not predict performance,
in which creativity and innovation are highly important (characteristics that are aspects of many lead-
ership positions). Although Ones and Viswesvaran (1996) argued that broad personality domains are
better for predicting performance across job levels than narrow domains, others have shown that con-
scientiousness was not a valid predictor of managerial performance (Robertson, Barron, Gibbons,
MacIver, & Nyfield, 2000). This study suggests that practitioners should recommend concurrent valid-
ity studies before recommending personality tests as core elements of a selection process.
In contrast to Roberston et al. (2000), Bartram (2004) indicated that scores on scales of the
Occupational Personality Questionnaire (OPQ) and ratings of work behavior on the Inventory of
Management Competencies showed an average uncorrected validity of .48, with a range of .29 to .69
(zero-order correlations). Additionally, personality measures have also been shown to predict lead-
ership style (Hogan & Kaiser, 2005). More recently, Hogan, Davies, and Hogan (2007) proposed
a conceptual model that links certain personality variables to workplace behaviors. They outlined
various strategies for utilizing the validity evidence from prior research to apply to other positions,
and they used research from managerial jobs as examples.
Although there is some debate as to the level of analysis that should be used (e.g., Robertson &
Smith, 2001) and there have been some conflicting findings regarding the validity for leader selec-
tion, personality measures (whether conscientiousness or integrity) add a degree of validity (i.e.,
incremental validity) over and beyond cognitive ability. An advantage of personality measures
over cognitive ability measures is that personality measures do not demonstrate adverse impact
to the same extent as other measures (Hogan & Hogan, 1995). As with cognitive ability tests,
there are various group differences that tend to be associated with personality measures; however,
these differences tend to focus on sex differences rather than racial differences. Indeed, as noted
previously, personality tests tend to not show significant group differences (i.e., adverse impact)
in regards to racial groups. For example, Ones and Viswesvaran (1998) compared the scores of
African Americans, Hispanics, Native Americans, Asians, and Whites and found trivial differ-
ences. They went on to note that group differences with these “trivial magnitudes” are not likely to
cause any discernable adverse impact. In regards to sex differences, the Big Five facet of conscious-
ness, women tend to score higher than men (Feingold, 1994). Hough, Oswald, and Ployhart (2001)
note that women tend to score higher on “dependability” scales whereas men tend to score higher
on “achievement” scales. Similarly, Ones and Viswesvaran (1998) found that women tended to
score higher than men on overt integrity tests. Overall, the use of personality measures for making
employment decisions is accepted, and the validity evidence for certain scales is growing (Hogan,
Hogan, & Roberts, 1996). Indeed, in a survey of organizations from 20 countries and 959 organiza-
tions, Ryan, McFarland, Baron, and Page (1999) found two general trends: (a) personality test usage
around the globe is increasing and (b) personality measures are being used more frequently for
managerial and professional selection.
Assessment Centers
In this review, assessment centers (ACs) are the only method developed specifically to measure
leadership and managerial performance. AC refers to an evaluation method, or process that usually
732 Handbook of Employee Selection
includes multiple simulations, designed to assess both dimensions (i.e., competencies) and cate-
gories of behaviors associated with success or failure in the target position and simulate critical
managerial job activities (Bray, 1982). An AC simulates critical and representative job challenges. It
may include written simulations (e.g., in-basket, fact finding, analysis, and planning exercises) and
interactive simulations (e.g., role-plays, presentation, group discussion, and business game; Howard,
2006). The groundbreaking work with ACs was the Management Progress Study at AT&T, which
led to the use of ACs as an aid in selecting first-line supervisors (Bray & Howard, 1983).
ACs are mainly used for managerial selection and have an impressive record of predictive
validity (.37; Schmidt & Hunter, 1998). Thornton and Rupp (2006) indicated that the estimates of
the relationship between AC ratings and management success range from .31 to .43, and Gaugler,
Rosenthal, Thornton, and Bentson (1987) found in their meta-analysis that there was an upper bound
of .63 under “optimal” conditions. AC researchers and practitioners are in conflict about the appro-
priate means to approach AC research. Most practitioners agree that competencies are categories
of behavior related to job success and not psychological constructs. Most research treats competen-
cies as constructs in which factor analytic studies indicate that the key factors that emerge from an
analysis of AC data are related to exercises rather than the competencies that are the assessment
target (Robertson & Smith, 2001). Lance (2008) recommended that dimensions should not be the
focus of the AC process, but that the exercises themselves should be rated. However, the influence
of exercises may have been exaggerated. Indeed, Arthur, Woehr, and Maldegen (2000) showed that
60% of the variance can be attributed to individuals and dimensions and only 11% to assessors and
exercises. This is in line with Howard’s (2008) assertion that an “integrative” perspective is more
appropriate. Specifically, Howard noted why dimensions (i.e., competencies/behaviors) are essential
to ACs and that focusing solely on exercises (i.e., the situations) is a mistake. Although the type of
exercise, or situation, does influence how individuals are judged, the assessors are primarily dif-
ferentiating people on the competencies of interest (Howard, 2006). Also, exercises tend to focus on
different elements of key behaviors that constitute a competency reducing the extent of redundancy
across exercises and behavior in competency correlations.
Particular attention has been given to group differences associated with ACs. Goldstein, Riley,
and Yusko (1999) examined group mean differences on an exercise level and found that the dif-
ferences varied by exercise. Specifically, the mean differences appeared to be a function of the
cognitive component of the exercise. They found that those exercise components that required inter-
personal skills resulted in less adverse impact for African Americans than exercise components
that focused on cognitive abilities in which Whites tended to score higher. Indeed, Bobrow and
Leonards (1997) developed an AC for first-line supervisors in a customer service division (i.e.,
requiring substantial interpersonal skills) and found no differences between Whites and minorities.
Overall, Hoffman and Thornton (1997) have reported that, although Whites tend to score higher for
overall AC ratings, the differences are typically lower than those found with cognitive ability. As
noted by Hough, Oswald, and Ployhart (2001), these racial differences appear to be associated with
measuring cognitive components. In addition to racial differences, Clapham and Fulford (1997)
examined age differences in overall AC ratings. They found that older adults tended to score lower
than younger workers and noted that this was a surprising result, particularly if experience, which
presumably comes with age, is important for developing managerial skills.
INTERVIEWS
Interviews are the most frequently used procedures in personnel selection across all countries, jobs,
and levels (McDaniel, Whetzel, Schmidt, & Maurer, 1994; Salgado et al., 2001) and likely the
most frequently utilized method for leadership selection. Indeed, it is estimated that practically
100% of selection processes use one or more interviews, although not all types of interviews are
considered as valid, or even as useful as others. The employment interview has been the target of
significant research (e.g. McDaniel et al., 1994). Krajewski and colleagues (2006) compared the
Selection Methods and Desired Outcomes 733
validity of situational versus past experience interviews for predicting managerial performance.
Using a sample of 157 applicants to managerial positions, they found that the experience-based
interview significantly predicted overall performance (.32), whereas the situational interview did
not (.09). Additionally, in an examination of the construct differences between the two interview
types, Krajewski and colleagues also showed that the experience-based interviews were highly
related to manager-relevant work sample measures (i.e., AC exercises), cognitive ability facets, and
personality traits.
Ultimately, the most effective selection system will use various methods, and this is especially
true for entry- and mid-level leadership jobs in which the job requires balance among experience,
knowledge, interpersonal competencies, leadership motivation, and personality. On the basis of
Schmidt and Hunter (1998), “incremental validity” can be translated into increases in utility (i.e.,
practical value). The issue of the incremental validity provided by different methods is useful for
assessing the extent to which combinations of methods are useful, or by contrast overly redun-
dant. For example, personality tests and assessment simulations measure different job requirements.
Using both methods together should produce incremental validity, thereby leading to a stronger
relationship with leader performance.
American company. With this in mind, the plant start-up team set out to specifically define working
requirements. In the new plant, lean operating procedures would be the core concept that defined
requirements for people and teams. Identification and elimination of waste is a core concept of lean
manufacturing. This requires everyone in the plant to work together to follow established operat-
ing procedures and to put in place improvements on a fast, continuous basis. Older manufacturing
practices defined by hierarchical decision-making, command, and control management style, and
adversarial manager to worker relationships would not work.
The core management team set out to define the requirements for leaders and team members.
The definition of success was highlighted by higher levels of teamwork, empowerment, coach-
ing, and initiative than had been tried in the past. The management team struggled to break away
from past practices to define this new working culture. A job analysis was conducted with the new
management team. The job analysis could not rely on incumbent interviews or job observation as
the new jobs did not exist within older plants. A series of “visionary job analysis” discussions were
conducted with the plant manager and his direct reports, the functional managers within operations,
engineering, maintenance, and HR. Targets were set for behavior, motivation, and technical skills
that defined the type of leader that would be successful in the new facility. The selection system
design for leaders was especially critical. It was recognized that front- and mid-level leaders would
be critical for carrying out the day-to-day production tasks and for following the operating model
that the senior leaders had established. A rigorous program was required to identify those leaders
who would accept and excel in this progressive manufacturing environment. The recruitment effort
was complicated by the fact that leader candidates would be selected from existing manufacturing
facilities (where older manufacturing practices were the norm).
The first step in the hiring process was a comprehensive application blank that covered work
experience and technical skill. Special care was taken to fill the selection funnel with a broad pool
of leader applicants to achieve a high number of people with the potential to display the right skill
set and diversity mix. The next step involved a comprehensive test battery that targeted behavioral,
personality, and motivational targets that were consistent with a lean manufacturing environment.
Candidates were prioritized for the next step according to “fit” (as measured by the test battery) with
the defined roles leaders would play in the plant. A third step employed the use of a day-in-the-life
AC. This simulation-based set of exercises involved prework about the fictitious company’s opera-
tion (market, competitors, structure, culture) that was used for the simulation, an in-basket exercise
that challenged the candidate on issues ranging from planning the schedule of production to dealing
with HR issues, a coaching exercise requiring the candidate to role-play with an assessor who played
the role of a team member whose performance had gone down over the last few weeks, and a peer
exercise requiring the candidate to negotiate operating procedures and schedules with an assessor.
The AC was designed to reflect the operating environment of the new plant while giving the can-
didate the opportunity to display behaviors required in the leadership roles. Successful candidates
used behaviors that were consistent with the new start-up’s desired culture. The benefit of the AC
was realized in two ways. First, candidates had the chance to experience a realistic preview of the
leadership job for which they were applying. Second, hiring managers had the chance to see how
candidates performed in exercises that were very similar to the requirements of the job. The fi nal
step in the selection process was a behavior-based interview during which candidates described how
they had performed in past jobs. Hiring managers asked questions that provided candidates with the
opportunity to describe their previous work and the results they had achieved in these situations.
Answers were evaluated against the target job requirements with relevancy, recency, and similarity
to the target job used as guiding evaluation criteria. At the end of the process, all of the data were
integrated with successful candidates given a contingent job offer (candidates needed a successful
reference check and drug screen to get the job).
Many organizations would not have taken such extensive steps to select leaders for the new facil-
ity. Each step in a hiring process takes time and investment of resources. In this case, the plant man-
agement team recognized the importance of selecting the right leaders to the eventual success of the
Selection Methods and Desired Outcomes 735
facility. The selection of the first leaders to come on board in a new facility is especially critical as
they play multiple roles early in the start-up such as interviewing the team member job candidates,
playing a significant coaching role in the development of the team members, and setting the tone
for the culture right from the beginning. For this plant start-up, new leaders were supported with
additional training on the concepts of lean manufacturing, coaching and team skills, as well as how
to conduct a successful interview. The results at this plant have been impressive. To date, the new
engine manufacturing facility is on target to reach its production goals and has started up on time
and on budget due in part to the successful hiring of its first wave of leaders. The workforce is mea-
sured on safety (number of safety violations and on-the-job injuries), quality (number of defects and
achievement of quality goals), and engagement (workforce survey conducted yearly). Engagement
levels at the new plant are at the top of the list as compared to the network of plants operated by the
organization. The plant management team attributes the high engagement level of the workforce to
the quality of the front-line and mid-level leaders. This benchmark facility is held up as an example
for how a new start-up should be implemented and as an example of the culture of the future.
force who had the motivation and potential to be effective district managers, a first-level leader
position. They were committed to a “grow your own leadership strategy” because they recognized
that performance of internally promoted leaders was more effective than those hired from the out-
side. The organization’s primary business involved the implementation of a complex set of product
lines. Leaders with minimal technical and sales experience specific to the company lacked cred-
ibility with the sales associates and were not effective coaches. Unfortunately, individuals promoted
from within (under the current process) were only marginally more effective as leaders. This was
disconcerting to HR leaders because they spent considerable resources in training internal leader
candidates prior to their promotion. Candidates for this leadership training program were picked
based on top management’s judgment. Internal and external selection strategies were not working
as the company faced new market opportunities, new competitors, and an increasingly complex set
of market offerings.
The first step taken to improve the program was to develop a success profile for leaders. The
selection criteria for new leaders had not changed for many years whereas the business challenges
for sales leaders had changed substantially. Working with senior sales leaders, the authors identified
the business drivers for sales leadership positions. Business drivers represent the critical activi-
ties on which incumbents must focus to successfully implement the company’s sales strategy. In
this case, they included identifying the best client opportunities in their districts and establishing
broader customer networks in these strategic accounts, coaching others to sell value rather than
focus on product features and benefits, and to engage client line executives in discussions about
their business needs and then relate them to technology solutions. The insights from the visionary
job analysis formed the base for the success profile for the position. It included specific competen-
cies, experiences, knowledge, and personal attributes needed to address these leadership challenges.
These insights were confirmed by successful mid-level managers and incumbent sales leaders. Key
job activities related to each performance area were documented and confirmed by a representative
sample of subject matter experts.
On the basis of this success profile, tools were developed and a process was designed to support
a new promotion process. The program included the following:
1. Realistic job previews and motivational profile: Current sales associates who were above
average sales associates for 3 years had access to an online leadership career site. On the
site, they obtained information that provided a balanced view of a career in sales leader-
ship. This site included insights from current successful sales leaders about the challenges
they faced in transitioning into a leader role. Without identifying themselves, users had
access to a motivational fit inventory in which they documented the degree to which they
liked various work activities. Their responses were computer-scored, and they were given
immediate and confidential access to a profile of motivational matches and mismatches
with a leadership career. The associates were encouraged, but not required, to share the
results with their manager so that they (the associates) had the data and insights they need
to make a sound decision about pursuing a leadership career.
2. Leadership insight inventory: If they decided to continue in the process, the associates
documented their relevant experience and knowledge online and completed a leadership
insights inventory that was predictive of the many leadership competencies and personal
attributes needed to be successful in that leadership position. The inventory consisted of
a battery of item types including situational judgment items, biodata items, and personal-
ity items. The items were grouped into subscales that mapped back to the target success
profile dimensions. Only managers of candidates had access to the results. These managers
received training in interpreting the results and provided feedback to candidates. Managers
were required to have a feedback discussion with candidates. Managers were to try to influ-
ence candidates’ career choice (some candidates were identified as ready for a leadership
job while others were not), but the final decision to proceed was left to the job candidate.
Selection Methods and Desired Outcomes 737
3. Online training: Candidates had access to online leadership courses that they completed
at their own pace and on their own time. After candidates completed the coursework, they
were encouraged to discuss their training results with their manager and decide jointly if
they were ready for the next step in the process.
4. Competency assessment: Candidates who decided to proceed in the process had access to
an online assessment of the new sales leader competencies. The online assessment asked
candidates to respond to a series of leadership challenges by rating the effectiveness of
various actions to address each challenge. The leadership challenges provided to candi-
dates were tailored to challenges identified by the job analysis/success profiling process.
Responses were computer-scored, and results were provided to a promotional panel who
conducted a behavioral interview to further evaluate motivation fit and skills/competency
readiness. The promotional panels that made decisions whether or not to place candidates
in the promotion pool integrated the interview and assessment results. Candidates were
provided with feedback on the decisions. If the decision was not to proceed, there was a
discussion about other career paths.
5. Leadership training: Candidates placed in a promotion pool had access to more in-depth
leadership training.
In a concurrent validity study, 153 randomly selected incumbent managers completed the com-
petency assessment. Ratings of the leadership competencies of the participating managers were
made by their direct supervisors. The correlation between the predictor (the competency assess-
ment) and criterion (the ratings of supervisors) was .56. The validity of the competency assessment
appears to be one of the reasons why the new program is showing immediate positive results. Under
the new process, satisfaction with the slate of candidates and success rates in the final promotional
interview were much higher. There was no increase in turnover among sales associates who did
not succeed in the final steps of the promotional process, suggesting that those not selected see the
process as fair. Performance in prepromotional training was substantially better than before the
process redesign. Plans are in place to measure longer-term productivity and success of the newly
promoted leaders and their teams.
CONCLUSIONS
Current trends in business suggest that the demand for high-quality leaders is high, the complexity of
their jobs has increased, and the process for readying future leaders is more difficult for organizations
to implement. Selection processes that have multiple phases and methods have the greatest likelihood
of success. These processes include well-developed and validated screening tools (e.g., cognitive tests,
biodata instruments, personality and/or situational tests) accompanied by more in-depth evaluations,
such as simulations, ACs, and structured interviews. Sound leadership selection processes that are tied
to development have the greatest impact on performance, especially when there is a sound implemen-
tation strategy (e.g., the way the need for assessment is communicated to participants; transparency
of results with those impacted by the assessment; clear accountability for the participants, managers
and HR; alignment with other HR systems; and success metrics that can be used to demonstrate ROI).
As the case studies demonstrate, differing organizational needs and contexts, such as start-up and
leadership pipeline, have differing demand characteristics that impact the tools and processes used
and the implementation strategy. Other organizational contexts, such as mergers and acquisitions and
the desire to improve employee and customer engagement, also have differential impact on the assess-
ment targets, as well as implications for how they are measured and implemented.
It is clear that private organizations are not all in the same place when it comes to improving the
performance of leaders. Although there are bright spots that can be pointed to as examples that oth-
ers should follow, the lack of a systematic approach to identifying, selecting, and developing leaders
provides opportunity for the future. Leadership is a topic that has been written about extensively
738 Handbook of Employee Selection
in the academic and popular business press, and there is no lack of theory or advice on defining
leadership or conceptualizing what steps should be taken to improve leadership performance. There
is, however, a lack of agreement on the best way to assess leadership potential and performance and
how to get individuals ready for entry- and mid-level leadership roles. To be useful, future practice
and research should seek to evaluate the specific tools, processes, and implementation strategies that
will create the best ROI within specific organizational contexts. Modern, successful selection sys-
tems balance the science of leadership selection with practical realities of the business environment.
Sustainable, long-term impact is achieved by taking a holistic and practical approach to interpret-
ing organizational context, weighing the potential impact of various selection tools, and rigorously
executing the implementation plan.
REFERENCES
Arthur, W., Woehr, D. J., & Maldegen, R. (2000). Convergent and discriminant validity of assessment center
dimensions: A conceptual and empirical re-examination of the assessment center construct-related para-
dox. Journal of Management, 26, 813–835.
Bandura, A. (2006). Social cognitive theory. In S. Rogelberg (Ed.). Encyclopedia of industrial/organizational
psychology. Beverly Hills, CA: Sage.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Bartram, D. (2004). Assessment in organizations. Applied Psychology: An International Review, 53, 237–259.
Bernthal, P. R., & Erker, S. (2005). Selection forecast: Recruiting and hiring talent. Pittsburgh, PA: Development
Dimensions International.
Bernthal, P. R., & Wellins, R. S. (2005). Leadership forecast 2005/2006: Best practices for tomorrow’s global
leaders. Pittsburgh, PA: Development Dimensions International.
Bray, D. W. (1982). The assessment center and the study of lives. American Psychologist, 37, 180–189.
Bray, D. W., & Howard, A. (1983). The AT&T Longitudinal Studies of Managers. New York, NY: The Guilford
Press.
Burke, R. J., & Cooper, C. L. (2000). The organization in crisis: Downsizing, restructuring, and privatization.
Hoboken, NJ: Blackwell.
Byham, T. M., Concelman, J., & Cosentino, C. (2007). Optimizing your leadership pipeline. Pittsburgh, PA:
Development Dimensions International.
Byham, W. C., Smith, A. B., & Paese, M. J. (2002). Grow your own leaders. Upper Saddle River, NJ: Prentice
Hall.
Campbell, J. P., McHenry, J. J, & Wise, L. L. (1990). Modeling job performance in a population of jobs.
Personnel Psychology, 43, 313–333.
Carlson, K. D., Scullen, S. E., Schmidt, F. L., Rothstein, H., & Erwin, F. (1999). Generalizable biographical data
validity can be achieved without multi-organizational development and keying. Personnel Psychology,
52, 731–755.
Charan, R., Drotter, S., & Noel, J. (2000). The leadership pipeline: How to build the leadership powered com-
pany. Hoboken, NJ: Jossey-Bass.
Feingold, A. (1994). Gender differences in personality: A meta-analysis. Psychological Bulletin, 116, 429–456.
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of assessment center
validity. Journal of Applied Psychology, 72, 493–511.
Goldman, D. (1996). Emotional intelligence. London, England: Bloomsbury.
Goldstein, H., Riley, Y., & Yusko, K. P. (1999). Exploration of black-white subgroup differences on inter-
personal constructs. In B. Smith (Chair), Subgroup differences in employment testing. Symposium
conducted at the 14th annual conference of the Society for Industrial and Organizational Psychology,
Atlanta, Georgia.
Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ:
Lawrence Erlbaum.
Hoffman, C. C., & Thornton, G. C. (1997). Examining selection utility where competing predictors differ in
adverse impact. Personnel Psychology, 50, 455–470.
Hogan, J., Davies, S., & Hogan, R. (2007). Generalizing personality-based validity evidence. In. S. M. McPhail
(Ed), Alternative validation strategies: Developing new and leveraging existing validity evidence
(pp. 181–229). Hoboken, NJ: John Wiley & Sons.
Selection Methods and Desired Outcomes 739
Hogan, J., & Ones, D. S. (1997). Conscientiousness and integrity at work. The Handbook of Personality
Psychology. New York, NY: Academic Press.
Hogan, R., & Hogan, J. (1995). Hogan Personality Inventory manual. Tulsa, OK: Hogan Assessment
Systems.
Hogan, R., Hogan, J., & Roberts, B. W. (1996). Personality measurement and employment decisions: Questions
and answers. American Psychologist, 51, 469–477.
Hogan, R., & Kaiser, R. B. (2005). What we know about leadership. Review of General Psychology, 9,
169–180.
Hollander, E. (2006). Influence processes in leadership-followership: Inclusion and the Idiosyncrasy Credit
Model. In D. A. Hantula (Ed.), Theoretical & methodological advances in social & organizational psy-
chology: A tribute to Ralph Rosnow. Mahwah, NJ: Lawrence Erlbaum.
Hollander, E. (2008). Inclusive leadership and leader-follower relations: Concepts, research, and applications.
New York, NY: Routledge/Psychology Press.
Hough, L. M., & Oswald, F. L. (2000). Personnel selection: Looking toward the future—remembering the past.
Annual Review of Psychology, 51, 631–664.
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detections and amelioration of
adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International
Journal of Selection and Assessment, 9, 152–194.
Howard, A. (2006). Best practices in leader selection. In J. A. Conger & R. E. Riggio (Eds.), The practice
of leadership: Developing the next generation of leaders. San Francisco, CA: Jossey-Bass.
Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and validity
generalization to the general aptitude test battery (GATB). Washington, DC: U.S. Department of Labor:
U.S. Employment Service.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Journal
of Applied Psychology, 96, 72–98.
Hunter, J. E., Schmidt, F. L., & Judiesch, M. K. (1990). Individual differences in output variability as a function
of job complexity. Journal of Applied Psychology, 75, 28–42.
Katz, D., & Kahn, R. L. (1978). The social psychology of organizations. Hoboken, NJ: John Wiley &
Sons, Inc.
Krajewski, H. T., Goffin, R. D., McCarthy, J. M., Rothstein, M. G., & Johnston, N. (2006). Comparing the
validity of structured interviews for managerial-level employees: Should we look to the past or focus on
the future? Journal of Occupational and Organizational Psychology, 79, 411–432.
Kramer, R. M. (1999). Trust and distrust in organizations: Emerging perspectives, enduring questions. Annual
Reviews in Psychology, 50, 569–598.
Lefkowitz, J., Gebbia, M. I., Balsam, T., & Dunn, L. (1999). Dimensions of biodata items and their relation-
ships to item validity. Journal of Occupational and Organizational Psychology, 72, 331–350.
Marston, C. (2007). Motivating the “What’s in it for me?” workforce. Hoboken, NJ: John Wiley & Sons.
McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situ-
ational judgment tests to predict job performance: A clarification of the literature. Journal of Applied
Psychology, 86, 730–740.
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment inter-
views: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, 599–616.
Mumford, M. D., Stokes, G. S., & Owens, W. A. (1990). Patterns of life history: The ecology of human indi-
viduality. Hillsdale, NJ: Lawrence Erlbaum.
Ones, D. S., & Viswesvaran, C. (1996). What do pre-employment customer service scales measure? Explorations
in construct validity and implications for personnel selection. Presented at the Annual Meeting for the
Society of Industrial and Organizational Psychology, San Diego, CA.
Ones, D. S. & Viswesvaran, C. (1998). Gender, age and race differences on overt integrity tests: Analyses
across four large-scale applicant data sets. Journal of Applied Psychology, 83, 35–42.
Robertson, I. T., Barron, H., Gibbons, P., MacIver, R., & Nyfield, G. (2000). Conscientiousness and managerial
performance. Journal of Occupational and Organizational Psychology, 66, 225–244.
Robertson, I. T., & Smith, M. (2001). Personnel selection. Journal of Occupational and Organizational
Psychology, 74, 441–472.
Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, C. P. (1990). Biographical data in
employment selection: Can validities be made generalizable? Journal of Applied Psychology, 75,
175–184.
Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation
and culture as explanations for variability in practice. Personnel Psychology, 52, 359–391.
740 Handbook of Employee Selection
Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score adjustment in preemploy-
ment testing. American Psychologist, 49, 929–954.
Salgado, J. F., Viswesvaran, C, & Ones, D. S. (2001). Predictors used for personnel selection: An overview
of constructs, methods, and techniques. In N. Anderson, D. S. Ones, H. K. Sinangil, & C. Viswesvaran
(Eds.), Handbook of industrial, work, and organizational psychology. Vol. 1, Personnel psychology.
Thousand Oaks, CA: Sage.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology:
Practical and theoretical implications of 85 years of research findings. Journal of Applied Psychology,
124, 262–274.
Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnel selection. Annual Review of Psychology, 43,
671–710.
Shackleton, V., & Newell, S. (1997). International assessment and selection. In N. Anderson & P. Herriot
(Eds.), International handbook of selection and assessment. Chichester, England: Wiley.
Society for Human Resource Management. (2009). SHRM’s 2009 HR trend book. Alexandria, VA: Author.
Stricker, L. J., & Rock, D. A. (1998). Assessing leadership potential with a biographical measure of personality
traits. International Journal of Selection and Assessment, 6, 162–184.
Thornton, G. C., & Rupp, D. E. (2006). Assessment centers in human resource management: Strategies for
prediction, diagnosis, and development. Mahwah, NJ: Lawrence Erlbaum.
Wernimont, P. F. & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52,
372–376.
34 Blue-Collar Selection in
Private Sector Organizations
This chapter focuses on the issues that distinguish blue-collar/technical selection from other
selection procedures. The chapter begins with a definition of blue-collar/technical jobs and pro-
vides a comprehensive survey of the environment in which blue-collar jobs occur. Thereafter,
issues and challenges associated with planning and developing, implementing, and maintaining
selection procedures are addressed. Finally, recruitment and employee development are discussed
as positive measures that may increase the number of qualified, diverse candidates and provide
political support for selection procedures.
ENVIRONMENT
Understanding the environment for technical jobs requires recognition of the distinctions from
management selection, the effect of recent economic changes, and the perspectives of management
and labor organizations, and the history of distrust.
741
742 Handbook of Employee Selection
industries, workforces have been cut to the bone, and line managers are resistant to any activity that
threatens their ability to achieve production goals; this includes the development of a selection pro-
cedure. When employees are completing job analysis questionnaires or taking experimental tests,
they are not performing their jobs. It is not always easy for line management to secure replacement
workers and doing so may involve payment of overtime wages to the available workers.
TRADITION OF DISTRUST
The history of the labor movement is long, and the relationship between management and labor has
often been turbulent. There has been violence on both sides, and it takes considerable effort to over-
come historical difficulties, animosity, and distrust. It is convenient to brush away these problems as
ancient history, but the truth is that the problems persist. Management is not above using reductions
in force to enhance short-term profitability, and labor is not above calling strikes to obtain conces-
sions that organizations cannot long afford.
SUMMARY
In most organizations, employee selection for technical jobs tends to be housed in a separate area
than management selection. Candidates for technical jobs tend to be graduates of high school or
technical schools as opposed to graduates of 4-year colleges. Organizations provide much of the job
training, frequently through formalized apprenticeship programs.
744 Handbook of Employee Selection
Corporate management sometimes underestimates the knowledge, skills, and abilities required
for technical jobs, and this belief system permeates all aspects of the planning and development of
the selection procedure. Perceptions of the importance of technical jobs may be modified as the
nation faces the joint economic pressures of an aging workforce combined with the relative scarcity
of skilled labor.
Understanding the perspectives of management and labor organizations and the tradition of
distrust is essential when working on selection procedures for technical jobs. As David Bownas
(2000) noted, management and labor have a great deal in common. Both have attained positions of
power and influence, and neither is likely to question their own judgment in matters of personnel
selection. Support from labor and management is essential to the development of a successful selec-
tion procedure.
ECONOMIC CONSIDERATIONS
Before embarking on what could be a very expensive validity study, it is important to determine that
the need faced by the organization can be addressed by a new selection procedure. For example, a
turnover problem may be due to perceptions of inadequate compensation levels or poor management
practices.
It is necessary to consider the number of individuals who will flow through the new selection
process as well as the company’s investment in the employees selected. If the number of employees
selected is small and the investment in development is nominal, a cost benefit analysis might lead
to the conclusion that the expense of a new selection process is not worth the perceived benefit. In
general, the larger the number of individuals who flow through a selection process or the greater the
expenses in time and development, the greater the potential benefit of a selection procedure to the
corporation. As an example, the electric utility industry typically invests approximately $250,000 or
more in the training of each nuclear plant operator. Thus, the decision to invest in a comprehensive
selection process is a prudent business decision.
The resources that organizations are willing to invest in a selection procedure are often governed
by their perception of the value of the job. As indicated earlier, sophisticated, time-consuming, and
expensive processes that may be acceptable for high level or mission critical jobs are very much
different from what is acceptable for lower level jobs. In one large corporation, the service repre-
sentative position, a relatively low-level position, was viewed as a mission critical job because it
was the primary interface with customers. As a consequence, the company was willing to expend
considerable resources on this nonmanagement job.
One of the advantages of valid selection procedures is the economic benefit associated with the
selective identification of individuals placed into positions. A valid selection procedure provides
the opportunity to identify candidates that can increase the overall level of job performance
for the organization. The increase in average job performance can be translated into economic
values or other significant organizational outcomes such as the reduction of training costs due to
turnover. As the number of people selected, the magnitude of the validity, and the variability of
Blue-Collar Selection in Private Sector Organizations 745
job performance increases, so does the economic benefit due to selection (Cascio, 1982; Cascio &
Ramos, 1986).
TABLE 34.1
Meta-Analysis Summary Correcting for Sampling Error,
Criteria and Predictor Attenuation, and Range Restriction
Number of 95% Confidence
Validity Limits 90%
Number of Coefficients Total Sample Mean Credibility
Predictor Type Studies Reviewed Represented Validity Lower Upper Value
Biographical
information 2 11 1,990 .23 .13 .33 .16
General mental ability 9 12 1,492 .66 .66 .66 .66
Mechanical ability 27 48 35,453 .76 .76 .76 .76
Memory 3 10 2,998 .55 .55 .55 .55
Motor ability 20 150 29,587 .40 .40 .40 .40
Perceptual speed 28 81 33,308 .65 .65 .65 .65
Personality and
interest inventories 7 37 8,375 .22 .22 .22 .22
Quantitative ability 39 87 57,309 .77 .77 .77 .77
Reasoning ability 5 18 1,598 .64 .64 .64 .64
Spatial ability 39 122 41,838 .70 .70 .70 .70
Verbal ability 31 70 14,419 .68 .68 .68 .68
Source: From Jones, D. P., & Gottschalk, R. J., Validation of selection procedures for electric utility construction and
skilled trades occupations: Literature review and meta-analysis of related validation studies, Edison Electric
Institute, Washington, DC, 1988. With permission.
746 Handbook of Employee Selection
TABLE 34.2
Meta-Analysis Summary Corrected for Sampling Error and
Attenuation in the Job Proficiency Criterion
Number of 95% Confidence
Validity Total Limits 90%
Number Coefficients Sample Mean Credibility
Predictor Type of Studies Reviewed Represented Validity Lower Upper Value
Biographical
information 2 11 1,870 −.02 −.02 −.02 −.02
General mental ability 16 37 2,274 .33 .33 .33 .33
Mechanical ability 45 90 10,363 .53 .53 .53 .53
Memory 3 9 700 .30 .30 .30 .30
Motor ability 31 200 29,701 .39 .39 .39 .39
Perceptual speed 33 99 12,260 .41 .41 .41 .41
Personality and interest
inventories 7 52 5,398 .04 .04 .04 .04
Quantitative ability 53 119 17,032 .45 .45 .45 .45
Reasoning ability 6 15 1,813 .34 .34 .34 .34
Spatial ability 52 198 25,762 .49 .49 .49 .49
Verbal ability 44 92 8,996 .37 .37 .37 .37
Source: From Jones, D. P., & Gottschalk, R. J., Validation of selection procedures for electric utility construction and skilled
trades occupations: Literature review and meta-analysis of related validation studies, Edison Electric Institute,
Washington, DC, 1988. With permission.
and attenuation in the criterion for training criteria and job proficiency criteria validities, respec-
tively. Consistent with previous research, correlations for training criteria were generally higher
than those for job proficiency criteria. The four highest mean validity coefficients for training
criteria were for mechanical ability (.76), quantitative ability (.77), spatial ability (.70), and verbal
ability (.68). The three lowest mean validity coefficients for training criteria were for biographical
information (.23), personality and interest inventories (.22), and motor ability (.40). The four high-
est mean validity coefficients for job proficiency criteria were for mechanical ability (.53), spatial
ability (.49), quantitative ability (.45), and perceptual speed (.41). The two lowest correlation coeffi-
cients for job proficiency criteria were biographical information (−.02) and personality and interest
inventories (.04).
The Jones and Gottschalk (1988) meta-analyses presented somewhat lower “corrected validities”
than past meta-analytic reviews. They cited the following reasons for the reduced corrected validi-
ties (p. 60):
• Relatively conservative coding rules (e.g., coding as zero reported nonsignificant correlations)
• Relatively conservative assumptions about the mean level of criterion reliability (e.g., .70)
and the amount of variance in criterion reliability among studies
• A focus on the validity of single tests rather than overall test batteries
• A focus on coding the validity of a predictor against the “overall criterion” measure
available in a given validity study, as opposed to focusing on correlations with individual
criterion ratings
Schmidt, Hunter, Pearlman, and Shane (1979) presented validity generalization data (mean
correlations) on three technical jobs: mechanical repairman, r = .78 with mechanical principles
Blue-Collar Selection in Private Sector Organizations 747
tests; bench workers, r = .39 with finger dexterity tests; and machine tenders, r = .05 with spatial
ability tests.
Schmidt et al. (1981) reported average validities (mean correlations) for 35 mostly technical jobs
obtained with the Army Classification Battery. Validities for two samples were as follows:
A common thread running through the results presented in this section is that cognitive ability
tests of verbal, math, spatial, and mechanical comprehension are valid predictors of job performance
for the broad spectrum of technical jobs. Biographical, personality, and interest inventories—all
self-report measures—provided near-zero validities for job performance.
any self-report measure used in employment testing whose responses are not verified. One of the
experimental designs often used to detect faking on noncognitive measures is to compare the scores
of job incumbents from the original validity study with job applicants. Presumably, job incumbents
participating in a validity study will respond honestly on noncognitive measures because there are
no negative consequences associated with their test performance. Applicants, on the other hand,
are highly motivated to present themselves as positively as they can in order to obtain employment.
One would expect job incumbents to generally score higher on valid employment measures because
successful job performance is associated with having or developing increasing levels of skills and
abilities required by the job.
Research examining the differences in responses to personality measures by applicants ver-
sus nonapplicant groups typically show significant and substantially higher scores in the applicant
group. In a meta-analysis of job applicant faking on personality measures, Birkeland, Manson,
Kisamore, Brannick, and Smith (2006) found, across all jobs, that applicants scored significantly
higher than nonapplicants on extraversion, emotional stability, conscientiousness, and openness. In
addition, this study showed that job applicants distorted their responses on personality dimensions
that were viewed as particularly job relevant. These results show that intentional faking is the norm
for a highly motivated group in a high stakes testing environment such as employment testing.
Finally, the research evidence shows that faking behavior on the part of applicants can have a
negative impact on the predictive and construct validity of noncognitive measures. When compared
to the original validation sample, Hough (1998) has found that validity is often lower in applicant
groups. Douglas, McDaniel, and Snell (1996) presented evidence that criterion related validity may
decrease or disappear for applicant groups. Douglas et al. reported that the factor structure of per-
sonality tests changes for groups instructed to fake versus honest responders.
SUMMARY
Selection procedures should only be developed if doing so provides a solution to the organization’s
problem. Once this is determined, the selection procedure’s validity and subgroup differences need
to be examined. A review of the sparse published literature on technical jobs revealed that cogni-
tive ability tests of verbal, math, spatial, and mechanical comprehension are valid predictors of job
performance for a broad range of technical jobs. Biographical, personality, and interest invento-
ries provided near-zero validities for job performance. Furthermore, in a technical environment,
women, Hispanics, and Asian Americans scored substantially below White males on noncognitive
measures.
Efficiency is generally preferred over fidelity because of the wide variety of jobs that can be
subsumed under one selection procedure. Top-down selection has greater utility than the use of
cut-off scores, but this utility comes at the price of higher validity requirements, and the process is
cumbersome in a continuous testing environment. The use of cut-off scores results in less adverse
impact than top-down selection and serves as a compromise in represented environments. The
employee with the most seniority who passes the test is selected.
agreement specifies not only the selection procedure(s) to be used but also the cut-off scores and the
nature of the feedback that must be provided to their members.
Line managers frequently reach their positions by one of two paths: a degree in engineering or
advancement through a range of technical jobs such as those they now supervise. Line managers
work closely with the technical workers and generally have an understanding of the skills and abili-
ties required for successful performance. Thus, line management is a valuable resource as well as
a powerful ally.
A sometimes overlooked patron of selection procedures is the technical training staff. As the
resident experts on the tasks and the ways in which they are to be performed, the trainers have
tremendous influence with line management. Trainers are well-versed on the knowledge, skills,
and abilities required for successful performance on the job, and many of them are keenly aware
that training cannot cure all deficiencies. Technical trainers often understand more readily than
line management what information the psychologists require and can contribute enormously to the
development of task lists and job requirements in the job analysis stage as well as the identification
of criterion measures.
A final note on the value of technical trainers is that they are the first people in the organi-
zation to see the benefits of a newly implemented selection procedure and the first to see the
decline in abilities when valid selection procedures are discontinued. Thus, forming an alliance
with trainers is well worth the effort and pays dividends long after the selection procedure has
been implemented.
Employee safety, like job security, is part of the very fabric of the labor movement. Thus, for
those occupations where the safety of employees is dependent upon their competence, employee
safety can be a powerful argument for the introduction of a selection procedure. The authors are
familiar with organizations where the represented employees, because of concerns regarding their
personal safety, have refused to work with employees who were hired during a period when the
selection procedure was temporarily suspended. In another case, a labor organization refused to
countenance the removal of a selection procedure for promotion because of their fears regarding the
impact of the removal on member safety.
The promotion of current employees is generally an issue when employees in entry-level posi-
tions apply for promotion to apprenticeship positions. Many organizations will give preferential
consideration to current employees over external applicants. This is particularly true in a repre-
sented environment. Thus, entry-level jobs are typically the only positions in an organization where
recruitment is almost exclusively external.
Resistance to testing current employees has resulted in many organizations choosing to extend
job qualifications for higher level jobs to entry level positions. This approach is limited to those
situations where a majority of the employees hired into the entry level position move up in the job
progression within a 5-year period (Uniform Guidelines, 1978). It is important to note that the 50%
majority applies only to those employees who have remained in the organization.
Imposing higher-level job requirements on entry level employees has several benefits. External
applicants are less likely to challenge selection procedures than current employees and are less
likely to have access to the information required to wage an effective attack on the selection pro-
cedure. External candidates also lack the standing to grieve and arbitrate adverse selection deci-
sions. Those candidates who are hired are able to progress upward in the job progression without
additional testing. This approach reduces morale problems among lower-level employees and the
attendant grievances and arbitrations. It also guarantees a pipeline of workers for the more demand-
ing jobs. To the extent that the selection procedure has adverse impact, imposing the selection
procedure as an entry level requirement may avoid the appearance of a segregated workforce, with
minority candidates in the lowest level position.
Grandparenting
As indicated previously, grandparenting refers to the practice of exempting certain groups of
employees from the selection procedure. It is standard procedure in most organizations to grandpar-
ent employees in their current job. The question arises concerning the breadth of the grandparenting
provisions. Does grandparenting cover only those jobs in a given line of progression or can employ-
ees move from one line of progression to another that requires the same basic skills and abilities?
As an example, cognitive ability tests such as reading, math, mechanical ability, and spatial ability
are relevant for a wide range of technical jobs. Thus, the question could be raised whether a current
employee working as a mechanic is able to move to an electrician apprenticeship program that uses
the same test battery without passing the test. Disputes regarding the breadth of the grandparenting
clauses are fertile ground for arbitrations. Furthermore, arbitrations addressing the extent of cover-
age of grandparenting provisions invariably also question the validity of the selection procedure.
Thus, great care should be taken to clearly establish and communicate the parameters under which
grandparenting will apply.
It is also important to address the impact of demotions and separations with respect to testing
requirements. For instance, if an employee takes a demotion as a result of a reduction in force, which
job progressions are available without requiring the selection procedure? Are supervisors who are
demoted back into the ranks required to qualify on a test that is used to select their former subordi-
nates? What happens to employees who leave the organization either voluntarily or involuntarily?
This situation may be addressed by including in the grandparenting policy a requirement of a cer-
tain period of time in the job or the job progression and evidence of satisfactory job performance.
In reality, the latter requirement is often more a matter of the existence of evidence that documents
unsatisfactory job performance. In the absence of such documentation, satisfactory performance is
assumed.
SUMMARY
The presence of a labor organization may reduce management’s control of the selection pro-
cedure, depending on the nature and extent of the concessions made by management in their
contract negotiations. Management’s resistance to selection procedures has at least three
754 Handbook of Employee Selection
components: even greater loss of control of the selection system, the need to take employees off
the job during test development, and the potential need to pay replacement workers overtime.
Labor organizations may resist selection procedures because they diminish the role of seniority
in selection decisions.
Selection procedures benefit management and labor organizations on issues such as employee
safety, reduction in adverse employment decisions, and test fairness. Finally, several operational
selection policies such as the jobs covered by the selection procedure and the grandparenting
provisions must normally be agreed to by both parties in the planning stage in order for develop-
ment to take place.
SELECTION POLICIES
Selection policies include such things as the specification of the jobs covered by the testing
procedure, grandparenting, duration of test qualifications over time, retest policies, test security
provisions, the nature of testing information provided to the selecting manager, and the nature of
feedback provided to the candidates.
When a selection procedure is developed in a union environment, policies such as the jobs cov-
ered by the selection procedure and the grandparenting provisions are generally established during
the planning stage to ensure worker participation. Other provisions such as the duration of test
qualifications over time, retest policies, test security provisions, the nature of testing information
provided to the selecting manager, and the nature of feedback provided to the candidates are gener-
ally decided at the time the selection procedure is implemented.
Decisions regarding the duration of test qualifications over time is influenced by the nature of
the test and data storage considerations. Test scores on cognitive ability tests have a relatively long
shelf life compared to performance on skills tests such as keyboarding skills, which may degrade
over time. It is not unusual to consider passing scores on cognitive ability tests as having an unlim-
ited life, particularly because large amounts of data may be stored relatively easily on computer
systems. Scores on skills tests are viewed as less stable by some organizations, whereas others
assume that individuals will be able to refresh their skills adequately upon assuming a position that
requires them.
Retest policies involve considerations of test security and the cost considerations related to test
administration. The interval between test administrations should be sufficiently long to avoid prac-
tice effects. The labor costs associated with retesting and the cost of the test materials themselves
influence retest policies, particularly when candidate scores vary little across test administrations.
Organizations that implement selection procedures must make provisions to ensure the security
of their investment. The physical storage of paper tests is relatively easy so long as access is limited
to those directly involved in the operation of the selection procedure. Storage of computer-based
tests is made more challenging in that the testing staff is often dependent on the technical skills of
information technology (IT) personnel. The most difficult of the security policies deals with access
by management. Management is naturally curious about the contents of the selection procedures,
and their knowledge of the content may jeopardize the integrity of the selection program. Managers
have been known to provide assistance to candidates whom they favor. As an example, a manager
who had access to tests in one organization provided forged answer sheets to the central office for
scoring. The “passed” prototype was used for favored candidates, and the “failed” prototype was
used for those that the manager did not want to employ.
Blue-Collar Selection in Private Sector Organizations 755
A related issue is the amount of information provided to the selecting manager regarding test
performance. Providing test scores may result in the de facto use of a rank-ordering of candidates.
Another serious concern with providing test scores to managers is that the scores may influence sub-
sequent employment decisions that go well beyond the validation study. For instance, supervisors
may be influenced by test scores when providing work assignments, developmental opportunities,
or deciding which employees will be affected by a reduction in force.
Organizations must also make decisions regarding the nature of the feedback provided to job
candidates. Candidate feedback can be as simple as a form letter or as complex as a feedback inter-
view. Form letters may be limited to whether the candidate was successful on a test or may provide
more detailed information such as performance on each component of the selection procedure.
Establishing policies and influencing constituents are interactive processes in that policies must
sometimes be adjusted to gain the support of affected parties.
INFLUENCING CONSTITUENTS
The introduction of a selection procedure by definition reduces the control of the selecting man-
ager by limiting the candidates that are advanced for consideration. Likewise, the introduction
of the selection procedure constrains the influence of seniority for labor organizations. Line
management’s and labor’s acceptance of a selection procedure in the development stage of the
process is no guarantee of their support at the implementation stage. During the development
stage, the selection procedure is an abstraction. At the implementation stage, the selection proce-
dure has taken form and its potential impact is fully recognized. This recognition may result in
the need to revisit selection policies. As an example, one organization initially placed a limit on
the number of retests permitted on a well-developed selection procedure, believing that retests
would provide little value in increasing the pool of qualified candidates. The backlash from the
labor organization, and by extension labor relations personnel, resulted in the removal of the
upper limit on retests as a condition of implementation.
One source of resistance to selection procedures is that labor leaders and labor organizations
themselves are subject to the whims of their membership. Labor leaders may be voted out of office,
and the need for the union itself may be called into question. Thus, the labor leaders and the labor
organization are keenly aware of the need to keep their membership satisfied. Furthermore, the
union can be sued for failure to represent the interests of its members. This concern often lies at the
heart of apparently frivolous grievances and arbitrations.
SUMMARY
Selection policies at the implementation stage tend to cover the duration of test qualifications over
time, retest policies, test security provisions, test information provided to the selecting manager,
and the nature of feedback to the candidates. Efforts to persuade management and labor of the
756 Handbook of Employee Selection
benefits of the selection procedure must continue at this stage because the abstract concept has now
become a reality. Testing professionals must be trained on the selection policies and the standard-
ized procedures for administering and scoring the selection procedure.
VALIDITY
Most challenges, whether the source is a government agency, a disgruntled applicant or employee,
line management, or a labor organization, will be waged on the validity of the selection procedure.
Carefully developed selection procedures by definition are valid at the time they are developed.
Over time, however, some selection procedures can become less effective. Frequently, there are
challenges in that the jobs have changed and that the selection procedures no longer measure the
knowledge, skills, and abilities that are required for successful job performance. There are certainly
cases where this allegation is true, and it is incumbent upon the psychologist to examine changes in
the job. However, the experience in the electric utility industry is that technical jobs have changed
little over time with regard to the basic job requirements. Most of the job changes have been minor
and involved ancillary responsibilities. For example, an employee might obtain job assignments
from a computer in the company truck rather than receiving them personally from a supervisor.
Abilities in areas such as basic mathematics, reading, and understanding mechanical concepts con-
tinue to be relevant for very long periods of time. Line management and labor organizations do not
always share this opinion when they have problems of their own.
Most of the challenges coming from line management are related in some fashion to meeting
immediate staffing requirements.
Line managers often come up with their own solutions to staffing problems. These generally take
the form of contractors and temporary assignments of current employees, lowering test standards,
reductions in retest periods, and providing more detailed feedback and developmental opportunities
for unsuccessful candidates.
CONTRACTORS
Contractors have long been used to meet short-term staffing needs caused by one-time planned
activities, unexpected events, or emergencies. The shortage of skilled workers has resulted in many
situations where the contractors have been retained for years and sometimes are performing the
same work as the employees.
This poses several legal difficulties, not the least of which is a threat to the selection procedure.
Most often the selection requirements of the contracting firms differ from those of the hiring orga-
nization. This creates a problem when the company’s selection procedures are required for current
employees but are not required for contractors who perform the same work. If the selection procedure
employed by the hiring organization has adverse impact, then the use of contractors who perform
the same work as current employees threatens the business necessity of the selection procedure. This
problem is made all the more difficult when the contractor who has been performing the same work as
the employees applies for a job with the company and is not successful on the selection procedure.
Two solutions that have been implemented to deal with these problems include limiting the range
of work performed by the contractors or requiring that contractors retained beyond a specified
period of time be required to meet the employee selection requirements. Restricting the work that
can be done by a contractor has the advantage of ensuring that only qualified employees perform the
more complex work. This approach has withstood legal challenges in the electric utility industry.
The second approach, requiring that long-term contractors be successful on the selection pro-
cedure, can be difficult to manage if the organization chooses to track employment time. Many
employers who adopt this approach require that all contractors qualify on the selection procedure
prior to stepping foot on company property. The benefit is that the company has some assurance that
the contractors hired have the ability to do the work and will be eligible for hire when they apply.
The disadvantage is that sometimes the need for workers exceeds the number of contractors who
have been successful on the selection procedure.
An additional concern frequently raised is the perception that testing contractors creates
co-employment issues. Co-employment refers to the sharing of employment responsibilities
(including compliance with employment laws) by the staffi ng agency and the company. For
example, the company and the staffing agency might jointly be held responsible for violations of
Title VII of the Civil Rights Act.
TEMPORARY ASSIGNMENTS
This situation is very similar to the contractor problem described above, except that the person
performing the work is a current employee. When the employee is denied a promotion to the job
performed for some period of time, the company may face a host of legal challenges, including
grievances, arbitrations, and lawsuits.
The solution applied most often is to apply the higher-level job standards to individuals who enter
a job progression. However, this is only possible when more than 50% of the employees hired into
the lower level position move to the higher-level job within 5 years.
but it should be a last resort because it will be accompanied by a corresponding reduction in job
performance. Other times, line managers who are dissatisfied with the level of job performance of
their current workforce may request that the standards be raised. Careful review of pass rate infor-
mation in comparison to job openings and job performance information must precede changes in
the standards regardless of the direction.
Reducing the period between retests is also a popular idea but doing so increases the costs asso-
ciated with administering and scoring the selection procedure. Furthermore, the reliability valued
by psychologists is not always appreciated by line management and labor organizations when can-
didates are repeatedly unsuccessful. By contrast, organizations may choose to lengthen the retest
period when labor markets are more favorable.
Despite the best efforts of all concerned, occasions do occur when labor relations personnel
unwittingly negotiate problematic agreements. For instance, the authors are familiar with a cir-
cumstance where the wrong test was negotiated in a collective bargaining agreement. The situ-
ation was resolved by negotiating an addendum to the collective bargaining agreement for the
use of the correct test battery; the alternative could well have been a concession to eliminate the
testing requirement. Thus, the moral of this story is that labor relations personnel must be edu-
cated on testing issues to avoid inconvenient and counterproductive concessions during contract
negotiations.
Providing more detailed feedback, diagnostic testing, and providing practice tests will be
described later in the chapter.
REDUCTIONS IN FORCE
Reducing the number of employees is probably one of the most unpleasant activities of management
in that the affected employees often have done nothing to warrant this action. If there is a collective
bargaining agreement, then the displacement is governed by seniority. The employees with lower
seniority may be able to bump into other jobs in order to maintain their employment. Regardless of
the presence or absence of a labor organization, the reduction in force often impacts testing when
the job reassignment requires that the displaced employee pass an employment test before moving
to the position of choice.
The development of a single test battery for multiple jobs with similar requirements can alleviate
this problem. Oftentimes, the job a displaced employee wants to move to is covered by the same test
battery or a test battery measuring similar abilities.
When the reduction in force is substantial, senior employees who had previously been grandpar-
ented into their positions may have to test to move into another job progression. This depends on the
parameters of the grandparenting policy and, in the case of represented employees, the collective
bargaining agreement.
demanding, are limited in number. Administrative jobs are attractive alternatives, but often
vary considerably in the skills and abilities required for successful performance. Refusing to
lower the standards for entry into these less demanding administrative jobs is a political night-
mare; however, defending the exception is often even more perilous, threatening the defensibil-
ity of the selection procedure in situations where other equally less able candidates are refused
employment.
SUMMARY
Maintaining a selection procedure is primarily an issue of protecting it against a variety of chal-
lenges. Most of the challenges will be waged against the validity of the selection procedure and/
or the standards that differentiate successful candidates from less successful ones. Although many
technical jobs change very little over long periods of time, job changes must be monitored to assess
the impact on the validity of the selection procedure. The trigger point for management challenges
to the selection procedure generally involves difficulties in maintaining adequate staffing levels.
Interim solutions provided by line management include the use of contractors and temporary assign-
ment of current employees to different jobs. Both of these solutions have the potential of threatening
the business necessity of the selection procedure.
Some policies may need to be modified over time to preserve the selection procedure. The
standards for distinguishing successful from unsuccessful candidates may need to be adjusted,
either upward or downward based on the labor market and job performance. Retest periods are
also subject to changes depending on the labor market.
It is necessary that labor relations personnel be educated on testing issues to avoid inconvenient
and counterproductive concessions during contract negotiations. Reductions in force and the place-
ment of employees unable to continue performing their jobs can create havoc on existing selection
procedures when labor relations personnel are uninformed of testing requirements.
DIAGNOSTIC TESTING
Most selection procedures are designed to deliver accurate recommendations on whether a candi-
date should proceed with the remainder of the selection process at a relatively low cost in the least
amount of time possible. These cost and time criteria do not lend themselves to providing in-depth
diagnostic information regarding the specific needs for improvement. Therefore, the quality of the
feedback available to candidates is somewhat limited.
Companies that have combined the diagnostic feedback with concerted remedial opportunities
at local community colleges are reporting positive results on pass rates. These efforts certainly war-
rant closer investigation to determine their efficacy.
At the very least, the development of diagnostic instruments to address the need for feedback can
be a powerful political force for improving employee reactions to the selection procedure and the
organization. One goal of the diagnostic test is to enhance retention of lower-level employees in the
organization. Labor leaders have embraced the process and negotiated its implementation in a num-
ber of their collective bargaining agreements. Providing diagnostic feedback to candidates, many
of whom are currently employed, shifts the burden to the employee to do something to address
developmental needs. Even if the employees do nothing to improve, their enhanced self-efficacy
does much to alleviate employee relations problems.
Whereas the diagnostic assessments and feedback have been concentrated on current
employees, efforts have also been directed to candidates outside of the organizations in the form
of practice tests.
PRACTICE TESTS
Many organizations that use pre-employment tests provide candidates with brochures describing
the testing process. Although testing brochures can help alleviate candidate anxiety, supplementing
the brochures with practice tests that provide information on how to solve the sample problems is
likely to increase pass rates.
The basis for believing that practice tests may be effective lies in the research of Clause, Delbridge,
Schmitt, Chan, and Jennings (2001). These researchers examined the relationship between test
preparation activities and employment test performance in situations where applicants were tested
on material to be learned during a given period of time. They found that self-efficacy and motiva-
tion were associated with greater efforts to plan and monitor test preparation activities, to engage
Blue-Collar Selection in Private Sector Organizations 761
learning strategies that required deeper information processing, and to devote more effort to the test
preparation process.
There is certainly room for debate regarding whether increased test scores following practice are
the result of candidates becoming familiar with test content or reflect actual enhancements in candidate
ability levels. In a meta-analysis of retest scores on cognitive ability tests, Hausknecht, Halpert, Di Paolo,
and Moriarty Gerrard (2007) found practice effects of about one quarter of a standard deviation between
the first and second test administration. In addition, these researchers found that coaching was associ-
ated with larger practice effects but that the effects of coaching deteriorated with time. Retests using the
identical test form were associated with greater practice effects than the use of an alternate form, and
shorter intervals between retests of the identical form were associated with larger practice effects.
The most important issue for organizations is the effect of practice tests on the ability of the test
to predict criteria of interest. Hausknecht, Trevor, and Farr (2002) investigated the effect of retests
on posthire training performance and turnover among candidates hired into law enforcement posi-
tions. After controlling for ability, a consistent, though small, effect size was found favoring the
training performance of candidates who were hired after retesting. Those candidates who were
hired after retesting also were more likely to remain with the organization than those candidates
who were hired after their first test. Candidates who retested to obtain positions were viewed as
being highly motivated and once hired, invested in their jobs. In an academic setting, Lievens,
Buyse, and Sackett (2005) found that the validity of a cognitive ability test in predicting grade point
average was unaffected by whether the first or second test scores was used.
Our contention is that the model proposed and supported by Clause et al. (2001) has application
to cognitive abilities testing when practice tests are made available. The effects of the test prepara-
tion are likely to be moderated by the varying levels of ability at the time the practice tests are made
available. High ability candidates will probably pass regardless of the availability of the resource,
and some low ability candidates may never be able to improve their abilities to the extent necessary
to be successful either on the test or the job. However, neither of these groups is really the focus of
the practice tests. The real goal is to provide an opportunity for individuals with modest abilities to
develop to the level necessary to pass the tests and to competently perform the jobs. These individu-
als may simply need to refresh abilities not recently used or may never have fully developed their
potential. The only thing that really matters is that those individuals who are motivated and willing
to put forth the required effort may be able to improve their abilities sufficiently to be successful on
the selection procedure and the job. The opportunity to prepare for test performance also becomes
an indirect measure of the candidates’ motivation. Although it would be desirable to extricate the
impact of initial ability versus motivation, this is not always viable in applied settings. Anonymity
has been promised to candidates in the electric utility industry to encourage the use of the practice
tests and make it impossible for organizations to use the results for selection. A variety of research
designs are under discussion to attempt to quantify the positive reports from the field.
Even if providing practice tests has no impact on actual test performance, the practice continues
to serve a public relations function. For candidates, practice tests may lead to enhanced self-efficacy
and perceptions of the test process as being fair. In the case of employees, labor leaders, and line
management, providing practice tests seems like the right thing to do. There are few things that
practitioners in technical environments do that result in this level of satisfaction by this broad a
swath of their constituents.
Anecdotal reports from staffing departments generally indicate improved pass rates on the selec-
tion procedure after the implementation of diagnostic feedback and the practice tests, and there have
been no corresponding complaints of decrements in the job performance of candidates hired after
these initiatives were implemented. Although the absence of complaints from line managers does
not constitute definitive evidence that the diagnostic tests and practice tests are functioning effec-
tively as developmental interventions, these managers have never been reticent about voicing dis-
satisfaction over the effectiveness of selection procedures. One of the benefits of these approaches is
that individuals and the organization share the responsibility for development.
762 Handbook of Employee Selection
SUMMARY
The ultimate value of a selection procedure is dependent on the ability of line management to staff
vacant positions with qualified candidates. Locating qualified staff is becoming increasingly difficult
regardless of candidate race, gender, or ethnicity. The field has traditionally focused on identifying
ways to reduce adverse impact on selection procedures. An alternative is to deal with the problem on
the supply side by upgrading the skills of candidates who have not fully developed their abilities.
Three methods that may be used to enhance talent acquisition include the development of working
relationships with technical schools, providing diagnostic testing, and making practice tests available
to candidates. Developing relationships with rigorous technical schools has increased pass rates
substantially. In addition, diagnostic tests and practice tests can be provided to help candidates
identify and address developmental needs. Practice tests are not without controversy; however, the
focus of these practice tests has been on skill acquisition rather than on test-taking strategies. The
candidates who are motivated to use the practice tests and are subsequently hired, may, as sug-
gested by the research of Hausknecht et al. (2002), go on to display other behaviors that are valued
by organizations. Anecdotal evidence suggests that the diagnostic feedback and the practice tests
uniformly have been associated with positive responses from recruiters, labor relations and staffing
staff, diversity committees, and labor organizations and warrants formalized investigation.
REFERENCES
Birkeland, S. A., Manson, T. M., Kisamore, J. L., Brannick, M. T., & Smith, M. A. (2006). A meta-analytic
investigation of job applicant faking on personality measures. International Journal of Selection and
Assessment, 14, 317–335.
Bownas, D. (2000). Selection programs in a union environment: A commentary. In J. Kehoe (Ed.), Managing
selection in changing organizations (pp. 197–209). San Francisco, CA: Jossey-Bass.
Cascio, W. F. (1982). Costing human resources: The financial impact of behavior in organizations. Boston,
MA: Kent.
Cascio, W. F., & Ramos, R. A. (1986). Development and application of a new method for assessing job performance
in behavioral/economic terms. Journal of Applied Psychology, 71, 20–28.
Clause, C. S., Delbridge, K., Schmitt, N., Chan, D., & Jennings, D. (2001). Test preparation activities and
employment test performance. Human Performance, 14, 149–168.
Douglas, E. F., McDaniel, M. A., & Snell, A. (1996). The validity of noncognitive measures decays when applicants
fake. Paper presented at the annual conference of the Academy of Management, Cincinnati, OH.
Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: John Wiley & Sons.
Hausknecht, J. P., Halpert, J. A., DiPaolo, N. T., & Moriarty Gerrard, M. O. (2007). Retesting in selection:
A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied
Psychology, 92, 373–385.
Hausknecht, J. P., Trevor, C. O., & Farr, J. L. (2002). Retaking ability tests in a selection setting: Implications
for practice effects, training performance, and turnover. Journal of Applied Psychology, 87, 243–254.
Hedge, J. W., Borman, W. C., & Lammlein, S. E. (2006). The aging workforce: Realities, myths, and implica-
tions for organizations. Washington, DC: American Psychological Association.
Hough, L. M. (1998). Effects of intentional distortion in personality measurement and evaluation of suggested
palliatives. Human Performance, 11, 209–244.
Hunter, J. E, & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–98.
Hunter, J. E., Schmidt, F. L., & Le, H. (2006). Implications of direct and indirect range restriction for meta-
analysis methods and findings. Journal of Applied Psychology, 91, 594–612.
Jones, D. P., & Gottschalk, R. J. (1988). Validation of selection procedures for electric utility construction and
skilled trades occupations: Literature review and meta-analysis of related validation studies. Washington,
DC: Edison Electric Institute.
Lievens, F., Buyse, T., & Sackett, P. R. (2005). Retest effects in operational selection settings: Development and
test of a framework. Personnel Psychology, 58, 981–1007.
Prasow, P., & Peters, E. (1970). Arbitration and collective bargaining: Conflict resolution in labor relations.
New York, NY: McGraw-Hill.
Blue-Collar Selection in Private Sector Organizations 763
Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies for reducing adverse impact and their
effects on criterion related validity. Human Performance, 9, 241–258.
Ryan, A., Ployhart, R. E., & Friedel, L. A. (1998). Using personality testing to reduce adverse impact: A
cautionary note. Journal of Applied Psychology, 83, 298–307.
Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1981). Task differences as moderators of aptitude test validity in
selection: A red herring. Journal of Applied Psychology, 66, 166–185.
Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, G. S. (1979). Further tests of the Schmidt-Hunter Bayesian
Validity Generalization Model. Personnel Psychology, 32, 257–281.
Uniform Guidelines on Employee Selection Procedures. 29 CFR § 1607 et seq. (1978).
This page intentionally left blank
35 Selection for Service
and Sales Jobs
According to data from the Bureau of Labor Statistics (BLS), U.S. organizations currently employ
over 30 million workers in service and sales occupations (U.S. Department of Labor, 2007). Although
annual turnover rates can exceed 100% for some jobs in services and sales, even a conservative esti-
mate of 20% turnover reveals that U.S. organizations select over 6 million service and sales work-
ers each year. As such, many organizations have adopted formal assessment methods to improve
hiring decisions and ultimately increase organizational effectiveness. Research shows that the use
of validated selection tools as part of a broader, strategic approach to human resource (HR) manage-
ment is associated with higher productivity, lower employee turnover, and better corporate financial
performance (Huselid, 1995; Terpstra & Rozell, 1993). However, it is clear that not all selection
methods are equally effective, nor do research findings apply uniformly to all occupations.
This chapter provides a review of selection research for service and sales occupations and is
organized into three major sections. First, we describe the nature of service and sales work and
define the competencies that underlie success in these jobs. Second, we summarize past research
concerning the methods that have been used to select service and sales employees with attention to
issues of validity, applicant reactions, and adverse impact. Finally, we discuss the implications of
this body of work for practice and future research, highlighting several important but often over-
looked issues concerning selection system design for this critical segment of today’s workforce.
765
766 Handbook of Employee Selection
TABLE 35.1
Job Titles for Common Occupations in Services and Sales
Services Sales
Flight attendants Retail sales persons
Customer service representatives Real estate sales agents
Ticket agents and travel clerks Sales representatives
Tellers Telemarketers
Hotel, motel, and resort desk clerks Insurance sales agents
Waiters and waitresses Travel agents
Gaming service workers Advertising sales agents
Concierges Cashiers
in the United States include customer service representative (approximately 2.2 million employees)
and waiter/waitress (2.4 million). Table 35.1 provides a sampling of these and other job titles com-
monly found within the service sector.
Occupational information from the O*NETTM database (http://www.onetcenter.org) reveals that
the core activities of service workers often involve (a) interacting directly with the public (i.e.,
customers), (b) processing customer requests (e.g., billing inquiries, food orders, bank deposits), (c)
soliciting sales of new products and services, and (d) routinely dealing with unpleasant and angry
people, such as when resolving complaints.
In contrast, the general nature of most sales work involves selling products and services to
customers, clients, or businesses. Approximately 14 million people held sales and related occu-
pations in 2007 (see Table 35.1 for a sample of common sales-related job titles). This group con-
sists largely of retail salespersons (4.4 million), cashiers (3.5 million), and sales representatives
(2.5 million). On the basis of O*NET information, the core activities of sales workers include
(a) locating new clients or customers, (b) determining customers’ needs, (c) providing informa-
tion about products or services (e.g., features, benefits, pricing), (d) convincing customers to
purchase products or services, (e) negotiating sale prices and terms, and (f) providing follow-up
services.
TABLE 35.2
Important Worker Requirements and Characteristics for Service and Sales Occupations
Worker Requirements Worker Characteristics
Knowledge Abilities
Customer and personal servicea: Knowledge of customer- Oral comprehension: The ability to listen to and understand
service principles and processes (e.g., customer needs information and ideas presented through spoken words and
assessment, quality service standards, evaluating customer sentences
satisfaction)
Sales and marketingb: Knowledge of principles and methods Oral expression: The ability to communicate information
for promoting and selling products and services (e.g., and ideas in speaking so that others will understand
marketing strategies, product demonstrations, sales
techniques)
According to O*NET information defines worker requirements are defined as “descriptors referring to work-related attributes
acquired and/or developed through experience and education.” Worker characteristics are defined as “enduring char-
acteristics that may influence both work performance and the capacity to acquire knowledge and skills required for
effective work performance.”
a Rated as more important for service-related occupations.
in effect-size estimates reported in these studies, but these moderators have not been empirically
evaluated to date.
In some studies, interactive effects among personality dimensions, moderating contextual
influences, and other design considerations have been found to account for an additional 2–9% of
the variance in performance ratings. Brown, Mowen, Donovan, and Licata (2002) studied frontline
restaurant service workers and found that customer orientation partially mediated the relationship
between certain personality traits (emotional stability, agreeableness, need for activity) and self- and
supervisor-provided performance ratings. The results indicated that customer orientation accounted
for an additional 2% of the variance in supervisor-reported performance and an additional 9% of the
variance in self-reported performance. In a selection context, such results show the potential value
of assessing certain traits (i.e., customer service orientation) in conjunction with more traditional
personality characteristics.
Research has also found that certain cognitive-motivational work orientations, specifically
accomplishment striving and status-striving, may mediate the relationship between certain person-
ality traits (i.e., conscientiousness and extraversion) and supervisor-rated job performance (Barrick,
Stewart, & Piotrowski, 2002). Barrick et al. sampled telemarketing sales representatives and found
that an individual’s orientation toward status-striving mediated the relationship between extraver-
sion and job performance such that individuals scoring higher on extraversion were more likely
to strive for status, which in turn resulted in higher supervisor ratings of effectiveness. Similarly,
individuals high in conscientiousness were more likely to strive for accomplishment, which led to
higher effectiveness ratings indirectly through status-striving.
Goal-setting behavior is another motivational variable that has been found to mediate the rela-
tionship between personality and job performance. Looking specifically at the personality trait
of conscientiousness, Barrick, Mount, and Strauss (1993) studied sales representatives of a large
appliance manufacturer and found that the relationship between conscientiousness and supervisor-
rated job performance was mediated by goal commitment and autonomous goal-setting, such that
individuals scoring high in conscientiousness were more likely to set and commit to goals, which
then led to increased job performance. The above studies illustrate the potential value of assessing
motivational variables in the process of selection, because they demonstrate how such variables may
impact the relationship between personality and job performance.
In terms of design, researchers have found that using supervisor, coworker, and customer ratings
of employee personality (rather than self-ratings alone) increases the total explained variance in
performance ratings by an additional 11–20% (Mount, Barrick, & Strauss, 1994). In addition, when
job performance is measured using more specific versus general job criteria, personality characteris-
tics appear to more accurately predict job performance ratings (Hogan & Holland, 2003). Regarding
personality measurement, Hunthausen et al. (2003) studied entry-level customer service managers
at a major airline and found that using an “at-work” frame of reference (i.e., asking respondents
to think about how they behave at work when responding to survey questions) resulted in stronger
relationships between two dimensions of the Big Five (extraversion and openness to experience) and
supervisory ratings of performance when controlling for cognitive ability.
Although not as plentiful as the research involving broad traits, there is evidence supporting a
narrow-traits approach to studying job performance. A meta-analysis conducted by Dudley, Orvis,
Lebiecki, and Cortina (2006) found (in their overall analysis, which included all types of jobs)
that four narrow traits of conscientiousness (dependability, cautiousness, achievement, and order)
have incremental validity over the global conscientiousness construct in predicting performance.
Specifically, the narrow traits explained an additional 3.7% of variance in overall performance.
Breaking performance into more specific criteria, narrow traits explained an additional 5–26% of
the variance in specific aspects of job performance, such as task performance (4.6%), job dedication
(25.9%), interpersonal facilitation (5.8%), and counterproductive work behaviors (13.6%).
In addition, Dudley et al. (2006) examined the incremental validity of narrow traits of con-
scientiousness on the basis of occupational type. Jobs were divided into four occupation types:
sales, customer service, managerial, and skilled/semi-skilled. Across all occupational categories,
narrow conscientiousness traits were found to have incremental validity of 1–24% over the global
dimension. Although the incremental validity of narrow traits over the global trait was relatively
small for the customer service occupational group (1.2%), it rose to 5.4% for the sales group. The
managerial occupational group showed a 9.3% increase in variance explained, whereas the skilled/
semi-skilled group posted the largest increase at 24%. On the basis of these results, the authors
note that the degree of prediction offered by narrow traits depends in large part on the type of job
and the aspect of performance under study (Dudley et al., 2006). In the context of sales and service
selection, such results suggest that although the assessment of narrow conscientiousness traits may
be useful for selection of salespeople, such assessment may have less utility for those positions
with a customer service focus. Although further research is necessary to examine the utility of a
narrow traits approach to personality assessment (particularly for other personality dimensions),
initial results suggest the assessment of narrow traits may be useful in predicting performance for
certain jobs.
Service/Customer/Sales Orientation
Given the distinctive features of service and sales work, researchers have developed composite scales
to assess candidates’ dispositions toward customers, service, and/or sales. Sometimes referred to as
“criterion-focused occupational personality scales” (Ones & Viswesvaran, 2001), these self-report,
noncognitive composite measures typically assess a pattern of personality characteristics thought to
underlie successful performance in service and sales domains. Service orientation is one such con-
struct, and it is defined as a set of basic predispositions to provide helpful customer service, includ-
ing dimensions such as friendliness, reliability, responsiveness, courteousness, and cooperativeness
(Cran, 1994; Frei & McDaniel, 1998; Hennig-Thurau, 2004; Hogan, Hogan, & Busch, 1984).
Meta-analysis findings provide evidence of validity for service orientation measures. In a review
of 41 studies, and with supervisory performance ratings serving as the criterion, Frei and McDaniel
(1998) reported an unadjusted validity coefficient of .24. They also showed that service orientation
was moderately correlated (approximately .30 to .40) with several Big Five personality constructs
(agreeableness, emotional stability, and conscientiousness), sales drive, and social vocational inter-
ests. Service orientation was generally unrelated to extraversion, openness to experience, cognitive
ability, or other vocational interests. One caveat noted by Frei and McDaniel is that most of the coef-
ficients summarized in the meta-analysis were drawn from unpublished studies that were produced
by the test vendor. More recently, McDaniel, Rothstein, and Whetzel (2006) conducted a case study
of test vendor technical reports and found evidence of “moderate-to-severe publication bias” such
that two of the four test vendors studied showed a greater likelihood of reporting only statistically
significant validity coefficients for particular scales. A second concern is that researchers have
found that service orientation measures fare no better than general personality dimensions in pre-
dicting performance and do not predict service-focused criteria any better than they predict broader
criteria such as overall performance or counterproductive work behaviors (Ones & Viswesvaran,
2001; Rosse, Miller, & Barnes, 1991).
Selection for Service and Sales Jobs 771
Several measures have been developed to evaluate sales potential, customer-oriented sell-
ing orientation, or sales ability. These scales variously reflect composite measures of per-
sonality facets that are important for success in sales occupations (e.g., Hakstian, Scratchley,
MacLeod, Tweed, & Siddarth, 1997; Hogan, Hogan, & Gregory, 1992; Li & Wang, 2007),
self-assessments of behaviors taken when selling (Saxe & Weitz, 1982), or knowledge of basic
selling principles (Bruce, 1953, 1971, as cited in Vinchur et al., 1998). These studies gener-
ally find that sales potential is predictive of supervisory ratings and objective sales (Farrell &
Hakstian, 2001; Hogan et al., 1992; Li & Wang, 2007). Regarding selling/customer orienta-
tion, meta-analytic evidence from 19 studies reveals unadjusted validity coefficients of .17
for subjective performance measures and .06 for objective performance indicators, although
confidence intervals for the two criteria overlap (Jaramillo, Ladik, Marshall, & Mulki, 2007).
Vinchur et al. (1998) summarized the predictive validity of sales ability measures using meta-
analysis and reported unadjusted average correlations of .26 (supervisory performance ratings)
and .21 (objective sales).
In addition to personality testing, the other dominant approach to the selection of service and sales
workers involves systematic assessment of candidates’ personal histories using biodata inventories.
The most common approach has been to develop paper-and-pencil questionnaires that ask can-
didates about various domains such as work history, experience, interests, values, attitudes, and
leadership activities (e.g., Allworth & Hesketh, 2000; Jacobs, Conte, Day, Silva & Harris, 1996;
McManus & Kelly, 1999; Ployhart, Weekley, Holtz, & Kemp, 2003; Schoenfeldt, 1999; Stokes, Toth,
Searcy, Stroupe, & Carter, 1999). Regarding sales occupations, meta-analysis evidence reveals an
average unadjusted correlation of .31 between biodata and job performance ratings and .17 between
biodata and objective sales (Vinchur et al., 1998). Dalessio and Silverhart (1994) found that bio-
data predicted 12-month survival and first-year commissions among life insurance sales agents,
although effects tended to be smaller than those typically found for performance rating criteria.
Research also supports biodata as a predictor in customer service contexts. Allworth and Hesketh
(2000) found that a biodata inventory that measured experience with tasks and behaviors required
in service jobs provided incremental validity beyond cognitive ability and personality measures in
explaining supervisory performance ratings.
Although biodata inventories encompass multiple aspects of an applicant’s background, work
experience is one element of such inventories that deserves more detailed examination. Organizations
routinely advertise that “previous experience is required” for many service and sales jobs, but expe-
rience is rarely addressed in most validation studies. Drawing from two broader meta-analyses that
included (but were not limited to) sales and service jobs reveals some support for work experience
as a predictor of performance. Schmidt and Hunter (1998) reported an adjusted correlation of .18
between previous work experience (in years) and job performance. When work experience mea-
sures were categorized according to their level of specificity (task, job, and organization) and mea-
surement mode (amount, time, and type), researchers found adjusted correlations with performance
ranging from .16 to .43 (Quinones, Ford, & Teachout, 1995) and suggested that validity can be
maximized by measuring the amount of work experience and tailoring measures to the task level.
Although neither study was conducted with an exclusive focus on sales or service settings, other
research demonstrates the potential of assessing an applicant’s previous work experience in these
contexts. Allworth and Hesketh (2000) approached the construct of work experience by collecting
job requirements biodata from incumbents at an international hotel. This type of biodata asked
participants to gauge how much their previous or current jobs required them to enlist certain cus-
tomer service behaviors. Overall, the authors found that job requirements biodata accounted for
7.6% of unique variance in job performance. Further validation studies by Weekley and Jones
772 Handbook of Employee Selection
(1997, 1999) in multiple service contexts found correlations between previous work experience and
future performance that ranged from .14 to .19. Work experience was assessed using a multidimen-
sional measure that asked participants to report their total full-time work experience, maximum
tenure with any single organization, retail-specific work experience, number of different employers,
and tenure in last job.
APPLICANT REACTIONS
In addition to validity concerns, it is important to consider how applicants will respond to differ-
ent selection procedures. Broadly speaking, research on applicant reactions involves understand-
ing candidates’ perceptions of the fairness and job relatedness of different selection procedures.
Selection for Service and Sales Jobs 773
The general arguments put forth in this area suggest that candidates who hold negative perceptions
of the selection process will be less attracted to the company, less likely to recommend the company
to others, and perhaps even less likely to perform well or remain on the job (Gilliland, 1993). Recent
reviews and meta-analytic evidence confirm many of these propositions with the exception of the
hypothesized performance and retention outcomes, which have yet to be systematically addressed
(Hausknecht, Day, & Thomas, 2004; Ryan & Ployhart, 2000).
When compared with a list of other possible selection methods, participants have among the least
favorable reactions to personality inventories and biodata, whereas reactions to cognitive ability testing
tend to be somewhat more positive but not as favorable as they are to interviews or work samples
(Hausknecht et al., 2004). We are not aware of any published work on applicants’ reactions to occupation-
specific inventories. Given their strong association with personality inventories, one might expect
reactions to be somewhat negative. However, because these tests have been designed for particular
applications in service and sales contexts, fairness and job relatedness perceptions may improve because
of the close connections to relevant aspects of the job. Smither and colleagues found that applicants’
perceptions were more positive for item types that were less abstract, suggesting that occupation-specific
predictors may fare somewhat better on this dimension (Smither, Reilly, Millsap, Pearlman, & Stoffey,
1993). Applicant reactions to SJTs have been studied infrequently, but evidence from Chan and Schmitt
(1997) indicated that reactions to a video-based SJT were favorable and comparable in magnitude to
those found for work sample tests in the Hausknecht et al. (2004) meta-analysis. Bauer and Truxillo
(2006) noted that reactions to SJTs may be dependent on the stimulus and response formats used (i.e.,
written vs. video, multiple-choice vs. open-ended) but suggested that reactions to SJTs overall should be
more favorable than reactions to selection procedures with more abstract content.
ADVERSE IMPACT
Given the legal context of selection and employment testing, concerns about adverse impact must
be given due consideration in selection system design and administration. Although a detailed
treatment of adverse impact research is beyond the scope of this chapter (see Hough, Oswald, &
Ployhart, 2001), several findings are summarized here concerning subgroup differences in test
scores for the predictor classes reviewed above. We note upfront that even small subgroup differ-
ences can produce adverse impact (as defined by the four-fifths rule), particularly as organizations
become more selective (Sackett & Ellingson, 1997). Further, adverse impact calculations involving a
small number of hires and/or low selection ratios tend to produce higher numbers of false positives,
meaning that adverse impact can be found even though subgroup differences are not statistically
significant (Roth, Bobko, & Switzer, 2006). Finally, it is important to point out that many of the esti-
mates reported below are based on reviews that include, but are not limited to, samples drawn from
service and sales domains. At this point in the literature, there are simply too few published studies
available to make definitive conclusions concerning adverse impact in sales and service settings.
Generally speaking, subgroup differences based on ethnic/cultural background, gender, and age
for Big Five personality measures tend to be minimal and, when found, are typically less than one-
tenth of a standard deviation. The largest effects have been found for measures of agreeableness
(women tend to score about four-tenths of a standard deviation higher than men) and emotional
stability (men tend to score about one-quarter of a standard deviation higher than women; Hough
et al., 2001). Subgroup differences have not been comprehensively assessed for measures of service/
sales/customer orientation, although the large overlap with personality constructs suggests that dif-
ferences would be relatively small. Hogan et al. (1992) examined archival data for a personality-
based sales potential inventory and found no differences when comparing scores across ethnic/
cultural and gender-based subgroups. Published data concerning subgroup differences for biodata
inventories in sales and service contexts are limited, although broader reviews find that the aver-
age performance for Whites is about one-third of a standard deviation higher than that for Blacks
(Bobko, Roth, & Potosky, 1999).
774 Handbook of Employee Selection
For measures of cognitive ability, the cumulative evidence (across all types of occupations) indi-
cates that Whites score approximately one standard deviation higher than Blacks, over one-half of a
standard deviation higher than Hispanics, and approximately two-tenths of a standard deviation lower
than Asians (Hough et al., 2001; Roth, Bevier, Bobko, Switzer, & Tyler, 2001). These estimates are
moderated by job complexity such that subgroup differences tend to be larger for less complex jobs.
Thus, given that many service and sales occupations are relatively low in complexity (see O*NET),
subgroup differences may be somewhat larger in these domains. Regarding age and gender differ-
ences, research shows that age and cognitive ability test scores tend to be negatively related, whereas
cognitive ability test performance does not generally differ between males and females (Hough et al.,
2001). Finally, research on subgroup differences for video-based and written SJTs shows that Whites
tend to score about four-tenths of a standard deviation higher than members of other ethnic/cultural
groups, whereas women tend to score slightly higher (approximately one-tenth of a standard devia-
tion) than men (Nguyen, McDaniel, & Whetzel, 2005, cited in Ployhart & Holtz, 2008). Potential
age-based differences for SJTs have not been reported in the published literature.
In summary, validity and adverse impact considerations often represent tradeoffs. Selection meth-
ods with strong evidence of predictive validity often share variance with cognitive ability, and cog-
nitively loaded measures tend to produce the highest levels of adverse impact. Pyburn, Ployhart, and
Kravitz (2008) termed this situation the “diversity-validity dilemma.” From a practical standpoint,
there are many strategies available to selection specialists who must balance diversity and validity
concerns, and the interested reader is directed to several recent papers that provide valuable critiques
of these various approaches (Aguinis & Smith, 2007; De Corte, Lievens, & Sackett, 2007; Kravitz,
2008; Ployhart & Holtz; 2008). One common conclusion from this line of research is that, to date,
there are no universal solutions that successfully maximize validity and eliminate adverse impact.
CRITERION ISSUES
Much of the research reviewed here has included supervisory performance ratings as the sole cri-
terion. Although these ratings serve many important functions, a pessimistic view is that organi-
zations are not as interested in boosting the performance appraisal ratings of its members as they
are in increasing sales volume and service quality perceptions among customers. Objective sales
figures have obvious implications for an organization’s bottom line, and customer perceptions of
service quality are an important leading indicator of future sales and repeat business (Bowman
& Narayandas, 2004). Further, in the sales domain in particular, research has shown that validity
coefficients vary considerably across different criteria (Vinchur et al., 1998). Thus, organizations
that use cognitive ability tests, for example, may see no benefit in terms of enhanced sales volume
among new hires (but would identify candidates who will be rated highly by supervisors). Despite
the appeal of using objective sales criteria, such validation work requires adequate consideration
of situational opportunities that may influence performance (Stewart & Nandkeolyar, 2006). For
example, researchers have advocated for controlling geographic or territorial constraints such as
market potential, workload, company presence in a particular area, local economic conditions, and
other region-specific factors (Cravens & Woodruff, 1973; McManus & Brown, 1995).
In addition to considering alternative measures of job performance, researchers might broaden
the types of criteria examined in future research. Only a handful of studies reviewed here examined
Selection for Service and Sales Jobs 775
withdrawal behaviors or counterproductive work behaviors (e.g., Dalessio & Silverhart, 1994; Jacobs
et al., 1996; Ones & Viswesvaran, 2001). Given the significant costs associated with these outcomes,
it would be useful to broaden the scope of selection research by incorporating these criteria into
validity studies whenever possible.
PREDICTOR ISSUES
Almost exclusively, published research in this area tends to feature self-report, paper-and-pencil per-
sonality tests or biodata inventories. This work is valuable, but research must also respond to new and
different forms of assessment. For example, Winkler (2006) estimated that about 5% of organiza-
tions (e.g., Toyota, SunTrust Bank) are using technology to assess important competencies via online
job simulations. These interactive assessments place candidates in a virtual environment that mirrors
the work that they would be doing on the job and allows companies to assess important competencies
while providing a realistic preview of the work itself. Other forms of capturing live behavior (e.g.,
assessment centers) may also be appropriate for assessing service and sales candidates, although
little work has been published in this area (see Burroughs & White, 1996, for an exception).
The format of predictors is another important consideration, particularly as organizations consider
how to leverage technology when building selection systems. Technology-based selection measures
differ from their paper-and-pencil counterparts in several ways (Weekley & Jones, 1997, 1999; see
also Chapter 8, this volume), and suggest a different profile of considerations for organizations in terms
of costs, applicant reactions, administrative ease, and so forth. Until additional research examines
these alternative approaches in the context of what we already know, it is unclear what (if any) effect
these alternative forms of assessment have on selection outcomes in service and sales contexts.
TEMPORAL ISSUES
Another issue raised by this analysis is that we currently know very little about the role of time in
the selection process. Much of the research reviewed here uses concurrent (i.e., cross-sectional)
designs or time-lagged predictive designs with a fairly short temporal window (e.g., 6-month per-
formance review). Yet, recent explorations into predictors of performance trends suggest that past
findings may not readily generalize across time (Ployhart & Hakel, 1998; Stewart & Nandkeolyar,
2006; Thoresen, Bradley, Bliese, & Thoresen, 2004). For example, in a study of insurance sales per-
sonnel, Hofmann, Jacobs, and Baratta (1993) found that the performance of sales agents followed a
quadratic trend over time such that mean performance was initially positive and linear, then curved
asymptotically with time. The authors suggested that different skills and abilities may be predictive
of performance at early and later stages of the sales agents’ careers. Goal orientation was advanced
as a potential determinant of intraindividual performance trends, such that highly goal-oriented
individuals may be better equipped to learn from unsuccessful sales calls over time and more likely
to engage in the self-development activities that ultimately lead to improved performance.
Other researchers have shown that conclusions about personality-performance relationships dif-
fer when comparing cross-sectional and longitudinal designs such that certain personality charac-
teristics are more predictive of performance trends than they are of initial performance (Thoresen
et al., 2004), whereas others moderate the effect of situational opportunities on performance over
time (Stewart & Nandkeolyar, 2006). Conclusions about the predictive validity of cognitive ability
measures are also likely to be time-dependent in service and sales contexts. Keil and Cortina (2001)
found that validity coefficients decline with time, and although their review was not limited to sales
and service contexts, Cascio and Aguinis (2005) argued that task performance should be dynamic
in service contexts (thus making it more difficult to predict over time) because service workers
often have to adapt to new work processes as new products or services are introduced. These stud-
ies demonstrate that by focusing more closely on temporal dynamics, organizations can not only
select candidates who are likely to perform well soon after hire, but also identify those who have
776 Handbook of Employee Selection
the capacity to increase their performance over time or reach proficiency in a shorter period, both
of which are critically important to long-term organizational success.
LEVELS ISSUES
A final consideration is that nearly all of the studies reviewed here focus on selection at the indi-
vidual level of analysis. This reflects a long tradition in psychology of examining individual differ-
ence characteristics that might relate to individual job performance. However, selection researchers
have argued that the field needs to examine relationships at higher levels of analysis such as the
group, work unit, or organization (Ployhart, 2004, 2006). In one recent empirical example, Ployhart,
Weekley, and Baughman (2006) found unique personality-performance associations at individual,
job, and organizational levels and concluded that higher-level relationships may occur because cer-
tain personality factors are related to the teamwork and coordination behaviors critical for success
in service work.
Another multilevel study examining the impact of managerial personality traits and service
quality orientation on service climate found that a manager’s personality may play a role in shap-
ing service climate (Salvaggio et al., 2007). Core self-evaluations were administered to managers,
in which participants were asked to rate themselves on certain personality traits (i.e., self-esteem,
self-efficacy, neuroticism, etc.). Results indicated that managers with more positive self-evaluations
had higher service quality orientations, which in turn led to more positive service climates. As the
authors note, these results demonstrate the potential impact that individual managers’ personality
traits may have on the overall workplace service climate. Considering that service climate is posi-
tively related to sales volume via customer-focused citizenship behaviors and customer satisfaction
(Schneider, Ehrhart, Mayer, Saltz, & Niles-Jolley, 2005), such findings show that careful attention to
employee selection may be useful not only in predicting individual performance, but also more dis-
tal indicators of success for the work unit or organization. Taken together, these studies demonstrate
that multilevel approaches are valuable for addressing the long-standing question of how to improve
organizational effectiveness through selection (see also Chapter 9, this volume).
CONCLUSIONS
Service and sales workers represent a significant portion of the global workforce, and the eco-
nomic success of many organizations hinges upon their performance. Although much remains to be
learned, the research reviewed here shows that careful attention to selection system design provides
organizations with an opportunity to improve the overall quality of hiring decisions for service and
sales employees. Results clearly indicate that investments in formal selection methods improve the
odds of finding service and sales workers who will perform well on the job. The validity coefficients
discussed here are not large; however, they can translate into substantial benefits in terms of reduced
hiring and training costs, increased sales productivity, and better service quality. Combining the
results of these individual-level studies with what we are beginning to learn about similar relation-
ships at higher levels and over time shows that effective selection is a viable means by which orga-
nizations can generate an advantage over competitor firms.
REFERENCES
Aguinis, H. & Smith, M. A. (2007). Understanding the impact of test validity and bias on selection errors and
adverse impact in human resource selection. Personnel Psychology, 60, 165–199.
Ahearne, M., Bhattacharya, C. B., & Gruen, T. (2005). Antecedents and consequences of customer-
company identification: Expanding the role of relationship marketing. Journal of Applied Psychology,
90, 574–585.
Allworth, E., & Hesketh, B. (2000). Job requirements biodata as a predictor of performance in customer service
roles. International Journal of Selection and Assessment, 8, 137–147.
Selection for Service and Sales Jobs 777
Avis, J. M., Kudisch, J. D., & Fortunato, V. J. (2002). Examining the incremental validity and adverse impact
of cognitive ability and conscientiousness on job performance. Journal of Business and Psychology, 17,
87–105.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Barrick, M. R., Mount, M. K., & Strauss, J. P. (1993). Conscientiousness and performance of sales representa-
tives—Test of the mediating effects of goal-setting. Journal of Applied Psychology, 78, 715–722.
Barrick, M. R., Stewart, G. L., & Piotrowski, M. (2002). Personality and job performance: Test of the mediating
effects of the motivation among sales representatives. Journal of Applied Psychology, 87, 43–51.
Bauer, T. N., & Truxillo, D. M. (2006). Applicant reactions to situational judgment tests: Research and related
practical issues. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measure-
ment, and applications (pp. 233–249). Mahwah, NJ: Erlbaum.
Bobko, P., Roth, P.L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporat-
ing cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561–589.
Bowman, D., & Narayandas, D. (2004). Linking customer management effort to customer profitability in
business markets. Journal of Marketing Research, 41, 433–447.
Brown, T. J., Mowen, J. C., Donovan, D. T., & Licata, J. W. (2002). The customer orientation of service work-
ers: Personality trait effects on self- and supervisor performance ratings. Journal of Marketing Research,
39, 110–119.
Bruhn, M., & Georgi, D. (2006). Services marketing: Managing the service value chain. Essex, England:
Pearson.
Burroughs, W. A., & White, L. L. (1996). Predicting sales performance. Journal of Business and Psychology,
11, 73–84.
Cascio, W. F., & Aguinis, H. (2005). Test development and use: New twists on old questions. Human Resource
Management, 44, 219–235.
Cellar, D. F., DeGrendel, D. J. D., Klawsky, J. D., & Miller, M. L. (1996). The validity of personality, service
orientation, and reading comprehension measures as predictors of flight attendant training performance.
Journal of Business and Psychology, 11, 43–54.
Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judg-
ment tests: Subgroup differences in test performance and face validity. Journal of Applied Psychology,
82, 143–159.
Cran, D. J. (1994). Towards validation of the service orientation construct. Service Industries Journal, 14,
34–44.
Cravens, D. W., & Woodruff, R. B. (1973). An approach for determining criteria of sales performance. Journal
of Applied Psychology, 57, 242–247.
Dalessio, A. T., & Silverhart, T. A. (1994). Combining biodata test and interview information: Predicting deci-
sions and performance criteria. Personnel Psychology, 47, 303–315.
De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors to achieve optimal trade-offs between
selection quality and adverse impact. Journal of Applied Psychology, 92, 1380–1393.
Dudley, N. M., Orvis, K. A., Lebiecki, J. E., & Cortina, J. M. (2006). A meta-analytic investigation of consci-
entiousness in the prediction of job performance: Examining the intercorrelations and the incremental
validity of narrow traits. Journal of Applied Psychology, 91, 40–57.
Farrell, S., & Hakstian, A. R. (2001). Improving salesforce performance: A meta-analytic investigation of
the effectiveness and utility of personnel selection procedures and training interventions. Psychology &
Marketing, 18, 281–316.
Frei, R. L., & McDaniel, M. A. (1998). Validity of customer service measures in personnel selection: A review
of criterion and construct evidence. Human Performance, 11, 1–27.
Gilliland, S. W. (1993). The perceived fairness of selection systems: An organizational justice perspective.
Academy of Management Review, 18, 694–734.
Hakstian, A. R., Scratchley, L. S., MacLeod, A. A., Tweed, R. G., & Siddarth, S. (1997). Selection of telemar-
keting employees by standardized assessment procedures. Psychology & Marketing, 14, 703–726.
Hausknecht, J. P., Day, D. V., & Thomas, S.C. (2004). Applicant reactions to selection procedures: An updated
model and meta-analysis. Personnel Psychology, 57, 639–683.
Hennig-Thurau, T. (2004). Customer orientation of service employees—Its impact on customer satis-
faction, commitment, and retention. International Journal of Service Industry Management, 15,
460–478.
Hofmann, D. A., Jacobs, R., & Baratta, J. E. (1993). Dynamic criteria and the measurement of change. Journal
of Applied Psychology, 78, 194–204.
778 Handbook of Employee Selection
Hogan, J., Hogan, R., & Busch, C. M. (1984). How to measure service orientation. Journal of Applied
Psychology, 69, 167–173.
Hogan, J., Hogan, R., & Gregory, S. (1992). Validation of a sales representative selection inventory. Journal of
Business and Psychology, 7, 161–171.
Hogan, J., & Holland, B. (2003). Using theory to evaluate personality and job-performance relations: A socio-
analytic perspective. Journal of Applied Psychology, 88, 100–112.
Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse
impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of
Selection and Assessment, 9, 152–194.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–98.
Hunthausen, J. M., Truxillo, D. M., Bauer, T. N., & Hammer, L. B. (2003). A field study of frame-of-reference
effects on personality test validity. Journal of Applied Psychology, 88, 545–551.
Hurley, R. F. (1998). Customer service behavior in retail settings: A study of the effect of service provider
personality. Journal of the Academy of Marketing Science, 26, 115–127.
Huselid, M. A. (1995). The impact of human resource management practices on turnover, productivity, and
corporate financial performance. Academy of Management Journal, 38, 635–672.
Jacobs, R. R., Conte, J. M., Day, D. V., Silva, J. M., & Harris, R. (1996). Selecting bus drivers: Multiple
predictors, multiple perspectives on validity, and multiple estimates of utility. Human Performance, 9,
199–217.
Jaramillo, F., Ladik, D. M., Marshall, G. W., & Mulki, F. P. (2007). A meta-analysis of the relationship between
sales orientation-customer orientation (SOCO) and salesperson job performance. Journal of Business
and Industrial Marketing, 22, 302–310.
Keil, C. T., & Cortina, J. M. (2001). Degradation of validity over time: A test and extension of Ackerman’s
model. Journal of Applied Psychology, 127, 673–697.
Kravitz, D. A. (2008). The diversity-validity dilemma: Beyond selection—the role of affirmative action.
Personnel Psychology, 61, 173–193.
Li, L., & Wang, L. (2007). Development and validation of the salespeople forced choice behavioral style test in
the information technology industry. Personality and Individual Differences, 42, 99–110.
Liao, H., & Chuang, A. (2004). A multilevel investigation of factors influencing employee service performance
and customer outcomes. Academy of Management Journal, 47, 41–58.
McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. L. (2007). Situational judgment tests, response
instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91.
McDaniel, M. A., Rothstein, H. R., & Whetzel, D. L. (2006). Publication bias: A case study of four test vendors.
Personnel Psychology, 59, 927–953.
McManus, M. A., & Brown, S. H. (1995). Adjusting sales results measures for use as criteria. Personnel
Psychology, 48, 391–400.
McManus, M. A., & Kelly, M. L. (1999). Personality measures and biodata: Evidence regarding their incre-
mental predictive value in the life insurance industry. Personnel Psychology, 52, 137–148.
Miner, J. B. (1962). Personality and ability factors in sales performance. Journal of Applied Psychology, 46,
6–13.
Mount, M. K., Barrick, M. R., & Stewart, G. L. (1998). Five-factor model of personality and performance in
jobs involving interpersonal interactions. Human Performance, 11, 145–165.
Mount, M. K., Barrick, M. R., & Strauss, J. P. (1994). Validity of observer ratings of the Big 5 personality-
factors. Journal of Applied Psychology, 79, 272–280.
Ones, D. S., & Viswesvaran, C. (1996). Bandwidth-fidelity dilemma in personality measurement for personnel
selection. Journal of Organizational Behavior, 17, 609–626.
Ones, D. S., & Viswesvaran, C. (2001). Integrity tests and other criterion-focused occupational personal-
ity scales (COPS) used in personnel selection. International Journal of Selection and Assessment, 9,
31–39.
Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. Research in person-
nel and human resource management, 23, 121–176.
Ployhart, R. E. (2006). Staffing in the 21st century: New challenges and strategic opportunities. Journal of
Management, 32, 868–897.
Ployhart, R. E., & Hakel, M. D. (1998). The substantive nature of performance variability: Predicting interindi-
vidual differences in intraindividual performance. Personnel Psychology, 51, 859–901.
Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and
sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172.
Selection for Service and Sales Jobs 779
Ployhart, R. E, Weekley, J. A., & Baughman, K. (2006). The structure and function of human capital emer-
gence: A multilevel examination of the attraction-selection-attrition model. Academy of Management
Journal, 49, 661–677.
Ployhart, R. E., Weekley, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of
applicants in a proctored setting: Are personality, biodata, and situational judgment tests comparable?
Personnel Psychology, 56, 733–752.
Pyburn, K. M., Ployhart, R. E., & Kravitz, D. A. (2008). The diversity-validity dilemma: Overview and legal
context. Personnel Psychology, 61, 143–151.
Quinones, M. A., Ford, J. K., & Teachout, M. S. (1995). The relationship between work experience and job
performance: A conceptual and meta-analytic review. Personnel Psychology, 48, 887–910.
Robie, C., Brown, D. J., & Shepherd, W. J. (2005). Interdependence as a moderator of the relationship between com-
petitiveness and objective sales performance. International Journal of Selection and Assessment, 13, 274–281.
Rosse, J. G., Miller, H. E., & Barnes, L. K. (1991). Combining personality and cognitive ability predictors for
hiring service-oriented employees. Journal of Business and Psychology, 5, 431–445.
Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., III, & Tyler, P. (2001). Ethnic group differences in cognitive
ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330.
Roth, P. L., Bobko, P., & Switzer, F. S., III. (2006). Modeling the behavior of the 4/5ths rule for determining
adverse impact: Reasons for caution. Journal of Applied Psychology, 91, 507–522.
Ryan, A. M., & Ployhart, R. E. (2000). Applicants’ perceptions of selection procedures and decisions: A critical
review and agenda for the future. Journal of Management, 26, 565–606.
Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differ-
ences and adverse impact. Personnel Psychology, 50, 707–721.
Salvaggio, A. N., Schneider, B., Nishii, L. H., Mayer, D. M., Ramesh, A., & Lyon, J. S. (2007). Manager per-
sonality, manager service quality orientation, and service climate: Test of a model. Journal of Applied
Psychology, 92, 1741–1750.
Saxe, R., & Weitz, B. A. (1982). The SOCO scale: A measure of the customer orientation of salespeople.
Journal of Marketing Research, 19, 343–351.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology:
Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274.
Schneider, B., Ehrhart, M. G., Mayer, D. M., Saltz, J. L., & Niles-Jolly, K. (2005). Understanding organization-
customer links in service settings. Academy of Management Journal, 48, 1017–1032.
Schneider, B., & White, S. S. (2004). Service quality: Research perspectives. Thousand Oaks, CA: Sage.
Schneider, R. J., Hough, L. M., & Dunnette, M. D. (1996). Broadsided by broad traits: How to sink science in
five dimensions or less. Journal of Organizational Behavior, 17, 639–655.
Schoenfeldt, L. F. (1999). From dust bowl empiricism to rational constructs in biographical data. Human
Resource Management Review, 9, 147–167.
Smither, J. W., Reilly, R. R., Millsap, R. E., Pearlman, K., & Stoffey, R. W. (1993). Applicant reactions to selec-
tion procedures. Personnel Psychology, 46, 49–76.
Stewart, G. L., & Nandkeolyar, A. K. (2006). Adaptation and intraindividual variation in sales outcomes:
Exploring the interactive effects of personality effects of personality and environmental opportunity.
Personnel Psychology, 59, 307–332.
Stokes, G. S., Hogan, J. B., & Snell, A. F. (1993). Comparability of incumbent and applicant samples for the
development of biodata keys—The influence of social desirability. Personnel Psychology, 46, 739–762.
Stokes, G. S., Toth, C. S., Searcy, C. A., Stroupe, J. P., & Carter, G. W. (1999). Construct/rational biodata
dimensions to predict salesperson performance: Report on the U.S. Department of Labor sales study.
Human Resource Management Review, 9, 185–218.
Terpstra, D. E., & Rozell, E. J. (1993). The relationship of staffing practices and organizational level measures
of performance. Personnel Psychology, 46, 27–48.
Thoresen, C. J., Bradley, J. C., Bliese, P. D., & Thoresen, J. D. (2004). The big five personality traits and indi-
vidual job performance growth trajectories in maintenance and transitional job stages. Journal of Applied
Psychology, 89, 835–853.
U.S. Department of Labor. (2007). Occupational employment and wages, 2007. Washington, DC: Author.
Vinchur, A. J., Schippmann, J. S., Switzer, F. S., & Roth, P. L. (1998). A meta-analytic review of predictors of
job performance for salespeople. Journal of Applied Psychology, 83, 586–597.
Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25–49.
Weekley, J. A., & Jones, C. (1999). Further studies of situational tests. Personnel Psychology, 52, 679–700.
Winkler, C. (2006). Job tryouts go virtual: Online job simulations provide sophisticated candidate assessments.
HR Magazine, 51, 131–134.
This page intentionally left blank
36 Selection in Multinational
Organizations
781
782 Handbook of Employee Selection
TABLE 36.1
Core Sales Competencies for 3M Sales Representatives
Competencies Skills
Core sales skills Selling process management
Account planning/account plan management
Opportunity identification
Solution creation
Negotiation persuasiveness
Industry and market knowledge
Business acumen skills Customer knowledge
Product and technical knowledge
Channel acumen
Coaching
Individual effectiveness skills Resource management
Problem solving/decision-making
Communication
integrated into 3M’s selection system for potential new hires and also the training and development
programs for incumbent sales representatives.
In developing this competency model as the basis for the common test globally, an international
job analysis was conducted to assess whether the content domain of the 3M sales representative
position was similar around the world. A job analysis questionnaire (JAQ) was administered to sales
representative and sales subject matter experts in ten countries. The JAQ assessed work behaviors
from the content domain for 3M sales representatives shown in Table 36.1. In 2006, 3M sales rep-
resentatives from Brazil, Russia, India, China, and Poland (labeled BRICP) completed the JAQ. In
2007, 3M sales representatives in Australia, Singapore, Taiwan, Japan, and Korea (labeled APAC)
completed the JAQ. Of the 13 work behavior dimensions shown in Table 36.1, the top 7 in terms of
importance are presented in Table 36.2. For each of the seven most important work behaviors, the
average importance rating is shown for both country sets. As Table 36.2 illustrates, the results of
this global job analysis found that the job content was the same around the world.
TABLE 36.2
Mean Importance Ratings for Work Behavior Dimensions: 3M Sales
Representatives’ Job Analysis Results Across Two Sets of Countries
Country Set
Work Behavior Dimensions APAC BRICP
Conduct sales and follow-up 3.7 3.7
Work with others 3.4 3.7
Provide information to customers and distributors 3.7 3.6
Plan, organize, and prioritize 3.5 3.6
Maintain knowledge and skills 3.4 3.5
Negotiate and persuade 3.4 3.5
Document work: keep records 3.3 3.5
APAC country set includes Australia, Japan, Korea, Singapore, and Taiwan. BRICP country
set includes Brazil, Russia, India, China, and Poland. Job analysis importance ratings obtained
on five-point scale (5 = highly important).
784 Handbook of Employee Selection
The greater challenge for global organizations is in maintaining consistency with more subjec-
tive dimensions of the type generally found in critical leadership roles (i.e., a global firm’s top
management team). It is critical for organizations to select leaders who have integrity, can work well
in teams, are committed, and are results-oriented. However, the interpretation of these dimensions
can vary greatly depending on the culture of the subsidiary location (Caligiuri & Day, 2000). There
are also practical challenges with headquarter-controlled international selection systems. Using the
same test across countries may be difficult for reasons ranging from possible culture-based interpre-
tations lowering the validity of the test to the basic logistics of testing. The challenge of maintaining
consistency in employee selection is discussed later in this chapter.
Organizations pursuing a multidomestic strategy face strong pressures for local responsiveness
and weak pressures for worldwide integration. Their structure resembles a decentralized group
of local firms and their HR policies differ across countries, conforming to local market demands.
Multidomestic organizations transfer almost all HR practices, including employee selection, to the
host national subsidiaries. Decisions are made at the subsidiary level regarding the way in which
employees should be selected, the method to be used in selection, the criteria for evaluation, and
the like.
The benefit of this strategy is that global firms are able to compete locally—and with local
knowledge, which may be especially important when a country has a unique infrastructure, market,
client base, governmental roles, etc. It follows that the localization of selection systems is best for
positions where a localization strategy is being deployed. The weakness of this strategy at the com-
pany level is that companies lose the economies of scale and their ability to maintain consistency
and standards around the world and the possibility for global talent management. For example, in
selection, multiple selection tests would need to be validated, and it would be impossible to have
cross-nationally comparability of candidates across countries.
Organizations pursuing a transnational strategy face strong pressures for worldwide integration
and for local responsiveness. These organizations attempt to develop an interdependent global net-
work of subsidiaries that combines the benefits of worldwide integration and local responsiveness.
Subsidiaries are neither satellites of nor independent from headquarters, but rather are integral
parts of a whole system with both global and local objectives. Each subsidiary is believed to make
a unique contribution to this synergistic system. Decisions are a result of a collaborative process
between headquarters and subsidiaries. As such, employee selection systems in transnational orga-
nizations seek consistency around the world—based on strategic necessity—but are also culturally
acceptable across the participating countries. Many multinational organizations aspire to (or believe
themselves to follow) a transnational business strategy. As such, there is an increased pressure to
develop HR systems (and employee selection systems in particular), which are acceptable and syn-
ergistic across cultures.
In the case of 3M’s prehire sales test, their solution was hybrid—to standardize the test (on the
basis of the common competency model outlined in Table 36.1) but allow the countries’ HR depart-
ments the freedom to vary when and how the test was given. For example, in some countries it did
not make sense to offer the test at the beginning of the selection process, but rather a little bit later if it
was a particularly competitive job market, if it was in a more remote location, and so forth. By work-
ing very pragmatically, 3M came up with a variety of different approaches to implement the online
test to make sure that the process was really helping advance the cause of the country and company
Selection in Multinational Organizations 785
rather than something prescribed and imposed from corporate. In the end, 3M’s solution to global
testing was implemented, and the prehire test for sales representatives was rolled-out globally.
There are three challenges when developing candidate selection systems from a transnational
or synergistic perspective. The first challenge is determining selection dimensions that would be
applicable for candidates for the same positions across subsidiaries, regardless of country (Davis,
1998; Ployhart, Wiechmann, Schmitt, Sacco, & Rogg, 2003). This means that the content domain
is comparable across cultures within positions and that the selection systems based on the com-
mon content domain would have validity coefficients generalizable across countries (Lievens,
2007; Salgado & Anderson, 2002). Once the common content domain is determined, creating
conceptual equivalence in the assessment tools is the next and second challenge. This may include
everything from language comparability in selection tests to developing the behavioral indices of
various selection dimensions so that raters (e.g., interviewers, assessors in assessment centers) can
make cross-culturally comparable ratings or possibly even changing cut scores and norms within
countries to appropriate levels. The third challenge is determining the selection and assessment
methods that can be successfully implemented across cultures. This challenge includes gaining an
understanding of the candidates’ perspectives and how their reactions to the selection system may
vary depending upon their cultural perspectives.
These three challenges make the transnational strategy very difficult to maintain with respect to
selection and assessment. In practice, organizations that purport to have HR systems following from
a transnational business strategy tend to have a hybrid model in practice. They tend to either have
substantial influence from headquarters in setting and developing a selection system (acting much
like the global organizations) or they allow for a significant amount of cross-national adaptation to
a given countrywide system (acting much like a multi-domestic system). Given that many multina-
tional organizations desire to be in a synergistic state with respect to their HR systems (i.e., follow-
ing a transnational strategy), it is useful to discuss each of the three challenges in greater detail.
Many transnational firms have driving corporate cultural values that appear in managerial selec-
tion systems around the world. These corporate values may include dimensions such as managing
with integrity, taking appropriate risks, being customer-focused, being results-oriented, and the
like. After these broad performance dimensions are named, the challenge turns to creating concep-
tual equivalence for each dimension across cultures. Once this equivalence is established, selection
systems to assess candidates against these dimensions are created.
Once the dimensions to be included in the selection system have been established, the next cross-
cultural concern would be the appropriateness of the assessment method and the logistics of those
Selection in Multinational Organizations 787
methods in a given cross-cultural context. With respect to logistics, testing assumptions need to be
questioned cross-culturally. For example, when 3M was rolling out their prehire sales test globally,
one of the basic assumptions made was that testing would be done in a room with multiple comput-
ers and a fairly controlled environment so that multiple applicants could simultaneously take the
online test. As it turned out, this was easier thought than done. First, for many of the 3M subsidiar-
ies around the world, they did not have an available testing room (i.e., an empty room with multiple
computers each with internet connections). Second, some of the subsidiaries had sales territories
that covered vast regions. If 3M was looking for sales representatives for a given region, they needed
to be able to connect with candidates in their remote locations. In Russia, for example, 3M needed
to be able to connect with candidates in more remote places such as Siberia. Practically, decisions
needed to be made regarding the appropriate distance for a candidate to need to travel to even take
the prehire test. Third, as 3M learned, the idea to have multiple applicants taking the test simultane-
ously in some countries was flawed. For some cultures, and in highly competitive job markets, it was
undesirable and discouraging for applicants to see how many people are competing. Furthermore,
in some cultures this kind of testing is culturally unacceptable. Even the idea of a controlled testing
room with a closed door in some small subsidiaries or in predominantly open-floor plans such as
Japan raised cross-national challenges.
With respect to testing methods, some methods are perceived more favorably by applicants
around the globe. Applicants across Western cultures perceive job-related interviews, résumés, and
work samples more favorably than testing (e.g., personality tests, cognitive ability tests) and refer-
ence checking (Steiner & Gilliland, 2001). Only one study has been conducted in the Asian con-
text: Comparing perceptions of selection methods in Singapore and the United States. Phillips and
Gully (2002) found similar results to those conducted in the Western context (reviewed by Steiner
& Gilliland, 2001). However, they did find that Singaporeans rated personality tests more favorably
than Americans.
Although applicant reactions to selection methods may be generally similar across countries,
their usage is not. Multicountry survey-based studies found that countries did vary significantly in
terms of employee selection procedures used (Ryan, McFarland, Baron, & Page, 1999; Shackleton &
Newell, 1997). Ryan et al. (1999) found that national-level cultural values, such as uncertainty avoid-
ance, predicted what selection procedures were more likely to be used across countries. Countries
higher in risk aversion were more likely to rely more heavily on interviews and testing, presum-
ably as a way of reducing hiring risks. Further research in the area of cross-cultural differences in
use and acceptance of selection methods is important to further understanding of global employee
selection methods and, hopefully, reduce resistance to them (for a review, see Lievens, 2007).
In the context of cross-cultural acceptability, it is important to consider how the selection is being
made and who is conducting the evaluation. Selection is most often conducted through supervi-
sors’ interviews of candidates. Interviews are especially challenging when supervisors from one
culture are rating candidates from another culture. In these cross-cultural rater-ratee dyads, ratings
could be biased by the degree of rater-ratee similarity measured in terms of demographic similarity
(e.g., Judge & Ferris, 1993; Tsui & O’Reilly, 1989). The degree of similarity between members of
a dyad—also referred to as relational demography—has been shown to be positively related to per-
ceived similarity and supervisors’ reported liking of a ratee (Wayne & Liden, 1995). Both of these
effects have the potential to favorably bias ratings.
In the context of performance evaluation (with comparable implications for candidate assess-
ment) in cross-national work settings, national similarity is a type of relational demography that
could affect ratings. Similarity of managers and their supervisors in terms of national background
has been shown to influence performance ratings (Caligiuri & Day, 2000). Caligiuri and Day (2000)
found that the bias of nationality might affect the performance dimensions differently—depending
on the type of dimension. They tested task and contextual performance dimensions and found that
rater nationality influenced the ratings of the more subjective contextual performance dimensions,
but not the objective task-based performance dimensions. This finding is consistent with research
788 Handbook of Employee Selection
indicating that less ambiguous performance standards increases rater-ratee agreement regarding
performance ratings (Schrader & Steiner, 1996). Given that the assessors’ cultural lens is a potential
for bias, supervisors should be trained on behavioral indicators for assessing employees’ perfor-
mance. A training intervention, such as frame-of-reference rater training (Bernardin & Buckley,
1981), could be highly effective at reducing the idiosyncratic standards of raters (Day & Sulsky,
1995). Frame-of-reference rater training would clarify standards across dimensions and across
cultures—especially for the more ambiguous subjective dimensions in the content domain.
one of the best trained workforces in the world, with an extensive apprenticeship program in which
employers help train students on their intended trades and professions.
Within-country selection systems rely on an ample supply of qualified talent against the orga-
nization’s demand for talent for a given job. Given that countries differ in labor economics, the
availability of talent will influence selection ratios, making selection systems more or less effec-
tive across the entire workforce strategy with the country. The general labor economics of a given
country or city affects the size and quality of the applicant pools. Supply of talent also affects the
concern companies will have for candidate reactions to their (even validated) selection methods.
For example, in both India and Poland, skilled labor is in high demand. Often a hiring manager
just wants someone to fill a position, without the extra hurdle of giving applicants a test, which
increases the time needed to make a hiring decision and could result in losing some viable candi-
dates. One of the ways that 3M accommodated this high demand for skilled labor in Poland and
India was to change the placement of testing in the selection process to be a later hurdle in the
process. The goal was to keep more qualified candidates in the pipeline for the interpersonally
interactive aspects of the selection system, such as the interview, and not turn them off with the
testing process. Testing was conducted after the relationship with 3M was built, which also ensured
top talent was selected.
continuum, some assignments will have an extensive need for intercultural effectiveness through
successful communication with host nationals for the assignments to be deemed successful; for
example, product managers (where knowledge of the local market through collaboration with host
nationals is critical), and regional managers (where most direct reports are host nationals). Others
may require substantially less intercultural effectiveness and rely on more technically based skills,
as in positions involving information technology, engineering, or physical plant.
With respect to the second continuum, multinational organizations are sending employees on
global assignments for the primary purpose of developing global competence or a global orienta-
tion, such as building their knowledge of the worldwide market, developing cross-cultural under-
standing, or the ability to speak a second language at work. The extent to which developmental
dimensions are formally recognized or the primary purpose as an expected outcome of a global
assignment will vary greatly depending on type of assignment.
These two continua form a basic classification system resulting in the following three general
categories of global assignments: (a) technical or functional assignments, (b) developmental or high-
potential assignments, and (c) strategic or executive assignments.
given that adjustment (psychological comfort living and working in another country) is important
for almost all expatriates.
In meta-analysis of antecedents and consequents of expatriate adjustment, Bhaskar-Shrinivas,
Harrison, Shaffer, and Luk (2005) found language ability, previous overseas experience, withdrawal
cognitions, job satisfaction, and spousal adjustment were predictors of cross-cultural adjustment. In
another meta-analysis, Hechanova, Beehr, and Christiansen (2003) found self-efficacy, frequency
of interaction with host nationals, and family support were predictors of cross-cultural adjustment.
These meta-analyses also suggest that greater cross-adjustment in international assignees gener-
ally predicted greater job satisfaction, less strain, and higher levels of organizational commitment.
Another meta-analysis examining personality as predictors of expatriate performance (Mol, Born,
Willemsen, & Van Der Molen, 2005) found that extraversion, emotional stability, agreeableness,
and conscientiousness were predictive of expatriate performance. This same meta-analysis also
found cultural sensitivity and local language ability to also be predictive.
Across these meta-analyses, three categories of individual-level antecedents seem to emerge as
predictors of cross-cultural adjustment that would lend themselves to international assignee selec-
tion systems. They are personality characteristics, language skills, and prior experience living in
a different country (see Caligiuri & Tarique, 2006, for a review). On the basis of the domestic
literature, cognitive ability would seem that it should be another potential category, however, as a
predictor cognitive ability “has barely been investigated in the expatriate context” (Mol et al., 2005,
p. 614).
Personality Characteristics
Extensive research has found that well-adjusted and high-performing international assignees tend
to share certain personality traits (e.g., Black, 1990; Caligiuri, 2000a, 2000b; Church, 1982; Dalton &
Wilson, 2000; Mendenhall & Oddou, 1985; Mol et al., 2005; Shaffer, Harrison, Gregersen,
Black, & Ferzandi, 2006). Personality characteristics enable international assignees to be open
and receptive to learning the norms of new cultures, to initiate contact with host nationals, to
gather cultural information, and to handle the higher amounts of stress associated with the ambi-
guity of their new environments (Black, 1990; Church, 1982; Mendenhall & Oddou, 1985; Shaffer
et al., 2006)—all important for international assignee success. Research has found that five factors
provide a useful typology or taxonomy for classifying personality characteristics (Digman, 1990;
Goldberg, 1992, 1993; McCrae & Costa, 1987, 1989; McCrae & John, 1992), labeled “the Big Five.”
The Big Five personality factors are (a) extroversion, (b) agreeableness, (c) conscientiousness, (d)
emotional stability, and (e) openness or intellect.
Each of the Big Five personality characteristics relate to international assignee success in
a unique way (Caligiuri, 2000a, 2000b; Ones & Viswesvaran, 1997, 1999; Shaffer et al., 2006)
and should be included in a selection system for international assignees for different reasons (Van
Vianen, De Pater, & Caligiuri, 2005). For example, employees higher in conscientiousness are more
likely to become leaders, gain status, get promoted, earn higher salaries, etc. Studies in a domestic
context have found a positive relationship between conscientiousness and work performance among
professionals in the domestic work context (e.g., Barrick & Mount, 1991; Day & Silverman, 1989),
which was later generalized to the international assignment context (Caligiuri, 2000a; Ones &
Viswesvaran, 1997). On the basis of the meta-analysis conducted by Mol et al. (2005), the estimated
true population effect size for the relationship between conscientiousness and international assignee
success is positive (ρ = .17).
Personality characteristics related to relational skills (extroversion and agreeableness) enable
international assignees to form stronger interpersonal bonds with both host nationals and other
international assignees. On the basis of a meta-analysis, Bhaskar-Shrinivas et al. (2005) found that
relational skills are positively related to cross-cultural adjustment (ρ = .32). Extroverted individuals
are able to more effectively learn the social culture of the host country through their relationships
with local nationals and, in turn, report higher cross-cultural adjustment (Abe & Wiseman, 1983;
792 Handbook of Employee Selection
Black, 1990; Caligiuri, 2000a, 2000b; Mendenhall & Oddou, 1985, 1988; Searle & Ward, 1990).
More agreeable international assignees tend to deal with conflict collaboratively, strive for mutual
understanding, and are less competitive. They report greater cross-cultural adjustment and greater
success on the assignment (Bhaskar-Shrinivas et al, 2005; Caligiuri, 2000a, 2000b; Mol et al., 2005;
Ones & Viswesvaran, 1997; Black, 1990; Tung, 1981). Meta-analytic results (Mol et al., 2005) found
the estimated true population effect size for the relationship of international assignee success and
extroversion and agreeableness to be positive (ρ = .17 and .11, respectively).
Emotional stability is also important for international assignee success. Emotional stability is
the universal adaptive mechanism enabling humans to cope with stress in their environment (Buss,
1991). Given that stress is often associated with living and working in an ambiguous and unfamiliar
environment (Richards, 1996; Stahl & Caligiuri, 2005), emotional stability is an important person-
ality characteristic for international assignees’ adjustment to the host country (Abe & Wiseman,
1983; Black, 1988; Gudykunst, 1988; Gudykunst & Hammer, 1984; Mendenhall & Oddou, 1985)
and completion of an international assignee assignment (Ones & Viswesvaran, 1997). From the
meta-analysis conducted by Mol et al. (2005), the estimated true population effect size for the rela-
tionship between emotional stability and international assignee success is positive (ρ = .10).
Seeming to be the most intuitively necessary personality characteristic relating to international
assignee success is openness. For an international assignee, the ability to correctly assess the
social environment is more complicated given that the host country may provide ambiguous or
uninterpretable social cues (Caligiuri & Day, 2000). Successful international assignees must possess
cognitive complexity, openness, and intuitive perceptual acuity to accurately perceive and interpret
the host culture (Caligiuri, Jacobs, & Farr, 2000; Finney & Von Glinow, 1987; Ones & Viswesvaran,
1997). Openness should be related to international assignee success because individuals higher in
this personality characteristic will have fewer rigid views of appropriate and inappropriate contextual
behavior and are more likely to be accepting of the new culture (e.g., Abe & Wiseman, 1983; Black,
1990; Cui & van den Berg, 1991; Hammer, Gudykunst, & Wiseman, 1978). On the basis of the meta-
analysis conducted by Mol et al. (2005), the estimated true population effect size for the relationship
between openness and international assignee success is positive (ρ = .06); however, this relationship
was not significant, as the confidence interval included zero. The authors noted that “moderated
support was found for the relationship of openness” (p. 608), which is consistent with other
research. For example, Caligiuri (2000b) found moderated support for openness as a personality
characteristic relating to expatriate adjustment, such that greater contact with host nationals was
positively related to cross-cultural adjustment when an individual possesses the personality trait
of openness. In addition, the same meta-analysis (Mol et al., 2005) did find a positive relationship
between international assignee performance and cultural sensitivity (r = .24). Perhaps, depending
on measurement, cultural sensitivity may be assessing a facet of openness, which could have a linear
and positive relationship.
Collectively, these personality characteristics should be included in any selection program for
international assignees (Van Vianen et al., 2005). It is important to note that this type of employee
assessment would predict those who will do well adjusting to a cross-cultural job context. However,
this assessment does not predict success in the actual job tasks. Likewise, the absolute level of
each personality characteristic may be contingent upon the type of international assignment under
consideration. For example, the necessary level of relational skills might be important for all inter-
national assignees but higher for more senior executives who must network with, persuade, and
influence host nationals to be successful, compared with technical assignees, who may interact with
host nationals mostly around tasks with computer systems or equipment.
Language Skills
Many have noted a positive relationship between language skills and international assignee success
(Abe & Wiseman, 1983; Church, 1982; Cui & van den Berg, 1991; Shaffer, Harrison, & Gilley,
1999). In their meta-analytic studies, Mol et al. (2005) and Bhaskar-Shrinivas et al. (2005) found
Selection in Multinational Organizations 793
that local language ability is a positive predictor of international assignee success (as generally
defined by adjustment; ρ = .19 and .22). Mol et al. (2005) further noted that “more research may be
needed on the moderators of this relationship” (p. 609) between language skills and international
assignee performance. For example, some researchers suggest that language skills, which are neces-
sary for communication, are critical for cross-cultural adjustment. Others (e.g., Cui & van den Berg,
1991) suggest that there may be an interactive influence of language fluency (Shaffer et al., 1999):
Individual differences such as openness may interact with language fluency to positively influence
international assignee success (Cui & van den Berg, 1991). In other words, one could speak the
host language fluently and know the “correct” behaviors to display, and yet only superficially be
immersed in the host culture (Cui & van den Berg, 1991). Because it would be difficult for the oppo-
site to be true (i.e., that one could be immersed in a culture without language skills), basic language
skills should, at very least, be considered helpful. At a minimum, an attempt should be made to find
a qualified international assignee with language skills, for some positions the language skills may
be more critical than with others.
previews prior to international assignments did not change candidates’ interest in possible
assignments, but did increase candidates’ self-efficacy for an international assignment. This
self-efficacy, in turn, could influence the outcome of the international assignment.
Research and practice suggest that in the selection phase (or prior to it) it is useful for firms to
provide some information to assist candidates in making realistic decisions on whether an assign-
ment is right for them and to help them form realistic expectations about a possible international
assignment (Black, Gregersen, & Mendenhall, 1992; Caligiuri & Phillips, 2003; Tung, 1988). Many
firms have preselection programs that pair repatriates with international assignee candidates to
give international assignees the opportunity to form realistic expectations (Black, Gregersen, &
Mendenhall, 1992; Tung, 1988). Caligiuri and Phillips (2003) have found that self-directed realistic
previews are also highly effective in helping international assignee candidates form accurate per-
ceptions of the possible assignment.
Self-Selection
Given that the demographic profiles and personal situations of the international assignee candidates
will vary, self-assessment (or self-selection) has been found to be an effective method for sharing
realistic assessments in a tailored way (Caligiuri & Phillips, 2003). For example, an unmarried
person who is a candidate for an international assignment might have a different set of concerns,
compared with a married candidate with a family (Caligiuri, Hyland, Joshi, & Bross, 1998). Self-
assessment has been found to be useful because global assignment candidates actively self-assess
their fit with the personality and lifestyle requirements of the assignment (Caligiuri & Phillips,
2003). Effective self-selection tools enable international assignee candidates to critically evaluate
themselves on three critical dimensions: (a) personality and individual characteristics, (b) career
issues, and (c) family issues (including issues of spouses and children). Self-selection instrument,
acting as a realistic preview of the assignment, help employees make a thoroughly informed and
realistic decision about a global assignment (Caligiuri & Phillips, 2003). Many firms have found that
this self-assessment step fosters the creation of a candidate pool of potential international assignees.
This candidate pool can be organized to include the following pieces of information: the availability
of the employee (when and to what countries), languages the employee speaks, countries preferred,
technical knowledge, skills, and abilities, etc.
Candidate Assessment
Once the requirements of a given international assignment have been determined, many possibili-
ties exist to assess the candidates on job-related dimensions. Given that international assignments
are job contexts, rather than job descriptions and, therefore, require that the various international
assignments will require differential levels of relevant attributes (e.g., language fluency, openness).
For example, greater emphasis would be placed on personality characteristics (such as sociability
and openness) when assessing a candidate for a developmental or strategic assignment requiring
much more host national contact, compared to a more technical international assignment (Caligiuri
2000a, 2000b; Caligiuri & Tarique, 2006). In the best case, a thorough assessment can be conducted
through a variety of valid formal selection methods: paper-and-pencil tests, assessment centers,
interviews, behavioral observations, and the like. However, the reality is that most international
assignee selection generally happens using the most informal methods—recommendations of peers
or supervisors (Brewster & Harris, 1999).
Looking forward to best practice, two aspects of international assignee selection process have
shown promise but warrant further investigation. The first is to better understand ways to engage
employees early—even before an international assignment is available. The best candidates can
build their efficacy for the assignment when their decision-making processes are engaged before a
position becomes available (Caligiuri & Phillips, 2003). The second is to better understand ways to
effectively involve the family as early as possible in the selection process. Research has concluded
that each family member will positively or negatively influence the assignment (Caligiuri et al.,
Selection in Multinational Organizations 795
1998), so their influence should not be disregarded in the assessment phase. It is accepted that the
best selection decision will be mutual among the employees, their organizations, and their fami-
lies. Although the best case for international assignee selection is understood, the dynamic inter-
play among employees, families, and organizations—in terms of international assignment selection
decisions—are not yet thoroughly understood and warrant further research.
Most multinational companies acknowledge that the wrong person in an expatriate assignment
can result in poor job performance, early repatriation, anxiety or other emotional problems, and
personal and professional upheaval for accompanying family members. With the risks so high,
expatriate selection (designed to identify who will have the greater likelihood of success) is critical.
The efficacy of expatriate selection programs is challenged when transnational firms report (as they
often do) that there are not enough people to fill current expatriate assignments. The natural reac-
tion, in this case, is to believe that expatriate selection would not apply. However, ignoring proper
selection is extremely shortsighted given the risks to the firm and the individual if the global assign-
ment is unsuccessful. This reaction is especially limited given that when selection is thorough,
firms cast wider nets for possible candidates and generally find multiple candidates with a higher
probability of success. These comprehensive selection systems generally have four distinct phases
including self-assessment, the creation of a candidate pool, technical and managerial selection,
and placement. The placement in a host country will be most successful when agreement is mutual
among the candidate, the candidate’s family, the sending unit, and the host national unit.
CONCLUSIONS
This chapter covered the many challenges of developing international selection systems, namely
the transnational strategies affecting selection systems, the challenges of construct development
with respect to cross-cultural comparability, and selection issues international assignees. As the
need for strategically oriented and conceptually equivalent international selection systems contin-
ues to grow, so do the demands on HR professionals and applied psychologists to respond to this
complex need.
There are many dynamic changes happening today that will increase the need for and the ease of
adopting internationally integrated selection systems. For example, increasingly strong worldwide
corporate cultures, where employees globally share values and norms, may diminish the influ-
ence of national cultures. Strong global corporate cultures create a common frame-of-reference for
more subjective constructs and ease the integration of international selection systems. Subjective
constructs, such as “integrity,” “teamwork,” and “trust” will have a company-driven understanding
leveling any nationally driven cultural differences. This move to stronger corporate cultures will
increasingly ease of integrating international selection systems.
Although the technical issues of employee selection are important, the implementation of selec-
tion systems globally requires more than merely validating employee selection tests in different
countries. Employee selection tests are created and adopted by HR professionals located around the
world. These HR professionals, from different cultures and with different levels of knowledge of
the science and practice of employee selection, ultimately affect whether a given selection system
can be integrated globally. As described in this chapter, the concept of testing—and the very idea
of individual differences—varies from country to country. Likewise, the science of testing and
the level of acceptance of U.S.-oriented industrial-organizational psychology standards for practice
also vary from country to country. In some cultures testing is rooted in education (not industrial-
organizational psychology), where teachers create and give tests, assigning grades accordingly. Test
validation, in these cultures, would seem like a burdensome and unnecessary process. Creating
standards for practice for a company’s HR professionals globally is an important step to developing
selection systems that can be validated and accepted globally.
The future success of international employee selection may also be reliant on headquarters-based
HR professionals’ and industrial/organizational psychologists’ abilities to manage relationships
796 Handbook of Employee Selection
cross-nationally. Developing relationships with in-country HR leaders and line managers are criti-
cal for successful integration of selection systems. The in-country HR professionals will likely be
first to identify any country-specific problems and ways to eventually solve those problems. Because
this willingness to help relies on the goodwill of in-country HR professionals (some of whom may
initially need to be convinced that testing is appropriate), the ability for headquarters-based testing
professionals to develop respectful, collegial, and lasting relationships is critical.
Lastly, the future success of international employee selection may be determined by whether
the employee selection systems are integrated as part of a whole strategic HR system (or high-
performance work system). HR professionals would be only addressing part of the picture if
they developed employee selection systems in isolation. Ideally, selection and assessment should
be integrated with training and development, performance management, and reward systems.
Collectively, when these systems globally reinforce the predictors of performance in a compre-
hensive manner, the needle moves much quicker toward a high-performing globally competitive
organization.
REFERENCES
Abe, H., & Wiseman, R. (1983). A cross-culture confirmation of the dimensions of intercultural effectiveness.
International Journal of Intercultural Relations, 7, 5–67.
Adler, N. J. (2001). International dimensions of organizational behavior (4th ed.). Cincinnati, OH:
Southwestern.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Bartlett, C. A., & Ghoshal, S. (1989). Managing across borders: The transnational solution. Boston, MA:
Harvard Business School Press.
Begin, J. P. (1992). Comparative human resource management (HRM): A systems perspective. International
Journal of Human Resource Management, 3(3), 379–408.
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6,
205–212.
Bhaskar-Shrinivas, P., Harrison, D. A., Shaffer, M., & Luk, D. M. (2005). Input-based and time-based models
of international adjustment: Meta-analytic evidence and theoretical extensions. Academy of Management
Journal, 48(2), 257–281.
Black, J. (1990). The relationship of personal characteristics with adjustment of Japanese expatriate managers.
Management International Review, 30, 119–134.
Black, J. S., Gregersen, H. B., & Mendenhall, M. E. (1992). Global assignments: Successfully expatriating and
repatriating international managers. San Francisco, CA: Jossey-Bass.
Bochner, S., Hutnik, N., & Furnham, A. (1986). The friendship patterns of overseas students and host students
in an Oxford student resident hall. Journal of Social Psychology, 125, 689–694.
Bochner, S., McLeod, B. M., & Lin, A. (1977). Friendship patterns of overseas students: A functional model.
International Journal of Psychology, 12, 277–294.
Brewster, C., & Harris, H. (1999). International HRM: Contemporary issues in Europe. London, England:
Routledge.
Caligiuri, P. (2000a). The Big Five personality characteristics as predictors of expatriate success. Personnel
Psychology, 53, 67–88.
Caligiuri, P. (2000b). Selecting expatriates for personality characteristics: A moderating effect of personality on
the relationship between host national contact and cross-cultural adjustment. Management International
Review, 40, 61–80.
Caligiuri, P. (2006). Performance measurement in a cross-national context: Evaluating the success of global
assignments. In W. Bennett, D. Woehr, & C. Lance (Eds.), Performance measurement: Current perspec-
tives and future challenges (pp. 227–245). Mahwah, NJ: Lawrence Erlbaum.
Caligiuri, P., & Colakoglu, S. (2008). A strategic contingency approach to expatriate assignment management.
Human Resource Management Journal, 17, 393–410.
Caligiuri, P., & Day, D. (2000). Effects of self-monitoring on technical, contextual, and assignment-specific
performance: A study of cross-national work performance ratings. Group and Organization Management,
25, 154–175.
Selection in Multinational Organizations 797
Caligiuri, P., Hyland, M., Joshi, A., & Bross, A. (1998). A theoretical framework for examining the relationship
between family adjustment and expatriate adjustment to working in the host country. Journal of Applied
Psychology, 83, 598–614.
Caligiuri, P., Jacobs, R., & Farr, J. (2000). The attitudinal and behavioral openness scale: Scale development
and construct validation. International Journal of Intercultural Relations, 24, 27–46.
Caligiuri, P., & Lazarova, M. (2001). Strategic repatriation policies to enhance global leadership development. In
M. Mendenhall, T. Kuehlmann, & G. Stahl (Eds.), Developing global business leaders: Policies, processes,
and innovations, (pp. 243–256). Westport, CT: Quorum Books.
Caligiuri, P., & Phillips, J. (2003). An application of self-assessment realistic job previews to expatriate assign-
ments. International Journal of Human Resource Management, 14, 1102–1116.
Caligiuri, P., & Stroh, L. K. (1995). Multinational corporation management strategies and international human
resource practices: Bringing IHRM to the bottom line. International Journal of Human Resource
Management, 6, 494–507.
Caligiuri, P., & Tarique, I. (2006). International assignee selection and cross-cultural training and development.
In I. Björkman & G. Stahl (Eds.), Handbook of research in international human resource management
(pp. 302–322). London, England: Edward Elgar.
Church, A. (1982). Sojourner adjustment. Psychological Bulletin, 9, 540–572.
Cui, G., & Van den Berg, S. A. (1991). Testing the construct validity of intercultural effectiveness. International
Journal of Intercultural Relations, 15, 227–241.
Dalton, M., & Wilson, M. (2000). The relationship of the five-factor model of personality to job performance
for a group of Middle Eastern international assignee managers. Journal of Cross-Cultural Psychology,
18, 250–258.
Davis, D. D. (1998). International performance measurement and management. In J. W. Smither (Ed.),
Performance appraisal: State of the art in practice (pp. 95–131). San Francisco, CA: Jossey-Bass.
Day, D. V., & Silverman, S. (1989). Personality and job performance: Evidence of incremental validity.
Personnel Psychology, 42, 25–36.
Day, D. V., & Sulsky, L. M. (1995). Effects of frame-of-reference training and information configuration on
memory organization and rating accuracy. Journal of Applied Psychology, 80, 158–167.
Digman, J. (1990). Personality structure: The emergence of the Five-Factor model. Annual Review of
Psychology, 41, 417–440.
England, G., & Harpaz, I. (1983). Some methodological and analytic considerations in cross-national compara-
tive research. Journal of International Business Studies, Fall, 49–59.
Erez, M. (1994). Toward a model of cross-cultural industrial and organizational psychology. In H. C. Triandis,
M. D. Dunnette, & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (Vol. 4,
pp. 559–608). Palo Alto, CA: Consulting Psychologists Press.
Ghoshal, S., & Nohria, N. (1993). Horses for courses: Organizational forms for multinational corporations.
Sloan Management Review, 2, 23–35.
Goldberg, L. (1992). The development of markers for the Big-Five factor structure. Psychological Assessment,
4, 26–42.
Goldberg, L. (1993). The structure of phenotypic personality traits. American Psychologist, 48, 26–34.
Gudykunst, W. (1988). Uncertainty and anxiety. In Y. Kim & W. B. Gudykunst (Eds.), Theories in intercultural
communication (pp. 123–156). Newbury Park, CA: Sage.
Gudykunst, W., & Hammer, M. (1984). Dimensions of intercultural effectiveness: Culture specific or cultural
general? International Journal of Intercultural Relations, 8, 1–10.
Hammer, M. R., Gudykunst, W. B., & Wiseman, R. L. (1978). Dimensions of intercultural effectiveness: An
exploratory study. International Journal of Intercultural Relations, 2, 382–393.
Harpaz, I. (1996). International management survey research. In B. J. Punnett & O. Shenkar (Eds.), Handbook
for international management research (pp. 37–62). Cambridge, MA: Blackwell.
Hechanova, R., Beehr, T. A., & Christiansen, N. D. (2003). Antecedents and consequences of employees’
adjustment to overseas assignment: A meta-analytic review. Applied Psychology: An International
Review, 52(2), 213–236.
Hofstede, G. (1980). Cultures consequences: International differences in work-related values. Thousand Oaks,
CA: Sage.
Judge, T. A., & Ferris, G. R. (1993). Social context of performance evaluation decisions. Academy of
Management Journal, 36, 80–105.
Lievens, F. (2007). Research on selection in an international context: Current status and future directions. In
M. M. Harris (Ed.), Handbook of research in international human resource management (pp. 107–123).
Mahwah, NJ: Lawrence Erlbaum.
798 Handbook of Employee Selection
McCrae, R., & Costa, P. (1987). Validation of the five-factor model of personality across instruments and
observers. Journal of Personality and Social Psychology, 52, 81–90.
McCrae, R., & Costa, P. (1989). More reasons to adopt the five-factor model. American Psychologist, 44,
451–452.
McCrae, R., & John, O. (1992). An introduction to the Five-Factor model and its applications. Journal of
Personality, 60, 175–216.
Mendenhall, M., & Oddou, G. (1985). The dimensions of expatriate acculturation. Academy of Management
Review, 10, 39–47.
Mendenhall, M., & Oddou, G. (1988). The overseas assignment: A practical look. Business Horizons, 31,
78–84.
Mol, S. T., Born, M. P., Willemsen, M. E., & Van Der Molen, H. T. (2005). Predicting expatriate job performance
for selection purposes: A quantitative review. Journal of Cross-Cultural Psychology, 36, 590–620.
Ones, D. S., & Viswesvaran, C. (1997). Personality determinants in the prediction of aspects of expatriate job
success. In Z. Aycan (Ed.), New approaches to employee management, (Vol. 4, pp. 63–92). Greenwich,
CT: JAI Press.
Ones, D. S., & Viswesvaran, C. (1999). Relative importance of personality dimensions for international assignee
selection: A policy capturing study. Human Performance, 12, 275–294.
Phillips, J. M., & Gully, S. M. (2002). Fairness reactions to personnel selection techniques in Singapore and the
United States. International Journal of Human Resource Management, 13, 1186–1205.
Ployhart, R. E., Wiechmann, D., Schmitt, N., Sacco, J. M., & Rogg, K. L. (2003). The cross-cultural equiva-
lence of job performance ratings. Human Performance, 16, 49–79.
Prahalad, C. K., & Doz, Y. L. (1987). The multinational mission: Balancing local demands and global vision.
New York, NY: Free Press.
Riordan, C. M., & Vandenberg, R. J. (1994). A central question in cross-cultural research: Do employees of
different cultures interpret work-related measures in an equivalent manner? Journal of Management, 20,
643–671.
Ryan, A. M., Chan, D., Ployhart, R. E., & Slade, L. A. (1999). Employee attitude surveys in a multinational
organization: Considering language and culture in assessing measurement equivalence. Personnel
Psychology, 52, 37–58.
Ryan, A. M., Horvath, M., Ployhart, R. E., Schmitt, N., & Slade, L. A. (2000). Hypothesizing differential item
functioning in global employee opinion surveys. Personnel Psychology, 53, 531–562.
Ryan, A. M., McFarland, L., Baron, H., & Page, R. (1999). An international look at selection practices: Nation
and Culture as Explanations for variability in practice. Personnel Psychology, 52, 359–391.
Salgado, J. F., & Anderson, N. R. (2002). Cognitive and GMA testing in the European Community: Issues and
evidence. Human Performance, 15, 75–96.
Schrader, B. W., & Steiner, D. D. (1996). Common comparison standards: An approach to improving agreement
between self and supervisory performance ratings. Journal of Applied Psychology, 81, 813–820.
Searle, W., & Ward, C. (1990). The prediction of psychological and sociocultural adjustment during cross-cultural
transitions. International Journal of Intercultural Relations, 14, 449–464.
Shackleton, V., & Newell, S. (1997). International assessment and selection. In N. Anderson & P. Herriot
(Eds.), International handbook of selection and assessment (pp. 82–95). New York, NY: Wiley.
Shaffer, M. A., Harrison, D. A., & Gilley, K. M. (1999). Dimensions, determinants and differences in the expatriate
adjustment process. Journal of International Business Studies, 30, 557–581.
Shaffer, M. A., Harrison, D. A., Gregersen, H., Black, J. S., & Ferzandi, L. A. (2006). You can take it with you:
Individual differences and expatriate effectiveness. Journal of Applied Psychology, 91, 109–115.
Stahl, G., & Caligiuri, P. M. (2005). The relationship between expatriate coping strategies and expatriate adjust-
ment, Journal of Applied Psychology, 90, 603–616.
Steiner, D. D., & Gilliland, S. W. (2001). Procedural justice in personnel selection: International and cross-cultural
perspectives. International Journal of Selection and Assessment, 9, 124–137.
Tsui, A. S., & O’Reilly, C. A., III. (1989). Beyond simple demographic effects: The importance of relational
demography in superior-subordinate dyads. Academy of Management Journal, 32, 402–423.
Tung, R. (1981). Selection and training of personnel for overseas assignments. Columbia Journal of World
Business, 16, 21–25.
Tung, R. L. (1988). Career issues in international assignments. Academy of Management Executive, 2,
241–244.
Van Vianen, A. E. M., De Pater, I. E., & Caligiuri, P. M. (2005). Expatriate selection: A process. In A. Evers,
O. Smit-Voskuyl, & N. Anderson (Eds.), The handbook of personnel selection (pp. 458–475). Oxford,
England: Blackwell.
Selection in Multinational Organizations 799
A Contingency and
Multilevel Perspective
Susan Mohammed, Jan Cannon-Bowers,
and Su Chuen Foo
Working in a team environment calls for a different set of skills and motivations than those required
in traditional workplaces. A premium is placed on the ability to learn, teamwork and collaboration, a
continuous improvement ethic, high initiative, a focus on the customer, and problem-solving skills….
Making sure you have the best people on board will dramatically increase your chances for a successful
team implementation.
Wellins, Byham, & Dixon
Inside Teams (1994, p. 322)
Why do some teams succeed and others fail? This fundamentally important question has been
investigated for decades (e.g., Heslin, 1964; Hackman, 1987; Kozlowski & Ilgen, 2006), and the
evolving answer is multifaceted and complex, involving a wide range of constructs and their inter-
actions. Nevertheless, team scholars generally agree that selecting the right team members is a key
variable in the team effectiveness equation. Indeed, most team performance models highlight the
critical role of member characteristics (e.g., Gladstein, 1984; Hackman, 1987; Tannenbaum, Salas, &
Cannon-Bowers, 1996), and the popular Input-Process-Outcome framework of team studies begins
with the expertise, abilities, and personality traits that members bring to the group (McGrath, 1984).
Selection interventions are expected to enhance team effectiveness because they identify employees
with superior taskwork and teamwork skills and detect those with a better fit to work in team envi-
ronments (Tannenbaum et al., 1996). However, despite the importance of team selection, significant
knowledge gaps remain regarding how to distinguish “team players” from “team inhibitors” and
how to create teams whose members have the right mix of competencies. Ironically, despite a wealth
of accumulated knowledge about how to select individuals to fit jobs and a burgeoning team litera-
ture, relatively little of this research has systematically focused on team selection issues (e.g., Jones,
Stevens, & Fischer, 2000; McClough & Rogelberg, 2003). In addition, teamwork measurement for
selection has been described as “primitive” (O’Neil, Wang, Lee, Mulkey, & Baker, 2003, p. 283).
The purpose of this chapter is to review and integrate what is currently known about team selec-
tion with the goals of identifying gaps in current knowledge and underscoring promising avenues for
future research. In doing so, we adopt a contingency as well as a multilevel perspective. With respect
to contingency, one of the overarching themes of the present work is that selection approaches will
differ for diverse types of teams and tasks. Because the nature of the team and why it exists plays
such a prominent role in determining what member characteristics are needed, selection systems
must clearly define the purposes and conditions for which group members are chosen. Therefore, we
feature recent developments in team task analysis, which extend traditional job analysis procedures
by addressing the coordinative behaviors needed for effective team performance.
801
802 Handbook of Employee Selection
Team type:
• Executive • Command and control • Negotiation • Design
Task demands:
• Interdependence • Coordination demands • Behavioral discretion
• Role specialization • Structure (formal & informal) • Level of autonomy
competencies: competencies:
Team-level
• Knowledge • Knowledge
• Skills • Skills
• Abilities • Abilities
• Attitudes • Attitudes
FIGURE 37.1 Conceptual framework for understanding selection for team membership.
Selection for Team Membership 803
individually, team members must also possess teamwork skills that enable interdependent work
(e.g., interpersonal, conflict management, communication, collaborative problem-solving; Stevens
& Campion, 1994). Because taskwork skills are not unique to the team context, we focus on two
types of teamwork competencies: core (general teamwork behaviors common to all team tasks)
and contingent (dependent on the task and the team’s configuration). Similar to other researchers
(e.g., Cannon-Bowers, Tannenbaum, Salas, & Volpe, 1995; Kichuk & Wiesner, 1998), we posit
that core teamwork competencies are generic or transportable—that is, they are important regard-
less of the particular task or team at hand. Examples of such competencies include interpersonal
skills, knowledge of teamwork, communication skills, preference for teamwork, and agreeable-
ness. Further, we propose that these attributes are held by individuals and can therefore be mea-
sured at the individual level.
Team Type
The key determinants of team effectiveness differ depending on team type. Integrating the various
taxonomies of team types that have been proposed in the literature, Devine (2002) derived two
broad classifications of teams: knowledge and physical. Knowledge work teams (e.g., executive,
command, negotiation, design, advisory) involve mental processing as the core task and informa-
tion as the main outcome. Contrastingly, physical work teams (e.g., service, production, medical,
military, sports) produce a tangible product as the main outcome.
Because knowledge and physical work teams vary with regard to their fundamental work cycle,
temporal duration, technological constraints, and health risks (Devine, 2002), selection require-
ments will clearly be different for diverse team types. In addition, consideration should be given to
whether selection is for new teams (all new members), transition teams (familiar members being
reassigned to new team tasks), or intact teams (new members being placed in existing teams; Jones
et al., 2000).
Task Demands
Influenced in large part by team type, the nature of the task that is being performed is a central
driver of team performance. Past researchers have couched this discussion from multiple perspec-
tives, including descriptions of different types of interdependence (Saavedra, Earley, & van Dyne,
1993) and the behavioral requirements of members during performance (McGrath, 1984). What
these conceptualizations have in common is the notion that the task affects the way the team func-
tions as well as the types of members needed. Other drivers of team functioning include coordina-
tion demands (the manner in which team members must organize and allocate their resources in
order to be successful; Bowers, Morgan, Salas, & Prince, 1993), behavioral discretion (the degree
of control team members have in performing the task as dictated by the level of proceduralization;
Cannon-Bowers, Salas, & Blickensderfer, 1998), role specialization (how roles are defined in the
team; Kilduff, Angelmar, & Mehra, 2000), structure (the nature of the formal organization and
communication channels; Price, Harrison, & Gavin, 2006), level of autonomy (the degree to which
the team manages itself; Langfred, 2007), and environmental stability (the stability of demands
placed on the team from external forces; Keck & Tushman, 1993). Given the importance of task
demands, team researchers have developed team task analysis methods.
804 Handbook of Employee Selection
INDIVIDUAL-LEVEL CONSIDERATIONS
INDIVIDUAL ATTRIBUTES THAT CONTRIBUTE TO EFFECTIVE TEAMWORK
In this section, we focus on describing individual-level attributes that contribute to effective team
performance as a basis to develop selection systems for teams. From a purely practical perspective,
Selection for Team Membership 805
organizations typically hire employees individually even if they are going to work as part of a team.
For this reason, it behooves team selection researchers to attempt to identify teamwork competen-
cies that can predict as much variance in team performance as possible. Because “interdependence
creates special job requirements” (Klimoski & Jones, 1995, p. 309), team tasks place additional
demands on employees that are not found in typical individual tasks. Further, these “teamwork”
attributes are separate and distinct from “taskwork” attributes, which are defined as the more tech-
nical competencies necessary to perform the task (Morgan, Salas, & Glickman, 1993).
Table 37.1 provides a summary of the knowledge, skills, attitudes, and personality traits impor-
tant for team selection, although we do not claim to be exhaustive. In a few cases, the variables
displayed here have been studied, and even validated, in a selection context. However, in other cases
we have made the link to selection by extrapolating from the broader team performance literature.
In other words, we have reasoned that variables which have been demonstrated to be important to
team functioning should be useful in selection, particularly if the attribute is difficult to train.
Paper-and-Pencil Measures
Because of ease of administration and relatively low cost, paper-and-pencil measures are a popular
means of assessing KSAOs and personality traits for team member selection. Reflecting knowledge
of how to act in team situations, one promising measurement tool is the Teamwork KSA test, which
consists of 35 situational judgment items answered in a multiple choice format (Stevens & Campion,
1999). On the basis of the conceptual model of teamwork requirements developed by Stevens and
Campion (1994), the test captures both interpersonal (conflict resolution, collaborative problem-
solving, communication) and self-management (goal-setting, performance management, planning,
and coordination) KSAs. In addition to being more face-valid for team selection than cognitive ability,
the answers are not as fakable as personality measures, and the items are written generally to be appli-
cable for a variety of industry contexts. Validation studies have shown that the Teamwork KSA test
correlates with supervisory ratings of teamwork and taskwork performance (Leach, Wall, Rogelberg,
& Jackson, 2005; Stevens & Campion, 1999), peer nominations of teamwork (Stevens & Campion,
1999), team task proficiency (Hirschfeld, Jordan, Field, Giles, & Armenakis, 2006), observed ratings
of effective teamwork (Hirshfeld et al., 2006), and contextual performance (Morgeson, Reider, &
Campion, 2005) in organizational and military samples. Moreover, higher scores on the Teamwork
KSA test yielded higher observable teamwork behavior scores and peer ratings of individual effec-
tiveness in a student sample (McClough & Rogelberg, 2003). Although one sample revealed incre-
mental criterion-related validity beyond employment aptitude tests (Stevens & Campion, 1999), a
cautionary note is that strong correlations have raised the issue of redundancy with cognitive ability.
Another situational judgment test (SJT) that has recently been developed and validated for team
member selection is the Team Role Test (Mumford, van Iddekinge, Morgeson, & Campion, 2008),
which assesses declarative and procedural knowledge of team role types and the situational con-
tingencies needed for role adaptability. The Team Role Test consists of nine team scenarios, each
requiring one appropriate role, ten items per scenario. In academic and work team samples, the
Team Role Test was positively related with peer ratings of team role performance (Mumford et al.,
2008). Furthermore, the SJT demonstrated incremental validity beyond cognitive ability and Big
Five traits in predicting role performance (Mumford et al., 2008).
806
TABLE 37.1
Examples of the KSAOs and Personality Traits Important for Team Selection
Attribute Definition Related/Subsidiary Constructs Validation/Measurement Issues
Knowledge
Knowledge of teamwork skills Understanding of the necessary underpinnings Understanding teamwork, familiarity with Assessed via Teamwork KSA test. Validation data
and behavioral requirements of effective team teamwork, knowledge of teamwork KSAs show that this variable predicts effective
performance teamwork (Hirschfeld et al., 2006; McClough &
Rogelberg, 2003; Stevens & Campion, 1999).
Knowledge of team roles Knowledge of team roles and their situational Assessed via Team Role Test. Validation data
contingencies show that this test predicts role performance
(Mumford et al., 2008).
Skills
Adaptability Ability of team members to adjust their strategies Compensatory behavior, backing-up behavior, Best assessed in a work sample or other
in response to task demands, by reallocating dynamic reallocation of function, mutual simulation. Most likely a function of past
team resources adjustment, workload balancing experience (so not easily trained). Some data to
suggest that adaptability improves teamwork
(Salas, Nichols, & Driskell, 2007).
Interpersonal Ability of team members to optimize the quality Morale building, conflict resolution, negotiation, May be assessed through a combination of
if team member interactions through resolution cooperation, consulting with others, interpersonal paper-and-pencil and behavioral measures. Some
on dissent, motivational reinforcement, and trust, social perception, persuasion, helping validation data suggests that interpersonal skills
cooperative behaviors others predict teamwork (e.g., see Morgeson et al.,
2005).
Team management/leadership Ability of team members to direct and coordinate Task motivation, goal-setting, planning and task Best assessed in a work sample or other
activities; assign tasks; organize workflow coordination, establishing roles and expectations, simulation, although paper-and-pencil
among members; and plan, organize, and instructing others, planning, organizing instruments may add value. Not easily trained
establish a positive climate because probably develops with experience.
Some validation indicates that individual
leadership skills are associated with teamwork
effectiveness (e.g., Burke et al., 2006).
Handbook of Employee Selection
Assertiveness Capacity of team members to communicate Task-related assertiveness, component of Can be assessed via paper-and-pencil tests, but
effectively by sharing ideas clearly and extraversion behavioral measures are better. Some validation
directly in interpersonal situations data exist (Pearsall & Ellis, 2006; Smith-Jentsch,
Salas, & Baker, 1996).
Skills
Mutual performance monitoring Ability of team members to accurately monitor Accepting suggestions/criticism; giving Best assessed through a combination of
and assess the work of others; ability to give, suggestions/criticisms; intrateam feedback; paper-and-pencil and behavioral measures. Has
seek, and receive task-clarifying feedback in a monitoring and giving feedback; cross checking; been linked to team performance (e.g., Marks &
constructive manner, and to offer advice error correction; team maintenance Panzer, 2004).
Communication Ability to clearly and accurately articulate and Active listening; information exchange; closed Best assessed through a combination of
exchange information among team members loop communication; information sharing; open paper-and-pencil and behavioral measures.
Selection for Team Membership
using accepted terminology; acknowledge of exchange; consulting with others Closed-loop communication has been shown to
receipt of information; clarify message when predict teamwork (e.g., see Bowers, Pharmer, &
needed Salas, 2000).
Cross-boundary External, task-related actions directed to other Organizational awareness, organizational Paper-and-pencil measure developed by Druskat
teams or the larger organizational context resourcefulness, building relationships with and Kayes (1999).
other teams
Attitudes
Preference for teamwork Inclination and desire to be part of a team; Team/collective orientation, importance of Has been assessed with paper and pencil
willingness to engage with other people in teamwork; appreciation for teamwork; desire to measures. Some evidence to suggest that a
pursuit of task success; appreciation for the work in a team; collectiveness; preference for collective orientation leads to better teamwork
importance of teamwork in accomplishing teamwork (Driskell & Salas, 1992) and that those who
challenging tasks enjoy working in a team engage in less social
loafing (Stark, Shaw, & Duffy, 2007) and have
better team performance (Helmreich & Foushee,
1993).
Self-efficacy for teamwork Degree to which individuals believe that they Teamwork self-efficacy Has been measured with paper-and-pencil
have the requisite knowledge, skills, and other measure (e.g., McClough & Rogelberg, 2003).
attributes to be a successful team member Some data support the link to effective teamwork
(e.g., Tasa, Taggar & Seijts, 2007).
continued
807
808
Other Characteristics
Personality
Conscientiousness Extent to which a person is self-disciplined and Need for achievement, ambition, responsible, Has been assessed with paper-and-pencil
organized dependable measures. Positively related to contextual
performance in team settings (Morgeson et al.,
2005)
Extraversion Extent to which an individual is social, outgoing, Enthusiasm, optimism, assertiveness, dominance, Has been assessed with paper-and-pencil
and talkative gregariousness measures. Positively related to contextual
performance in team settings (Morgeson et al.,
2005)
Agreeableness Extent to which an individual is gentle and Likeability, interpersonal facilitation, trustworthy, Has been assessed with paper-and-pencil
cooperative tolerance, courteousness measures. Positively related to contextual
performance in team settings (Morgeson et al.,
2005)
Emotional stability Extent to which an individual is calm and poised Neuroticism (negative relationship), adjustment, Has been assessed with paper-and-pencil
lack of nervous tendencies, not anxious, security measures. Positively (but only marginally)
related to contextual performance in team
settings (Morgeson et al., 2005)
Handbook of Employee Selection
Selection for Team Membership 809
TEAM-LEVEL CONSIDERATIONS
Thus far, we have discussed the individual-level KSAOs needed for team functioning, which
assumes that teams whose members score higher on taskwork and teamwork competencies will
perform better. However, “when individuals form groups the effects of a valid selection procedure
can be nullified by any lack of cooperation within groups and by bottlenecks, shirking, and social
loafing” (Schneider, Smith, & Sipe, 2000, p. 99). Therefore, it is critical that the overall team con-
text be considered in selection for team membership. In the sections below, we discuss team size,
person-group fit, and team composition.
TEAM SIZE
Because too few members can result in unreasonable work demands and too many members can
produce unnecessary redundancy, an important consideration in team selection involves determin-
ing an appropriate team size. Although larger teams are generally advantaged in terms of resources
(e.g., information, skills, division of labor), they are disadvantaged by heightened coordination dif-
ficulties (e.g., Steiner, 1972). Therefore, one prescription is to staff teams with the smallest number
required to do the work, but determining the optimal figure is contingent on team and task type
(Steiner, 1972; Sundstrom, DeMeuse, & Futrell, 1990). Different team responsibilities across stud-
ies may account for the mixed pattern of findings regarding team size (e.g., Campion, Medsker, &
Higgs, 1993; Mullen, Symons, Hu, & Salas, 1989). To illustrate, a meta-analysis by Stewart (2006)
found that the overall relationship between team size and performance was very small, but modera-
tion effects revealed stronger positive results for project and management teams as compared to
production teams. Because project and management teams involve unstructured tasks and interac-
tion with external constituencies, more team members may be desirable when the environment is
complex (Stewart, 2006).
PERSON-GROUP FIT
Subsumed under the broad, multilevel construct of person-environment (PE) fit, person-group
(PG) or person-team fit refers to the compatibility between members and their groups (Werbel &
Johnson, 2001). Compared with the other types of fit (e.g., person-job and person-organization,
person-supervisor), PG fit has received the least research attention, but interest has been growing
in recent years (Kristof-Brown, Zimmerman, & Johnson, 2005b). Two general categories of PG fit
have been identified. Supplementary PG fit occurs when the individual and the workgroup share
810 Handbook of Employee Selection
similar personality, goals, values and abilities. For example, a person with an engineering back-
ground may join a team of other engineers. In contrast, complementary PG fit occurs when mem-
bers have different competencies, offsetting others’ weaknesses and offering resources that support
each other (Werbel & Johnson, 2001). To illustrate, a member with a marketing background may fill
a gap in a team comprised of engineers. Although examining the match between individuals and
groups, fit research examines individual-level dependent variables.
Existing studies on PG congruence have looked at fit on a variety of content domains such as
values congruence (e.g., Adkins, Ravlin, & Meglino, 1996; DeRue & Morgeson, 2007), goal con-
gruence (e.g., Kristof-Brown & Stevens, 2001), and personality traits (e.g., Kristof-Brown, Barrick,
& Stevens, 2005a). Among these various content dimensions, PG value congruence (supplementary
fit) appears to have the strongest correlations with various outcomes given the relative constancy of
value systems (Kristof-Brown et al., 2005b). With respect to complementary fit on personality traits,
there is some evidence that extraverts are more attracted to teams of introverts, whereas introverts
are more attracted to teams of extraverts (Kristof-Brown et al., 2005).
Current research points to various advantages of PG fit for the individual and team. For exam-
ple, PG value congruence contributed to increased satisfaction with work and social relation-
ships, improved performance on interpersonal dimensions, and reduced tardiness and absenteeism
(Adkins et al., 1996). Additionally, similarity between the individual and team on perceived self
and team mastery goals as well as self and team performance goals led to increased interpersonal
contributions to the workgroup (Kristof-Brown & Stevens, 2001). Self-team performance goal con-
gruence also improved satisfaction with work and the team (Kristof-Brown & Stevens, 2001). The
PG fit-outcome relationship can be characterized as reciprocal and cyclical, in that improved PG
fit enhances individual and group outcomes, which then results in better perceived PG fit (DeRue
& Morgeson, 2007). It is important for researchers to measure the perceptions of team members in
assessing PG fit, as studies have shown the greater salience of perceived PG fit as opposed to actual
PG fit in determining individual outcomes (Kristof-Brown & Stevens, 2001).
Demonstrating its importance, a recent meta-analysis by Kristof-Brown et al. (2005b) established
that PG fit (broadly defined) taps an independent conceptual domain distinct from other types of fit.
Interestingly, PG fit predicted outcomes such as work satisfaction and overall performance equally
well as more established dimensions of fit (Kristof-Brown, Jansen, & Colbert, 2002; Kristof-Brown
et al., 2005b). Specifically, the meta-analytic results revealed that PG fit was positively correlated
with job satisfaction, organizational commitment, supervisor satisfaction, overall performance, and
contextual performance, but negatively correlated with intention to quit. Coworker satisfaction and
group cohesion exhibited particularly strong relationships with PG fit. Because of the limited num-
ber of studies examining PG fit, the unique relationships of supplementary and complementary fit
could not be examined in this meta-analysis (Kristof-Brown et al., 2005b).
Individual member characteristics have been shown to be important predictors of PG fit. In
particular, individual performance and growth satisfaction of team members were found to posi-
tively predict person-team congruence on values and person-role demands-abilities fit (DeRue &
Morgeson, 2007). In addition, the number of previous companies worked for (job switching) and
the length of overall work experience influenced supplementary PG goal congruence on satisfaction
with work and the team (Kristof-Brown et al., 2002). Specifically, individuals who worked in many
companies in the past placed greater emphasis on person-organization fit, whereas individuals with
longer working experience prioritized person-job fit more, deflating the significance of PG fit when
evaluating satisfaction with work and team.
Hollenbeck (2000) discussed the various ways in which individual personal traits can be matched
with team type to improve team performance. For example, to achieve internal person-team fit, it
is recommended that researchers and practitioners select individuals high on cognitive ability for
teams characterized by broad and undefined roles, but select individuals who are relatively high
on openness to experience for teams that constantly need to change and adapt to the environment
(Hollenbeck, 2000). Additionally, functional team structures, which are defined by roles that are
Selection for Team Membership 811
narrow and low in scope, require agreeable members while self-managing teams are better suited
for high conscientious members. Finally, misaligned team structures, which occur when the team
structure is not well matched to the environment, need emotionally stable individuals to handle the
stress of associated problems (Hollenbeck, 2000).
Given the advantages gleaned from PG fit, it is important for managers and practitioners to
consider the match between individuals and the groups to which they are assigned. By focusing
only on individual KSAs when forming a team or when selecting a member into a pre-existing
team (Baker & Salas, 1992; Leach et al., 2005), organizations neglect the overall team context.
Measuring individual-level teamwork skills (Baker & Salas, 1992; O’Neil et al., 2003) is necessary,
but not sufficient, for team selection, as the interaction between individual characteristics and the
team environment must be taken into account. One available tool is the Team Selection Inventory,
which assesses an individual’s preferred team climate as compared to the team’s current climate to
determine person-team fit (Burch & Anderson, 2004).
TEAM COMPOSITION
Because team effectiveness depends, in part, on the compatibility of team members, it is important
to not only seek the best individuals for the team, but the best combination of individuals in
terms of the mix of skills, personality traits, ability, and experience levels. Indeed, a recent meta-
analysis concluded, “who is included in the team matters…individual characteristics emerge to
form a collective construct that links with higher team-level performance” (Stewart, 2006, p. 44).
Composition is a broad term referring to configurations of attributes within small groups (Levine
& Moreland, 1990). Deriving from distinct literatures and utilizing different methodologies, it
is important to differentiate between PG fit and team composition. Whereas the PG fit literature
examines individual-level criteria, team composition studies aggregate member characteristics to
the group level and investigate their impact on group-level outcomes (Kristof-Brown et al., 2005b).
Although team composition has been recognized as a significant predictor of team effectiveness
for years (e.g., Mann, 1959; Hackman, 1987), there is growing acknowledgement that the relation-
ship between member characteristics and team performance is complex and multifaceted. Similar to
the contingent teamwork competencies in Figure 37.1, the emerging conceptual framework reflects
a contingency perspective by suggesting that how and why composition variables influence team
outcomes will depend on a multiplicity of factors. Below, we discuss the aggregation method used,
the individual differences assessed, the particular outcomes studied, and the nature of the team
task as contingency factors in team composition (e.g., Argote & McGrath, 1993; Jackson, May, &
Whitney, 1995).
Aggregation Method
Team composition research is complicated by the various ways that individual scores can be com-
bined to arrive at a group score. The most popular and straightforward approach is to simply aver-
age each member’s responses. More complex than aggregating individual characteristics in a linear
fashion, diversity is concerned with the “distribution of differences among the members of a unit
with respect to a common attribute” (Harrison & Klein, 2007, p. 1200). The standard deviation of
individual-level scores is one specific aggregated team score for diversity (e.g., Harrison & Sin,
2006). Additional methods for operationalizing team composition include selecting the maximum
or minimum individual team member score (e.g., Chan, 1998; Kozlowski & Klein, 2000). Studies
have demonstrated that results differ, depending on the type of aggregation used, and that each
captures a unique aspect of team composition (e.g., Barrick, Stewart, Neubert, & Mount, 1998; Day,
Arthur, Miyashiro, Edwards, & Hanson, 2004).
Steiner’s (1972) task typology provided guidance on which aggregation method to use and
has been the most commonly used approach to specifying the appropriate operationalization in
the team composition literature. According to Steiner (1972), mean aggregation is best suited
812 Handbook of Employee Selection
for additive tasks, in which group performance is the sum of each member’s contribution (e.g.,
shoveling snow). Minimum scores are deemed appropriate for conjunctive tasks where the weakest
member determines team performance (e.g., mountain climbing), and maximum scores are deemed
appropriate for disjunctive tasks where the most competent member determines team performance
(e.g., problem-solving; Steiner, 1972). However, recent studies have been critical of this rationale
(e.g., Day et al., 2004), and a meta-analysis found that stronger effects were not observed when the
operationalization matched the task type of Steiner’s typology (Bell, 2007).
Because Steiner’s (1972) task taxonomy focused exclusively on the way in which group members’
contributions combine into a team outcome, additional variables must be considered in determining
the appropriate method of aggregation, including the predictor and outcome variables being assessed
as well as team and task type. For example, in a sample of business student teams, Mohammed and
Angell (2003) found that diversity on agreeableness, neuroticism, and extraversion affected oral
presentation scores, but mean cognitive ability positively affected written reports. Reflecting these
findings, a recent meta-analysis concluded that the best aggregation method depended on the com-
position variable of interest and that no single operationalization emerged as superior for all compo-
sition variables (Bell, 2007). To illustrate, the strongest relationships with team performance were
observed when conscientiousness was operationalized as the team mean, but when agreeableness
was operationalized as the team minimum (one disagreeable member was enough to be a disruptive
force) (Bell, 2007).
Individual Differences
Clearly, many types of individual differences can exist in a team, including demographic (e.g.,
gender, age, ethnicity), task-related (e.g., ability, experience, functional background, tenure in the
field), and psychological (e.g., attitudes, personality, values). Therefore, in addition to aggregation
method, the member characteristics being assessed figure prominently in the contingency frame-
work describing how and why composition variables influence team outcomes. On the basis of a
TTA, it is important to assess which differences are likely to account for the greatest amount of
variance in predicting team performance. We briefly review the major findings for the individual
differences most researched in the team composition literature: demographics, ability, expertise,
and personality.
Given the current and projected demographic changes in the American workforce (Fullerton &
Toossi, 2001), there has been sustained research interest in readily detectable, social-category dif-
ferences in ethnicity, gender, and age (e.g., Mannix & Neale, 2005). This area of research focuses
exclusively on variability as an aggregation method. Reflecting the fact that diversity is often viewed
as a “double-edged sword,” competing views of team diversity have been espoused (e.g., Milliken &
Martins, 1996). Pessimistically, member heterogeneity is predicted to adversely impact commu-
nication, cohesion, and ultimately team performance because members are motivated to maintain
their social identities and prefer similarity in their interactions (e.g., Byrne, 1971; Tajfel, 1978).
Optimistically, member heterogeneity promotes creativity and quality decision-making because of
the various perspectives brought to a problem and opportunities for knowledge sharing (e.g., Cox,
Lobel, & McLeod, 1991).
Rather than resolve this debate, the results of five published meta-analyses have consistently
found that there is virtually no relationship between heterogeneity on demographic variables and
team performance (Bell, 2007; Bowers, Pharmer, & Salas, 2000; Horowitz & Horowitz, 2007;
Stewart, 2006; Webber & Donahue, 2001). Nevertheless, diversity scholars have shown surpris-
ingly high agreement in their recommendations concerning how to move this area of study forward.
For example, qualitative reviews have recommended extending beyond demographic differences
to incorporate more task-relevant forms of diversity that relate to the fundamental purposes of the
team (e.g., Mannix & Neale, 2005; van Knippenberg & Schippers, 2007). In addition, researchers
have been strongly advised to explore moderating influences rather than focus solely on main effects
(e.g., Webber & Donahue, 2001; Williams & O’Reilly, 1998). Time is one moderator that has proved
Selection for Team Membership 813
fruitful in explaining some of the inconsistent research findings in this literature. Specifically,
the effects of demographic diversity on team processes have been shown to weaken over time (or
with greater group tenure), whereas the effects of deep-level diversity (e.g., job-related attitudes)
strengthen over time (e.g., Harrison, Price, Gavin, & Florey, 2002). Similarly, a recent meta-analysis
found that the negative effects of race and sex diversity on team performance diminished the longer
a team was together (Bell, Villado, Lukasik, Briggs, & Belau, 2007).
Whereas much of the research has primarily focused on a single demographic characteristic at a
time (e.g., gender), individuals have multiple identities simultaneously (e.g., Hispanic female under 30).
Receiving increased attention in the diversity literature, faultline theory explores the hypothetical
dividing lines that may split a group into subgroups (Lau & Murnighan, 1998; Thatcher, Jehn, &
Zanutto, 2003). Recent evidence shows that the more differences converge with each other (e.g.,
all male members of a work group are over 45, while all female members are under 30), the more
groups experience intragroup conflict and lower satisfaction (Lau & Murnighan, 2005).
Whereas demographic findings have generally been mixed and inconclusive, cognitive ability
has yielded the most robust results in team composition research, replicating across field mainte-
nance teams (Barrick et al., 1998), student laboratory groups (Day et al., 2004), human resource
teams (Neuman & Wright, 1999), military tank crews (Tziner & Eden, 1985), and hierarchical
decision-making teams (Lepine, Hollenbeck, Ilgen, & Hedlund, 1997). Isomorphic to the strong
positive relationship between general mental ability and individual-level performance (Schmidt,
2002), several meta-analyses have concluded that teams with smarter members do better (Bell,
2007; Devine & Philips, 2001; Stewart, 2006). When different operationalizations of cognitive
ability are compared (e.g., mean, maximum, minimum, variance), the mean has emerged as the
strongest predictor of team performance across several task types (e.g., Day et al., 2004; Devine &
Philips, 2001).
Expanding beyond general mental ability, research has also investigated expertise as a task-
relevant composition variable, which focuses on specific types of knowledge and experience that
members bring to the team, as well as educational specialization and functional background.
Although the results for cognitive ability were notably stronger, a recent meta-analysis found a
small positive relationship between expertise (mean-aggregated member experience and education)
and team performance (Stewart, 2006). However, most of the work in this area focuses on exper-
tise diversity, which can result in either increased team learning or conflict (Milliken & Martins,
1996; Williams & O’Reilly, 1998). Moderating factors such as collective team identification (van
der Vegt & Bunderson, 2005) and member debate and dialogue (Simons, Pelled, & Smith, 1999)
have been shown to promote positive outcomes from expertise diversity. Although Webber and
Donahue (2001) found that highly job-related diversity (functional, educational, and industry back-
ground) was not related to team outcomes, a more recent meta-analysis found a positive relationship
with both the quality and quantity of team performance (Horowitz & Horowitz, 2007). Similarly,
a meta-analysis by Bell et al. (2007) established that functional background and organizational
tenure diversity (but not educational diversity) were positively associated with team performance
(Bell et al., 2007).
In addition to cognitive ability and expertise, there is growing interest in the role of personal-
ity as a form of deep-level psychological diversity in the team composition literature. Much of the
existing work has focused on the Five-Factor Model (conscientiousness, extraversion, agreeable-
ness, neuroticism, and openness to experience) and examines the aggregate mean level of traits
(e.g., Mohammed, Mathieu, & Bartlett, 2002; Neuman & Wright, 1999). Meta-analytic results con-
sistently conclude that teams composed of conscientious and agreeable members perform better
(Bell, 2007; Peeters, van Tuijl, Rutte, & Reymen, 2006; Stewart, 2006). In addition, a meta-analysis
by Bell (2007) found that extraversion and openness to experience also positively predicted team
performance. Not surprisingly, these personality traits generally exhibited stronger relationships
with performance for organizational teams as compared to laboratory groups (Bell, 2007; Peeters
et al., 2006).
814 Handbook of Employee Selection
Whereas mean aggregation assumes that more of a trait is always better, there are individual
differences for which a balanced pattern across members may be more desirable. Of the research
focusing on personality diversity, extraversion (tendency to be sociable, talkative, and dominant;
Costa & McCrae, 1992) has received the most attention (e.g., Humphrey, Hollenbeck, Meyer, &
Ilgen, 2007; Neuman, Wagner, & Christiansen, 1999). Although too few extraverts within a group
may result in inadequate levels of intrateam communication, too many may pursue social interac-
tions at the expense of task demands and experience power struggles for leadership. Therefore,
diversity on extraversion may lead to more positive outcomes because roles are complimentary, with
some members talking/leading and others listening/following. Several studies have found favorable
results for variability on extraversion (e.g., Barry & Stewart, 1997; Mohammed & Angell, 2003;
Neuman et al., 1999), but meta-analytic results have not been supportive (Bell, 2007; Peeters et al.,
2006). In general, meta-analyses investigating member heterogeneity on personality characteristics
have not yielded strong findings (e.g., Bell, 2007; Stewart, 2006).
Several researchers have begun to expand beyond Big Five personality traits to consider how the
composition of variables such as time urgency (Mohammed & Angell, 2004), task-related attitudes
(e.g., Harrison et al., 2002), collectivism (e.g., Tyran & Gibson, 2008), and preference for teamwork
(e.g., Mohammed & Angell, 2003) affect team functioning. Although the number of studies is quite
limited, meta-analytic results suggest that composing teams with members who enjoy working in
teams and have a collectivistic orientation is positively related to team performance (Bell, 2007).
Outcome Variables
In addition to aggregation method and individual differences, the type of outcome variable
being assessed also affects the relationship between team composition and team performance.
Demonstrating that differences in outcomes reflect a major source of variation across teams,
Mohammed et al. (2002) found that different predictors emerged for each of the three types of team
performance measured. Specifically, ability was significantly related to technical-administrative
task performance, and extraversion, neuroticism, and ability were related to leadership task perfor-
mance. Agreeableness predicted contextual performance. Thus, a single optimal composition for
work teams did not emerge.
Team member heterogeneity may prove to be a liability or an asset, depending on the dependent
variable being assessed. According to Mannix and Neale (2005), diverse teams are more appro-
priate for experimenting, problem-solving, and innovating, whereas homogenous teams are more
desirable for producing, executing, and convergent thinking. Supportive of these predictions, meta-
analytic results revealed that homogeneous teams outperformed heterogeneous teams for physical
work tasks requiring low cognitive demand (Bowers et al., 2000). In addition, although the over-
all relationship between team member heterogeneity and performance was near zero in the meta-
analysis by Stewart (2006), moderator tests found some support for the notion that heterogeneous
teams were better suited for creative tasks than production tasks. Similarly, functional background
and educational diversity yielded stronger effects with performance when innovation was the crite-
rion compared to efficiency as the criterion (Bell et al., 2007). Thus, the potentially positive effects
of work group diversity on group performance are more likely to emerge in teams performing
relatively complex tasks that require information processing, creativity, and collaborative decision-
making where the exchange and integration of diverse task-related information may stimulate thor-
ough consideration of ideas (van Knippenberg, De Dreu, & Homan, 2004).
Webber & Donahue, 2001). For example, Bowers et al. (2000) stated, “It appears that the signifi-
cant effects found in many of the included studies can be attributed to the type and difficulty of
the task used in the investigation” (p. 305). Bell et al. (2007) found that the diversity (functional
background and organizational tenure)-performance relationship was strengthened for design and
cross-functional teams compared to other types of teams. However, some meta-analyses have failed
to find evidence for task/team moderated effects (e.g., Horowitz & Horowitz, 2007; Webber &
Donahue, 2001), and others have been hindered in the attempt because of sparse research. To illus-
trate, Horowitz and Horowitz (2007) could not test whether interdependence was a contingency
factor in the team diversity-team outcome link because of the lack of low interdependence teams
across studies. In addition, although there was clear support for moderation, Bell (2007) was unable
to distinguish between study setting and team type because knowledge work teams were mostly
studied in the laboratory, whereas physical work teams were mostly investigated in the field.
DISCUSSION
IMPLICATIONS FOR RESEARCH
TTA
Despite widespread consensus among team researchers regarding its importance (e.g., Baker et al.,
1998; Blickensderfer et al., 2000; Lorenzet et al., 2003), the paucity of research on TTA is indeed
surprising. Research has been deficient in offering empirically tested methodologies for practitioner
use. Since Baker et al. (1998) referred to TTA as “lost but hopefully not forgotten,” there has been
little follow-up research activity (p. 79). Therefore, efforts to examine untested teamwork taxono-
mies, develop new tools, and determine which methods are most appropriate for collecting specific
information top the research agenda in this area (Lorenzet et al., 2003).
Individual-Level Considerations
Although there are many theoretically derived variables that have been hypothesized and investi-
gated as important contributors to team effectiveness, few studies have been conducted to validate
the predictive power of these attributes in a selection context. Moreover, studies that assess the
combinatorial contributions of individual and team-level factors are required in order to optimize
the prediction of effective teamwork. Because many aspects of team functioning cannot be easily
captured via paper-and-pencil measures, efforts to develop behaviorally based assessment tools to
capture observed team actions objectively and reliably are also sorely needed. Fortunately, new
technologies are beginning to emerge as candidates for simulating team environments realistically,
including role players, video-based systems, and virtual world technologies (e.g., Cannon-Bowers
et al., 2007), but they must be validated for team selection.
Team-Level Considerations
Although the most nascent type of fit, a small body of work has already demonstrated the positive
impact of PG fit on a number of individual-level outcomes (e.g., Kristof-Brown et al., 2005b). In
addition, taking into account that meta-analytic moderator analyses could not be tested because of
too few studies (Kristof-Brown et al., 2005b), further PG fit research is clearly warranted.
Still in a formative stage of development, team composition has consistently been identified as a
promising avenue for future studies (e.g., Ilgen, Hollenbeck, Johnson, & Jundt, 2005; Kozlowski &
Bell, 2003; Kozlowski & Ilgen, 2006). Although meta-analytic results have been straightforward
regarding mean-aggregated characteristics (e.g., Bell, 2007; Stewart, 2006), results have been
far less conclusive regarding how to improve the mix of competencies in a team or how to select
new team members while considering existing team member KSAOs. In addition, the question
of whether heterogeneity or homogeneity of member traits is superior has not been satisfactorily
addressed (e.g., Bowers et al., 2000; Webber & Donahue, 2001).
816 Handbook of Employee Selection
Diversity should not be discussed using ‘blanket’ statements suggesting that it is either good, bad, or
unrelated to team performance without reference to the specific diversity variable of interest, type of
team, type of performance assessed, and the amount of time the team has been together.” (Bell et al.,
2007, p. 14)
Criticized as being “conceptually scattered” (McGrath, 1984, p. 256) and “atheoretical” (Levine &
Moreland, 1990, p. 594), well-developed models adopting a contingency and multilevel perspective
are needed to help clarify the complex patterns of variables deemed important in the team composi-
tion literature. Still a rarity in research practice, a comprehensive “meso” approach to team staffing
involves not only multiple levels, but also cross-level interactions (Ployhart, 2004). The meso para-
digm concerns the simultaneous investigation of two levels of analysis where the processes between
levels are articulated (House, Rousseau, & Thomas-Hunt, 1995).
According to Stewart (2006), “some traits matter more than others … yet our understanding
of which traits are most beneficial for collective performance of teams is still evolving” (p. 45).
Therefore researchers should be encouraged to explore new task-relevant constructs and expand
beyond examining the traditionally considered characteristics of demographic variables, cognitive
ability, expertise, and Big Five traits. In addition, research simultaneously examining multiple char-
acteristics, as is done in faultline studies, should increase. Future work should also seek to specify
the team and task types, as well as contextual factors, that affect optimal team size.
Individual-Level Considerations
Evaluating potential team members on generic team KSA requirements and personality factors
has shown success as a selection tool (e.g., Hirschfeld et al., 2006; McClough & Rogelberg, 2003;
Morgeson et al., 2005; Mumford et al., 2008; Stevens & Campion, 1999). Because these paper-and-
pencil tests are relatively easy to administer and add significantly to the prediction of team member
effectiveness, we would recommend including them in a team selection context. However, given that
many of the attributes that best predict team effectiveness are not easily captured via paper-and-
pencil tests (e.g., communication skills, adaptability), we also advise implementing work sample
tests that place the applicant in a realistic team setting when feasible. Obviously, this is a more costly
and complicated process, making it justified for only high stakes hires.
Team-Level Considerations
Convergent meta-analytic results offer some guidance to practitioners in their quest to staff teams
effectively in organizations. Team composition has been shown to strongly influence team perfor-
mance across many studies, with some effects rivaling those found for process variables such as
conflict and cohesion (e.g., Bell, 2007; Stewart, 2006). Therefore, when possible, practitioners should
utilize team composition as a tool to increase team effectiveness. With regard to what characteris-
tics to focus on, both taskwork and teamwork competencies have been shown to contribute unique
variance as predictors of team performance (Bell, 2007). Specifically, multiple meta-analyses have
confirmed that teams with smart, conscientious, and agreeable members perform better (e.g., Bell,
2007, Peeters et al., 2006; Stewart, 2006). Individual meta-analyses have also found that higher
mean levels of expertise (Stewart, 2006), openness to experience, and extraversion (Bell, 2007) are
also related to higher team performance.
Although it is recommended that the heterogeneity/homogeneity of member characteristics be
explored in team selection (e.g., Klimoski & Jones, 1995; McClough & Rogelberg, 2003), the incon-
sistency and complexity of current research findings disallows the kind of straightforward prescrip-
tions that are appealing to practitioners. Whereas moderated results are attractive to researchers in
specifying the conditions under which diversity will aid or hinder team performance, the number of
contingencies to be considered significantly complicates the feasibility of interventions to compose
teams. In addition, composition research has not adequately addressed many of the individual dif-
ference variables that may be of substantial import in work settings. For example, when forming
military teams, rank or status must be considered to avoid violating political protocols and creating
interpersonal mismatches. Therefore, a significant disconnect remains between the current state of
research and practitioners’ ability to optimally assign members to teams based on the accumulated
body of knowledge. Bridging this gap is a key agenda item for future work.
The legal issues underlying selection for team membership must also be considered. Interestingly,
there may be an inherent conflict between Equal Employment Opportunity laws written to protect
the individual and team selection procedures designed to improve group outcomes. Whereas the legal
perspective emphasizes standardization and the importance of evaluating all applicants according to
a common set of metrics, the team contingency perspective emphasizes customization and the value
of member compatibility as well as skill heterogeneity. Is it legally defensible for an employer to reject
a candidate who has the same competencies of other team members and select another candidate with
different competencies? What are the legal ramifications when selection for team membership is seen
as promotion or special placement? These questions have yet to be fully explored and resolved.1
CONCLUSIONS
Understanding how to form superior teams is the key to harnessing selection as a tool for improving
team performance. Given the importance of teams in many modern organizations, it is surprising
1 The authors would like to thank Nancy T. Tippins for this addition.
818 Handbook of Employee Selection
that the state of the science and practice in team selection is still fairly rudimentary. Although
there is no shortage of variables that have been hypothesized to affect team performance, specific
studies validating predictors of team effectiveness in a selection context are relatively rare. In addi-
tion, developing mechanisms to determine how a potential employee will fit into a particular team
requires a level of precision that has yet to be developed, in terms of what competencies are needed
within the team and how to measure them in applicants. Therefore, more work is needed regarding
the categories of attributes that are necessary to optimize team functioning—those that are held
by individual members and those that transcend individual members and exist at the team level.
Despite the difficulties, we hope that team researchers will focus on identifying and empirically
validating the competencies needed for selection for team membership.
REFERENCES
Adkins, C. L., Ravlin, E. C., & Meglino, B. M. (1996). Value congruence between co-workers and its relation-
ship to work outcomes. Group & Organization Studies, 21, 439.
Argote, L., & McGrath, J. E. (1993). Group processes in organizations: Continuity and change. International
Review of Industrial and Organizational Psychology, 8, 333–389.
Arthur, W., Jr., Edwards, B. D., Bell, S. T., Villado, A. J., & Bennett, W., Jr. (2005). Team task analysis:
Identifying tasks and jobs that are team-based. Human Factors, 47(3), 654–669.
Baker, D. P., & Salas, E. (1992). Principles for measuring teamwork skills. Human Factors, 34(4), 469–475.
Baker, D. P., Salas, E., & Cannon-Bowers, J. (1998). Team task analysis: Lost but hopefully not forgotten. The
Industrial and Organizational Psychologist, 35(3), 79–83.
Barrick, M. R., Stewart, G. L., Neubert, M. J., & Mount, M. K. (1998). Relating member ability and personality
to work-team processes and team effectiveness. Journal of Applied Psychology, 83(3), 377–391.
Barry, B., & Stewart, G. L. (1997). Composition, process, and performance in self-managed groups: The role
of personality. Journal of Applied Psychology, 82(1), 62–78.
Barsness, Z. I., Tenbrunsel, A. E., Michael, J. H., & Lawson, L. (2002). Why am I here? The influence of group
and relational attributes on member-initiated team selection. In H. Sondak (Ed.), Toward phenomenology
of groups and group membership (pp. 141–171). New York, NY: Elsevier Science.
Bell, S. T. (2007). Deep-level composition variables as predictors of team performance: A meta-analysis.
Journal of Applied Psychology, 92(3), 595–615.
Bell, S. T., Villado, A. J., Lukasik, M. A., Briggs, A., & Belau, L. (2007). Revisiting the team demographic
diversity and performance relationship: A meta-analysis. Paper presented at the annual conference of the
Society for Industrial/Organizational Psychology, New York, NY.
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential
and evidential bases. Journal of Applied Psychology, 74, 478–494.
Blickensderfer, E., Cannon-Bowers, J. A., Salas, E., & Baker, D. P. (2000). Analyzing knowledge require-
ments in team tasks. In J. M. Schraagen, S. F. Chipman, & V. L. Shalin (Eds.), Cognitive task analysis
(pp. 431–450). Philadelphia, PA: Lawrence Erlbaum.
Bowers, C. A., Morgan, B. B., Jr., Salas, E., & Prince, C. (1993). Assessment of coordination demand for air-
crew coordination training. Military Psychology, 5, 95–112.
Bowers, C. A., Pharmer, J. A., & Salas, E. (2000). When member homogeneity is needed in work teams: A
meta-analysis. Small Group Research, 31(3), 305.
Burch, G. S. J., & Anderson, N. (2004). Measuring person-team fit: Development and validation of the team
selection inventory. Journal of Managerial Psychology, 19(4), 406.
Burke, C. S., Stagl, K. C., Klein, C., Goodwin, G. F., Salas, E., & Halpin, S. (2006). What type of leadership
behaviors are functional in teams? A meta-analysis. The Leadership Quarterly, 17, 288–307.
Byrne, D. E. (1971). The attraction paradigm. New York, NY: Academic Press.
Campion, M. A., Medsker, G. J., & Higgs, A. C. (1993). Relations between work group characteristics and
effectiveness: Implications for designing effective work groups. Personnel Psychology, 46(4), 823–847.
Cannon-Bowers, J. A., Bowers, C. A., & Sanchez, A. (2007). Using synthetic learning environments to train
teams. In V. I. Sessa & M. London (Eds.), Work group learning: Understanding, improving, and assess-
ing how groups learn in organizations (pp. 315–347). Mahwah, NJ: Lawrence Erlbaum.
Cannon-Bowers, J. A., Salas, E., & Blickensderfer, E. (1998, April). On training crews. In R. J. Klimoski
(Chair), When is a work team a crew—and does it matter? Paper presented at the 13th annual meeting of
the Society of Industrial and Organizational Psychology, Dallas, TX.
Selection for Team Membership 819
Cannon-Bowers, J. A., Tannenbaum, S. I., Salas, E., & Volpe, C. E. (1995). Defining competencies and estab-
lishing team training requirements. In R. A. Guzzo, E. Salas, et al. (Eds.), Team effectiveness and deci-
sion making in organizations (pp. 333–380). San Francisco, CA: Jossey-Bass.
Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of
analysis: A typology of composition models. Journal of Applied Psychology, 83(2), 234–246.
Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor
Inventory (NEO-FFI): Professional manual. Lutz, FL: Psychological Assessment Resources.
Cox, T. H., Lobel, S. A., & McLeod, P. L. (1991). Effects of ethnic group cultural differences on coop-
erative and competitive behavior on a group task. The Academy of Management Journal, 34(4),
827–847.
Day, E. A., Arthur, W., Miyashiro, B., Edwards, B. D., & Hanson, T. (2004). Criterion-related validity of sta-
tistical operationalizations of group general cognitive ability as a function of task type: Comparing the
mean, maximum, and minimum. Journal of Applied Social Psychology, 34(7), 1521–1549.
DeRue, D. S., & Morgeson, F. P. (2007). Stability and change in person-team and person-role fit over time: The
effects of growth satisfaction, performance, and general self-efficacy. Journal of Applied Psychology,
92(5), 1242–1253.
Devine, D. J. (2002). A review and integration of classification systems relevant to teams in organizations.
Group Dynamics: Theory, Research, and Practice, 6, 291–310.
Devine, D. J., & Philips, J. L. (2001). Do smarter teams do better: A meta-analysis of cognitive ability and team
performance. Small Group Research, 32(5), 507.
Driskell, J. E., & Salas, E. (1992). Collective behavior and team performance. Human Factors, 34, 277–288.
Druskat, V. U., & Kayes, D. C. (1999). The antecedents of competence: Toward a fine-grained model
of self-managing team effectiveness. In E. A. Mannix, M. A. Neale, & R. Wageman (Eds.),
Research on managing groups and teams: Groups in context (Vol. 2, pp. 201–231). Greenwich,
CT: JAI Press.
Fullerton, H. N., Jr., & Toossi, M. (2001). Labor force projections to 2010: Steady growth and changing com-
position. Monthly Labor Review, 124(11), 21–38.
Gladstein, D. L. (1984). Groups in context: A model of task group effectiveness. Administrative Science
Quarterly, 29(4), 499–517.
Hackman, J. R. (1987). The design of work teams. In J. Lorsch (Ed.), Handbook of organizational behavior
(pp. 315–342). Englewood Cliffs, NJ: Prentice-Hall.
Hackman, J. R. (2002). Leading teams: Setting the stage for great performances. Boston, MA: Harvard
Business School Press.
Halfhill, T. R., Huff, J. W., Sundstrom, E., & Nielsen, T. M. (2003). Group personality composition and work
team effectiveness: Key factor in staffing the team-based organization. Advances in Interdisciplinary
Studies of Work Teams, 9, 147–168.
Harrison, D. A., & Klein, K. J. (2007). What’s the difference? Diversity constructs as separation, variety, or
disparity in organizations. The Academy of Management Review, 32(4), 1199–1228.
Harrison, D. A., Price, K. H., Gavin, J. H., & Florey, A. T. (2002). Time, teams, and task performance: Changing
effects of surface-and deep-level diversity on group functioning. Academy of Management Journal,
45(5), 1029–1045.
Harrison, D. A., & Sin, H. (2006). What is diversity and how should it be measured? In A. M. Konrad, P. Prasad,
& J. K. Pringle (Eds.), Handbook of workplace diversity (pp. 191–216). Thousand Oaks, CA: Sage.
Helmreich, R. L., & Foushee, H. C. (1993). Why crew resource management? Empirical and theoretical bases
of human factors in training and aviation. In E. Wiener, B. G. Kanki, & R. L. Helmreich (Eds), Cockpit
resource management (pp. 3–45). San Diego, CA: Academic Press.
Heslin, R. (1964). Predicting group task effectiveness from member characteristics. Psychology Bulletin, 62,
248–256.
Hirschfeld, R. R., Jordan, M. H., Feild, H. S., Giles, W. F., & Armenakis, A. A. (2006). Becoming team players:
Team members’ mastery of teamwork knowledge as a predictor of team task proficiency and observed
teamwork effectiveness. Journal of Applied Psychology, 91(2), 467–474.
Hollenbeck, J. R. (2000). A structural approach to external and internal person-team fit. Applied Psychology:
An International Review, 49(3), 534–549.
Horowitz, S. K., & Horowitz, I. B. (2007). The effects of team diversity on team outcomes: A meta-analytic
review of team demography. Journal of Management, 33(6), 987–1015.
House, R., Rousseau, D. M., & Thomas-Hunt, M. (1995). The meso paradigm: A framework for the inte-
gratition of micro and macro organizational behavior. Research in Organizational Behavior, 17,
71–114.
820 Handbook of Employee Selection
Huffcutt, A. L., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment
of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86(5),
897–913.
Humphrey, S. E., Hollenbeck, J. R., Meyer, C. J., & Ilgen, D. R. (2007). Trait configurations in self-managed
teams: A conceptual examination of the use of seeding for maximizing and minimizing trait variance in
teams. Journal of Applied Psychology, 92(3), 885.
Ilgen, D. R., Hollenbeck, J. R., Johnson, M., & Jundt, D. (2005). Teams in organizations: From input-
process-output models to IMOI models. Annual Review of Psychology, 56(1), 517–543.
Jackson, S. E., May, K. E., & Whitney, K. (1995). Understanding the dynamics of diversity in decision-making
teams. In R. A. Guzzo, E. Salas, & Associates (Eds.), Team effectiveness and decision making in organi-
zations (pp. 204–261) San Francisco, CA: Jossey Bass.
Jones, R. G., Stevens M. J., & Fischer, D. L. (2000). Selection in team contexts. In J. F. Kehoe (Ed.), Managing
selection in changing organizations (pp. 210–241). San Francisco, CA: Jossey-Bass.
Keck, S., & Tushman, M. (1993). Environmental and organizational context and executive team structure.
Academy of Management Journal, 36(6), 1314–1344.
Kichuk, S. L., & Wiesner, W. H. (1998). Work teams: Selecting members for optimal performance. Canadian
Psychology, 39(1–2), 23–32.
Kilduff, M., Angelmar, R., & Mehra, A. (2000, January). Top management-team diversity and firm perfor-
mance: Examining the role of cognitions. Organization Science, 11, 21–34.
Kirksey, J., &. Zawacki, R. A. (1994). Assessment center helps find team-oriented candidates. Personnel
Journal, 73(5), 92.
Klimoski, R., & Jones, R. G. (1995). Staffing for effective group decision-making: Key issues in matching
people and teams. In R. A. Guzzo, E. Salas, & Associates (Eds.), Team effectiveness and decision making
in organizations (pp. 291–332) San Francisco, CA: Jossey Bass.
Kozlowski, S. W. J., & Bell, B. S. (2003). Work groups and teams in organizations. In W. C. Borman,
D. R. Ilgen, & R. J. Klimoski (Eds.), Handbook of psychology: Industrial and organizational psychology
(pp. 333–375). London, England: Wiley.
Kozlowski, S. W. J., & Ilgen, D. R. (2006). Enhancing the effectiveness of work groups and teams. Psychological
Science in the Public Interest, 7(3), 77–124.
Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations:
Contextual, temporal, and emergent processes. In S. W. Kozlowski & K. J. Klein (Eds.), Multilevel the-
ory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 3–90).
London, England: Wiley.
Kristof-Brown, A., Barrick, M. R., & Stevens, C. K. (2005a). When opposites attract: A multi-sample demon-
stration of complementary person-team fit on extraversion. Journal of Personality, 73, 935–958.
Kristof-Brown, A. L., Jansen, K. J., & Colbert, A. E. (2002). A policy-capturing study of the simultaneous
effects of fit with jobs, groups, and organizations. Journal of Applied Psychology, 87(5), 985–993.
Kristof-Brown, A. L., & Stevens, C. K. (2001). Goal congruence in project teams: Does the fit between
members’ personal mastery and performance goals matter? Journal of Applied Psychology, 86(6),
1083–1095.
Kristof-Brown, A. L., Zimmerman, R. D., & Johnson, E. C. (2005b). Consequences of individual’s fit at work:
A meta-analysis of person-job, person-organization, person-group, and person-supervisor fit. Personnel
Psychology, 58(2), 281–342.
Langfred, C. (2007). The downside of self-management: A longitudinal study of the effects of conflict on
trust, autonomy, and task interdependence in self-managing teams. Academy of Management Journal,
50, 885–900.
Lau, D. C., & Murnighan, J. K. (1998). Demographic diversity and faultlines: The compositional dynamics of
organizational groups. The Academy of Management Review, 23(2), 325–340.
Lau, D., & Murnighan, J. K. (2005). Interactions with groups and subgroups: The effects of demographic fault-
lines. Academy of Management Journal, 48, 645–659.
Leach, D. J., Wall, T. D., Rogelberg, S. G., & Jackson, P. R. (2005). Team autonomy, performance, and mem-
ber job strain: Uncovering the teamwork KSA link. Applied Psychology: An International Review, 54,
1–24.
Lepine, J. A., Hollenbeck, J. R., Ilgen, D. R., & Hedlund, J. (1997). Effects of individual differences on the
performance of hierarchical decision-making teams: Much more than g. Journal of Applied Psychology,
82(5), 803–811.
Levine, J. M., & Moreland, R. L. (1990). Progress in small group research. Annual Review of Psychology,
41(1), 585–634.
Selection for Team Membership 821
Lorenzet, S. J., Eddy, E. R., & Klein, G. D. (2003). The importance of team task analysis for team human
resource management. In M. M. Beyerlein, D. A. Johnson, & S. T. Beyerlein (Eds.), Team-based organiz-
ing (Vol. 9, pp. 113–145). New York, NY: Elsevier Science.
Mann, R. D. (1959). A review of the relationships between personality and performance in small groups.
Psychological Bulletin, 56(4), 241.
Mannix, E., & Neale, M. A. (2005). What differences make a difference? Psychological Science in the Public
Interest, 6(2), 31–55.
Marks, M. A., & Panzer, F. J. (2004). The influence of team monitoring on team processes and performance.
Human Performance, 17, 25–41.
McClough, A. C., & Rogelberg, S. G. (2003). Selection in teams: An exploration of the teamwork knowledge,
skills, and ability test. International Journal of Selection and Assessment, 11, 56–66.
McGrath, J. E. (1984). Groups: Interaction and performance. Englewood Cliffs, NJ: Prentice-Hall.
Milliken, F. J., & Martins, L. L. (1996). Searching for common threads: Understanding the multiple effects of
diversity in organizational groups. The Academy of Management Review, 21(2), 402–433.
Mohammed, S., & Angell, L. C. (2003). Personality heterogeneity in teams: Which differences make a differ-
ence for team performance? Small Group Research, 34(6), 651.
Mohammed, S., & Angell, L. (2004). Surface- and deep-level diversity in workgroups: Examining the mod-
erating effects of team orientation and team process on relationship conflict. Journal of Organizational
Behavior, 25, 1015–1039.
Mohammed, S., Mathieu, J. E., & Bartlett, A. L. (2002). Technical-administrative task performance, leader-
ship task performance, and contextual performance: Considering the influence of team- and task-related
composition variables. Journal of Organizational Behavior, 23(7), 795–814.
Morgan, B. B., & Lassiter, D. L. (1992). Team composition and staffing. In R. W. Swezey & E. Salas (Eds.),
Teams: Their training and performance (pp. 75–100). Westport, CT: Ablex.
Morgan, B., Salas, E., & Glickman, A. (1993, July). An analysis of team evolution and maturation. Journal of
General Psychology, 120(3), 277–291.
Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The impor-
tance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58(3),
583–611.
Mullen, B., Simons, C., Hu, L. T., & Salas, E. (1989). Group size, leadership behavior, and subordinate satisfac-
tion. The Journal of General Psychology, 116(2), 155.
Mumford, T. V., van Iddekinge, C. H., Morgeson, F. P., & Campion, M. A. (2008). The team role test:
Development and validation of a team role knowledge situational judgment test. Journal of Applied
Psychology, 93, 250–267.
Neuman, G. A., Wagner, S. H., & Christiansen, N. D. (1999). The relationship between work-team personality
composition and the job performance of teams. Group and Organization Management, 24, 28–45.
Neuman, G. A., & Wright, J. (1999). Team effectiveness: Beyond skills and cognitive ability. Journal of Applied
Psychology, 84(3), 376–389.
O’Neil, H. F., Wang, S., Lee. C., Mulkey, J., & Baker, E. L. (2003). Assessment of teamwork skills via a
teamwork questionnaire. In H. F. O’Neil & R. S. Perez (Eds.), Technology applications in education: A
learning view (pp. 283–304). Mahwah, NJ: Lawrence Erlbaum.
Pearsall, M. J., & Ellis, A. P. J. (2006). The effects of critical team member assertiveness on team performance
and satisfaction. Journal of Management, 32, 575–594.
Peeters, M. A. G., van Tuijl, H., Rutte, C. G., & Reymen, I. (2006). Personality and team performance: A meta-
analysis. European Journal of Personality, 20(5), 377–396.
Ployhart, R. E. (2004). Organizational staffing: A multilevel review, synthesis, and model. Research in Personnel
and Human Resources Management, 23, 121–176.
Price, K., Harrison, D., & Gavin, J. (2006). Withholding inputs in team contexts: Member composition,
interaction processes, evaluation structure, and social loafing. Journal of Applied Psychology, 91(6),
1375–1384.
Saavedra, R., Earley, P. C., & van Dyne, L. (1993). Complex interdependence in task performing groups.
Journal of Applied Psychology, 78, 61–72.
Salas, E., Burke, C. S., Fowlkes, J. E., & Priest, H. A. (2004). On measuring teamwork skills. In J. Thomas
(Ed.), Comprehensive handbook of psychological assessment. Vol. 4: Industrial/organizational assess-
ment (pp. 427–442). Hoboken, NJ: Wiley.
Salas, E., Cannon-Bowers, J. A., & Johnston, J. H. (1997). How can you turn a team of experts into an expert
team? Emerging training strategies. In G. K. Zsambok & C. E. Zsambok (Eds.), Naturalistic decision
making (pp. 359–370). Mahwah, NJ: Lawrence Erlbaum.
822 Handbook of Employee Selection
Salas, E., Nichols, D. R., & Driskell, J. E. (2007). Testing three team training strategies in intact teams: A meta-
analysis. Small Group Research, 38, 471–488.
Schmidt, F. L. (2002). The role of general cognitive ability and job performance: Why there cannot be a debate.
Human Performance, 15, 187–210.
Schneider, B., Smith, D. B., & Sipe, W. P. (2000). Personnel selection psychology. In S. W. Kozlowski &
K. J. Klein (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions,
and new directions (pp. 91–120). London, England: Wiley.
Simons, T., Pelled, L. H., & Smith, K. A. (1999). Making use of difference: Diversity, debate, and decision
comprehensiveness in top management teams. Academy of Management Journal, 42(6), 662–673.
Smith-Jentsch, K. A., Salas, E., & Baker, D. P. (1996). Training team performance-related assertiveness.
Personnel Psychology, 49, 909–936.
Stark, E. M., Shaw, J. D., & Duffy, M. K. (2007). Preference for group work, winning orientation, and social
loafing behavior in groups. Group and Organization Management, 32(6), 699–723.
Steiner, I. D. (1972). Group processes and productivity. New York, NY: Academic Press.
Stevens, M. J., & Campion, M. A. (1994). The knowledge, skill, and ability requirements for teamwork:
Implications for human resource management. Journal of Management, 20(2), 503.
Stevens, M. J., & Campion, M. A. (1999). Staffing work teams: Development and validation of a selection test
for teamwork settings. Journal of Management, 25(2), 207.
Stewart, G. L. (2006). A meta-analytic review of relationships between team design features and team perfor-
mance. Journal of Management, 32(1), 29.
Sundstrom, E., De Meuse, K. P., & Futrell, D. (1990). Work teams: Applications and effectiveness. American
Psychologist, 45(2), 120–133.
Tajfel, H. (1978). Differentiation between social groups: Studies in the social psychology of intergroup rela-
tions. New York, NY: Academic Press.
Tannenbaum, S. I., Salas, E., & Cannon-Bowers, J. A. (1996). Promoting team effectiveness. In M. A. West
(Ed.), Handbook of work group psychology (pp. 503–529). Chichester, England: Wiley.
Tasa, K., Taggar, S., & Seijts, G. H. (2007). The development of collective efficacy in teams: A multilevel and
longitudinal perspective. Journal of Applied Psychology, 91(1), 17–27.
Thatcher, S. M. B., Jehn, K. A., & Zanutto, E. (2003). Cracks in diversity research: The effects of diversity
faultlines on conflict and performance. Group Decision and Negotiation, 12(3), 217–241.
Tyran, K. L., & Gibson, C. B. (2008). Is what you see, what you get?: The relationship among surface-and
deep-level heterogeneity characteristics, group efficacy, and team reputation. Group & Organization
Management, 33(1), 46.
Tziner, A., & Eden, D. (1985). Effects of crew composition on crew performance: Does the whole equal the
sum of its parts? Journal of Applied Psychology, 70(1), 85–93.
van der Vegt, G., & Bunderson, J. S. (2005). Learning and performance in multidisciplinary teams: The impor-
tance of collective team identification. The Academy of Management Journal, 48(3), 532–547.
van Knippenberg, D., De Dreu, C. K., & Homan, A. C. (2004). Work group diversity and group performance:
An integrative model and research agenda. Journal of Applied Psychology, 89(6), 1008–1022.
van Knippenberg, D., & Schippers, M. C. (2007). Work group diversity. Annual Review of Psychology, 58,
515–541.
van Welie, M., & van der Veer, G. C. (2003). Pattern languages in interaction design: Structure and organiza-
tion. Proceedings of Interact, 3, 1–5.
Webber, S. S., & Donahue, L. M. (2001). Impact of highly and less job-related diversity on work group cohe-
sion and performance: A meta-analysis. Journal of Management, 27(2), 141.
Wellins, R. S., Byham, W. C., & Dixon, G. R. (1994). Inside teams: How 20 world-class organizations are
winning through teamwork. San Francisco, CA: Jossey-Bass.
Wellins, R. S., Byham, W. C., & Wilson, J. M. (1991). Empowered teams, creating self managing working
groups and the improvement of productivity and participation. San Francisco, CA: Jossey-Bass.
Werbel, J. D., & Johnson, D. J. (2001). The use of person-group fit for employment selection: A missing link
in person-environment fit. Human Resource Management, 40(3), 227–240.
Williams, K. Y., & O’Reilly, C. A. (1998). Demography and diversity in organizations: A review of 40 years of
research. Research in Organizational Behavior, 20(S77), 140.
38 Selecting Leaders
Executives and High Potentials
George C. Thornton III, George P. Hollenbeck,
and Stefanie K. Johnson
Despite the importance of leaders at the top levels of organizations, there has been relatively little
research on executive selection. There is a large literature on leadership concepts (Den Hartog &
Koopman, 2001) and leadership development (Day, 2001), but surprisingly little empirical research
on selecting leaders into top-level positions. Recent books on selection (Evers, Anderson, &
Voskuijl, 2005) provide little or no guidance on the selection of executive leaders, despite the fact
that selection of personnel is a major area of practice in industrial-organizational (I-O) psychology,
and executive selection is extremely important in any organization.
The majority of our understanding of executive selection has come from applied research and
experience. For example, Hollenbeck (1994) summarized observations from his experience and
eight books on the selection of chief executive officers (CEOs). Sessa and Taylor (2000) summarized
the results of a series of studies conducted at the Center for Creative Leadership in the 1990s using
simulations and surveys of executives. The purpose of this chapter is to review research and practice
in the selection of executive leaders and those who have high potential for these positions and to
comment on these developments over the past decades on the basis of our observations.
We begin the chapter by defining our focal group of executives and high potentials. To clarify
our focus, we make distinctions between leadership behaviors, leaders, and management. Next we
describe a few of the attributes thought to be important for the effectiveness of top leaders and
review some techniques to assess these attributes in high potentials and executive candidates. Then,
we describe the importance of an integrated process of recruitment, development, and management
of high potentials and executives in the context of an organization’s internal culture and exter-
nal environment. Next, mechanisms for integrating evaluations of performance and potential are
described. We then take up special topics such as the fit of executive leaders with the organization’s
needs and culture. We discuss what may be the most difficult and challenging issue: An evalua-
tion of whether leader selection methods work. We conclude with some review of past and present
executive selection research discussing roles (actual and potential) of I-O psychologists in executive
selection.
823
824 Handbook of Employee Selection
Executives, defined by level in the organization and by participation in the company’s execu-
tive compensation plan, generally make up less than 2% of the total employee population in large
organizations. According to Silzer (2002), executives include general managers, corporate offi-
cers, and heads of major organizational functions and business units; senior executives are in the
top tier—corporate officers, executive committee members, and CEOs. For example, at General
Electric in 2007, among its 320,000 employees, 5,000 were executives, 600 were senior executives,
and 185 were corporate officers (http://gecapsol.com/cms/servlet/cmsview/GE_Capital_Solutions/
prod/en/index.html). Highly decentralized organizations may have many company presidents (e.g.,
Johnson and Johnson with about 180 companies; S. Peters, personal communication, July 2, 2007).
In smaller organizations, the half dozen or so executives who have responsibility for major segments
of the business are, of course, a much smaller percentage of the employee population.
High potential for executive jobs is also organization-specific, but perhaps used more narrowly
than executive. Classifying employees as high potential grows out of an organization’s efforts
to ensure continuity in its supply of executives through succession planning, replacement
planning for key positions, and talent management. High potentials are typically high-
performing managers and individual contributors. At Royal Dutch Shell, “potential is the
estimate of the highest job that an individual will be able to perform in his or her future career”
(Podratz, Hungerford, Schneider, & Bui, 2007). High potentials are those thought to be able
to reach executive jobs. Depending on the resources devoted by the organization to leadership
development, high potentials may be identified quite early in their careers, or when they have had
a chance to demonstrate potential.
Executive jobs have changed dramatically since the 1950s when management was associated
with large, stable organizations and consisted of the classic functions of planning, organizing, and
controlling. Hemphill’s classic studies of executive behavior arrived at ten dimensions of manage-
ment at the executive level (Campbell, Dunnette, Lawler, & Weick, 1970) that only faintly resemble
the way executive work is described today. Now, those classic management functions must be aug-
mented with complex, diverse, and situation-specific leadership behaviors required by dynamic,
global, competitive business environments. Bass (1990) captured the essence of the distinction
between manager and leader: “Leaders manage and managers lead, but the two activities are not
synonymous,” (p. 383). To manage, the executive carries out the classic functions; to lead, the execu-
tive behaves in ways that inspires and influences the behavior of others.
Members throughout the organization may carry out leadership behaviors, and a full
description of a modern understanding of leadership behaviors is beyond the scope of this chapter
(see Salancik, Calder, Rowland, Leblebici, & Conway, 1975 for the distinction between leadership
vs. leader). Today, the simplest answer to the question of “What do executives do?” may be
“Whatever it takes.” Lengthy executive position descriptions have given way to outcome-oriented
objectives, relating to what the executive is expected to contribute to the strategic mission of the
organization.
Today, organizations and positions are seen as dynamic and rapidly changing. Executives are
expected to change the jobs they are in, and expected to be changed by these jobs. What executives
do to achieve the required outcomes is less a function of a written job description and more about
how what they do fits into the climate and culture of the organization. The higher the level of execu-
tive position, the more the incumbents shape the position to their preferences, talents, and abilities
to advance the mission of the organization. Selecting an executive has changed from asking, “What
must the executive do?” to “What must get done?” “What does it take to get that done?” is typi-
cally defined by a list of competencies or human attributes believed to underlie executive success.
competencies and human attributes, but this distinction is not always precise. Competencies are
often stated in terms of broad organizational objectives (e.g., global perspective), but some com-
petency lists include human attributes (e.g., adaptability). Most organizations have competency
lists or models for their executive positions (e.g., see the 17-competency Executive Skills Profile,
Frisch, 1998). Lists may vary from few (6–8) to many (60–70). A typical set of competencies
that has been used in talent management at Johnson and Johnson include: integrity and credo-
based actions, strategic thinking, big-picture orientation with attention to detail, ability to inspire
and motivate, intellectual curiosity, collaboration and teaming, sense of urgency, prudent risk tak-
ing, self awareness and adaptability, results and performance-driven. A challenge facing anyone
doing selection is to translate these competencies into operationally defined personal attributes that
are related to leadership effectiveness, make a difference in performance of executives, and can
be assessed. The process of translating organizational competencies to measurable human attri-
butes is carried out by various methods traditionally known as job analysis techniques (Brannik &
Levine, 2002). Subject matter experts may be asked to rate the relevance of a list of attributes needed
to completion of job tasks and responsibilities, to rate the importance of behaviors known to be
examples of attributes, and/or to infer directly the importance of human attributes to successful
leadership performance.
In the next section, we discuss some of the attributes thought to be associated with successful
executives and high potential leaders. The scores of human attributes that have been associated with
effective leadership literally range from “a to z” (Bass, 1990). Our review suggests to us that five
of these have been emphasized recently and bear further discussion: intelligence, social/emotional
intelligence, personality, self-efficacy, and integrity.
INTELLIGENCE
Executives must have a fairly high level of intelligence to run complex organizations. The well-
documented relationship of job complexity to cognitive abilities (Ones, Viswesvaran, & Dilchert,
2005) suggests that intelligence is important for executive success. But, the type of intelligence
and the relationship of intelligence to leader effectiveness have been debated for decades. Korman
(1968) concluded that verbal intelligence predicts performance of supervisors but not higher-level
managers. Menkes (2005) found that cognitive skills, such as analyzing and anticipating business
trends, differentiate star executives from their peers. Some have argued that whereas crystallized
intelligence (i.e., knowledge of facts) is important at lower levels, fluid intelligence (akin to creativ-
ity) is important at executive levels. Although a certain (probably considerable) amount of cognitive
abilities is important, additional levels may not be related to executive performance. In fact, some
evidence suggests a curvilinear relation: lower and higher levels of intelligence may be detrimental
to leadership success (Bass, 1990). The search for elements of cognitive ability important to execu-
tive leadership (and other areas of effective performance) has led to specification of different types
of intelligence.
SOCIAL/EMOTIONAL INTELLIGENCE
Emotional intelligence may be linked to leadership success (Goleman, 2000). Emotional intelli-
gence is defined as a person’s ability to successfully manage his or her own emotions and under-
stand the emotions of others (Salovey & Mayer, 1990). Goleman (1998) suggested that individuals
with high levels of emotional intelligence are more likely to be identified as star performers, and
McClelland (1996) found that managers with higher levels of emotional intelligence outperform
those with lower levels of emotional intelligence by 20%. Similarly, social skill (Riggio, Riggio,
Salinas, & Cole, 2003) and social intelligence (Riggio, Murphy, & Pirozzolo, 2002; Zaccaro, 2001)
have been related to leadership success. Considerable disagreement exists over the definition of
these types of intelligence.
826 Handbook of Employee Selection
PERSONALITY
The Five-Factor Model has become one of the widely used and popular conceptualizations of per-
sonality (e.g., McCrae & Costa, 1989). Sometimes known as “The Big Five,” these attributes include
conscientiousness, agreeableness, openness to experience, neuroticism (sometimes called anxiety),
and extraversion. Variables in the model, especially extraversion and conscientiousness have been
associated with leadership effectiveness (Judge, Bono, Ilies, & Gerhardt, 2002; Lord, DeVader, &
Alliger, 1986). Personality traits may be particularly predictive of successed for top-level leaders
given the amount of autonomy that is characteristic of such positions (Barrick & Mount, 1991).
Indeed, researchers have demonstrated the importance of personality to top-level leaders in orga-
nizations, such as Sears (Bentz, 1990). For the flip side, Hogan and Hogan (2001) suggested that
“dark side” personality traits (e.g., paranoia and passive aggressiveness) can also be used to predict
leadership failure.
SELF-EFFICACY
Self-efficacy (Smith & Foti, 1998; Chemers, Watson, & May, 2000) and the related construct of
self-esteem (e.g., Atwater, Dionne, Avolio, Camobreco, & Lau, 1999) are also important for leader
success. Self-esteem is seen as a more generalized perception of the self, whereas self-efficacy is
usually conceptualized as one’s confidence in his or her ability to perform well in a specific domain.
In the case of leadership, the most relevant type of self-efficacy would be self-efficacy for one’s
ability to lead. Chemers et al. (2000) showed that superior officers, peers, and trained observers
related levels of leadership efficacy to leadership ratings. Further, individuals’ leadership efficacy
may impact team performance through its effects on followers’ self-efficacy beliefs (Hoyt, Murphy,
Halverson, & Watson, 2003).
INTEGRITY
Integrity also deserves mention. The relevance of integrity to leadership has been highlighted in
recent years with corporate scandals such as Enron and WorldCom. Meta-analytic research has
supported the relationship between leader integrity and leadership performance (Judge & Bono,
2001). The importance of integrity to leadership has also given rise to a new model of leadership
called Authentic Leadership (Avolio & Gardner, 2005). The construct of ethical leadership is also
related to leader integrity (Brown & Trevino, 2006; Brown, Trevino, & Harrison, 2005). These
authors define ethical leadership as “the demonstration of normatively appropriate conduct through
personal actions and interpersonal relationships” (Brown et al., 2005, p. 120). They demonstrate that
ethical leadership is related to perceptions of leader honesty and fairness, trust in leader, and other
perceptions of effective leadership. Moreover, they suggest that ethical leadership can also influence
follower ethical behavior insofar as leaders are role models for their followers (Brown & Trevino,
2006; Brown et al., 2005).
ASSESSMENT TECHNIQUES
Methods to assess these and other attributes range from using internal performance appraisal data
to elaborate testing and assessment. In the following section, we summarize research on several
different types of assessment techniques and, where data exist, discuss their use in executive selec-
tion. The strength of the relationship between a specific attribute measured by a given assessment
technique and a specific criterion of leadership effectiveness is typically only moderate. That is, the
correlation is seldom over .35 to .40. Combinations of multiple measures of multiple attributes often
yield correlations in the range of .50 to .60. In other words, approximately 15–35% (i.e., .352 to .602)
of the variation in leadership effectiveness is predicted from one or more human attributes. Readers
Selecting Leaders 827
seeking more information on specific tests, questionnaires, and assessment methods in I-O psychol-
ogy will find the handbook by Thomas (2004) quite useful.
BIODATA
There is little question that one’s experiences during childhood, adolescence, education, military
training, and initial work up to the time of being considered to have high potential as an executive,
or being chosen for an executive position, are relevant to later success. Considerable research has
shown that systematic measures of biographical experiences can be highly predictive of perfor-
mance in various jobs (Stokes, Mumford, & Owens, 1994), including supervisors and managers
(Stokes & Cooper, 1994). Individual experiences that predict different performance criteria vary
considerably, but commonly predictive generic dimensions of life history include academic achieve-
ment, family background, economic stability, and financial responsibility. However, at the executive
level, systematically gathering information about individuals’ early life experiences can be prob-
lematic because the relevance may not be apparent. Formally scored biodata questionnaires are not
used frequently, although their prevalence as a selection device may be increasing. In a 2006 survey
of staffing directors, 60% reported they used application forms and 23% (up from 13% in 2004)
used a biographical data form (Howard et al., 2007).
PERSONALITY QUESTIONNAIRES
In 1991, Tett, Jackson, and Rothstein reported that personality tests, and in particular measures
of the Five-Factor Model of personality, are frequently used in selection contexts. Furthermore,
research has demonstrated that personality tests that are more specific than the Big Five are more
predictive than the Big Five (Pulakos, Borman, & Hough, 1988). Yet, despite the relevance of per-
sonality characteristics to leadership, personality questionnaires are rarely used to select executives
(Hogan & Kaiser, 2005). In Howard et al.’s (2007) survey of organizations, 65% of the organiza-
tions had never used a personality inventory for selection.
Considerable controversy exists over the extent to which responses to self-report personality
tests are influenced by contaminants such as self-enhancement biases and faking. Contrasting opin-
ions on these matters are expressed in a series of articles in the autumn and winter 2007 issues of
Personnel Psychology. Morgeson et al. (2007) raised questions over the utility of self-report per-
sonality questionnaires for selection purposes given their low correlation with job performance and
the inherent flaws of self-report data. In response to this article, other authors argue for the utility
of personality measures in selection, particularly when more sophisticated weighting schemes and
conceptualizations of personality are used (Ones, Dilchert, Viswesvaran, & Judge, 2007; Tett &
Christiansen, 2007).
828 Handbook of Employee Selection
Integrity measures, personality-based scales and overt scales have demonstrated high levels of
validity and are used with some frequency in selection contexts (Ones, Viswesvaran, & Schmidt,
1993), although 85% of Howard et al.’s (2007) respondents had never used integrity tests for execu-
tive selection. Yet, there is some evidence that integrity tests could be useful at the executive level.
Ones et al.’s (1993) meta-analysis demonstrated that integrity tests were equally good at predicting
job performance for jobs ranging from low to medium to high complexity, and integrity tests were
better at predicting counterproductive work behaviors for jobs that were high in complexity.
Although related to job performance, measures of self-efficacy (Judge & Bono, 2001) and
emotional intelligence (Law, Wong, & Song, 2004) have had little use in selection contexts, pos-
sibly because of the transparency of these measures. There is some evidence supporting the use
of measures of social skills, a construct related to emotional/social intelligence, for selection (e.g.,
Morgeson, Reider, & Campion, 2005; Riggio & Taylor, 2000), but additional research is needed
before use in selection of top-level leaders. In particular, the move toward ability-based measures of
emotional intelligence could be fruitful (e.g., Daus & Ashkanasy, 2005). Ability measures of emo-
tional intelligence are psychometrically similar to general intelligence tests and assess skills using
performance tasks rather that self-report measures of behavior.
1 Thanks to the following for their generous sharing of time and knowledge about current practice: Seymour Adler (AON),
Steward Crandell (PDI), P. Richard Jeanneret (Valtera), and Matt Pease (DDI).
Selecting Leaders 829
MULTISOURCE APPRAISAL
Multisource or 360º performance appraisals are often used for selection and development of high
potentials; they are also used for screening of executive candidates. As the name implies, managers
are rated on a questionnaire by their supervisors, subordinates, peers, and selves, and even inter-
nal customers, external customers, vendors, or suppliers (Dalessio, 1998). Estimates of the usage
of 360º appraisals range from 12% to 29% of all organizations (Church, 2000) and up to 90% of
Fortune 500 companies (Atwater & Waldman, 1998). Although 360º appraisals have primarily been
used as a feedback and development tool for managers (Goodstone & Diamante, 1998), Tornow
(1993) suggested that the method can also be used for appraisal, selection, and promotion. Indeed,
Halverson, Tonidandel, Barlow, and Dipboye (2005) found that ratings and self-other agreement on
360º ratings predicted promotion rate throughout one’s career in the United States Air Force.
Despite the trend toward using 360º appraisals for administrative purposes (Dalton, 1996;
London & Smither, 1995), there is controversy over the application of 360º appraisals to selection
(Craig & Hannum, 2006; Toegel & Conger, 2003). Specifically, writers have expressed several
concerns: self-enhancement by the manager who has a strong motivation to convey that he or
she has been performing well (Craig & Hannum, 2006); raters who know the assessment is for
administrative purposes may not wish to negatively impact the focal manager (DeNisi & Kluger,
2000); employment decisions based on ratings from unknown and minimally trained raters may
not be legally defensible. In fact, Morgeson, Mumford, and Campion (2005) noted that one of the
leading organizational consultants in the use of 360º appraisals, the Center for Creative Leadership,
restricts their use to developmental purposes.
ASSESSMENT CENTERS
The assessment center (AC) method has been used for evaluating executive potential for over
50 years (Thornton & Byham, 1982; Thornton & Rupp, 2006). Originally validated with longitudinal
studies as an indicator of early management potential of men and women (Howard & Bray, 1988),
the method has been used to assess executives in numerous industries and countries. The unique
aspect of the AC method is the combination of features that involve observation of overt behavior
in multiple simulations of organizational challenges by multiple trained observers who integrate
evaluations in consensus discussion, statistical formulae, or a combination of both. Some ACs involve
the consideration of information from other techniques such as cognitive ability tests, personality
questionnaires, multisource feedback, or a background interview. Older and recent surveys show
that large numbers of organizations use ACs for selecting executives and high potentials (Thornton
& Krause, 2009). Howard (1997) reported that Development Dimensions International assessed
over 350 executives in 1996, and the number is surely larger in recent years. Executive ACs involve
dimensions such as global awareness and strategic vision, calling for strategic decisions such as
launching joint ventures, managing the talent pool, or promoting a turnaround. A recent study in
Germany found that an AC added incremental validity over cognitive ability tests in predicting
executive success in a public organization (Krause, Kersting, Heggestad, & Thornton, 2006).
LEADERSHIP QUESTIONNAIRES
The reader may be surprised that we have not reviewed leadership behavior and style questionnaires
(Clark & Clark, 1990). To be sure, there are scores of self-report questionnaires developed over the years,
such as the Leader Behavioral Description Questionnaire and the Leadership Opinion Questionnaire.
The respondent is typically asked to indicate how often he or she does certain behaviors, such as
directing the work of subordinates or providing them support. In addition, some general self-report
personality questionnaires have scales measuring leadership potential. Although these instruments
have been useful in research, in helping individuals gain insight into their leadership styles, and in
830 Handbook of Employee Selection
counseling and training settings, they have not been applied extensively in executive selection. The
older questionnaires typically do not cover the broader set of leader characteristics deemed important
in recent leadership studies, and they all suffer the potential biasing effects of self-enhancement.
executives, the CEO, and the board. The reviewers ordinarily have a number of prior reviews of
performance and potential along with assessments of development needs available. The board of
directors of an organization has become increasingly involved in an organization’s selection (as
well as compensation and evaluation) of the CEO as a result of the recent raft of corporate scandals
and “a subsequent stream of regulations and guidelines” (Nadler, Behan, & Nadler, 2006, p. 174).
Boards are involved in reviewing the performance of not only the CEO and senior executives, but
also current high potentials and staff who are in the pipeline to become executives in the next sev-
eral years. Board involvement in executive selection and talent management is equally characteristic
in not-for-profit and private sectors, although not-for-profits are typically smaller organizations with
less extensive and less formal review processes.
Another example of the integration of evaluations of performance and potential is General Electric’s
widely copied process called Session C. This is a formal annual review of past accomplishments
in delivering quantifiable results including financial results, input from 360 degree feedback
questionnaires, and observations from mentoring relationships and developmental programs to arrive
at judgments of performance and potential. Integration of these data is carried out by committees
including the CEO, top executives, and HR staff (http://gecapsol.com/cms/servlet/cmsview/GE_
Capital_Solutions/prod/en/index.html), typically ascending up the organization with review sessions
beginning at the division levels and extending up to the executive committees and board level.
Corporate responses to the well-documented limitations of performance reviews (e.g., Dipboye,
Smith, & Howell, 1994) have improved the process, but those limitations still exist and have not
decreased the popularity of the process. Arguably, changes in the 21st century organization (e.g.,
globalization, multiple managers, greater spans of control, increased use of teams, virtual reporting
relationships) have made effective performance reviews even more difficult. Reviews of perfor-
mance and potential may be supplemented by other techniques, but they remain today, as 25 years
ago, the core of the selection process.
SUMMARY
The evidence for the validity, relevance, and accessibility of these techniques for the selection
of executives and high potentials is mixed. Moreover, perhaps the most widely used process for
selecting executives from outside the organization involves and is managed by external executive
recruiters (Howard et al., 2007), a process not discussed in detail here. Executive recruiters
typically use interviews and references as the data gathering methods, provide descriptions of
candidates, and make their own recommendations to the organization. Wackerle (2001) described,
in detail, this process. When there are internal and external candidates for an executive position in
an organization, there may well be much more information available about the internal employee
(e.g., performance appraisals, test results). The differences raise questions of the comparability and
fairness of the selection decisions. Equity may be introduced if an external source assesses both
internal and external candidates, a process being carried out more frequently (Howard, 2001).
There does not appear to be any one best method for executive selection, and evidence of the
prevalence of one selection technique or the lack of use of another does not provide evidence of
the measure’s validity or utility. Each of the measures discussed here may be useful for selecting
executives. Organizations must examine the qualities they are looking for, their staffing strategy
and philosophy, and their past success with different measures when choosing their selection plan.
In addition, consideration must be given to a leader’s fit with the organization.
levels, a broader array of characteristics of individuals and jobs are essential; fit becomes multidi-
mensional on both sides. Fit has dominated writing on executive selection in recent years. Fitting the
executive to the strategic demands of a position (e.g., turnaround, start-up, harvesting) has become
standard practice. Increasingly, students of organizational success have recognized that success
depends on well-functioning teams rather than a single, dominant leader.
TEAM
Therefore, finding executives who fit the team is essential for the success of the executive and the
organization. At the CEO level, Hollenbeck (1994) argued that successful selection depends upon
the fit among three sets of variables: those of the individual, the organization, and the external envi-
ronment or strategic demands. Sessa and Taylor (2000) discussed characteristics of the candidate,
the organization, and its strategy. Moses and Eggebeen (1999) argued that executive selection is
influenced by the notion of fit between the individual and the needs and climate of the organization
as they change over time (e.g., from a large, stable organization such as the earlier AT&T or IBM to
a faster-paced, versatile, constantly evolving organization such as the later AT&T or IBM).
Although our earlier discussion of what executives do suggests they may face different demands
across situations and time, Sternberg and Vroom (2002) suggested that the leader, the situation, and
the leader by situation interaction are all likely to impact leadership success. Fiedler’s early work on
Contingency Theory (Fiedler, 1967) suggests that the extent to which task-oriented or relationship-
oriented leaders will be successful in a particular situation depends on situation favorability: posi-
tive leader member relations, a structured task, and position power. Path-goal theory (House, 1971)
suggests that leaders should alter their behavior on the basis of characteristics of followers and the
situation (task, authority system, workgroup).
OUTCOME
The type of performance outcome that is required for the task may also be an important situational
contingency. For example, charismatic and transformational leadership may lead to greater creativity,
but not overall productivity (Johnson & Dipboye, 2008; Sosik, 1997). Another variable that has been
of interest is the level of stress or situational uncertainty in the situation. Fiedler and Macaulay’s (1999)
review of the literature on Cognitive Resource Theory (Fielder, 1995) suggests that experience is
more important than intelligence for leader success under stressful situations. Similarly, Yun, Faraj,
and Sims (2005) found that empowering leadership was more important in low-stress situations and
when team members were more experienced, whereas directive leadership was more effective in
high-stress situations and when team members were less experienced. Charismatic leadership theory
also suggests that leaders who express a vision, demonstrate sensitivity, and exhibit unconventional
behavior (e.g., self-sacrifice) are particularly effective during crisis situations (Halverson, Holladay,
Kazama, & Quiñones, 2004; Halverson, Murphy, & Riggio, 2004), when leading organizational
change initiatives (Zaccaro & Banks, 2004), and when promoting organizational growth in
entrepreneurial firms (Baum, Locke, & Kirkpatrick, 1998).2
FOLLOWERS
Follower characteristics may also impact what type of leader will be effective in a given situa-
tion. Followers’ satisfaction and perceptions of their abilities (Hersey & Blanchard, 1988); need for
autonomy (Yun, Cox, & Sims, 2006); openness to experience (Groves, 2005); motives (Wofford,
2 However, it should be noted that doubt has been cast on the importance of charismatic leadership. For example, in his
popular book, Good to Great, Collins (2001) suggested that the effects of charismatic leadership on organizations is short
lived and does not lead to overall organizational success.
Selecting Leaders 833
Whittington, & Goodwin, 2001); and achievement orientation, self-esteem, and need for structure
(Ehrhart & Klein, 2001) have been shown to moderate the effectiveness of leaders’ behavior. From a
slightly different point of view, rather than matching a leader to situational demands, organizations
might seek to hire leaders who can adapt to any situational demand, highlighting the importance of
adaptability (Pulakos, Arad, Donovan, & Plamondon, 2000) and social/emotional intelligence. For
example, Goleman (2000) suggested that leaders should change their style on the basis of situational
demands and suggested that leaders high in emotional intelligence are better able to do so than
leaders with less emotional intelligence. Similarly, Hooijberg and Schneider (2001) suggested that
executive leaders high in social intelligence may be better able to adapt to differences in followers’
attitudes, personalities, and motives.
Identifying high potentials is explicitly an exercise in fit within a given organization. When placing
high potentials in positions, consideration should be given to how much the position will increase the
leadership capability of the individual and the organization (McCall, 1998), in addition to how well
the high potential will perform. These findings provide guidance to analyze the situation in the organi-
zation that executives are being chosen to lead. Whereas traditional job analysis focused on respon-
sibilities to be accomplished, competencies to attain, and human attributes of leaders, additional
efforts to understand situational demands may improve selection of executives and high potentials.
Taxonomies and measures of situational demands are fertile areas for research by I-O psychologists.
CULTURE
In addition to differences in the variables just reviewed, the culture in which a leader’s organization
is embedded can also impact leadership effectiveness. That is to say, certain leadership traits and
behaviors will be perceived more positively in certain cultures than others. Considerable evidence
for cross-cultural differences in leadership effectiveness has come from the work on the GLOBE
project (e.g., Dickson, Den Hartog, & Mitchelson, 2003; Javidan, Dorfman, de Luque, & House,
2006). For example, Javidan et al. (2006) reported that more individualistic cultures rate contingent-
reward leaders higher than they are rated in collectivistic cultures. Collectivistic cultures also prefer
leaders who avoid conflict and promote harmony. Despite the long list of differences, there are also
some universally positive leader characteristics including integrity, planning, and team building
skills. Needless to say, however, the national culture should have an influence on what types of
leaders are selected for an organization. Moreover, the type of selection practices used to hire execu-
tives should reflect the national culture from which an organization is hiring (Dipboye & Johnson,
in press).
DOES IT WORK?
In the previous sections, we reviewed several techniques used in executive selection. Mixed
amounts and levels of relevant published, supportive evidence were noted. This begs the question—
Does executive selection work? This question is particularly important given the marked increase
in executive turnover since the 1990s (Lucier, Spiegel, & Schuyt, 2002) and the increase in the
use of outside selection rather than internal promotion (Murphy & Zabojnik, 2004). In a unique
study, Russell (2001) reported a longitudinal study of performance among 98 top-level executives.
Information from interviews and questionnaires were integrated into ratings on several performance
dimensions by teams of executives and researchers via a process akin to the wrap-up discussion in
an AC. Competency ratings in problem-solving and people orientation predicted subsequent fiscal
and nonfiscal performance trends. This sort of research is difficult for several reasons. First, as
Hollenbeck (1994) pointed out, there are several inherent difficulties of CEO selection itself: each
CEO position is unique and may be changing in an uncertain future, the selection decision is unique,
the decision-makers may never have made such a decision and are probably not trained to do so, the
process is probably not completely open, and outside forces may come into play.
834 Handbook of Employee Selection
Second, it is difficult to conduct a good study to determine if the process was successful.
Hollenbeck (1994) offered a partial list of explanations: the long time involved to select one per-
son, high secrecy surrounding this high-stakes choice, and difficulty for credible researchers to
get access to the expensive process. There are also technical research problems precluding classic
criterion validation studies: small sample size, low range in measures of key variables such as intel-
ligences, resistance of candidates to be subject to onerous and sensitive assessment procedures,
inherent limitations (e.g., faking, biased rating by self and others) of some promising constructs,
difficulty of accessing a comparison group of individuals not selected, and complexity of any cri-
terion measure. The difficulty of finding a meaningful criterion of effectiveness of selecting high
potentials and executive leaders bedevils researchers and practitioners. Appealing as it may appear,
an index of organizational performance as a complex criterion measure has proven to be a conten-
tious topic in the leadership literature. The lack of published empirical studies of the accuracy of
executive selection procedures may be lamentable, but hardly surprising.
Furthermore, there has been a debate over the extent to which leadership impacts organiza-
tional performance. For example, Pfeffer and Salancik (1978) and others (e.g., Meindl, Ehrlich, &
Dukerich, 1985) have argued that leadership has a minimal impact on organizational performance.
Lieberson and O’Connor (1972) found that leadership had a smaller impact on organizational per-
formance than the factors of industry or company. Despite these arguments, other researchers have
demonstrated the enormous potential for leaders to affect organizational performance (see Day &
Lord, 1988, for a review). Weiner and Mahoney (1981) found that CEO change accounted for 43.9%
of the variance in organizational profitability and 47% of the variance in stock prices. As Svensson
and Wood (2006) suggested, organizational performance can be based on leader skillfulness or
serendipity.
However, the relationship between leadership and organizational performance (or lack thereof)
does not suggest that leader selection efforts are necessarily successful (or unsuccessful). The CEO’s
team may play a large role in success or failure. If all leader selection efforts were successful, then
there would be no variance in the impact of leadership selection on performance. In fact, evidence
of the effect of leader succession and change on organizational performance suggests just the oppo-
site. Many organizations are making bad leadership selection decisions, just as others are making
good selection decisions. There is a definite need for research on the use of different selection meth-
ods on organizational and leadership performance to further address this issue.
CONCLUSIONS
There are many ways executives get the job done. There is no agreed upon list of executive com-
petencies or attributes. Competencies that are commonly listed include strategic global perspec-
tive, strategic thinking, performance orientation, and emphasis on people development. To get the
job done, the executive will have a pattern of human attributes needed by the organization at the
point in time of selection. No single attribute or simple profile of attributes is related to executive
effectiveness. These attributes include a unique profile including some forms of intelligence, per-
sonality characteristics and values, as well as experience, knowledge, and effective interpersonal
skills. Organizations use various methods to assess these attributes. The quality of tests, personal-
ity questionnaires, and interviews has improved over the years. But, these procedures are used in
different ways at each stage of the process: They are used in more formal systematic quantitative
ways at screening candidates into pools of high potentials, but in more informal and variable ways
during the integration of information at time of selection into executive ranks. The most common
method of selecting executives remains the performance review process by higher-level executives
and the board of directors. In larger companies, the process patterned after the GE’s Session C has
become common.
The rigor of these final steps of leader selection varies considerably. Observers of these pro-
cesses have lamented the lack of consistency and sophistication shown by many organizations.
Selecting Leaders 835
Suggestions have been made for more systematic processes of determining organization needs,
assessing competencies in candidates, and matching competencies to needs, but little research has
been conducted to evaluate these methods. I-O psychologists have helped articulate and evaluate
the attributes related to leader effectiveness and have been involved in designing programs to screen
candidates into high-profile pools and to develop leadership and managerial skills. In addition, I-O
psychologists have specialized in individual assessment of external candidates. But, with few excep-
tions of psychologists who consult with CEOs and boards, they have not played extensive roles in
the final stages of executive selection among internal candidates.
RESEARCH OPPORTUNITIES
Future involvement of I-O psychologists could include further articulation of the competencies and
attributes needed for diverse organizational challenges (e.g., defining and assessing characteristics
related to long term success such as character); specification of organizational and environmental
variables that need to be considered in determining fit; understanding how the complex process of
executive selection will be done differently in every job and every organization; and training CEOs,
top executive teams, and boards of directors in processes of matching candidates to demands of
positions. Consulting firms are becoming more involved in assessing internal and external candi-
dates for CEO positions, and these assignments may provide opportunities for I-O psychologists
to apply more scientific methods to the selection of CEOs (A. Howard, personal communication,
March 3, 2008).
On the basis of our observations of the field of executive selection, we note that there was much
systematic research in the 1960s to 1980s on early identification of management potential and exec-
utive selection, but not as much recent published research. Various assessment techniques (e.g., bio-
data, ACs, and cognitive ability tests) were evaluated for selection, but emphasis the past 2 decades
has been placed on development. Considering the noted scandals in executive ranks, selection may
be gaining renewed importance as the cost of executive failure becomes higher. The concern for
fit is probably the most significant development in executive selection in the last 20 years. More
research is needed into the judgmental processes that are needed to combine complex patterns of
information about candidates on the one hand with the complex changing patterns of organizational
and situational demands on the other hand. At the risk of appearing nihilistic, we suggest that the
traditional statistical methods used by I-O psychologists to study relationships of predictor scores
and criterion measures may not be up to the task of understanding the processes of executive selec-
tion at the highest levels of organizations. Studies of these complex processes of executive selection
may call for different research methods to study executive selection, including clinical methods of
judgment, policy capturing with executive recruiters, and a return to dormant complex statistical
validation strategies such as synthetic validation (McPhail, 2007).
REFERENCES
Atwater, L. E., Dionne, S. D., Avolio, B., Camobreco, F. E., & Lau, A. W. (1999). A longitudinal study of
the leadership development process: Individual differences predicting leadership effectiveness. Human
Relations, 52, 1543–1562.
Atwater, L. E., & Waldman, D. (1998). Accountability in 360 degree feedback. HR Magazine, 43, 96–104.
Avolio, B., & Gardner, W. (2005). Authentic leadership development: Getting to the root of positive forms of
leadership. The Leadership Quarterly, 10, 181–217.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A
meta-analysis. Personnel Psychology, 44, 1–26.
Bass, B. M. (1990). Bass and Stogdill’s handbook of leadership: Theory, research, and managerial applica-
tions (3rd ed.). New York, NY: Free Press.
Baum, R. J., Locke, E. A., & Kirkpatrick, S. A. (1998). A longitudinal study of the relation of vision and vision
communication to venture growth in entrepreneurial firms. Journal of Applied Psychology, 83, 43–54.
836 Handbook of Employee Selection
Bentz, V. J. (1990). Contextual issues in predicting high-level leadership performance. In K. E. Clark & M. B.
Clark (Eds.), Measures of leadership (pp. 131–143). West Orange, NJ: Leadership Library of America.
Brannik, M. R., & Levine, E. L. (2002). Job analysis. Thousand Oaks, CA: Sage.
Brown, M. E., & Trevino, L. K. (2006). Ethical leadership: A review and future directions. The Leadership
Quarterly, 17, 595–616.
Brown, M. E., Trevino, L. K., & Harrison, D. A. (2005). Ethical leadership: A social learning perspective
for construct development and testing. Organizational Behavior and Human Decision Processes, 97,
117–134.
Byham, W. C., Smith, A. B., & Paese, M. J. (2002). Grow your own leaders. Pittsburgh, PA: DDI Press.
Campbell, J. P., Dunnette, M. D., Lawler, E. F., III, & Weick, K. E., Jr. (1970). Managerial behavior, perfor-
mance, and effectiveness. New York, NY: McGraw-Hill.
Charam, R., Drotter, S., & Noel, J. (2001).The leadership pipeline: How to build the leadership-powered
company. San Francisco, CA: Jossey-Bass.
Chemers, M. M., Watson, C. B., & May, S. (2000). Dispositional affect and leadership effectiveness: A
comparison of self-esteem, optimism and efficacy. Personality and Social Psychology Bulletin, 26,
267–277.
Church, A. H. (2000). Do higher performing managers actually receive better ratings? A validation of multi-
rater assessment methodology. Consulting Psychology Journal: Practice and Research, 52, 99–116.
Clark, K. E., & Clark, M. B. (1990). Measures of leadership. West Orange, NJ: Leadership Library of
America.
Collins, J. (2001). Good to great: Why some companies make the leap . . . and others don’t. New York, NY:
HarperCollins.
Craig, S. B., & Hannum, K. (2006). Research update: 360-degree performance assessment. Consulting
Psychology Journal: Practice and Research, 58, 117–122.
Dalessio, A. T. (1998). Using multi-source feedback for employee development and personnel decisions. In
J. W. Smither (Ed.), Performance appraisals: State-of-the-art in practice (pp. 278–330). San Francisco,
CA: Jossey-Bass.
Dalton, M. (1996). Multirater feedback and conditions for change. Consulting Psychology Journal: Practice
and Research, 48, 12–16.
Daus, C. S., & Ashkanasy, N. M. (2005). The case for the ability-based model of emotional intelligence in
organizational behavior. Journal of Organizational Behavior, 26, 453–466.
Day, D. V. (2001). Leadership development: A review in context. The Leadership Quarterly, 11, 581–613.
Day, D. V., & Lord, R. G. (1988). Executive leadership and organizational performance: Suggestions for a new
theory and methodology. Journal of Management, 14, 453–464.
Den Hartog, D. N., & Koopman, P. L. (2001). Leadership in organizations. In N. Anderson, D. S. Ones, H. H.
Sinangil, & C. Viswesvaren (Eds.), Handbook of industrial, work, and organizational psychology. Vol. 2:
Organizational Psychology (pp. 167–187). Thousand Oaks, CA: Sage.
DeNisi, A. S., & Kluger, A. N. (2000). Feedback effectiveness: Can 360-degree appraisals be improved?
Academy of Management Executive, 14, 129–139.
Dipboye, R. L., & Johnson, S. K. (in press). A cross-cultural perspective on employee selection. In D. Stone,
E. F., Stone-Romero, & E. Salas (Eds.), The influence of cultural diversity on human resources practices.
Mahwah, NJ: Lawrence Erlbaum.
Dipboye, R. L., Smith, C. S., & Howell, W. C. (1994). Understanding industrial and organizational psychol-
ogy: An integrated approach. Orlando, FL: Harcourt Brace.
Dickson, M. W., Den Hartog, D. N., & Mitchelson, J. (2003). Research on leadership in a cross-cultural context:
Making progress and raising new questions. The Leadership Quarterly, 14, 729–768.
Ehrhart, M. G., & Klein, K. J. (2001). Predicting followers’ preferences for charismatic leadership: The influ-
ence of follower values and personality. The Leadership Quarterly, 12, 153–179.
Evers, A., Anderson, N., & Voskuijl, O. (2005). Handbook of personnel selection. Malden, MA: Blackwell.
Fiedler, F. (1967). A theory of leadership effectiveness. New York, NY: McGraw-Hill.
Fiedler, F. (1995). Cognitive resources and leadership performance. Applied Psychology: An International
Review, 44, 5–28.
Fiedler, F. E., & Macaulay, J. L. (1999). The leadership situation: A missing factor in selecting and training
managers. Human Resource Management Review, 8, 335–350.
Frisch, M. H. (1998). Designing the individual assessment process. In R. Jeanneret & R. Silzer (Eds.), Individual
psychological assessment (pp. 135–177). San Francisco, CA: Jossey-Bass.
Goleman, D. (1998). Working with emotional intelligence. New York, NY: Bantam Books.
Goleman, D. (2000). Leadership that gets results. Harvard Business Review, 78, 78–90.
Selecting Leaders 837
Goodstone, M. S., & Diamante, T. (1998). Organizational use of therapeutic change strengthening multisource
feedback systems through interdisciplinary coaching. Consulting Psychology Journal: Practice and
Research, 50, 152–163.
Groves, K. S. (2005). Linking leader skills, follower attitudes, and contextual variables via an integrated model
of charismatic leadership. Journal of Management, 31, 255–277.
Halverson, S. K., Holladay, C. L., Kazama, S. K., & Quiñones, M. A. (2004). Self-sacrificial behavior in crisis
situations: The competing roles of behavioral and situational factors. The Leadership Quarterly, 15,
263–275.
Halverson, S. K., Murphy, S. E., & Riggio, R. E. (2004). Charismatic leadership in crisis situations: A labora-
tory investigation of stress and crisis. Small Group Research, 35, 495–514.
Halverson, S. K., Tonidandel, S., Barlow, C., & Dipboye, R. L. (2005). Self-other agreement on a 360-degree
leadership evaluation. In S. Reddy (Ed.), Perspectives on multirater performance assessment (pp. 125–
144). Nagarjuna Hills, Hyderabad, India: ICFAI Books.
Hersey, P., & Blanchard, K. H. (1984), Management of organizational behaviour (4th ed.). Englewood Cliffs,
NJ: Prentice-Hall.
Highhouse, S. (2002) Assessing the candidate as a whole: A historical and critical analysis of individual psy-
chological assessment for personnel decision making. Personnel Psychology, 55, 363–396.
Hogan, R., & Hogan, J. (2001). Assessing leadership: A view from the dark side. International Journal of
Selection and Assessment, 9, 40–51.
Hogan, R., & Kaiser, R. B. (2005). What we know about leadership. Review of General Leadership, 9,
169–180.
Hollenbeck, G. P. (1994). CEO selection: A street-smart review. Greensboro, NC: Center for Creative
Leadership.
Hooijberg, R., & Schneider, M. (2001). Behavioral complexity and social intelligence: How executive leaders
use stakeholders to form a systems perspective. In S. J. Zaccaro & R. J. Klimoski (Eds.), The nature of
organizational leadership (pp. 104–131). San Francisco, CA: Jossey-Bass.
House, R. J. (1971). A goal-path theory of leader effectiveness. Administrative Science Quarterly, 16,
321–339.
Howard, A. (1997). A reassessment of assessment centers: Challenges for the 21st century. Journal of Social
Behavior and Personality, 12, 13–52.
Howard, A. (2001). Identifying, assessing, and selecting senior leaders. In S. J. Zaccaro & R. Klimoski (Eds.),
The nature and context of organizational leadership (pp. 305–346). San Francisco, CA: Jossey-Bass.
Howard, A., & Bray, D. W. (1988). Managerial lives in transition: Advancing age and changing times. New
York, NY: Guilford Press.
Howard, A., Erker, S., & Bruce, N. (2007). Selection forecast 2006/2007. Pittsburgh, PA: Development
Dimensions International.
Hoyt, C. L., Murphy, S. E., Halverson, S. K., & Watson, C. B. (2003). Group leadership: Efficacy and effective-
ness. Group Dynamics: Theory, Research, & Practice, 7, 259–274.
Javidan, M., Dorfman, P. W., de Luque, M. S., & House, R. J. (2006). In the eye of the beholder: Cross cultural
lessons in leadership from project GLOBE. Academy of Management Perspectives, 20, 67–90.
Jeanneret, P. R., & Silzer, R. (Eds.). (1998). Individual psychological assessment. San Francisco, CA: Jossey-
Bass.
Johnson, S. K., & Dipboye, R. L. (2008). Effects of task charisma conduciveness on the effectiveness of char-
ismatic leadership. Group & Organization Management, 33, 77–106.
Judge, T. A., & Bono, J. E. (2001). Relationship of core self-evaluation traits—self-esteem, generalized self-
efficacy, locus of control, and emotional stability—with job satisfaction and job performance. Journal of
Applied Psychology, 86, 80–92.
Judge, T. A., Bono, J. E., Ilies, R., & Gerhardt, M. W. (2002). Personality and leadership: A quantitative and
qualitative review. Journal of Applied Psychology, 87, 765–780.
Korman, A. K. (1968) The prediction of managerial performance: A review. Personnel Psychology, 21,
295–322.
Krause, D. E., Kersting, M., Heggestad, E. D., & Thornton, G. C., III. (2006). Incremental validity of assess-
ment center ratings over cognitive ability tests: A study at the executive management level. International
Journal of Selection and Assessment. 14, 360–371.
Law, K. S., Wong, C., & Song, L. J. (2004). The construct and criterion validity of emotional intelligence and
its potential utility for management studies. Journal of Applied Psychology, 89, 483–496.
Lieberson, S., & O’Connor, J. F. (1972). Leadership and organizational performance: A study of large corpora-
tions. American Sociological Review, 37, 117–130.
838 Handbook of Employee Selection
London, M., & Smither, J. W. (1995). Can multi-source feedback change perceptions of goal accomplish-
ment, self-evaluations, and performance related outcomes? Theory-based applications and directions for
research. Personnel Psychology, 48, 803–839.
Lord, R. G., DeVader, C. L., & Alliger, G. M. (1986). A meta-analysis of the relation between personality traits
and leadership perceptions: An application of validity generalization procedures. Journal of Applied
Psychology, 71, 402–410.
Lucier, C., Spiegel, E., & Schuyt, R. (2002). Why CEOs fall: The causes and consequences of turnover at the top.
Retrieved December 17, 2007, from http://www.boozallen.com/media/file/CEO-Turnover-Study-sb28.pdf
McCall, M. W. (1998). High flyers: Developing the next generation of leaders. Boston, MA: Harvard Business
School Press.
McCall, M. W., & Hollenbeck, G. P. (2001). Developing global leaders. Boston, MA: Harvard Business Review
Press.
McClelland, D. C. (1996). Does the field of psychology have a future? Journal of Research in Personality, 30,
429–434.
McCrae, R. R., & Costa, P. T. (1989). The structure of interpersonal traits: Wiggins’s circumplex and the Five-
Factor model. Journal of Personality and Social Psychology, 56, 586–595.
McPhail, S. M. (2007). Alternative validation strategies: Developing new and leveraging existing evidence.
San Francisco, CA: Jossey-Bass.
Meindl, J. R., Ehrlich, S. B., & Dukerich, J. M. (1985). The romance of leadership. Administrative Science
Quarterly, 30, 78–102.
Menkes, J. (2005). Executive intelligence: What all great leaders have. New York, NY: HarperCollins.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck, J. R., Murphy, K., & Schmitt, N. (2007).
Reconsidering the use of personality tests in selection contexts. Personnel Psychology, 60, 683–729.
Morgeson, F. P., Mumford, T. V., & Campion, M. A. (2005). Coming full circle: Using research and practice
to address 27 questions about 360-degree feedback programs. Consulting Psychology Journal: Practice
and Research, 57, 196–209.
Morgeson, F. P., Reider, M. H., & Campion, M. A. (2005). Selecting individuals in team settings: The impor-
tance of social skills, personality characteristics, and teamwork knowledge. Personnel Psychology, 58,
583–611.
Moses, J. L., & Eggebeen, S. L. (1999). Building room at the top. In A. Kraut & A. K. Korman (Eds.), Evolving
practices in human resource management: Responses to a changing world of work (pp. 201–225). San
Francisco, CA: Jossey-Bass.
Murphy, K. J., & Zabojnik, J. (2004). CEO payment and appointments: A market-based explanation for recent
trends. American Economic Review, 94, 192–196.
Nadler, D. A., Behan, B. A., & Nadler, M. B. (Eds.). (2006). Building better boards: A blueprint for effective
governance. San Francisco, CA: Jossey-Bass.
Ones, D. S., Dilchert, S., Viswesvaran, C., & Judge, T. A. (2007). In support of personality assessment in orga-
nizational settings. Personnel Psychology, 60, 999–1027.
Ones, D. S., Viswesvaran, C., & Dilchert, S. (2005). Cognitive ability in personnel selection decisions. In A.
Evers, N. Anderson, & O. Voskuijl (Eds.), Handbook of personnel selection (pp. 143–173). Malden, MA:
Blackwell.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities:
Findings and implications for personnel selection and theories of job performance. Journal of Applied
Psychology, 78, 679–703.
Pfeffer, J., & Salancik, G. R. (1978). The external control of organizations: A resource dependent perspective.
New York, NY: Harper & Row.
Piotrowski, M. (2007, April). Individual assessment today: What works, and what doesn’t work! Presentation at
the Annual Meeting of the Society for Industrial and Organizational Psychology, New York, NY.
Podratz, L. T., Hungerford, M. K., Schneider, J. R., & Bui, T. (2007, April) Global high-potential assessment:
Supporting a global talent pipeline. Practice Forum presented at the 22nd Annual Conference of the
Society for Industrial and Organizational Psychology, New York, NY.
Prien, E. P., Schippmann, J. S., & Prien, K. O. (2003). Individual assessment as practiced in industry and con-
sulting. Mahwah, NJ: Lawrence Erlbaum.
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace:
Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85, 612–624.
Pulakos, E. D., Borman, W. C., & Hough, L. M. (1988). Test validation for scientific understanding: Two
demonstrations of an approach for studying predictor-criterion linkages. Personnel Psychology, 41,
703–716.
Selecting Leaders 839
Riggio, R. E., Murphy, S. E., & Pirozzolo, F. J. (2002). Multiple intelligences and leadership. Mahwah, NJ:
Lawrence Erlbaum.
Riggio, R. E., Riggio, H. R., Salinas, C., & Cole, E. J. (2003). The role of social and emotional communica-
tion skills in leader emergence and effectiveness. Group Dynamics: Theory, Research, and Practice, 7,
83–103.
Riggio, R. E., & Taylor S. J. (2000). Personality and communication skills as predictors of hospice nurse per-
formance. Journal of Business and Psychology, 15, 351–359.
Russell, C. J. (2001). A longitudinal study of top-level executive performance. Journal of Applied Psychology,
86, 560–573.
Ryan, A. M., & Sackett, P. R. (1987). A survey of individual assessment practices by I-O psychologists.
Personnel Psychology, 40, 455–488.
Salancik, G. R., Calder, B. J., Rowland, K. M., Leblebici, H., & Conway, M. (1975). Leadership as an out-
come of social structure and process: A multidimensional analysis. In J. G. Hunt & L. L. Larson (Eds.),
Leadership frontiers (pp. 81–101). Kent, OH: Kent State University.
Salovey, P., & Mayer, J. D. (1990). Emotional intelligence. Imagination, Cognition, and Personality, 9,
185–211.
Sessa, V. I. (2001). Executive promotion and selection. In M. London (Ed.), How people evaluate others in
organizations (pp. 91–110). Mahwah, NJ: Lawrence Erlbaum.
Sessa, V. I., & Taylor, J. J. (2000) Executive selection. San Francisco, CA: Jossey-Bass.
Silzer, R. (2002). The 21st century executive: Innovative practices for building leadership at the top. San
Francisco, CA: Jossey-Bass.
Smith, J. A., & Foti, R. J. (1998). A pattern approach to the study of leader emergence. The Leadership
Quarterly, 9, 147–160.
Sosik, J. J. (1997). Effects of transformational leadership and anonymity on idea generation in computer-
mediated groups. Group & Organization Management, 22, 460–485.
Sternberg, R. J., & Vroom, V. (2002). Theoretical letters: The person versus the situation in leadership. The
Leadership Quarterly, 13, 301–323.
Stokes, G. S., & Cooper, L. A. (1995). Selection using biodata: Old notions revisited. In G. S. Stokes, M. D.
Mumford, & W. A. Owens (Eds.), Biodata handbook (pp. 311–349). Palo Alto, CA: CPP Books.
Stokes, G. S., Mumford, M. D., & Owens, W. A. (Eds.). (1995). Biodata handbook. Palo Alto, CA: CPP
Books.
Svensson, G., & Wood, G. (2006). Sustainable components of leadership effectiveness in organizational perfor-
mance. Journal of Management Development, 25, 522–534.
Tett, R. P., & Christiansen, N. D. (2007). Personality tests at the crossroads: A response to Morgeson, Campion,
Dipboye, Hollenbeck, Murphy, & Schmitt (2007). Personnel Psychology, 60, 967–993.
Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job performance: A
meta-analytic review. Personnel Psychology, 44, 703–742.
Thomas, J.C. (2004). Comprehensive handbook of psychological assessment. Vol. 4: Industrial and organiza-
tional assessment. Hoboken, NJ: Wiley.
Thornton, G. C., III, & Byham, W. C. (1982). Assessment centers and managerial performance. New York,
NY: Academic Press.
Thornton, G. C., III, & Krause, D. E. (2009). Comparison of practices in selection vs. development assess-
ment centers: An international survey. International Journal of Human Resource Management, 20,
478–498.
Thornton, G. C., III, & Rupp, D. R. (2006). Assessment centers in human resource management: Strategies for
prediction, diagnosis, and development. Mahwah, NJ: Lawrence Erlbaum.
Toegel, G., & Conger, J. A. (2003). 360-degree assessment: Time for reinvention. Academy of Management
Executive, 12, 86–94.
Tornow, W. W. (1993). Perceptions or reality: Is multi-perspective measurement a means or an end? Human
Resource Management, 32, 211–219.
Wackerle, F. W. (2001). The right CEO: Straight talk about making tough CEO selection decisions. San
Francisco, CA: Jossey-Bass.
Weiner, N., & Mahoney, T. A. (1981). A model of corporate performance as a function of environmental, orga-
nizational, and leadership processes. Academy of Management Journal, 24, 453–470.
Wofford, J. C., Whittington, J. L., & Goodwin, V. L. (2001). Follower motive patterns as situational moderators
for transformational leadership effectiveness. Journal of Managerial Issues, 13, 196–211.
Yun, S., Cox, J., & Sims, H. P. (2006). The forgotten follower: A contingency model of leadership and follower
self-leadership. Journal of Managerial Psychology, 21, 374–388.
840 Handbook of Employee Selection
Yun, S., Faraj, S., & Sims, H. (2005). Contingent leadership and effectiveness of trauma resuscitation teams.
Journal of Applied Psychology, 90, 1288–1296.
Zaccaro, S. J. (2001). The nature of executive leadership: A conceptual and empirical analysis of success.
Washington, DC: APA Books.
Zaccaro, S. J., & Banks, D. (2004). Leader visioning and adaptability: Building the gap between research and
practice on developing the ability to manage change. Human Resource Management, 43, 367–380.
Part 8
Milestones in Employee Selection
Walter C. Borman and Benjamin Schneider,
Section Editors
This page intentionally left blank
39 The Management
Progress Study and Its
Legacy for Selection
Ann Howard
The Management Progress Study (MPS) is a longitudinal study of Bell System telephone company
managers that actively collected data from the mid-1950s to the mid-1980s. AT&T’s primary moti-
vation for initiating the study was to generate knowledge about how managers’ lives and careers
develop over time. This long-term pursuit of basic human resources research in a corporate setting
remains a unique and outstanding achievement. However, it was founding director Douglas W.
Bray’s primary research method—the management assessment center—that made the study a clas-
sic in personnel selection. The foundations for this method came primarily from the military, but
Bray adapted it to organizational settings in a way that has endured for more than 50 years.
This chapter describes the beginnings of MPS, its methodology, and major findings that bear
on selection, including a brief review of how successful managers change and develop over time.
A discussion of what has been preserved from the MPS assessment center model, what has been
enhanced, and what has been sidetracked or overlooked helps frame the MPS legacy for selection.
The chapter closes with hypotheses about the durability of specific research findings.
EARLY DEVELOPMENTS
The roots of the AC method are usually traced to the testing and observation of German mili-
tary officers in World War I. Air Force officer candidates participated in tests, interviews, physical
exercises, discussion groups, and simulations, such as explaining a problem to a group of soldiers.
Psychologists observed the candidates and provided evaluation data to a senior military officer, who
made the final selection among candidates (Martin, 1941).
The German program was among the first to use both multiple assessment techniques and mul-
tiple assessors. It also made some attempts to obtain behavioral samples and brought in the idea
of holistic assessment rather than trait measurement. However, the simulations were crude and
843
844 Handbook of Employee Selection
inconsistent across applications, and there was a lack of standardization in administration and
recording observations. Moreover, the target of measurement (overall character of leaders) and
some of the measures used (handwriting analysis, facial expression) were misguided. The validity
of the program was not examined (Thornton & Byham, 1982).
The idea of bringing together multiple techniques and multiple assessors also grew among personality
researchers at the Harvard Psychological Clinic in the 1930s. Dr. Henry Murray, a physician turned
psychologist, thought that better diagnoses of normal personality would result from bringing diverse
specialists together to make a holistic evaluation. Murray and his colleagues also developed unique
instruments to study personality such as the Thematic Apperception Test, or TAT. However, their
evaluations of these early techniques focused on the extent to which they elicited rich material for
interpretation of personality rather than their relationship to external criteria (Murray, 1938).
the potential of simulations for eliciting behavior in a wide range of situations, and all methods
contributed significantly to one or more final dimension ratings. The overall effectiveness rating
from assessments at two locations had an average correlation with a field performance appraisal by
overseas staff of .45; this was a median of .37 at Station S and .53 at Station W, each corrected for
range restriction (OSS Assessment Staff, 1948).
In 1948, while finishing his doctorate at Yale, Douglas W. Bray read Assessment of Men, the
book describing the OSS experience, and became excited by the possibility of evaluating people by
observing their live behavior. He had already recognized the power of watching people in action
from his own World War II experience evaluating the performance of bombardiers. In his first
post-PhD position on Princeton’s Study of Education, he proposed using the AC method to study
students, but the university was reluctant to implement such a bold initiative. It was not until 1956
that Bray finally got the opportunity to put an AC into action when he accepted a new position at
AT&T.
widely from its military predecessors in terms of what was measured and the methods used. Instead
of enduring physical activities or keeping one’s identity as a spy under duress, participants engaged
in activities directly related to the challenges of a middle manager in a large organization.
The variables measured were qualities considered important to managerial success. Bray con-
solidated a list of 25 qualities from three sources: the management and psychological literature,
behavioral scientists outside of the Bell System, and senior personnel executives inside of the
System (Bray, Campbell, & Grant, 1974). These variables, later called dimensions, clustered into
seven factors common to both samples (Bray & Grant, 1966): administrative skills (e.g., organizing
and planning), interpersonal skills (e.g., leadership skills), intellectual ability (e.g., general mental
ability), stability of performance (e.g., tolerance of uncertainty), work involvement (e.g., primacy of
work), advancement motivation (e.g., need for advancement), and independence (e.g., low need for
superior approval).
In addition to simulations, the exercises included background interviews, a personal history
questionnaire, biographical essays, an expectations inventory, cognitive tests (School and College
Ability Test, Critical Thinking in Social Science Test, Contemporary Affairs Test), personality
and motivation inventories (Edwards Personal Preference Schedule, Guilford-Martin Inventory,
Sarnoff’s Survey of Attitudes Toward Life, Q sort, Bass version of California F-scale), and projec-
tive tests (Murray’s TAT, Rotter Incomplete Sentences Blank, Katkovsky’s Management Incomplete
Sentences Test).
However, it was the business simulations that drew the most attention. There were three of these:
an in-basket, a business game, and a leaderless group discussion. The in-basket technique was first
developed by the Educational Testing Service to measure curriculum effectiveness at the U.S Air
Force Air College (Frederiksen, Saunders, & Wand, 1957). The MPS assessee received memos,
reports, and records of telephone calls that might well accumulate on the desk of a mid-level man-
ager. In responding to these items, the assessee had to initiate meetings, delegate tasks, make deci-
sions, and engage in planning.
Also new was the business game, which used Tinker Toys to simulate a manufacturing situa-
tion. A team bought parts and assembled them to make one of five toys, which they sold back to the
assessors. The prices of the raw materials and the completed toys fluctuated throughout the game.
In a group discussion exercise, six assessees had to decide which of their direct reports should be
promoted to fill a vacancy. Each participant made a short presentation recommending his candidate.
Then the group had to discuss and agree on which person to promote.
Assessors, most of whom were psychologists, wrote reports that described individual partici-
pants’ behavior in each simulation, summarized interview information, and interpreted the
projective tests. They read their reports aloud in an integration session at which test scores and
all other information on the candidate were also revealed. Each assessor individually rated the
25 dimensions for each participant and then reviewed each rating with the group. If assessors’
ratings disagreed by at least two points on a five-point scale, they discussed the evidence until they
came to a consensus rating.
The assessment staff then made predictions about each man’s future with the Bell System. Three
types of predictions were made: Will this person be advanced to the entry level of middle manage-
ment (third level) within 10 years? Should he be? And will he stay with the Bell System for the rest
of his career? The “will” prediction took into account not only latent potential, but how decision-
makers in the company would react to the participant. Thus, “will” was the best prediction of
whether the person would be promoted, whereas “should” was the truer measure of management
potential.
The extensive array of data collected on each participant enabled deep insight into each man’s
background, lifestyle, work interests, personality, motivation, and abilities. Differences between the
college and noncollege samples were pronounced from the start (Howard & Bray, 1988). For exam-
ple, each background interview was scored on nine life themes, such as marital-familial, financial-
acquisitive, religious-humanism, and occupational (Rychlak & Bray, 1967). The more experienced
The Management Progress Study and Its Legacy for Selection 847
and mostly married noncollege men were more invested in their marital family and in their work
life, whereas the college men were more likely to focus on improving their minds and bodies (ego-
functional life theme) and on providing service to the community. Personality and motivation factors
found the college men considerably more ambitious and impulsive than their noncollege counter-
parts. Although the two groups were similar in administrative and interpersonal skills, the college
men far surpassed the noncollege in cognitive ability. Factor scores computed from the assessors’
dimension ratings showed the college men much higher on the intellectual ability and advancement
motivation factors. The noncollege men scored higher on the independence factor, meaning that
they were freer of others’ influence and better able to stand on their own.
The assessors predicted that more college than noncollege men had true potential for middle
management, which was not surprising in light of the differences in intellectual ability and advance-
ment motivation. However, they found nearly half of the college men (47%) and more than two-
thirds of the noncollege men (69%) as lacking middle-management potential, and they rated still
more as questionable. These findings indicated that the Bell System was in great need of a better
way of selecting higher-level managers (Howard & Bray, 1988).
The predictions about rising into middle management were for 10 years after the assessment;
thus, evidence to verify the accuracy of the predictions accumulated slowly. Although individual
results of the assessment were kept confidential, the activities of the center were not a secret. When
Michigan Bell executives heard about the AC, they asked if the method could be used to help select
foremen. Bray understood that operational use would require disengaging the method from its psy-
chological underpinnings and making it accessible to laypeople. He removed all of the personality,
motivational, and other psychological techniques and left only the mental ability test to complement
the simulations. Bray and his staff then trained the managers to conduct the assessment, observe
behavior in the simulations, write reports, make judgments about the relevant dimensions, and rate
candidates’ potential performance as a foreman. The first operational center opened in Michigan
Bell in 1958.
The AC method quickly spread throughout the Bell System and was adapted to additional man-
agement positions. The centers were staffed entirely by managers, who went through assessor
training programs. The orientation of the assessment switched from clinical judgments to a more
behavioral approach; assessors were encouraged to use the behavior observed in the exercises to
predict future behavior on the job, not to try to be amateur psychologists. The dimensions sought
were determined by job analyses of target jobs rather than educated guesses, although many of the
original dimensions endured.
70%
“No (or ?)” prediction 58%
60%
% Reaching middle management
“Yes” prediction
50%
40%
40%
30%
22%
20%
9%
10%
0%
Noncollege College
expectations, and other aspects of their work and personal lives. The interview program also
included those who had left the Bell System. In conjunction with the interviews, participants filled
out questionnaires about their expectations and attitudes toward management. The research-
ers also gathered others’ views about the participants’ work assignments and job performance.
Personnel representatives provided this information initially, but the responsibility eventually
shifted to each participant’s immediate supervisor. The supervisor interviews were coded for
various work conditions but were less fruitful as validation criteria than was progression in
management.
By the time 8 years had passed, it became clear that the interviews could not fully evaluate
how the men’s lives and careers had changed as they fulfilled their destiny as business managers.
AT&T therefore decided to put those participants still employed in the Bell System (about 75% of
the total group) through another AC duplicating or paralleling the one they had undergone 8 years
previously. Richard Campbell served as administrator of the data collection while Bray continued
as director of MPS and Donald Grant supervised the data analyses. Following this second AC, the
interview program continued on a 3-year rather than an annual basis.
As the 20th year of the study approached, the participants were in their 40s and 50s, many
careers had plateaued, and thoughts of retirement had begun to stir. It was an ideal time to gather
in-depth information about the later stages of the MPS participants’ careers. Doug Bray once again
determined that an AC was the best way to develop the deepest understanding, and he hired me as a
new MPS collaborator. My lucky break, which came as I was completing my PhD at the University
of Maryland, stemmed from an article about ACs that I originally wrote for a graduate class and was
published in the Academy of Management Journal (Howard, 1974), thanks to the encouragement of
Benjamin Schneider, one of my professors. The article stimulated an invitation to an AC conference
at the Center for Creative Leadership, where I became acquainted with Doug, who had also read the
article. When there was an unexpected job opening at AT&T a few months later, Doug recruited me,
and I began my new position in August 1975.
The third MPS AC, which began in 1976, repeated only about one-third of the tests and question-
naires given at the previous two centers. The rest of the assessment included various new exercises
focused on midcareer and midlife that Doug Bray and I developed together (Bray & Howard, 1983).
Bray remained the overall MPS director while I assumed the roles of administrator and researcher;
Don Grant had by now retired from AT&T and taken an academic position.
One more round of MPS interviews was conducted at year 25 before the breakup of the Bell
System ended data collection for the study. Table 39.1 shows the 25-year design and the sample sizes
for each AC.
The Management Progress Study and Its Legacy for Selection 849
TABLE 39.1
MPS Design and Sample
Sample
Year Design College Noncollege Total
0 MPS:0 assessment 274 148 422
1–7 Annual interviews with participants and company
representatives; annual expectations and attitude
questionnaires by participants; terminator interviews
8 MPS:8 reassessment (parallel to original) 167 142 309
9–19 Triannual interviews with participants, bosses, and
terminators; participant questionnaires
20 MPS:20 reassessment (midlife and midcareer) 137 129 266
25 Interviews with participants, bosses, and terminators;
participant questionnaires
TABLE 39.2
MPS Management Levels at Year 25a
College Noncollege Total
Level N % N % N %
7 1 1 1 0
6 4 3 4 1
5 10 7 10 4
4 32 23 5 4 37 13
3 62 45 40 29 102 37
2 24 17 61 44 85 31
1 6 4 32 23 38 14
Total 139 100 138 100 277 100
a Includes retirees.
850 Handbook of Employee Selection
college group once again demonstrated successful prediction from the original assessment: 44%
of the “yes” group attained the fourth level of management compared to 25% of the “no or ?”
prediction group.
Although the original assessment ratings remained valid across time, the question still remains
as to whether a later assessment would have even greater validity. The year-8 reassessment data
confirmed that this was true. Figure 39.2 illustrates why greater prediction occurred with time. The
graph shows average ratings on a higher-order general effectiveness factor, which comprised all but
a few of the dimensions. The results are shown by management level attained by year 20.
The pattern of results shown in Figure 39.2 illustrates the cliché, “The rich get richer, and the
poor get poorer” (Howard & Bray, 1988). The men who reached the fifth level or higher gained
significantly in managerial effectiveness between years 0 and 8, the fourth- and third-level men
showed little change, but the men who remained in lower levels showed significant declines. The
sharper slope of the line from the year-8 reassessment illustrates the gains for prediction of later
advancement.
The lesson for selection is that assessment information needs to be updated periodically to
account for individual change. Moreover, it should not be assumed that changes over time are due
simply to managers growing at differential rates. They are also likely to suffer differential declines
in ability and motivation as bad habits become ingrained and the flames of youthful ambition are
extinguished by the wet blanket of reality.
4.0
3.5
Transformed factor score
MPS:0
3.0
2.5
2.0
MPS:8
1.5
1.0
1 2 3 4 5&6
Level at MPS:20
FIGURE 39.2 General effectiveness factor over time by management level at year 20.
The Management Progress Study and Its Legacy for Selection 851
Within these broad categories were more specific strong predictors. For example, the mental
ability tests had more predictive power than the test of contemporary affairs. Among the interper-
sonal skills, oral communications was particularly useful in predicting later advancement.
Ambition was by far the most significant motivational characteristic bearing on later advance-
ment, and goal-setting was a key indicator. Goals were coded from assessment interviews in which
participants were asked to state the highest management level they wanted to attain in their career.
These goals could range from 1 (the lowest level of management) to 7 (a telephone company presi-
dent). The participants’ goals showed a substantive correlation with level attained by year 25 (r = .54),
even when the influence of cognitive ability was controlled (r = .43). Correlations with advancement
were significant for the college and noncollege groups separately, so that these groups’ differing
expectations of success did not fully explain the results. Perhaps the final career goal represents a
personal sizing up of one’s potential in relation to the organizational environment (Howard, 2005).
Lifestyles also showed some relationship to later advancement. For both educational groups,
those destined for high-level management were concerned about developing and improving them-
selves, mentally and physically. They were also more involved in their work.
Among the college graduates, academic success in high school and college was a good predictor
of advancement, as was attending a college of better quality. College major also made a difference;
the higher-level men were more likely to have been humanities or social science majors than busi-
ness or technical majors (Howard, 1986). Apparently the verbal and interpersonal skills, breadth of
interests, and inclinations toward leadership that accompany a liberal arts education trumped the
technical competence acquired in more specialized degrees.
PREDICTING TURNOVER
A precursor to success in Bell System management—or in any organization—is whether or not a
manager stays on the payroll. In the 1950s, men joining large organizations were typically expected
to stay until retirement. Nevertheless, nearly half of the college graduates (128 of 274) had left the
Bell System by the time of the MPS:20 assessments. Most of the terminations occurred in the first
few years of employment: 25% were gone within 4 years and nearly 40% within 10. Terminations
among the noncollege participants were negligible, given that they had been employed by the com-
pany more than 9 years on average before attending the AC. But the MPS:20 staff was able to suc-
cessfully predict which of the noncollege men were likely to retire early (Howard, 1988).
The staff at the original AC predicted that 70% of the college recruits would stay with the com-
pany, an overestimate of the 53% who actually did stay. Yet the assessors were still able to differen-
tiate those who would stay or leave far better than chance. Figure 39.3 shows their batting average
(in each case p < .01) for all terminations as well as by termination type. The subgroups included
56% that were discharged or induced to leave (forced terminators) and 44% who left voluntarily
(Howard & Bray, 1988).
The forced terminators’ lower scores on administrative and cognitive skills differentiated them
from the voluntary terminators and the stayers. They were rated significantly lower on decision-
making and planning and organizing in the in-basket, and they had lower mental ability test scores.
In addition, they showed unrealistic expectations about their future, lower resistance to stress, and
less endurance on the Edwards Personal Preference Schedule. Thus, the forced terminators were not
only less capable than those who stayed or left voluntarily, but they also lacked the resources that
would help them cope with a difficult situation.
The voluntary terminators showed equal abilities to those who stayed, but they were significantly
different from stayers and forced terminators in motivational characteristics. Dimension ratings
identified them as higher on need for advancement, but their ambition had an insistent and rigid
quality, accompanied by lower ratings on the ability to delay gratification and on behavior flexibil-
ity. Their high need for status was not matched by an intrinsic interest in the work itself, and they
had lower standards for the quality of their work. The voluntary terminators had lower interview
852 Handbook of Employee Selection
70%
66%
Predicted to leave (or ?)
60% Predicted to stay
50%
40%
% Terminated
40%
37%
30% 29%
21% 20%
20%
10%
0%
All terminations Voluntary Forced
ratings on need for security and Bell System value orientation, which flagged less loyalty to the
company. When rapid advancement did not materialize, these men, being relatively uncaring about
their work or their company, departed.
MANAGERIAL ABILITIES
Evaluation of changes in managerial abilities must rely on data between MPS:0 and MPS:8 because
some key exercises were lacking at MPS:20. Figure 39.4 shows changes in three ability factors
(based on specific scores or ratings rather than dimensions), separated by management level at
MPS:20.
There was a surprising difference among the level groups in the administrative ability factor.
Those who ended up at year 20 in the lowest two management levels showed declines in administra-
tive ability in their early years, whereas those at the fifth and sixth levels showed a marked increase.
The future executives had developed their administrative capabilities over the first 8 years to a
rather extraordinary degree compared with their cohorts.
None of the level groups demonstrated better interpersonal ability at MPS:8 than they had
at the beginning, and most were rated more poorly. The best that could be said for the future
executives was that they held their own. Although motivation might be a key to the declines, the
results suggest that organizations need to continually reinforce interpersonal skills in development
planning.
All level groups gained on the cognitive ability factor, although those who would ultimately
reach third level and higher gained more. In this case, continuing one’s self-education in worldly
affairs was mostly a function of having gone to college.
The Management Progress Study and Its Legacy for Selection 853
65
Administrative Interpersonal Cognitive
60
55
50
45
40
35
Year 0
30 Year 8
25
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
MPS: 20 level
The longitudinal ability data suggest that organizations should not make selection decisions with
the assumption that managers will likely improve their interpersonal skills over time; unless an
intervention is planned, they are likely to get worse. Although some managers might get better at
administrative skills, this is most likely to occur as they move up the ladder and take on greater
responsibilities. As a matter of course, only cognitive ability can be expected to increase as manag-
ers continue to expand their knowledge.
5.0
Level 5+
4.0
Assessor rating
Level 4
3.0 Level 3
Level 2
2.0
Level 1
1.0
0 8 20
Year of study
The selection message in these data is that ambition is more characteristic of young managers
than older, experienced ones. Although setting high goals at the beginning of one’s career
was a powerful predictor of later advancement, it is likely to lose its potency as time goes by.
However, reinforcing the drive to advance with promotions along the way helps to maintain its
importance.
MANAGEMENT STYLE
The higher-level men at MPS:20 showed a clear preference for leadership and advancement, but this
did not necessarily mean they were heavy-handed in executing their roles. On the Bass version of
the California F-scale, shown in Figure 39.6, the lower-level men grew into greater authoritarianism
over time. Perhaps they insisted on doing things by the book with strict enforcement of the rules.
By contrast, the men at the fourth level and higher ended up lowest on authoritarianism by MPS:20.
They had likely developed a broader perspective and learned to operate with more uncertainty and
greater flexibility.
Over the same period of time, the lower-level men became more nurturant on the Edwards
Personal Preference Schedule, whereas the higher-level men became less so. Figure 39.7 shows that
on the Bell System norms, the first-level men were at the 81st percentile by MPS:20, whereas the
fifth- and sixth-level men were only at the 34th percentile. Apparently the lower-level men were
more likely to offer sympathetic understanding and help to others, whereas the top executives were
much less likely to do so.
The combination of low authoritarianism and low nurturance suggests that the highest-level men
wanted to be flexible and objective in their leadership style but avoid emotional involvement, a style
I previously described as “cool at the top.”
They are like steely-eyed Solomons, attempting to size up situations in a rational way without the
distractions of emotional entanglements. They achieve grace not through sympathy and compassion,
but through a determination to be open to all the alternatives and to make decisions that are ultimately
objective, practical, and fair. (Howard, 1984, p. 21)
When selecting executives, then, it is important to recognize that overly nurturant behavior can be
a liability rather than an asset. The job often requires tough decisions, and executives need to be
able to sleep at night.
Percentile
90
80
MPS:20
70
60
MPS:0
50
40
30
1 2 3 4 5& 6
Level at MPS:20
Percentile
90
80
MPS:20
70
60
MPS:0
50
40
30
1 2 3 4 5& 6
Level at MPS:20
In the late 1970s, as work on MPS:20 was in progress, the question arose as to whether the
MPS data could be generalized beyond its White male population. As a result of discussions with
AT&T executives, Doug Bray and I began research parallel to MPS that I named the Management
Continuity Study, or MCS. As before, participants were eligible only if they were considered viable
candidates for middle management and beyond. More than 20 years after the start of MPS, times
had indeed changed. When the telephone companies provided their recent hires as participants,
one-half were women and one-third represented ethnic minority groups. Table 39.3 shows the pri-
mary differences in the composition of the two samples.
As with previous AC results, there were no gender differences in the overall ratings of man-
agement potential of the MCS participants. Assessors predicted that 51% of the men and 53% of
the women “will” make middle management within 10 years and 39% of the men and 45% of the
women “should” do so. They thought 11% of the men and 13% of the women should rise beyond
the third level of management. None of these gender differences was statistically significant. The
termination predictions also did not differ by gender, although the reason for leaving the company
did. Assessors believed that significantly more of the men (23% of those leaving) would be forced
to do so compared with only 4% of the women.
Although overall management potential was the same for the MCS men and women, their constel-
lations of abilities, interests, motivations, and management style showed many interesting nuances
that can be only briefly summarized here (for details, see Howard & Bray, 1988). Table 39.4 gives a
snapshot of some of these differences. What is important to note is that values, interests, personality,
and motivation best differentiated the men and women (in a discriminant function analysis), but the
sexes were seen as equal on the major abilities and in psychological adjustment.
There were ethnic group differences in overall management potential, although interpreta-
tion was complicated by interactions with gender and small sample sizes of some ethnic groups.
Only the Black sample (N = 79) was large enough to compare with the Whites (N = 237). It is
not known to what extent affi rmative action created selection disparities between these two
groups.
Blacks were at the greatest disadvantage on cognitive abilities, which relied on paper-and-pencil
tests. On the simulations, they lagged somewhat behind Whites in administrative skills, but the two
groups were equal on interpersonal skills. There were no overall differences between the groups on
the motivation factors. Thus, if only cognitive tests had been used for selection, the adverse impact
would have been considerably greater than it would be in an AC, which looks at a broader range of
abilities and other characteristics. Unfortunately, the breakup of the Bell System precluded collect-
ing criterion data that would have allowed examination of the validity—and potential differential
validity—of the MCS assessment.
PRACTICAL ADVANTAGES
ACs, particularly those that rely primarily on simulations, offer organizations several additional
practical advantages as selection tools (Howard, 2006).
TABLE 39.3
AT&T Longitudinal Research Samples
MPS MCS
Assessed 1956–1960 Assessed 1977–1982
6 telephone companies 18 telephone companies
All White males One-half female; one-third ethnic minority
College and noncollege All college graduates
N = 422 N = 344
The Management Progress Study and Its Legacy for Selection 857
TABLE 39.4
Management Continuity Study Gender Differences
Women Higher Men Higher
Verbal skills Quantitative skills
Interpersonal sensitivity Assertiveness
Work standards Desire to advance
Poise General information
Self-objectivity Expectations of success
Flexibility Authority orientation
1. Live behavior: Because participants demonstrate live behavior, there is little opportunity
for them to fake the demonstration of skills. Decision-makers also do not need to speculate
from test scores about what participants’ behavior would be like.
2. Diagnosis for development: A significant advantage of simulations over other selection
methods is their usefulness for development. The dimensions are primarily aimed at skills
and abilities that can be learned, and demonstrations of live behavior in simulations provide
excellent material for credible feedback. If relatively stable personality and motivation char-
acteristics are also measured in the assessment, participants can learn how they manifest
these qualities in different situations and how to keep them from getting out of control.
3. Future orientation: Simulations can be directed at future jobs or challenges, producing
information that can only be speculated about from current job performance.
4. Realistic job preview: Because they simulate real-life challenges representative of a par-
ticular position or role, ACs form a realistic job preview for candidates. Gaining this per-
spective can encourage those who are not a good fit to opt out of the selection process. This
experience is especially important for advancement to the first level of management, where
candidates might not realize what leadership really entails. It can also be a revelation to
senior-level candidates who do not appreciate the difference between operational and stra-
tegic leadership.
5. Plasticity: AC simulations are fundamentally plastic; that is, they can be designed to rep-
resent various organizational challenges and measure specific dimensions of interest.
DISADVANTAGES OF ACS
The major drawback of ACs is the effort and expense involved in conducting them. Although mod-
ern middle-management ACs are typically much shorter (1 day) than the MPS model (3.5 days)
and take advantage of computer technology, they still entail considerable administrative and staff-
ing costs. This drawback has meant that organizations are more selective about when to use ACs.
Measurement of live behavior is particularly advantageous in circumstances such as the following:
Interaction Role-Plays
MPS had only one brief role-play: the in-basket interview was conducted by a role player act-
ing as the participant’s boss. However role-plays have become an important staple in AC practice.
Participants review background information on a peer, internal or external customer, direct report,
prospective client, or other important contact. They then meet with a trained role player to resolve
a problem or gain commitment to a course of action. Role-plays are often structured to bring out a
primary competency such as coaching, influencing others, or managing conflict.
structure, or morale. The analysis is often followed by a presentation of the results to one or more
role players or external observers.
Media Interview
Typically confined to executive assessments, these exercises provide participants with background
information about a situation that has attracted media attention. Candidates then meet with a media
representative (a trained role player) to answer probing questions and defend their organization’s
actions.
Visionary Address
Participants make an inspirational talk on some selected problem or issue.
Fact-Finding Exercise
The participant is given a brief description of a hypothetical situation. The task is to seek informa-
tion from a resource person and make a decision about that situation within a limited period.
Each simulation in MPS and early operational centers had its own unique setting. However, the
modern assessment is more likely to be of a “day in the life” variety, in which the participant takes
on one organizational role and faces numerous challenges and tasks related to that role. Such centers
are more realistic for participants and also more efficient, given that participants have to understand
only one background situation before proceeding.
The context in which ACs take place has also changed. Delivery of assessment materials has
advanced from papers wrapped in large envelopes to online messages in e-mail, voice mail, and video
mail. In a simulated desktop environment, participants can schedule appointments, create to-do lists,
and bring new documents into the system. Computer-enhanced ACs also facilitate other efficiencies.
For example, they shorten administration time and facilitate assessor scoring and reporting. Although
participants might still come to an established setting (more likely an office than an estate), other par-
ticipants can avoid traveling to a site altogether and participate by computer in a virtual AC.
AC professionals have also improved the dimensions in order to get reliable ratings on reason-
ably independent characteristics. Definitions of dimensions are more precise and often include key
behaviors. For example, the MPS definition of “decision-making” was simply “How ready is this
person to make decisions, and how good are the decisions that are made?” A modern definition
would include more detailed guidance on what to look for, such as identifying issues, problems, and
opportunities; gathering, comparing, and interpreting data to draw conclusions; using relevant deci-
sion criteria to choose an effective option; and taking action within a reasonable time. The number
of dimensions for an individual AC has been reduced, and the ones that remain are often tied to
specific business challenges.
In summary, ACs have been enhanced in many ways since their beginnings in MPS, but this
is as it should be. The fundamental plasticity of ACs is what keeps them alive and relevant to
modern organizational challenges. What is missing is convincing research to demonstrate that the
various innovations work successfully and do not compromise the method’s validity. For example,
although we might assume that more precisely defined dimensions will increase assessor reliability
and therefore enhance validity, research might reveal instead that a broadly interpreted predictor
will better match a broadly interpreted criterion.
Another sidetrack has been the research demonstrating that AC ratings vary more by exercise
than by dimension, with the resulting recommendation that ACs should be focused on exercises
or tasks rather than dimensions. This research stemmed from use of a multitrait-multimethod
(MTMM) analysis of AC data, with dimensions equated to traits and exercises equated to
methods (cf., Lievens & Conway, 2001). These models could not have been constructed from
MPS or its immediate successors because assessors rated dimensions only after all information
from all exercises were reviewed in an integration session (within-dimension method). Later
centers had assessors rate dimensions by exercise as an additional step before the final
integration (within-exercise method); data from these centers were used to fill the cells of the
MTMM matrix.
As I have argued extensively elsewhere (Howard, 2008), a center where each exercise is used
to rate all or nearly all dimensions is not good measurement and is not representative of AC prac-
tice. Bray selected techniques for MPS believing that each would add a different perspective and
facilitate rating specific dimensions; he did not expect that each technique would be equally adept
at measuring nearly every dimension. Simulations are designed to bring out behaviors relative to
particular dimensions. For example, the in-basket elicits decision-making, planning and organizing,
and delegation, but the leaderless group discussion brings out emergent leadership and interpersonal
skills. The trouble begins when assessors are asked to rate too many dimensions from one exercise.
For example, if the in-basket is used to measure interpersonal skills, where the evidence is bound
to be weak at best, it is no wonder the assessors take their cues from other dimensions they can rate
well from the exercise.
Other findings in psychology suggest that situations (exercises) and individual characteristics
(dimensions) interact in determining behavior. Indeed, where methods other than MTMM have
been used to study the elements of ACs, exercises and dimensions have been shown to account for
meaningful variance in AC performance. One generalizability analysis showed that 60% of the
variance in participants’ AC performance could be attributed to individuals and constructs and only
11% to assessors and exercises (Arthur, Woehr, & Maldegen, 2000).
Aside from the research issues, the recommendation to rate exercises instead of dimensions in an
AC makes no practical sense. Many tasks and jobs change quickly in today’s economy, which would
quickly render task-based assessment results obsolete. More importantly, exercise ratings cannot be
generalized to other situations. Because ACs are often used to simulate untried roles—including
some that may not yet exist—and build a leadership pipeline for as yet undetermined positions, it
is important to be able to generalize about the human qualities needed for various organizational
challenges.
An important principle lost in this sidetrack is that behaviors, not exercises, are the currency
of ACs; exercises serve only to elicit behaviors. Dimensions are logical clusters of behaviors; they
organize the behavioral information so that conclusions can be drawn about participants’ readiness
for future positions and responsibilities.
be spared from their primary jobs. This turnaround opens the door to incorporating other measure-
ment techniques into the AC. Figure 39.8 illustrates (with multiple regression data) how different
techniques used at the original MPS assessment contributed to the prediction of management level
20 years later.
Although interpersonal skills measured in simulations were highly predictive of advancement in
management, that prediction was enhanced by including educational level (a biodata item), ambition
(comprised of information from the interview and a personality test), and cognitive ability (from a
mental ability test). Instead of being competitive about their favorite method (e.g., “my test works
better than your simulation”) researchers and practitioners should remember that different tech-
niques have advantages for measuring different human characteristics. Wise psychologists should
become better aware of how techniques work together to select the best candidates for organiza-
tional purposes.
Another technique from MPS that has been overlooked is the use of staff specialists. MPS was
privileged to have two clinical psychologists—Joseph F. Rychlak and Walter Katkovsky—who
stayed with the study throughout its three assessments and also participated in the MCS assess-
ments. Rychlak was a specialist in interviewing and coding life themes from those discussions;
Katkovsky was a specialist in interpreting the various projective measures. By keeping these
two aimed at their specialty assignments, MPS and MCS gained reliability and accuracy that
would not be available if they had spread the tasks among a variety of assessors. Other ACs often
have assessors that are particularly interested in or adept at evaluating certain exercises (e.g., in-
baskets) or performing certain role-plays (e.g., a media interview). Although specialists can reduce
administrative flexibility in some situations, the advantages of cultivating them can far exceed the
disadvantages.
Another practice that today’s ACs should take advantage of is the strategic use of cross-situational
dimensions (Howard, 2008). Because MPS made no attempt to rate dimensions after exercises,
assessors could draw information from any technique to support their ratings. ACs that have asses-
sors rate dimensions by exercise—the large majority of those used in practice today—often over-
look the opportunity to build a dimension rating from diverse behaviors elicited from different
exercises. Yet, some complex dimensions cannot be measured well within a single exercise.
Cross-situational dimensions have key behaviors that are best elicited by different types
of stimuli. For example, in the dimension “accelerating change,” a key behavior “identifies
opportunities for improvement,” is best elicited by covertly planting opportunity cues among
various exercises in the design. Another key behavior, “catalyzes change,” is best measured by
an exercise requiring an action plan; whereas a third key behavior, “addresses change resis-
tance,” is best measured in a role-play. Even ACs using the within-exercise method can emulate
the more free-wheeling within-dimension MPS design by having assessors make key behavior
ratings from the relevant exercises and rolling them up into a cross-situational dimension rat-
ing at the integration session. Such a design will add precision to the measurement of complex
dimensions.
862 Handbook of Employee Selection
• Aging and experience: We hypothesized that aging and experience accounted for the find-
ings when the MPS sample changed between years 0 and 20 and the MCS sample at year
0 resembled the MPS group at year 0. These changes included an increase in verbal skills
and motivational shifts. Whereas the MPS men’s intrinsic motivation increased over time
(higher need for achievement and work standards), their promotion orientation declined.
That is, as time went by they were more willing to delay promotions and had more realistic
expectations. The personalities of the MPS men seemed to toughen over time; they showed
more resistance to stress, less restraint on hostility, and less inclination to take the blame
when something went wrong. The men also underwent lifestyle changes; the role of friends
declined in favor of more involvement with the marital family and recreational activities.
• Culture changes: A culture change was indicated when the MPS participants changed over
20 years and the MCS group resembled them at year 20 rather than at year 0. A primary
theme here was less respect for institutions. The MPS:20 and MCS:0 participants showed
less deference toward people in authority, lower needs for superior approval, and less reli-
gious involvement. There was also less identification with “masculine” qualities among
the men.
• Both aging and culture changes: We hypothesized that aging and culture changes were at
work when the MPS participants changed between years 0 and 20 but the MCS partici-
pants differed from them at both ages. Factors falling within this hypothesis often involved
having increased control over one’s life. That is, compared with the MPS:0 findings, the
MPS:20 participants, and to a lesser extent the MCS:0 participants, had greater need for
autonomy, less need for peer approval, less involvement with the parental family, lower
The Management Progress Study and Its Legacy for Selection 863
needs for security, and less interest in service to the community. Both groups also showed
a tendency toward less effective interpersonal skills and less behavior flexibility. Greater
pessimism about their future with the organization also appeared to be a culture change
and a characteristic that grew with maturity. The MPS:20 and MCS:0 groups had a lower
motivation for advancement, less forward-striving, and lower career expectations.
The MPS:20 group also showed declines on various scales of a Bell System question-
naire of attitudes toward work and the organization. The MCS:0 participants generally
scored lower than the MPS:0 sample but were inconsistent in their average scores relative
to the group at MPS:20. However, these attitudinal data might be artifacts of the Bell
System questionnaire; other research has typically found higher levels of job satisfaction
for older workers (Hedge, Borman, & Lammlein, 2006).
• Generation differences: Where the MPS group changed little over 20 years but the MCS
group looked significantly different, we hypothesized a generation effect. That is, being
young at a particular point in time might have created qualities unique to a generation.
One prominent but worrisome difference was the younger generation’s reduced interest in
leadership. They were less identified with management, less interested in leading others,
had poorer communication skills, and were more nonconforming. At the same time they
had a greater need for emotional support, on the giving end (nurturance) and the receiving
end (succorance).
The AT&T studies compared the post-World War II generation (MPS) and the Baby
Boom generation (MCS). By now, another generation or two have ushered in new values
and motivations, and more cultural changes have taken place. The results that might be
enduring, then, are those that we isolated as due in whole or in part to aging and experi-
ence. If these findings hold up under the scrutiny of additional research, they have strong
implications for how to develop managers and what to be aware of when selecting from
different age groups.
Some key facts to remember are that the average manager does not improve spontane-
ously; organizations need to work at developing the skills they want. Moreover, there is a
tendency for the rich to get richer and the poor to get poorer. Developing and promoting
your best management candidates will not be enough; organizations also need to attend
to managers lower on the totem pole lest they fall into bad habits or lose their desire to
excel.
In retrospect, the MPS stimulated more than 50 years of AC practice and research with compel-
ling implications for personnel selection. At the moment, there seems to be no reason why it cannot
encourage 50 more.
REFERENCES
Arthur, W. A., Jr., Woehr, D. J., & Maldegen, R. (2000). Convergent and discriminant validity of assessment
center dimensions: A conceptual and empirical re-examination of the assessment center construct-related
validity paradox. Journal of Management, 26, 813–835.
Bray, D. W. (1976, March–April). Identifying managerial talent in women. Atlanta Economic Review, 38–43.
Bray, D. W. (1990). Reminiscing in tempo. Autobiography prepared for the Society of Industrial and
Organizational Psychology. Available online at http://www.siop.org/Presidents/Bray.aspx
Bray, D. W., Campbell, R. J., & Grant, D. L. (1974). Formative years in business: A long-term AT&T study of
managerial lives. New York, NY: Wiley.
Bray, D. W., & Grant, D. L. (1966). The assessment center in the measurement of potential for business man-
agement. Psychological Monographs, 80 (17, Whole No. 625).
Bray, D. W., & Howard, A. (1983). The AT&T longitudinal studies of managers. In K. W. Schaie (Ed.),
Longitudinal studies of adult psychological development. New York, NY: Guilford.
Frederiksen, N., Saunders, D. R., & Wand, B. (1957). The in-basket test. Psychological Monographs, 71,
(Whole No. 438).
864 Handbook of Employee Selection
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., III, & Bentson, C. (1987). Meta-analysis of assessment
center validity. Journal of Applied Psychology, 72, 493–511.
Grant, D. L., & Bray, D. W. (1969). Contributions of the interview to assessment of managerial potential.
Journal of Applied Psychology, 53, 24–34.
Grant, D. L., Katkovsky, W., & Bray, D. W. (1967). Contributions of projective techniques to assessment of
managerial potential. Journal of Applied Psychology, 51, 226–232.
Hedge, J. W., Borman, W. C., & Lammlein, S. E. (2005). The aging workforce: Realities, myths, and implica-
tions for organizations. Washington, DC: American Psychological Association.
Howard, A. (1974). An assessment of assessment centers. Academy of Management Journal, 17, 115–134.
Howard, A. (1984, August). Cool at the top: Personality characteristics of successful executives. In P. D. Lifton
(Chair), Industrial assessment and personality psychology. Symposium conducted at the meeting of the
American Psychological Association, Toronto, Canada.
Howard, A. (1986). College experiences and managerial performance (Monograph). Journal of Applied
Psychology, 71, 530–552.
Howard, A. (1988). Who reaches for the golden handshake? The Academy of Management Executive, 11(2),
133–144.
Howard, A. (1995). A framework for work change. In A. Howard (Ed.), The changing nature of work (pp. 3–44).
San Francisco, CA: Jossey-Bass.
Howard, A. (2005). Subconscious and conscious motives in long-term managerial success. In G. P. Latham
(Chair), The effects of subconscious trait and state motivation on performance. Symposium presented
at the 20th annual conference of the Society for Industrial and Organizational Psychology, Los Angeles,
CA.
Howard, A. (2006). Best practices in leader selection. In J. A. Conger & R. E. Riggio (Eds.), The practice of
leadership: Developing the next generation of leaders. San Francisco, CA: Jossey-Bass.
Howard, A. (2008). Making assessment centers work the way they are supposed to. Industrial and Organizational
Psychology, 1, 101–107.
Howard, A., & Bray, D. W. (1988). Managerial lives in transition: Advancing age and changing times.
New York, NY: Guilford.
Howard, A., & Bray, D. W. (1990). Predictions of managerial success over long periods of time: Lessons
from the management progress study. In K. W. Clark & M. B. Clark (Eds.), Measures of leadership
(pp. 113–130). West Orange, NJ: Leadership Library of America.
International Task Force on Assessment Center Guidelines. (2000). Guidelines and ethical considerations for
assessment center operations. Public Personnel Management, 28, 315–331.
Lievens, F., & Conway, J. M. (2001). Dimensions and exercise variance in assessment center scores: A large-
scale evaluation of multitrait-multimethod studies. Journal of Applied Psychology, 86, 1202–1222.
Martin, A. H. (1941). Examination of applicants for commissioned rank. In L. Farago & L. F. Gittler (Eds.),
German psychological warfare. New York, NY: Committee for National Morale.
Murray, H. A. (1938). Explorations in personality. New York, NY: Oxford University Press.
Office of Strategic Services Assessment Staff. (1948). Assessment of men. New York, NY: Holt, Rinehart &
Winston.
Riesman, D., Glazer, N., & Denney, R. (1950). The lonely crowd: A study of the changing American character.
New Haven, CT: Yale University Press.
Rychlak, J. F., & Bray, D. W. (1967). A life-theme method for scoring of interviews in the longitudinal study of
young business managers. Psychological Reports (Monograph Supplement, 1-V21).
Thornton, G. C., III, & Byham, W. C. (1982). Assessment centers and managerial performance. New York,
NY: Academic Press.
Thornton, G. C., III, & Rupp, D. E. (2006). Assessment centers in human resource management: Strategies for
prediction, diagnosis, and development. Mahwah, NJ: Lawrence Erlbaum.
U. S. Bureau of the Census. (1975). Historical statistics of the United States: Colonial times to 1970. Washington,
DC: U.S. Department of Commerce.
Vernon, P. E., & Parry, J. B. (1949). Personnel selection in the British forces. London, England: University of
London Press.
Wallace, P. A. (Ed.). (1976). Equal employment opportunity and the AT&T case. Cambridge, MA: MIT Press.
Whyte, W. H., Jr. (1956). The organization man. New York, NY: Simon & Schuster.
40 Project A
12 Years of R & D
John P. Campbell and Deirdre J. Knapp
This chapter1 is about personnel selection and classification research on a scale never before
attempted in terms of (a) the types and variety of information collected, (b) the number of jobs that
were considered simultaneously, (c) the size of the samples, and (d) the length of time that individu-
als were followed as they progressed through the organization.
The effort, commonly known as Project A, was sponsored by the U.S. Army Research Institute
for the Behavioral and Social Sciences (ARI). For contract management reasons the research pro-
gram was conducted as two sequential projects: Project A (1982–1989) and Career Force (1990–
1994), which worked from a single overall design (described subsequently).
Collectively, these projects attempted to evaluate the selection validity and classification effi-
ciency of systematically sampled domains of prediction information for different selection and clas-
sification goals for the entire enlisted personnel system of the U.S. Army, using various alternative
decision rules (i.e., “models”). Pursuing such ambitious objectives required the development of a
comprehensive battery of new tests and inventories, the development of a wide variety of training
and job performance measures for each job in the sample, four major worldwide data collections
involving thousands of Army enlisted job incumbents for one to two days each, and the design and
maintenance of the resulting database.
The truly difficult part was the never-ending need to develop a consensus among all of the project
participants regarding literally hundreds of choices among measurement procedures, analysis meth-
ods, and data collection design strategies. Although many such decisions were made in the original
design stage, many more occurred continuously as the projects moved forward, driven by the target
dates for the major data collections, which absolutely could not be missed. The fact that all major
parts of the projects were completed within the prescribed time frames and according to the speci-
fied research design was a source of wonder for all who participated.
ORIGINS OF PROJECT A
Project A resulted from pressing organizational needs from within the Department of Defense
(DoD), and not from unsolicited proposals for research grants. The post-Vietnam development of
the all volunteer force, the comprehensive force modernization programs of the services during the
1970s, periods of high attrition, and the fallout from the misnorming of forms 6/7 of the Armed
Services Vocational Aptitude Battery (ASVAB) all created great pressure on the military services
to make certain that their selection and classification procedures were as valid as possible. The
culmination was a 1980 Congressional mandate to conduct a more thorough validation of ASVAB
for selection and classification purposes. Prior to 1980, virtually all validation evidence was based
1 Much of this chapter was drawn from Campbell and Knapp (2001) and from the technical reports generated by Project
A and Career Force. We are indebted to all of the individuals from the Army Research Institute (ARI), the Human
Resources Research Organization (HumRRO), the American Institutes for Research (AIR), the Personnel Decisions
Research Institute (PDRI), and the University of Minnesota who contributed their technical expertise, initiative and
effort, personal discipline, and peer leadership to these projects. It was the experience of a lifetime.
865
866 Handbook of Employee Selection
on training grades as criteria. The DoD’s formal response was the Joint-Service Job Performance
Measurement/Enlistment Standards Project (JPM).
Project A was the Army’s contribution to JPM. Moreover, the Army viewed the Congressional
mandate as an opportunity to address a much larger set of personnel research questions than just
the validation of ASVAB against job performance. In September 1982, a contract for Project A was
signed with the Human Resources Research Organization (HumRRO) and its subcontractors, the
American Institutes for Research (AIR) and Personnel Decisions Research Institute, Inc. (PDRI).
The overall design of the Project A program was intended to be fundamentally different from
the conventional paradigm that dominated personnel research from 1906 to 1982, which consisted
of computing the correlation between a single predictor score, or a predictor composite score, and a
single criterion measure of performance obtained from a sample of job incumbents. Literally thou-
sands of such estimates have been generated over the past 100+ years (e.g., Ghiselli, 1973; Hunter
& Hunter, 1984; Nathan & Alexander, 1988; Schmidt, 1988; Schmidt & Hunter, 1998; Schmitt,
Gooding, Noe, & Kirsch, 1984).
There are probably legitimate reasons why single investigators working to generate one bivariate
distribution at a time has served as the dominant paradigm through most of our history. For one,
the recurring problem of how best to select individuals for a particular job in a particular organiza-
tion is a very real one, and a rational management will devote resources to solving such a problem.
Also, certain structural and technological factors have worked against the establishment of long-
term coordinated research projects that dealt with large parts of the personnel system at one time.
For example, the field of industrial and organizational psychology is not very large and the supply
of research labor is limited. When the basic outline of Project A/Career Force was proposed, there
was no single organization or university group that had the resources necessary to carry it out.
Coalitions of organizations had to form. Also, until fairly recently, there were no means available
for coordinating the efforts of researchers who are geographically scattered. Neither was there a
technology for acquiring and maintaining a large central database that could be accessed and ana-
lyzed efficiently from remote locations.
In general, the dominant paradigm came to be so because of the constraints imposed by technol-
ogy, the structural characteristics of the research enterprise itself, and the contingencies built into
the reward structures for individual investigators.
ENABLING OF PROJECT A
Fortunately, along with the Army’s need to address enlisted selection and classification as a system,
there were concomitant developments in the structure and technology of the personnel research
enterprise. For example, advances in computerized database management and electronic communi-
cation made it possible to design, create, edit, update, and maintain a very large database that could
be accessed from anywhere. What is routine now was new and liberating in 1982.
Advances in computerization also permitted use of new testing technologies, and the develop-
ment of powerful, linear programming algorithms made the estimation of classification efficiency
and the comparison of alternative selection/classification strategies using the entire Army database
a very manageable analytic problem. Certainly, the development of confirmatory techniques within
the general domain of multivariate analysis models opened up several powerful strategies for gen-
eralizing research findings from a sample of jobs to a population of jobs and from the specific mea-
sures that were used to a latent structure of constructs.
Finally, the realization in industrial and organizational psychology during the 1970s that one of
our fundamental tasks is to learn things about an appropriately defined population, and not to learn
more and more specific things about specific samples, changed the field’s approach to the estimation
of selection validity and classification efficiency. Meta-analysis and corrections for attenuation and
restriction of range were no longer novel games to play. They were a necessary part of statistical
estimation.
Project A 867
In sum, the intent was to design a research program that would be directly useful for meeting the
system’s needs, both as they existed initially and as changes took place. Simultaneously, everyone
hoped that by considering an entire system and population of jobs at once, and by developing mea-
sures from a theoretical/taxonomic base, the science of industrial and organizational psychology
would also be served.
1. Identify the constructs that constitute the universe of information available for selection/
classification into entry-level skilled jobs, given no prior job experience, and develop pre-
dictor measures for those constructs identified as “best bets.”
2. Develop multiple measures of entry-level and noncommissioned officer (NCO) job
performance.
3. Develop a general model of performance for entry-level skilled jobs and for NCO jobs.
4. Validate existing selection measures (i.e., ASVAB) against training and job performance.
5. On the basis of the “best bet” constructs, validate a battery of new predictor measures.
6. Estimate validity and incremental validity of the new predictors.
7. Estimate the degree of differential prediction across (a) major domains of predictor infor-
mation, (b) major factors of job performance, and (c) different types of jobs.
8. Develop new analytic methods to evaluate optimal selection and classification.
9. Compare the marginal utility of performance across jobs.
10. Develop a fully functional research database that includes all archival research data on the
three cohorts of new Army accessions included in the research program.
The initial set of 19 MOS included only 5% of Army jobs but represented 44% of the soldiers
recruited in FY81. An independent (i.e., without considering the CMF designation) cluster analysis
of MOS (based on task content similarity) was carried out via Army subject matter expert (SME)
868 Handbook of Employee Selection
TABLE 40.1
Sampled Military Occupational Specialties (MOS)
MOS Batch A MOS Batch Z
11B Infantryman 12B Combat engineer
13B Cannon crewman 16S MANPADS crewman
19E/K Armor tank crewman 27E Tow/dragon repairer
31C Single-channel radio operator 29E Communications-electronics radio repairer
63B Light-wheel vehicle mechanic 51B Carpentry/masonry specialist
71L Administrative specialist 54B NBC specialist
88M Motor transport operator 55B Ammunition specialist
91A/B Medical specialist/medical NCO 67N Utility helicopter repairer
95B Military police 76Y Unit supply specialist
94B Food service specialist
96B Intelligence analyst
judgments to evaluate the representativeness of the sample of 19 and to make changes in the compo-
sition of the sample if it was judged appropriate to do so. Two jobs were added, and Table 40.1 shows
the resulting MOS (n = 21) that were studied over the course of Project A. “Batch A” MOS received
the most attention in that soldiers in these jobs were administered a full array of first- and second-
tour job performance measures, including hands-on work sample tests, written job knowledge tests,
and Army-wide and MOS-specific ratings. Soldiers in “Batch Z” were not measured as extensively
with regard to the job performance criterion measures.
83 84 85 86 87 88 89 90 91 92 93
81/82
File
data
analyses 83/84
Concurrent
sample
Trial
predictor 83/84
battery Concurrent
and sample
first-tour
performance Second-tour
measures performance
N = 9,500 measures
CVI N = 1,000
CVII
83 84 85 86 87 88 89 90 91 92 93
FIGURE 40.1 Project A: Overall data collection design and major research samples.
• Concurrent validation second-tour (CVII) sample: The same teams that administered the
first-tour performance measures to the LVI sample administered second-tour performance
measures at the same location and during the same time periods to a sample of junior
NCOs from the 83/84 cohort who were in their second tour of duty (4–5 years of service).
• Longitudinal validation second-tour (LVII) sample: This sample included members of the
86/87 cohort from the Batch A MOS who were part of the LVP (predictors), LVT (training
performance measures), and LVI (first-tour job performance measures) samples and who
reenlisted for a second tour (6–7 years of total service). The second-tour (NCO) perfor-
mance measures were administered at 15 U.S. posts, multiple locations in Germany, and
two locations in Korea.
the population of enlisted MOS and then sample appropriately from it. The appropriate constructs
were those that were judged to maximize the expected linkages with the population of perfor-
mance components, not just the performance components encompassed by the 21 MOS in the
sample. The next steps were to construct measures for each sampled construct that was above a
threshold of criticality, given the measurement purposes. Use of available commercial measures
was not considered given the requirements of the military enlistment testing system, although
some such tests (e.g., subtests from the Employee Aptitude Survey, Differential Aptitude Test)
were used as marker measures in the development of the new measures. Accordingly, the intent
was to develop a predictor battery that was maximally useful for selection and classification into
an entire population of jobs, and that provided maximal incremental information beyond that
provided by the ASVAB.
CRITICAL CONSTRUCTS
After a thorough literature review (including available meta-analyses), the research team identified
a list of 53 potentially useful predictor variables. A sample of 35 personnel selection experts was
then asked to estimate the expected correlations between each predictor construct and an array of
potential performance factors. All of the information was then used to arrive at a final set of vari-
ables (k = 25) for which new measures would be constructed.
The measure development process included (a) a large-scale field test that also administered a set
of established marker tests for several of the constructs (see Peterson et al., 1990); (b) development
of software for a computerized battery of perceptual/psychomotor tests, as well as a portable test-
ing station and a special response pedestal permitting various responses (e.g., one-hand tracking,
two-hand coordination); (c) several paper-and-pencil cognitive tests; and (d) two inventories. One
inventory assessed relevant vocational interests, and the second focused on major dimensions of
personality and biographical history. This process resulted in the following experimental predictor
battery that was first used in the Concurrent Validation (CVI) in 1985.
Spatial visualization—rotation
Assembling objects test: Select the alternative that depicts the components assembled
correctly.
Object rotation test: Is the figure represented the same as the test object, only rotated?
Spatial visualization—scanning
Maze test: Determine which of the four entrances leads to a correct pathway.
Spatial orientation
Orientation test: Rotate the frame to match correct orientation of a picture.
Map test: From directions to only one landmark infer directions to another.
Reasoning test: Given a series of figures, identify the figure that should appear next.
Psychomotor Precision
Psychomotor precision encompasses two of the ability constructs identified by Fleishman and his
associates: control precision and rate control (Fleishman, 1967).
Target tracking test 1: As a target moves at a constant rate along a path consisting of horizontal
and vertical lines, a single joystick is used to keep crosshairs centered on the target. The mean
distance from the crosshairs to the center of the target, computed several times each second,
constitutes overall accuracy.
Target shoot test: A target moves in an unpredictable manner. A joystick is used to move the
crosshairs into the center of the target and “fire.” Scored on accuracy and “time to fire.”
Multilimb coordination
Target tracking test 2: The subject manipulates two sliding resistors to control movement of
the crosshairs: one in the horizontal plane and the other in the vertical plane.
Number operations
Number memory test: Successive arithmetic operations appear on the screen until a solution is
presented and the subject must indicate whether the solution presented is correct.
Movement judgment
Cannon shoot test: A “cannon” appears and can “fire” a shell, which travels at a constant
speed, at a moving target. Subject must fire so that the shell intersects the target.
TABLE 40.2
Experimental Battery: Composite Scores and Constituent Basic Scores
Computer-Administered Test
ASVAB Composites (4) Composites (8) ABLE Composite (7) AVOICE Composites (8)
Quantitative Psychomotor Achievement orientation Rugged outdoors
Math knowledge Target tracking 1 distance Self-esteem Combat
Arithmetic reasoning Target tracking 2 distance Work orientation Rugged individualism
Cannon shoot time score Energy level Firearms enthusiast
Technical Target shoot distance
Auto shop Leadership potential Audiovisual arts
Mechanical Movement time Dominance Drafting
comprehension Pooled movement time Audiographics
Electronics information Dependability Aesthetics
Perceptual speed Traditional values
Speed Perceptual speed & Conscientiousness Interpersonal
Coding speed accuracy (DT) Nondelinquency Medical services
Number operations Target identification (DT) Leadership guidance
Adjustment
Verbal Basic speed Emotional stability Skilled/technical
Word knowledge Simple reaction time (DT) Science/chemical
Paragraph Choice reaction time (DT) Cooperativeness Computers
comprehension Cooperativeness Mathematics
General science Perceptual accuracy Electronic
Perceptual speed & Internal control communication
accuracy (PC) Internal control
Spatial Test Composite (1) Target identification (PC) Administrative
Assembling objects test Physical condition Clerical/administrative
Object rotation test Basic accuracy Physical condition Warehouse/shipping
Maze test Simple reaction time (PC)
Orientation test Choice reaction time (PC) Food service
Map test Food service professional
JOB Composites (3)
Reasoning test Number speed and accuracy Food service employee
Number speed (operation High job expectations
DT) Pride Protective services
Number memory (PC) Job security Fire protection
Serving others Law enforcement
Short-term memory Ambition
Short-term memory (PC) Structural/machines
Short-term memory (DT) Job routine Mechanics
Routine Heavy construction
Electronics
Job autonomy Vehicle operator
Autonomy
DT, decision time; PC, proportion correct.
factor analyses guided by considerable prior theory and empirical evidence (McHenry, Hough,
Toquam, Hanson, & Ashworth, 1990; Peterson et al., 1990). A final set of 31 composites was identi-
fied and is shown in Table 40.2.
Collectively, the Experimental Battery and the ASVAB were intended to be a comprehensive
and representative sample of predictor measures from the population of individual differences that
are relevant for personnel selection and classification, and which can be measured in a standardized
fashion at the time of organizational entry.
PERFORMANCE CRITERIA
The general goals of training performance and job performance measurement were to define, or
model, the total domain of performance in some reasonable way and then develop reliable and valid
measures of each major factor. The general procedure for criterion development followed a basic
cycle of a comprehensive literature review, initial instrument construction based on the job analyses
previously described, pilot testing, instrument revision, field-testing, and proponent (i.e., manage-
ment) review. The specific measurement goals were as follows:
INITIAL THEORY
Criterion development efforts were guided by a model that viewed performance as truly multidi-
mensional. For the population of Army entry-level enlisted positions, there were two major types
of performance components: (a) those that reflect specific technical tasks or specific job behaviors
that are not required for other jobs and (b) components that are defined and measured in the same
way for every job (i.e., Army-wide), such as first aid and basic weapons proficiency, contributions to
teamwork, continual self-development, support for the norms and customs of the organization, and
perseverance in the face of adversity. The working model of total performance with which Project
A began viewed performance as multidimensional within these two broad categories.
well. The more substantive differences occur because during their second tour, soldiers begin to
take on substantial and critical leadership responsibilities.
The second-tour job analysis results identified six additional MOS-specific leadership dimen-
sions and three Army-wide leadership dimensions. A set of supervisory performance rating scales
was created to measure the following dimensions: acting as a role model, communication, per-
sonal counseling, monitoring subordinate performance, organizing missions/operations, personnel
administration, and performance counseling/correcting. Because it proved infeasible to collect peer
ratings from second-tour soldiers in CVII, only supervisor ratings were collected in LVII.
On the basis of a review of the literature and consideration of feasibility, two additional methods
were developed for assessing NCO performance. The first was a set of assessment center-like role-
play exercises, and the second was a written situational judgment test.
The format for the simulations was for the examinee to play the role of a supervisor. A trained con-
federate, who also scored the performance of the examinee on several specific dimensions within
the three major tasks, played the subordinate.
longitudinal second-tour samples (CVII and LVII). For the first- (entry-level enlisted) and second-
tour (junior NCOs), the best fitting models determined independently in the concurrent and longi-
tudinal sample were virtually identical. Also, the best fitting model in one sample (i.e., current or
longitudinal) fit the data equally as well in the other sample.
Because there were far fewer criterion measures, a similar confirmatory procedure was not used
to model training performance. Instead, expert judgment was used to group the training perfor-
mance criteria into composites that paralleled the latent factors in the first-tour and second-tour
performance models. The expert judgment based factors were then checked against the criterion
intercorrelation matrix estimated from the training validation (LVT) sample. The prescribed model
fit the data better than any alternative.
In summary, the modeling analyses produced a specification of the factor scores (i.e., latent
variables) that defined the latent structure of performance at each of three organizational levels, or
career stages: EOT performance, first-tour job performance, and second-tour job performance. It
was the performance factor scores at each of these three points that constituted the criterion scores
for all subsequent validation analyses.
1. Core Technical Proficiency (CTP): Represents the proficiency with which the soldier per-
forms the tasks that are “central” to the MOS. These tasks represent the core of the job
and are its primary definers. This construct does not include the individual’s willingness to
perform the task.
2. General Soldiering Proficiency (GSP): In addition to the technical content specific to an
MOS, individuals in every MOS also are responsible for being able to perform a variety of
Army-wide tasks (e.g., first aid, land navigation).
3. Effort and Leadership (ELS): This construct reflects the degree to which the individual
exerts effort over the full range of job tasks, perseveres under adverse or dangerous condi-
tions, and demonstrates leadership and support toward peers.
6. Leadership (LDR)
• Admin: promotion rate score
• Army-wide rating scales:
leading/supervising ratings factor
• Individual scores from each of the
three role plays
• Situational judgment test–total score
FIGURE 40.3 Models of first- and second-tour job performance and associated scores.
4. Maintaining Personal Discipline (MPD): MPD reflects the degree to which the individual
adheres to Army regulations and traditions, exercises personal self-control, demonstrates
integrity in day-to-day behavior, and does not create disciplinary problems.
5. Physical Fitness and Military Bearing (PFB): PFB represents the degree to which the
individual maintains an appropriate military appearance and stays in good physical
condition.
Note that the first two factors are represented by the hands-on work sample test and the job knowl-
edge tests, whereas the last three factors are each represented by rating scales and administrative
measures. Again, this factor solution represented the best fitting a priori model in the concurrent
and longitudinal cohort samples and was cross-validated from one cohort to the other with no loss
in the accuracy of fit. It was a very stable representation. It also led to further developments in per-
formance modeling, most notably the eight-factor model described by J. P. Campbell and his col-
leagues (e.g., Campbell, McCloy, Oppler, & Sager, 1993) and the core technical versus contextual
performance model proposed by Borman and Motowidlo (1993).
component. The other five factors are very similar in content to the first-term model, as shown
in Figure 40.3, which presents the factors and associated basic criterion scores for the fi rst- and
second-term models side-by-side. Elements of the two models that differ are identified in italics.
TABLE 40.3
Zero-Order Correlations of First-Tour Job Performance (LVI) Variables With
Second-Tour Job Performance (LVII) Variables: Weighted Average Across MOS
LVI:CTP LVI:GSP LVI:ELS LVI:MPD LVI:PFB LVI:NCOP
LVII: Core technical .44 .41 .25 .08 .02 .22
proficiency (CTP) (.55/.59) (.49/.55) (.30/.33) (.10/.11) (.02/.03) (.26/.29)
LVII: General .51 .57 .22 .09 −.01 .19
soldiering (.60/.68) (.67/.76) (.26/.29) (.11/.12) (−.01/−.01) (.22/.25)
proficiency (GSP)
LVII: Effort and .10 .17 .45 .28 .32 .43
achievement (EA) (.11/.12) (.18/.20) (.49/.53) (.30/.33) (.35/.38) (.46/.50)
LVII: Leadership .36 .41 .38 .27 .17 .41
(LEAD) (.39/.42) (.44/.47) (.41/.45) (.29/.32) (.18/.20) (.44/.48)
LVII: Maintain −.04 .04 .12 .26 .17 .16
personal discipline (–.04/–.05) (.04/.05) (.13/.15) (.29/.32) (.19/.21) (.18/.20)
(MPD)
LVII: Physical fitness −.03 –.01 .22 .14 .46 .30
and bearing (PFB) (–.03/−.04) (–.01/–.01) (.24/.27) (.15/.17) (.51/.56) (.33/.36)
LVII: Rating of .11 .15 .35 .25 .31 .41
overall effectiveness (.14/.16) (.19/.22) (.45/.49) (.32/.36) (.40/.44) (.53/.68)
(EFFR)
Total pairwise N values range from 333 to 413. Correlations corrected for range restriction. Correlations between matching
variables are in bold. Leftmost coefficients in parentheses are corrected for attenuation in the future criterion. Rightmost
coefficients in parentheses are corrected for attenuation in both criteria. ELS, effort and leadership; NCOP, NCO potential—a
single scale.
CRITERION-RELATED VALIDATION
TYPES OF INFORMATION
Project A/Career Force produced a very large archive of validation results. The following is a basic
list of the major categories of information.
1. For the CVI and LVI samples (i.e., the prediction of first-tour performance), the basic
data set consisted of the zero-order correlations of each basic predictor score with each of
the five performance factors. This intercorrelation matrix was computed for each MOS in
each of the CVI and LVI samples. Then for each sample, the full prediction equation (e.g.,
each predictor variable was included) was evaluated for each criterion factor, using several
different kinds of predictor weights. For comparative purposes, the estimates were cor-
rected for restriction of range and for criterion unreliability.
2. The incremental validity of each predictor domain over ASVAB was evaluated for pre-
dicting each of the five performance factors in CVI and LVI.
3. The same basic validities and the incremental validities were estimated for the CVII and
LVII samples, except that the six factors in the second-tour (NCO) performance model
were used as criteria.
4. For the LVII sample, the ASVAB, the Experimental Battery, and assessment of perfor-
mance during the first tour were weighted via hierarchical regression to estimate the
incremental validities over ASVAB produced by adding the Experimental Battery first
and then the first-tour performance assessment to the prediction of performance in the
second tour.
880 Handbook of Employee Selection
5. The LVI sample data were used to estimate the overall classification gains (compared to
gains from selection only) when particular predictor batteries were used to make optimal
assignments to MOS.
6. Using a nine-factor measure of the perceived leadership environment for enlisted person-
nel that was administered during CVI, the moderator effects of the perceived leadership
environment on the relationship between cognitive ability and first-tour performance and
the relationship between personality and first-tour performance were estimated.
TABLE 40.4
Comparison of Multiple Correlations Averaged Across Batch A MOS When the
ASVAB Factors, the Spatial, Cognitive Computer-Based Scores, ABLE Composites,
and AVOICE Composites Are Used as Predictors
ASVAB Spatial Computer-Based ABLE-Based AVOICE Basic
Factors Composite Scores Scores Scores
Criterion LV CV LVII LV CV LVII LV CV LVII LV CV LVII LV CV LVII
CTP 63 63 64 57 55 57 50 53 53 27 26 24 41 35 41
GSP 67 65 63 64 63 58 57 57 48 29 25 19 40 34 29
ELS (AE) 39 31 29 32 25 27 34 26 09 20 33 13 25 24 09
MPD 22 16 15 14 12 15 15 12 12 22 32 06 11 13 06
PFB 21 20 16 10 10 13 17 11 03 31 37 17 15 12 09
LDR 63 55 49 34 35
Results corrected for range restriction and adjusted for shrinkage. Decimals omitted. CTP, core technical proficiency; CV,
concurrent validation; ELS (AE), effort and leadership (achievement and effort); GSP, general soldiering proficiency; LDR,
leadership; LV, longitudinal validation (first-term); LVII, longitudinal validation (second-term); MPD, maintain personal
discipline; PFB, physical fitness and bearing.
Project A 881
incremented ASVAB for core technical components that involved firing hand operated weapons
(e.g., TOW missile) (Walker & Rumsey, 2001).
An estimate of the maximum validity attainable from ASVAB plus the Experimental Battery for
predicting Core Technical Proficiency is shown in Table 40.5. The “reduced equation” is comprised
of the four ASVAB factors plus six to eight scores (depending on the MOS) from the Experimental
Battery that were chosen a priori by the psychologist SMEs as the most likely to increment ASVAB.
The results are shown for unit weights, zero-order validity coefficients as weights, and multiple
regression weights with the R values adjusted for shrinkage and corrected for unreliability in Core
Technical Proficiency.
For the prediction of second-tour performance, as shown in Table 40.4, the estimated validi-
ties of the cognitive ability tests for predicting Core Technical Proficiency and General Soldiering
Proficiency were virtually identical for first tour (2–3 years after enlistment) and for second tour
(6–7 years after enlistment). Overall, the validities did not degrade, as some have speculated they
should (e.g., Henry & Hulin, 1987). The only real change was for predicting Effort and Leadership/
Effort and Achievement, but the nature of this performance dimension in fact changed between first
tour and second tour. For second-tour, the leadership components formed a separate factor.
As noted previously, the first-tour performance factors had substantial correlations with the sec-
ond-tour performance factors and displayed considerable convergent and divergent validity. The
assessment of first-tour performance (i.e., in LVI) also provided considerable incremental validity
over ASVAB and the Experimental Battery for predicting the Effort/Achievement and Leadership
performance factors in the second tour, but not for the Core Technical Proficiency factor, as shown in
Table 40.6. This reflects, in part, the increased importance of these factors for the NCO position.
Estimating potential classification gains from new predictors is a complex issue that depends on
a number of contextual parameters (e.g., quotas, assignment priorities, selection ratio, differential
recruitment costs across MOS, and variability in selection validities across jobs). The Project A
database for LVI was used to model a multivariate normal population, and Monte Carlo procedures
were then used to generate estimates of potential classification gains from using various forms of
the Experimental Battery plus ASVAB. Quotas for the nine MOS were set proportional to 1993
accessions.
There are two cross-validation issues in estimating classification gains. One concerns the weights
used for the predictor equation for each MOS (e.g., ordinary least-squares weights are sample specific
to some degree), and the second concerns the sample specificity of the differential assignments
themselves. During Project A/Career Force, a new index of classification efficiency labeled “mean
average performance” (MAP) was developed and Monte Carlo methods were used to provide an
unbiased estimate of the gains in aggregate performance resulting from classification, as compared
with selection alone (Rosse, Campbell, & Peterson, 2001). Using a test battery for each MOS that
TABLE 40.5
Comparison of LVI Estimates of Maximum Predictive Validity, Averaged Over
Batch A MOS, When Unit Weights, Zero-Order Validities, or Multiple Regression
Weights Are Used to Weight Predictors (Criterion Is Core Technical Proficiency)
Unit Validity Adjusted Corrected
Comparison Weights Weights R Ra
Full equation (all .57 .70 .70 (.77)
predictors)
Reduced equation (for .67 .72 .72 (.79)
selection)
TABLE 40.6
Multiple Correlations for Predicting Second-Tour Job Performance (LVII)
Criteria From ASVAB and Various Combinations of ASVAB, Selected Experimental
Battery Predictors, and First-Tour (LVI) Performance Measures: Corrected for
Restriction of Range and Criterion Unreliability
Predictor Composite
Adjusted R values from Rozeboom (1978; formula 8). Decimals omitted. A, ASVAB factors (quantitative,
speed, technical, verbal). X, experimental battery (spatial, rugged/outdoors interests from AVOICE, achieve-
ment orientation, adjustment, physical condition, internal control, cooperativeness, dependability, and leader-
ship from ABLE). 1, the LVI “can do” composite (CTP + GSP) for CTP and “will do” composite (ELS +
MPD + PFB) for effort/achievement and leadership. See Table 40.3 for LVI performance factor labels.
was selected, on a priori grounds, to be potentially useful for classification purposes, the aggregate
gain in MAP for Core Technical Proficiency was .14 standard deviation (SD) units if all accessions
must be classified, and .22 SD units if 5% could remain unassigned. In an organization the size of
the Army, such gains would have enormous utility.
Parallel to Project A, ARI sponsored Project B, which developed the Enlisted Personnel
Assignment System (EPAS) and uses linear programming strategies to make optimal MOS assign-
ments that maximize a specific function (aggregate performance, training cost savings, number of
individuals above a performance minimum, etc.) given a set of constraints to be accommodated
(MOS quotas, priorities, selection ratios, etc.). Together, the Project A database and the EPAS algo-
rithm provided an unparalleled test bed for estimating the effects of various selection and classifica-
tion strategies (Konieczny, Brown, Hutton, & Stewart, 1990).
In addition to the above, many additional data collections and analyses were carried out pertain-
ing to such issues as (a) estimating the differential utility of performance gains across jobs, (b) the
differential criticality of specific performance factors across jobs, (c) the prediction of attrition,
(d) the influence of reward preferences on performance prediction, and (e) the influence of race and
gender on performance assessments. The reader must consult Campbell and Knapp (2001) and the
articles and technical reports they referenced for the details.
RATING METHOD
The rating method represents a complex process of information processing and social cognition that is
rampant with opportunities for biased and error-filled judgments (e.g., Morgeson & Campion, 1997).
It has a bad press. However, Project A rating measures yielded reasonable distributional properties,
had reasonable single-rater reliabilities across cohorts, and produced a factor structure that was highly
replicable (virtually to the second decimal place). One conclusion might be that ratings are a valu-
able measurement method if (a) the dimensions to be rated are carefully defined and meaningful to
the raters, (b) there is considerable rater training, (c) the setting ensures that raters will give sufficient
time and attention to the rating task, and (d) the goals of the rater are commensurate with the goals
of the researchers. That is, the measurement is for research purposes, not operational performance
appraisal, and the raters accept the goal of doing their best to assess performance on the dimensions as
defined by the researchers. Although minimal, such conditions probably go far beyond most studies.
construct simply cannot do it. They must resort to combining the verbal specifications for the indi-
vidual specific factors. Now, anyone can add up the “scores” on the specific factors to obtain a total
score; and a weighted sum may indeed be necessary for certain specific decision purposes (Schmidt
& Kaplan, 1971). MacKenzie, Podsakoff, and Jarvis (2005) refered to such a score as a formative
score in contrast to a reflective score, which is intended to represent a substantive latent variable. For
confirmatory and reliability estimation purposes, the two require a different measurement model
(see Chapter 2, this volume), and Mackenzie et al. (2005) discussed how model misspecifications
can lead to faulty inference.
ROLE OF PERSONALITY
In retrospect, the development and validation of the ABLE was one of the primary reasons for
the resurgence of research on personality for selection purposes, along with the subsequent meta-
analysis reported by Barrick and Mount (1991) that covered nonmilitary occupations. However, the
ABLE results, which differed in substantive ways between the concurrent and longitudinal valida-
tions, also introduce cautions about the reactivity to experience of some types of items and the
subsequent use of concurrent designs for validating personality measures.
CONCLUSIONS
The experiences of Project A argue again and again for the necessity of placing the measures and
variables used in a particular study within a model of the relevant latent structure that is specified as
Project A 885
well as possible. It facilitates the interpretation of results, the integration of findings across studies,
the identification of future research needs, the use of the findings for unanticipated applications, and
the generalization of findings to other settings.
Continually trying to improve our models of relevant domains, as well as the interrelationship
among them, is critical for (good) practice as it is for (good) science. We began Project A with high
respect for our practice and our science. The respect for both and the appreciation for how they are
so strongly interrelated were even greater at the end.
REFERENCES
Ackerman, P. L. (1989). Within task intercorrelations of skilled performance: Implications for predicting
individual differences? (A commentary on Henry & Hulin, 1987). Journal of Applied Psychology, 97,
360–364.
Austin, J. T., Humphreys, L. G., & Hulin, C. L. (1989). A critical reanalysis of Barrett et al. Personnel
Psychology, 42, 583–596.
Barrett, G. V., & Alexander, R. A. (1989). Rejoinder to Austin, Humphreys, and Hulin: Critical reanalysis of
Barrett, Caldwell, and Alexander. Personnel Psychology, 42, 597–612.
Barrett, G. V., Caldwell, M. S., & Alexander, R. A. (1985). The concept of dynamic criteria: A crucial reanaly-
sis. Personnel Psychology, 38, 41–56.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance. A
meta-analysis. Personnel Psychology, 44, 1–26.
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements of contextual
performance. In N. Schmitt, W. C. Borman, & Associates (Eds.), Personnel selection in organizations
(pp. 71–98). San Francisco, CA: Jossey-Bass.
Brogden, H. E. (1946). An approach to the problem of differential prediction. Psychometrika, 11, 139–154.
Campbell, J. P. (1990). An overview of the Army selection and classification project (Project A). Personnel
Psychology, 43, 231–239.
Campbell, J. P., & Knapp, D. J. (2001). Exploring the limits in personnel selection and classification. Mahwah,
NJ: Lawrence Erlbaum.
Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt &
W. C. Borman (Eds.), Frontiers in industrial/organizational psychology: Personnel selection and clas-
sification (pp. 35–71). San Francisco, CA: Jossey-Bass.
Fleishman, E. A. (1967). Performance assessment based on an empirically derived task taxonomy. Human
Factors, 9, 349–366.
Ghiselli, E. E. (1973). The validity of aptitude tests in personnel selection. Personnel Psychology, 26,
461–477.
Henry, R. A., & Hulin, C. L. (1987). Stability of skilled performance across time: Some generalizations and
limitations on utilities. Journal of Applied Psychology, 72, 457–462.
Holland, J. L. (1966). The psychology of vocational choice. Waltham, MA: Blaisdell.
Hulin, C. L., Henry, R. A., & Noon, S. L. (1990). Adding a dimension: Time as a factor in the generalizability
of predictive relationships. Psychological Bulletin, 107, 328–340.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 98, 72–98.
Knapp, D. J. (2006). The U.S. Joint-Service Job Performance Measurement Project. In W. Bennett,
C. E. Lance, & D. J. Woehr (Eds.), Performance measurement: Current perspectives and future chal-
lenges (pp. 113–140). Mahwah, NJ: Lawrence Erlbaum.
Konieczny, F. B., Brown, G. N., Hutton, J., & Stewart, J. E. (1990). Enlisted Personnel Allocation System:
Final report (ARI Technical Report 902). Alexandria, VA: U.S. Army Research Institute for the
Behavioral and Social Sciences.
Mackenzie, S. B., Podsakoff, P. M., & Jarvis, C. D. (2005). The problem of measurement model misspecifi-
cation in behavioral and organizational research and some recommended solutions. Journal of Applied
Psychology, 90, 710–730.
McCloy, R. A., Campbell, J. P., & Cudeck, R. (1994). A confirmatory test of a model of performance determi-
nants. Journal of Applied Psychology, 79, 493–504.
McHenry, J. J., Hough, L. M., Toquam, J. L., Hanson, M. A., & Ashworth, S. (1990). Project A validity results:
The relationship between predictor and criterion domains. Personnel Psychology, 43, 335–354.
886 Handbook of Employee Selection
Morgeson, F. P., & Campion, M. A. (1997). Social and cognitive sources potential of inaccuracy in job analy-
sis. Journal of Applied Psychology, 82, 627–655.
Nathan, B. R., & Alexander, R. A. (1988). A comparison of criteria for test validation: A meta-analytic investi-
gation. Personnel Psychology, 41, 517–536.
Peterson, N. G., Hough, L. M., Dunnette, M. D., Rosse, R. L., Houston, J. S., Toquam, J. L., & Wing, H. (1990).
Project A: Specification of the predictor domain and development of new selection/classification tests.
Personnel Psychology, 43, 247–276.
Reynolds, D. H., Bayless, A., & Campbell, J. P. (2001). Criterion reliability and the prediction of future perfor-
mance from prior performance. In J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in personnel
selection and classification. Mahwah, NJ: Lawrence Erlbaum.
Rosse, R. L., Campbell, J. P., & Peterson, N. G. (2001). Personnel classification and differential job assign-
ments: Estimating classification gains. In J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in
personnel selection and classification. Mahwah, NJ: Lawrence Erlbaum.
Schmidt, F. L. (1988). Validity generalization and the future of criterion related validity. In H. Wainer &
H. Braun (Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychol-
ogy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schmidt, F. L., & Kaplan, L. B. (1971). Composite vs. multiple criteria: A review and resolution of the contro-
versy. Personnel Psychology, 24, 419–434.
Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Meta-analyses of validity studies published
between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37,
407–422.
Scholarios, T. M., Johnson, C. D., & Zeidner, J. (1994). Selecting predictors for maximizing the classification
efficiency of a battery. Journal of Applied Psychology, 79, 412–424.
Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (1993, April). Theoretical implications of a general factor in
job performance criteria. Paper presented at the 8th Annual Conference of the Society of Industrial and
Organizational Psychology. San Francisco, CA.
Walker, C. B., & Rumsey, M. G. (2001). Application of findings: ASVAB, new aptitude tests, and personnel
classification. In J. P. Campbell & D. J. Knapp (Eds.), Exploring the limits in personnel selection and
classification. Mahwah, NJ: Lawrence Erlbaum.
Wigdor, A. K., & Green, B. F. (1991). Performance assessment for the workplace (Vols. I-II). Washington, DC:
National Academy Press.
Zeidner, J., Johnson, C. D., & Scholarios, D. (1997). Evaluating military selection and classification systems in
the multiple job context. Military Psychology, 9, 169–186.
41 The Dictionary of
Occupational Titles
and the Occupational
Information Network
MILESTONES
The Dictionary of Occupational Titles (DOT; U.S. Department of Labor [DoL], 1991a, 1991b) and
its successor, the Occupational Information Network (O*NET TM; Peterson, Borman, Mumford,
Jeanneret, & Fleishman, 1999), have played a unique role for the United States, providing a single,
standard system for analyzing and presenting information about occupations for a wide variety of
applied uses. Although not perfect (Miller, Treiman, Cain, & Roos, 1980, Committee on Techniques
for the Enhancement of Human Performance: Occupational Analysis, 1999), these systems are both
milestones of behavioral science achievement. Designed at significantly different periods of time
with different technologies and in response to different workplace dynamics, the DOT and the
O*NET have been simultaneously shaped by the nations’ societal needs and the state of the art
occupational analysis in response to those needs. Industrial-organizational (I-O) psychology has
played a major, but not the only, role in the development of these systems. In this chapter we discuss
the development and implementation of these systems.
ORGANIZING FRAMEWORK
I-O psychologists most often approach occupational analysis from the framework of obtaining
information about a single occupation or a fairly homogeneous set of occupations within one or
a few organizations and for one or a few specific applications, such as employee selection, perfor-
mance appraisal, or compensation. National occupational systems necessarily take a much broader
view, both from the point of view of the population of occupations to be included and the purposes
to be served by the system. The DOT and the O*NET have as their intended population all of the
occupations in the U.S. economy, and their intended uses are more universal in nature; for example,
the provision of job descriptions in narrative terms for general lay use, the development and mainte-
nance of an organizing framework for all occupations, and use in matching applicants with jobs at
state employment agencies across the nation.
The DOT evolved fairly gradually over its history and its successor, the O*NET, seemingly
represents a sharp departure from the DOT. However, if occupations are viewed as multidimen-
sional variables subject to a broad range of independent variables (e.g., technological, demographic,
cultural, and legal), and occupational analysis as the science and art of describing occupations,
then the DOT and O*NET do not appear so different. Figure 41.1 shows one way of depicting
887
888 Handbook of Employee Selection
Organizational
restructuring
Occupational
analysis
FIGURE 41.1 Framework for conceptualizing the changing nature of work and occupational analysis. (From
the National Academy of Sciences, The changing nature of work, National Academies Press, Washington,
DC, 1999. With permission.)
occupations and occupational analysis in a larger causal network (from Committee on Techniques
for the Enhancement of Human Performance: Occupational Analysis, 1999, p. 15). The independent
variables shown there (law, demographics, markets, technology) wax and wane in terms of their
effect on the nature and structure of occupations and occupational analysis systems. Research on
and experience in the conduct of occupational analysis itself obviously more directly affects the
nature of occupational analysis systems.
A matrix is a convenient device to conceptualize occupational analysis in general. Figure 41.2
shows a matrix with rows representing single occupations or categories of more than one occu-
pation (e.g., file clerk, police officer, retail sales representative) and columns representing vari-
ous attributes of occupations (e.g., tasks, abilities, skills, education required). The main focus
of I-O psychologists has been to define and measure the various types of attributes best used to
describe occupations—tasks, skills, and abilities. An example of this might be the Position Analysis
Questionnaire (McCormick, Jeanneret, & Mecham, 1972). Economists, sociologists, and national
government agencies like the DoL tend to focus at a more macro level on the larger structure of
occupations. Here the focus is on the number of occupations, their titles, and the best ways to group
FIGURE 41.2 Matrix representation of occupational analysis. (Adapted from the National Academy of
Sciences, The changing nature of work, National Academies Press, Washington, DC, 1999. With permission.)
The Dictionary of Occupational Titles and the Occupational Information Network 889
the occupations for statistical summaries; for example, the Standard Occupational Classification
(U.S. Department of Commerce, 1980). The cell entries in the matrix represent “scores” for each
occupational category on each occupational attribute, however they might be obtained. Finally, note
that if hierarchical arrangements are imposed on the rows or columns, or both, then various aggre-
gations of the occupational information can be obtained and used for different purposes.
As noted above, the DOT and the O*NET intentionally incorporate the population of the rows
and the columns of this matrix. Although many of the occupational analyses conducted by individual
I-O psychologists for research or applied purposes cover the population of the columns (attributes) or
a large part of that population, almost none cover the population of the rows (occupations). Instead,
they focus on a single row or a few such rows of this matrix. The extent to which an occupational
analysis system (e.g., the DOT or the O*NET) adequately includes, defines, and measures the
elements in the populations of rows and columns ultimately determines its value in meeting societal
needs.
DOT
In the 1930s, the United States was in the throes of the Great Depression. One of the governmental
responses to this crisis was the Wagner-Peyser Act of 1933, an act “to provide for the establishment
of a national employment system and for cooperation with the States in the promotion of such
system, and for other purposes.” This act established the U.S. Employment Service (USES) to “assist
in coordinating the State public employment services throughout the country and in increasing their
usefulness by developing and prescribing minimum standards of efficiency … [and] promoting
uniformity in their administrative and statistical procedure.” Funds allocated were to be used, among
other things, for “job search and placement services to job seekers including counseling, testing,
occupational and labor market information, assessment, and referral to employers” (Section 3(a)).
Each state service pursued these ends in their own ways, creating difficulties in communication
between state services and between the states and the USES (Droege, 1988). In reaction, a panel
including several prominent psychologists was formed to oversee the development of an occupational
research program. In the five years from 1934 to 1939, over 50,000 individual job analyses were
carried out using procedures initially developed at the Minnesota Stabilization Research Institute
(Paterson & Darley, 1936). These analyses were conducted in work sites via job observation and
interview of job holders and supervisors.
The first edition of the DOT (DoL, 1939), including 17,500 occupations, was primarily based
on those analyses. The content of these job analysis reports, and derivative ratings of other occu-
pational attributes based on those reports, constituted the DOT attributes, or columns, of our con-
ceptual matrix shown in Figure 41.2. The initial set of DOT “rows” in the idealized occupational
matrix were the 17,500 occupations organized into 550 occupational groups that “reflected occupa-
tional relationships based on work performed, job content, and skill level” (Droege, 1988, p. 993).
In essence, this method of (a) conducting on-site job analyses employing observation and interview
techniques and resulting in a (b) written job analysis report (often referred to as a JAR) which was
the basis for (c) additional ratings of job attributes by job analysts (that may or may not have been
involved in the initial job analysis) remained constant over the entire history of the DOT.
Droege (1988) summarized the somewhat checkered organizational history of the occupational
research program at the USES and the state employment services up to the mid 1980s. Cutbacks
in national office staff after World War II, in the early 1950s, and in the early 1980s led to vari-
ous forms of Federal-state cooperative organizations responsible for conducting the occupational
research necessary to update and maintain the DOT. An important component of these organiza-
tional systems was the Occupational Analysis Field Centers (OAFCs), each based in a State, (the
number of these centers varied over time, ranging from 4 to 12) which were primarily responsible
for DOT updating activities. Over this time period the second (DoL, 1949), third (DoL, 1965), and
fourth (DoL, 1980) editions of the DOT were produced as well as various supplements and other
890 Handbook of Employee Selection
publications. In 1991, a revision of the fourth edition was published (DoL, 1991a). Each of these edi-
tions had changes in the occupational groupings and associated coding systems and in the informa-
tion or “attributes” about the occupations; these changes were primarily incremental in nature.
• Worker functions: A set of three ratings depicting the worker’s level of involvement with
the famous “data,” “people,” and “things” (Fine, 1968a, 1968b, 1988; DoL, 1972)
• Work fields: One chosen from a set of 100 technological and socioeconomic objectives,
such as “material moving,” “cooking-food preparing,” and “researching” with associated
definitions and verbs
• Work devices: A list of machines, tools, equipment, and work aids used on the job
• Materials, products, subject matter, and services (MPSMS): One or more of these chosen
from an organization of 48 groups with 328 constituent categories, such as a “marine life”
group with “finfish,” “shellfish,” and “other marine life” categories
1 Notably, the number of occupations will vary considerably because of the deletion, combining, or adding of occupations.
The Dictionary of Occupational Titles and the Occupational Information Network 891
Collectively then, the hierarchical, occupational category system embedded in the DOT code and
the definitions and ratings of the occupational attributes constitute the actual “rows” and “columns”
for the DOT when it is thought about in terms of our idealized occupational analysis matrix. For the
fourth edition, there were 12,099 rows or occupations and about 100 “occupational attribute” scores,
depending on how you count, in addition to the textual definition of the occupation itself.
• Preparing for job analysis studies: Researching the industry, selecting establishments,2
contacting establishments, initial meeting with establishment officials, arranging for an
authorized study, obtaining establishment data to facilitate job analysis, preparing a pre-
liminary staffing table, taking the establishment tour
• Conducting job analysis studies: Establishing a job analysis study schedule, the
observation-interview job analysis method, determining number of jobs, meeting with
establishment official at end of study
• Writing job description: Parts of a job description, determining detail needed in job and
task descriptions, breaking a job down into tasks, flag statements, estimating time percent-
ages of tasks, preparing job summaries
The primary product of this process was the Job Analysis Report (JAR). As Droege (1988) stated,
“The job analysis report serves the dual purpose of structuring the job analysis and recording the
resulting data” (p. 1007). He also provided an example of a completed JAR. In it, the analyst docu-
mented the basic identifying information (title, industry, Standard Occupational Classification code,
and a single sentence summary of the job), the work devices used; the MPSMS provided; informa-
tion about the type and length of training, licenses required, methods of promotion, and supervision
on the job; and the data collection methods used in the job analysis. With this basic information, the
analyst then made ratings of all those attributes described above under structure and content (worker
functions, GED, aptitudes, etc.).
2 It appears that sampling establishments were not done with any formal sampling technique. Droege (1988) indicated the job
analysis studies were done in “a number of establishments representing key segments of all industries” (p. 1003) and these
were selected from lists drawn up by job analysts on the basis of their experience in industry research, industrial directories,
and classified pages of telephone directories. Factors that influenced the actual selection of an establishment included type
of product or service, number of employees, location, and policy or history of the establishment in allowing studies.
892 Handbook of Employee Selection
Nongovernmental Use
The NRC report noted that the fourth edition of the DOT had sold over 100,000 copies in the first
fourteen months after its publication in 1977, compared to over 40,000 copies of the third edition
in the comparable time period. (The third edition sold almost 150,000 copies in total.) These were
copies directly purchased from the DoL, and were over and above the approximately 35,000 copies
distributed free of charge to all government agencies. In their survey of purchasers, they found that
37% were in career or vocational counseling work, 18% did library work, and 15% did management/
compensation work. Eight percent were in employment placement whereas 7% each were in voca-
tional education, rehabilitation counseling, and work having to do with projections/occupational
information dissemination. Five percent were purchased by teachers/researchers (5000 copies!).
(Figures are from Table 4-2, Miller et al., 1980, p. 52). It is clear that counselors, librarians, and
placement workers heavily used the DOT to get their work done, and it also played a major role for
personnel workers in business.
Most users (88%) reported that discontinuance of the DOT would have at least a minor disrup-
tive effect on their work, and 50% or more of teacher/researchers (67%), occupational information
disseminators (58%), vocational educators (50%), and rehabilitation counselors (50%) reported that
it would have a major disruptive effect (Table 4-5, Miller et al., 1980, p. 56).
This brief review demonstrates the degree of penetration of the DOT into the U.S. occupational
information realm by 1980. Still, despite its widespread use, it was not without criticism, and these
were well documented by the NRC.
• The occupational analysis program should concentrate on the fundamental activity of job
analysis and research and development strategies for improving procedures, monitoring
changes in job content, and identifying new methodologies.
• A permanent, professional research unit of high quality should be established to conduct
such research and improvement efforts.
• An outside advisory committee should be established, representing employers, unions,
relevant academic disciplines, and the public, to review and make recommendations about
occupational analysis program activities.
Note that these recommendations are primarily organizational in nature. As briefly mentioned ear-
lier, the way in which the DOT job analysis information had been collected over the years was
continually subject to the vagaries of funding fluctuations, which led to organizational changes,
especially in the responsibility for collecting job analysis information. These recommendations
seem responsive to that history. Unfortunately, the recommendations were not followed as assidu-
ously as they might have been.
More specific recommendations were made in the areas of data collection procedures, measure-
ment of occupational characteristics, classification, other research, and organization and adminis-
tration (a total of nineteen such recommendations). A few of these are paraphrased below to give
the flavor of the full set.
• Continue on-site observation and interviews as the major mode of data collection, but
experiment with other methods.
894 Handbook of Employee Selection
• Select establishments and work activities for job analysis according to a general sampling
plan tailored to occupational analysis. (As mentioned in a footnote above, selection of
establishments was done primarily by analysts within industries assigned to the different
field centers with virtually no overall guidance on how to study or select establishments
within industries.)
• Devise procedures to monitor and identify changes in the job content of the economy.
• Review and replace, if appropriate, the worker trait and worker function scales. (The
committee argued that these measures did not reflect “conceptually central aspects of
occupational content,” they were not developed in accord with current psychometric
standards and had lower than desirable reliability, and some of the scales had limited
use.)
• Investigate cross-occupational linkages indicating possible transferability of skills or
experience.
• Use a standard classification, compatible with or identical to the standard system
implemented by the Office of Federal Statistical Policy and Standards (or its successor
developers).
• Give research priority to developing criteria for defining occupations—the aggregation
problem (i.e., methods for determining the homogeneity or heterogeneity of a target occu-
pation, and, therefore, the number and/or sources of individual job analyses that it might be
necessary to study and “aggregate” to adequately describe an occupation).
• Collection and dissemination of occupational information should be continuous, not epi-
sodic (coinciding with production of new editions).
• DOT data collection and development procedures should be documented and publicly
described, and its data made available to the public.
No doubt all of these recommendations were seriously received, and some were at least partially
implemented. However, some of these issues were not squarely faced until the development of a new
kind of occupational information system was seriously contemplated.
the effective education, training, counseling and employment of the American workforce. The DOT
should be restructured to accomplish its purpose by providing a database system that identifies, defines,
classifies and describes occupations in the economy in an accessible and flexible manner. Moreover,
the DOT should serve as a national benchmark that provides a common language for all users of
occupational information. (APDOT, 1993, p. 13)
APDOT’s specifications for a new occupational analysis system included the following themes:
• Use of a common occupational classification system that facilitates linkage to other occu-
pational and labor-market databases and represents collapsing, merging, and restructuring
of the workplace with fewer than the 12,000 DOT occupations
• Multiple windows of job descriptors intended for different user needs
• A “common language” of descriptors useful across occupations in their definition and
scaling
• Improved data collection, with more systematic sampling and use of structured job analy-
sis questionnaires
• Hierarchical organization so that users could ease into the system from the top down or
aggregate from the bottom up
• A relational electronic database that could be queried dynamically in a manner that facili-
tates access for a variety of users
Worker Characteristics
These cross-occupation descriptors were conceptualized to represent the enduring characteristics
of individuals that are relevant to job performance and/or the capacity to acquire knowledge and
skills necessary for work (National Center for O*NET Development DoL, 2007). The abilities for
the prototype and the current version of O*NET include 52 descriptors that address enduring human
capabilities that are not substantially affected by experience (e.g., Oral Comprehension, Deductive
Reasoning, Spatial Orientation, and Near Vision) originating from work by Fleishman (1975). The
occupational interests are six descriptors based on Holland’s (1976) taxonomy of interests (i.e.,
Realistic, Investigative, Artistic, Social, Enterprising, and Conventional) that represent preferences
896 Handbook of Employee Selection
Worker-oriented
Cross- Occupation-
occupation specific
Occupational
requirements
Occupational-specific
Generalized work activities information
Labor market
Detailed work activities
characteristics Tasks
Organizational context
Work context Tools and technology
Labor market information
Occupational outlook
Job-oriented
FIGURE 41.3 Occupational Information Network (O*NET) content model. (Adapted from the O*NET
Center Online.)
for work environments and outcomes. The work values are 21 descriptors consisting of specific
needs that the work can satisfy originating from Dawis and Lofquist’s (1984) Theory of Work
Adjustment. They include occupational reinforcers such as Ability Utilization, Recognition, and
Autonomy.
Worker Requirements
These cross-occupation descriptors are work-related attributes that can be developed by educa-
tion, training, and experience (National Center for O*NET Development DoL, 2007). Basic skills
are 10 developed capabilities that facilitate acquisition of information and learning (e.g., Writing,
Mathematics, Critical Thinking). Cross-functional skills are currently 25 developed capacities rele-
vant to the performance of activities occurring across occupations. The prototype included 36 cross-
functional skills that were restructured for the current version of O*NET (Mumford, Peterson, &
Childs, 1999). The knowledge are 33 descriptors that organize sets of principles and facts that apply
to general domains of knowledge (i.e., Economics and Accounting, Mechanical, Biology, and Law
and Government). Education covers educational experiences required to perform an occupation in
terms of level (i.e., high school, vocational school, college, etc.) and subjects (e.g., English/Language
Arts, Basic Math, Humanities, etc.). The current educational portion of the current O*NET is fairly
similar to its prototype version. The work styles are 16 descriptors that represent personal charac-
teristics relevant to how well a person performs work, traditionally referred to as personality and/or
temperament variables. The prototype included a list of 17 descriptors that underwent only minor
modifications in the operational version.
Experience Requirements
These descriptors are both cross-occupation and occupation-specific (National Center for O*NET
Development DoL, 2007). They begin with an indication of the amount of different types of experi-
ence required to be hired (e.g., related work experience, on-the-job training, apprenticeship). Next
The Dictionary of Occupational Titles and the Occupational Information Network 897
are the basic and cross-functional skills required to be hired for the occupation. These are the same
skills referred to in worker requirements. Finally, this part of the content model includes licenses,
certificates, and registrations relevant to the occupation.
Occupational Requirements
These descriptors are work-oriented descriptors that describe the requirements of each occupation
(National Center for O*NET Development DoL, 2007). The generalized work activities (GWAs) in
the current version of O*NET include 41 descriptors that address general work behaviors that occur
across occupations (e.g., Getting Information, Analyzing Data or Information, Performing General
Physical Activities, and Resolving Conflicts and Negotiating with Others). The GWA taxonomy was
substantially influenced by the theory behind and content of the PAQ (McCormick et al., 1989) for
nonsupervisory jobs and for supervisory and management jobs by several managerial dimension
systems summarized in Borman and Brush (1993). The prototype GWAs included 42 descriptors
that are not very different from the current version (Jeanneret, Borman, Kubisiak, and Hanson,
1999).3
Organizational context descriptors reflect characteristics of organizations in which the occupa-
tions are embedded. They cover a variety of areas such as the level of employee empowerment,
skill variety, reward systems, and organizational culture. There were some adjustments between the
prototype and current version of organizational context items (Arad, Hanson, & Schneider, 1999).
Finally, the work context descriptors cover physical and social factors that influence the work envi-
ronment. The current version has 57 work context descriptors. They are organized somewhat dif-
ferently than in the prototype version; but, are otherwise fairly similar (Strong, Jeanneret, McPhail,
Blakely, & D’Egidio, 1999).
Workforce Characteristics
This part of the content model contains information from the BLS regarding wages, level of employ-
ment, and employment outlook for each occupation (National Center for O*NET Development DoL,
2007).
Occupation-Specific Information
This portion of the content model serves as the primary location of occupation-specific information
(National Center for O*NET Development DoL, 2007). It includes tasks, tools, and technology rel-
evant to each occupation. The tasks are work behaviors more specific than GWAs and detailed work
activities. The O*NET prototype included tasks, but not tools and technology.
Hierarchical Structure
It is important to note that the descriptors in each domain of the content model represent the level at
which data are collected and are at the most specific level of a hierarchical structure. For example,
the ability Deductive Reasoning is one of the “Idea Generation and Reasoning” abilities that belong
to the “Cognitive” abilities, which in turn belong to the “Abilities” domain. Finally, the abilities are
a domain of descriptors belonging to the “Worker Characteristics” (see Figure 41.3).
3 The current version of the O*NET content model also includes detailed work activities consisting of statements describing
work within particular GWAs.
898 Handbook of Employee Selection
OTHER REVISIONS
Other revisions to the O*NET’s prototype involved adjustments to the rating scales (i.e., data collec-
tion instruments) and the content model itself on the basis of postprototype reviews and additional
research (e.g., Hubbard et al., 2000). Significant changes to the content model are discussed in the
900 Handbook of Employee Selection
content model discussion above. On the basis of analyses of the prototype data and other research,
the basic structure of the descriptor rating scales was modified. For example, the prototype GWAs
questionnaire presented a level, importance, and frequency scale for each GWA, respectively. This
was changed so that the importance rating comes first and a “level” rating is made only if the impor-
tance rating is greater than 1 “Not Important.” Finally, the frequency scale was deleted. Figure 41.4
shows these scales for the first GWA in the current questionnaire. These changes were designed to
improve the reliability and validity of ratings partly by making the rating process more straightfor-
ward and reducing the raters’ burden.
∗ If you marked not important, skip LEVEL below and go on to the next activity.
1 2 3 4 5 6 7
Highest level
FIGURE 41.4 The first GWA from the current O*NET incumbent data collection questionnaires. (Adapted
from the O*NET Data Collection Program: Office of Management and Budget Clearance Package Supporting
Statement and Data Collection Instruments, 2008.)
The Dictionary of Occupational Titles and the Occupational Information Network 901
incumbent ratings, the mean level rating for a particular descriptor on a particular scale for a par-
ticular occupation is flagged for suppression if the mean importance rating is too low. Also, mean
importance and level ratings are flagged if the there is too little agreement among raters. With these
ratings collected for 480 of the 812 occupations, the median importance and level interrater reli-
abilities were .87 and .90, respectively (i.e., on the basis of eight raters per occupation)4.
Soon the update phase of O*NET’s data collection efforts will be complete and a database, con-
sisting of incumbent ratings for most of the descriptors (analysts ratings for the abilities and skills),
for 812 O*NET SOC occupations will be available (O*NET Center, 2007a). Another, planned effort
includes collecting incumbent/analyst data on occupations that are economically important, pro-
jected to grow in employment, or are being substantially affected by technology and innovation
(Tsacoumis, 2007). Additionally, data are being collected from occupational experts on new and
emerging occupations.
O*NET DATABASE
The online database is a publicly available, free, and very user friendly tool (O*NET Center, 2007 b;
see http://online.onetcenter.org). It can be accessed in several different ways. For example, O*NET
Online supports searching for job descriptions by name, O*NET SOC number, or job family. Once
an occupation is identified, summary, detailed, or customized descriptive reports can easily be gen-
erated depending on desired detail or the domains of the content model on which the user wishes to
focus. Additional services and products offered include self-assessment tools on interests and abili-
ties that individuals can use to match themselves with potential occupations and a downloadable
file version of the O*NET database for use by human resource, counseling, and other professionals
and related researchers.
APPLICATIONS OF O*NET
Frequency of Use
In 2008 the DoL ETA reported information regarding usage of O*NET products. At that time,
O*NET Online (http://online.onetcenter.org) was being visited an average of 775,000 times per
month. As of September 2007, 2,179 individuals and organizations provided certifying information
indicating their intent to use O*NET data and other products (DoL ETA, 2008). These users repre-
sented categories such as educational services, computer system design and program services, and
government and public administration. At that time, 5,309 websites were linked to O*NET Online.
Between January 2002 and September 2007 there were 59,232 downloads of the O*NET database.
O*NET has also been referred to extensively in literature. As of September 2007, DoL ETA (2008)
showed 58 journal publications and 35 book and book chapters referencing O*NET.
General Application
There are several general descriptions of O*NET discussing its characteristics and potential useful-
ness. Early descriptions were either by or contributed to researchers who worked on O*NET’s devel-
opment (e.g., Committee on Techniques for the Enhancement of Human Performance: Occupational
Analysis, 1999; Peterson et al., 1999; Peterson et al., 2001) and focused to a considerable extent on
how well O*NET addressed APDOT’s original mandate. Other authors discussing the overall area
of job analysis also gave O*NET considerable attention (e.g., Sackett & Laczo, 2003). In a job analy-
sis text book, Brannick et al. (2007) provided a multi-page description of O*NET mentioning its
advantages relative to matching people to jobs and comparing jobs on cross-occupational attributes
4 On the basis of (a) a study conducted by Tsacoumis & Van Iddekinge (2006) supporting the validity of analyst skill
ratings (collected via the same method as analyst ability ratings), (b) the idea that some skills might also be better under-
stood by analysts, and (c) practical considerations (i.e., cost), the U.S. Department of Labor ETA (2008) has decided to
use analyst instead of incumbent skill ratings.
902 Handbook of Employee Selection
TABLE 41.1
A Comparison of Some Occupations in Terms of the Level of GWA
Performing General Physical Activitiesa
O*NET-SOC Occupation O*NET Level Rating
Tree trimmers and pruners 88
Light or delivery service truck drivers 84
Carpet installers 77
Pest control workers 66
Short-order cooks 54
Home health aides 44
Childcare workers 37
Security guards 24
Telemarketers 16
O*NET-SOC titles and level ratings from O*NET Online 12.0 (2008).
a Performing general physical activities is defined in the O*NET content model as performing
physical activities that require considerable use of your arms and legs and moving your whole
body, such as climbing, lifting, balancing, walking, stooping, and handling materials.
(i.e., descriptors). These authors also indicate that the O*NET content model and database could be
useful as (a) a source of information for developing competency models, (b) generating descriptions
of jobs for the purpose of identifying characteristics to assess with selection instruments, and (c)
serving as an early step in the development of more detailed job analysis information.
Tables 41.1 and 41.2 show two examples of how O*NET data might be presented and used
for different purposes (O*NET Online 12.0, 2008). Table 41.1 shows the level of the GWA
Performing General Physical Activities (on a scale from 0 to 1005) needed for the performance of
nine occupations that a high school graduate might consider. A career counselor might use such
information when advising a client at this educational level with physical fatigue issues. Table 41.1
shows that some occupations required much less physical activity than others. Table 41.2 shows the
work style O*NET descriptors with importance ratings of 50 or greater, on a scale from 0 to 1006,
for childcare workers. It shows the work styles rated to be most important to the performance in this
occupation. A human resources (HR) professional might use this information to identify potential
commercially offered personality instruments that could be helpful as part of a selection system for
this occupation.
5 O*NET GWA level ratings are made on a 7-point scale and transformed to a 100-point scale for presentation.
6 O*NET work styles importance ratings are made on a 5-point scale and transformed to a 100-point scale for presentation.
The Dictionary of Occupational Titles and the Occupational Information Network 903
TABLE 41.2
Important Work Styles for Childcare Workers With Importance Ratings of 50 or Higher
Title of O*NET O*NET
Work Style Definition of O*NET Work Style Importance Rating
Dependability Job requires being reliable, responsible, and dependable, and fulfilling 92
obligations.
Self-control Job requires maintaining composure, keeping emotions in check, controlling 89
anger, and avoiding aggressive behavior, even in very difficult situations.
Concern for others Job requires being sensitive to others’ needs and feelings and being 86
understanding and helpful to others on the job.
Integrity Job requires being honest and ethical. 86
Cooperation Job requires being pleasant with others on the job and displaying a 82
good-natured, cooperative attitude.
Stress tolerance Job requires accepting criticism and dealing calmly and effectively with 81
high-stress situations.
Social orientation Job requires preferring to work with others rather than alone, and being 78
personally connected with others on the job.
Adaptability/ Job requires being open to change (positive or negative) and to considerable 74
flexibility variety in the workplace.
Attention to detail Job requires being careful about details and thorough in completing tasks 69
Leadership Job requires a willingness to lead, take charge, and offer opinions and 69
direction.
Initiative Job requires a willingness to take on responsibilities and challenges. 68
Independence Job requires developing one’s own ways of doing things, guiding oneself 65
with little or no supervision, and depending on oneself to get things done.
Persistence Job requires persistence in the face of obstacles. 58
Innovation Job requires creativity and alternative thinking to develop new ideas for and 57
answers to work-related problems.
Achievement/ Job requires establishing and maintaining personally challenging 54
effort achievement goals and exerting effort toward mastering tasks.
Analytical Job requires analyzing information and using logic to address work-related 50
thinking issues and problems.
O*NET descriptors and importance ratings from O*NET Online 12.0 (2008).
Reiter-Palmon, Brown, Sandall, Buboltz, and Nimps (2006) described use of the O*NET content
model to structure and develop a web-based job analysis process that collects specific tasks and
links them to O*NET abilities and skills. Taylor, Li, Shi, and Borman (2008) examined the trans-
portability of job information across countries. They compared data collected in the United States
(i.e., from the O*NET database) to responses to the scales on the O*NET GWAs, basic and skill
functional skills, and work styles questionnaires from incumbents in three jobs (i.e., first-line super-
visor, office clerk, and computer programmer) in three other countries (i.e., New Zealand, China,
and Hong Kong). Results showed only modest differences suggesting that generally this type of job
information is transportable across countries.
O*NET has also been used in research applications that may have been somewhat less expected.
For Glomb, Kammeyer-Mueller, and Rotundo (2004), O*NET data was one of the sources of
information used to compare occupations in terms of emotional and cognitive demands and to assess
these demands as factors related to compensation. Crouter, Lanza, Pirretti, Goodman, and Neebe
(2006) demonstrated how O*NET data could be used to understand the work life circumstances
of parents in the family research context. Liu, Spector, and Jex (2005) used O*NET data for the
descriptor Autonomy in the work values domain of the content model as one source of information
904 Handbook of Employee Selection
to investigate the relationship between job stressors and physical well-being and psychological
strain. In a related effort, Dierdorff and Ellington (2008) showed that significant variance in work-
family conflict can be accounted for by two characteristics of an individual’s occupation (i.e.,
interdependence and responsibility for others). Data from scales in the work context domain of the
content model were used to generate scores on each characteristic for the occupations included in
their analyses.
One particular research effort demonstrates O*NET’s content model as a useful taxonomy of
occupational descriptors. While developing a system for selecting individuals for team settings,
Morgeson, Reider, and Campion (2005) used selected domains from the content model to con-
duct a job analysis and to develop questions for a structured selection interview. With the goal of
identifying the knowledge, skills, abilities, and other attributes (KSAOs) required for successful
performance in team settings, job incumbents completed O*NET skill, ability, and work styles
surveys regarding the level and importance of these descriptors to their work. Responses were used
as part of the job analysis effort to identify KSAOs important to performance in this study’s work
context (i.e., a Midwestern steel corporation). This job analysis information was used to develop
questions for a structured selection interview. Content for a number of these questions was taken
from the O*NET skills and work styles domains (e.g., Active Listening, Social Perceptiveness, Time
Management, Cooperation, and Stress Tolerance).
Another study demonstrates the usefulness of the populated O*NET database. LaPolice, Carter,
and Johnson (2008) used O*NET data in the component validation context to show that O*NET
descriptors can be used to predict literacy scores for O*NET-SOC occupations. Researchers used
mean level ratings for these occupations from four content domains (i.e., abilities, skills, knowledge,
and GWAs). Within each domain they identified descriptors relevant to three kinds of literacy identi-
fied in the national adult literacy survey (NALS) (a) prose (involving text in sources such as editori-
als, news stories, poems, and fiction), (b) document (involving text in sources such as job applications,
payroll forms, maps, tables, etc.), and (c) quantitative (involving arithmetic operations, numbers in
printed materials, etc,). Literacy scores for each occupation were calculated from responses on the
NALS for individuals who identified their occupation in the survey’s background questionnaire.
Results showed that mean NALS prose, document, and quantitative literacy scores for these occupa-
tions were predicted by descriptor level ratings in the O*NET database with multiple correlations
coefficients, corrected for shrinkage, of .81, .79, and .80, respectively. This study also demonstrates
how O*NET data can be used at a level above that represented by the descriptors in each content
domain. LaPolice, Carter, and Johnson (2008) identified a practically useful level of aggregation they
referred to as “literacy” derived from four content domains. Other possibilities in this direction might
include aggregating O*NET descriptors to address important broad competencies other than literacy
(e.g., Interpersonal Skills, Cognitive Flexibility, Technological Awareness, and Leadership).
Other Uses
Beyond the direct uses of O*NET data and associated tools that the O*NET Center makes available,
the DoL ETA (2007) keeps track of organizations that use O*NET data to design, implement, and
sustain a variety of practical applications that are described in the “into action” section of its website.
They focus on matching people to work in different environments from different perspectives.
• Several programs are using O*NET to help students understand and prepare for the world
of work. Toward this end it helps them identify occupations that match their interests and
capabilities and evaluate the educational, training, and practical experiences required
to prepare for those occupations (e.g., Boys & Girls Club of America and New York
CareerZone).
• Other programs focus on assisting individuals to find work matching their skills and other
characteristics. They include interventions that focus on older workers, some of whom
are reentering the workforce, disabled workers, and workers seeking new employment
The Dictionary of Occupational Titles and the Occupational Information Network 905
voluntarily or involuntarily (i.e., because of events such as large layoffs and plant closures).
Typically, O*NET occupation descriptions are combined with national or local labor mar-
ket information to help individuals identify job opportunities and/or potential alternative
occupations (e.g., Alabama’s Comprehensive Labor Market Information System, Maryland
Division of Rehabilitation Services, and Catholic Community Services in Baton Rouge).
• Lehigh Carbon Community College used O*NET and state and national labor market data,
among other tools, to help a curriculum design effort to understand the needs of business
and industry.
• A large staffing service used O*NET occupation descriptions to help understand the needs
of its corporate customers in terms of in-demand skill-sets.
These research and applied examples demonstrate that O*NET is being used. Additionally, it appears
to have reinvigorated research made possible by a complete, up-to-date taxonomy of occupations
described by a complete, up-to-date taxonomy of occupational descriptors.
CONCLUSIONS
O*NET and the DOT are two large-scale efforts to provide comprehensive descriptions of all occu-
pations in the U.S. economy. Three questions are relevant to evaluating their success. First is the
taxonomy of occupations complete and reasonably valid? Second, is the taxonomy of occupational
descriptors complete and are the methods of assessing occupations on those descriptors reasonably
valid? Finally, is the information presented in a manner that is accessible to its users?
As noted at the beginning of this chapter, the DOT and the O*NET were developed in different
eras in response to different societal needs and with different technologies. The DOT was, in our
opinion, a highly successful response to the challenges of its era. Its shortcomings with regard to
these three questions became apparent only as the times changed and are embodied in the content
of the report produced by APDOT (APOT, 1993), the panel that was specifically charged with
identifying the desired characteristics of a system to replace the DOT. Whether the current version
of O*NET represents an affirmative answer to these questions can be captured by the extent to
which O*NET satisfies APDOT’s charge for a redesigned system. APDOT’s first requirement was
for a new occupation classification system that could be efficiently linked to other occupational and
labor market databases. OMB’s establishment of the SOC (i.e., 949 occupations; 812 of which will
be described in O*NET) as the across government standard substantially addresses the need for a
comprehensive taxonomy of occupations. A big advantage is that the Bureau of Labor Statistics will
track employment for the same occupations that O*NET describes. The assessment of agreement in
ratings across incumbents within an occupation could and should be used as one way of monitoring
whether an occupational category is and has remained a single “occupation.”
The APDOT also stipulated that the new system should include an improved sampling approach
and modern and efficient methods of data collection. The current O*NET procedures that focus on
collecting most descriptor data via incumbent questionnaires and identifying incumbents from a
representative sample of organizations satisfies these requirements. The current system’s favorable
data collection response rates also support the representativeness its sample.
APDOT emphasized the need for a comprehensive taxonomy of cross occupation descriptors
that is hierarchically organized and describes occupations from multiple windows (i.e., with dif-
ferent content domains). The extensive O*NET content model represents substantial evidence that
these requirements have been met.
Finally, O*NET Online, a completely free-of charge database accessible through the Internet,
substantially achieves the APDOT vision that the data should be presented in a relational electronic
database that could be queried dynamically in a manner that facilitates access for a variety of users.
The O*NET usage information presented above also documents the high degree of access that has
been achieved.
906 Handbook of Employee Selection
However, there are two questions that have not been completely addressed. First, can or will
O*NET be sufficiently sensitive to identify the emergence and disappearance of occupations?
Monitoring labor market information and actively seeking to develop descriptions of emerging occu-
pations can address this question. O*NET is currently involved in the latter. The second question is
validity. A great deal of evidence supporting the validity of O*NET data collection instruments was
generated during the prototype development and data collection. Subsequent development efforts
have enhanced its validity. However, more research comparing O*NET data to other sources of
occupational information will further the investigation of its validity and likely identify opportuni-
ties for its improvement.
The O*NET was conceived as an open system, continuously evolving to reflect changes in the
occupational structure and in the ways that occupations can be usefully described. As long as this
stricture is followed, there is ample reason to believe that a national occupational database will
continue to stand as a milestone of I-O psychology’s achievements.
REFERENCES
Advisory Panel for the Dictionary of Occupational Titles (APOT). (1993). The new DOT: A database of occu-
pational titles for the twenty-first century (Final Report). Washington, DC: Employment Service, U.S.
Department of Labor Employment and Training Administration.
Arad, S., Hanson, M. A., & Schneider, R. J. (1999). Organizational context. In N. G. Peterson, M. D. Mumford,
P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The
development of O*NET (pp. 147–174). Washington, DC: American Psychological Association.
Borman, W. C., & Brush, D. H. (1993). More progress toward a taxonomy of managerial performance require-
ments. Human Performance, 6, 1–21.
Brannick, M. T., Levine, E. L., & Morgeson, F. P. (2007). Job and work analysis: Methods, research, and appli-
cations for human resource management. Thousand Oaks, CA: Sage.
Byrum, C. N., & Tsacoumis, S. (2006). O*NET analyst occupational abilities ratings: Analysis Cycle 623
results. Alexandria, VA: Human Resources Research Organization.
Committee on Techniques for the Enhancement of Human Performance: Occupational Analysis. (1999). The
changing nature of work: Implications for occupational analysis. Washington, DC: National Academy
Press.
Converse, P. D., Oswald, F. L., Gillespie, M. A., Field, K. A., & Bizot, E. B. (2004). Matching individuals
to occupations using abilities and the O*NET: Issues and an application in career guidance. Personnel
Psychology, 57, 451–487.
Crouter, A. C., Lanza, S. T., Pirretti, A., Goodman, W. B., and Neebe, E. (2006). The O*NET jobs classification
system: A primer for family researchers. Family Relations, 55, 461–472.
Dawis, R. V., & Lofquist, L. H. (1984). A psychological theory of work adjustment: An individual-differences
model and its applications. Minneapolis, MN: University of Minnesota Press.
Dierdorff, E. C., & Ellington, J. K. (2008). It’s the nature of work: Examining behavior-based sources of work-
family conflict across occupations. Journal of Applied Psychology, 93 (4), 883–892.
Donsbach, J., Tsacoumis, S., Sager, C., & Updegraff, J. (2003). O*NET analyst occupational abilities ratings:
Procedures (DFR-03-22). Alexandria, VA: Human Resources Research Organization.
Droege, R. C. (1988). Department of Labor job analysis methodology. In S. Gael (Ed.), The job analysis hand-
book for business, industry, and government (Vol. 2, pp. 993–1018). New York, NY: Wiley.
Dunnette, M. D. (1999). Introduction. In N. G. Peterson, M. D. Mumford, P. R. Jeanneret, & E. A. Fleishman
(Eds.), An occupational information system for the 21st century: The development of O*NET (pp. 3–7).
Washington, DC: American Psychological Association.
Dye, D., & Silver, M. (1999). The origins of O*NET. In N. G. Peterson, M. D. Mumford, P. R. Jeanneret, &
E. A. Fleishman (Eds.), An occupational information system for the 21st century: The development of
O*NET (pp. 9–19). Washington, DC: American Psychological Association.
Fine, S. A. (1968a). The use of the Dictionary of Occupational Titles as a source of estimates of educational and
training requirements. Journal of Human Resources, 3, 363–375.
Fine, S. A. (1968b). The 1965 Third Edition of the Dictionary of Occupational Titles—Content, Contrasts, and
Critique. Kalamazoo, MI: Upjohn Institute for Employment Research.
Fine, S. A. (1988). Functional job analysis scales: A desk aid. Milwaukee, WI: Author.
The Dictionary of Occupational Titles and the Occupational Information Network 907
Fleishman, E. A. (1975). Manual for Ability Requirement Scales (MARS). Bethesda, MD: Management
Research Institute.
Fleishman, E. A. (1988). The ability requirements scales. In S. Gael (Ed.), The job analysis handbook for busi-
ness, industry, and government (pp. 917–925). New York, NY: Wiley.
Glomb, T. M., Kammeyer-Mueller, J. D., & Rotundo, M. (2004). Emotional labor demands and compensating
wage differentials. Journal of Applied Psychology, 89(4), 700–714.
Harvey, R. J., & Wilson, M. A. (2000). Yes Virginia, there is an objective reality in job analysis. Journal of
Organizational Behavior, 20, 829–854.
Holland, J. L. (1976). Vocational preferences. In M. D. Dunnette (Ed.), Handbook of industrial and organiza-
tional psychology (pp. 521–570). Chicago, IL: Rand McNally.
Hubbard, M., McCloy, R. Campbell, J., Nottingham, J., Lewis, P., Rivkin, D., & Levine, J. (2000). Revision
of O*NET data collection instruments (Revised Version). Raleigh, NC: National O*NET Consortium;
National Center for O*NET Development; Employment Security Commission.
Jeanneret, P. R., Borman, W. C., Kubisiak, U. C., & Hanson, M. A. (1999). Generalized work activities. In
N. G. Peterson, M. D. Mumford, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information
system for the 21st century: The development of O*NET (pp. 105–125). Washington, DC: American
Psychological Association.
Liu, C., Spector, P. E., & Jex, S. M. (2005). The relation of job control with job strains: A comparison of mul-
tiple data sources. Journal of Occupational and Organizational Psychology, 78, 325–336.
McCormick, E. J., Jeanneret, P.R., & Mecham, R.C. (1972). A study of job characteristics and job dimensions as based
on the Position Analysis Questionnaire (PAQ). Journal of Applied Psychology Monograph, 56, 347–368.
McCormick, E. J., Mecham, R. C., & Jeanneret, P. R. (1989). Technical manual for the Position Analysis
Questionnaire (2nd ed.). Palo Alto, CA: Consulting Psychologist Press.
Miller, A. R., Treiman, D. J., Cain, P. S., & Roos, P. A. (Eds.). (1980). Work, jobs and occupations: A critical
review of the Dictionary of Occupational Titles. Washington, DC: National Academy Press.
Morgeson, F. R., Reider, M. H., & Campion, M. A. (2005). The importance of social skills, personality charac-
teristics, and teamwork knowledge. Personnel Psychology, 58, 583–611.
Mumford, M. D., Peterson, N. G., & Childs, R. A. (1999). Basic and cross-functional skills. In N. G. Peterson, M.
D. Mumford, P. R. Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st
century: The development of O*NET (pp. 49–69). Washington, DC: American Psychological Association.
National Center for O*NET Development. (2006). Updating the O*NET-SOC taxonomy: Summary and impli-
cations. Retrieved July 23, 2008, from http://www.onetcenter.org/reports/UpdatingTaxonomy.html
National Center for O*NET Development U.S. Department of Labor (2007). The O*NET® content model
detailed outline with descriptions. Retrieved January 30, 2008, from http://www.onetcenter.org/dl_files/
ContentModel_DetailedDesc.pdf
O*NET Center (2007a). O*NET center data collection. Retrieved January 30, 2008, from http://www.
onetcenter.org/dataPublication.html
O*NET Center (2007b). O*NET Resource center. Retrieved January 30, 2008, from http://www.onetcenter.org
O*NET Online 12.0. O*NET Center. Retrieved July 30, 2008, from http://online.onetcenter.org/find
O*NET Resource Center. Data publication schedule. Retrieved July 31, 2008, from http://www.onetcenter.org/
dataPublication.html
Paterson, D. G., & Darley, J. G. (1936). Men, women, and jobs. Minneapolis, MN: University of Minnesota Press.
Peterson, N. G., Borman, W. C., Mumford, M. D., Jeanneret, P. R., & Fleishman, E. A. (Eds.). (1999). An
occupational information system for the 21st century: The development of O*NET. Washington, DC:
American Psychological Association.
Peterson, N. G., Mumford, M. D., Borman, W. C., Jeanneret, P. R., Fleishman, E. A., Levin, K. Y., Campion, M. A.
et al. (2001). Understanding work using the occupational information network (O*NET): Implications
for practice and research. Personnel Psychology, 54, 451–492.
Reiter-Palmon, R., Brown, M., Sandall, D. L., Buboltz, C., & Nimps, T. (2006). Development of an O*NET
web-based job analysis and its implementation in the U.S. Navy: Lessons learned. Human Resources
Management Review, 16, 294–309.
Sackett, P. R., & Laczo, R. M. (2003). Job and work analysis. In W. C. Borman, D. R. Ilgen, & R. J. Kilmoski
(Eds.), Handbook of psychology: Industrial and organizational psychology (Vol. 12, pp. 21–37).
Hoboken, NJ: John Wiley & Sons.
Strong, M. H., Jeanneret, P. R., McPhail, S. M., Blakely, B. B., & D’Egidio, E. L. (1999). Work context:
Taxonomy and measurement of the work environment. In N. G. Peterson, M. D. Mumford, P. R.
Jeanneret, & E. A. Fleishman (Eds.), An occupational information system for the 21st century: The
development of O*NET (pp. 127–145). Washington, DC: American Psychological Association.
908 Handbook of Employee Selection
Tsacoumis, S. (2007, May). The feasibility of using O*NET to study skill changes. Paper presented at the
Workshop on Research Evidence Related to Future Skills Demands sponsored by Center for Education
National Research Council, Washington, DC.
Tsacoumis, S., & Van Iddekinge, C. H. (2006). A comparison of incumbent and analyst ratings of O*NET
skills. Alexandria, VA: Human Resources Research Organization.
U.S. Department of Commerce. (1980). Standard Occupational Classification manual. Washington, DC: Author.
U.S. Department of Labor. (1939). Dictionary of occupational titles. Washington, DC: U.S. Government
Printing Office.
U S. Department of Labor. (1949). Dictionary of occupational titles. (2nd ed.). Washington, DC: U.S. Government
Printing Office.
U.S. Department of Labor. (1965). Dictionary of occupational titles. (3rd ed.). Washington, DC: U.S. Government
Printing Office.
U.S. Department of Labor. (1972). Handbook for analyzing jobs. Washington, DC: U.S. Government Printing
Office.
U.S. Department of Labor. (1980). Dictionary of occupational titles. (4th ed.). Washington, DC: U.S. Government
Printing Office.
U.S. Department of Labor. (1991a). Dictionary of occupational titles. (4th. ed., rev.). Washington, DC: U.S.
Government Printing Office.
U.S. Department of Labor. (1991b). The revised handbook for analyzing jobs. Washington, DC: U.S. Government
Printing Office.
U.S. Department of Labor, Bureau of Labor Statistics. (2004, December). Standard Occupational Classification
(SOC) user guide. Retrieved January 31, 2008, from http://www.bls.gov/soc/socguide.htm
U.S. Department of Labor, Employment and Training Administration. (2005, September). O*NET data col-
lection program: Office of management and budget clearance package supporting statement. Retrieved
January 31, 2008, from http://www.onetcenter.org/dl_files/ContentModel_DetailedDesc.pdf
U.S. Department of Labor, Employment and Training Administration. (2007, February). O*NET in action.
Retrieved December 11, 2007, from http://www.doleta.gov/programs/onet/oina.cfm
Wagner-Peyser Act of 1993, P.L. 73-30, 48 Stat. 113, 73d Cong. (1933).
42 Situational Specificity and
Validity Generalization
Most investigators would agree that a well-designed test will have nonzero validity against a well-
measured criterion in work contexts that require whatever the test measures for success. If “validity
generalization” were limited to this inference, then there would be no reason for this chapter. Indeed,
the authors of this chapter subscribe to this inference. But, validity generalization is not limited to this
inference. Validity generalization (VG) is often in the business of slaying the situational specificity
dragon. A somewhat famous quote from the developers of VG illustrates this point:
The evidence from these two studies appear to be the last nail required for the coffin of the situational
specificity hypothesis. (Schmidt, Hunter, Pearlman, & Rothstein-Hirsch, 1985, p. 758.)
In our opinion, the situational specificity hypothesis continues to thrive. It is in support of that opinion
that this chapter is written.
There is little question that soundly measured knowledge, skills, abilities (i.e., KSAs) and
personality generally predict soundly measured, organizationally relevant criteria. In this sense,
validity can be said to generalize. Whether the validity for a given type of predictor (e.g., critical
intellectual skills) against a given class of criterion (e.g., job performance) is generally the same—or
cross-situationally consistent—is another issue. The cross-situational consistency hypothesis has
endured a long history of theoretical and empirical debate, the roots of which can be found in
dialectics such as the person-situation debate (e.g., Epstein, Mischel, & Peake, 1982). The emergence
of meta-analysis as a popular method for testing the consistency of predictive validities across a set
of separate studies accelerated and transformed the debate into one of a more quantitative and
methodological nature.
Basically, meta-analysis made it possible for organizational researchers to apply increasingly
sophisticated quantitative maneuvers so as to accurately assess the predictive validity of test scores
and, more importantly, the consistency of these estimates over studies. Perhaps the most well-established
product of these efforts is VG analysis in which validity coefficients for the same, or similar, predictor-
criterion variable pairs from different study samples are assessed in terms of cross-situational
(organizational) consistency (i.e., Pearlman, Schmidt, & Hunter, 1980; Hunter & Schmidt, 1998;
Schmidt & Hunter, 1977; Schmidt, Hunter, & Raju, 1988).
The VG approach attempts to control statistical artifacts that distort validity coefficient esti-
mates, thus enhancing the comparability of these estimates across situations (Hunter & Schmidt,
1998; Schmidt & Hunter, 1977; Schmidt et al., 1993). However, critics argue that VG paints an
incomplete picture of the cross-situational consistency of predictor-criterion pairs (see James,
Demaree, & Mulaik, 1986; James, Demaree, Mulaik, & Ladd, 1992; Algera, Jansen, Roe, & Vijn,
1984). This claim is made in light of two observations: (a) VG analysis rarely frees up all of the
between-sample variance in validity coefficient estimates, sometimes freeing up very little variance
909
910 Handbook of Employee Selection
for certain predictor-criterion pairs and/or job types (i.e., Murphy, 2000; Ones, Viswesvaran, &
Schmidt, 2003; Salgado, Anderson, Moscoso et al., 2003); and (b) the formulas used in VG analyses
fail to explicitly (statistically) incorporate measured situational variables (e.g., authority structure,
interpersonal interactions, social climate), despite extensive histories of situational influence on
many of the predictor-criterion pairs frequently cited in VG literature (i.e., Ghiselli, 1959, 1966,
1973; Peters, Fisher, & O’Connor, 1982). It is further noteworthy that closer inspection of VG equa-
tions reveals hidden assumptions and perhaps the potential for more than just a trace or hint of
cross-situational variability (James et al., 1992).
The following discussion poses philosophical and empirical questions of relevance to the
situational specificity-cross-situational consistency debate in the context of VG. For example, is
situational specificity truly a prerequisite for understanding the relationship between KSA (or
personality) scores and job-related behaviors? Are statistical corrections sufficient? In other words,
what are the scientific and practitioner-oriented advantages to including situational factors in the
estimation of cross-situational consistency in predictor-criterion relationships? The purpose of this
chapter is to address these questions by tracing the development of predictive validity estimation
from the early situational-specificity dominated approach to the new contender, cross-situational
consistency. This discussion includes a comparison of conceptual and methodological assumptions
underlying situational specificity and cross-situational consistency. Ultimately, we seek to identify
the advantages and disadvantages of each approach and critically assess their collaborative strength
in illuminating patterns of behavior in organizational settings.
SITUATIONAL SPECIFICITY
Validity studies conducted in the mid- to late-20th century offered modest hope for the utility of
personality and KSAs as predictors of crucial outcome variables (i.e., job performance) in applied
settings. Of particular interest were validity coefficients for cognitive ability tests, which tended to
be small to modest and inconsistent across job types (Ghiselli, 1959, 1966, 1973). As a result of
inconsistency, the situational specificity hypothesis (SSH) drew a large following and dominated
the literature for several decades (i.e., Murphy, 2000; Schmidt & Hunter, 1998). The adoption of the
SSH in fact reflected a more general academic movement toward situational specificity of behavior
(Mischel, 1968; Epstein et al., 1982). However, in the context of meta-analysis, the SSH posits
that the underlying population validities for any particular predictor-criterion pair will vary across
populations (studies) because of the differential impact of certain situational attributes on the (typi-
cally linear) relationship between the predictor and criterion (i.e., James et al., 1986, 1992; Murphy,
2000; Schmidt et al., 1993).
Support for the SSH is easily found in the literature. For example, several studies have revealed
the moderating role of job complexity on validity estimates of cognitive ability and KSAs (i.e.,
Hunter & Hunter, 1984; Levine, Spector, Menon, Narayanan, and Cannon-Bowers, 1996; Russell,
2001; Salgado et al., 2003; Schmidt et al., 1993). Hunter and Hunter (1984) found that cognitive abil-
ity demonstrated a higher validity for predicting job performance and training success for occupa-
tions involving greater task complexity. Salgado et al. (2003) found that the empirical validity for
general mental ability ranged from .12 for police to .34 for sales occupants when predicting supervi-
sor ratings of job performance. Schmidt et al. (1993) demonstrated that the variability of validity
estimates for various measures of cognitive ability (i.e., general, verbal, quantitative, reasoning,
perceptual speed, memory, and spatial and mechanical) was at least partially a function of job type.
For example, the standard deviation of the validity estimates (i.e., SDr) for reasoning ability predict-
ing job performance was .04 for jobs involving stenography, typing, and filing and .19 for production
and stock clerks.
In regard to personality and empirical validities, more extensive evidence of situational specificity
can be found, especially in the case of team-oriented organizations (Barrick & Mount, 1991, 1993;
Mount et al., 1998; Stewart, 1996; Stewart & Carson, 1995). For example, Barrick and Mount (1993)
Situational Specificity and Validity Generalization 911
found that validity coefficients representing relationships between key personality traits (conscien-
tiousness and extraversion) and job performance (as rated by supervisors) varied over managers as a
function of managers’ perceived level of autonomy on the job. Validities tended to increase in pro-
portion with the amount of perceived autonomy. Additionally, Mount et al. (1998) showed that some
Big-Five traits were more valid than other Big-Five traits, but the dominant trait varied as a function
of the degree of social and interpersonal interaction required at work. To illustrate, agreeableness and
extraversion had stronger validities against performance for employees engaged in team-oriented
jobs (e.g., highest mean validities were .24 and .20, respectively) relative to employees engaged in
clerical and “cubicle” jobs in which only dyadic interactions occur (i.e., newspaper employees in the
circulation department; banking employees in loan operations; telemarketing representatives) (e.g.,
highest mean validities were .16 and .16, respectively). In contrast, the opposite was true for consci-
entiousness. Specifically, dyadic jobs yielded greater validity estimates (e.g., highest mean validity
was .29) than team-oriented jobs (e.g., highest mean validity was .19). Moreover, even when validi-
ties are examined for one specific job type (e.g., sales), validities for the extraversion-sales effective-
ness relationship vary across organizations with only 54% of their variance being accounted for by
statistical artifacts (Barrick & Mount, 1991; Stewart, 1996).
The illustrative studies above suggest active involvement of situational moderators in relationships
between personality or KSA predictors and job performance. Perhaps even more important is the
fact that a considerable number of prominent situational attributes have not been examined, at least
extensively, as potential moderators of predictor-criterion relationships in meta-analyses. Included
here are environmental, organizational, climate, and culture variables such as authority structure,
standardization of job tasks and procedures, reward processes, leadership styles, organizational
norms and values, and stress. We address the issue of potential situational moderators later in this
chapter.
The critical point for now is that findings supporting the SSH led to certain conclusions for
scientists and practitioners interested in the utility of personality and cognitive ability scores and
a convenient method of assessing predictive validity. To ascertain the efficacy with which scores
obtained from a particular type of measure (i.e., general mental ability) predict a particular criterion
(i.e., job performance), a local validity estimate is warranted. In other words, for every organization
interested in employing general mental ability measures for the purposes of selection testing, pro-
motion, etc., a separate validity study, yielding an individualized validity coefficient, is advisable.
Indeed, it was this point of view that dominated validity research for most of the 1960s and 1970s.
As will be discussed later, this opinion has not been steadfast.
The reservoir of situational factors yet to be taken into account may seem overwhelming and
difficult to incorporate into studies. This problem is exacerbated by the fact that very little effort has
been made to cultivate conceptual models that could interpolate situational variables as explanations
of specificity of validity coefficient estimates (James et al., 1992). Therefore it is not surprising that
some prefer to question the viability of a situational specificity approach in terms of its ability to
effectively explain the situational complexities in meta-analytic results. Indeed, rather than address
these issues directly by assessing potential situational moderators, a more indirect approach was
adopted by a series of ongoing research studies (Hunter & Hunter, 1984). This approach asked the
questions “To what extent does the situation need to be statistically controlled in order to attain a
reasonable level of comparability across organizations and samples?” This question prompted other
questions, such as “Are there other between-sample sources of variance that can be localized and
removed in a parsimonious and systematic manner?” It was then thought that perhaps identifying
and partitioning variance attributable to these nonsituational factors would help to elucidate the
amount of variance left unaccounted for and, therefore, potentially attributable to situational factors. In
other words, rather than directly exploring the impact of situational factors on the stability of validities
through explanatory and statistical models, the presence of these factors is merely inferred based on
the amount of variance remaining. This indirect, process-of-elimination technique emerged as what
we now know as VG analysis (James et al., 1992).
912 Handbook of Employee Selection
VG
VG developed out of a need to increase the precision of validity coefficient estimates for similar
or identical predictor-criterion pairs—a need that had been unfulfilled during the critical period of
situational specificity. VG analysis is essentially a special version of meta-analysis and has been
described as an inferential version of its more descriptive counterpart (Murphy, 2000). That is,
whereas meta-analysis provides an overall depiction of the relationship between a predictor and
criterion for a set of samples, VG goes one step further to ascertain the degree to which certain
factors contribute to the consistency of this relationship across samples. Typically, these estimates
are computed separately for different job types (i.e., clerical, mechanical, managerial) (Schmidt &
Hunter 1977). The factors to be considered are statistical artifacts such as unreliability of predictor and
criterion scores, range restriction in predictor scores, and sampling error (i.e., Schmidt & Hunter
1977; Hunter & Schmidt, 1998; Schmidt et al., 1988; Schmidt et al., 1993).
A VG analysis begins with the standard meta-analytic procedure of calculating the mean (weighted
by sample size), over studies, of validity coefficients for a particular predictor-criterion pair and the
variance among validities around these respective means. Each study represents one sample from
its respective population (i.e., organization, situation), and it is assumed that (a) the true validity for a
particular predictor-criterion pair may be equal across populations but that (b) differences among studies
in statistical biases (e.g., predictor and/or criterion reliability, range restriction, and sampling error)
may distort and restrict the size of the observed validity. In an attempt to identify and effectively
model the impact of these biases on estimates of true validity, the following structural equation—in
which Greek symbols represent population parameters—is generally used in VG analysis:
r k = ρkαk1/2φk1/2ξk + ek (42.1)
where ρk is the population correlation between the unrestricted true scores for the predictor and
criterion in situation k (i.e., the true validity); r k represents the observed validity coefficient—i.e.,
the correlation between a predictor Xk and a criterion Yk for a random sample of nk individuals
from population (i.e., organization, situation) k; αk is the unrestricted population reliability for the
criterion in situation k; φk is the unrestricted population reliability for the predictor in situation k;
ξk reflects the degree of range restriction in the predictor in situation k; and ek is the sampling error
inherent in r k .
Once the statistical artifact population estimates are inserted in the equation and ρk is estimated
for each k, the next step in the VG analysis is to estimate the variance among the ρk, referred to as
V(ρ), and determine whether or not this variance is small enough to justify generalization of the
validity to all situations. V(ρ) is calculated based on the following estimation equation (see James
et al., 1992):
where V(ρ) refers to variance in population (true) validities; V(r) is the between-situation variance in
the observed validities; V(r') is the expected between-situation variance in validities due to statistical
artifacts; and Π is an additional correction for mean reliabilities and range restriction across situations.
In essence, the amount of variance attributable to statistical artifacts is subtracted from the total
observed variance, and the remaining variance, termed “residual variance,” represents true variance
in validities that is unaccounted for (i.e., by statistical artifacts).
A primary step of a VG analysis is to determine whether or not cross-situational consistency in
validities has been achieved. Basically, if the estimate of V(ρ) is approximately equal to 0, the ρk are
deemed equal across situations (i.e., generalizable), whereas V(ρ) > 0 suggests a potential situational
moderator. To this end, two rules have emerged that elaborate on the term “approximately equal”
by imposing predetermined, theoretically justified critical values, and it is above these values that
Situational Specificity and Validity Generalization 913
V(ρ) must not extend in order for cross-situational consistency to be established. One example is the
“75% rule,” in which 75% of the total variance in validity estimates (i.e., V(ρ), must be accounted
for by statistical artifacts to effectively rule out the SSH. The remaining 25% is presumed to be
caused by unmeasured artifacts (i.e., clerical and programming errors; Hermelin & Robertson,
2001; Schmidt & Hunter, 1977). This rule has been highly debated in the literature for being insensi-
tive to situational moderators (e.g., James et al., 1986).
An additional technique for assessing cross-situational consistency is the 90% confidence rule,
in which the lower-bound of the validity distribution (e.g., 10th percentile) is compared to a
minimal validity coefficient value (e.g., 0, .01, .1). If the value at the 10th percentile is greater than
the minimal value, one can say with 90% confidence that the validity of the scores will generalize to
other populations. The 90% rule, in essence, is simply an acknowledgement from cross-situational
advocates that variance in validities are typically nonzero and, therefore, as long as 90% of the
true validities lie in the positive range, validities can be said to at least minimally generalize across
situations. It has been pointed out that this latter rule is more lenient and, therefore, oriented toward
detecting cross-situational consistency, whereas the more strict 75% rule is directed toward testing
and (potentially) ruling out situational specificity (James et al., 1992; Murphy, 2000).
EVALUATING VG
On the basis of the equations above, it can be deduced that there are no substantive situational variables
explicitly taken into account in VG analysis. Only basic statistical properties of the measurement
scores are corrected. There are no situational variables measured and incorporated into the model
to ascertain which situational variables influence a particular predictor-criterion relationship and the
validity coefficient that corresponds to this relationship. Furthermore, it is worth noting that a logical
assumption underlying this residualized model of the SSH testing is that situational factors and
statistical artifacts are independent in the population, an assumption that has only recently received
critical attention (James et al., 1992; Raju, Anselmi, Goodman, & Thomas, 1998; Thomas & Raju,
2004). Elaboration of this assumption and its implications will be addressed later in the Conclusions.
Several investigators would say that the VG approach has been a success (e.g., Schmidt et al.,
1993). First of all, consider that the criterion-related validity estimates of individual traits, such
as general mental ability and personality, for predicting job performance was typically cited in
the 1950s to late 1970s as being in the .20–.30 range, maxing out at .40 (Ghiselli, 1966). With the
advent of the statistical corrections applied in VG analyses, these estimates have increased as much
as 120%. This is a result of a shift in the maximum percentage of variance in criterion (true) scores
accounted for by predictor (true) scores from 16% to upwards of 36% (Hunter & Hunter, 1984).
Similar trends have been observed for other types of predictor-criterion pairs such as integrity tests
predicting counterproductive work behaviors, which demonstrates a mean estimated population
validity coefficient of .41 (i.e., Ones et al., 1993).
But how successful has VG been in determining the extent to which these estimates are
generalizable across populations (i.e., organizations)? Strong proponents of VG posit that essentially
all of the variance of validity coefficients can be accounted for by the statistical artifacts incorporated
into the VG model such that, for example, a general mental ability measure will be equally predictive
of job performance for a particular job type in all organizations (Schmidt et al., 1988; Schmidt et al.,
1993). Indeed, VG corrections, particularly sampling error, has accounted for up to 100% of the
variance (sometimes more than 100%) in validity estimates across studies for certain job types
(i.e., Salgado et al., 2003; Schmidt et al., 1993). On the other hand, corrections have accounted for
as little as 16% in the variance in validities in other meta-analytic samples (Ones et al., 2003). For
the SSH to be disconfirmed, there should be a negligible amount of residual variance after the variance
originating from statistical artifacts has been removed. According to the 75% rule, there should be
no more than 25% of the total variance remaining, which would indicate situational specificity in
the latter example. Nonetheless, as previously mentioned, proponents of cross-situational consistency
914 Handbook of Employee Selection
generally use the 90% confidence rule to frame results in terms of the degree to which validities
are applicable across studies. Thus, the presence of any residual variance, even as high as 25%, is
perceived by these investigators as conveying little practical significance. Instead, they argue that
the predictive utility of the KSA or trait score applies, at least to some degree, to all organizations
(i.e., Murphy, 2000; Schmidt et al., 1993). In other words, despite evidence in support of situational
specificity (i.e., violation of the 75% rule), simultaneous and yet conceptually conflicting support for
cross-situational consistency can be obtained (i.e., via the 90% confidence rule) with any residual
variance being dismissed as arising from methodological or statistical issues. Indeed, the extent to
which situational specificity is explored ends here. A salient and distinguishing feature of the cross-
situational approach in VG analysis, and meta-analytic validity studies in general, is that situational
causes of meaningful differences in validities among populations are not fully addressed, and it is
this issue on which much of the following concerns are based.
in a VG analysis of validities for general mental ability predicting job performance, there were 50
more samples (3,394 more employees) in typing and filing occupations than in samples engaged in
production and stock occupations. Yet, the proportions of variance in validities explained by VG
corrections are 377% for production and stock clerks (the smaller sample) and 76% for typing and
filing clerks (the larger sample). Similar findings emerged for other predictors of performance such
as verbal ability, quantitative ability, and perceptual speed. These findings suggest that situational
factors may exert impact on validity estimates but that most VG studies based on small samples
have insufficient power to detect these effects, a concern which has been voiced elsewhere (i.e.,
James et al., 1986; James, Demaree, Mulaik, & Mumford, 1988; Murphy, 2000). Thus, adequate
power (i.e., sufficiently large sample sizes) is advised when attempting to test the SSH. Insufficient
power to detect moderation should preclude an interpretation in favor of or against the SSH.
perceptual speed. Therefore, it appears that for each predictor-criterion pair, there is a range of
validities that do not necessarily represent only sampling error and methodological differences
among studies; rather, one natural and substantive dimension on which these validities differ
is job type. A subsequently reasonable question to pose is “Why?” Perhaps the origin of these
varying validities, after removing statistical artifacts, lies in situational factors such as interpersonal
communication, supervision over other employees versus subordinate, and reward structure, which
inherently vary across job types. However, certain jobs may be more or less vulnerable to situational
factors on the basis of the degree of complexity in one’s social and interpersonal environment (i.e.,
little communication required, self-employed, small group, steepness of organizational hierarchy).
Further examination of validity and cross-situational consistency as a gradient may illuminate the
moderating impact of situational factors, serving as a more sensitive and informative “test” than the
dichotomous all or none 75% rule.
model surpassed the traditional VG model in accuracy. Furthermore, the Raju et al. (1998) model
demonstrated comparable properties to the James et al. (1992) model in accurately estimating
validity coefficients, albeit with slightly more stable estimates (i.e., lower variance in estimates
across samples). This latter finding can be perceived as bolstering support for the SSH in James
et al.’s estimation because not only does the procedure provide levels of estimate accuracy similar to
other methods that have exceeded traditional VG techniques in accuracy, but the residual variances
of these estimates, which can be interpreted as arising from situational influences, increases. Of
course, neither model identifies which, nor in what way, situational variables serve to moderate
validities.
Although inclusion of a conceptual model in VG analysis has not been attempted, a study by
Burke, Rupinski, Dunlap, and Davison (1996) addressed the substantive nature of the indepen-
dence assumption violation inherent in VG analysis by testing James et al.’s (1992) proposition that
restrictiveness of climate contributes to violation of the assumption via its dual impact on statistical
artifacts. Burke et al. (1996) examined the impact of organizational climate on the range restrictions
and internal consistencies (coefficient alpha) for the predictor (job satisfaction) and criterion (job
performance), respectively, across 537 sales organizations. Organizational climate—specifically,
restrictiveness-of-climate—was measured via aggregated (store-level) employee perceptions (self-report)
on five first-order factors (goal emphasis, management support, nonmonetary reward orientation,
means emphasis general, means emphasis specific) of the higher-order concern for employees factor
of an organizational climate measure (Burke, Borucki, & Hurley, 1992).
Contrary to expectations, Burke et al. (1992) found that a more restrictive climate was associated
with more, rather than less, variability in the predictor and criterion scores, although only a small
subset of these correlations was significant. Furthermore, restrictiveness of climate demonstrated
only three significant correlations with internal consistency (i.e., coefficient alpha values) for the
predictor and criterion measurement scores, and one of these was in the opposite of the proposed
direction. It is worth noting that, for internal consistency, the two significant correlations in the
hypothesized direction both occurred for one particular first-order factor—management support.
This factor, relative to the remaining factors, most closely matched the restrictiveness-of-climate
conceptualization provided by James et al. (1992) in which trust and encouragement from manage-
ment serve as the integral forces behind the level of restrictiveness and, subsequently, the range of
individual differences manifested in the workplace. Furthermore, the predictor included in the analysis
was job satisfaction. This is perhaps not the most appropriate predictor to use in selection testing,
which usually relies on KSAs or personality. Finally, the sample was limited to sales organizations.
Therefore, although this study provides a first step toward examining the magnitude of influence that
situational factors exert on statistical artifacts and validities, additional studies are warranted.
A theoretical framework that has been proposed for situational specificity pertains to restrictiveness
of climate (James et al., 1992). Although this framework was introduced earlier in this review, it was
largely for the purpose of discussing potential assumption violations in VG. However, its possible
merit as an explanatory framework is based on implications that the predictive validities of lower-
level individual-level traits and KSAs/personality are affected by the restrictiveness of the organiza-
tional climate. An example is the conscientiousness-job performance relationship. For organizations
with nonrestrictive climates, in which reward structures are typically built around merit and effort,
it may be that the mean relationship between consciousness and job performance tends toward the
upper range of validities typically found in this type of research. On the other hand, these values
may drop to near zero, or at least nonsignificance, in a highly restrictive organization because of an
emphasis on tenure and seniority, rather than just hard work, in the organization’s reward structure.
If there is a substantial range in restrictiveness in the general population of organizations included
in a meta-analysis, then there may also be a similarly wide range in validities across studies. If the
variance in these validities can be statistically accounted for by variations in scores on an organiza-
tional-level measure of restrictiveness of climate, then the variance in validities may be statistically
controlled. All that is necessary is to include the situational variable(s) in the VG equations and
analyses. This, of course, requires a revised VG model, but such a model is easily developed.
CONCLUSIONS
In responding to the critiques of VG posed by James et al. (1986), Schmidt et al. (1988, p. 666)
stated, “they [James et al., 1986] did not recognize that it is the higher level pattern of findings across
numerous VG analyses that is important in revealing the underlying reality.” Indeed, replications of
findings are crucial to building a theory or stable explanation concerning a particular (set of) behav-
ioral phenomenon. In this sense, many studies have demonstrated that the VG approach is effective
in improving cross-situational consistency in validity coefficients through statistical corrections for
measurable artifacts (i.e., reliability, range restriction, and sampling error) (i.e., Schmidt et al., 1993;
Hunter & Schmidt, 1998). However, numerous VG studies also demonstrate a pattern of inconsistency,
particularly across jobs, in the degree to which adjustments in statistical artifacts improve validities.
Moreover, in these studies, it is often the case that little variance is accounted for by corrections for
statistical artifacts. Specifically, examination of individual meta-analyses reveals that there is a size-
able amount of unexplained variance in validities after controlling for statistical artifacts and that the
amount of variance accounted for is also a function of job type (Algera et al., 1984; Salgado et al.,
2003; Schmidt et al., 1993). These patterns, when taken together, suggest that variation in statistical
artifacts are likely responsible for only a portion of the total fluctuation in validity estimates across
studies. Additional factors remain unmeasured and need specification, particularly contextual (i.e.,
situational) factors. At this point, research suggests that differences in job type and task complexity
moderate the relationship between personality/cognitive ability and job behaviors (Barrick & Mount,
1991, 1993; Mount et al., 1998; Russell, 2001; Salgado et al., 2003; Schmidt et al., 1993; Stewart,
1996; Stewart & Carson, 1995). It is likely that many more situational variables are operative.
Unfortunately, up to this point, the impact of situational factors on validities tends to be examined
indirectly. VG studies typically direct their attention toward what is considered to be more influential
causes of variability of validity estimates—namely, statistical artifacts. The degree of situational
specificity is addressed by examining the residual variance after removing what is alleged to be
attributable to the statistical artifacts. It seems plausible that the results of VG analyses, as helpful as
they may be for illuminating the impact of statistical artifacts on validity estimates, are incomplete
and often times misleading because of both the lack of specific inclusion of situational variables in
analyses and indirect impact of situational variables on validity mean and variance estimates. The
potential for situational moderators at the meta-analytic level are not only possible but also likely.
However, until situational factors are explicitly measured and included in VG analyses, conclusions
regarding situational specificity cannot be safely and soundly made.
Situational Specificity and Validity Generalization 919
REFERENCES
Algera, J. A., Jansen, P. G. W., Roe, R. A., & Vijn, P. (1984). Validity generalization: Some critical remarks on
the Schmidt-Hunter procedure. Journal of Occupational Psychology, 57, 197–210.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-
analysis. Personnel Psychology, 44, 1–26.
Barrick, M. R., & Mount, M. K. (1993). Autonomy as a moderator of the relationships between the big five
personality dimensions and job performance. Journal of Applied Psychology, 78, 111–118.
Burke, M. J., Borucki, C. C., & Hurley, A. E. (1992). Reconceptualizing psychological climate in a retail
service environment: A multiple-stakeholder perspective. Journal of Applied Psychology, 77, 5,
717–729.
Burke, M. J., Rupinski, M. T., Dunlap, W. P., & Davison, H. K. (1996). Do situational variability act as substan-
tive causes of relationships between individual difference variables? Two large-scale tests of “common
cause” models. Personnel Psychology, 49, 573–598.
Epstein, M. W., Mischel, W., & Peake, P. K. (1982). Beyond déjà vu in the search for cross-situational consis-
tency. Psychological Review, 89, 730–755.
Ghiselli, E. E. (1959). The generalization of validity. Personnel Psychology, 12, 397–402.
Ghiselli, E. E. (1966). The validity of occupational aptitude tests. New York, NY: Wiley.
Ghiselli, E. E. (1973). The validity of aptitude tests. Personnel Psychology, 26, 461–477.
Hermelin, E., & Robertson, I. T. (2001). A critique and standardization of meta-analytic validity coefficients in
personnel selection. Journal of Occupational and Organizational Psychology, 74, 253–277.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance.
Psychological Bulletin, 96, 72–98.
Hunter, J. E., & Schmidt, F. L. (2000). Fixed effects vs. random effects meta-analysis models: Implications for
cumulative research knowledge. International Journal of Selection and Assessment, 8, 275–292.
James, L. R., Demaree, R. G., & Mulaik, S. A. (1986). A note on validity generalization procedures. Journal of
Applied Psychology, 71, 440–450.
James, L. R., Demaree, R. G., Mulaik, S. A., & Ladd, R. T. (1992). Validity generalization in the context of
situational models. Journal of Applied Psychology, 5–14.
James, L. R., Demaree, R. G., Mulaik, S. A., & Mumford, M. D. (1988). Validity generalization: A rejoinder to
Schmidt, Hunter, and Raju (1988). Journal of Applied Psychology, 73, 673–678.
Levine, E. L., Spector, P. E., Menon, S., Narayanan, L., & Cannon-Bowers, J. (1996). Validity generalization
for cognitive, psychomotor, and perceptual tests for craft jobs in the utility industry. Human Performance,
9, 1–22.
Mischel, W. (1968). Personality and assessment. New York, NY: John Wiley.
Mount, M. K., Barrick, M. R., & Stewart, G. L. (1998). Five-factor model of personality and performance in
jobs involving interpersonal interactions. Human Performance, 11, 145–165.
Murphy, K. R. (2000). Impact of assessments of validity generalization and situational specificity on the science
and practice of personnel selection. International Journal of Selection and Assessment, 8, 194–206.
Ones, D. S., Viswesvaran, C., and Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validi-
ties: Findings and implications for personnel selection and theories of job performance. Journal of
Applied Psychology, 78, 679–703.
Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (2003). Personality and absenteeism: A meta-analysis of integ-
rity tests. European Journal of Personality, 17, S19–S38.
Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job
proficiency and training success in clerical occupations. Journal of Applied Psychology, 65, 373–406.
Peters, L. H., Fisher, C. D., & O’Connor, E. J. (1982). The moderating effect of situational control of per-
formance variance on the relationships between individual differences and performance. Personnel
Psychology, 35, 609–621.
Raju, N. S., Anselmi, T. V., Goodman, J. S., & Thomas, A. (1998). The effect of correlated artifacts and true
validity on the accuracy of parameter estimation in validity generalization. Personnel Psychology, 51,
453–465.
Raju, N. S., & Burke, M. J. (1983). Two new procedures for studying validity generalization. Journal of Applied
Psychology, 68, 382–395.
Raju, N. S., Burke, M. J., Normand, J., & Langlois, G. M. (1991). A new meta-analytic approach. Journal of
Applied Psychology, 76, 432–446.
Russell, C. J. (2001). A longitudinal study of top-level executive performance. Journal of Applied Psychology,
86, 560–573.
920 Handbook of Employee Selection
Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., de Fruyt, F., & Rolland, J. P. (2003). Criterion validity
of General Mental Ability measures for different occupations in the European community. Journal of
Applied Psychology, 88, 1068–1081.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychol-
ogy: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124,
262–274.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generaliza-
tion. Journal of Applied Psychology, 529–540.
Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1982). Progress in validity generalization: Comments on Callender
and Osburn and further developments. Journal of Applied Psychology, 67, 835–845.
Schmidt, F. L., Hunter, J. E., Pearlman, K., & Rothstein-Hirsh, H. (1985). Forty questions about validity gen-
eralization and meta-analysis. Personnel Psychology, 38, 697–798.
Schmidt, F. L., Hunter, J. E., & Raju, N. S. (1988). Validity generalization and situational specificity: A second
look at the 75% rule and Fisher’s z Transformation. Journal of Applied Psychology, 73, 665–672.
Schmidt, F. L., Law, K., Hunter, J. E., Rothstein, H. R., Pearlman, K., & McDaniel, M. (1993). Refinements
in validity generalization methods: Implications for the situational specificity hypothesis. Journal of
Applied Psychology, 3–12.
Stewart, G. L. (1996). Reward structure as a moderator of the relationship between extraversion and sales
performance. Journal of Applied Psychology, 81, 619–627.
Stewart, G. L., & Carson, K. P. (1995). Personality dimensions and domains of service performance: A field
investigation. Journal of Business and Psychology, 9, 365–378.
Thomas, A., & Raju, N. S. (2004). An evaluation of James et al.’s (1992) VG estimation procedure when artifacts
and true validity are correlated. International Journal of Selection and Assessment, 12, 299–311.
43 Employee Selection in Europe
Psychotechnics and the
Forgotten History of Modern
Scientific Employee Selection
Jesús F. Salgado, Neil R. Anderson, and
Ute R. Hülsheger
INTRODUCTION
First of all, we must commence with a seemingly naïve clarification. Europe, as a whole, is a con-
tinent consisting of dozens of independent countries, with many different languages, legal regula-
tions, cultures, history, and, indeed, employee selection frameworks (Shimmin, 1989; see also Myors
et al., 2008). Just for illustrative purposes it must be taken into account that Europe has more coun-
tries than North, Central, and South America combined. The situation, therefore, is totally different
from that in the United States in relation to the cultural, legal, and national influences upon employee
selection procedures (Shimmin, 1989). Nevertheless, in spite of this heterogeneity, there was an
important exchanging of ideas across Europe during the 20th century. Three basic mechanisms were
used for this exchange. The Congresses of the International Psychotechnics Association (later called
International Association of Applied Psychology) were the first and most important mechanism. The
exchange of national journals was the second mechanism. Finally, research visits to the psychotech-
nical laboratories across Europe was constant during the first 30 years of the 20th century.
To present a complete and comprehensive account of the history of employee selection in Europe
is a titanic job beyond the scope of the present chapter. Consequently, in the following pages, we
will present a general but limited review of the main historical milestones of European employee
selection during the 20th and into the 21st century. We will divide the chapter in three parts:
(a) The European Origin of Scientific Employee Selection: The Years of Success (1900–1945);
(b) The Fall of the European Employee Selection After World War II: The Years of Decline and
Stagnation (1945–1980); and (c) The Resurgence of European Employee Selection: The Years of
Optimism and Expansion (Since 1980).
921
922 Handbook of Employee Selection
process of the development of the scientific and practical discipline of employee selection. It seems
that the sad destiny of the work of the first European contributors to employee selection is obscurity.
For example, two recent and very well documented historical accounts of employee selection and
I-O psychology outside of the United States ignore practically all of the European contributions
between 1900 and 1940 (Vinchur, 2007; Warr, 2007). Reading these accounts, it seems that the
European contributions were a simple anecdote. However, the reality was very different.
Some years before Münsterberg began his experiments on employee selection, researchers had
already conducted personnel examinations in several European countries. For example, starting in
1901, the Italian psychologist Ugo Pizzoli carried out professional examinations of the apprentices
in Modena (Italy) (Baumgarten, 1949; Carroll, 1951). In France, scientific employee selection began
in 1900 when Eduard Toulouse, director of the Experimental Psychology Laboratory at the Ècole
des Hautes Etudes (School of High Studies), proposed a project for professional selection to the
Ministry of Work and asked Jean Marie Lahy to conduct it. Lahy began his task in 1903, and
the first articles were published in 1905 in the Revue Scientifique and in 1910 in the Bulletin de
l’Inspection du Travail (Lahy 1922; Toulouse, 1927). Between 1905 and 1908, Lahy provided the
first steps toward developing a job analysis method and carried out preliminary experiments on the
selection of streetcar operators. In the same years, Imbert, professor at the University of Montpellier
(France), was developing a method for selecting engineers and reported his proposals during the
Congress of Psychology at Strasbourg in 1909 (Lahy, 1922). Lahy also proposed a general method for
employee selection that is, in synthesis, very similar to the current method for validating employee
selection procedures (later termed “orthodox personnel selection” by Guion, 1976). The first step
of the method consisted of determining the psychophysiological features characteristic of highly
proficient workers. To do so, two groups of workers should be examined: the first group consists
of the elite and the second one consists of mediocre workers. Both groups should complete tests
measuring psychological functions demanded by the occupation and the psychological aptitudes
to obtain group norms. The second step of the method, following the suggestion by the Russian
psychologist Rossolimo, consists of applying the same tests to the job candidates and to classify
them according to the norms in order to determine their “psychological profile.” Lahy applied this
method for selecting typists in 1912 and for selecting army gunners in 1917 during World War I
(Lahy, 1922). In 1907, the Zeitschrift für angewandte Psychologie [Journal of Applied Psychology],
edited by Otto Lipmann and William Stern, appeared in Germany. This was the first journal in
the world (together with the Rivista di Psicología Applicatta edited since 1905 by G. C. Ferrari in
Italy) devoted to the application of psychological knowledge, with special focus on work-related
issues. In summary, a complete program for employee selection, and more generally for work and
organizational psychology, was operating in many European countries since 1907, 5 years before
Münsterberg began his experiments.
In effect, Münsterberg first described his studies on the selection of drivers and his program
for industrial psychology—the so-called psychotechnics—in 1912 when he was lecturing as an
exchange professor at the University of Berlin. In April 1912, during the Fifth German Congress of
Experimental Psychology in Berlin, Münsterberg briefly mentioned his experience with two tramway
drivers, two telephone employees, two civil servants, and with other workers from various industries
as a comment after Karl Marbe’s presentation (Münsterberg, 1912a; Menzerath, 1913). The same
year, Münsterberg published in Leipzig (Germany) the book Psychologie und Wirtschaftsleben: Ein
Beitrag zur angewandten Experimental Psychologie (Münsterberg, 1912b) that was subsequently
translated to English with the title Psychology and Industrial Efficiency and was first published
in Boston in 1913. His points of view were subsequently developed in his book Grundzüge der
Psychotechnik [Compendium of Psychotechnics], which appeared in 1914. Therefore, Germany was
the first country receiving Münsterberg’s development in employee selection, and German was the
first language for his studies, for which Münsterberg received the first feedback, comments, and
support from the European and German applied psychologists. Moreover, the work by William
Stern on individual differences formed the basis of the main ideas of Münsterberg’s vocational
Employee Selection in Europe 923
selection. Stern (1903) was also the creator of the term “psychotechnics” and published the seminal
book Die differentielle Psychologie in ihren methodischen Grundlagen [Differential Psychology in
its Methodological Foundations] (1911), which served as basis for Münsterberg’s program. William
Stern proposed that four basic tasks were the object of differential psychology: (a) the variation
theory, testing the variation of features within many individuals; (b) the correlation theory, testing
the correlation between various features; (c) psychography, investigating individuals’ characteris-
tics features; and (d) the comparison theory for comparing individuals.
The complete European psychotechnical model was in use since the 1920s (Giese, 1925;
Chleusebairgue, 1934). According to Giese (1925), two types of psychotechnics could be distin-
guished: subject psychotechnics and object psychotechnics. Subject psychotechnics consisted of
selection and placement of employees, training, vocational and career counseling, and personnel
advising and guidance. Object psychotechnics, on the other hand, consisted of studies on fatigue,
time and motion, accident prevention, and advertising.
Regarding the use of psychological procedures for examining individuals, Lahy began to use
cognitive tests for selecting drivers in France since 1908. Walter Moede and Curt Piorkowski, Otto
Lipmann, and William Stern used similar tests in Germany from 1914 onwards (Grundlach, 1999).
Agostino Gemelli used psychophysiological measures for selecting military pilots during World
War I in Italy (Lahy, 1922; Baumgarten, 1949). Emilio Mira used attention and perceptual tests
for selecting drivers in Spain from 1918 (Mira, 1922). Pierre Bovet carried out personnel selection
programs for mechanics and dressmakers (Lahy, 1922) and Walther (1929) used tests for selecting
machine tending and manual workers (wrappers and packers) in Switzerland. Cyril Burt, the mem-
bers of the National Institute of Industrial Psychology (NIIP; e.g., Winifred Spielman, G. Miles) and
the members for the National Board of Health (e.g., Eric Farmer) developed and used many cogni-
tive tests in the United Kingdom as early as 1914 (see Kwiatkowski, Duncan, & Shimmin, 2006;
Shimmin & Wallis, 1994, for a comprehensive account of the NIIP). Many more examples from
other European countries could be cited, but our central point remains the same—that the founda-
tions of scientific employee selection are European in origin and not North American, and these
principles, which remain in common use today, have been around for over a century now.
A consequence of the popularity of psychological procedures for employee selection was that
hundreds of criterion validity studies were carried out and reported during the first half of the
20th century in European countries, and they were largely used in civil and military contexts (e.g.,
see the journals Industrielle Psychotechnik, Le Travail Humain, Journal of National Institute of
Industrial Psychology, and the Proceedings of the International Association of Psychotechnics,
mentioning just a few of them). Three criteria were generally used in those validity studies: job
performance, training proficiency, and accidents.
In 1920, following an idea by Bovet, Director of the Jean Jacques Institute of Geneva, the
first International Conference of Psychotechnics was held in Geneva (Switzerland) under Eduard
Claparède’s presidency (see the comprehensive collection of the 13 volumes edited by Grundlach,
1998). Claparède was also the first president of the International Association of Psychotechnics
founded this same year. Later, in 1952, this association was renamed into the current International
Association of Applied Psychology (IAAP). In this early period, psychotechnics was primarily
associated with the use of apparatus tests aimed at assessing individual characteristics. In the early
1920s, various European companies started selling psychotechnical testing equipment while lead-
ing psychotechnicians constructed their own devices on the basis of their own ideas regarding the
psychological requirements of various vocations and the most suitable instruments for measuring
them. The use of this sophisticated equipment contributed significantly to the reputation of psycho-
technics as a solid and scientific enterprise. In the early years, the most well-known psychotechni-
cians considered one of their primary tasks the development of instruments that would enable them
to assess vocational aptitude as precisely as possible. This is strikingly close to the present-day role
of test and product developers in psychological consultancies. A list of further, relevant European
precursors in the field of employee selection is described in Table 43.1.
924 Handbook of Employee Selection
TABLE 43.1
List of Representative Psychologists in Europe Working on Personnel
Selection Before 1980
Belgium
Christiaens, Arthur G. Ferrari, Gulio C.
Brabant, V. Filipinni, Azeglio
Decroly, Ovide J. Gemelli, Agostino
Piret, Roger Pizzoli, Ugo
Coetsier, Leo Ponzo, Mario
Biegel, Rebekka A. Sanctis, Sante de
Spaltro, Enzo
Czechoslovakia
Forster, William Latvia
Seracky, Frank Möller, Johan
Vana, Joseph
The Netherlands
Denmark Algera, Jan
Eriksen, Erik C. De Wolff, Charles
Drenth, Peter
France
Roe, Robert A.
Bacqueyrisse, Louis
Bonnardel, Raymond Poland
Camus, Pierre Wojciechowski, Jean
Faverge, Jean Marie Joteiko, Josephine
Goguelin, Pierre Lehmann, Alfred
Lahy, Jean Marie Studencki, Stanislas
Levy-Leboyer, Claude Suchorzewski, Henri
Montmollin, Marie
Romania
Pacaud, Suzanne
Nestor, Jacob-Marius
Palmade, Guy
Soviet Union
Germany
Kolodnaja, A.
Amthauer, Rudolf
Levitof, N.
Arnold, Willen
Luria, A. R.
Couvé, Raymond
Rossolimo, G.
Giese, Fritz
Spielrein, I.
Klemm, Otto
Toltchinsky, A.
Lipmann, Otto
Moede, Walter Switzerland
Piorkowski, Curt Baumgarten, Franzisca
Poppelreuter, Walther Bobet, Pierre
Rupp, Hans Claparede, Eduard
Schuler, Heinz Dachler, Peter
Selz, Otto Fontegne, Julien
Simoneit, Max Suter, Jules
Stern, Eric Walther, Leon
Stern, William
United Kingdom
Tramm, Karl
Anstey, Edgar
Italy Barlett, Frederic C.
Boganelli, Eleuterio Bartram, Dave
Bonaventura, Enzo Burt, Cyril
Diez Gasca, Maria Farmer, Eric
Faneli, Carlo Fletcher, Clive
Employee Selection in Europe 925
A good example of the advances in psychotechnics for employee selection purposes was the proposal
by the German applied psychologist Lipmann (1922), who suggested that occupations varied according
to their cognitive demands and provided a scheme for classifying jobs. Lipmann’s proposal was very
popular in Europe during the 1920s and was well known amongst American industrial psychologists of
that age (e.g., Bingham, Kornhauser, Viteles). According to Lipmann, differences between occupations
were not solely due to different mental functions but also due to the different intensity with which these
were used. Lipmann proposed a classification of occupations on the basis of the nature of the object on
which the work is made. Consequently, occupations could be distinguished into three groups depending
on whether the action is made on things, people, or concepts (ideas). In essence, this is similar to Fine’s
scheme of functional job analysis, thereby being a very early precursor of the scheme for classifying
occupations used by The Dictionary of Occupational Titles (DOT). Examples of occupations based
on things can be carpenters or watchmakers. Professions involving people can be physicians or
attorneys. Finally, occupations involving concepts can be philosophers or mathematicians. The Spanish
psychologist Mira expanded this classification by incorporating the relation between these dimensions
and cognitive abilities. Thus, according to Mira, things are related to spatial intelligence, people are
related to verbal intelligence, and concepts are related to abstract intelligence.
Making a cursory review of the main advances in European employee selection in its first years,
the following should be mentioned as critical events of the period.
1. According to the exhaustive review by Grundlach (1999), in 1911 a captain of the German
army named Meyer wrote an article in the journal Archiv für Psychologie [Archives of
Psychology] in which he suggested using psychological examinations for recruiting soldiers
(Grundlach, 1999). In 1915, Moede and Piorkowski established the first laboratory in which
psychologists examined truck drivers for the Preussische Gardekorps [Prusian Garde-
Corps] in Berlin. At the end of the war, there were 17 laboratories with tens of thousands
of individuals examined (see Table 43.2 for a summary), and Moede had been appointed as
the chief of the laboratories for the psychological assessment of the truck reserve sections
within the inspection of motor troops (Leiter der psychologischen Prüfungslaboratorien
der Kraftfahr-Ersatz-Abteilungen bei der Inspektion der Kraftfahrtruppen). Two simula-
tion tests were used for assessing the drivers. Together with the selection of drivers, psy-
chological assessment was used for selecting pilots, radiotelegraphers, observers, and those
responsible of sound-localization services since 1916.
Together with Moede and Piorkowski, other psychologists as well known as Eric Stern,
Klemm, Lipmann, Selz, and Wertheimer participated in the army selection program. From
1916 onwards during World War I, William Stern conducted a series of experiments for
selecting female tramway drivers (Stern, 1917; see also Sachs, 1920). He used tests of gen-
eral intelligence, reaction time, concentrated attention, and speed of response. In Germany,
psychotechnical laboratories and testing rooms existed in large organizations as the Berlin
Post Office, the State Railways, A.E.G., Siemens, and Osram since 1922. Selection tests
were used for occupations as different as salesmen, telephonists, and railway workers
926 Handbook of Employee Selection
TABLE 43.2
Some Results of the Aptitude Testing Program for Assessing
Drivers in the German Army Between 1915 and 1918
Highly Moderately Tentatively Sum of
Qualified Qualified Qualified Qualified Unqualified Withdrawal of
Total individuals Individuals Individuals Individuals Individuals Driver’s License
Laboratory Individuals (%) (%) (%) (%) (%) (%)
Berlin 1,650 8 76 10 94 6 5
Stettin 983 10 62 25 97 3 9
Pose 134 10 43 29 82 18 35
Breslau 1,657 4 66 20 90 10 6
Düsseldorf 2,063 13 65 15 93 7 10
Köln 1,194 6 56 25 87 13 12
Hannover 350 22 56 12 90 10 11
Apolda 248 3 59 27 89 11 27
Mannheim 998 20 54 15 89 11 8
Danzig 158 7 52 29 88 12 4
Frankfurt 816 29 53 10 92 8 8
Total 10,251 11.7 62.7 17 91.4 8.6 11.4
Source: Adapted from Moede, W., Lehrbuch der Psychotechnik [Textbook of Psychotechnics], p. 433, Springer, Berlin, 1932.
(engine drivers, plate-layers, supervisors, etc.; Miles, 1922). Therefore, the German army
used scientific psychological employee selection at least 2 years before the American army
started their famous project with the Alpha and Beta tests (it must be remembered that this
project really began in 1918). Furthermore, German psychologists used multiple proce-
dures for employee selection such as intelligence tests, psychomotor tests, simulations, and
sensorial tests, rather than mainly concentrating on intelligence testing as it was the case
in the American army.
It is interesting to note that at the Conference of Applied Psychology, held in Berlin in
October 1922, apart from discussions about statistical methods, training in psychological
testing, measurement of personality, and cooperation between researchers and practitio-
ners, it was also discussed
what should be done with those who were rejected by the tests and the conclusion was
that for the interests of the workers and of the community it appears equally essential that
those candidates who are rejected should receive further help and advice. (Miles, 1922,
p. 192)
It is obvious that this discussion is a precursor of the current pragmatic concerns and
research into fairness, applicant reactions, and organizational justice in employee selection
procedures.
In 1913, Jules Suter, professor at the University of Zürich (Switzerland), intro-
duced psychotechnical examinations at the Bally Manufacturing Company (Heller,
1929; Baumgarten, 1949). Between 1916 and 1920, psychotechnic examinations using
Münsterberg’s and William Stern’s procedures for selecting drivers were conducted by
Claparède and Fontégne in Geneva (Switzerland), by Tramm in Berlin (Germany), by
Mira in Barcelona (Spain), Sbowoda in Vienna (Austria), and by Houdmont, Defailly, and
Chopinet in Brussels (Belgium) (see Perez-Creus, 1947, for a more completed description
of the studies with drivers and also Sollier, 1927).
Employee Selection in Europe 927
2. Pilot selection during World War I is another relevant example of the early contribution
of European psychologists. For example, Gemelli conducted psychophysiological exami-
nations for selecting military pilots in Italy as early as 1915, as H. G. Anderson did in
Great Britain, Camus in France, and Selz, W. Benary, and Kronfeld in Germany, and
Lehmann in Denmark (Eriksen, 1927; Ferrari, 1939; Grundlach, 1999; Lahy, 1922; Perez-
Creus, 1947). The typical European model consisted of (a) an examination of the emotional
responses and their influence on the respiratory and cardiac rhythm, (b) an examination
of the psychomotor reaction times, (c) a measure of the intensity and duration of the emo-
tional responses, and (d) a measure of fatigability. Two approaches were used in this case:
• A global laboratory examination in which the candidate had to perform the job of a pilot
• An examination of specific functions required for doing the job
Just for a comparative purpose, it should be remembered that the first studies with
American pilots were conducted in June 1917 when the U.S. “National Research Council
Committee on Aviation tried out tests reported as promising by French and Italian psy-
chologists, and other tests as well, on 75 candidates at the Massachusetts Institute of
Technology Ground School” as acknowledged by Hennon, one of those responsible for the
American aviation program (Hennon, 1919, p. 105). According to Damos (2007):
The United States had no airlines, no civil airports, and no civil pilots prior to 1917, and was
ranked 14th internationally in terms of aviation … The first American-made aircraft did
not see service in Europe until August, 1918. Thus, America has little in terms of aircraft,
production facilities, flying training schools, or pilots at the beginning of the war. … In early
1917, the U.S. Army decides to adopt modified version of the Canadian and British flying
training programs (Damos, 2007, p. 12).
to distinguish between highly important and less important aspects. In Spain, Mallart
(1925) classified aptitudes according to the main professional groups.
4. An important characteristic of personnel selection under the psychotechnic model was
that a large network of laboratories, institutes and, most importantly, personnel was cre-
ated for supporting the efforts of these researchers. Thus, by 1926 there were up to 250
psychotechnic institutes in Germany; 11 in Czechoslovakia; 5 in Austria; 6 in Switzerland
and Hungary; 5 in Russia; 3 in Spain; and a similar number in Belgium, The Netherlands,
Poland, Hungary, Finland, Italy, the Scandinavian countries, the United Kingdom, and the
Baltic Republics (i.e. Latvia, Esthonia, Lithuania). Seven international conferences (con-
gresses) of psychotechnics were held during the next 14 years after the foundation of the
International Association of Psychotechnics in 1920 in Geneva.
5. Notable advances in employee selection were made all over Europe and not only in the
more wealthy countries. For example, Duck conducted studies on engineers, clerical,
and construction workers in Innsbruck (Austria) (Perez-Creus, 1947). In Czechoslovakia,
between the two World Wars, work on employee selection was conducted at the Institute
of Psychotechnics in Prague, and the best-known Czech psychotechnician in those years
was William Forster, who designed a reaction time apparatus for assessing distributed
attention in 1928. This apparatus received remarkable interest and it was in use for many
years afterwards (e.g., in Spain) for assessing drivers. The members of the Institute
mainly focused on conducting employee selection for industry, policy, army, and public
administration (Paulik, 2004). They conducted also selection processes for screening
students at secondary schools and universities. Together with the Institute of Psychotechnics
in Prague, other psychotechnic laboratories were founded in private industries mainly for
personnel selection purposes. For example, the psychotechnic laboratory of the Prague
electrical company (founded in 1925), the laboratory of the Prague company Koh-i-noor
(founded in 1935), the laboratory of the textile industrial school at Liberec (founded in
1923), and the laboratory at the Vitkovice ironworks (founded in 1923).
Moreover, psychotechnic tests were used in large companies such as Bat’a in Zlin (dur-
ing 1937–1939) (see Paulik, 2004, for a detailed account). In general, the objective of this
employee selection effort was to improve productivity and to reduce injuries and accidents
by assessing the applicants’ aptitude. The success of this assessment program is illus-
trated in a report given in 1932 at Vitkovice Ironwoks, documenting a decrease of 61%
of workplace injuries (Paulik, 2004). According to an article published in the Journal of
the National Institute of Industrial Psychology (Anonymous, 1926a), entitled “Industrial
Psychology in Europe,” psychological tests were used by various large firms and were
used to select employees on the state railways in Austria. In Finland, since 1922, rail-
way employees were selected on the basis of tests and airmen were given psychological
tests in the army laboratories (Anonymous, 1926a). In Italy, the Psychological Laboratory
of the Catholic University of Milan, founded by Gemelli in 1921, carried out different
validity studies and developed tests for the selection of spinners in artificial silk mills,
women workers in wool industries, staff in two large shoe factories and, also, research was
done for the selection of airplane pilots (Ferrari, 1939). Also in Italy, Maria Diez-Gasca
conducted studies on the selection of women for millinery, and Marzi conducted studies
for selecting smiths, welder, turners, mechanics, motor drivers, electricians, fitters, and
switchboard operators (Ferrari, 1939). In Riga (Latvia), firemen were selected using tests
(Anonymous, 1926a). In Warsaw (Poland), there was a special testing center for engine
drivers (see Wojciechowski, 1927). Tests were particularly used on the European railways;
for instance, Nestor (1933a, 1933b) reported that as early as 1917 tests for engine drivers
were introduced in Dresden by Ullbricht, Schreiber, and Glasel, and that in 1925, Couvé
and Heydt used tests taken by the whole staff, including office workers (by the end of
1930, 75,000 persons had been tested; see also Anonymous, 1926b). Nestor (1933b; see
Employee Selection in Europe 929
also Viteles, 1938, for Russia) reported that in 1924, Karner introduced cognitive tests for
apprentices on the Austrian Federal Railways; in 1925, the Swiss Federal Railways intro-
duced ability tests for workshop apprentices. Also in 1925, the selection of engine drivers
and stokers was introduced in Russia (by Kolodnaja), and the Dresden Method together
with Rossolimo’s tests were used. In 1926, tests were introduced in Czechoslovakia for
selecting engine drivers, motor-trolley drivers, and office workers at various railway com-
panies. Since 1927, cognitive and psychomotor ability tests were introduced in Latvia (by
Möller), Poland (by Wojciechowski), Finland (by Hjelt, Vuolle, and Petterson), Yugoslavia,
Italy (by Filippini), Belgium, Sweden (Anderberg, 1936), Hungary, Norway, France (by
Lahy), and Romania (by Nestor).
6. Together with many paper-and-pencil tests, European psychotechnicians designed and
created many sets of apparatus and devices for assessing mental, psychomotor, and
sensorial aptitudes, and a robust industry was created around the production and selling
of these apparatus. Also, film technology was used to achieve a more naturalistic setting
(e.g., see Erisman & Moers, 1926; Giese, 1925, for illustrations and photographs of many
apparatus). During the 1930s, coordination motor tests received a lot of attention in
Europe. Hopkins (1939, 1944), reporting on his visit to Berlin in 1936, mentioned that the
German air force used complex coordination tests and reaction tests for selecting pilots;
this information was later confirmed by Fitts (1946). Hopkins (1939) also mentioned that
84 psychologists were employed in the 15 army psychological laboratories that existed
in Germany. In the United Kingdom, Barlett and Craik (1939) developed a psychomotor
test called the “Reid machine” for the Royal Air Force that later became known as the
Sensory Motor Apparatus (SMA). Williams (1940) reported the validity of the SMA as
being moderately high.
7. The examination of leaders and administrators was made as long ago as 1914 in France
and 1925 in the former USSR (Henry, 1914; Jurowskaja, 1925). Pryns Hopkins described
a very interesting test used by the Hungarian army since 1933 in a paper presented at a
meeting of the British Psychological Society on April 3, 1938, and subsequently published
as an article (Hopkins, 1939). The tests
were conducted in a hall which resembled a gymnasium. Here ladders, bars, boards, building
blocks, ropes, etc., were lying about, and two officers who were examining a candidate for
commissioned rank were requiring him to instruct three men to use this material for the
construction of a bridge suitable for them to cross. The candidate was first left alone to
consider the situation. He then informed the two officers how he would proceed with his
instructions, and listened to their comments. Thereupon he summoned his three men and,
just as if he were their officer, gave them the orders and the supervision necessary for the
execution of their task. (Hopkins, 1939, p. 60)
Pointedly not far removed at all from modern-day assessment center exercises still in use
by the British Army for officer selection (Anderson, Lievens, van Dam, & Born, 2006).
Subsequently, the two officers judged the candidate’s suitability for promotion. Another
test for selecting sergeants was taken in
a hall equipped with desks at which the candidates sat for a group test before a screen, on which
were thrown pictures of an air raid, anti-aircrafts guns going into action, the extinguishing
of fires, care for the wounded, etc. After viewing any one of such films, the candidates had to
give a report of what they had seen and to answer questions upon it. (Hopkins, 1939, p. 61)
Many other interesting tests and simulations used by the Hungarian army were described
in Hopkins’ article, and it is obvious again that the modern assessment center method is a
descendent of these procedures.
930 Handbook of Employee Selection
European leadership in employee selection during the first three decades of the 20th century
was acknowledged by the most reputed American scholars such as Viteles (1926, 1928, 1930, 1932),
Kornhauser (1929–1930), Kitson (1922), and Bingham (1919). In their work they reviewed the
extensive work carried out in European countries and compared it with the developmental level of
American employee selection at that time.
However, a few years after these reviews, European employee selection started to dramatically
lose its leading role. Indeed, it has never quite recovered since that time, and the indisputable cur-
rent leader of employee selection research is certainly the United States. However, it is also fair
to acknowledge that researchers such as Bingham, Viteles, Kitson, and Kornhauser had special
characteristics: they read research published in several languages, including German, French, and
Russian (Viteles); they visited the European research centers; and they were close with colleagues of
the leading European psychologists at the time, such as Lahy, Moede, Lipmann, Baumgarten, Mira,
Myers, and Spielrein, to mention just a few.
Several political reasons can be put forward to explain the decline of European personnel
selection at the end of the 1930s. In the USSR, psychological testing was forbidden as a capitalistic
ideology, psychotechnic offices were closed and many of its industrial psychologists disappeared
under Stalin’s purges. In Germany, Austria, Poland, and the Czech Republic, many research
Institutes were closed, the most relevant researchers and professors were condemned to exile or
worse—for example, Stern, Lewin, Baumgarten, and many others during the Nazi period. Some of
them preferred suicide, including Klemp and Lipmann, who intentionally ended his life when the
Nazis closed his Institute and he was fired (Baumgarten, 1933, 1940). Similarly, Rebecca Biegel,
director of the psychotechnical laboratory of the Postal Office in Holland, preferred to commit
suicide, thereby avoiding her death in a concentration camp (Baumgarten, 1949).
Such personal histories make for salutary reading, and it is perhaps somewhat shocking for
readers of this handbook to discover the tragic fates of some of these individuals who were
undoubtedly persecuted under different regimes merely for being scientists or practitioners in
I-O psychology. In Italy, as another example, Enzo Bonaventura was prosecuted because of his
Jewish origin and religious beliefs, and he had to abandon the country and flee to Palestine.
Bonaventura (1934) had been one of the first Italian contributors to the examination of emotions
for personnel selection purposes. In Spain, the fascist rebellion against the Spanish Republic and
its eventual victory meant that the best reputed Spanish industrial psychologists such as Mira,
Rodrigo, and Germain had no choice but to disappear into exile whilst others were fi red from
their jobs (e.g., Mallart).
Employee Selection in Europe 931
The selection system for soldiers and officers of the Spanish Republican army (1937–1939) was
described by Mira (1943), who was Chief of Psychological and Psychiatric Division of the Spanish
Republican Army. The use of cognitive and psychomotor tests as well as personality questionnaires
and interviews formed the basis for the selection decision.
The beginning of World War II produced a borderline paralysis of the research efforts in most
continental European countries and for legendary individuals such as Lahy, who died in Paris dur-
ing the Nazi-control of the city. Consequently, the period between 1935 and 1945 was practically the
end of the successful period in several European countries (with perhaps the only exception being
Great Britain).
However, this panoramic view should not suggest that German personnel selection during the
Nazi period was inefficient or nonexistent. This is simply not true. Two important figures con-
tinued their research and practice under Nazi support. One of them was Moede, probably the
most important German industrial psychologist between 1915 and 1945. Moede was the edi-
tor of the journal Industrielle Psychotechnik between 1924 and 1944. The second important
German researcher under the Third Reich was Max Simoneit. Mira (1943) considered the book
by Simoneit (1938) as the main source of information about military officer selection in the Nazi
army. Incidentally, Simoneit was a member of the group that plotted the July 20, 1944, attempt
on Hitler’s life. According to several authors (Ansbacher, 1941; Fitts, 1946; Mira, 1943; Simoneit,
1938), the most important key in the selection and promotion of officers in the German army
was the assessment of personality and character. A selection team consisting of two officers, a
military physician, three psychologists, and a psychiatrist was assigned to each candidate. The
team collected information regarding biographical background, expression analysis, psychological
testing, behavior efficiency assessment, and a final examination. Biographical information was
collected using interviews. Expression analysis consisted of the analysis of nonverbal behavior
and communication, facial expression, verbal content, and graphological (handwriting) analysis.
The psychological testing consisted of a job simulation task in which the candidate could consult
books, ask supplementary information, and then chose what he thought was the best method for
solving a task-related problem. After finishing the task, the candidate was asked about the solving
process and about the reasoning process. On the basis of those explanations, his performance on
each task was evaluated. The assessment of behavior efficiency was done by placing the candidate
in an experimental situation with some degree of risk and adventure. The fi nal examination, called
Führerprobe [Command Test], a test of leadership capabilities, was the most complex procedure.
The assessment lasted for 2 days, and in this time, the candidate was subjected to interrogations,
athletic tests, and manual and mental work.
An important part was a task in which the candidate had to lead a group of unknown soldiers,
ordering displeasing tasks. In this last part of the selection process, the team made notes about the
reaction of the future officer but also the reactions of the soldiers, because the German directorate
assumed that soldier reactions were a good signal of the officer’s leadership ability. Finally, the can-
didate was confronted with future colleagues to informally discuss various topics under the obser-
vation of the examining team. Putting together all of this information and test results, a diagnosis
of the candidate’s abilities was given. As can be seen, this assessment procedure is once again very
similar to our current assessment centers, and the British and U.S. armies very soon imitated it (see
Highhouse, 2002, for further details).
difficult to maintain. In the first Congress of the International Association of Psychotechnics (IAP)
celebrated after the World War II, Baumgarten (see Grundlach, 1998) described a devastating pan-
orama: Dunajervski, Syrkin, Spielrein, Gastew, and Kolodnaja were dead or imprisoned in Russia;
Stern, Lipmann, Giese, Poppelreuter, and Klemm in Germany; Lahy and Weinberg in France; and
Studencki, Segal, Lipszycowa, and Przedborska in Poland. Claparède and Myers were also dead.
Out of 27 members of the Director Committee of the IAP, 11 were dead and 2 had been expelled
from the Director Committee for ideological reasons. Another source of difficulties for improv-
ing the situation was the problem of obtaining publications, especially from scientific journals.
Information on books was also problematic to obtain let alone to be able to afford to buy books. In
fact, these kinds of difficulties were present until the 1990s. In this regard, the Congresses of the
IAP were perhaps the best opportunity for many personnel selection researchers to share their find-
ings and to get to know what was being researched across the world, especially across European
countries. Consequently, an examination of the proceedings of the IAP gives a fascinating insight
into developments during this period in European selection psychology.
A cursory review shows that between 1949 and 1958 four congresses were held in Bern, Paris,
Göttenbourg, and Rome, and that, for example:
• Studies on the selection of railway workers and drivers were made in France and Italy.
• German and Italian psychologists reported the status of personnel selection in their
countries.
• The classification of occupations was another issue of interest for Belgian, French, and
German researchers.
• Personality in personnel selection was the subject of several studies conducted by British,
French, and Spanish psychologists.
• The measurement of criteria and methodological advances in employee selection was
another theme of interest for psychologists from Belgium, France, Italy, Spain, and the
United Kingdom.
• Other issues of interest were the validity of interviews, the selection of engineers, military
pilot selection, and aptitude tests and their relation with accidents at work.
Of crucial importance in this period were collaborations between the American armed forces
and their European counterparts (e.g., Belgian, British, Dutch, German, Norwegian, Spanish, and
Swedish armies, air forces, and navies). In Holland, Buiten and Hofstee (1971), Fokkema (1958),
and Helbing (1964) conducted validity studies for the selection of pilots, fighter control officers, and
soldiers using cognitive and personality tests. In Spain, a series of validity studies was conducted
with the U.S. Air Force Cognitive Battery for selecting pilots and drivers (e.g., Germain, Pinillos,
Ramo, & Pascual 1958, 1959). Riggs (1958) in the Royal Norwegian Air Force and Anttila (1968) in
Finland conducted studies for selecting pilots and air mechanics, respectively.
These collaborative efforts were frequent during the 1950s and 1960s and continued in the next
decades. However, and rather curiously, there was practically no contact with American selection
researchers working at universities or in industry. In other words, this collaboration was limited to
psychologists working in military settings in Europe and the United States. The reputed American
I-O psychologist, Walter Bingham, wrote in 1951:
During the third week of July the Thirteenth International Congress of Psychology was convened in
Stockholm; and following it, in Gothenburg was held the Tenth International Congress of Applied
Psychology (or call it “Psychotechnique” if you are speaking French). It was this latter congress in
which I had an active role and in which you who look at Personnel Psychology might be expected to
have a special interest.
As to your real interest in such international affairs I am not certain since almost none of you
attended; or, I ought to say, almost none who live in the USA. Readers of this journal in countries
scattered over six continents took part; but our homeland, which furnishes employment to half of the
Employee Selection in Europe 933
A few further relevant contributions that appeared in this period must also be mentioned.
Probably, the most important was the book by Vernon and Parry (1949), in which these two British
psychologists gave an account of the employee selection program conducted in the British armed
forces during the war, highlighting the largest research program conducted in Europe. Even today,
it is a model of good practice and research. Together with this book, Vernon, Parry, Burt, and other
British psychologists published a series of articles in which they reported the essence of the job
done and reported the validity studies conducted (e.g., see Parry, 1947; Vernon, 1947a). They had
also made important contributions to methodology, including developing formulas for correcting
validity coefficients for range restriction in predictor and criterion, which, unfortunately, was widely
unknown in these years (Burt, 1942a, 1942b; Vernon, 1946). Much of this work was subsequently
translated to civilian organizations and, especially, to the British Civil Service (see Anstey, 1977).
For example, assessment center technology was introduced in the Civil Service and for selecting
police officers. The origins of these procedures remain visible in assessment centers conducted
today in both of these contexts (Feltham, 1988).
In France, Bonnardel, the leading French personnel selection psychologist of the period,
conducted many validity studies and developed new tests (see Salgado et al., 2003, for a large list of
Bonnardel‘s validity studies). Other French psychologists with relevant contributions to personnel
selection were Pacaud, Goguelin, and Palmade (see Pacaud, 1971; Palmade 1948). Studies on
the selection of engineers, supervisors, drivers, and mechanics were conducted, new tests were
developed, and new laboratories were created (e.g., the psychological laboratory at Peugeot) by this
group of highly active French researchers. The selection of railways workers was another focus of
interest for employee selection in this historical period, and studies were conducted in France, Spain,
Belgium, and other European countries. For example, in Romania, Craciunescu (1969) conducted a
validity study of cognitive tests for selecting railway mechanics.
In Belgium, studies were done by Miret-Alsina (1958) at the Sabena Company for selecting civil-
ian pilots. In the Scandinavian countries, Trankell (1959) also developed a procedure for selecting
pilots for the Scandinavian Airlines System (SAS). The procedure required the individual to hold a
pencil in each hand and to place one dot in each of series of circles connected by straight lines. One
series of circles was used for the right hand and the other series for the left hand. The individual
had to alternatively move left and right hands from circle to circle following the lines. The test was
validated using the results of a training course. This selection process for SAS included other tests
such as verbal intelligence, maturity, and tact. In the 1960s and 1970s, a projective test, the Defense
Mechanism Test (DMT), developed by Krach at the University of Lund (Sweden) received a great
deal of attention. Several studies were published reporting its validity (see Krach, 1960, for a com-
plete description of the test and the validity studies). The DMT displays several pictures through
a tachistoscope and the length of the exposure is gradually longer. The individual must provide a
description of what she/he has seen. The responses are assessed in terms of the defense mechanisms
revealed.
The 1960s and 1970s can best be conceptualized as the obscure period of European employee
selection, because apparently little research was being done in European countries and notably little
interest for this research was shown in the universities. For example, only a couple of presentations
were held at the 16th and 17th International Congress of Applied Psychology celebrated in Holland
934 Handbook of Employee Selection
(1968) and Belgium (1971). Indeed, this is a generalization, and some few exceptions can be found in
different countries. However, important exceptions to the previous generalization were the research-
ers of the University of Amsterdam (Holland), who carried out a relatively large number of doctoral
dissertations and internal reports on the validity of psychological tests (see Salgado et al., 2003, for
a list of these contributions), with Drenth and Roe as the most significant researchers. In Germany,
Amthauer (1973) conducted a large series of validity studies with the intelligence I-S-T 70 test.
Consequences of this lack of research and interest in employee selection research were that this
discipline disappeared from many university psychology programs, few students looked for careers
in this occupational field, there was practically no interchange between researchers and practitio-
ners, and, most importantly, few jobs and opportunities were offered to psychologists working as
practitioners in the field of employee selection (Warr, 2007).
validity study of the selection system used in the Belgian armed forces; Busch-Jensen from
Denmark presented the question of values associated to selection procedures; and Bartram and
Dale reported a validity study on personality measures for predicting training proficiency in the
British military pilots. In another presentation, Bartram described “Micropat,” a computerized
test battery for pilots; Horia Pitaru presented a study on the validity of selection tests for pre-
dicting job performance in data processing occupations in Romania; Poppleton from the United
Kingdom reported on the development and validation of a test for the measurement of sales
aptitude; and Schuler and Stehle presented an empirical study on the validity of the assessment
center method (see IAAP, 1982).
Today, European employee selection researchers contribute to the main journals in the world,
such as Journal of Applied Psychology, Personnel Psychology, Human Performance, Applied
Psychology: An International Review, and some of them are members of the editorial boards
or even are the editors. In Europe, employee selection researchers published articles in jour-
nals such as International Journal of Selection and Assessment, Journal of Occupational and
Organizational Psychology, Zeitschrift für Personalpsychologie, European Journal of Work
and Organizational Psychology, or Journal of Organizational Behavior. In 1993, a European
researcher (Neil R. Anderson) founded the fi rst journal in the world devoted entirely to employee
selection research, the International Journal of Selection and Assessment, today among the lead-
ing journals in the world for this discipline according to the Thompson Institute for Scientific
Information (ISI).
Employee selection is currently an important research area in European work and organizational
psychology (Roe & van den Berg, 2003), and examples of active researchers include Derous, de
Corte, de Fruyt, De Witte, and Lievens in Belgium; Rolland and Steiner in France; Höft, Hülsheger,
Hornke, Kanning, Kersting, Krause, Marcus, Moser, and certainly Schuler in Germany; Aramburu-
Zabala, Garcia-Izquierdo, Gorriti, Moscoso, Osca, Sáez, and Salgado in Spain; Anderson, Bartram,
Cunningham-Snell, Robertson, Silvester, Smith, and Warr in the United Kingdom; Bertolino in
Italy; Nicolaou in Greece; Kleinmann, König, and Klehe in Switzerland; and Born, Evers, Roe, te
Nijenhuis, van Dam, van den Berg, van Vianen, and Voskuijl in The Netherlands. They participated
in symposia, congresses, and published their research with such well-known and reputed American
scholars as Wally Borman, James L. Farr, Milton Hakel, Michael Harris, Scott Highhouse, Tim
Judge, Rich Klimoski, Frank J. Landy, Gary Latham, Kevin R. Murphy, Deniz S. Ones, Mitchell
Rothstein, Paul R. Sackett, Juan I. Sanchez, Frank Schmidt, Paul E. Spector, Dirk D. Steiner, George
Thornton, and Vish Viswesvaran, mentioning just a few of them.
Among the most recent contributions of European employee selection, several edited books
must be mentioned, including International Handbook of Selection and Assessment (Anderson &
Herriot, 1997), Handbook of Industrial, Work and Organizational Psychology (Anderson, Ones,
Sinangil, & Viswesvaran, 2001), Handbook of Personnel Selection (Evers, Anderson, & Voskuijl,
2005), and La Psychologie Appliqué a la Gestion des Resources Humaines (Levy-Leboyer, Hutteau,
Louche, & Rolland, 2001). Relevant contributions made by European researchers include studies
on the multimodal interview (Schuler, 2002; Schuler & Funke, 1989), on the construct validity of
assessment centers (Lievens & Van Keer, 2001; Scholz & Schuler, 1993), on interview validity and
constructs assessed by interviews (Salgado & Moscoso, 2002; Schuler, 2002), the meta-analyses
of the validity of cognitive ability testing (Bertua et al., 2005; Hülsheger, Maier, & Stumpp, 2007;
Salgado & Anderson, 2003; Salgado et al., 2003), the examination of the validity of situational
tests (Lievens, Buyse, & Sackett, 2005; Lievens & Sackett, 2006), Internet-based recruitment and
testing (Bartram, 2000; Lievens & Harris, 2003), the contributions on fairness reaction to employee
selection procedures (see the whole issue of the International Journal of Selection and Assessment
edited by Neil R. Anderson, 2004), the meta-analyses of the validity of personality measures
(Salgado, 1997, 2002, 2003), the studies on typical and maximum performance (see the whole
issue of Human Performance edited by Ute-Christine Klehe, Neil R. Anderson, & Chockalingam
Viswesvaran, 2007), and the studies on the adverse impact of selection procedures (De Corte,
936 Handbook of Employee Selection
Lievens, & Sackett, 2006; see also Annual Review articles by Hough & Oswald, 2000, and Sackett
& Lievens, 2008, for more comprehensive reviews of the European contributions).
As a summary, in our opinion, common characteristics of the current personnel selection field
across the European countries are the following:
1. A large cultural diversity that expresses the use of different techniques and tools across
the European countries with a different frequency of use. For example, cognitive tests are
more often used in the United Kingdom and Spain than in Germany or Italy.
2. An international generalization of the validity for cognitive ability tests across the European
Union (EU).
3. Great interest in the use of personality testing and studies showing validity coefficients
similar to those found in the United States.
4. Great interest in applicant reactions and the ethics of personnel selection. The social con-
text of the selection process is viewed as a primarily relevant one.
5. Construct validity of the personnel selection procedures is crucial.
6. Growing interest in the possible employment discrimination due to various selections tools
and procedures.
7. A confluence of the national employment laws within the framework of the EU (currently
consisting of 25 countries with over 400 million individuals), with an emphasis in the
rights of the nationals of the EU countries.
8. A deeply held concern for the researcher-practitioner divide.
9. Concern for the “turbulent” socioeconomic environment, which implies frequent and
quick changes in jobs and organizations. In this context, competency models are viewed
as a possible line of advance together with new conceptualizations of personnel selection
as a “theater model” (i.e., this model “aims at developing competences between selection,
learning in practice, and direction,” and putting “the emphasis on the principle of employ-
ment and career opportunity” Roe & van den Berg, 2003, p. 275).
10. Finally, Roe and van den Berg (2003) identified six principles which provide direction and
legitimacy to selection practices in Europe: (a) meritocracy, (b) risk avoidance, (c) employ-
ment and career opportunity, (d) fair chance, (e) two-sidedness, and (f) involvement.
CONCLUSIONS
In this chapter we have summarized three main periods of development over the history of
European personnel selection research and practice. In so doing, we have brought to light several
vitally important contributions by scholars who seem to have been unfairly underacknowledged
within Europe but especially in North America. Our retrospective review is, of course, necessar-
ily a narrative one and one influenced by the interpretations of historic precedent by us as authors.
However, we suspect that many North American colleagues may never have even heard of some
of the names cited in this chapter, let alone been aware of some of the persecution and hardships
endured by many European I-O psychologists in the past. These personal histories make salu-
tary reading for us all. We believe the early establishment of the fundamental bases of employee
selection theory, methods, and techniques in Europe some 100 years ago is as impactful now.
We have argued that modern-day methods can be directly traced back to these historic forbear-
ers, and indeed, that the principles of scientific personnel selection were established during this
period. For a science that relies so heavily on the measurement of past behavior to predict future
behavior, it is curious and ironic that present-day accounts of selection in many of the popular
textbooks fail to take heed of these historic contributions. Despite periods of growth, difficult
periods, and periods of stagnation, our overall conclusion is that European employee selection
is alive and well. It is experiencing a second youth, and we look forward to future developments
with keen interest.
Employee Selection in Europe 937
ACKNOWLEDGMENTS
Jesús F. Salgado’s work for this chapter was partially supported by grant SEB1098-2005 from
Ministerio de Educación y Ciencia (Spain) and grant PSI2008-03592/PSIC from Ministerio de
Ciencia e Innovación (Spain). We thank Nancy T. Tippins, James L. Farr, and Wally Borman for
their valuable comments on an earlier version of this chapter.
A chapter on European employee selection coauthored by a Spaniard, an Englishman, and a
German would have been unthinkable 60 years ago. We dedicate this chapter to the bravery, forti-
tude, and honor of former colleagues who suffered incalculably for the science and practice of I-O
psychology and employee selection. We also dedicate the chapter to our American and European
colleagues who are contributing to the progress of our science.
REFERENCES
Algera, J., Jansen, P. G. W., Roe, R. A., & Vijn, P. (1984). Validity generalization: Some critical remarks at the
Schmidt-Hunter procedure. Journal of Occupational Psychology, 57, 197–210.
Altink, W. M., Roe, R. A., & Greuter, M. A. (1991). Recruitment and selection in The Netherlands. European
Review of Applied Psychology, 41, 35–45.
Amthauer, R. (1970). Intelligenz-Struktur-Test (I-S-T 70). Handanweisung für die Durchführung und
Auswertung [Intelligence-structure-test (I-S-T 70). Manual]. Göttingen, Germany: Hogrefe.
Anderberg, R. (1936). Psychotechnische Rekrutierungsmethoden bei den schwedischen Staatsbahnen.
[Psychotechnical methods for recruitment in Swedish railways]. Industrielle Pychotechnik, 13,
353–383.
Anderson, N., & Herriot, P. (Eds.). (1997). International handbook of personnel selection. Chichester, England:
Wiley.
Anderson, N., Lievens, F., van Dam, K., & Born, M. (2006). A construct-driven investigation of gender differ-
ences in a leadership-role assessment center. Journal of Applied Psychology, 91, 555–566.
Anderson, N., Ones, D. S., Sinangil, H. K., & Viswesvaran, C. (Eds.). (2001). International handbook of indus-
trial, work and organizational psychology. London, England: Sage.
Anonymous. (1926a). Industrial psychology in Europe. Journal of the National Institute of Industrial
Psychology, 3, 264–267.
Anonymous. (1926b). Selection tests on the German railways. Journal of the National Institute of Industrial
Psychology, 3, 201–204 [This paper is based on a lecture given by Dr. Glasel to the Railway Student’s
Association at Dresden in May 1926].
Ansbacher, H. L. (1941). German military psychology. Psychological Bulletin, 38, 370–392.
Anstey, E. (1977). A 30-year follow-up of the CSSB procedure, with lessons for the future. Journal of
Occupational Psychology, 50, 149–159.
Anttila, H. (1968). Studies in the selection of air mechanics. In IAAP Proceeding of the 17th International
Congress of Applied Psychology, July, 18–22, Amsterdam, The Netherlands.
Barlett, F. C., & Craik, K. J. (1939). Report on the Reid Machine (Report Nº 59). London, England: Flying
Personnel Research Committee, Royal Air Force, Ministry of Defense.
Bartram, D. (1987). The development of an automated testing system for pilot selection: The Micropat Project.
Applied Psychology: An International Journal, 36, 279–298.
Bartram. D. (2000) Internet recruitment and selection: Kissing frogs to find princes. International Journal of
Selection and Assessment, 8, 261–274.
Baumgarten, F. (1933). Otto Lipmann—Psychologist. Personnel Journal, 12, 324–327.
Baumgarten, F. (1940). Otto Klemm, professeur de psychologie appliqué à l’Université de Leipzig (1884–1937)
[Otto Klemm, professor of applied psychology at the University of Leipzig (1884–1937)]. Le Travail
Humain, 8, 96–98.
Bingham, W. V. (1919). Army personnel work. With some implications for education and industry. Journal of
Applied Psychology, 3, 1–10.
Bingham, W. V. (1951). Today … and yesterday. Nº16. In Scandinavia. Personnel Psychology, 4, 429–438.
Bonaventura, E. (1934). L’examen de l’emotivé dans la sélection des conducteurs des véhicules rapides [The
sensitivity test in the selection of drivers of fast motor vehicles]. In H. Grundlach (Ed.). (1998). Applied
Psychology: The first-thirteenth Congress Proceedings of the International Association of Applied
Psychology, (Vol. 8, pp. 225–229). London, England: Routledge.
938 Handbook of Employee Selection
Bruchon-Schweitzer, M., & Ferrieux, D. (1991). Une enquête sur le recrutement en France [An investigation of
recruitment in France]. European Review of Applied Psychology, 41, 9–17.
Buiten, B., & Hofstee, W. K. B. (1971). The prediction of adjustment to military life. In R. Piret (Ed),
Proceeding of the 17th International Congress of Applied Psychology, July 18–22 (pp. 253–258).
Brussels, Belgium.
Burt, C. (1942a). The value of statistics in vocational psychology (I). Occupational Psychology, 16, 164–174.
Burt, C. (1942b). The value of statistics in vocational psychology (II). Occupational Psychology, 17, 25–33.
Carroll, J. B. (1951). Personnel psychology abroad. The International Association of Psychotechnology.
Personnel Psychology, 4, 127–133.
Chleusebairgue, A. (1934). Psicología del trabajo professional [Psychology of professional work]. Barcelona,
Spain: Labor.
Chmiel, N. (2000). History and context for work and organizational psychology. In C. Chmiel (Ed.),
Introduction to work and organizational psychology. A European perspective (pp. 1–19). Oxford,
England: Blackwell.
Craciunescu, R. (1969). Validarea examenului psihologie in trasportul feroviar [The validation of psychological
examination in the railways transport]. Revista de Psihologie, 15, 405–415.
Damos, D. L. (2007). Foundations of military pilot selection systems: World War I. (Technical Report 1210).
Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Dany, F., & Torchy, V. (1994). Recruitment and selection in Europe: Policies, practices and methods. In
C. Brewster & A. Hegewisch (Eds.), Policy and practice in European human resource management: The
Price-Waterhouse-Cranfield Survey, London, England: Routledge.
De Corte, W., Lievens, F., & Sackett, P. R. (2006). Predicting adverse impact and multistage mean criterion
performance in selection. Journal of Applied Psychology, 91, 523–537.
De Wolff, C. J., & van den Bosch, G. (1984). Personnel selection. In P. J. D. Drenth, H. Thierry, H. Willems, & C. J.
de Wolf (Eds.), Handbook of work and organizational psychology (Vol. 1). Chichester, England: Wiley.
Eriksen, E. C. (1927). New psychological test for pilots. In H. Grundlach (Ed.). (1998). Applied psychology:
The first-thirteenth Congress Proceedings of the International Association of Applied Psychology (Vol. 4,
pp. 353–369). London, England: Routledge.
Erismann, T., & Moers, M. (1926). Psicología del trabajo professional [Professional work psychology].
Barcelona, Spain: Labor.
Evers, A., Anderson, N., & Voskuijl, O. (Eds.). (2005). International handbook of personnel selection. London,
England: Blackwell.
Feltham, R. (1988). Validity of a police assessment center: A 1–19 year follow-up. Journal of Occupational
Psychology, 61, 129–144.
Ferrari, C. A. (1939). Industrial psychology in Italy. Journal of Occupational Psychology, 13, 141–151.
Fitts, P. M. (1946). German applied psychology during World War II. American Psychologist, 1, 151–161.
Fokkema, S. D. (1958). Aviation psychology in The Netherlands. In J. B. Parry & S. D. Fokkema (Eds.),
Aviation psychology in western Europe (pp. 58–69). Amsterdam, The Netherlands: Swets & Zeitlinger.
Gemelli, A. (1917, August). The application of psychophysiological methods to the examination of candidates
for aviation service. Rivista di Psicologíat.
Germain, J., Pinillos, J. L., Ramo, M., & Pascual, M. (1958). Estudios sobre la selección de conductores en el
ejército del aire [Studies on selection of drivers in air army]. Revista de Psicología General y Aplicada,
13, 777–790.
Germain, J., Pinillos, J. L., Ramo, M., & Pascual, M. (1959). Selección de pilotos en el ejército del aire español
[Selection of pilots in the Spanish air army]. Revista de Psicología General y Aplicada, 14, 75–114.
Giese, F. (1925). Handbuch psychotechnischer Eignungprüfungen [Handbook of psychotechnical tests of
ability]. Halle, Germany: C. Marhold.
Grundlach, H. (Ed.). (1998). Applied psychology: The first-thirteenth Congress Proceedings of the International
Association of Applied Psychology (Vol. 1–13). London, England: Routledge.
Grundlach, H. (1999). El factor humano y el ingreso de la psicología y la psicotecnia en la guerra [The human
factor and the introduction of psychology and psychotechnics in the war]. Persona, 2, 163–179.
Guion, R. M. (1976). Recruiting, selection and job placement. In M. D. Dunnette (Ed.), Handbook of industrial
and organizational psychology. Chicago, IL: Rand McNally.
Helbing, J. C. (1964). Results of the selection of R.N.A.F air traffic and fighter control officers. In A. Cassie,
S. D. Fokkema, & J. B. Parry (Eds.), Aviation psychology. Studies on accident liability, proficiency criteria
and personnel selection. The Hague, The Netherlands: Mouton.
Heller, W. J. (1929). Industrial psychology and its development in Switzerland. Personnel Journal, 8,
435–441.
Employee Selection in Europe 939
Henmon, V. A. C. (1919). Air service tests of aptitude for flying. Journal of Applied Psychology, 3, 103–109.
Henry, R. A. (1914). Le socialisme et l’art de commander a l’industrie. Paris, France: Gauthier-Villars.
Herriot, P. (1989). Selection as a social process. In M. Smith & I. T. Robertson (Eds.), Advances in selection
and assessment methods (pp. 171–187). Chichester, England: Wiley.
Highhouse, S. (2002). Assessing the candidate as a whole: A historical and critical analysis of individual
psychological assessment for personnel decision-making. Personnel Psychology, 55, 363–396.
Hopkins, P. (1939). Psychological tests in the army and air force of foreign countries. Occupational Psychology,
13, 59–63.
Hopkins, P. (1944). Observations on army and air force selection and classification procedures in Tokyo,
Budapest, and Berlin. Journal of Psychology, 17, 31–37.
Hough L. M., & Oswald, F. L. (2000). Personnel selection: Looking toward the future—remembering the past.
Annual Review of Psychology, 51, 631–634.
Hülsheger, U. R., Maier, G. W., & Stumpp, T. (2007). Validity of general mental ability for the prediction of
job performance and training success in Germany: A meta-analysis. International Journal of Selection
and Assessment, 15, 3–18.
IAAP. (1982). 20th International Congress of Applied Psychology. Book of Abstracts. Edinburgh, Scotland:
July 25–31.
Jurowsjaja, M. A. (1925). Psichologieeskij profil administratora [The psychological profile of managers]. In
Sammelband “Inteligentnyj Trud” [Monograph “Works on Intelligence”]. Moscow, Russia.
Kitson, H. C. (1922). Second international conference of psychotechnics applied to vocational guidance and to
scientific management. Journal of Applied Psychology, 6, 418–424.
Kornhauser, A.W. (1929). Industrial psychology in England, Germany and the United States. Personnel Journal,
8, 421–434.
Kragh, U. (1960). The defense mechanism test: A new method of diagnosis and personnel selection. Journal of
Applied Psychology, 44, 303–309.
Kwiatkowski, R., Duncan, D. C., & Shimmin, S. (2006). What have we forgotten—and why? Journal of
Occupational and Organizational Psychology, 79, 183–201.
Lahy, J. M. (1922). La confèrence psychotecnique de Genève [The conference on psychotechnics in Geneve].
Journal of Psychologie Normal et Pathologique, 19, 65–79.
Landy, F. (1992). Hugo Münsterberg: Victim or visionary? Journal of Applied Psychology, 77, 787–802.
Levy-Leboyer, C. (1994). Selection and assessment in Europe. In H. C. Triandis, M. D. Dunnette, & L. M. Hough
(Eds.), Handbook of industrial and organizational psychology (Vol. 4., 2nd ed.). Palo Alto, CA: Consulting
Psychologist Press.
Levy-Leboyer, C., Huteau, M., Louche, C., & Rolland, J. P. (Eds.). (2001). Psychologie appliquée a la gestion
des resources humaines [Applied psychology and human resources management]. Paris, France: Editions
d’Organization.
Lievens, F., Buyse, T., & Sackett, P. R. (2005). The operational validity of a video-based situational judgment
test for medical college admissions: Illustrating the importance of matching predictor and criterion con-
struct domains. Journal of Applied Psychology, 90, 442–452.
Lievens, F., & Harris, M. M. (2003). Research on Internet recruitment and testing: Current status and future
directions. In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organiza-
tional psychology (Vol. 18, pp. 131–165). Chichester, England: Wiley.
Lievens, F., & Sackett, P. R. (2006). Video-based vs. Written situational judgment tests: A comparison in terms
of predictive validity. Journal of Applied Psychology, 91, 181–188.
Mallart, J. (1925). Problemas fundamentales de la investigación psicotécnica [Fundamental problems of psy-
chotechnical research]. Madrid, Spain: Instituto de Reeducación Profesional.
Menzerath, P. (1913). Le Ve Congrès de Psychologie Expérimentale (Berlin, 16–19 avril 1912) [The Fifth
Congress of Experimental Psychology (Berlin, 16–19 April 1912)]. L’Anné Psychologique, 19,
236–256.
Miles, G. H. (1922). The Berlin conference in applied psychology. Journal of the National Institute of Industrial
Psychology, 1, 190–192.
Mira, E. (1943). Psychiatry in war. New York, NY: Norton.
Miret-Alsina, F. (1958). Aviation psychology in the Sabena. In A. Cassie, S. D. Fokkema, & J. B. Parry
(Eds.), Aviation psychology. Studies on accident liability, proficiency criteria and personnel selection
(pp. 22–35). The Hague, The Netherlands: Mouton.
Münsterberg, H. (1912a). Comment. In F. Schumman (Ed.), Bericht über den V. Kongress für experiment-
elle Psychologie in Berlin von 16. bis 20. April 1912 [Proceedings of the V Congress of Experimental
Psychology in Berlin, from 16–20 April, 1912] (pp. 115–116). Leipzig, Germany: J. A. Barth.
940 Handbook of Employee Selection
Münsterberg, H. (1912b). Psychologie und Wirtschaftsleben. Ein Beitrag zur angewandten Experimental-
psychologie [Psychology and business. A contribution to applied experimental psychology]. Leipzig,
Germany: J. A. Barth.
Münsterberg, H. (1913). Psychology and industrial efficiency. Boston, MA: Houghton Mifflin.
Münsterberg, H. (1914). Grundzüge der Psychotechnik [Compendium of psychotechnics]. Leipzig, Germany:
J. A. Barth.
Myors, B., Lievens, F., Schollaert, E., van Hoye, G., Cronshaw, S., Mladinic, A., et al. (2008). International
perspectives on the legal environment for selection. Industrial and Organizational Psychology:
Perspectives on Science and Practice, 1, 206–246.
Nestor, J. M. (1933a). Vocational tests on the European railways. I. Human Factor, 7, 11–23.
Nestor, J. M. (1933b). Vocational tests on the European railways. II. Human Factor, 7, 51–58.
Pacaud, S. (1971). Le diagnostic du potential individual: I: Le probléme general Le personnel d’exécution. In
M. Reuchlin (Ed.), Traité de psychologie appliquée (Vol. 4, pp. 5–66) [Treatise on applied psychology].
Paris, France: Presses Universitaires de France.
Palmade, G. (1948). La psychotechnique [Psychotechnics]. Paris, France: Press Universitaires de France.
Parry, J. B. (1947). The selection and classification of R.A.F. air crew. Occupational Psychology, 21,
158–169.
Paulik, K. (2004). The history of the psychology of work and organization in Czech and Slovak industry.
European Psychologist, 9, 170–179.
Perez-Creus, J. (1947). Orientación y selección profesional [Orientation and professional selection]. Madrid,
Spain: La Abeja.
Piorkowski, C. (1915). Beitrage zur psychologischen Methodologie der wissenschaftlichen Berufsberantung
[Psychological methods of scientific career counseling]. In Beihefte zur Zeitschrift für angewande
Psychologie [Supplement of the Journal of Applied Psychology] (Vol. 11). Leipzig, Germany: J. A. Bart.
Prieto, J. M., Blasco, R., & Quintanilla, I. (1991). Recrutement et selection du personnel en Espagne [Recruitment
and personnel selection in Spain]. European Review of Applied Psychology, 42, 47–62.
Riggs, E. (1958). Aviation psychology in the Royal Norwegian Air Force. In J. B. Parry & S. D. Fokkema
(Eds.), Aviation Psychology in western Europe. Amsterdam, The Netherlands: Swets & Zeitlinger.
Robertson, I. T., & Makin, P. J. (1986). Management selection in Britain: A survey and critique. Journal of
Occupational Psychology, 59, 45–57.
Roe, R., & Van den Berg, P. T. (2003). Selection in Europe: Context, development and research agenda.
European Journal of Work and Organizational Psychology, 12, 257–287.
Sachs, H. (1920). Studien zur Eignungsprüfung der Strassenbahnfahrer [Studies on aptitude tests of tram
drivers]. In Zeitschrift für angewande Psychologie [Journal of Applied Psychology] (Vol. 17). Leipzig,
Germany: J. A. Barth.
Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 387–417.
Salgado, J. F. (1997). The Five Factor Model of personality and job performance in the European Community
(EC). Journal of Applied Psychology 82, 1, 30–43.
Salgado, J. F. (2002). The Big Five personality dimensions and counterproductive behaviors. International
Journal of Selection and Assessment, (Deniz Ones, Guest Editor) 10, 117–125.
Salgado, J. F. (2003). Predicting job performance by FFM and non-FFM personality measures. Journal of
Occupational and Organizational Psychology, 76, 323–346.
Salgado, J. F., & Anderson, N. (2003). Validity generalization of GMA tests across the European Community
Countries. European Journal of Work and Organizational Psychology, 12, 1–17.
Salgado, J. F., Anderson, N., Moscoso, S., Bertua, C., De Fruyt, F., & Rolland, J. P. (2003). GMA measures
and the prediction of Job Performance and training success for different occupations in the European
Community. Journal of Applied Psychology, 88, 1068–1081.
Salgado, J. F., & Moscoso, S. (2002). Comprehensive meta-analysis of the construct validity of the employment
interview. European Journal of Work and Organizational Psychology, 11, 299–324.
Scholz, G., & Schuler, H. (1993). Das nomologische Netzwek des Assessment Centers: Eine Metaanlyse
[The nomological network of the assessment center: A meta-analysis]. Zeitschrift für Arbeits- und
Organisationspsychologie, 37, 73–85.
Schuler, H. (1993). Social validity of selection situations: a concept and some empirical results. In H. Schuler,
J. L. Farr, & M. Smith (Eds.), Personnel selection and assessment: Individual and organizational per-
spectives (pp. 11–26). Mahwah, NJ: Lawrence Erlbaum.
Schuler, H. (2002). Das Einstellungs-interview. Göttingen, Germany: Hogrefe.
Schuler, H., & Funke, U. (1989). The interview as a multi-modal procedure. In R. W. Eder & G. F. Ferris (Eds.),
The employment interview (pp. 183–193). Newbury Park, CA: Sage.
Employee Selection in Europe 941
Seisdedos, N. (1993). Personnel selection, questionnaires, and motivational distortion: an intelligent attitude of
adaptation. In H. Schuler, J. L. Farr, & M. Smith (Eds.), Personnel selection and assessment: Industrial
and organizational perspectives. Hillsdale, NJ: Lawrence Erlbaum.
Shackleton, N. J., & Newell, S. (1991). Management selection: A comparative survey of methods used in top
Britain and French companies. Journal of Occupational Psychology, 64, 23–96.
Shimmin, S. (1989). Selection in a European context. In P. Herriot (Ed.), Assessment and selection in organiza-
tions (pp. 109–118). Chichester, England: Wiley.
Shimmin, S., & Wallis, D. (1994). Fifty years of occupational psychology in Britain. Leicester, England: British
Psychological Society.
SHL. (1989). Validation review. Surrey, England: Saville & Holdsworth.
SHL. (1996). Validation review II. Surrey, England: Saville & Holdsworth.
Simoneit, M. (1938). Leitgedanken über die psychologische Untersuchung des Offizier-Nachwuchs in der
Wehrmacht [Principles of the psychological study of the officer recruit relationship in the army]. Berlin,
Germany: Bernard & Graefe.
Smith, M. (1991). Recruitment and selection in the UK with some data on Norway. European Review of Applied
Psychology, 41, 27–34.
Smith, M., & Abrahansen, M. (1992). Patterns of selection in six countries. The Psychologist, 205–207.
Spector, P. E. (2000). Industrial and organizational psychology. Research and practice (2nd ed.). New York,
NY: Wiley.
Stern, W. (1900). Über Psychologie der individuellen Differentielle [On the Psychology of Individual
Differences]. Leipzig, Germany: J.A. Barth.
Stern, W. (1903). Angewandte Psychologie. Beiträge zur Psychologie der Aussage. Mit besonderer
Berücksichtigung von Problemen der Rechtspflege, Pädagogik, Psychiatrie und Geschichtsforschung,
1, 4–45.
Stern, W. (1911). Die differentielle Psychologie in ihren methodischen Grundlagen [Differential psychology
and its methodological foundations]. Leipzig, Germany: J. A. Barth
Stern, W. (1917). Über eine psychologische Eignungsprüfung für Strassenbahnfahrerinnen [On the psychologi-
cal aptitude test of female tramway drivers]. Zeitschrift für angewande Psychologie (Vol. 31). Leipzig,
Germany: J. A. Bart
Toulouse, E. (1927). Discours [Speech]. In H. Grundlach (Ed.), (1998), Applied psychology: The first-thirteenth
Congress Proceedings of the International Association of Applied Psychology (Vol. 4, pp. 21–27).
London, England: Routledge.
Trankell, A. (1959). The psychologist as an instrument of prediction. Journal of Applied Psychology, 43,
170–175.
Van Gineken, M. (1918). De rechte man op de rechte plaats [The right man in the right place]. Amsterdam, The
Netherlands: R. K. Boek-Centrale.
Vernon, P. E. (1946). Statistical methods in the selection of navy and army personnel. Journal of the Royal
Statistical Society, 8, 139–153.
Vernon, P. E. (1947). Research on personnel selection in the Royal Navy and the British army. American
Psychologist, 2, 35–51.
Vernon, P. E., & Parry, J. B. (1949). Personnel selection in the British forces. London, England: University of
London Press.
Vinchur, A. J. (2007). A history of psychology applied to employee selection. In L. L. Koppes (Ed.), Historical
perspectives in industrial and organizational psychology (pp. 193–218). Mahwah, NJ: Lawrence
Erlbaum.
Viteles, M. (1926). Psychology in industry. Psychological Bulletin, 23, 631–681.
Viteles, M. (1928). Psychology in industry. Psychological Bulletin, 25, 309–340.
Viteles, M. (1930). Psychology in industry. Psychological Bulletin, 27, 567–635.
Viteles, M. (1932). Industrial psychology. New York, NY: Norton.
Viteles, M. S. (1938). Industrial psychology in Russia. Occupational Psychology, 12, 85–103.
Walther, L. (1929). Technopsychology in a Swiss industry. Personnel Journal, 8, 1–18.
Warr, P. (2007). Some historical developments in I-O psychology outside the United States. In L. L. Koppes
(Ed.), Historical perspectives in industrial and organizational psychology (pp. 81–107). Mahwah, NJ:
Lawrence Erlbaum.
Williams, G. O. (1940). Flying aptitude tests (Report Nº 152). London, England: Flying Personnel Research
Committee, Royal Air Force, Ministry of Defense.
This page intentionally left blank
44 Employee Selection
Musings About Its Past,
Present, and Future
Robert M. Guion
A friend went on an overseas tour with a group of people he had not known before. One man in the
group made himself a tad obnoxious by starting nearly every statement with, “Ya know what I don’t
like?” In an edited book like this, many people (including perhaps the editors) anticipate the content
of the final chapter to be a listing of things liked or, with potentially greater entertainment value,
things not liked in the chapters preceding it. This is not the role I’m taking.
These chapters offer a lot to think about. Some of them, at some point, have led me to put
a chapter down, stare off into space, and consider some implications that may never have been
intended. Some of the resulting notions and questions may be trivially semantic; others seem (to
me) important to real-life employment practice and its underlying logic. This chapter has become
a collection of musings. Some of them were stimulated by the content, even by offhand paren-
thetic remarks, in the various chapters. Some of them were things I started musing about, without
a lot of awareness that I was doing so, as I went about working on a revision of my own book on
assessment for personnel decisions—things that began to come into clearer (or at least less foggy)
focus on changes (or needs for changes) in the last few years in the thinking of some people in the
field of personnel selection. I hope they stimulate other musings, maybe even some answers to my
questions and uncertainties.
PSYCHOMETRIC MUSINGS
WHAT DO WE MEAN BY “MEASUREMENT” OR “PSYCHOMETRICS”?
In most disciplines using psychological measurement, and most of the accompanying college
coursework, the term psychometrics refers to the theory of test construction and to the statis-
tical evaluation of the result and its use. I suspect that more attention is given to the statistics
than to the theory, and I think the term should include a broader array of assessment procedures.
Maybe it should also apply to the act of measurement. As a matter of fact, in some organiza-
tions only the people who administer tests, especially individually administered tests, are called
psychometricians.
This book seems unique in that several of its chapters describe or comment on actual selec-
tion procedures, most notably Chapter 18, this volume. However, even these chapters seem to
describe overall processes of selection, of which test administration is surely a part, rather than
focusing on principles of test administration. Any exceptions seem to concentrate on electronic
means of administration, not on more general principles. Some principles are clearly stated or
at least inescapably implied; they include well-known things like standardization of procedures,
instructions, scoring, and normative interpretation. I’ve seen much that is new about theory and
statistics in these chapters, but I have not picked up any new principles of testing as such, even
where technological changes have made some of the old ones less certain than they used to be.
943
944 Handbook of Employee Selection
adaptive nor maladaptive traits—merely “preferences or habits.” Selection people should recognize
that work-related habits may be strong influences on the way—and on how well—work gets done.
Constructs with similar influences are precisely what are needed for effective selection. Maybe the
advice to physicians (heal thyself) would apply to us; we may be too strongly influenced by our own
acquired habit of meeting requirements for job-relatedness (as defined by validity coefficients) to
recognize that good selection can require “preferences or habits” more specifically related to work
in general than to specific jobs. Or is that a heresy?
This musing began while reading Lievens and Chan (Chapter 16) and revisiting the common sense
but scientifically challenged material on various adjectives preceding the word “intelligence,” any
of them likely to be comprised of multiple constructs. It led me to wonder what would happen if
we were to choose people on the basis of behavioral samples instead of signs—if we were finally to
take seriously the advice of Wernimont and Campbell (1968). I see several possible consequences,
some of them desirable. We would have to identify in somewhat general terms the behaviors to
be sampled and establish a sensible taxonomy covering the full array. We would need to develop
another new array of assessment tools that sample, at least abstractly, those behaviors of interest.
Some contemporary competency measures might fall into that array, but I suspect that eight catego-
ries of them (Bartram, 2005) would not cover it all. Another effect would be a change in the concept
of job-relatedness from construct matching to content sampling. Another personally undesirable
consequence is that I would have to write a whole new book abandoning the central theme in Guion
(1998) about forming predictive hypotheses. Sections on validity in the Standards and Principles
would have to be entirely new, not just new steps along the evolutionary path. With these changes,
would selection decisions be any better?
scores reflect a specified construct rather than a different one, and (d) an intended interpretation or
inference from scores that can be supported by appropriate evidence and theory. This end point,
the “new definition,” is often considered a “unitarian” view, but is it? It’s not very unitary. Many
different inferences can be drawn from the same scores—as many as the different interpretations
intended by different test users. Validities of the inferences surely differ in the kinds and levels of
evidence and theory available to support them—hence, many validities. These many validities can-
not be enthroned on a single restricting adjective, nor can the many sources of validity evidence be
grouped with just a few adjectives. The number of potentially intended interpretations or inferences
from scores, and the number of validities fitting them, may be less than infinite, but it is surely more
than a few. I hope the trinitarian metaphor and the damage it has done will soon be interred, but
I don’t hold much hope that the unitarian metaphor will be any more viable.
In Chapter 21 by Borman, Bryant, and Dorio, it was properly noted that BARS is a system, “even
a philosophy.” Pat Smith and I developed a BARS system for our local police department. It was
designed as a system; all command officers took part in its development and continued use. The
intended purposes were, first, use by sergeants in performance observation and evaluation; second,
training and development at the patrol officer and supervisory levels; and third, to help supervising
sergeants provide performance feedback; it got used for other personnel activities as well. It contin-
ued to be used, properly and without professional monitoring, for 10 years—until the presence of
computers in every patrol car required some new anchoring behaviors and subsequent rescaling. It
was never intended to serve as a criterion for validating selection procedures. With its developmen-
tal and administrative baggage, it would not have served that function very well.
1 Oops! I’d better muse about a three-dimensional matrix; a criteria such as quality of work would not be predicted by the
same variables for welders and bookkeepers. Or so I think. Maybe later, such as 10 years later.
948 Handbook of Employee Selection
important behaviors. (This leads me to offer a new but widely applicable aphorism: “Decide what
you want to do before you start doing it.”) This is an obvious requirement in developing simulations
or other work samples, and it is, I think, just as important for describing the behavioral content by
which latent traits are reflected. Content might be described in terms of intended uses or methods of
measurement. In employment testing, the description begins in work or job analysis as knowledge,
skill, ability, or other characteristics (KSAOs), which might be described as important behaviors
or perhaps constructs defined with varying precision.2 The point, for me, of Kehoe and Murphy’s
Chapter 5 is that someone evaluating a test needs to identify a clear target it is supposed to hit. The
clarity of the target depends on how well it is described or defined, and the evaluation of the test
depends on how well the target is sampled or replicated. If targets are job behaviors, sampling can
probably be straightforward enough that agreement on its adequacy or appropriateness is an easy
call. If the targets are constructs, sampling may be harder to do, but it is still necessary. When the
attributes being measured are said to be compound constructs, the problem usually gets more dif-
ficult because the literature is sparser.
The special contribution of Chapter 5 is its emphasis on justifying decisions made in a particu-
lar situation (a local situation) by generalizing from previous research. I think this means research
done anywhere, locally or somewhere else. Kehoe and Murphy refer to these justifications as local
generalizations—needed to justify local decisions, not about candidates, but about procedures to be
used in making candidate choices. Local generalization, insofar as it focuses on assessment proce-
dures, seems to be based on some empirical evidence and also on judgment.
Judgment matters, too. Kehoe and Murphy are joined by many other chapter authors in referring
often to the importance of judgment in employment testing. Such testing is typically dependent on
the judgments of job experts in defining the work to be done and the personal attributes required in
doing it. What is less typical, and may signal a vital change in the way validity is viewed, is tying
that dependency to the concept of local generalization. I wonder if we can go so far as to suggest
that, no matter what sort of evidence is used to support claims of validity, it must be interpreted in
local terms where the assessment is done and the employment decisions are made. If that is a rea-
sonable interpretation, and it seems reasonable to me, it also raises new questions about the kinds of
trait-work generalizations we should seek.
I see a practical implication in tying the new definition to local generalization. The threat, at least
the possibility, of litigation is always with us. Cronbach’s (1988) notion of a validity “argument”
(p. 3) really anticipated the new definition by calling for the organization and integration of all rel-
evant evidence and theory to try to convince colleagues, clients, and courts that the use of a specific
test for a specific purpose in a specific setting is a good thing (i.e., valid).
2 It is fascinating—and instructive—to note that most of these chapters referred to the primary importance of work or job
analysis for all sorts of purposes.
Employee Selection 949
replications; validity has been more concerned about contamination by consistent or systematic
errors. That’s a well-established way of thinking, but it’s not very useful. Whether error is consistent
or inconsistent depends too much on the procedures used in estimating it. Generalizability analysis
examines sources of variance without categorizing them.
Including either generalizability theory or classical reliability as part of the new validity has
some elements of “déjà vu all over again.” A slightly older validity concept also held that validity is
a property of the inferences to be made from the scores, not a property of a test. Scores, of course,
are based on responses of examinees to items in the test. Thorndike (1949) listed four categories of
individual differences in responses that potentially lead to inconsistency and measurement error;
the nature of the measuring instrument was not his concern. Stanley (1971) moved Thorndike’s con-
tribution further by adding measurement error attributable to poor test administration or to scoring
errors. Zickar, Cortina, and Carter (Chapter 19) have pared the list down to three, but that was not
just condensing it. They included errors attributable to the items used, those due to raters or scorers
(and probably those administering the tests), and those due to chance factors related to timing—that
is, to events unlikely to have happened if testing had been done at a different time. They elaborated
these conditions, but the first needs no elaboration to make the point: at least to some limited extent,
unreliability is a function of the test itself. Musing about the implications of that one sentence could
fill a chapter.
test theory from other measurement theories. His answer was an uncertain, “Er, do you mean scaling?”
(L. Tucker, personal communications, late 1960s/early 1970s). No, I did not, but I couldn’t then articulate
my meaning very well. I should have answered that scaling, especially the scaling of attitudes and values
by methods created by Thurstone and Likert, falls within the boundaries of classical test theory with its
true scores and error scores, and I wanted advice on finding alternatives beyond those boundaries.
I still do. Alternative measurement theories seem to have been around. Most of them involve
mathematical modeling, highly endorsed in many of these chapters, including information theory
and its measures of structure and uncertainty, signal detection theory, and models of sequential pro-
cesses (Coombs, Dawes, & Tversky, 1970). Although signal detection theory expanded psychophys-
ics, and although structure and uncertainty bear some resemblance to consistent and inconsistent
variance, these approaches to measuring psychological variables regrettably seem outside of the
typical psychometric boundary. So are measurement and classification procedures now used in
neural and brain physiology research.
Do these various approaches to quantification differ so much from each other that their underlying
assumptions and rationale constitute different measurement theories? If we try to fit all of them under
the classical test theory umbrella (as this handbook may be doing), are we routinely overlooking pos-
sibilities for new or expanded approaches? Is classical test theory just one psychometric theory among
many or is it pretty much complete with testing and scaling methods kept pretty much to itself?
I didn’t get answers to questions like these from Woolf’s (1961) volume, and I didn’t find them in
these chapters or in the larger literature. The quantification approaches chronicled in these chapters
and in that literature always seem to be marked by some consistencies that match or recall various
definitions of true scores, and they are always plagued by inconsistencies that seem close to the
concept of error scores. Maybe classical test theory, as expanded, is the only measurement theory
that is alive, well, and growing. The concept of validity has evolved, and so has the whole of classi-
cal test theory. Gulliksen (1950) took it further than Spearman, and it has continued to change since
then. Yet I still wonder: Do those of us in psychometrics have something to learn from the way other
psychologists measure their variables?
From general observation, I think the closest many company psychologists come to research is moni-
toring the work of outside consultants hired to do it. That research seems (I have no systematically
acquired data) to be limited to topics in human resources—no psychophysics, and I don’t recall even
the currently popular neurosciences having much impact on either practice or research in employee
selection or other aspects of the human resource domain, let alone organizational problems in marketing
or product development or other areas outside of that domain. This restriction of scientific interest, of
course, is the pattern in the larger discipline of psychological science, in the larger galaxy of “the sci-
ences,” and in the scholarly disciplines in general. Specializing starts in graduate school if not earlier,
and collegiality seems limited to people with similar specialties. It isn’t only with meteorologists that
psychologists talk about the weather; that’s all they have in common with scholars in other specialties
as well. Although the phenomenon of specialization is everywhere, it seems to be a rare company
psychologist who combines research and its application even within a narrowly prescribed field.
Academic researchers in “applied” psychology seem also to avoid the practitioner role. Some of
them in earlier times would take on consulting or even summer adjunct jobs to stay familiar with issues,
problems, and outlooks in actual employing organizations. Their goals were to do “practical” research.
Even avowedly applied journals are now filled, sometimes exclusively, with articles stemming from or
contributing to theories of interest to the researcher if not to organizational decision-makers.
multidimensional criteria are becoming more widely accepted, and I’m happy to note that several
chapters at least show some willingness to consider nonlinearity and profile matching.
would happen if the conference method of management were resurrected. The method, used in the
1940s and 1950s, brought together people with different views about something—such as different
ideas on how to use test scores in making selection decisions—with a conference leader trained and
experienced in moving people toward a consensus. Retrieving this old notion might not only reduce
the tensions between selection researchers and decision-makers, but also those among decision-
makers. It might lead over the long term to less confrontational conversations on other issues. It
would be more productive than muttering about getting one’s own way without considering the qual-
ity of other points of view. Remember the Carl Rogers prescription for conflict reduction: Express the
other person’s ideas, in your own words, to his or her satisfaction (Rogers & Roethlisberger, 1952).
Suppose some people supported top-down decisions, and others objected that such a procedure is
too mechanical and leaves no room for discretion. An alternative might be to use relatively narrow
score bands for a “top-band down” policy. Perhaps bands in an expectancy chart could be used, as
Tippins, Papinchock, and Solberg (Chapter 17) said, so that “a person in one range of scores may be
hired without any other education or experience credentials while another person with a score in a
lower range may be hired only if he or she has certain kinds of experience.” (That seems to me to be
a good idea.) Or should the scores simply be given to decision-makers as “one source of job-relevant
information that they use according to their professional judgment” (an alternative often suggested,
although I believe the internal consultant should offer more guidance than that). Clearly, different
ways to use test scores require different levels of judgment from decision-makers, from no room at
all for judgment (as in strict top-down selection) to a minor role for assessments.
the proprietary assessment procedures of the large fi rms? Would their technical merits (or demerits)
be known and understood by client organizations? Would professional organizations or regulatory
agencies insist (successfully) that the proprietary tests meet the same standards of excellence that
researchers (in or out of large research organizations or publishing houses) are expected to meet?
Would the very size of these consulting organizations make possible and facilitate the large-scale
research that could solve many currently intractable psychometric questions? If so, would they fol-
low professional scholarly traditions and share the results or hold them close and proprietary for
the sake of competitive advantage? These questions flowed freely as I read parts of Chapter 17 by
Tippins et al.
WILL OUR CONSTRUCTS BE BROADER AND MORE GENERAL OR MORE NARROWLY FOCUSED?
The trend, clearly reflected in most of these chapters, is toward broader, less fractionated constructs.
Roughly half a century ago, we sought ever-smaller factors of human attributes; now we seek com-
petencies that are as atheoretical as factor analyzing any matrix of correlations we can put together.
Employee Selection 955
They are, nevertheless, large hunks of what people must be able to do if they are to do their work
well. For example, for sales clerks a necessary competence may require getting things done fast for
customers, although speed and customer satisfaction with the service may not be compatible. Fifty
years ago we wanted factors to be small and orthogonal; now we move toward such inclusive com-
petencies that they overlap conceptually and are therefore correlated.
Even within factor-speak, hierarchical factor structures deny orthogonality. Carroll’s (1993)
comprehensive three-stratum theory of cognitive abilities is invoked in enough of these chapters to
show that leaders in employment testing are really taking it seriously. At the bottom are narrow, cor-
related abilities. The intermediate level consists of broader, more general abilities broad enough to
encompass some of the abilities in the lowest stratum but not quite broad enough to encompass them
all. These broader constructs are correlated well enough to allow a single general level of cognitive
ability to emerge. Carroll didn’t specify any particular number of factors at the low or intermediate
levels, but he extended the variety of intermediate factor content far beyond that of earlier hierarchi-
cal models, and his highest level of generality is far broader than the older concept of g.
Ones, Dilchert, Viswesvaran, and Salgado (Chapter 12) are committed to the importance of
general mental ability (a construct probably somewhat less inclusive than Carroll’s third stratum),
but they acknowledged that people at the same level of general mental ability differ in more specific
abilities “due to differential ‘investment’ of their cognitive capacity (guided by other personal char-
acteristics as well as idiosyncratic developmental and educational experiences)” (my italics). I would
add work experience to the idiosyncratic experiences. Among them, this thought may return us to a
narrower focus on kinds or categories, as well as degrees, of individual differences.
Without reliance on factors, it seems that some traits that seem undiluted (homogeneous in mea-
surement) are being combined more often to form compound traits. This was explicit in the chapter
by Hough and Dilchert on personality (Chapter 14), in which such examples as integrity or customer
service orientation were cited. It was at least implicit in discussions of work-related competencies or
cognitive abilities such as creativity. Psychomotor abilities may combine some basic physical traits
(e.g., static strength, dynamic strength, and stamina) in a compound trait simply headed strength.
Compound traits, defined as composites of homogeneous traits that may not be noticeably corre-
lated, may be better predictors than any of their component parts. (But stop to muse: Isn’t that what
we’ve done all along with multiple regression but without the assumption of optimal weights?)
In short, although concepts of predictor traits have become broader, they still include the foun-
dation of narrower, more homogeneous parts. I think Pat Smith had it right. Compared with the
past, the trend leads to a wider range of choices between excessively broad and excessively narrow
options, and we can choose as the problem at hand directs us. That is good. The availability of many
options frees us from the constraints imposed by having just a few, even if it befuddles us.
electronic means of assessment (Reynolds & Dickter, Chapter 8), and if new electronic tools keep
begetting new ones, we’ll never run out of new things to learn.
Despite all this learning, we have not seemed to learn much about integrating effective assess-
ment with effective decision-making. Too many decision-makers make employment decisions only
now and then, and they find the assessment information unfamiliar and unfathomable. Too many
assessment experts accept the limitations of the occasional makers of selection decisions and try to
make their lives easier (a purposefully ambiguous reference: Whose lives?). Once again, what would
happen if testers made a concerted effort to be sure the decision makers genuinely understood the
implications of score differences? The answer, I suspect, is that decisions would themselves be more
valid because they will have considered more fully the implication of effective assessment and the
scores it produces.
A FINAL MUSING
I had not realized before trying to pin them down how many questions I have about employee selec-
tion that I can’t answer. I’ve offered tentative answers to many of them. Some of these answers may
be based on accumulated experience and reading, but some of them may be based on my own cog-
nitive habits. Many of them I have no answers for at all and, indeed, many of the questions may be
unanswerable. That is for research and practice over the next decade or so to find out.
REFERENCES
Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1–15.
Bartram, D. (2000). Internet recruitment and selection: Kissing frogs to find princes. International Journal of
Selection and Assessment, 8, 261–274.
Employee Selection 957
Boring, E. G. (1961). The beginning and growth of measurement in psychology. In H. Woolf (Ed.), Quantification:
A history of the meaning of measurement in the natural and social sciences (pp. 108–127). Indianapolis,
IN: Bobbs-Merrill.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York, NY: Cambridge
University Press.
Coombs, C. H., Dawes, R. H., & Tversky, A. (1970). Mathematical psychology: An elementary introduction.
Englewood Cliffs, NJ: Prentice-Hall.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Eds.), Test validity
(pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum Associates.
Guion, R. M. (1980). On trinitarian doctrines of validity. Professional Psychology, 117, 385–398.
Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley.
Helmreich, R. L., Sawin, L. L., & Carsrud, A. L. (1986). The honeymoon effect in job performance: Temporal
increases in the predictive power of achievement motivation. Journal of Applied Psychology, 71,
185–188.
Jenkins, J. G. (1946). Validity for what? Journal of Consulting Psychology, 10, 93–98.
Murphy, K. R. (1989). Is the relationship between cognitive ability and job performance stable over time?
Human Performance, 2, 183–200.
Rogers, C. R., & Roethlisberger, F. J. (1952). Barriers and gateways to communication. Harvard Business
Review, 30, 46–52.
Smith, P. C. (1985). Global measures: Do we need them? In Division 14 Scientific Contribution Award Address.
American Psychological Association, Los Angeles, CA.
Spearman, C. (1927). The abilities of man. New York, NY: Macmillan.
Stanley, J. C. (1971). Reliability. In Thorndike, R. L. (Ed.), Educational measurement, 2nd ed. (pp. 356–442).
Washington, DC: American Council on Education.
Tett, R. P., & Burnett, D. D. (2003). A personality trait-based interactionist model of job performance. Journal
of Applied Psychology, 88, 500–517.
Thorndike, R. L. (1949). Personnel selection: Test and measurement techniques. New York, NY: Wiley.
Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52,
372–376.
Woolf, H. (Ed.). (1961). Quantification: A history of the meaning of measurement in the natural and social
sciences. Indianapolis, IN: Bobbs-Merrill.
This page intentionally left blank
Author Index
A Angell, L. C., 812, 814
Angelmar, R., 803
Abe, H., 791, 792 Angoff, W. H., 99, 165
Abraham, J. D., 555 Ansbacher, H. L., 931
Abrahams, N., 161 Anselmi, T. V., 913
Abrahansen, M., 934 Anstey, E., 933
Ackerman, L. D., 449, 451, 452 Anttila, H., 932
Ackerman, P. L., 58, 300, 343, 499, 878 Aquino, K., 504
Adams, G. A., 561 Arabian, J., 57, 104
Adams-Webber, J., 449 Arad, S., 239, 402, 463, 537, 555, 833, 897
Adkins, C. L., 322, 810 Archer, D., 346
Adler, N. J., 781 Argote, L., 811
Aguinis, H., 23, 62, 205, 235, 236, 237, 238, 248, 551, 556, Armenakis, A. A., 805
559, 563, 651, 715, 774, 775 Armor, D. J., 689, 690
Ahearne, M., 480, 772 Armstead, P., 383, 385
Aiken, L. S., 10, 37 Arneson, J. J., 264
Aiman-Smith, L., 144 Arnold, J. D., 51
Al-Olayan, F., 131 Arthur, W., Jr., 63, 165, 166, 305, 308, 349, 351, 387,
Albright, L., 304 732, 804, 811
Alexander, R. A., 58, 62, 64, 66, 103, 262, 288, 598, Arvey, R. D., 55, 86, 236, 284, 285, 286, 497, 504,
866, 878 543
Alge, B. J., 469, 471 Aryee, S., 473, 480, 538, 560
Algera, J. A., 909, 918, 934 Asch, B. J., 687
Algina, J., 411 Ash, S. R., 133, 134
Allan, E., 497 Ashe, R. L, Jr., 242
Allen, E., 560 Ashford, S. J., 349
Allen, M. R., 198, 571 Ashforth, B. E., 500
Allen, T. D., 309, 454, 457, 465, 561 Ashkanasy, N. M., 342, 346, 829
Alley, W. E., 684 Ashton, M. C., 303, 495, 505
Alliger, G. M., 55, 66, 305, 826 Ashworth, S., 103, 310, 351, 873
Allworth, E., 771, 772 Ask, T., 284
Almurshidee, K. A., 131 Astin, A. W., 553
Alonso, A., 263 Astrand, P., 279, 280, 281, 282
Altink, W. M., 934 Attal-Toubert, K., 655
Alvarez, S. A., 202 Attenweiler, W. J., 33, 306
Ambrose, M. L., 500 Atwater, L. E., 826, 829
Amburgey, R., 385 Au, W. T., 501
Amelang, M., 351 Aurbach, H. A., 256
Amendola, A. A., 294 Austin, E. J., 344
Amthauer, R., 934 Austin, J. T., 58, 107, 454, 552, 572, 878
Anastasi, A., 947 Avedon, M. J., 422
Anderberg, R., 929 Averill, J. R., 300
Anderson, A., 128 Avery, D. R., 130, 131, 132, 135, 137, 139, 141
Anderson, C. K., 286 Avis, J. M., 37, 311, 768, 772
Anderson, D., 158 Avolio, B. J., 267, 505, 826
Anderson, D. R., 532, 535
Anderson, K. S., 540 B
Anderson, L. E., 86
Anderson, N. R., 73, 103, 140, 141, 197, 213, 215, Baack, D. W., 131
216, 237, 240, 261, 263, 264, 421, 424, Baba, V. V., 536
425, 455, 785, 811, 816, 823, 910, 921, Bachrach, D. G., 454, 457, 465, 480, 526
929, 935 Bagozzi, R. P., 407
Anderson, S. E., 466, 467 Bailyn, L., 561
Andersson, L. M., 490 Baird, B. M., 157
Andiappan, M., 500, 503 Baker, D. P., 804, 807, 809, 811, 815
Andrews, M. C., 205 Baker, E. L., 801
959
960 Author Index
Baker, T. A., 277, 279, 281, 282, 283, 284, 285, 287, 290, Beatty, R. W., 444, 516
293, 294 Beaty, J., 388
Ballinger, G. A., 471 Beaubien, J. M., 335
Balma, M. J., 244 Beaupré, M. G., 346
Balsam, T., 730 Becker, B. E., 88, 92, 197, 198
Baltes, M. M., 555 Becker, D. E., 692
Baltes, P. B., 555 Becker, G. S., 199, 205
Balzer, W. K., 157 Bedwell, S., 347
Banas, J., 479 Beehr, T. A., 791
Bandura, A., 727 Begin, J. P., 788
Bangert-Drowns, R. L., 430 Behan, B. A., 831
Banks, C. G., 516 Behling, O., 236
Banks, D., 476, 860 Belau, L., 813
Banse, R., 346 Bell, B. S., 136, 801, 815
Bar-On, R., 341, 344 Bell, J. F., 27
Baranowski, L. E., 86 Bell, S. T., 165, 804, 812, 813, 814, 815, 816, 817
Baratta, J. E., 775 Bemis, S. E., 598
Barber, A. E., 127, 128, 129, 135, 142, 143, 145 Benbow, C. P., 264
Barchard, K. A., 346, 351 Bendoly, E., 480
Barge, B. N., 304 Bennett, B. E., 587
Baritz, L., 572 Bennett, C. E., 265
Barlett, F. C., 929 Bennett, M., 421, 422, 423, 424
Barling, J., 497, 535, 536, 545, 561 Bennett, R. J., 490, 491, 505, 555
Barlow, C., 829 Bennett, W., Jr., 93, 401, 804
Barnard, C. I., 465 Benson, G. S., 88
Barnes, J. D., 303 Benson, M. J., 311
Barnes, L. K., 770 Bentson, C., 63, 732, 858
Barnett, R. C., 535, 541 Bentz, V. J., 826
Barney, J. B., 198, 201 Bergman, M. E., 473
Barney, M. F., 81, 82, 84, 90, 91, 236, 237, 247 Bernardin, H. J., 444, 446, 447, 514, 515, 521, 788
Baron, H., 207, 256, 305, 731, 787 Bernardy, C. J., 236, 497
Baron, R. A., 498, 499, 500 Bernstein, I. H., 600
Baron, R. M., 59 Bernthal, P. R., 722
Barrett, G. V., 9, 38, 41, 57, 58, 59, 62, 64, 90, 103, 107, Berry, C. M., 236, 260, 264, 493, 494, 495, 497, 501
164, 222, 235, 262, 288, 596, 598, 805, 878 Bertolino, M., 257
Barrick, M. R., 74, 103, 204, 245, 299, 304, 306, 308, 309, Bertua, C., 103, 261, 263, 264, 455, 910
353, 457, 477, 499, 502, 526, 540, 544, 545, 607, Bettencourt, L. A., 473
730, 765, 768, 769, 791, 810, 811, 813, 826, 884, Bettenhausen, K., 480
910, 911, 917, 918 Beutell, N. J., 561
Barron, H., 731 Bevier, C. A., 116, 160, 265, 657, 774
Barry, B., 814 Bhaskar-Shrinivas, P., 791, 792, 793
Barsade, S. G., 341 Bhattacharya, C. B., 772
Barsness, Z. I., 804 Bicksler, B. A., 687, 699
Bartlett, A. L., 813 Biderman, M. D., 348
Bartlett, C. A., 781, 782 Biderman, Y., 536
Bartlett, C. J., 287 Bierman, L., 205
Bartram, D., 57, 79, 104, 115, 119, 177, 300, 310, 731, 934, Bies, R. J., 490, 504
935, 945 Biga, A. M., 180
Bateman, T. S., 465 Bills, M. A., 490
Bauer, D. J., 158 Bilsky, W., 322
Bauer, R., 580 Bilzon, J. L., 279
Bauer, T. N., 132, 141, 144, 163, 421, 422, 423, 424, 768, Bing, M. N., 496
773 Bingham, W. V., 2, 552, 930, 932
Baughman, K., 204, 776 Binning, J. F., 9, 38, 41, 57, 58, 59, 105, 107, 222, 235, 596,
Baum, R. J., 832 805
Baumeister, R. F., 300 Birati, A., 504
Baumgarten, F., 926, 930 Birkeland, S. A., 748
Baumgartner, H., 409, 410 Birkland, A. S., 300
Bax, E. H., 205 Bisqueret, C., 352
Bayless, A., 878 Bizot, E. B., 902
Beach, L. R., 142 Bjork, R. A., 326
Beadle, D., 129 Black, J. S., 791, 792, 794
Bearden, R. M., 303 Blackburn, M. A., 383
Beatty, A. S., 256 Blackman, M. C., 306
Author Index 961
Burke, E. F., 159, 261, 378, 380, 456 Carroll, J. B., 154, 258, 259, 922, 955
Burke, M. J., 916, 917 Carroll, J. M., 300
Burke, R. J., 500, 541, 724 Carroll, S. A., 136
Burke, S., 463, 464 Carson, K. P., 66, 910, 917, 918
Burnett, D. D., 89, 310, 946 Carsrud, A. L., 944
Burns, M., 456 Carter, G. W., 308, 771
Burris, E. R., 500, 503 Carter, N., 399
Burroughs, W. A., 775 Carver, C. B., 291
Burse, R. L., 280 Casady, T., 497
Burt, C., 933 Cascio, W. F., 60, 62, 88, 164, 205, 235, 236, 237, 238, 239,
Burton, J., 538 240, 241, 242, 246, 247, 248, 287, 288, 521, 522,
Burton, W. N., 531 551, 553, 556, 559, 563, 596, 651, 715, 745, 775
Busch, C. M., 770 Casella, G., 20
Buster, M. A., 162 Casper, W. J., 136, 562
Butler, R. J., 543 Caspi, A., 309, 497
Butterfield, K. D., 380, 384, 385 Catano, V. M., 266, 351
Button, S. B., 329 Cattell, R. B., 17, 259
Buyse, T., 158, 351, 761, 935 Cellar, D. F., 772
Byham, T. M., 735 Cerone, S., 205
Byham, W. C., 735, 801, 804, 809, 829, 830, 843, 844 Chadwick, C., 198
Byrne, B., 406 Chadwick-Jones, J. K., 490
Byrne, B. M., 406 Chaffin, D. B., 281
Byrne, D. E., 812 Chalmers, C., 204
Chan, D., 55, 154, 157, 160, 257, 260, 300, 302, 306, 321,
C 324, 326, 327, 328, 330, 332, 333, 335, 339, 342,
346, 347, 348, 349, 350, 351, 352, 353, 354, 355,
Cable, D. M., 127, 128, 129, 130, 133, 134, 136, 137, 142, 423, 433, 469, 472, 632, 760, 773, 786, 811
144 Chang, H. H., 156
Cabrera, E. F., 241 Chao, G. R., 304
Cain, P. S., 887 Chapman, D. S., 129, 133, 136, 137, 141, 142, 236
Calder, B. J., 824 Charam, R., 735, 830
Caldwell, C., 584 Charles, N., 143
Caldwell, M. S., 58, 878 Chasteen, C. S., 109
Caligiuri, P., 781, 782, 784, 787, 790, 791, 792, 793, 794 Chatman, J. A., 215
Callahan, C. M., 90 Cheatham, D. W., 516
Callender, J. C., 607 Chemers, M. M., 826
Camara, W. J., 595, 610, 684 Chen, C. B., 134
Camobreco, F. E., 826 Chen, G., 469, 472, 475, 477
Campbell, D. T., 54, 59, 99, 108, 598 Chen, J. J., 481
Campbell, J. P., 23, 58, 65, 66, 67, 160, 258, 269, 305, 311, Chen, P. Y., 500
327, 351, 440, 447, 449, 450, 451, 452, 453, 454, Chen, Z. X., 473, 538
457, 464, 551, 553, 554, 555, 558, 680, 684, 690, Chenoweth, D., 294
721, 728, 824, 865, 877, 878, 880, 881, 882, Chernyshenko, O. S., 158, 180, 300, 303, 304, 658
883, 945 Cherry, B., 182, 186
Campbell, R. J., 846 Cheung, G. W., 406
Campbell, W. J., 157 Cheung, N., 346
Campion, J. E., 137, 404 Childs, R. A., 896
Campion, M. A., 62, 83, 85, 87, 90, 164, 215, 347, 404, Chipman, S. F., 80
421, 422, 423, 424, 473, 731, 803, 805, 806, Chleusebairgue, A., 923
809, 814, 817, 828, 829, 883, 904 Chmiel, N., 921
Canavan, F., 539 Choi, S. W., 155
Cannon-Bowers, J. A., 456, 801, 802, 803, 804, 809, 815, Christal, R. E., 79, 299
910 Christensen, F. G. W., 256
Canter, M. B., 587 Christiansen, N. D., 109, 303, 308, 791, 827
Caplinger, J. A., 54, 607 Christie, R., 539
Cappelli, P., 182, 198 Chuah, S. C., 347
Cardy, R. L., 514 Chuang, A., 768
Carey, N. B., 690 Church, A. H., 563, 791, 792, 829
Carless, S. A., 136 Cizek, G. J., 63, 377, 380
Carlson, K. D., 135, 729 Clark, J. P., 490
Carr, L. S., 90, 239 Clark, K. E., 829
Carr, W. K., 688 Clark, M. B., 829
Carretta, T., 162 Clarke, S., 543, 544
Carretta, T. R., 259, 270, 545 Claudy, J. G., 690
Author Index 963
Grauer, E., 388 Haertel, E. H., 9, 10, 11, 15, 22, 23, 24, 27, 32, 35
Graves, J. P., 525 Hagan, C. M., 514
Graves, L. M., 541 Haggerty, J. J., 196, 198, 203
Grayson, D., 33, 34, 35 Haig, A. J., 284
Green, B. F., 161, 286, 586, 689, 690, 693, 883 Haig, B. D., 407
Green, F., 561 Hakel, M. D., 422, 775
Green, S. B., 30, 33 Hakstian, A. R., 771, 772
Greenberg, I. M., 688 Haladyna, T., 164
Greenberg, J., 422, 490, 521 Halbesleben, J. R. B., 478
Greenberg, L., 497 Hale, A. R., 542
Greenberger, D. B., 538 Halfhill, T. R., 816
Greenhaus, J. H., 560, 563 Hall, A., 198, 571
Gregersen, H. B., 791, 794 Hall, J. A., 346
Gregory, D., 80 Hall, R. J., 474, 572, 575
Gregory, R. J., 705 Hallier, J., 135
Gregory, S., 771 Halpert, J. A., 111, 159, 230, 586, 761
Greguras, G. J., 27, 260, 413, 422 Halverson, S. K., 826, 829, 832
Greuter, M. A., 934 Hamel, G., 201
Griffeth, R. W., 304 Hamilton, J. G., 407
Griffin, M. A., 463, 464, 465, 477, 531, 544, 557, 558 Hammer, L. B., 560, 768
Griffin, R. W., 490 Hammer, M. R., 792
Griffin-Blake, C. S., 537 Han, J., 130
Griffith, J. E., 688 Hanges, P. J., 226
Grigsby, T. D., 560 Hanisch, K. A., 400
Groenvynck, H., 657 Hannan, R., 287
Grossman, N., 533 Hannum, K., 829
Grossnickle, W. F., 133 Hansen, C. P., 544
Grove, W. M., 391, 392, 393 Hansen, F., 128, 143
Groves, K. S., 832 Hanson, M. A., 86, 103, 310, 351, 450, 454, 466, 873, 897
Grubb, W. L., 260, 306, 342, 348, 772 Hanson, R. A., 522
Gruen, T., 772 Hanson, T., 811
Grundlach, H., 923, 925, 927, 932 Hansson, R. O., 555
Gruys, M. L., 264, 491, 500, 505, 584 Harackiewicz, J. M., 329
Guastello, S. J., 541, 543 Harms, P. D., 497
Gudykunst, W. B., 792 Harold, C. M., 305, 421, 422, 424
Guenole, N., 658 Harpaz, I., 786
Guenster, N., 580 Harris, C., 143
Guest, D., 88 Harris, D. A., 693
Gueutal, H. G., 191 Harris, H., 218, 794
Guilford, J. P., 346 Harris, J. H., 684
Guion, R. M., 1, 23, 38, 58, 60, 80, 99, 101, 103, 105, 106, Harris, L. C., 503
107, 162, 200, 280, 311, 367, 439, 441, 514, 651, Harris, M. M., 133, 141, 352, 423, 935
716, 721, 922, 943, 945 Harris, P. M., 497
Gulliksen, H., 12, 23, 951 Harris, R., 710, 771
Gully, S. M., 471, 787 Harrison, D. A., 132, 135, 791, 792, 803, 811, 813,
Gundlach, M. J., 496 814, 826
Gupta, N., 205 Hartel, C. E., 346
Gupta, V., 226 Harter, J. K., 477, 502
Gust, J., 388 Hartigan, J. A., 244, 607, 690
Gustafsson, J. E., 28, 29, 35, 259 Hartman, M. S., 143
Gutek, B. A., 55, 560 Hartman, N. S., 260, 306, 348, 772
Guterman, H. A., 310 Harvey, O. J., 326, 327
Guthrie, J., 88 Harvey, R. J., 75, 87, 902
Gutman, A., 293, 627, 628, 644 Hastie, R., 64
Guttman, L. A., 24 Hattori, Y., 284
Gwinner, K. P., 473 Hattrup, K., 639
Haugland, S. N., 306
H Hausdorf, P. A., 133
Hause, E. L., 129
Haar, J. M., 236 Hausknecht, J. P., 111, 116, 138, 141, 159, 183, 230, 257,
Haber, S., 575 422, 586, 761, 762, 765, 773
Hackett, R. D., 473 Hawk, J., 264
Hackman, J. R., 79, 88, 473, 801, 804, 811 Hawkins, K., 532
Haddleton, E., 141 Hazard, R. G., 284
968 Author Index
Hazer, J. T., 392 Hogan, J. C., 55, 80, 115, 236, 277, 278, 280, 281, 282, 285,
Hebl, M. R., 132 286, 290, 291, 292, 300, 301, 309, 490, 544,
Hechanova, R., 791 607, 730, 731, 769, 770, 771, 773
Heckman, R. J., 135 Hogan, M. J., 351
Hedge, J. W., 454, 464, 742, 863 Hogan, P. F., 693
Hedlund, J., 215, 339, 349, 813 Hogan, R. T., 80, 115, 236, 300, 308, 343, 490, 607, 731,
Heffner, T. S., 340 770, 771, 827
Heggestad, E. D., 111, 116, 159, 180, 230, 303, 499, 684 Holiday-Wayne, J., 137
Heilman, M. E., 481 Holladay, C. L., 832
Heisey, J. G., 689 Holland, B., 300, 309, 769
Helbing, J. C., 932 Holland, J. L., 79, 330, 457, 458, 871, 895
Held, J. D., 161, 691 Holland, P. W., 155
Heli, W., 204 Hollander, E., 727
Heller, D., 307 Hollenbeck, G. P., 823, 832, 833, 834
Heller, W. J., 926 Hollenbeck, J. R., 215, 810, 811, 813, 814, 815
Hellervik, L., 162 Hollinger, R. C., 490, 496, 503
Hellesoy, O. H., 490 Hollwitz, J. C., 494
Helmreich, R. L., 807 Holmgren, R. L., 690
Hendrickson, C. L., 130 Holtz, B. C., 162, 177, 265, 269, 715, 771, 774
Heneghan, J., 475 Hom, P. W., 304
Heneman, H. G., 127, 445 Homan, A. C., 814
Henle, C. A., 23 Hoobler, J., 538
Hennig-Thurau, T., 770 Hooijberg, R., 833
Henry, R. A., 58, 878, 881, 929 Hooper, G. S., 346
Hepburn, C. G., 561 Hoover, H. D., 155
Hepworth, W., 496, 501 Hopkins, P., 929
Hermeking, M., 131 Horn, J., 494
Hermelin, E., 913 Horn, J. L., 407
Hernandez, M., 132 Horowitz, I. B., 812, 813, 814, 815
Herrin, G. D., 281 Horowitz, S. K., 812, 813, 814, 815
Herriot, P., 204, 215, 424, 934, 935 Horvath, J. A., 339, 402
Herrnstein, R. J., 104 Horvath, M., 786
Herscovitch, L., 498 Hough, L. M., 103, 265, 269, 299, 300, 301, 302, 304,
Hersey, P., 832 305, 308, 309, 310, 351, 393, 499, 527, 714,
Hershberger, S. L., 30 729, 730, 731, 732, 747, 748, 769, 773, 774,
Hershcovis, M. S., 493, 494, 496, 497, 498, 500, 501 827, 873, 936
Herst, D. E., 561 House, R. J., 816, 832, 833
Hesketh, B., 464, 555, 771, 772 Houston, J. S., 303
Heslin, R., 801 Howard, A., 190, 721, 722, 729, 730, 731, 732, 827, 828,
Hess, R., 164 829, 831, 835, 845, 846, 847, 848, 850, 851, 852,
Hess, U., 346 854, 856, 858, 860, 861, 862
Hetter, R. D., 697 Howe, V., 128
Hewer, A., 300 Howell, W. C., 831
Hezlett, S. A., 33, 255 Howes, J. C., 501
Hickman, M., 496 Hox, J. J., 37
Hicks, L. E., 302 Hoyt, C., 24
Higgins, C. A., 540, 561 Hoyt, C. L., 826
Higgs, A. C., 90, 93, 239, 249, 809 Hu, C., 134
Highhouse, S., 128, 129, 381, 391, 392, 931 Hu, L. T., 809
Hillgren, J. S., 516 Hubbard, M., 899
Hirschfeld, R. R., 805, 806, 817 Huff, J. W., 816
Hitt, M. A., 197, 205 Huff, K., 378
Ho, J. M. C., 501 Huffcutt, A. I., 104, 305, 347, 446, 447, 809
Hochwarter, W. A., 343, 540 Hughes, D., 378, 380, 383, 384, 385
Hocking, R. R., 33, 35 Hughes, J. M. C., 384
Hodgdon, J. A., 287 Hui, C., 454
Hoepener, R., 342 Huiras, J., 497
Hoffman, B. J., 476 Hulin, C. L., 58, 878, 881
Hoffman, C. C., 109, 607, 732 Hull, C. L., 51, 255
Hoffman, R. G., 57, 104 Hülsheger, U. R., 263, 455, 921, 935
Hofmann, D. A., 329, 542, 545, 558, 775 Humphrey, S. E., 814
Hofstede, G., 786 Humphreys, L. G., 265, 878
Hofstee, W. K. B., 932 Hungerford, M. K., 830
Hogan, J. B., 305, 772 Hunt, D. E., 326
Author Index 969
Hunt, S. T., 303, 447, 451, 452, 453 Jehn, K. A., 813
Hunter, A. E., 305 Jenkins, J. G., 946
Hunter, D. R., 456 Jennings, D., 55, 160, 632, 760
Hunter, J. E., 11, 15, 22, 24, 36, 51, 56, 57, 61, 64, 65, 66, Jensen, A. R., 104, 348, 456
75, 103, 104, 119, 163, 201, 236, 241, 245, 255, Jex, S. M., 903
258, 261, 262, 268, 269, 304, 307, 311, 371, 455, Jingo, M., 400
456, 458, 472, 499, 516, 519, 526, 528, 596, 598, Joensen, J., 284
599, 607, 621, 690, 722, 729, 730, 732, 733, 745, John, O. P., 540, 791
746, 747, 771, 772, 866, 909, 910, 911, 910, 912, Johns, G., 89, 391, 392, 393, 491
913, 914, 918 Johnson, C. D., 884
Hunter, R. F., 258, 261, 268, 304, 455, 596, 607, 621, 690, Johnson, D. E., 467
729, 730, 747, 772, 866, 910, 911, 913 Johnson, D. J., 809, 810, 816
Hunthausen, J. M., 768, 769 Johnson, E. C., 134, 165, 502, 809
Hurley, A. E., 407, 917 Johnson, H. M., 346, 348
Hurley, R. F., 768 Johnson, J. L., 199
Hurtz, G. M., 309, 457, 607 Johnson, J. W., 151, 161, 162, 463, 464, 465, 553, 555, 607
Huselid, M. A., 88, 151, 196, 197, 198, 765 Johnson, M., 815
Huteau, M., 935 Johnson, M. S., 37
Hutnik, N., 793 Johnson, S. K., 832, 833
Hutton, J., 882 Johnson, W., 259
Hyde, J. S., 266 Johnston, J. H., 802
Hyland, M., 794 Johnston, W. B., 82
Joireman, J., 300, 468, 471
I Jones, C., 157, 181, 771, 772, 775
Jones, D. A., 136
Icenogle, M. L., 560 Jones, D. P., 745, 746
Ilgen, D. R., 65, 215, 463, 537, 813, 814, 815 Jones, J. R., 204
Ilies, R., 26, 307, 469, 473, 475, 477, 479, 502, 826 Jones, J. W., 490
Ingerick, M. J., 37, 463, 684 Jones, K., 343
Inness, M., 535 Jones, R. G., 82, 801, 802, 803, 805, 817
Irwin, J. L., 180 Jones, S. E., 587
Irwing, P., 266 Jordan, M., 204
Ispas, D., 141 Jordan, M. H., 805
Iverson, R., 543 Jordan, P. J., 341, 346, 350
Joreskog, K. G., 15, 16, 28, 29, 32, 404, 405, 406
J Joshi, A., 794
Judd, C. M., 25, 33
Jackson, A. S., 281, 287, 288 Judge, T. A., 33, 128, 129, 134, 137, 138, 141, 240, 257,
Jackson, C. L., 468, 470 300, 307, 309, 457, 467, 468, 470, 472, 475, 496,
Jackson, D. N., 166, 300, 301, 303, 457, 827 501, 540, 607, 787, 826, 827, 828
Jackson, P. R., 805 Judiesch, M. K., 371, 722
Jackson, S., 13, 18, 26 Jundt, D., 815
Jackson, S. E., 151, 197, 198, 811 Jung, C., 326
Jacobs, J. A., 541 Junker, B. W., 37
Jacobs, R. R., 57, 104, 115, 119, 542, 705, 710, 715, 771,
772, 775, 792 K
Jaeger, A. J., 351
Jaeger, R. M., 688 Kabin, M. B., 61, 162, 269, 344, 608
Jago, I. A., 281 Kacmar, K. M., 205
Jamal, M., 536 Kahn, R. L., 537, 726
James, L. R., 909, 910, 911, 912, 913, 915, 916, 917, 918 Kahneman, D., 264
Jamieson, B. D., 520 Kaiser, R. B., 731, 827
Jamnik, V. K., 279 Kamata, A., 31
Janovics, J., 388 Kamdar, D., 468, 470, 471, 477, 478
Jansen, K. J., 810 Kammeyer-Mueller, J., 104, 144, 903
Jansen, P. G. W., 354, 909, 934 Kamp, J. D., 300, 499
Janz, T., 162 Kane, H. D., 259
Jaramillo, F., 771 Kane, J. S., 514, 521
Jarratt, J., 82 Kane, M. T., 38
Jarvis, C. B., 118 Kanfer, R., 84, 300, 343, 440, 472
Javidan, M., 833 Kanter, R. M., 539, 673
Jayne, M. E. A., 239, 242 Kantor, J., 581
Jeanneret, P. R., 63, 77, 79, 447, 450, 451, 452, 453, 584, Kantrowitz, T. M., 84
589, 593, 828, 887, 888, 894, 897 Kaplan, L. B., 884
970 Author Index
983
984 Subject Index
Assessment center (AC) method, 584–585, 723, 731–732, Blue collar jobs, 741; see also Technical jobs, in private
847–848 sector organizations
advantages and disadvantages, 855–857 BNSF v. White, 644
early developments, 843–844 Boston NAACP v. Beecher, 637–638
in executive selection, 829 Bradley v. City of Lynn, 632
feedback, 417–418, 420 Brunet v. City of Columbus, 634
multitrait-multimethod (MTMM) analysis of, 860 Bullying, 490
for personality measurement, 307–308 Bultmeyer v. Fort Wayne, 647
in World War II, 844–845 Burlington v. Ellerth, 644
Assessment of Background and Life Experiences (ABLE), Business case, making, 186
457, 871, 872, 884 Business game, 846
Assessment tools and techniques, for leadership Business trends
screening methods baby boomers retirement
behavioral consistency method, 729–730 and experience leaders, searching of, 723
biographical data, 729 declining organizational commitment, 724
tests and inventories globalization and diversity of followers, 723–724
assessment centers method, 731–732 leader readiness, 724–725
cognitive ability tests, 730 on-the-job development experiences, 722
interviews, 732–733 Business value, of personnel selection, 196, 235
personality measures, 730–731 challenges, 236–237
situational judgment tests, 731 dynamic and contemporary approach, 237
AT&T, 845 dynamic change and change management, 237–239
Authoritarianism, 854 hiring and HR management, selection beyond,
Authority and accountability alignment principle, 239–240
230–231 multiple organizational stakeholders, expectations
Automated testing, 175–176 of, 239
computerized adaptive rating scales (CARS), 445–446 social context and interpersonal processes, importance
computerized and web-based testing, 585 of, 240
computerized assessments and feedback, 429–430 benchmarking current practices, 246–247
Automation, 188 business value of employee selection, rationale for
Avoid performance goal orientation (APGO), 329 evaluating, 240–241
employee selection, outcomes of, 241–245
B managers knowledge, about employee selection,
245–246
Baby boomers, retirement of, 723 selection processes, need for, 247–248
Banding, 62 traditional model, 235–236
Barth v. Gelb, 643 valuation of success, reasons for, 248
Basic ability tests, 281, 282, 283 recommendations and future directions, 249–250
Bayesian models, 104
Beck v. University of Wisconsin, 647 C
Behavioral consistency method, 729–730
Behavioral indicators (BIs), 91 California F-scale, Bass version, 846
Behaviorally anchored rating scales (BARS), 442, California Psychological Inventory (CPI), 672
444, 947 Candidate’s perspective
Behavioral requirements, for leaders, 725–728 costs of, 428
assessment principles feedback, benefits of, 427
multiple selection technique, 727 Career achievement profiles, see Behavioral consistency
selection and development, 727–728 method
skills and potentials,, 726–727 Career Force, 865, 866
transparency, 728 CEO Briefi ng, 246
Behavioral social intelligence versus cognitive social Challenges to business value, of employee selection,
intelligence, 343 236–237
Behavior observation scales (BOS) format, 445 Cheating studies, incidence of, 384–385
Behavior summary scales (BSS) format, 444–445 Citizenship performance
Bell System, 846, 847 definition of, 465–466
Bernard v. Gulf Oil, 635 distal individual differences, 470–471
Berndt v. Kaiser, 641 excessive changes, 481
Bew v. City of Chicago, 639 facets of, 466
Big Five personality, dimensions of, see Five-Factor immediate/proximal determinants, 472–474
Model (FFM) impact of, 480
Biodata measurement, 475–476
in executive selection, 827 moderators and mediators, 477–479
measures for personality measurement, 304–305 predictors of, 468–469
Biographical data, 382, 710, 729; see also Biodata and task performance, 454
Subject Index 985
V Values
in employee selection research, 321
Validation, 88, 402; see also Validity concerns and emerging issues, 323–324
contamination, 403–405 future research directions, strategic agenda for,
content-based approaches, 105–106 334–335
differences between entry-level and promotional practical considerations in employee selection,
testing, 712–713 332–333
evidence structure of, 322–323
criterion-related, 598, 614 studying, reasons for, 322
evaluation, 601 of I-O psychology, 572
generalization, 607–608 Vigilance, 544
integration, 599 Vocational and occupational education
sources, 597–599 use of The Dictionary of Occupational Titles, 893
integrated framework for, 107 Vocational interests, and task performance, 457–458
strategies, 51
data cleaning, 60–61 W
in different contexts, 53
limitations of, 64–67 Wagner-Peyser Act of 1933, 889
long-term or scientific perspective, 66–67 Wards Cove v. Atonio, 642–643
process, 102–108 Warrant officer, 680
sufficiency, 402–403 Watson v. Fort Worth Bank, 636
tests and criteria, 102–103 Well-being, 556–557
from test scores, 101–102 Whiteback v. Vital signs, 647
and work analysis, 633 Withdrawal
Validity, 9, 38, 51, 99, 100, 583, 584, 596, 606 assessment of, 491–492
broader consequences of test score use, 44–45 consequences of, 504
consequential, 53, 104 Women
construct, 39, 51, 101, 402, 614, 635 in firefighter recruitment, 712
content, 39, 51, 106–107, 117, 118, 402, 614, 633, 634, occupational segregation of, 656, 657
635 recruits in U.S. Military, 688
convergent, 598 salary disparity compared with men, 655
criterion-related, see Criterion-related validity status in workforce, 652, 655–657
data collection, different approaches, 52 unemployment of, 655
discriminant, 599 Wong Law Emotional Intelligence Scale (WLEIS), 345;
error score, 951 see also Emotional intelligence; Intelligence
evolution of, 945 Work analysis, 73, 605–606, 614, 633–634, 707, 711–712,
generalization, 103–104 825
of inference versus test, 38–39 data collection issues, 83
limiter of, 42–44 data sources, 83
multiple inferences, in validation research, 59 data types and analysis level, 84
nomological net, 57–58 qualitative methods, 85
predictive inference versus evidence, 41–42 quantitative methods, 85
predictor-criterion relationship versus broader frontiers of, 88–93
conceptualization, 38 inferential leaps and linkages, 86–87
psychometrics, converging trends in, 100 information framework, 75–77
and reliability, difference between, 946–949 by I-O psychologists, 887–889
situational factors, 55–57 quality evaluation, 87–88
as strategy for job relatedness, 40–41 practical implications, 81–82
strong versus weak inferences, 54–55 specific methods, 77–81
synthetic, 104, 116, 117, 607, 635 traditional selection-related application, 73–74
from test scores, 101–102 for criterion development, 74
transport, 635 for predictor development, 74
trinitarian approach, 51, 945–946 for validity evidence development, 74–75
true, 946 validity evidence extension, 75
unitarian approach, 52 Work design, 542
versus validity evidence, 39–40 Work environment
Validity generalization (VG), 116, 256, 599, 635, 909, and counterproductive work behavior, 499–502
912–917, 946 group influence, 501–502
evaluation of, 913–914 leadership, 501
situational specificity-cross-situational consistency in, organizational justice, 501
914–917 stressors, 500–501
assumption of independence in, 916–917 Worker-attribute categories, model of, 238
discrete versus continuous, 915–916 Work-family conflict, 560–562
insufficient power to detect moderators, 914–915 Workforce, in 21st century, 552
Subject Index 1001