Preview: Benjamin A. Stenhaug JUNE, 2021

MODEL SELECTION METHODS FOR
ITEM RESPONSE MODELS
W
IE
A DISSERTATION
SUBMITTED TO THE GRADUATE SCHOOL OF EDUCATION
EV
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
PR
FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY
BENJAMIN A. STENHAUG
JUNE, 2021
© 2021 by Ben Alan Stenhaug. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-

Noncommercial 3.0 United States License.
http://creativecommons.org/licenses/by-nc/3.0/us/
W
IE
This dissertation is online at: http://purl.stanford.edu/yt267zd9190
EV
PR
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Ben Domingue, Primary Adviser
Michael Frank
W
Daniel Bolt
IE
Approved for the Stanford University Committee on Graduate Studies.
EV
Stacey F. Bent, Vice Provost for Graduate Education
PR
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
PR
EV
IE
W
iv
v
ACKNOWLEDGMENTS
First, I’d like to thank my friends and family, near and far, for their support.
Without exaggeration, this dissertation was made possible by my fiancé, Emma,
who offered unconditional support, patience, and zest throughout the process. I’m
grateful to Jason Fullen, Seth Saeugling, Stacy Day, Nate Uzlik, Sean Legler, and
others for their hundreds of phone calls that joyfully interrupted my work. Thank
you to my roommates and friends, Casey Ulrich and Charlotte Sivanich, who pro-
vided a happy home, nightly Euchre games, and the background sound of remote
W
teaching. And, thank you to my family—Kalin, Krista, and Bruce—for everything.
I am deeply grateful for the mentorship that I’ve received. I am proud to be
IE
Ben Domingue’s first doctoral advisee. Ben’s insight, generosity, and commitment
to creating an intellectual community provided the backbone of my experience. Ben
EV
made learning and working a joy; some of my favorite memories from graduate
school are whiteboarding together. I am grateful to Mike Frank, who opened my
eyes to the value of measurement beyond education. Mike’s mentorship fundamen-
PR
tally changed how I think about science, and I’m in awe of his simultaneous com-
mitment to high standards for research and kindness towards researchers. I thank
Dan Bolt for his insight, support, and generosity in joining my committee from the
great state of Wisconsin. I thank sean reardon for his engagement at each milestone
of the graduate program, including teaching methodology in a way that focuses on
building deep intuition. Thank you to Margot Gerritsen for enthusiastically sup-
porting my various endeavors and for chairing my dissertation committee. Thank
you to Nilam Ram, who significantly contributed to my growth and work given our
short time collaborating. Thank you to Kate McKinney and Nadia Ahmed, who
brought compassion and clarity to this process.
vi
Many others—more than I can possibly thank—have contributed to my expe-

rience and development during my time at Stanford. Working on Stanford’s Data
Science for Social Good program with Mike Sklar, Emily Flynn, Kevin Koy, Chris
Mentzel, Suzanne Tamang, Mike Baiocchi, Balasubramanian Narasimhan, Chiara
Sabatti, John Chambers, Emmanuel Candès, and others was truly a highlight of
my time at Stanford. Thank you to the friends and colleagues that shared the ex-
perience of being a graduate student. My interactions with Sam Trejo, Emily Mor-
ton, Marissa Thompson, Lief Esbenshade, David Lang, Mike Sklar, Klint Kanopka,
Daniel Angell, Sita Syal, and many others brought great joy. I’d be remiss without
noting that Mike and Klint’s thought partnership also significantly contributed to
W
my research.
I am thankful for the various financial support that I received. Thank you
IE
to Stanford’s Data Science Institute, the Institute of Education Sciences (Grant
R305B140009), the Karr Family Graduate Fellowship, and the Spencer Foundation
EV
(Grant 201700082) for supporting my time at Stanford financially. This support
gave me the privilege and freedom to research and ultimately write this dissertation
on topics that I found personally interesting and important. Thank you to Luis
PR
Garza and Kinedu, Inc. for providing data that made part of this dissertation pos-
sible. I also wish to thank Hadley Wickham, Phil Chalmers, and all other contribu-
tors to open source software that I used for data analysis and computation.
vii
CONTENTS
Page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 TREADING CAREFULLY: AGNOSTIC IDENTIFICATION AS THE

FIRST STEP OF DIF DETECTION 5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
W
Agnostic Identification Methods . . . . . . . . . . . . . . . . . . . . . . . . 19
Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Appendix . . . . . . . . . . . .
IE . . . . . . . . . . . . . . . . . . . . . . . . 45
2 PREDICTIVE FIT METRICS FOR ITEM RESPONSE MODELS 48

EV
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Item Response Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Out-of-sample for Item Response Data . . . . . . . . . . . . . . . . . . . . 55
Predictive Fit Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
PR
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Predictive Fit in Practice via Cross-validation . . . . . . . . . . . . . . . . 70
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3 THE STRUCTURE OF DEVELOPMENTAL VARIATION

IN EARLY CHILDHOOD 82
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Study 1: The Structure of Developmental Variation Across Individuals . . 90
Study 2: The Dimensionality of Within-Child Variability Increases Across
Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Supplemental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
viii
LIST OF TABLES
Table Page
1.1 Three best practices for DIF detection . . . . . . . . . . . . . . . . . 12
1.2 The 24 items in the verbal aggression data come from crossing four
situations, three actions, and two types. For example, the first row
corresponds to the item that asks whether an individual would want
to curse if a bus didn’t stop for them. . . . . . . . . . . . . . . . . . . 13
1.3 Proportion of affirmative responses by gender in the verbal aggres-
sion data. Some of the items have a greater proportion of affirmative
responses by males, and others have a great proportion by females.
W
Proportions are an intuitive but imperfect way of making across-
item group difference comparisons. . . . . . . . . . . . . . . . . . . . 14
1.4 Logits of affirmative responses by gender for the train situation for
IE
the verbal aggression data. Logit differences are an improvement
over proportions for making across-item group difference compar-
isons, but they are still imperfect because there is no item response
EV
model underlying the estimation. . . . . . . . . . . . . . . . . . . . . 15
2.1 Simulation study 1 results. Counts are of the winning model ac-
cording to the theoretical predictive fit metrics, ELPL-MR and
ELPL-MP, and the model selected by BIC, AIC, and LRT. With the
PR
1PL and 2PL DGM, the predictive fit metrics find that the model
with the same parameterization is the prediction-maximizing model
and the model selection methods usually select this model. With
the 3PL DGM, ELPL-MR tends to find that the 2PL model offers
the best predictive fit, while ELPL-MP tends to find that the 3PL
model offers the best predictive fit. LRT selects models consistent
with ELPL-MP, while AIC and BIC selects models more consistent
with ELPL-MR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Fit of seven models to a subset of the 2015 PISA data according to
six metrics. Models are ordered by number of parameters. Consis-
tent with results from the simulation studies, metrics based on the
missing responses prediction task prefer models with fewer parame-
ters (i.e., less flexible). LRT is the p-value for the model compared
to the model in the previous row, which is why the first value is NA.
Each of these comparisons yielded a p-value <10^-10, thus the most
flexible model was selected. * indicates the winning model according
to the metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
3.1 Model performance as measured by out-of-sample accuracy. Higher-

dimensional models perform better. . . . . . . . . . . . . . . . . . . . 91
3.2 Descriptive information for the SWYC data. The SWYC contains
many versions which correspond to the age of the child. Each ver-
sion has exactly 10 milestones, which we mapped to Kinedu’s four
milestone categories. Our data contain varying numbers of children. . 117
3.3 Model performance as measured by out-of-sample accuracy for the

SWYC data. The 3F model performs best. . . . . . . . . . . . . . . . 117
W
IE
EV
PR
x
LIST OF FIGURES
Figure Page
1.1 GLIMMER for the verbal aggression data. The total performance
difference (from either ability differences or DIF), d˜j , for each item
is shown. Distributions—as opposed to point estimates—are shown
to help the analyst reason about uncertainty. Distributions are cal-
culated by drawing 10,000 imputations from the item parameter
covariance matrix. There is no consistent performance difference
across items, indicating that the Fundamental DIF Identification
Problem is difficult for this data. . . . . . . . . . . . . . . . . . . . . 18
W
1.2 AOAA results depicted in a GLIMMER . . . . . . . . . . . . . . . . 21
1.3 AOAA-OAT results for the verbal aggression data. On the left, the
GLIMMER is identified by setting the group means equal. Green
IE
represents items that AOAA-OAT did not find to contain DIF (i.e.,
anchor items). On the right, the final AOAA model is identified by
fixing the anchor items equivalent across groups. Distributions are
EV
shown to give a sense of variability and are estimated via 10,000
imputations from the item parameter covariance matrix. . . . . . . . 24
1.4 The search path of MAXGI the verbal aggression data. µfemale is
fixed to 0 and the goal is to identify the value of µmale that maxi-
PR
mizes the gini coefficient. Maximizing the gini coefficient leads to a

small minority of items containing DIF. . . . . . . . . . . . . . . . . . 25
1.5 The search path of MINBC for the verbal aggression data. µfemale
is fixed to 0 and the goal is to identify the value of µmale that mini-
mizes the total area between item characteristic curves. As a result,
the total amount of DIF on the test is minimized and as much
performance difference as possible is explained by group ability
differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6 Results by DIF detection method. The male group mean verbal
aggression, µ̂male , according to each method is graphed as a vertical
line, which are superimposed over the performance difference, d˜j , for
each item. The scale is set by fixing µ̂female = 0. . . . . . . . . . . . . 29
xi
1.7 The relationship between target ability and probability of correct

response for a 12-item test where the last 6 items contain DIF. Nui-
sance ability is fixed to the group mean. The reference group has
the same item characteristic curve for each item; the focal group has
lower probabilities of correct responses the more the item requires
the nuisance ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.8 GLIMMER for one replication using the same item parameters as
generated Figure 1.7. The six DIF-free items (items 1-6) show a con-
stant performance difference. As expected, the other six items (items
7-12) show an increasingly large performance difference. . . . . . . . . 34
1.9 Performance rates across 100 replications for each AGI method and
number of DIF items. Top row: AOAA-OAT nearly always chooses
all of the non-DIF items as anchors (the anchor hit rate) while
AOAA-AS and AOAA do much better for fewer DIF items. Bottom
W
row: All of the methods perform similarly as far as avoiding includ-
ing items with DIF in the anchor set (DIF avoidance rate). DIF
avoidance rates are slightly lower for the two DIF item condition
because the item with 20◦ of DIF was frequently incorrectly included
IE
in the anchor set (and it was one of only two items with DIF in this
condition). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
EV
1.10 Achievement gap residual distributions across 100 replications for
each AGI method and number of DIF items. . . . . . . . . . . . . . . 37
2.1 Understanding the two out-of-sample item response matrices, Ỹ MR

and Ỹ MP . With missing responses, the unit of observation is a sin-
PR
gle item response and the person’s other responses can be used to
estimate ability. With missing persons, the unit of observation is the
person’s response vector and there are no responses with which to
estimate ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2 Simulation study 2 results. Each point corresponds to the prediction-

maximizing model according to the predictive fit metrics, ELPL-MR
and ELPL-MP, or the model selected by BIC, AIC, or LRT for one
of 2000 replications. The 3PL model was most likely to offer the
best fit and be selected with more persons and at lower mean ability
(guessing is more prominent). For the predictive fit metrics, ELPL-
MP preferred more flexible models than ELPL-MR. For the model
selection methods, LRT and AIC preferred more flexible models than
BIC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
xii
2.3 Simulation study 3 results for the predictive fit metrics, ELPL-MR
and ELPL-MP. Each point corresponds to the prediction-maximizing
model from one of 1000 replications. The fixed guessing parameter
varies across columns; the fixed number of persons varies across rows.
The 3PL model was most likely to offer the best fit with greater item
discrimination, more difficult items, and more persons. ELPL-MP
preferred the 3PL model more often than ELPL-MR did. . . . . . . . 68
2.4 Simulation study 3 results for the model selection methods, BIC,
AIC, and LRT. Each point corresponds to the selected model from
one of 1000 replications. The fixed guessing parameter varies across
columns; the fixed number of persons varies across rows. The 3PL
model was selected more often with greater item discrimination,
more difficult items, and more persons. BIC selected models con-
sistent with ELPL-MR, while AIC and LRT selected models more
consistent with ELPL-MP. . . . . . . . . . . . . . . . . . . . . . . . . 69
W
2.5 Simulation study 4 results. Each point corresponds to the prediction-
maximizing model according to the predictive fit metrics, ELPL-MR
and ELPL-MP, or the model selected by BIC, AIC, or LRT for one
IE
of 2000 replications. The prediction-maximizing and selected model
was more likely to be the 2F 2PL with lower correlation between
factors and more persons. LRT nearly always selected the 2F 2PL
EV
model even in replications where both ELPL-MP and ELPL-MR
identified the 1F 2PL model as prediction-maximizing. AIC selected
models largely consistent with ELPL-MP. BIC selected models more
closely aligned to ELPL-MR. . . . . . . . . . . . . . . . . . . . . . . 71
PR
3.1 Number of the 414 milestones completed by age with percentile

curves. Points represent individual children. . . . . . . . . . . . . . . 89
3.2 Distribution of discrimination parameters for the factors of the 1F,

2F, 3F, 4F, and 5F models. Columns of subplots show models, rows
show factors, and distributions are the density of discrimination pa-
rameter estimates, colored by broad milestone categories. In each of
the models, linguistic milestones load heavily on the 1st factor. Addi-
tional factors tend to be composed of other milestone categories—for
example, physical milestones tend to load heavily on the 2nd factor.
As expected, the typical discrimination decreases for later factors.
Arrows track the location of two milestones, crawling (physical) and
saying at least 4 words (linguistic), across each of the factors. . . . . 92
xiii
3.3 Gain of higher-dimensional models over 1F model. Gain is defined as

the proportion of the distance between the 1F model’s performance
and 100% that the model achieves. The top panel shows that when
each model is fit to the full dataset, higher-dimensional models per-
form particularly well for older age groups. The bottom panel shows
that when each model is fit separately to each age group, higher-
dimensional models perform particularly well for older age groups. . . 94
3.4 Each panel corresponds to a step from Study 2. In the first step, we
used the survey data to develop a measurement model. The first fac-
tor is mainly physical and the second factor is mainly linguistic. In
the second step, we used the measurement model to estimate factor
scores for each child-timepoint in the app data. As expected, both
factors are highly associated with age. In the third step, we modeled
longer-term developmental trends separately for each child. Here, we
illustrate this step by showing the trends for a single child. In the
W
fourth step, we extract the deviations (i.e., residuals) from the devel-
opmental trends. Here, we show the deviations (i.e., residuals) for
that same child. These deviations allow us to examine age-related
IE
differences in within-person coupling of factor scores. . . . . . . . . . 96
3.5 Coupling parameters from 2 to 18 months old. Association between

1-unit deviation from the developmental pathway for one factor with
EV
deviation from the other factor’s developmental path. Left panel
shows factor 1 as the dependent variable and factor 2 as the inde-
pendent variable. Right panel is the inverse. As the differentiation
hypothesis suggests, we find decreasing coupling over the age span.
Red shading is 95% confidence interval for exploratory sample as
PR
calculated by 1000 bootstrapped simulations. . . . . . . . . . . . . . 97
3.6 Example item characteristic curves for 9 of the 414 milestones from
a 1F model fit to the survey data. Babbling is unrelated to a child’s
development whereas other milestones such as knowing three or more
numbers are highly related to development. . . . . . . . . . . . . . . 118
3.7 The raw proportion of milestones reached by age in months of the

child. This proportion stays relatively constant over the age-span
because as children develop the milestones that parents respond to
become more advanced. As a result, ceiling effects for older children
are avoided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xiv
3.8 Coupling parameters from 2 to 18 months old from a 3f measure-

ment model. Association between 1-unit deviation from the develop-
mental pathway for one factor with deviation from the other factor’s
developmental path. Top row shows the relationship between factor
1 and factor 2. Middle row shows the relationship between factor 1
and factor 3. Bottom row shows the relationship between factor 2
and factor 3. Red shading is 95% confidence interval for exploratory
sample as calculated by 1000 bootstrapped simulations. As the dif-
ferentiation hypothesis suggests, we find decreasing coupling over the
age span in the relationship between factor 1 and factor 2 as well as
between factor 1 and factor 3. . . . . . . . . . . . . . . . . . . . . . . 120
3.9 Gain of higher-dimensional models over 1F model for the SWYC

data. Gain is defined as the proportion of the distance between the
1F model’s performance and 100% that the model achieves. The top
panel shows results from when each model is fit to the full dataset.
W
The bottom panel shows results from when each model is fit sepa-
rately to each age group. We did not find a consistent relationship
between age (i.e., instrument version) and gain from higher dimen-
IE
sional models for the SWYC data. . . . . . . . . . . . . . . . . . . . 121
3.10 The relationship between the number of children for the SWYC
version and gain between a 5F and 1F model. As expected, the 5F
EV
model performs better when fit to larger sample sizes. This con-
founding is one possible reason that we did not find a relationship
between age group and gain over the 1F model in the previous figure. 121
PR
1
INTRODUCTION
How skilled is a student at solving equations? Is a child’s development on

track? Are males more verbally aggressive, on average, than females? Which crimi-
nal offenders are experiencing psychopathy? These questions and many more can
be answered by interpreting a statistical model fit to survey or assessment data
(e.g., Hambleton et al., 1982; Sheldrick et al., 2019; Smits, De Boeck, and Vanstee-
landt, 2004; Cooke et al., 2005). Item response theory offers a suite of such models,
known as item response models. Fundamentally, item response models view the
W
probability of an individual responding affirmatively (or correctly) to an item as
a function of the individual’s factors and the item’s parameters (Embretson and
IE
Reise, 2013). How many factors should represent the individual? Should each item
have a guessing parameter? What mathematical function links the individual’s fac-
EV
tors and the item’s parameters to the probability? Should individuals from different
groups have different item parameters? The many possible answers to each of these
questions constitute different item response models.
PR
Many item response models are possible for any data set, and different models
frequently lead to different conclusions. As an extreme example, one group of re-
searchers reported that a psychopathy instrument used for criminal risk assessment
contained significant bias against North Americans compared to Europeans (Cooke
et al., 2005). A different group of researchers countered that the results were driven
by their flawed model selection process (Bolt, Hare, and Neumann, 2007). The
Standards for Educational and Psychological Testing require that evidence of model
fit must be brought to bear, especially when decisions are made based on empirical
data (AERA, 2014). What exactly does it mean for a model to fit item response
data? What makes for valid evidence? And, what process should researchers follow
2
to arrive at this evidence? These are the broad questions that thread through my
dissertation’s three chapters.
In the first chapter, I consider model selection in the context of identifying

items that may contain bias. I warn against overlooking the model identification
problem at the beginning of most methods for detecting potentially biased items.
I suggest the following three-step process for flagging potentially biased items: (1)
begin by examining raw item response data, (2) compare the results of a variety of
methods, and (3) interpret results in light of the possibility of the methods failing.
I develop new methods for these steps, including GLIMMER, a graphical method
that enables analysts to inspect their raw item response data for potential bias
W
without making strong assumptions. I illustrate this process using data from a ver-
bal aggression instrument and find that it’s impossible to tell whether males, on
IE
average, are more verbally aggressive than females. For example, one method con-
cludes that males are 0.5 standard deviations more verbally aggressive than females,
EV
while another concludes that the difference is 0.001 standard deviations.
In the second chapter, I advocate for measuring an item response model’s

PR
fit by how well it predicts out-of-sample data instead of whether the model could
have produced the data. The fact that item responses are cross-classified within
persons and items complicates this discussion. Accordingly, I consider two separate
predictive tasks for a model. The first task, “missing responses prediction,” is for
the model to predict the probability of an affirmative response from in-sample per-
sons responding to in-sample items. The second task, “missing persons prediction,”
is for the model to predict the vector of responses from an out-of-sample person. I
derive a predictive fit metric for each of these tasks and conduct a series of simu-
lation studies to describe their behavior. For example, I find that defining predic-
tion in terms of missing responses, greater average person ability, and greater item
discrimination are all associated with the 3PL model producing relatively worse
3
predictions, and thus lead to greater minimum sample sizes. Further, I compare
the prediction-maximizing model to the model selected by AIC, BIC, and likeli-
hood ratio tests. In terms of predictive performance, likelihood ratio tests often
select overly flexible models, while BIC tends to select overly parsimonious mod-
els. Lastly, I use PISA data to demonstrate how to use cross-validation to directly
estimate the predictive fit metrics in practice (PISA, 2015).
In the third chapter, I develop new methods for comparing multidimensional

item response models and apply them to empirically explore early childhood devel-
opment. Despite both the abundance of theory and the important implications for
parenting, teaching, and health practice, there is surprisingly little large-scale em-
W
pirical work on early childhood development (e.g., Flavell, 1963; Gelman and Meck,
1983). My coauthors and I combine cross-sectional survey data and longitudinal
IE
mobile app data provided by thousands of parents as their children developed to
address this gap. In particular, the mobile app data, provided by Kinedu, Inc., is
EV
the result of over 10,000 parents repeatedly reporting on their child’s achievement
of collections of age-specific developmental milestones. We find that multiple fac-
PR
tors best represent early child development. For example, a two-factor model where
the 1st factor is mainly physical and the 2nd factor is mainly linguistic—better cap-
tures developmental variation than a one-factor model. Further, we find evidence
for the differentiation hypothesis, which suggests that the structure of a child’s de-
velopment is unitary early in infancy but becomes more complex with age. These
findings indicate that measures of developmental variation should move beyond as-
sumptions that differences and progression of children’s development can be repre-
sented as a homogenous process, and toward multidimensional representations.
4
References
AEERA. 2014. Standards for Educational and Psychological Testing. American Ed-
ucational Research Association American Psychological Association ….
Bolt, Daniel M, Robert D Hare, and Craig S Neumann. 2007. “Score Metric Equiv-
alence of the Psychopathy Checklist–Revised (PCL-r) Across Criminal Of-
fenders in North America and the United Kingdom: A Critique of Cooke,
Michie, Hart, and Clark (2005) and New Analyses.” Assessment 14 (1): 44–
56.
Cooke, David J, Christine Michie, Stephen D Hart, and Danny Clark. 2005. “As-
sessing Psychopathy in the UK: Concerns about Cross-Cultural Generalis-
ability.” The British Journal of Psychiatry 186 (4): 335–41.
Embretson, Susan E, and Steven P Reise. 2013. Item Response Theory. Psychology
Press.
W
Flavell, John H. 1963. “The Developmental Psychology of Jean Piaget.”
Gelman, Rochel, and Elizabeth Meck. 1983. “Preschoolers’ Counting: Principles
Before Skill.” Cognition 13 (3): 343–59.
IE
Hambleton, Ronald K, and others. 1982. “Applications of Item Response Models to
NAEP Mathematics Exercise Results.”
Pisa, OECD. 2015. “Pisa: Results in Focus.” Organisation for Economic Co-
EV
Operation and Development: OECD.
Sheldrick, R Christopher, Lauren E Schlichting, Blythe Berger, Ailis Clyne, Pen-
sheng Ni, Ellen C Perrin, and Patrick M Vivier. 2019. “Establishing New
Norms for Developmental Milestones.” Pediatrics 144 (6).
PR
Smits, Dirk JM, Paul De Boeck, and Kristof Vansteelandt. 2004. “The Inhibition
of Verbally Aggressive Behaviour.” European Journal of Personality 18 (7):
537–55.
5
CHAPTER 1
TREADING CAREFULLY: AGNOSTIC IDENTIFICATION AS THE
FIRST STEP OF DIF DETECTION
With Benjamin W. Domingue and Michael C. Frank

(In review at Psychological Methods)
Abstract
Differential item functioning (DIF) is a popular technique within the item-response

theory framework for detecting test items that are biased against particular demo-
graphic groups. The last thirty years have brought significant methodological ad-
W
vances in detecting DIF. Still, typical methods—such as matching on sum scores
or identifying anchor items—are based exclusively on internal criteria and there-
IE
fore rely on a crucial piece of circular logic: items with DIF are identified via an
assumption that other items do not have DIF. This logic is an attempt to solve
EV
an easy-to-overlook identification problem at the beginning of most DIF detec-
tion. We explore this problem, which we describe as the Fundamental DIF Identi-
fication Problem, in depth here. We suggest three steps for determining whether
PR
it is surmountable and DIF detection results can be trusted. (1) Examine raw
item response data for potential DIF. To this end, we introduce a new graphical
method for visualizing potential DIF in raw item response data. (2) Compare the
results of a variety of methods. These methods, which we describe in detail, include
commonly-used anchor item methods, recently-proposed anchor point methods, and
our suggested adaptations. (3) Interpret results in light of the possibility of DIF
methods failing. We illustrate the basic challenge and the methodological options
using the classic verbal aggression data and a simulation study. We recommend
best practices for cautious DIF detection.
CHAPTER 1: AGNOSTIC IDENTIFICATION 6
Introduction
Measures from surveys and assessments are in widespread use throughout

social and biomedical science. Education researchers use end-of-year assessments
to measure educational opportunity across communities (Reardon, Kalogrides, &
Ho, 2019), psychologists use surveys to understand personality (Vernon, 2014), and
medical researchers develop symptoms questionnaires that are widely used by clin-
icians (Amtmann et al., 2010). Such data has a general structure: individuals give
categorical responses to a set of questions. The use of such data has a general goal:
to better understand individuals’ specific abilities through the measurement of un-
W
observable latent variables rather than relying solely on observations (Borsboom,
2006). IE
A popular paradigm in service of this goal is item response theory (IRT)
(Hambleton, Swaminathan, & Rogers, 1991). In IRT, each individual’s response
EV
to each item on a measurement instrument is a function of the individual’s latent
variables and the item’s parameters. The simplest IRT model, the Rasch model,
specifies the probability of individual i responding affirmatively to item j as
PR
Pr(yij = 1) = σ(θi + bj ) (1.1)
where θi is the individual’s latent variable (i.e., trait or ability), bj is the item’s eas-
ex
iness, and σ(x) = is the standard logistic function (Thissen & Steinberg,
1 + ex
1986).
Frequently, categorical variables accompany item response data. As one exam-

ple, the gender of the test-taker is typically collected as part of the administration
of college entrance exams (Cai, Lu, Pan, & Zhong, 2019; Casey, Nuttall, Pezaris,
& Benbow, 1995). Other common “group” variables are male-female, high-low so-
cioeconomic status, and rural-urban; in the general case they are referred to as the
reference (“ref”) and focal (“foc”) group (Holland & Thayer, 1986). With a group
variable, an improved measurement model might allow for the possibility of sepa-
rate parameters for each group. In particular, the multigroup Rasch model allows
the probability of individual i responding affirmatively to item j to vary as a func-
tion of individual i’s group membership,
Pr(yij = 1) = σ(θi + bgroup

j ). (1.2)
The notation bgroup

j indicates that each item has a separate easiness parameter for
W
each group: bref foc
j for persons in the reference group and bj for persons in the focal
group. The multigroup model allows for the possibility of differential item function-
IE
ing (DIF); an item that contains DIF functions differently across groups and thus
has the potential to cause bias (Camilli & Shepard, 1994). An item does not suf-
EV
fer from DIF when bref
j = bfoc ref
j . On the other hand, if, for example, bj > bfoc
j , the
item contains DIF “against” the focal group. One way to think about DIF is that,
conditional on ability, an item is DIF-free if persons have the same probability of
PR
responding correctly regardless of their group membership.1
Items that contain DIF can invalidate the entire measurement process. As
one example, Pae and Park (2006) found that 22 out of 33 items from the English
reading portion of a Korean college entrance exam contained DIF across gender.
They concluded that the cumulative effect of these 22 items significantly contam-
inates test-level scores, potentially leading to unfair admission decisions. As such,
effective DIF detection methods are crucial to the field of measurement, and psy-
chometricians have long been in search of effective DIF detection methods (Mill-
sap, 2012). DIF detection methods typically test the hypothesis that bref
j = bfoc
j .
1
This is most easily seen by graphing the item characteristic curves (the mapping of ability to
probability of correct response) for each group and observing that they overlap.
The most common is to use a likelihood ratio test (LRT) to compare the base-
line model—where bref foc
j and bj are constrained to be equal—to a more flexible
model where that constraint is removed (Thissen, Steinberg, & Wainer, 1993). Crit-
ically, the item parameters for all of the other items are usually required to be
equal across groups. As another example, a more recent and complex approach,
the “lasso DIF method,” begins with a baseline model where every item parame-
ter is allowed to vary across groups (Magis, Tuerlinckx, & De Boeck, 2015). The
final model is found by lasso penalizing the baseline model such that a model where
some items have equal parameters across groups is obtained. Critically, each per-
son’s sum score is used as the estimate of their latent ability throughout this pro-
W
cess.
As is common, both of these DIF detection methods are based exclusively on

IE
internal criteria. As a result, they use the circular logic of looking for DIF items
while assuming other items do not contain DIF (Camilli & Shepard, 1994). The
EV
LRT DIF method makes this assumption explicitly. The lasso DIF method make
this assumption more subtly by using the sum score—which would be contaminated
PR
by DIF items—as an estimate of ability. Researchers have noticed this circularity,

but have mostly described it indirectly by pointing out inflated type I errors in sim-
ulation studies (Stark, Chernyshenko, & Drasgow, 2006). Andrich and Hagquist
(2012) refer to items incorrectly flagged as having DIF due to other items contain-
ing DIF as being caused by “artificial DIF.” We argue that type I errors and the
artificial DIF that can cause them are more clearly seen as consequences of an iden-
tification problem at the root of DIF detection. This identification problem may
be easy to overlook, but solving it is no easy task. In fact, we see it as the antago-
nist of any DIF detection method and name it accordingly: the Fundamental DIF
Identification Problem.
The Fundamental DIF Identification Problem
Thus far, we have focused on item parameters, but ability parameters must be
determined as well.2 The multigroup Rasch model typically assumes that persons
from the reference group are distributed θi ref ∼ N (0, 1) and persons from the focal
2
group are distributed θi foc ∼ N (µfoc , σ foc ). Setting the mean and variance for the
reference group abilities to 0 and 1 simply determines the scale. The Fundamental
DIF Identification Problem is that the focal group mean ability, µfoc , must be deter-
mined along with bref foc
j and bj for each item. A model cannot freely estimate all of
these parameters (i.e., the model is under-identified). To make this lack of identifi-
W
ability concrete, consider that the model has no way to disentangle the difference
between (a) the focal group having higher ability and (b) every item containing
bias against the reference group.3
IE
What we’re calling the Fundamental DIF Identification Problem is both ac-
EV
knowledged and frequently overlooked by the literature (Zumbo, 2007). We argue
that it has largely been communicated in a way that doesn’t fully capture its im-
portance. Camilli and Shepard (1994) describe it as the requirement that “param-
PR
eters must be ‘equated’ or scaled in the same metric” (p. 62). Hambleton, Swami-
nathan, and Rogers (1991), Embretson and Reise (2000), and Millsap (2012) de-
scribe it as the need to create a common scale for linking across groups. As one
example of the Fundamental DIF Identification Problem being overlooked, Cooke,
Michie, Hart, and Clark (2005) concluded that a psychopathy instrument used for
criminal risk assessment contained significant bias against North Americans as com-
pared to Europeans. Bolt, Hare, and Neumann (2007) pointed out that they made
2
As is common, we use marginal maximum likelihood estimation (MMLE) in which case it’s
only the group mean abilities, not the individual abilities, that are estimated in model fitting
(Bock & Aitkin, 1981).
j − b̂j + c are equivalent for any value of
3
Mathematically, all IRT models with µ̂foc + c and b̂ref foc
c.
the mistake of beginning DIF detection with an unidentified model, and the ex-
change culminated in a legal battle that was picked up the New York Times (Carey,
2010).
The most trustworthy ways of addressing the Fundamental DIF Identification

Problem use external information based on the context in which the item response
data was gathered (Camilli & Shepard, 1994). For example, in a large randomized
experiment, the groups might be determined equivalent at baseline, and the ana-
lyst can safely assume that µfoc = µref on any instruments administered before the
experiment’s intervention. Or, an item like “2 + 2” might seem so innocuous that
the analyst faithfully assumes that bfoc = bref 4
j for that item. In the equating litera-
W
j
ture, the former is a “non-common item random groups” design and the latter is a
“common-item nonequivalent groups” design (Cook & Paterson, 1987; Topczewski,
IE
Cui, Woodruff, Chen, & Fang, 2013).5 However, in most cases the analyst is not in
a position to make one of these assumptions. The multigroup model is unidentified
EV
and the analyst has no knowledge with which to make an identifying assumption; it
in this sense that we say they are “agnostic” about how to resolve the Fundamen-
PR
tal DIF Identification Problem. The analyst has nothing to hold onto so to speak.
We refer to any method that the analyst turns to in this case as an “agnostic iden-
tification” (AGI) method, as opposed to the more general title of DIF detection
method.6
At present “little evidence is available to guide applied researchers through

4
On the other hand, Angoff (1993) reports that test developers are often “confronted by DIF
results that they cannot understand; and no amount of deliberation seems to help explain why
some perfectly reasonable items have large DIF values” (p. 19). It’s unclear whether this obser-
vation indicates that seemingly innocuous items sometimes contain DIF or if it indicates a more
basic failure of the DIF detection method all-together.
5
With one of these assumptions in hand, the remainder of DIF detection is straightforward:
each of the other items can be checked for DIF using well-validated methods such as a likelihood
ratio test (LRT), which we use throughout this paper (Thissen, Steinberg, & Wainer, 1993).
6
To be sure, we refer to overcoming The Fundamental Problem of DIF without any a priori
assumptions as AGI. DIF detection, on the other hand, describes the complete process. In this
way, (we argue that) AGI is (perhaps the most important) part of DIF detection.
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.

Preview: Benjamin A. Stenhaug JUNE, 2021

Uploaded by

Copyright:

Available Formats

You might also like

Preview: Benjamin A. Stenhaug JUNE, 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Preview: Benjamin A. Stenhaug JUNE, 2021

Uploaded by

Copyright:

Available Formats

MODEL SELECTION METHODS FOR

ITEM RESPONSE MODELS

FOR THE DEGREE OF

This work is licensed under a Creative Commons Attribution-

Ben Domingue, Primary Adviser

Many others—more than I can possibly thank—have contributed to my expe-

1 TREADING CAREFULLY: AGNOSTIC IDENTIFICATION AS THE

2 PREDICTIVE FIT METRICS FOR ITEM RESPONSE MODELS 48

3 THE STRUCTURE OF DEVELOPMENTAL VARIATION

3.1 Model performance as measured by out-of-sample accuracy. Higher-

3.3 Model performance as measured by out-of-sample accuracy for the

mizes the gini coeﬀicient. Maximizing the gini coeﬀicient leads to a

1.7 The relationship between target ability and probability of correct

2.1 Understanding the two out-of-sample item response matrices, Ỹ MR

2.2 Simulation study 2 results. Each point corresponds to the prediction-

3.1 Number of the 414 milestones completed by age with percentile

3.2 Distribution of discrimination parameters for the factors of the 1F,

3.3 Gain of higher-dimensional models over 1F model. Gain is defined as

3.5 Coupling parameters from 2 to 18 months old. Association between

calculated by 1000 bootstrapped simulations. . . . . . . . . . . . . . 97

3.7 The raw proportion of milestones reached by age in months of the

3.8 Coupling parameters from 2 to 18 months old from a 3f measure-

3.9 Gain of higher-dimensional models over 1F model for the SWYC

How skilled is a student at solving equations? Is a child’s development on

In the first chapter, I consider model selection in the context of identifying

while another concludes that the difference is 0.001 standard deviations.

In the second chapter, I advocate for measuring an item response model’s

In the third chapter, I develop new methods for comparing multidimensional

With Benjamin W. Domingue and Michael C. Frank

Differential item functioning (DIF) is a popular technique within the item-response

Measures from surveys and assessments are in widespread use throughout

Pr(yij = 1) = σ(θi + bj ) (1.1)

Frequently, categorical variables accompany item response data. As one exam-

Pr(yij = 1) = σ(θi + bgroup

The notation bgroup

responding correctly regardless of their group membership.1

As is common, both of these DIF detection methods are based exclusively on

by DIF items—as an estimate of ability. Researchers have noticed this circularity,

The Fundamental DIF Identification Problem

The most trustworthy ways of addressing the Fundamental DIF Identification

At present “little evidence is available to guide applied researchers through

You might also like