Benjamin A. Stenhaug JUNE, 2021

JUNE, 2021
First, I’d like to thank my friends and family, near and far, for their support.
Without exaggeration, this dissertation was made possible by my fiancé, Emma,
who offered unconditional support, patience, and zest throughout the process. I’m
grateful to Jason Fullen, Seth Saeugling, Stacy Day, Nate Uzlik, Sean Legler, and
others for their hundreds of phone calls that joyfully interrupted my work. Thank
you to my roommates and friends, Casey Ulrich and Charlotte Sivanich, who pro-
vided a happy home, nightly Euchre games, and the background sound of remote

teaching. And, thank you to my family—Kalin, Krista, and Bruce—for everything.
I am deeply grateful for the mentorship that I’ve received. I am proud to be
Ben Domingue’s first doctoral advisee. Ben’s insight, generosity, and commitment
to creating an intellectual community provided the backbone of my experience. Ben
made learning and working a joy; some of my favorite memories from graduate
school are whiteboarding together. I am grateful to Mike Frank, who opened my
eyes to the value of measurement beyond education. Mike’s mentorship fundamen-

tally changed how I think about science, and I’m in awe of his simultaneous com-
mitment to high standards for research and kindness towards researchers. I thank
Dan Bolt for his insight, support, and generosity in joining my committee from the
great state of Wisconsin. I thank sean reardon for his engagement at each milestone
of the graduate program, including teaching methodology in a way that focuses on
building deep intuition. Thank you to Margot Gerritsen for enthusiastically sup-
porting my various endeavors and for chairing my dissertation committee. Thank
you to Nilam Ram, who significantly contributed to my growth and work given our
short time collaborating. Thank you to Kate McKinney and Nadia Ahmed, who
brought compassion and clarity to this process.

Many others—more than I can possibly thank—have contributed to my expe-

rience and development during my time at Stanford. Working on Stanford’s Data
Science for Social Good program with Mike Sklar, Emily Flynn, Kevin Koy, Chris
Mentzel, Suzanne Tamang, Mike Baiocchi, Balasubramanian Narasimhan, Chiara
Sabatti, John Chambers, Emmanuel Candès, and others was truly a highlight of
my time at Stanford. Thank you to the friends and colleagues that shared the ex-
perience of being a graduate student. My interactions with Sam Trejo, Emily Mor-
ton, Marissa Thompson, Lief Esbenshade, David Lang, Mike Sklar, Klint Kanopka,
Daniel Angell, Sita Syal, and many others brought great joy. I’d be remiss without
noting that Mike and Klint’s thought partnership also significantly contributed to

my research.
I am thankful for the various financial support that I received. Thank you
to Stanford’s Data Science Institute, the Institute of Education Sciences (Grant
R305B140009), the Karr Family Graduate Fellowship, and the Spencer Foundation
(Grant 201700082) for supporting my time at Stanford financially. This support
gave me the privilege and freedom to research and ultimately write this dissertation
on topics that I found personally interesting and important. Thank you to Luis

Garza and Kinedu, Inc. for providing data that made part of this dissertation pos-
sible. I also wish to thank Hadley Wickham, Phil Chalmers, and all other contribu-
tors to open source software that I used for data analysis and computation.


How skilled is a student at solving equations? Is a child’s development on

track? Are males more verbally aggressive, on average, than females? Which crimi-
nal offenders are experiencing psychopathy? These questions and many more can
be answered by interpreting a statistical model fit to survey or assessment data
(e.g., Hambleton et al., 1982; Sheldrick et al., 2019; Smits, De Boeck, and Vanstee-
landt, 2004; Cooke et al., 2005). Item response theory offers a suite of such models,
known as item response models. Fundamentally, item response models view the

probability of an individual responding affirmatively (or correctly) to an item as
a function of the individual’s factors and the item’s parameters (Embretson and
Reise, 2013). How many factors should represent the individual? Should each item
have a guessing parameter? What mathematical function links the individual’s fac-

tors and the item’s parameters to the probability? Should individuals from different
groups have different item parameters? The many possible answers to each of these
questions constitute different item response models.

Many item response models are possible for any data set, and different models
frequently lead to different conclusions. As an extreme example, one group of re-
searchers reported that a psychopathy instrument used for criminal risk assessment
contained significant bias against North Americans compared to Europeans (Cooke
et al., 2005). A different group of researchers countered that the results were driven
by their flawed model selection process (Bolt, Hare, and Neumann, 2007). The
Standards for Educational and Psychological Testing require that evidence of model
fit must be brought to bear, especially when decisions are made based on empirical
data (AERA, 2014). What exactly does it mean for a model to fit item response
data? What makes for valid evidence? And, what process should researchers follow

to arrive at this evidence? These are the broad questions that thread through my
dissertation’s three chapters.

In the first chapter, I consider model selection in the context of identifying

items that may contain bias. I warn against overlooking the model identification
problem at the beginning of most methods for detecting potentially biased items.
I suggest the following three-step process for flagging potentially biased items: (1)
begin by examining raw item response data, (2) compare the results of a variety of
methods, and (3) interpret results in light of the possibility of the methods failing.
I develop new methods for these steps, including GLIMMER, a graphical method
that enables analysts to inspect their raw item response data for potential bias

without making strong assumptions. I illustrate this process using data from a ver-
bal aggression instrument and find that it’s impossible to tell whether males, on
average, are more verbally aggressive than females. For example, one method con-
cludes that males are 0.5 standard deviations more verbally aggressive than females,

while another concludes that the difference is 0.001 standard deviations.

In the second chapter, I advocate for measuring an item response model’s


fit by how well it predicts out-of-sample data instead of whether the model could
have produced the data. The fact that item responses are cross-classified within
persons and items complicates this discussion. Accordingly, I consider two separate
predictive tasks for a model. The first task, “missing responses prediction,” is for
the model to predict the probability of an affirmative response from in-sample per-
sons responding to in-sample items. The second task, “missing persons prediction,”
is for the model to predict the vector of responses from an out-of-sample person. I
derive a predictive fit metric for each of these tasks and conduct a series of simu-
lation studies to describe their behavior. For example, I find that defining predic-
tion in terms of missing responses, greater average person ability, and greater item
discrimination are all associated with the 3PL model producing relatively worse

predictions, and thus lead to greater minimum sample sizes. Further, I compare
the prediction-maximizing model to the model selected by AIC, BIC, and likeli-
hood ratio tests. In terms of predictive performance, likelihood ratio tests often
select overly flexible models, while BIC tends to select overly parsimonious mod-
els. Lastly, I use PISA data to demonstrate how to use cross-validation to directly
estimate the predictive fit metrics in practice (PISA, 2015).

In the third chapter, I develop new methods for comparing multidimensional

item response models and apply them to empirically explore early childhood devel-
opment. Despite both the abundance of theory and the important implications for
parenting, teaching, and health practice, there is surprisingly little large-scale em-

pirical work on early childhood development (e.g., Flavell, 1963; Gelman and Meck,
1983). My coauthors and I combine cross-sectional survey data and longitudinal
mobile app data provided by thousands of parents as their children developed to
address this gap. In particular, the mobile app data, provided by Kinedu, Inc., is

the result of over 10,000 parents repeatedly reporting on their child’s achievement
of collections of age-specific developmental milestones. We find that multiple fac-

tors best represent early child development. For example, a two-factor model where
the 1st factor is mainly physical and the 2nd factor is mainly linguistic—better cap-
tures developmental variation than a one-factor model. Further, we find evidence
for the differentiation hypothesis, which suggests that the structure of a child’s de-
velopment is unitary early in infancy but becomes more complex with age. These
findings indicate that measures of developmental variation should move beyond as-
sumptions that differences and progression of children’s development can be repre-
sented as a homogenous process, and toward multidimensional representations.


AEERA. 2014. Standards for Educational and Psychological Testing. American Ed-
ucational Research Association American Psychological Association ….
Bolt, Daniel M, Robert D Hare, and Craig S Neumann. 2007. “Score Metric Equiv-
alence of the Psychopathy Checklist–Revised (PCL-r) Across Criminal Of-
fenders in North America and the United Kingdom: A Critique of Cooke,
Michie, Hart, and Clark (2005) and New Analyses.” Assessment 14 (1): 44–
Cooke, David J, Christine Michie, Stephen D Hart, and Danny Clark. 2005. “As-
sessing Psychopathy in the UK: Concerns about Cross-Cultural Generalis-
ability.” The British Journal of Psychiatry 186 (4): 335–41.
Embretson, Susan E, and Steven P Reise. 2013. Item Response Theory. Psychology

Flavell, John H. 1963. “The Developmental Psychology of Jean Piaget.”
Gelman, Rochel, and Elizabeth Meck. 1983. “Preschoolers’ Counting: Principles
Before Skill.” Cognition 13 (3): 343–59.
Hambleton, Ronald K, and others. 1982. “Applications of Item Response Models to
NAEP Mathematics Exercise Results.”
Pisa, OECD. 2015. “Pisa: Results in Focus.” Organisation for Economic Co-
Operation and Development: OECD.
Sheldrick, R Christopher, Lauren E Schlichting, Blythe Berger, Ailis Clyne, Pen-
sheng Ni, Ellen C Perrin, and Patrick M Vivier. 2019. “Establishing New
Norms for Developmental Milestones.” Pediatrics 144 (6).

Smits, Dirk JM, Paul De Boeck, and Kristof Vansteelandt. 2004. “The Inhibition
of Verbally Aggressive Behaviour.” European Journal of Personality 18 (7):


With Benjamin W. Domingue and Michael C. Frank

(In review at Psychological Methods)


Differential item functioning (DIF) is a popular technique within the item-response

theory framework for detecting test items that are biased against particular demo-
graphic groups. The last thirty years have brought significant methodological ad-

vances in detecting DIF. Still, typical methods—such as matching on sum scores
or identifying anchor items—are based exclusively on internal criteria and there-
fore rely on a crucial piece of circular logic: items with DIF are identified via an
assumption that other items do not have DIF. This logic is an attempt to solve
an easy-to-overlook identification problem at the beginning of most DIF detec-
tion. We explore this problem, which we describe as the Fundamental DIF Identi-
fication Problem, in depth here. We suggest three steps for determining whether

it is surmountable and DIF detection results can be trusted. (1) Examine raw
item response data for potential DIF. To this end, we introduce a new graphical
method for visualizing potential DIF in raw item response data. (2) Compare the
results of a variety of methods. These methods, which we describe in detail, include
commonly-used anchor item methods, recently-proposed anchor point methods, and
our suggested adaptations. (3) Interpret results in light of the possibility of DIF
methods failing. We illustrate the basic challenge and the methodological options
using the classic verbal aggression data and a simulation study. We recommend
best practices for cautious DIF detection.


Measures from surveys and assessments are in widespread use throughout

social and biomedical science. Education researchers use end-of-year assessments
to measure educational opportunity across communities (Reardon, Kalogrides, &
Ho, 2019), psychologists use surveys to understand personality (Vernon, 2014), and
medical researchers develop symptoms questionnaires that are widely used by clin-
icians (Amtmann et al., 2010). Such data has a general structure: individuals give
categorical responses to a set of questions. The use of such data has a general goal:
to better understand individuals’ specific abilities through the measurement of un-

observable latent variables rather than relying solely on observations (Borsboom,
2006). IE
A popular paradigm in service of this goal is item response theory (IRT)
(Hambleton, Swaminathan, & Rogers, 1991). In IRT, each individual’s response
to each item on a measurement instrument is a function of the individual’s latent
variables and the item’s parameters. The simplest IRT model, the Rasch model,
specifies the probability of individual i responding affirmatively to item j as

Pr(yij = 1) = σ(θi + bj ) (1.1)

where θi is the individual’s latent variable (i.e., trait or ability), bj is the item’s eas-
iness, and σ(x) = is the standard logistic function (Thissen & Steinberg,
1 + ex

Frequently, categorical variables accompany item response data. As one exam-

ple, the gender of the test-taker is typically collected as part of the administration
of college entrance exams (Cai, Lu, Pan, & Zhong, 2019; Casey, Nuttall, Pezaris,
& Benbow, 1995). Other common “group” variables are male-female, high-low so-

cioeconomic status, and rural-urban; in the general case they are referred to as the
reference (“ref”) and focal (“foc”) group (Holland & Thayer, 1986). With a group
variable, an improved measurement model might allow for the possibility of sepa-
rate parameters for each group. In particular, the multigroup Rasch model allows
the probability of individual i responding affirmatively to item j to vary as a func-
tion of individual i’s group membership,

Pr(yij = 1) = σ(θi + bgroup

j ). (1.2)

The notation bgroup

j indicates that each item has a separate easiness parameter for

each group: bref foc
j for persons in the reference group and bj for persons in the focal

group. The multigroup model allows for the possibility of differential item function-
ing (DIF); an item that contains DIF functions differently across groups and thus
has the potential to cause bias (Camilli & Shepard, 1994). An item does not suf-
fer from DIF when bref
j = bfoc ref
j . On the other hand, if, for example, bj > bfoc
j , the

item contains DIF “against” the focal group. One way to think about DIF is that,
conditional on ability, an item is DIF-free if persons have the same probability of

responding correctly regardless of their group membership.1

Items that contain DIF can invalidate the entire measurement process. As
one example, Pae and Park (2006) found that 22 out of 33 items from the English
reading portion of a Korean college entrance exam contained DIF across gender.
They concluded that the cumulative effect of these 22 items significantly contam-
inates test-level scores, potentially leading to unfair admission decisions. As such,
effective DIF detection methods are crucial to the field of measurement, and psy-
chometricians have long been in search of effective DIF detection methods (Mill-
sap, 2012). DIF detection methods typically test the hypothesis that bref
j = bfoc
j .

This is most easily seen by graphing the item characteristic curves (the mapping of ability to
probability of correct response) for each group and observing that they overlap.

The most common is to use a likelihood ratio test (LRT) to compare the base-
line model—where bref foc
j and bj are constrained to be equal—to a more flexible

model where that constraint is removed (Thissen, Steinberg, & Wainer, 1993). Crit-
ically, the item parameters for all of the other items are usually required to be
equal across groups. As another example, a more recent and complex approach,
the “lasso DIF method,” begins with a baseline model where every item parame-
ter is allowed to vary across groups (Magis, Tuerlinckx, & De Boeck, 2015). The
final model is found by lasso penalizing the baseline model such that a model where
some items have equal parameters across groups is obtained. Critically, each per-
son’s sum score is used as the estimate of their latent ability throughout this pro-


As is common, both of these DIF detection methods are based exclusively on

internal criteria. As a result, they use the circular logic of looking for DIF items
while assuming other items do not contain DIF (Camilli & Shepard, 1994). The

LRT DIF method makes this assumption explicitly. The lasso DIF method make
this assumption more subtly by using the sum score—which would be contaminated

by DIF items—as an estimate of ability. Researchers have noticed this circularity,

but have mostly described it indirectly by pointing out inflated type I errors in sim-
ulation studies (Stark, Chernyshenko, & Drasgow, 2006). Andrich and Hagquist
(2012) refer to items incorrectly flagged as having DIF due to other items contain-
ing DIF as being caused by “artificial DIF.” We argue that type I errors and the
artificial DIF that can cause them are more clearly seen as consequences of an iden-
tification problem at the root of DIF detection. This identification problem may
be easy to overlook, but solving it is no easy task. In fact, we see it as the antago-
nist of any DIF detection method and name it accordingly: the Fundamental DIF
Identification Problem.

The Fundamental DIF Identification Problem

Thus far, we have focused on item parameters, but ability parameters must be
determined as well.2 The multigroup Rasch model typically assumes that persons
from the reference group are distributed θi ref ∼ N (0, 1) and persons from the focal
group are distributed θi foc ∼ N (µfoc , σ foc ). Setting the mean and variance for the
reference group abilities to 0 and 1 simply determines the scale. The Fundamental
DIF Identification Problem is that the focal group mean ability, µfoc , must be deter-
mined along with bref foc
j and bj for each item. A model cannot freely estimate all of

these parameters (i.e., the model is under-identified). To make this lack of identifi-

ability concrete, consider that the model has no way to disentangle the difference
between (a) the focal group having higher ability and (b) every item containing
bias against the reference group.3
What we’re calling the Fundamental DIF Identification Problem is both ac-
knowledged and frequently overlooked by the literature (Zumbo, 2007). We argue
that it has largely been communicated in a way that doesn’t fully capture its im-
portance. Camilli and Shepard (1994) describe it as the requirement that “param-

eters must be ‘equated’ or scaled in the same metric” (p. 62). Hambleton, Swami-
nathan, and Rogers (1991), Embretson and Reise (2000), and Millsap (2012) de-
scribe it as the need to create a common scale for linking across groups. As one
example of the Fundamental DIF Identification Problem being overlooked, Cooke,
Michie, Hart, and Clark (2005) concluded that a psychopathy instrument used for
criminal risk assessment contained significant bias against North Americans as com-
pared to Europeans. Bolt, Hare, and Neumann (2007) pointed out that they made
As is common, we use marginal maximum likelihood estimation (MMLE) in which case it’s
only the group mean abilities, not the individual abilities, that are estimated in model fitting
(Bock & Aitkin, 1981).
j − b̂j + c are equivalent for any value of
Mathematically, all IRT models with µ̂foc + c and b̂ref foc


the mistake of beginning DIF detection with an unidentified model, and the ex-
change culminated in a legal battle that was picked up the New York Times (Carey,

The most trustworthy ways of addressing the Fundamental DIF Identification

Problem use external information based on the context in which the item response
data was gathered (Camilli & Shepard, 1994). For example, in a large randomized
experiment, the groups might be determined equivalent at baseline, and the ana-
lyst can safely assume that µfoc = µref on any instruments administered before the
experiment’s intervention. Or, an item like “2 + 2” might seem so innocuous that
the analyst faithfully assumes that bfoc = bref 4
j for that item. In the equating litera-


ture, the former is a “non-common item random groups” design and the latter is a
“common-item nonequivalent groups” design (Cook & Paterson, 1987; Topczewski,
Cui, Woodruff, Chen, & Fang, 2013).5 However, in most cases the analyst is not in
a position to make one of these assumptions. The multigroup model is unidentified

and the analyst has no knowledge with which to make an identifying assumption; it
in this sense that we say they are “agnostic” about how to resolve the Fundamen-

tal DIF Identification Problem. The analyst has nothing to hold onto so to speak.
We refer to any method that the analyst turns to in this case as an “agnostic iden-
tification” (AGI) method, as opposed to the more general title of DIF detection

At present “little evidence is available to guide applied researchers through

On the other hand, Angoff (1993) reports that test developers are often “confronted by DIF
results that they cannot understand; and no amount of deliberation seems to help explain why
some perfectly reasonable items have large DIF values” (p. 19). It’s unclear whether this obser-
vation indicates that seemingly innocuous items sometimes contain DIF or if it indicates a more
basic failure of the DIF detection method all-together.
With one of these assumptions in hand, the remainder of DIF detection is straightforward:
each of the other items can be checked for DIF using well-validated methods such as a likelihood
ratio test (LRT), which we use throughout this paper (Thissen, Steinberg, & Wainer, 1993).
To be sure, we refer to overcoming The Fundamental Problem of DIF without any a priori
assumptions as AGI. DIF detection, on the other hand, describes the complete process. In this
way, (we argue that) AGI is (perhaps the most important) part of DIF detection.

