Professional Documents
Culture Documents
Preview: Benjamin A. Stenhaug JUNE, 2021
Preview: Benjamin A. Stenhaug JUNE, 2021
Preview: Benjamin A. Stenhaug JUNE, 2021
W
IE
A DISSERTATION
SUBMITTED TO THE GRADUATE SCHOOL OF EDUCATION
EV
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
PR
BENJAMIN A. STENHAUG
JUNE, 2021
© 2021 by Ben Alan Stenhaug. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
W
IE
This dissertation is online at: http://purl.stanford.edu/yt267zd9190
EV
PR
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Michael Frank
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
W
Daniel Bolt
IE
Approved for the Stanford University Committee on Graduate Studies.
EV
Stacey F. Bent, Vice Provost for Graduate Education
PR
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
PR
EV
IE
W
iv
v
ACKNOWLEDGMENTS
First, I’d like to thank my friends and family, near and far, for their support.
Without exaggeration, this dissertation was made possible by my fiancé, Emma,
who offered unconditional support, patience, and zest throughout the process. I’m
grateful to Jason Fullen, Seth Saeugling, Stacy Day, Nate Uzlik, Sean Legler, and
others for their hundreds of phone calls that joyfully interrupted my work. Thank
you to my roommates and friends, Casey Ulrich and Charlotte Sivanich, who pro-
vided a happy home, nightly Euchre games, and the background sound of remote
W
teaching. And, thank you to my family—Kalin, Krista, and Bruce—for everything.
I am deeply grateful for the mentorship that I’ve received. I am proud to be
IE
Ben Domingue’s first doctoral advisee. Ben’s insight, generosity, and commitment
to creating an intellectual community provided the backbone of my experience. Ben
EV
made learning and working a joy; some of my favorite memories from graduate
school are whiteboarding together. I am grateful to Mike Frank, who opened my
eyes to the value of measurement beyond education. Mike’s mentorship fundamen-
PR
tally changed how I think about science, and I’m in awe of his simultaneous com-
mitment to high standards for research and kindness towards researchers. I thank
Dan Bolt for his insight, support, and generosity in joining my committee from the
great state of Wisconsin. I thank sean reardon for his engagement at each milestone
of the graduate program, including teaching methodology in a way that focuses on
building deep intuition. Thank you to Margot Gerritsen for enthusiastically sup-
porting my various endeavors and for chairing my dissertation committee. Thank
you to Nilam Ram, who significantly contributed to my growth and work given our
short time collaborating. Thank you to Kate McKinney and Nadia Ahmed, who
brought compassion and clarity to this process.
vi
W
my research.
I am thankful for the various financial support that I received. Thank you
IE
to Stanford’s Data Science Institute, the Institute of Education Sciences (Grant
R305B140009), the Karr Family Graduate Fellowship, and the Spencer Foundation
EV
(Grant 201700082) for supporting my time at Stanford financially. This support
gave me the privilege and freedom to research and ultimately write this dissertation
on topics that I found personally interesting and important. Thank you to Luis
PR
Garza and Kinedu, Inc. for providing data that made part of this dissertation pos-
sible. I also wish to thank Hadley Wickham, Phil Chalmers, and all other contribu-
tors to open source software that I used for data analysis and computation.
vii
CONTENTS
Page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
W
Agnostic Identification Methods . . . . . . . . . . . . . . . . . . . . . . . . 19
Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Appendix . . . . . . . . . . . .
IE . . . . . . . . . . . . . . . . . . . . . . . . 45
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Predictive Fit in Practice via Cross-validation . . . . . . . . . . . . . . . . 70
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Study 1: The Structure of Developmental Variation Across Individuals . . 90
Study 2: The Dimensionality of Within-Child Variability Increases Across
Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Supplemental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
viii
LIST OF TABLES
Table Page
1.1 Three best practices for DIF detection . . . . . . . . . . . . . . . . . 12
1.2 The 24 items in the verbal aggression data come from crossing four
situations, three actions, and two types. For example, the first row
corresponds to the item that asks whether an individual would want
to curse if a bus didn’t stop for them. . . . . . . . . . . . . . . . . . . 13
1.3 Proportion of affirmative responses by gender in the verbal aggres-
sion data. Some of the items have a greater proportion of affirmative
responses by males, and others have a great proportion by females.
W
Proportions are an intuitive but imperfect way of making across-
item group difference comparisons. . . . . . . . . . . . . . . . . . . . 14
1.4 Logits of affirmative responses by gender for the train situation for
IE
the verbal aggression data. Logit differences are an improvement
over proportions for making across-item group difference compar-
isons, but they are still imperfect because there is no item response
EV
model underlying the estimation. . . . . . . . . . . . . . . . . . . . . 15
2.1 Simulation study 1 results. Counts are of the winning model ac-
cording to the theoretical predictive fit metrics, ELPL-MR and
ELPL-MP, and the model selected by BIC, AIC, and LRT. With the
PR
1PL and 2PL DGM, the predictive fit metrics find that the model
with the same parameterization is the prediction-maximizing model
and the model selection methods usually select this model. With
the 3PL DGM, ELPL-MR tends to find that the 2PL model offers
the best predictive fit, while ELPL-MP tends to find that the 3PL
model offers the best predictive fit. LRT selects models consistent
with ELPL-MP, while AIC and BIC selects models more consistent
with ELPL-MR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Fit of seven models to a subset of the 2015 PISA data according to
six metrics. Models are ordered by number of parameters. Consis-
tent with results from the simulation studies, metrics based on the
missing responses prediction task prefer models with fewer parame-
ters (i.e., less flexible). LRT is the p-value for the model compared
to the model in the previous row, which is why the first value is NA.
Each of these comparisons yielded a p-value <10^-10, thus the most
flexible model was selected. * indicates the winning model according
to the metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
3.2 Descriptive information for the SWYC data. The SWYC contains
many versions which correspond to the age of the child. Each ver-
sion has exactly 10 milestones, which we mapped to Kinedu’s four
milestone categories. Our data contain varying numbers of children. . 117
W
IE
EV
PR
x
LIST OF FIGURES
Figure Page
1.1 GLIMMER for the verbal aggression data. The total performance
difference (from either ability differences or DIF), d˜j , for each item
is shown. Distributions—as opposed to point estimates—are shown
to help the analyst reason about uncertainty. Distributions are cal-
culated by drawing 10,000 imputations from the item parameter
covariance matrix. There is no consistent performance difference
across items, indicating that the Fundamental DIF Identification
Problem is difficult for this data. . . . . . . . . . . . . . . . . . . . . 18
W
1.2 AOAA results depicted in a GLIMMER . . . . . . . . . . . . . . . . 21
1.3 AOAA-OAT results for the verbal aggression data. On the left, the
GLIMMER is identified by setting the group means equal. Green
IE
represents items that AOAA-OAT did not find to contain DIF (i.e.,
anchor items). On the right, the final AOAA model is identified by
fixing the anchor items equivalent across groups. Distributions are
EV
shown to give a sense of variability and are estimated via 10,000
imputations from the item parameter covariance matrix. . . . . . . . 24
1.4 The search path of MAXGI the verbal aggression data. µfemale is
fixed to 0 and the goal is to identify the value of µmale that maxi-
PR
1.5 The search path of MINBC for the verbal aggression data. µfemale
is fixed to 0 and the goal is to identify the value of µmale that mini-
mizes the total area between item characteristic curves. As a result,
the total amount of DIF on the test is minimized and as much
performance difference as possible is explained by group ability
differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6 Results by DIF detection method. The male group mean verbal
aggression, µ̂male , according to each method is graphed as a vertical
line, which are superimposed over the performance difference, d˜j , for
each item. The scale is set by fixing µ̂female = 0. . . . . . . . . . . . . 29
xi
1.8 GLIMMER for one replication using the same item parameters as
generated Figure 1.7. The six DIF-free items (items 1-6) show a con-
stant performance difference. As expected, the other six items (items
7-12) show an increasingly large performance difference. . . . . . . . . 34
1.9 Performance rates across 100 replications for each AGI method and
number of DIF items. Top row: AOAA-OAT nearly always chooses
all of the non-DIF items as anchors (the anchor hit rate) while
AOAA-AS and AOAA do much better for fewer DIF items. Bottom
W
row: All of the methods perform similarly as far as avoiding includ-
ing items with DIF in the anchor set (DIF avoidance rate). DIF
avoidance rates are slightly lower for the two DIF item condition
because the item with 20◦ of DIF was frequently incorrectly included
IE
in the anchor set (and it was one of only two items with DIF in this
condition). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
EV
1.10 Achievement gap residual distributions across 100 replications for
each AGI method and number of DIF items. . . . . . . . . . . . . . . 37
gle item response and the person’s other responses can be used to
estimate ability. With missing persons, the unit of observation is the
person’s response vector and there are no responses with which to
estimate ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3 Simulation study 3 results for the predictive fit metrics, ELPL-MR
and ELPL-MP. Each point corresponds to the prediction-maximizing
model from one of 1000 replications. The fixed guessing parameter
varies across columns; the fixed number of persons varies across rows.
The 3PL model was most likely to offer the best fit with greater item
discrimination, more difficult items, and more persons. ELPL-MP
preferred the 3PL model more often than ELPL-MR did. . . . . . . . 68
2.4 Simulation study 3 results for the model selection methods, BIC,
AIC, and LRT. Each point corresponds to the selected model from
one of 1000 replications. The fixed guessing parameter varies across
columns; the fixed number of persons varies across rows. The 3PL
model was selected more often with greater item discrimination,
more difficult items, and more persons. BIC selected models con-
sistent with ELPL-MR, while AIC and LRT selected models more
consistent with ELPL-MP. . . . . . . . . . . . . . . . . . . . . . . . . 69
W
2.5 Simulation study 4 results. Each point corresponds to the prediction-
maximizing model according to the predictive fit metrics, ELPL-MR
and ELPL-MP, or the model selected by BIC, AIC, or LRT for one
IE
of 2000 replications. The prediction-maximizing and selected model
was more likely to be the 2F 2PL with lower correlation between
factors and more persons. LRT nearly always selected the 2F 2PL
EV
model even in replications where both ELPL-MP and ELPL-MR
identified the 1F 2PL model as prediction-maximizing. AIC selected
models largely consistent with ELPL-MP. BIC selected models more
closely aligned to ELPL-MR. . . . . . . . . . . . . . . . . . . . . . . 71
PR
3.4 Each panel corresponds to a step from Study 2. In the first step, we
used the survey data to develop a measurement model. The first fac-
tor is mainly physical and the second factor is mainly linguistic. In
the second step, we used the measurement model to estimate factor
scores for each child-timepoint in the app data. As expected, both
factors are highly associated with age. In the third step, we modeled
longer-term developmental trends separately for each child. Here, we
illustrate this step by showing the trends for a single child. In the
W
fourth step, we extract the deviations (i.e., residuals) from the devel-
opmental trends. Here, we show the deviations (i.e., residuals) for
that same child. These deviations allow us to examine age-related
IE
differences in within-person coupling of factor scores. . . . . . . . . . 96
3.6 Example item characteristic curves for 9 of the 414 milestones from
a 1F model fit to the survey data. Babbling is unrelated to a child’s
development whereas other milestones such as knowing three or more
numbers are highly related to development. . . . . . . . . . . . . . . 118
W
The bottom panel shows results from when each model is fit sepa-
rately to each age group. We did not find a consistent relationship
between age (i.e., instrument version) and gain from higher dimen-
IE
sional models for the SWYC data. . . . . . . . . . . . . . . . . . . . 121
3.10 The relationship between the number of children for the SWYC
version and gain between a 5F and 1F model. As expected, the 5F
EV
model performs better when fit to larger sample sizes. This con-
founding is one possible reason that we did not find a relationship
between age group and gain over the 1F model in the previous figure. 121
PR
1
INTRODUCTION
W
probability of an individual responding affirmatively (or correctly) to an item as
a function of the individual’s factors and the item’s parameters (Embretson and
IE
Reise, 2013). How many factors should represent the individual? Should each item
have a guessing parameter? What mathematical function links the individual’s fac-
EV
tors and the item’s parameters to the probability? Should individuals from different
groups have different item parameters? The many possible answers to each of these
questions constitute different item response models.
PR
Many item response models are possible for any data set, and different models
frequently lead to different conclusions. As an extreme example, one group of re-
searchers reported that a psychopathy instrument used for criminal risk assessment
contained significant bias against North Americans compared to Europeans (Cooke
et al., 2005). A different group of researchers countered that the results were driven
by their flawed model selection process (Bolt, Hare, and Neumann, 2007). The
Standards for Educational and Psychological Testing require that evidence of model
fit must be brought to bear, especially when decisions are made based on empirical
data (AERA, 2014). What exactly does it mean for a model to fit item response
data? What makes for valid evidence? And, what process should researchers follow
2
to arrive at this evidence? These are the broad questions that thread through my
dissertation’s three chapters.
W
without making strong assumptions. I illustrate this process using data from a ver-
bal aggression instrument and find that it’s impossible to tell whether males, on
IE
average, are more verbally aggressive than females. For example, one method con-
cludes that males are 0.5 standard deviations more verbally aggressive than females,
EV
fit by how well it predicts out-of-sample data instead of whether the model could
have produced the data. The fact that item responses are cross-classified within
persons and items complicates this discussion. Accordingly, I consider two separate
predictive tasks for a model. The first task, “missing responses prediction,” is for
the model to predict the probability of an affirmative response from in-sample per-
sons responding to in-sample items. The second task, “missing persons prediction,”
is for the model to predict the vector of responses from an out-of-sample person. I
derive a predictive fit metric for each of these tasks and conduct a series of simu-
lation studies to describe their behavior. For example, I find that defining predic-
tion in terms of missing responses, greater average person ability, and greater item
discrimination are all associated with the 3PL model producing relatively worse
3
predictions, and thus lead to greater minimum sample sizes. Further, I compare
the prediction-maximizing model to the model selected by AIC, BIC, and likeli-
hood ratio tests. In terms of predictive performance, likelihood ratio tests often
select overly flexible models, while BIC tends to select overly parsimonious mod-
els. Lastly, I use PISA data to demonstrate how to use cross-validation to directly
estimate the predictive fit metrics in practice (PISA, 2015).
W
pirical work on early childhood development (e.g., Flavell, 1963; Gelman and Meck,
1983). My coauthors and I combine cross-sectional survey data and longitudinal
IE
mobile app data provided by thousands of parents as their children developed to
address this gap. In particular, the mobile app data, provided by Kinedu, Inc., is
EV
the result of over 10,000 parents repeatedly reporting on their child’s achievement
of collections of age-specific developmental milestones. We find that multiple fac-
PR
tors best represent early child development. For example, a two-factor model where
the 1st factor is mainly physical and the 2nd factor is mainly linguistic—better cap-
tures developmental variation than a one-factor model. Further, we find evidence
for the differentiation hypothesis, which suggests that the structure of a child’s de-
velopment is unitary early in infancy but becomes more complex with age. These
findings indicate that measures of developmental variation should move beyond as-
sumptions that differences and progression of children’s development can be repre-
sented as a homogenous process, and toward multidimensional representations.
4
References
AEERA. 2014. Standards for Educational and Psychological Testing. American Ed-
ucational Research Association American Psychological Association ….
Bolt, Daniel M, Robert D Hare, and Craig S Neumann. 2007. “Score Metric Equiv-
alence of the Psychopathy Checklist–Revised (PCL-r) Across Criminal Of-
fenders in North America and the United Kingdom: A Critique of Cooke,
Michie, Hart, and Clark (2005) and New Analyses.” Assessment 14 (1): 44–
56.
Cooke, David J, Christine Michie, Stephen D Hart, and Danny Clark. 2005. “As-
sessing Psychopathy in the UK: Concerns about Cross-Cultural Generalis-
ability.” The British Journal of Psychiatry 186 (4): 335–41.
Embretson, Susan E, and Steven P Reise. 2013. Item Response Theory. Psychology
Press.
W
Flavell, John H. 1963. “The Developmental Psychology of Jean Piaget.”
Gelman, Rochel, and Elizabeth Meck. 1983. “Preschoolers’ Counting: Principles
Before Skill.” Cognition 13 (3): 343–59.
IE
Hambleton, Ronald K, and others. 1982. “Applications of Item Response Models to
NAEP Mathematics Exercise Results.”
Pisa, OECD. 2015. “Pisa: Results in Focus.” Organisation for Economic Co-
EV
Operation and Development: OECD.
Sheldrick, R Christopher, Lauren E Schlichting, Blythe Berger, Ailis Clyne, Pen-
sheng Ni, Ellen C Perrin, and Patrick M Vivier. 2019. “Establishing New
Norms for Developmental Milestones.” Pediatrics 144 (6).
PR
Smits, Dirk JM, Paul De Boeck, and Kristof Vansteelandt. 2004. “The Inhibition
of Verbally Aggressive Behaviour.” European Journal of Personality 18 (7):
537–55.
5
CHAPTER 1
TREADING CAREFULLY: AGNOSTIC IDENTIFICATION AS THE
FIRST STEP OF DIF DETECTION
Abstract
W
vances in detecting DIF. Still, typical methods—such as matching on sum scores
or identifying anchor items—are based exclusively on internal criteria and there-
IE
fore rely on a crucial piece of circular logic: items with DIF are identified via an
assumption that other items do not have DIF. This logic is an attempt to solve
EV
an easy-to-overlook identification problem at the beginning of most DIF detec-
tion. We explore this problem, which we describe as the Fundamental DIF Identi-
fication Problem, in depth here. We suggest three steps for determining whether
PR
it is surmountable and DIF detection results can be trusted. (1) Examine raw
item response data for potential DIF. To this end, we introduce a new graphical
method for visualizing potential DIF in raw item response data. (2) Compare the
results of a variety of methods. These methods, which we describe in detail, include
commonly-used anchor item methods, recently-proposed anchor point methods, and
our suggested adaptations. (3) Interpret results in light of the possibility of DIF
methods failing. We illustrate the basic challenge and the methodological options
using the classic verbal aggression data and a simulation study. We recommend
best practices for cautious DIF detection.
CHAPTER 1: AGNOSTIC IDENTIFICATION 6
Introduction
W
observable latent variables rather than relying solely on observations (Borsboom,
2006). IE
A popular paradigm in service of this goal is item response theory (IRT)
(Hambleton, Swaminathan, & Rogers, 1991). In IRT, each individual’s response
EV
to each item on a measurement instrument is a function of the individual’s latent
variables and the item’s parameters. The simplest IRT model, the Rasch model,
specifies the probability of individual i responding affirmatively to item j as
PR
where θi is the individual’s latent variable (i.e., trait or ability), bj is the item’s eas-
ex
iness, and σ(x) = is the standard logistic function (Thissen & Steinberg,
1 + ex
1986).
cioeconomic status, and rural-urban; in the general case they are referred to as the
reference (“ref”) and focal (“foc”) group (Holland & Thayer, 1986). With a group
variable, an improved measurement model might allow for the possibility of sepa-
rate parameters for each group. In particular, the multigroup Rasch model allows
the probability of individual i responding affirmatively to item j to vary as a func-
tion of individual i’s group membership,
W
each group: bref foc
j for persons in the reference group and bj for persons in the focal
group. The multigroup model allows for the possibility of differential item function-
IE
ing (DIF); an item that contains DIF functions differently across groups and thus
has the potential to cause bias (Camilli & Shepard, 1994). An item does not suf-
EV
fer from DIF when bref
j = bfoc ref
j . On the other hand, if, for example, bj > bfoc
j , the
item contains DIF “against” the focal group. One way to think about DIF is that,
conditional on ability, an item is DIF-free if persons have the same probability of
PR
Items that contain DIF can invalidate the entire measurement process. As
one example, Pae and Park (2006) found that 22 out of 33 items from the English
reading portion of a Korean college entrance exam contained DIF across gender.
They concluded that the cumulative effect of these 22 items significantly contam-
inates test-level scores, potentially leading to unfair admission decisions. As such,
effective DIF detection methods are crucial to the field of measurement, and psy-
chometricians have long been in search of effective DIF detection methods (Mill-
sap, 2012). DIF detection methods typically test the hypothesis that bref
j = bfoc
j .
1
This is most easily seen by graphing the item characteristic curves (the mapping of ability to
probability of correct response) for each group and observing that they overlap.
CHAPTER 1: AGNOSTIC IDENTIFICATION 8
The most common is to use a likelihood ratio test (LRT) to compare the base-
line model—where bref foc
j and bj are constrained to be equal—to a more flexible
model where that constraint is removed (Thissen, Steinberg, & Wainer, 1993). Crit-
ically, the item parameters for all of the other items are usually required to be
equal across groups. As another example, a more recent and complex approach,
the “lasso DIF method,” begins with a baseline model where every item parame-
ter is allowed to vary across groups (Magis, Tuerlinckx, & De Boeck, 2015). The
final model is found by lasso penalizing the baseline model such that a model where
some items have equal parameters across groups is obtained. Critically, each per-
son’s sum score is used as the estimate of their latent ability throughout this pro-
W
cess.
LRT DIF method makes this assumption explicitly. The lasso DIF method make
this assumption more subtly by using the sum score—which would be contaminated
PR
Thus far, we have focused on item parameters, but ability parameters must be
determined as well.2 The multigroup Rasch model typically assumes that persons
from the reference group are distributed θi ref ∼ N (0, 1) and persons from the focal
2
group are distributed θi foc ∼ N (µfoc , σ foc ). Setting the mean and variance for the
reference group abilities to 0 and 1 simply determines the scale. The Fundamental
DIF Identification Problem is that the focal group mean ability, µfoc , must be deter-
mined along with bref foc
j and bj for each item. A model cannot freely estimate all of
these parameters (i.e., the model is under-identified). To make this lack of identifi-
W
ability concrete, consider that the model has no way to disentangle the difference
between (a) the focal group having higher ability and (b) every item containing
bias against the reference group.3
IE
What we’re calling the Fundamental DIF Identification Problem is both ac-
EV
knowledged and frequently overlooked by the literature (Zumbo, 2007). We argue
that it has largely been communicated in a way that doesn’t fully capture its im-
portance. Camilli and Shepard (1994) describe it as the requirement that “param-
PR
eters must be ‘equated’ or scaled in the same metric” (p. 62). Hambleton, Swami-
nathan, and Rogers (1991), Embretson and Reise (2000), and Millsap (2012) de-
scribe it as the need to create a common scale for linking across groups. As one
example of the Fundamental DIF Identification Problem being overlooked, Cooke,
Michie, Hart, and Clark (2005) concluded that a psychopathy instrument used for
criminal risk assessment contained significant bias against North Americans as com-
pared to Europeans. Bolt, Hare, and Neumann (2007) pointed out that they made
2
As is common, we use marginal maximum likelihood estimation (MMLE) in which case it’s
only the group mean abilities, not the individual abilities, that are estimated in model fitting
(Bock & Aitkin, 1981).
j − b̂j + c are equivalent for any value of
3
Mathematically, all IRT models with µ̂foc + c and b̂ref foc
c.
CHAPTER 1: AGNOSTIC IDENTIFICATION 10
the mistake of beginning DIF detection with an unidentified model, and the ex-
change culminated in a legal battle that was picked up the New York Times (Carey,
2010).
W
j
ture, the former is a “non-common item random groups” design and the latter is a
“common-item nonequivalent groups” design (Cook & Paterson, 1987; Topczewski,
IE
Cui, Woodruff, Chen, & Fang, 2013).5 However, in most cases the analyst is not in
a position to make one of these assumptions. The multigroup model is unidentified
EV
and the analyst has no knowledge with which to make an identifying assumption; it
in this sense that we say they are “agnostic” about how to resolve the Fundamen-
PR
tal DIF Identification Problem. The analyst has nothing to hold onto so to speak.
We refer to any method that the analyst turns to in this case as an “agnostic iden-
tification” (AGI) method, as opposed to the more general title of DIF detection
method.6
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.