Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Structural Equation Modeling: A Multidisciplinary Journal

ISSN: 1070-5511 (Print) 1532-8007 (Online) Journal homepage: http://www.tandfonline.com/loi/hsem20

The Performance of Maximum Likelihood and


Weighted Least Square Mean and Variance
Adjusted Estimators in Testing Differential Item
Functioning With Nonnormal Trait Distributions

Youngsuk Suh

To cite this article: Youngsuk Suh (2015) The Performance of Maximum Likelihood and
Weighted Least Square Mean and Variance Adjusted Estimators in Testing Differential
Item Functioning With Nonnormal Trait Distributions, Structural Equation Modeling: A
Multidisciplinary Journal, 22:4, 568-580, DOI: 10.1080/10705511.2014.937669

To link to this article: http://dx.doi.org/10.1080/10705511.2014.937669

Published online: 29 Jan 2015.

Submit your article to this journal

Article views: 148

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=hsem20

Download by: [Gazi University] Date: 02 January 2016, At: 12:20


Structural Equation Modeling: A Multidisciplinary Journal, 22: 568–580, 2015
Copyright © Taylor & Francis Group, LLC
ISSN: 1070-5511 print / 1532-8007 online
DOI: 10.1080/10705511.2014.937669

The Performance of Maximum Likelihood and


Weighted Least Square Mean and Variance Adjusted
Estimators in Testing Differential Item Functioning
With Nonnormal Trait Distributions
Youngsuk Suh
Rutgers, The State University of New Jersey
Downloaded by [Gazi University] at 12:20 02 January 2016

The relative performance of the maximum likelihood (ML) and weighted least square mean
and variance adjusted (WLSMV) estimators was investigated by studying differential item
functioning (DIF) with ordinal data when the latent variable (θ ) was not normally distributed.
As the ML estimator, ML with robust standard errors (labeled MLR in Mplus) was chosen and
implemented with 2 link functions (logit vs. probit). The Type I error and power of χ 2 tests
were evaluated under various simulation conditions including the shape of the θ distributions
for the reference and focal groups. Type I error was better controlled with MLR estimators
than WLSMV. The error from WLSMV was inflated when there was a large difference in the
shape of the θ distribution between the 2 groups. In general, the power remained quite stable
across different distribution conditions regardless of the estimators. WLSMV and MLR-probit
showed comparable power, whereas MLR-logit performed the worst.

Keywords: differential item functioning, limited information, nonnormality, ordinal response

Ordered categorical measures are commonly used in educa- been explored in the psychometric literature (e.g., Forero &
tional and psychological research. An example is a Likert Maydeu-Olivares, 2009; Kamata & Bauer, 2008).
scale item with five response categories (e.g., 1 = strongly A common estimation method in CFA and IRT mod-
disagree, 2 = disagree, 3 = neither disagree nor agree, els is full information maximum likelihood (ML), which is
4 = agree, 5 = strongly agree). Several item response the- based on all the information contained in the entire response
ory (IRT) models can be applied to such categorical data. pattern. In the ML method, the observed variables are con-
IRT models have been widely used to model data from tinuous and normally distributed. Because the responses for
educational settings (e.g., large-scale achievement or apti- the ordered categorical variables are arranged numerically
tude tests and professional credential or licensure exams). in ascending order, one can easily treat them as continuous
However, more recently, these models have been adopted variables and thus use ML for estimating model parame-
in other areas such as clinical studies (e.g., panic disorder ters without examining the categorical nature of the vari-
[Woods & Thissen, 2006]; for more references, see Reise ables. Although in principle ordered categorical variables are
& Waller, 2009). Some IRT models can be presented in the assumed to measure an underlying continuous latent trait,
framework of the confirmatory factor analytic (CFA) model, the observed responses are discretized with a few response
and the relationship between the two models’ parameters has categories. Ordered categorical variables usually violate the
normality assumption, which could lead to significant prob-
lems when the continuous normal theory ML estimation is
applied to these variables (e.g., Muthén & Kaplan, 1985).
In addition, the ML estimation is often computationally
Correspondence should be addressed to Youngsuk Suh, Department of
Educational Psychology, Graduate School of Education, Rutgers, The State demanding when the method is applied to a complex model,
University of New Jersey, 10 Seminary Place, Room 323, New Brunswick, particularly for large sample sizes and numerous items.
NJ 08901. E-mail: yssuh327@gmail.com These limitations often appear to make ML less preferable
ML AND WLSMV IN TESTING DIF WITH NONNORMAL DISTRIBUTIONS 569

in practice to alternative estimators (i.e., limited information NOHARM. The item parameter estimates from ULS were
estimators), although ML estimators produce asymptotically somewhat more biased than those from the full information
efficient parameter estimates in infinite samples (Forero & ML under a nonnormal distribution of the latent trait vari-
Maydeu-Olivares, 2009). able. However, the ULS method recovered item parameters
Limited information estimators analyze a tetrachoric (or more accurately than the ML method under a normal trait
polychoric) covariance and correlation matrix, and assume distribution.
that the continuous latent variable underlying the discrete Forero and Maydeu-Olivares (2009) examined the per-
observed variable is normally distributed. One advantage of formance of parameter estimates and standard errors in
the limited information method is that it is faster than the full estimating Samejima’s (1969) graded response model with
information ML, especially in estimating complex models one and three latent traits using ULS and ML and concluded
with large sample sizes. In addition, the limited information that ULS yielded slightly more accurate parameter estimates,
estimators can be easily implemented in structural equation whereas ML yielded slightly more accurate standard errors.
modeling (SEM) software, such as Mplus. Thus, the limited Both methods failed in conditions such as few indicators per
information estimators are often used in educational and psy- dimension or highly skewed items. In fact, the authors did
chological research, and the popularity of SEM approaches not simulate nonnormal trait distributions per se, but gen-
has increased among behavioral scientists. erated six item types varying in skewness, kurtosis, or both
Several studies examined the performance of full infor- in terms of the distributions of the item response categories,
mation ML and limited information estimators under IRT using normal trait distributions across all conditions.
Downloaded by [Gazi University] at 12:20 02 January 2016

models, especially when the underlying latent variable Finch (2010) investigated the accuracy of item parame-
is not normally distributed (e.g., Boulet, 1996; DeMars, ter estimates in a two-dimensional IRT model with a simple
2012; Finch, 2010). Investigating the performance of these structure using ML in BILOGMG and two limited infor-
estimators under nonnormal distribution conditions is criti- mation estimators (ULS and robust weighted least squares
cal for two reasons. From a practical point of view, although [RWLS]) in NOHARM and Mplus, respectively. Finch con-
most psychological and educational latent traits are thought sidered normal and nonnormal trait distributions. The limited
to be normally distributed, trait scores obtained from an information estimators examined the two dimensions in the
estimation procedure tend to be positively skewed in many estimation, but the ML estimated the model parameters for
psychological data (e.g., clinical tests; Reise & Waller, each factor separately, because BILOGMG cannot estimate
2009). In addition, negatively skewed distributions might the parameters for multidimensional models. Therefore, the
be expected in some educational circumstances (e.g., Sass, three methods cannot be directly compared. The perfor-
Schmitt, & Walker, 2008). Unless there is strong evidence mance of the three estimators was affected by the trait
that whether the latent distribution is normally distributed distribution conditions, as well as other factors, such as inter-
does not make much difference in estimating parameters, trait correlations, generating models, and pseudo-guessing
estimation methods should be carefully selected. From a conditions. One major finding was that the two limited infor-
theoretical point of view, some statistical estimation meth- mation estimators were influenced by the distribution of the
ods are intended for use with a normality assumption either latent traits, and yielded larger standard errors of the item
for observed variables (i.e., ML) or for latent variables parameter estimates under skewed conditions.
underlying the discrete observed variables (i.e., limited infor- Most recently, DeMars (2012) examined RWLS (labeled
mation). Therefore, examining how sensitive the two types of WLSMV in Mplus) and ML. ML was implemented with
estimators are to the violation of the normality assumption is the expectation-maximization (EM) algorithm and numeri-
worthwhile. cal integration under a two-dimensional IRT model. In both
Many studies have compared the behavior of the full estimation methods, the standard (unit) normal distribution
information ML estimator with that of one or more lim- was assumed for the two traits (dimensions). In addition to
ited information estimators under latent trait models through these two methods, a third method was investigated. In this
simulations and applications. Forero and Maydeu-Olivares method, ML was implemented with a known trait distribu-
(2009, p. 280) summarized major studies on factors affect- tion instead of the standard normal distribution. That is, the
ing the performance of estimation methods for categorical true distribution used in generating the nonnormal trait dis-
variables. In this article, we focus on summarizing simula- tribution was specified in the Mplus estimation code. The
tion studies that compared the two types of estimators under author acknowledged that the third method is not a realistic
IRT models, especially when the underlying latent variable approach, but could serve as a baseline or best case scenario.
was not normally distributed. The simulation study showed that WLSMV yielded con-
Boulet (1996) investigated the accuracy and efficiency siderable biases under a skewed trait distribution. The ML
of IRT parameter estimates for a two-parameter logistic estimates obtained by marginalizing over a normal distribu-
(2PL) model using two estimators, a full information ML tion were somewhat biased, and the ML by marginalizing
implemented in TESTFACT and a limited information esti- over the true (known) latent trait distribution was essentially
mator (unweighted least squares [ULS]) implemented in unbiased.
570 SUH

Although these studies provided insightful implica- most to a person’s scale score and the lowest category con-
tions for the general performance of limited information tributes the least (Baker & Kim, 2004, p. 203). For analyzing
estimators against ML under nonnormal trait distribution such ordinal items, Samejima’s (1969) GRM describes the
conditions, only one population was assumed across the probability of selecting category k on item i given person j’s
studies. However, we are often interested in assessing group trait level, θj . Assume k = 1, 2, · · · , m response categories
differences in a measure, which implies more than one for item i. Although different parameterizations of the GRM
underlying population. When a scale (or test) is used to are available, this article takes the general form as follows:
make comparisons across subpopulations (e.g., gender or
ethnicity), it should be assumed that the test is measur- P(Xi = k|θj )
ing the same trait in all the groups being compared. This
 
assumption is often referred to as the measurement invari- 1 − f (ai θj − dik ) if k = 1
ance assumption. If the assumption is met, comparisons of = f (ai θj − di,k−1 ) − f (ai θj − dik ) if 1 < k < m ,
those scores are acceptable and yield valid results. However, f (ai θj − di,k−1 ) if k = m
if the assumption does not hold, then such comparisons (1)
might not yield meaningful results at best, and result in
misleading conclusions at worst. When the violation of mea- where Xi represents an item response for item i; ai is a
surement invariance occurs at the item level, the item is said discrimination parameter for item i; dik is an intercept param-
to exhibit differential item functioning (DIF) in the context eter (with a negative sign) associated with category k of
Downloaded by [Gazi University] at 12:20 02 January 2016

of IRT applications. the item i; θj is the ability or trait measured by the test;
The main purpose of this study is to investigate the rela- and f is a cumulative distribution function (CDF), chosen
tive performance of two types of estimators (full information as either a normal or logistic CDF. To maintain the underly-
vs. limited information) in testing DIF when the underlying ing order of the response categories, the intercept parameters
latent variable is not normally distributed. In particular, we must be ordered (i.e., monotonically increasing), but do not
compared ML with RWLS to study DIF in ordered categori- have to be equally spaced. The m categories share a com-
cal items. We chose RWLS as a limited information estima- mon value of the discrimination parameter ai , and there
tor because this method works effectively in most situations will always be one fewer intercept parameter (m − 1) than
where ordered categorical variables are used with CFA for there are item response categories due to the model restric-
various sample sizes (Flora & Curran, 2004). In particular, m
tion, P(Xi = k|θj ) = 1. (For a detailed GRM estimation
WLSMV (an RWLS estimator) has been recommended for k=1
estimating CFA model parameters with categorical variables procedure, see Baker & Kim, 2004, pp. 207–210.)
(Muthén & Muthén, 2010). Several IRT models can analyze Many IRT models including the GRM can be presented
ordinal data. Among others, the most commonly used IRT in the FA model framework. In an FA model with categori-
model is Samejima’s (1969) graded response model (GRM; cal variables (Xij ), the relationship between the latent factor
Forero & Maydeu-Olivares, 2009). Therefore, we compared score (ξj ) and the continuous underlying variable (Xij∗ ) is
the performance of the two estimators (ML and WLSMV) specified as
in testing DIF under the GRM. The Type I error rate and
power of the DIF tests obtained from these estimators were Xij∗ = vi + λi ξj + εij , (2)
examined through a Monte Carlo study that included various
simulation conditions. where ξj represents a factor or a latent trait score for person
The remaining sections of the article are set out as fol- j, vi and λi are an intercept and a factor loading for item i,
lows. In the next section, we provide a description of the respectively, and εij is a residual for Xij∗ . It is assumed that
GRM, the IRT model analyzed in this study, along with its the underlying variables (X ∗ s) are multivariate normally dis-
correspondence in a factor analytic (FA) model. In the fol- tributed. A threshold (τik ) structure is then added to the FA
lowing two sections, we describe the estimators and the DIF model to accommodate the categorical nature of the observed
test procedures. Next, we address the Monte Carlo study per- response, Xij , as follows:
formed and report the results. In the final section, we present
a brief summary of the findings and discuss the implications Xij = k
of the simulation results.
 
Xij∗ < τik (k = 1) (3)
if τi,k−1 < Xij∗ ≤ τik (1 < k < m) .
GRADED RESPONSE MODEL AND FACTOR τi,k−1 ≤ Xij∗ (k = m)
ANALYTIC MODEL
This equation is very similar to Equation 1 in that there
When ordered categorical items are used, the item responses will be m − 1 threshold parameters. The intercept vi and the
are ordered, implying the highest category contributes the threshold τik are not simultaneously identified; thus, typically
ML AND WLSMV IN TESTING DIF WITH NONNORMAL DISTRIBUTIONS 571

vi is fixed to zero in the model (Kamata & Bauer, 2008), and Limited information methods assume that the continuous
only λi and τik are estimated. latent variables (Xij∗ in Equation 2) are normally distributed,
In this study, the GRM parameters were estimated using which implies that the factor (ξ or θ ) as well as the resid-
robust ML (labeled MLR1 in Mplus) and WLSMV imple- uals have a multivariate normal distribution. Unlike the
mented in Mplus 6 (Muthén & Muthén, 2010). The IRT full information ML, the limited information estimators use
parameters, ai and dik , are closely connected to the FA only low-order associations (typically univariate and bivari-
parameters, λi and τik . Because Mplus 6 uses the FA model ate information) among the observed variables to estimate
(Equations 2 and 3) with the negative sign for τik , the rela- model parameters (Forero & Maydeu-Olivares, 2009). They
tion between the FA parameters obtained from WLSMV in are implemented by estimating thresholds, τik , in the first
Mplus and the IRT parameters is: stage, and by obtaining polychoric correlations (the corre-
lations among the latent variables Xij∗ ) in the second stage.
λi τik The model parameters are then estimated from the esti-
ai = and dik = , (4)
qi qi mates of τik and polychoric correlations, in a generalized
least square procedure with a weight matrix composed of the
where qi = Var(εi )1/2 . The IRT parameterization is used asymptotic error variance and covariances of the thresholds
hereafter. and polychoric correlations estimated in the first two stages.
Different weight matrices can be used. For example, when
the diagonal elements, the error variances, of the weight
Downloaded by [Gazi University] at 12:20 02 January 2016

matrix are used, the method is often referred to as diago-


FULL INFORMATION AND LIMITED nally weighted least square, which is WLSMV in Mplus.
INFORMATION This method requires robust corrections (adjustments) to
the standard errors and test statistics (Satorra & Bentler,
In the full information ML method, observed variables fol- 1994). The robust corrections still need the assumption that
low a multivariate normal distribution, which is connected the underlying continuous variables (Xij∗ ) are normally dis-
to a latent variable or factor (θ or ξ ) via a generalized lin- tributed. That is, the normality assumption is not relaxed on
ear model with a link function (probit or logit). ML uses the the latent variables. In other words, the robust corrections
entire multivariate categorical distribution of the observed do not adjust for nonnormality in the underlying continu-
variables to estimate model parameters. In other words, ML ous variables (Rhemtulla, Brosseau-Liard, & Savalei, 2012).
uses the examinees’ entire response patterns to obtain infor- Although ML and WLSMV are based on the normality
mation about the parameters. ML estimates are obtained assumption, the effect of nonnormality might worsen when
iteratively via the EM algorithm. During the E step, the WLSMV is used, because the normality assumption is an
proportion of examinees choosing a certain category is esti- inherent part of the estimation procedure (DeMars, 2012).
mated by weighting the likelihood of each response pattern
over the prior θ distribution (typically standard normal) of a
given provisional set of parameters. In this step, the resulting CHI-SQUARE DIFFERENCE TESTS
distribution is a posterior distribution, which is pulled away
from the prior distribution. Therefore, the more information A DIF test applied in this study is the χ 2 difference test
in the likelihood, the less the impact of the prior (normal) dis- (a.k.a., the likelihood ratio test). In a typical DIF study, two
tribution (DeMars, 2012). In the M step, the proportions of groups are considered: a reference (R) group and a focal (F)
the responses obtained from the E step are used to get item group. The groups are manifest, such as gender or ethnic-
parameter estimates. If the overall likelihood is unchanged ity. When the χ 2 difference test is conducted to study DIF,
or changed within a certain criterion, the process has termi- some items are used as anchor items to set a common met-
nated. Otherwise, the EM steps are repeated. (See Baker & ric across groups. The metric of the item parameter estimates
Kim, 2004, pp. 157–175, for further details.) Several studies depends on a set of anchor items, and thus, the anchor items
on ML estimation showed that using a normal θ distribu- are assumed to be DIF-free. In the context of DIF, the χ 2 test
tion when the true distribution was nonnormal produced less is carried out for one item at a time. The item tested for DIF is
accurate item parameter estimates (e.g., Stone, 1992; Woods referred to as the studied item. For each studied item, the χ 2
& Lin, 2009). difference test compares two hierarchically nested models: a
compact model (a simpler model) and an augmented model
(a more complex model). The null (H0 ) and alternative (Ha )
1 MLR provides ML estimates with standard errors and a chi-square hypotheses set up for this study are as follows:
test statistic (when applicable) that are robust to nonnormality and non-
independence of observations, whereas ML provides ML estimates with H0: aiF = aiR and dikF = dikR for all k.
conventional standard errors and chi-square test statistic. The MLR stan- Ha : At least one parameter for the studied item i is not
dard errors are computed using a sandwich estimator (Muthén & Muthén,
2010).
equal between groups.
572 SUH

TABLE 1
To test H0 , the two nested models are defined with dif-
Item Parameters (ai and dik ) of the Graded Response Model in the
ferent constraints: (a) a compact model in which all item 10-Item Test
parameters for the studied item are set equal between groups,
and (b) an augmented model in which none of the stud- Item ai di1 di2 di3 di4
ied item parameters are constrained to be equal. In both 1 1.46 −0.51 0.98 1.42 2.83
models, all item parameters for the anchor items are set 2 1.73 0.31 1.56 2.23 3.36
equal between groups. The χ 2 test statistic is then the dif- 3 1.81 −0.67 0.05 1.65 4.14
ference between the values of –2 times the log-likelihood 4 1.53 −0.86 −0.20 1.22 3.40
for the compact model (−2 log LC ) and –2 times the log- 5 1.57 −0.60 0.77 1.63 3.66
6 1.58 −0.96 1.00 2.16 3.70
likelihood for the augmented model (−2 log LA ); that is, 7 1.75 0.02 1.17 2.33 3.82
χ 2 = −2 log LC − (−2 log LA ), which is approximately χ 2 8 1.48 −0.34 0.46 1.45 3.64
distributed, with degrees of freedom (df ) equal to the dif- 9 1.85 −0.57 1.11 2.35 4.51
ference in the number of free parameters. Statistical signifi- 10 1.53 −0.55 0.81 1.84 3.58
cance indicates the presence of DIF. This test is an omnibus
test in that the test statistic would be significant if DIF
occurs in any one item parameter or in any combination Student Assessment System. Fidalgo and Bartram (2010)
of item parameters for the studied item. If this omnibus also used the 10-item parameters in a DIF simulation study
test is significant, then subsequent tests can be easily con-
Downloaded by [Gazi University] at 12:20 02 January 2016

for the GRM. The item parameters used to generate data


ducted to investigate whether the DIF is due to unequal ai s are presented in Table 1. The intercept parameter (dik ) was
or unequal dik s. In this study, only the omnibus test is con- obtained by multiplying the discrimination parameter by the
sidered because the Type I error rate and the power for the difficulty parameter originally used by Wang and Su (2004,
subsequent tests largely depend on those for the omnibus test p. 456) to be consistent with the parameterization used in
(Woods, 2011). Equation 1. On the 10-item test, Items 1 through 9 were used
When MLR or WLSMV is used to obtain χ 2 values as anchor items, and Item 10 was arbitrarily chosen as the
for the two nested models being compared, the χ 2 values studied item, the item tested for DIF. Item responses were
should be adjusted because the difference in the χ 2 val- generated with normal CDF and logistic CDF in Equation 1
ues is not distributed as a χ 2 (Muthén & Muthén, 2010). using R version 2.15.1.
For MLR, a scaled χ 2 difference test (Satorra & Bentler, For the data generated with a logistic CDF, MLR with a
2001) can be calculated by using log-likelihood values and logit link was used to estimate the GRM parameters. For the
scaling correction factors from the Mplus output, whereas data sets generated with a normal CDF, the WLSMV estima-
for WLSMV, the DIFFTEST function in Mplus (for more tor with a probit link was used to match the link function
details, see Asparouhov & Muthén, 2006) allows an adjusted of the generating model and that of the estimating model
χ 2 difference test. and thus to control any artifact produced by mismatching
link functions between the generating and estimating mod-
els. As the third estimation method, MLR with a probit link
METHOD was also considered for the data sets generated with a normal
CDF. Because WLSMV is implemented with a probit link,
Data Simulation we can directly evaluate the effect of the estimator (MLR
A Monte Carlo study was conducted by including two levels vs. WLSMV) by using the same link for MLR. By com-
of sample design, five levels of latent trait (θ ) distributions paring MLR-logit with MLR-probit, we can examine the
for the two groups being compared, six levels of DIF pat- effect of link function, because they are based on different
terns, and two levels of DIF magnitude. The four simulation link functions. For WLSMV with the probit link, the theta
factors were fully crossed, yielding 120 conditions. In addi- parameterization was selected.
tion, a non-DIF condition was simulated and fully crossed
with the sample design and latent trait distribution factors.
In the non-DIF condition, it is assumed that DIF does not Sample design. Two sample designs were simu-
occur in the studied item (and thus, no DIF pattern exists). lated for the R and F groups: (a) a balanced sample
One hundred replications were simulated for each condi- design (R500/F500), and (b) an unbalanced sample design
tion. Data were generated following the GRM described in (R600/F400). These sample designs were selected to resem-
Equation 1 using the item parameter values reported in Wang ble the values observed in earlier DIF studies using the GRM
and Su (2004), where a 10-item test with five response cate- (Fidalgo & Bartram, 2010; Wang & Su, 2004; Woods, 2011).
gories was considered. These parameters were adopted from In addition, based on previous GRM recovery study results
the parameter estimates of 4th-, 8th-, and 10th-grade stu- (Reise & Yu, 1990), at least 500 examinees are needed to
dents’ responses to the mathematics tests for the Wisconsin adequately calibrate the items.
ML AND WLSMV IN TESTING DIF WITH NONNORMAL DISTRIBUTIONS 573

Latent trait distribution. Five levels of latent trait (θ) • High-balanced DIF (HB): di1F = di1R − s; di4F =
distributions were evaluated: (a) normal distributions for di4R + s
both groups, (b) a normal distribution for the R group and • Low-balanced DIF (LB): di1F = di1R + s; di4F =
a nonnormal distribution with low skewness and kurtosis di4R − s
for the F group, (c) a normal distribution for the R group • High-unbalanced DIF1 (HU1): di4F = di4R + s
and a nonnormal distribution with high skewness and kur- • High-unbalanced DIF2 (HU2): di3F = di3R + s; di4F =
tosis for the F group, (d) nonnormal distributions with low di4R + s
skewness and kurtosis for both groups, and (e) a nonnormal • Low-unbalanced DIF1 (LU1): di1F = di1R + s
distribution with low skewness and kurtosis for the R group • Low-unbalanced DIF2 (LU2): di1F = di1R + s; di2F =
and a nonnormal distribution with high skewness and kur- di2R + s,
tosis for the F group. The nonnormal distribution with low
skewness and kurtosis had skewness and kurtosis coefficients
where s is equal to the DIF magnitude manipulated. In this
equaling 0.75 and 1.25, respectively, whereas the nonnormal
study, two levels of s were considered: 0.25 and 0.5, indi-
distribution with high skewness and kurtosis was generated
cating small and medium DIF, respectively. Regardless of
with skewness = 1.5 and kurtosis = 3.5. The skewness and
the DIF patterns, DIF in ai was also introduced simultane-
kurtosis values in the latter condition were similar to those
ously with DIF in dik s due to the frequent cooccurrence of
previously estimated with real data (Woods & Thissen, 2006)
these forms of DIF (Suh & Bolt, 2011). The ai parameter of
and those used in earlier DIF studies regarding nonnormal θ
Downloaded by [Gazi University] at 12:20 02 January 2016

item 10 for the F group was set 0.3 higher than for the R
distributions (e.g., Finch, 2010; Woods, 2011). The former
group. The DIF magnitude values in ai and dik are expected
nonnormal condition (skewness = 0.75, kurtosis = 1.25) was
to be encountered in practice and have been frequently cho-
included to examine the effect of a less skewed and lep-
sen in other DIF studies (e.g., Kim & Yoon, 2011; Suh &
tokurtic condition on the GRM model parameter estimation
Bolt, 2011; Wang & Su, 2004). Likewise, the six patterns
and DIF results. In fact, a skewness of 0.75 has been com-
were selected to reflect empirically observed patterns and
monly observed in the IRT literature that examined the effect
commonly studied patterns in the literature (e.g., Fidalgo &
of nonnormality on the parameter estimation (e.g., Stone,
Bartram, 2010; Orlando & Marshall, 2002).
1992; Tate, 1995). In addition, the skewness and kurtosis
DIF conditions were used to investigate the power of
values used in this study are similar to those examined by
the χ 2 difference tests, whereas non-DIF conditions were
Flora and Curran (2004; skewness = 0.75 and 1.25; kur-
included to examine the Type I error rate of the tests. For
tosis = 1.75 and 3.75), who conducted confirmatory factor
the non-DIF item, the same item parameters of Item 10 in
analysis to analyze ordinal variables in a single group. Both
Table 1 were used for both groups. Each data set was ana-
nonnormal conditions were generated following the power
lyzed with MLR and WLSMV using Mplus 6 to estimate the
method described by Fleishman (1978). Using the tabu-
GRM model parameters and obtain the values necessary to
lated values for specific levels of skewness and kurtosis
conduct the χ 2 tests. To identify the model, the mean and
in Fleishman (1978, p. 524), trait variables (θ ) were gen-
the variance of the R group (μθR , σθR 2
) were fixed at 0 and
erated to follow a nonnormal distribution with the desired
1, respectively. However, for the F group, the mean and the
levels of skewness and kurtosis. In this study, only posi-
variance (μθF , σθF
2
) were free to be estimated.
tive skewness distributions were included because in many
psychological data (e.g., clinical tests) trait scores tended to
be positively skewed (for references, see Reise & Waller,
Outcomes
2009). Regardless of the shapes of the distributions, the mean
of the θ distribution for the R group was 0, whereas the The standard deviation of the item parameter estimates
mean for the F group was –0.5. The variance was always across replications (SD), the mean of the standard errors of
1. These values for the mean and variance of the distribu- the item parameter estimates across replications (MSE), the
tion were selected to resemble values chosen in previous bias and root mean square errors (RMSEs) of μθF and σθF ,
DIF studies using IRT models (Suh & Bolt, 2011; Woods, and the Type I error rate and power of the χ 2 tests were
2011). examined. To check whether the number of replications (i.e.,
100) used in this study was sufficient to reduce the sam-
DIF pattern and DIF magnitude. As explained ear- pling error to an acceptable level, the SD and the MSE were
lier, Items 1 through 9 were used as anchor items (thus, the calculated and compared for each item parameter for each
same item parameters were used across groups to generate condition. Because μθF and σθF were estimated in each run,
data), and only Item 10 was tested for DIF. The item param- the bias and RMSEs of the two parameters were computed
for each simulation condition. For example, the bias and the
eters for Item 10 in Table 1 were applied for the R group,
  
whereas the values were changed for the F group to simu- RMSE for μθF are obtained with 100 r=1 μ̂ − μ 100 and
late DIF. For the dik parameters for Item 10, six DIF patterns  
100  2
were manipulated as follows: r=1 μ̂ − μ 100, respectively. The Type I error rate
574 SUH

indicates the probability of detecting DIF when there is in (low vs. low). Finally, when both distributions were normally
fact no DIF in the studied item (i.e., false positive rates), distributed (C1), the bias tended to be close to 0. The
whereas the power represents the probability of detecting unbalanced sample design (R600/F400) produced a slightly
DIF when there is DIF in the studied item (i.e., true posi- larger bias than the balanced condition (R500/F500) with
tive rates). Therefore, for each non-DIF condition, the Type WLSMV. For the other estimators, no systematic difference
I error rate was computed as the proportion of the number of was found between the two sample designs. RMSE showed
significant χ 2 test statistics (at α = 0.05) out of 100 replica- similar patterns with bias. The results from the DIF condi-
tions. For each DIF condition, the power was calculated in tions (DIF patterns and magnitudes) were very similar to
the same manner. The results are summarized separately for those observed in Table 2, indicating the recovery of μθF and
MLR with logit link, MLR with probit link, and WLSMV σθF was not affected by the level of DIF conditions. The DIF
with probit link in the next section. condition results can be obtained from the author on request.

RESULTS Type I Error


Table 3 displays the Type I error rates of the χ 2 tests from
SD and MSE of Item Parameter Estimates MLR-logit, MLR-probit, and WLSMV. The three estimators
The SD and MSE values of each parameter type (ai and di ) yielded slightly higher error rates in the balanced sam-
Downloaded by [Gazi University] at 12:20 02 January 2016

were averaged across items for each simulation condition ple design (R500/F500) than in the unbalanced design
and then compared. For example, in the non-DIF condition (R600/F400). The average error rate from MLR-logit in the
with the balanced sample design, for the ai parameter esti- unbalanced design was close to the expected value (0.05).
mate using MLR with the probit link, the SD values ranged The error for WLSMV was substantially inflated when there
from 0.078 to 0.081 across the five θ distribution conditions, was a large difference in the shape of the θ distribution
whereas the MSE values ranged from 0.095 to 0.103. For the between the two groups (C3). Based on the row averages,
di parameter, the SD ranged from 0.098 to 0.103, whereas the C3 condition showed the largest error due to the highly
the MSE ranged from 0.126 to 0.128. The two MLR meth- inflated errors with WLSMV. As expected, the C1 condition,
ods yielded similar results. However, the values tended to equal to the expected value on average, showed the smallest
be slightly larger than those observed for WLSMV on aver- error rate. The other three distribution conditions produced
age. The values in the DIF conditions (DIF patterns and similar results: slightly higher than the expected value.
magnitudes) also displayed very similar patterns. In sum,
the results showed that the uncertainty of the sampling error Power
(SD) was smaller than the uncertainty of the estimates (MSE)
for all parameters obtained from the three estimation meth- Tables 4 and 5 show the power3 of the χ 2 tests. Table 4
ods under all simulation conditions, which indicates that the presents the results from the small DIF magnitude (s =
number of replications (i.e., 100) was sufficient to reduce 0.25) condition, and Table 5 displays the results from the
sampling error. medium DIF condition (s = 0.5). When s = 0.25, MLR-
logit appeared to produce lower power (an average of
0.44) than the other estimators. WLSMV showed the high-
Bias and RMSE of μθF and σ θF est rejection rates (an average of 0.69), and MLR-probit
Table 2 displays the bias and the RMSE of μθF and σθF for provided slightly lower rates than WLSMV (an average of
the three estimators in the non-DIF condition. WLSMV pro- 0.67). As the DIF magnitude increased, power increased as
duced the largest bias, and MLR-probit provided larger bias expected. The average power for MLR-logit, MLR-probit,
than MLR-logit on average. μθF was better estimated than and WLSMV was 0.73, 0.81, and 0.80, respectively. In par-
σθF . Both parameters tended to be overestimated2 especially ticular, the power from MLR-logit improved more noticeably
in the nonnormal conditions. No matter the estimators, when than the other two estimators.
there was a large difference in the shape of the θ distribution Under each DIF magnitude condition, the effect of sample
between the R and F groups (C3 condition), the bias tended designs on power was examined after the power was aver-
to be the largest. As the second largest bias, the C2 (nor- aged across the other simulation factors (trait distributions
mal vs. low) and C5 (low vs. moderate) conditions showed and DIF patterns). Sample designs seem to have a small
similar biases, which were greater than the C4 condition
3 As shown in Table 3, Type I errors were overestimated in some con-

ditions and underestimated in others. As power is affected by the Type I


2 The true value for μ
θF is –0.5. Therefore, a negative value of bias in error rate, power is also over- or underestimated. Because the power should
Table 2 indicates that the estimate (e.g., –0.6) is smaller than –0.5, but the be obtained conditional on controlling the Type I error rate, the term power
absolute value of the estimate is greater than 0.5. In this regard, the author in this article should be interpreted as the rejection rates (when DIF was
used the term overestimated. present), rather than the intrinsic meaning of power.
ML AND WLSMV IN TESTING DIF WITH NONNORMAL DISTRIBUTIONS 575
TABLE 2
Bias and RMSE of Mean (μθF ) and Standard Deviation (σθF ) in the Non-DIF Conditions

Trait (θ ) Distribution R500/F500 R600/F400

R F MLR-Logit MLR-Probit WLSMV MLR-Logit MLR-Probit WLSMV Average

Mean (μθF )
Bias C1 Normal Normal .00 −.01 .00 −.01 −.02 −.02 −.01
C2 Normal Low −.05 −.06 −.11 −.04 −.07 −.13 −.08
C3 Normal Moderate −.07 −.12 −.23 −.07 −.12 −.24 −.14
C4 Low Low .00 −.02 −.03 −.01 −.02 −.03 −.02
C5 Low Moderate −.04 −.07 −.13 −.04 −.07 −.14 −.08
Average −.03 −.06 −.10 −.03 −.06 −.11 −.07

RMSE C1 Normal Normal .03 .03 .03 .03 .03 .03 .03
C2 Normal Low .06 .07 .11 .05 .07 .13 .08
C3 Normal Moderate .08 .12 .23 .08 .12 .25 .15
C4 Low Low .03 .03 .03 .04 .03 .04 .03
C5 Low Moderate .06 .08 .13 .05 .08 .15 .09
Average .05 .07 .11 .05 .07 .12 .08
Standard deviation (σθF )
Downloaded by [Gazi University] at 12:20 02 January 2016

Bias C1 Normal Normal .00 .00 .01 .00 .02 .02 .01
C2 Normal Low .06 .08 .13 .06 .08 .15 .09
C3 Normal Moderate .08 .13 .24 .08 .13 .25 .15
C4 Low Low .03 .05 .07 .04 .05 .09 .06
C5 Low Moderate .07 .09 .16 .06 .10 .19 .11
Average .05 .07 .12 .05 .08 .14 .09

RMSE C1 Normal Normal .03 .03 .03 .04 .03 .04 .03
C2 Normal Low .07 .09 .14 .05 .09 .15 .10
C3 Normal Moderate .09 .14 .24 .08 .14 .26 .16
C4 Low Low .05 .06 .08 .06 .06 .10 .07
C5 Low Moderate .08 .10 .17 .07 .10 .19 .12
Average .06 .08 .13 .06 .08 .15 .09

Note. RMSE = root mean square error; DIF = differential item functioning; MLR = maximum likelihood with robust standard errors;
WLSMV = weighted least square mean and variance adjusted.

TABLE 3
Type I Error Rates in the Non-DIF Conditions

Trait (θ ) Distribution R500/F500 R600/F400

R F MLR-Logit MLR-Probit WLSMV MLR-Logit MLR-Probit WLSMV Average

C1 Normal Normal .04 .07 .02 .04 .04 .08 .05


C2 Normal Low .09 .04 .08 .06 .08 .06 .07
C3 Normal Moderate .05 .07 .22 .04 .07 .15 .10
C4 Low Low .06 .10 .06 .07 .05 .06 .07
C5 Low Moderate .09 .08 .11 .05 .06 .04 .07
Average .07 .07 .10 .05 .06 .08

Note. DIF = differential item functioning; MLR = maximum likelihood with robust standard errors; WLSMV = weighted least square mean and variance
adjusted.

or negligible effect on the power. Under the small DIF shows power across different distribution conditions (C1–
condition, MLR-logit, MLR-probit, and WLSMV showed C5) under the small DIF condition (left side) and the medium
a 0.03 decrease, no change, and a 0.01 increase in power, DIF condition (right side). Under the small DIF condition,
respectively, when moving from the balanced design to WLSMV showed higher power in the C3 (normal vs. mod-
the unbalanced design, and the changes became even more erate) condition than the other distribution conditions, which
minimal under the medium DIF condition. is probably attributed to the inflated Type I error observed
The effects of different trait distributions and DIF pat- in the C3 condition (see Table 3). However, this tendency
terns on power are illustrated in Figures 1a and 1b. Figure 1a disappeared when the DIF magnitude increased. Overall, the
576 SUH

TABLE 4
Power in the Small DIF (s = 0.25) Conditions

Trait (θ ) Distribution R500/F500 R600/F400

DIF Pattern R F MLR-Logit MLR-Probit WLSMV MLR-Logit MLR-Probit WLSMV Average

High-balanced DIF (HB) C1 Normal Normal .25 .51 .54 .26 .47 .52 .43
C2 Normal Low .24 .43 .44 .24 .44 .56 .39
C3 Normal Moderate .24 .41 .53 .23 .39 .50 .38
C4 Low Low .21 .44 .47 .17 .37 .34 .33
C5 Low Moderate .31 .41 .38 .24 .44 .42 .37
Low-balanced DIF (LB) C1 Normal Normal .65 .96 .91 .60 .93 .80 .81
C2 Normal Low .74 .96 .93 .63 .93 .92 .85
C3 Normal Moderate .71 .96 .98 .73 .92 .94 .87
C4 Low Low .74 .96 .89 .69 .94 .90 .85
C5 Low Moderate .65 .95 .93 .73 .91 .88 .84
High-balanced DIF1 (HU1) C1 Normal Normal .24 .32 .50 .17 .46 .51 .37
C2 Normal Low .22 .45 .54 .22 .38 .51 .39
C3 Normal Moderate .19 .52 .62 .19 .41 .58 .42
C4 Low Low .26 .48 .51 .21 .39 .46 .39
C5 Low Moderate .28 .35 .45 .19 .49 .57 .39
Downloaded by [Gazi University] at 12:20 02 January 2016

High-unbalanced DIF2 (HU2) C1 Normal Normal .21 .38 .46 .22 .44 .52 .37
C2 Normal Low .21 .33 .40 .22 .36 .49 .34
C3 Normal Moderate .22 .38 .54 .22 .39 .60 .39
C4 Low Low .26 .42 .51 .24 .30 .46 .37
C5 Low Moderate .36 .45 .59 .28 .41 .51 .43
Low-unbalanced DIF1 (LU1) C1 Normal Normal .49 .88 .81 .64 .93 .81 .76
C2 Normal Low .54 .84 .73 .61 .95 .89 .76
C3 Normal Moderate .62 .92 .91 .62 .93 .95 .83
C4 Low Low .61 .93 .86 .59 .93 .85 .80
C5 Low Moderate .75 .91 .93 .69 .93 .93 .86
Low-unbalanced DIF2 (LU2) C1 Normal Normal .65 .92 .80 .64 .92 .85 .80
C2 Normal Low .57 .88 .78 .61 .96 .92 .79
C3 Normal Moderate .60 .91 .94 .64 .95 .94 .83
C4 Low Low .67 .92 .87 .67 .94 .88 .83
C5 Low Moderate .73 .97 .95 .62 .97 .96 .87
Average .45 .67 .69 .43 .67 .70

Note. DIF = differential item functioning; MLR = maximum likelihood with robust standard errors; WLSMV = weighted least square mean and variance
adjusted.

power of the three estimation methods tended to be unaf- but not identical. The power also improved as the DIF mag-
fected by different θ distribution conditions, implying the nitude increased, as expected, particularly with the use of
χ 2 test from the three estimators was not sensitive to the the MLR-logit estimator. Ceiling effects for LB, LU1, and
violation of the normality assumption regarding power with LU2 might exist, especially with the use of MLR-probit and
one exception (using WLSMV under the C3 condition with WLSMV.
the small DIF). The MLR-probit and WLSMV methods
showed similar performances to another and higher power
than MLR-logit. The difference between the two estimators DISCUSSION AND CONCLUSION
and MLR-logit was larger in the small DIF condition than in
the medium DIF condition. The main purpose of this simulation study was to investigate
Figure 1b illustrates how power changes across different the relative performance of ML and WLSMV, each as full
DIF patterns under each DIF magnitude condition. In gen- information and limited information estimators, in testing
eral, the χ 2 tests from all three estimators showed a similar DIF when the underlying latent variable (θ) was not nor-
pattern again; that is, the power was higher in the conditions mally distributed. For the ML estimation in Mplus, MLR
of low-balanced DIF (LB), low-unbalanced DIF1 (LU1), with the logit link and MLR with the probit link were chosen.
and low-unbalanced DIF2 (LU2) than in the other three By comparing MLR-probit with WLSMV, we directly eval-
conditions (HB, HU1, and HU2). This tendency was more uated the effect of estimator (MLR vs. WLSMV) because
dramatic with the two ML estimators than WLSMV under WLSMV also used the same link function (probit). By com-
the small DIF condition. As the DIF magnitude increased, paring MLR-logit with MLR-probit, we examined the effect
the patterns from the three estimators were generally alike, of link function, because they were based on different link
ML AND WLSMV IN TESTING DIF WITH NONNORMAL DISTRIBUTIONS 577
TABLE 5
Power in the Medium DIF (s = 0.5) Conditions

Trait (θ ) Distribution R500/F500 R600/F400

DIF Pattern R F MLR-Logit MLR-Probit WLSMV MLR-Logit MLR-Probit WLSMV Average

High-balanced DIF (HB) C1 Normal Normal .75 .88 .78 .66 .93 .86 .81
C2 Normal Low .69 .87 .77 .74 .86 .76 .78
C3 Normal Moderate .73 .90 .80 .65 .81 .77 .78
C4 Low Low .63 .87 .72 .67 .88 .75 .75
C5 Low Moderate .63 .84 .61 .61 .82 .63 .69
High-balanced DIF (HB) C1 Normal Normal .99 1.00 1.00 1.00 1.00 1.00 1.00
C2 Normal Low 1.00 .99 .99 1.00 1.00 1.00 1.00
C3 Normal Moderate 1.00 1.00 1.00 .99 1.00 1.00 1.00
C4 Low Low 1.00 1.00 1.00 1.00 1.00 1.00 1.00
C5 Low Moderate .99 1.00 1.00 .98 1.00 1.00 1.00
High-unbalanced DIF1 (HU1) C1 Normal Normal .29 .32 .47 .29 .60 .64 .44
C2 Normal Low .23 .37 .47 .19 .43 .51 .37
C3 Normal Moderate .18 .31 .45 .24 .42 .48 .35
C4 Low Low .32 .48 .54 .33 .40 .53 .43
C5 Low Moderate .36 .51 .57 .32 .47 .50 .46
Downloaded by [Gazi University] at 12:20 02 January 2016

High-unbalanced DIF2 (HU2) C1 Normal Normal .55 .66 .53 .47 .65 .62 .58
C2 Normal Low .44 .61 .50 .42 .50 .46 .49
C3 Normal Moderate .43 .55 .56 .47 .51 .54 .51
C4 Low Low .56 .61 .53 .54 .62 .54 .57
C5 Low Moderate .45 .69 .63 .55 .56 .46 .56
Low-unbalanced DIF1 (LU1) C1 Normal Normal .93 1.00 .99 .98 .99 .98 .98
C2 Normal Low .94 1.00 .98 .95 1.00 .99 .98
C3 Normal Moderate .97 1.00 .99 .97 1.00 1.00 .99
C4 Low Low .91 .99 .99 .98 1.00 .99 .98
C5 Low Moderate 1.00 .99 .97 .99 1.00 .99 .99
Low-unbalanced DIF2 (LU2) C1 Normal Normal .97 1.00 1.00 .95 1.00 .99 .99
C2 Normal Low .99 1.00 1.00 .97 1.00 1.00 .99
C3 Normal Moderate 1.00 1.00 1.00 1.00 1.00 1.00 1.00
C4 Low Low 1.00 1.00 .99 .98 1.00 1.00 1.00
C5 Low Moderate 1.00 1.00 1.00 .97 1.00 1.00 1.00
Average .73 .81 .79 .73 .82 .80

Note. DIF = differential item functioning; MLR = maximum likelihood with robust standard errors; WLSMV = weighted least square mean and variance
adjusted.

(a) Small DIF Condition Medium DIF Condition


1.00 1.00
0.90 0.90
0.80 0.80
0.70 0.70
0.60 0.60 MLR-Logit
Power
Power

0.50 0.50
MLR-Probit
0.40 0.40
WLSMV
0.30 0.30
0.20 0.20
0.10 0.10
0.00 0.00
C1 C2 C3 C4 C5 C1 C2 C3 C4 C5

(b) Medium DIF Condition


Small DIF Condition
1.00 1.00
0.90 0.90
0.80 0.80
0.70 0.70
0.60 0.60 MLR-Logit
Power
Power

0.50 0.50
MLR-Probit
0.40 0.40
0.30 WLSMV
0.30
0.20 0.20
0.10 0.10
0.00 0.00
HB LB HU1 HU2 LU1 LU2 HB LB HU1 HU2 LU1 LU2

FIGURE 1 The effect of (a) trait distributions and (b) DIF patterns on power. Note. DIF = differential item functioning; MLR = maximum likelihood with
robust standard errors; WLSMV = weighted least square mean and variance adjusted.
578 SUH

functions. Samejima’s (1969) GRM was used to analyze Based on the results from the power analysis, MLR-probit
ordered categorical data. The Type I error rate and power of and WLSMV showed similar rejection rates, both higher
the χ 2 test as well as the recovery of the F group’s mean and than MLR-logit. The discrepancy between the two estimators
standard deviation were evaluated under different simulation and MLR-logit was larger in the small DIF condition than
conditions: the shape of the θ distributions for the R and F in the medium DIF condition. The average power increased
groups, sample design, DIF pattern, and DIF magnitude. as the DIF magnitude increased. In particular, the power
In estimating the mean and the standard deviation of from MLR-logit improved noticeably, compared to the other
the F group’s θ distribution (μθF and σθF ), MLR-logit out- two methods. This might imply that the DIF detection from
performed the other methods, and MLR-probit performed MLR-logit works better when the DIF magnitude is medium
better than WLSMV. Both parameters were not estimated (in other words, it does not work well when the DIF mag-
well when there was a large difference in the shape of the nitude is small) compared to the other methods, or might
θ distribution between the two groups (C3), especially with indicate that a ceiling effect can occur with the power from
WLSMV. When both distributions were normally distributed MLR-probit and WLSMV (i.e., because high power rates
(C1), the recovery results seemed to be acceptable (less than from these two estimators were already achieved under some
0.05 for RMSE). An interesting finding was that the recov- DIF pattern conditions with the small DIF, there was not
ery was better when both groups had the same nonnormal much room for improving power with the medium DIF).
distribution with low skewness and kurtosis (C4) than when Regarding the effect of the trait distributions on power,
the R group followed a normal distribution and the F group WLSMV showed higher power with the C3 condition than
Downloaded by [Gazi University] at 12:20 02 January 2016

followed a nonnormal distribution with low skewness and the other distribution conditions, especially when small DIF
kurtosis (C2). It might be suspected that having the same was present, which is probably due to the inflated Type
distribution across groups being compared could be more I error observed in the C3 condition. When medium DIF
critical in estimating distribution parameters than having one occurred, all three estimation methods displayed similar pat-
normal distribution when a multiple group analysis is con- terns, and remained quite stable across different distribution
ducted in Mplus. However, this pattern was not apparent conditions. Overall, MLR-probit and WLSMV showed the
in Type I error and power results. Whether this interest- approximate same level of performances, except the C3 con-
ing result can be generalized requires further investigations. dition with the small DIF.
For example, it would be interesting to examine whether the The effect of DIF patterns on power was similar regard-
condition of same-degree nonnormality between groups pro- less of the estimation methods. The power was higher in the
vides better estimation accuracy or possibly more controlled LB, LU1, and LU2 conditions than the other DIF pattern con-
Type I error as well as higher power than the combined ditions. The LB condition indicates that the first category of
condition of normal and nonnormal distributions by includ- Item 10 functions differently against the F group, whereas
ing other nonnormal conditions, such as moderately skewed the last category functions differently against the R group
versus moderately skewed and normal versus moderately (i.e., di1F = di1R + s; di4F = di4R − s). The LU1 condition
skewed. Because this study relied on Mplus, the result might represents that DIF occurs in the first category of Item 10 in
be attributed to the software we chose. Therefore, it might favor of the R group (di1F = di1R + s), whereas the LU2 con-
be also worthwhile to compare the estimators in Mplus dition indicates that DIF presents in the first and second
with ones in other available software (e.g., IRTLRDIF and categories in favor of the R group (di1F = di1R + s and di2F =
IRTPRO) under such simulation conditions. Other simula- di2R + s). However, the high-balanced and unbalanced DIF
tion factors (sample design, DIF pattern, and DIF magnitude) conditions (HB, HU1, and HU2) represent DIF occurring in
did not seem to affect the recovery of the parameters. high categories against the F group. In short, higher power
The χ 2 tests obtained from the three estimators slightly was observed when DIF occurred in low categories (first and
over- and underestimated the nominal α level (0.05) except second) against the F group. This could be because the means
the normal versus moderate condition (C3) with WLSMV, of the R and F groups’ θ distributions were generated to be
in which the Type I error was substantially inflated. This equal to 0 and –0.5, respectively. These values are closer to
implies that the χ 2 tests from WLSMV were likely to lead the values of the d parameter for the low categories (di1 =–
to an invalid conclusion for DIF testing especially when the 0.5 and di2 = 0.81) in Table 1 than the values for the high
difference in the shape of θ distribution between the two categories (di3 = 1.84 and di4 = 3.58). Because more exam-
groups increased. Therefore, if one has no choice but to use inees in each group were clustered around the mean of the θ
the WLSMV estimator in studying DIF, the significance distribution than at the high levels of the distribution, DIF in
level (α) needs to be selected with caution, particularly low categories appeared to have a greater effect on the exami-
when a large difference in the shape of the distributions is nees and consequently could be easily detected with the tests.
expected. Regardless of estimators, the balanced sample In summary, WLSMV did not perform better than the ML
design produced a slightly higher error rate than the unbal- methods in recovering the mean and the standard deviation
anced sample design, but no such systematic pattern was of the F group’s θ distribution (μθF and σθF ) especially in
observed in the power analysis. the nonnormal conditions. In addition, WLSMV produced
ML AND WLSMV IN TESTING DIF WITH NONNORMAL DISTRIBUTIONS 579

relatively higher error rates on average than the two ML a relatively large sample size for the WLSMV estimator.
methods, because of the highly inflated error in the C3 condi- Because the statistical power of the χ 2 test is affected by
tion (normal vs. moderate). These results could be attributed sample sizes, studying effect size measures can help inter-
to the fact that the χ 2 test from ML was more robust against pret the test results. Several effect size measures for ordered
a severe violation of the normality assumption about the categorical variables have been presented and compared (see,
trait latent variable. As explained in the literature review, e.g., Kim, Cohen, Alagoz, & Kim, 2007) for facilitating prac-
although ML and WLSMV are based on the normality tical interpretations of DIF test results. Therefore, examining
assumption, violating the normality assumption might have how those effect size measures perform under the simulation
more impact on the WLSMV estimation, but be less appar- conditions included in this study, compared to the χ 2 tests,
ent on the ML estimation. Because power is affected by would be valuable.
Type I error, power from WLSMV is expected to be higher In addition to the effect size measures, alternative fit
than the two ML methods, especially with the C3 condition. indices can be considered. For example, the performance
In fact, when the DIF magnitude was small, the power from of χ 2 goodness of fit, root mean square error of approxi-
WLSMV was highest in the C3 condition (see Figure 1a). mation (RMSEA), and weighted root mean square residual
On average, the power from WLSMV was 0.25 higher than (WRMR) can be evaluated in detecting DIF (e.g., Kim &
MLR-logit, but only 0.02 higher than MLR-probit in the Yoon, 2011). By comparing the fit indices between a cor-
small DIF condition. Comparing WLSMV with MLR-probit rectly specified model and an incorrectly specified model
only should be more reasonable, because they used the same under different distribution conditions, we can investigate
Downloaded by [Gazi University] at 12:20 02 January 2016

link function. In this regard, there was a negligible dif- whether such fit indices could serve as viable alternatives to
ference (0.02) between WLSMV and MLR-probit in the the χ 2 tests.
small DIF condition. As the DIF magnitude increased, MLR- Finally, the findings and suggestions are limited to the
probit showed higher power than WLSMV. The comparison conditions included in this Monte Carlo study. It is there-
between MLR-logit and MLR-probit for evaluating the effect fore left for a future study to examine more systematically
of different link functions showed that both yielded similar the impact of other simulation conditions, such as different
error rates, but the probit link produced higher power than test lengths, distribution differences between the two groups,
the logit link. Overall, the simulation results suggest that and different DIF patterns and magnitudes to reflect different
MLR-probit would be the best option in terms of Type I error scenarios that can be encountered in real testing programs.
control and a high power rate. WLSMV is a viable alterna- Results and implications from such studies including this
tive to MLR-probit except under the C3 condition (normal study would provide test practitioners with useful informa-
vs. moderate). tion concerning the relative values of choosing one estimator
This study has several limitations. In terms of esti- over the others for studying DIF in ordered categorical data,
mation methods and test statistics, we considered MLR especially when it is suspected that there might be a large
and WLSMV with the chi-square tests. Comparing these difference in the shape of the θ distribution between the
estimators with others such as ULS or MLM in Mplus4 R and F groups or when it is suspected that the trait dis-
would be interesting. Exploring whether alternate statistics tributions of one or more groups might not be normally
such as the Wald test can produce more accurate Type I distributed.
error rates and power would also be worthwhile. We con-
sidered the effect of balanced (R500/F500) and unbalanced
design (R600/F400), but not of sample size. The difference
REFERENCES
between the two designs considered in this study was not
big. Therefore, it might be valuable to examine the combined
Asparouhov, T., & Muthén, B. (2006). Robust chi-square difference testing
effect of nonnormality and unbalanced design with varying with mean and variance adjusted test statistics. Retrieved from http://
sample sizes (e.g., R1000/F500). www.statmodel.com/download/webnotes/webnote10.pdf
Significance tests such the χ 2 tests are sensitive to even a Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter
small DIF magnitude when sample sizes are large. Although estimation techniques (2nd ed.). New York, NY: Marcel Dekker.
Boulet, J. R. (1996). The effect of nonnormal ability distributions on
at least 500 examinees were needed for adequate item cal-
IRT parameter estimation using full-information and limited-information
ibration in a previous GRM recovery study (Reise & Yu, methods. Unpublished doctoral dissertation, University of Ottawa,
1990), another study concluded that WLSMV worked rea- Ottawa, ON, Canada.
sonably well with a sample size of 200 under a normal DeMars, C. E. (2012). A comparison of limited-information and full-
distribution condition for detecting DIF using the DIFFTEST information methods in Mplus for estimating item response theory
parameters for nonnormal populations. Structural Equation Modeling,
function in Mplus (Kim & Yoon, 2011). Accordingly, the
19, 610–632.
sample size of 500 used in this study could be treated as Fidalgo, A. M., & Bartram, D. (2010). A comparison between some gen-
eralized Mantel–Haenszel statistics for detecting DIF in data simulated
under the graded response model. Applied Psychological Measurement,
4 The Bayes estimator is not feasible for the current application in Mplus.
34, 600–606.
580 SUH

Finch, H. (2010). Item parameter estimation for the MIRT model: Bias continuous and categorical SEM estimation methods under suboptimal
and precision of confirmatory factor analysis-based models. Applied conditions. Psychological Methods, 17, 354–373.
Psychological Measurement, 34, 10–26. Samejima, F. (1969). Estimation of a latent ability using a response pattern
Fleishman, A. I. (1978). A method for simulating non-normal distributions. of graded scores. Psychometrika Monographs, 34(Suppl. 4).
Psychometrica, 43, 521–531. Sass, D. A., Schmitt, T. A., & Walker, C. M. (2008). Estimating non-
Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative normal latent trait distributions within item response theory using true
methods of estimation for confirmatory factor analysis with ordinal data. and estimated item parameters. Applied Measurement in Education, 21,
Psychological Methods, 9, 466–491. 65–88.
Forero, C. G., & Maydeu-Olivares, A. (2009). Estimation of IRT graded Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and stan-
response models: Limited versus full information methods. Psychological dard errors in covariance structure analysis. In A. von Eye & C. C. Clogg
Methods, 14, 275–299. (Eds.), Latent variable analysis: Applications to developmental research
Kamata, A., & Bauer, D. J. (2008). A note on the relation between factor (pp. 399–419). Thousand Oaks, CA: Sage.
analytic and item response theory models. Structural Equation Modeling, Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test
15, 136–153. statistic for moment structure analysis. Psychometrika, 66, 507–514.
Kim, E.-S., & Yoon, M. (2011). Testing measurement invariance: A com- Stone, C. A. (1992). Recovery of marginal maximum likelihood esti-
parison of multiple-group categorical CFA and IRT. Structural Equation mates in the two-parameter logistic response model: An evaluation of
Modeling, 18, 212–228. MULTILOG. Applied Psychological Measurement, 16, 1–16.
Kim, S.-H., Cohen, A. S., Alagoz, C., & Kim, S. (2007). DIF detec- Suh, Y., & Bolt, D. M. (2011). A nested logit approach for investigating dis-
tion and effect size measure for polytomously scored items. Journal of tractors as causes of differential item functioning. Journal of Educational
Educational Measurement, 44, 93–116. Measurement, 48, 188–205.
Muthén, B., & Kaplan, D. (1985). A comparison of some methodologies Tate, R. L. (1995). Robustness of the school-level IRT model. Journal of
Downloaded by [Gazi University] at 12:20 02 January 2016

for the factor-analysis of non-normal Likert variables. British Journal of Educational Measurement, 32, 145–162.
Mathematical and Statistical Psychology, 38, 171–180. Wang, W.-C., & Su, Y.-H. (2004). Factors influencing the Mantel and gener-
Muthén, L. K., & Muthén, B. O. (2010). Mplus: Statistical analysis alized Mantel–Haenszel methods for the assessment of differential item
with latent variables user’s guide 6.0. Los Angeles, CA: Muthén & functioning in polytomous items. Applied Psychological Measurement,
Muthén. 28, 450–480.
Orlando, M., & Marshall, G. N. (2002). Differential item functioning in a Woods, C. M. (2011). DIF testing for ordinal items with Poly-SIBTEST,
Spanish translation of the PTSD checklist: Detection and evaluation of the Mantel and GMH test, and IRT-LR-DIF when the latent distribution
impact. Psychological Assessment, 14, 163–173. is nonnormal for both groups. Applied Psychological Measurement, 32,
Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical 511–526.
measurement. Annual Review of Clinical Psychology, 5, 27–48. Woods, C. M., & Lin, N. (2009). Item response theory with estima-
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response tion of the latent density using Davidian curves. Applied Psychological
model using MULTILOG. Journal of Educational Measurement, 27, Measurement, 33, 102–117.
133–144. Woods, C. M., & Thissen, D. (2006). Item response theory with estima-
Rhemtulla, M., Brosseau-Liard, P. E., & Savalei, V. (2012). When can tion of the latent population distribution using spline-based densities.
categorical variables be treated as continuous? A comparison of robust Psychometrika, 71, 281–301.

You might also like