Item-Level Factor Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Chapter 35

Item-Level Factor Analysis


Brian D. Stucky, Nisha C. Gottfredson, and A. T. Panter
Copyright American Psychological Association. Not for further distribution.

Fundamental research questions in psychology center methods for assessing measurement invariance
on establishing the factor structure of new measures across groups. Because most commercially available
and understanding how unobserved constructs factor analysis and structural equation modeling
assessed by these measures relate to other constructs. (SEM) computer programs allow for estimation
The standard factor analytic model assumes that approaches and options for conducting item-level
response data are based on continuously measured analyses, distinctions among computer programs are
indicators of constructs, yet the individual items used made when useful. Finally, specific recommenda-
in social science research rarely meet this assumption. tions are provided about appropriate research design
Items appearing in surveys (such as in attitude or conditions for modeling item-level data.
knowledge scales), obtained through observation
codes, or gathered in interview settings may be
Factor Analysis
dichotomous, ordered categorical, censored counts or
zero-inflated counts. They may also have a variety of Factor analyses of categorical and continuous mea-
other distribution forms. This chapter addresses the sures have the same goal: to explain the relation-
historical traditions and current data analytic recom- ships among the variables by a smaller set of
mendations for conducting data analysis using latent underlying, or latent, variables. Traditionally, factor
variable models with item-level indicators. analysis emphasizes understanding the dimensional-
The general topic of how to describe the factor ity of a set of variables (e.g., exploratory factor anal-
structure and item-level characteristics of a collec- ysis [EFA] and confirmatory factor analysis [CFA],
tion of items emerged from two main theoretical respectively) and modeling the causal relationships
routes (social sciences: item-level factor analysis among those dimensions (e.g., SEM). Although
[IFA]; education: item response theory [IRT]); the these analytic procedures are applicable for categori-
connection between these routes has been eluci- cal response data, the key difference between the
dated formally in the past few decades. In consider- approaches lies in where the investigator places
ing analytic options for modeling item-level data, we emphasis and the nature of the research questions.
first discuss issues related to item measurement, In the historical treatment of factor analysis for con-
response distributions, and estimation methods. We tinuously measured items, item-level characteristics
then address appropriate modeling conditions for were often seen as problematic (McDonald, 1967,
focusing on individual items versus composites that 1999; McDonald & Ahlawat, 1974). For example,
are formed by the researcher from item subsets, the very notion of item difficulty (the probability of
such as item parcels or testlets, and we review endorsement) is conceptually challenging from a

The authors thank members of the PROMIS pediatric network (funded by Grant 5U01AR052181) and Deb Irwin, Michelle Langer, David Thissen, Esi
DeWitt, Jim Varni, Karin Yeatts, and Darren DeWalt for use of these data.

DOI: 10.1037/13619-036
APA Handbook of Research Methods in Psychology: Vol. 1. Foundations, Planning, Measures, and Psychometrics, H. Cooper (Editor-in-Chief)
683
Copyright © 2012 by the American Psychological Association. All rights reserved.
Stucky, Gottfredson, and Panter

continuously distributed factor analytic perspective, disciplines (i.e., Likert-type scales such as agree-
and hence techniques such as data transformations ment, self-description).
and standardization are commonly used to mask It is not statistically appropriate to ignore the cat-
item differences in response distributions and scales. egorical nature of items and apply the common fac-
The IFA approach, on the other hand, appreci- tor model to such data; attempting to do so can
ates item behavior and characteristics as intrinsically result in a variety of problems. Perhaps most central,
meaningful. Concepts such as item discrimination categorical response variables are bounded by the
(the strength of the relationship between an item minimum and maximum response options (e.g., 0
and its factor) and item difficulty are key tools that for an item that is not endorsed and 1 for an item
psychometricians use to analyze item response data. that is endorsed), and only through nonlinear func-
To further develop the importance of item charac- tions (such as the normal ogive and logistic cumula-
teristics, we provide some background on the tradi- tive distribution functions) is one able to place
tional factor analysis model. discrete item responses on the scale of the unob-
Copyright American Psychological Association. Not for further distribution.

The common factor model (Jöreskog, 1969, served continuous latent variable (McDonald &
1970) is based on a set of assumptions that the Ahlawat, 1974). If ignored, the model is misspeci-
observed variables y = (y1, . . ., yi) are continuous, fied, and the linear approximation of the nonlinear
normally distributed with mean σ and covariance Σ relationship will be heavily influenced by the
and that the relationship between observed and observed variable’s mean (McDonald, 1999; Mislevy,
latent variables ξ = (ξ 1,. . ., ξ p) is best described by 1986). Second, fitting a factor analysis linear model
a linear model: to categorical data often leads to biased parameter
estimates (DiStefano, 2002). In addition, factor load-
y =  +  + e , (1) ings may be attenuated in cases of sparse response
where observed responses y are functions of factor coverage (such as skewness), which is typical of
loadings Λ, latent variables ξ, item means α, and scale items with response options that measure only
measurement error (residuals). Incorporating the extreme trait locations. The accumulation of these
covariance matrix of ξ, called Φ, and the covariance problems results in untrustworthy model fit
matrix of (typically uncorrelated) residuals Ψ, the statistics.
relationships among y are represented by the covari- In parallel with emergent developments for the
ance matrix Σ: factor analysis of dichotomous and ordered categori-
cal data in the mid-1980s, important advances in the
 = ′ +  . (2) IRT tradition also allowed for analyses of these types
Items appearing in research in psychology and of items (Mislevy, 1986). Examples from early
the behavioral sciences typically have categorical adopters of these two traditions in psychology and
response options that violate these assumptions related fields include assessment of depression
(Krosnick, 1999). In these situations, an appropriate (Childs, Dahlstrom, Kemp, & Panter, 1992; Schaef-
alternative to the linear common factor model is IFA fer, 1988), need for cognition (Tanaka, Panter, &
(Mislevy, 1986; Muthén, 1978; Wirth & Edwards, Winborne, 1988), self-monitoring (Panter &
2007). Categorical confirmatory factor analysis Tanaka, 1987), and psychopathology and personal-
(CCFA) assumes that a researcher is focused on the ity (e.g., Reise & Waller, 1990) as well as several
measurement features of responses to categorical studies focusing on general issues in item order,
items and that these items are “discrete representa- scale design, and administration (Panter, Tanaka, &
tions of continuous latent responses” (Wirth & Wellens, 1992; Steinberg, 1994; Waller & Reise,
Edwards, 2007, p. 59). Items with these characteris- 1989). Review articles and early conference presen-
tics range from dichotomous responses common in tations were specifically targeted at communicating
education settings (i.e., correct or incorrect, true or these developments in the analysis item response
false) to polytomous responses, or ordered categori- data to psychological researchers (e.g., Panter,
cal responses, now common in most psychological Swygert, Dahlstrom, & Tanaka, 1997; Reise,

684
Item-Level Factor Analysis

Widaman, & Pugh, 1993; Steinberg & Thissen, assumptions, and their uses to motivate our discus-
1995; Thissen, 1992). sion of the close relation between IFA and IRT.
Thissen and Steinberg (1986) described a taxon- When binary data are modeled in an IRT context,
omy of models in the psychometric literature to one often uses logistic functions. While capitalizing
properly analyze categorical response data in a con- on the mathematical convenience of the model
firmatory analytic setting allowing items to be linked (Birnbaum, 1968, p. 400), the logistic function
to underlying latent variables. Many of these models closely follows the cumulative normal distribution.
originated in the IRT tradition, including one-, two-, The two-parameter logistic (2PL) model may be
and three-parameter logistic models for binary data, written as
the graded-response model (GRM; Samejima, 1969),
the generalized partial credit model (Muraki, 1992)  1 
T  ui = 1|  = . (5)
for ordered categorical response items, models for  (1 + exp[− Da i ( − bi )]

nominal response data (Bock, 1972), and many oth-
Copyright American Psychological Association. Not for further distribution.

ers. For simplicity of discussion, however, we In keeping with Lazarsfeld’s (1950) classic nota-
review the binary–dichotomous item response case tion, T traces the probability of a correct response,
(e.g., agree or disagree) as well as a generalization to u = 1, to item i conditional on ability θ, whereas an
model polytomous item responses (e.g., Likert-type incorrect response has the probability T(0|θ) = 1 −
items). T(1| θ), or T0 = 1 − T1(θ) in more compact notation.
For dichotomous items, a respondent’s location The logistic function describes each response proba-
on the continuous underlying trait (ξ in factor anal- bility on the basis of the slope or discrimination
ysis notation and θ in IRT notation) is expressed parameter (ai), which indicates the strength of asso-
through her choice of response option. In other ciation between the item response and latent dimen-
*
words, one’s location on the latent variable, yi , is sion, and a difficulty or threshold parameter (bi),
modeled through the common factor and item- which indicates the location on the ability contin-
specific residual: uum for which individuals’ probability of correct
response is 50%. The constant D (approximately
y*i = i  + e i , (3) 1.7) is commonly used to place the model on the
*
where the continuous latent response yi has a vari- same scale as the normal ogive, the precursor to the
ance of 1.0. Assuming that higher scores reflect logistic representation (Lord, 1952).
more of a given trait, the relationship between For polytomous (ordered categorical) response
*
observed item responses yi, location on the trait yi , data, a generalization of the 2PL, Samejima’s (1969)
and the threshold parameters is GRM, is often used. The GRM describes the proba-
bility of a response in category k or higher, where
yi = 1 if y*i ≥ oi k = 0, 1, . . ., m − 1:
(4)
0 if y*i ≥ oi  
1
T *  yi = k |  = , (6)
The answers that respondents provide to a given  (1 + exp[−a i ( − bik )]

scale item can be arranged along the latent response
*
distribution yi , by thresholds, τi. Thus, a threshold noting that T*(0|θ) = 1 and T*(m|θ) = 0. Here the
represents the location on the latent continuum that probability of responding in category k is the differ-
separates discrete responses to a given item. ence between the probabilities of responding in k or
higher and the higher response:

The Item Response Theory Tradition Ti (k | ) = Ti* (k | ) − Ti* (k +1 | ). (7)


Chapter 36 of this volume reviews traditional IRT The model describes the response process by
modeling and several new directions in the field. We estimating one less threshold than the number of
discuss the more common set of IRT models, their response alternatives. Each threshold or cutpoint

685
Stucky, Gottfredson, and Panter

marks the boundary between a response in category intercept, –aibi, which, when D = 1 is the log odds of
k from k + 1. Although such a model is useful for a a correct response at θ = 0. In typical IRT applica-
variety of psychological scales with ordinal response tions, however, intercepts are converted into the
category items, many other IRT models characterize common slope-threshold parameterization by divid-
item responses (for a review, see Embretson & ing the intercept by the negative slope.
Reise, 2000). As Wirth and Edwards (2007) made clear, there
is some practical utility in these conversions. In
Relations Between the Two Traditions many instances these translations help identify
The histories and notational differences between unreasonable parameter estimates. Item-level data
Jöreskog’s (1969) common factor model and the IRT are virtually never perfectly related to the latent
models popularized by Lord and Novick (1968) trait. More often, items contain some degree of error
seem to indicate some inherent difference across the variance, so when this error variance becomes suspi-
traditions. Actually, the relation between IRT and ciously low, one can anticipate when item parame-
Copyright American Psychological Association. Not for further distribution.

CCFA frameworks has been presented by Bartho- ters are untrustworthy. Wirth and Edwards (2007)
lomew (1983), Muthén (1983), and Muthén and suggested skepticism when interpreting factor load-
Lehman (1985). Takane and de Leeuw (1987) have ings > .95 (and hence a parameters above about 3)
provided the mathematical equivalence between the and objection to loadings > .97 (and a parameters
common models. Indeed, under certain conditions, above about 4). In both instances, such values would
IFA can be conducted entirely in an IRT framework. be considered near-Heywood (1931) cases.
Given item parameters in either IRT or IFA, one
may translate and back-translate between parame- Estimating Models Within the
terizations. IRT-equivalent slopes and thresholds Two Traditions
may be obtained easily from factor analysis notation: If one considers IFA to be the intersection of tradi-
tional factor analysis and IRT, then it may be helpful
 a 
ai =  i  D and bi = a i , (8)
to consider why these closely related techniques
 1+ a  y*i evolved from largely separate literatures. In part, the
 i   distinction has less to do with differences in the
where a and b are as previously defined, and τi rep- hypothesized models for item responses and more to
resents the location on the latent variable scale that do with differences in estimation employed within
distinguishes between responses. The square root of each tradition (Mislevy, 1986). Estimation methods
one minus the squared factor loading is the residual for CCFA traditionally made use of the sample tetra-
* *
standard deviation of yi (with yi standardized to choric (for binary–dichotomous data) and poly-
have a variance of 1.0, as is common in many IFA choric (for polytomous–ordered categorical data)
approaches). Additionally, if the IRT parameters are correlation matrix (Christofferson, 1975; Muthén,
in the normal metric, the scaling factor D is not 1978, 1984) in addition to a weight matrix that grew
needed and drops out of the equation. With little substantially with the number of items. Mixtures of
algebra, one may translate the IRT parameters back item types, which often occur in research settings in
into factor analysis notation loadings (Λ) and psychology, also could be handled as well as models
thresholds (τ): with many factors (e.g., SEM). Two widely used
    least squares estimators that make use of the sam-
ai / D  and  =  (a i / D )bi
i =   . (9) ple tetrachoric–polychoric correlation matrix are
 1 + (a / D )2  i
 1 + (a / D )2  (a) diagonally weighted least squares (Jöreskog &
 i   i  
Sörbom, 2001) as used in LISREL (Jöreskog &
Actually, most software programs estimate the Sörbom, 2001), and (b) mean- and variance-
model in a slope-intercept parameterization. This adjusted weighted least squares (WLSMV; Muthén,
parameterization utilizes the logit, Dai (θ – bi), and du Toit, & Spisic, 1997) as used in Mplus (Muthén &
after multiplying the slope through, provides the Muthén, 1998–2007). Both estimators use only the

686
Item-Level Factor Analysis

diagonal elements of a weight matrix (see Jöreskog, parameters to be estimated in the marginal distribu-
1990; Muthén, 1984), thereby greatly reducing the tion. Application of the expectation–maximization
burdensome operation of inverting a weight matrix (EM) algorithm (Bock & Aitkin, 1981) alleviated
for models with many items (for some other estima- much computational time for models with few fac-
tors see Christofferson, 1975; Jöreskog & Sörbom, tors and many items by iteratively estimating trial
2001). item parameters, then using these to find the
An alternative to weighted least squares estima- expected number of responses and the proportions
tion is full-information maximum likelihood of individuals at given levels of the latent variable
(FIML), traditionally used in IFA and SEM (E-step), and finally resubstituting these values back
approaches. Rather than employing a polychoric into the likelihood equation (M-step). MML/EM is
correlation matrix, FIML takes full advantage of the used in widely available IRT software, such as
entire data matrix. Here, the underlying data struc- BILOG-MG (Zimowski, Muraki, Mislevy, & Bock,
ture is reproduced by way of the matrix of factor 2003), MULTILOG (Thissen, 2003), and soon,
Copyright American Psychological Association. Not for further distribution.

loadings (Λy), covariance matrixes of latent vari- IRTPRO (Cai, du Toit, & Thissen, 2011).
ables (Φξ), and measurement errors (Θδ). The FIML More recently, advances in parameter estimation
method produces unbiased standard errors for the have overcome the challenge of dimensionality
model parameters (factor loadings, interrelations posed by using MML/EM, and have made high-
among latent variables, measurement errors) and dimensional IRT and IFA models more tractable
model fit indexes, unlike adjusted weighted least (e.g., multidimensional IRT [MIRT]; Reckase, 2009).
squares approaches that require corrections for That is, when researchers discretely measure items,
biased standard errors and fit indexes (see Satorra & they can now model a larger number of underlying
Bentler, 1994). FIML introduces the problem factors than was previously possible in traditional
of multiple dimensions, however. Specifically, unidimensional IRT models. Here, we briefly high-
integration—finding the area of a region defined by light three recent developments that, with time, may
a function—must be approximated by a number of see more use from applied researchers in psychology
quadrature points (e.g., Gauss-Hermite) over the and the behavioral sciences with high-dimensional
number of dimensions. So, for many reasonable-size data: (a) EM with adaptive quadrature (ADQ),
models (and quadrature points), computing time is (b) Markov chain Monte Carlo (MCMC), and
calculated in hours, a problem less often encoun- (c) the Metropolis-Hastings Robbins-Monro
tered when using least squares estimators. (MH-RM) algorithm for maximum likelihood esti-
Alternatively, estimation of IRT parameters is mates (MLEs). For a moderate number of dimen-
typically conducted using maximum likelihood. sions, ADQ is an attractive alternative to MML/EM.
Birnbaum’s (1968, p. 420) use of joint-maximum Instead of using fixed-point quadrature, ADQ adapts
likelihood (JML) provided the first directly estimated the number of points needed so that the estimation
IRT parameters.1 With JML, the parameters describ- process becomes more efficient (Schilling & Bock,
ing how the item behaves and the score reflecting a 2005). More recently, MH-RM (Cai, 2010a) uses a
person’s level on the underlying trait dimension are Metropolis-Hastings (Hastings, 1970) Robbins-
simultaneously estimated. Currently, the popularity Monro algorithm (Robbins & Monro, 1951), which
of JML has been surpassed by the use of marginal has enabled efficient estimation of high-dimensional
maximum likelihood (MML; Bock & Lieberman, models that had previously been intractable.
1970), which, rather than attempting to estimate (MH-RM will be an estimation option in the soft-
item and person parameters simultaneously, consid- ware package IRTPRO; Cai et al., 2011). Finally,
ers person parameters as nuisance parameters MCMC algorithms allow for the inspection of the
(and integrates over them), leaving only the item quality of other maximum likelihood estimates by

Rasch (1960) provided a substantially simpler estimation method, conditional maximum likelihood (CML), which capitalizes on the fact that summed
1

item scores are sufficient statistics. CML estimates item difficulty (the only item parameter estimated; discrimination parameters are 1.0) conditional
on summed scores.

687
Stucky, Gottfredson, and Panter

constructing a Markov chain with a stationary distri- item’s error is uncorrelated with every other item’s
bution as its target, at which point samples (i.e., error, then a parcel that is constructed of a randomly
random draws) taken from the chain serve as poste- selected sample of items should be, on average, a
rior estimates. Although MCMC is an attractive less biased and more reliable indicator of the latent
solution to dimensionality problems in FIML construct. If, however, items are not conditionally
contexts, the efficiency of current MCMC tech- independent from one another after accounting for
niques (i.e., the computational burden; Edwards, the common factor (i.e., item responses are related
2010) makes its use, from an applied researcher’s for a reason other than the underlying dimension
perspective, somewhat limited. Hopefully, however, such as when two or three items share similar word
these new estimation approaches will provide stems), then parcels of these correlated items will be
psychologists with the tools to test models whose biased away from the latent variable’s centroid. If it
complexity exceeded the boundaries of the past makes sense to conceive of sampling from an infi-
software. nite pool of independently distributed items in a
Copyright American Psychological Association. Not for further distribution.

The preceding discussion highlighted two ana- given content domain, then using parcels will repre-
lytic traditions that focus on understanding the sent the latent construct in a more stable, replicable
underlying factor structure of item-level data with way than if the same number of items were used to
different measurement levels. The less than continu- represent the same latent construct. Indeed, a fre-
ously measured items present more of an analytic quently cited benefit of parceling is improved indi-
challenge than do the continuously measured indi- cator reliability (Cattell & Burdsal, 1975). The
cators. In the next section, we consider current Spearman–Brown prophecy formula reveals that if
research on a related approach that investigators each item representing a latent construct consists
have used to circumvent complexities associated partially of true score variability and partially of
with analyses conducted at the item level. error variance, then the sum of several such items
will contain a higher proportion of true score vari-
ability than any individual item (Coffman &
If Items Are More Complicated to
MacCallum, 2005).
Analyze, Why Not Parcel Them?
Coffman and MacCallum (2005) advocated using
Under certain circumstances, a researcher may parcels as latent variable indicators when such a
decide to aggregate (i.e., create a sum score from) a large number of items are needed to achieve ade-
collection of items to serve as factor indicators, quate representation of the latent variable domain
rather than using individual items as indicators, and when it is implausible to use individual items as
before conducting their factor analysis. This practice factor indicators. For example, if a researcher has a
is known as parceling.2 Parceling involves splitting a very large number of items to factor analyze, IFA
relatively large number of items thought to repre- estimation in both of the traditions is more difficult.
sent a latent variable into a smaller set of sum In such situations, when there are too many items to
scores. These sum scores are then used in place of estimate a latent variable using individual items,
the individual items to identify the latent variable. researchers are forced to choose between (a) creat-
Little, Cunningham, Shahar, and Widaman ing parcels so that model estimation is possible
(2002) and Little, Lindenberger, and Nesselroade (because estimation with many discrete items is
(1999) described a latent variable’s domain as con- computationally intensive) or (b) aggregating the
sisting of an infinite pool of potential items, each items into a single measured variable (e.g., by
deviating to some degree from the centroid of the summing the items or outputting an estimated fac-
latent variable. Assuming that the expected value of tor score). Coffman and MacCallum demonstrated
these deviations is zero, and assuming that each that the parceling method is superior to alternative

In IRT, aggregated indicators of latent variables are called testlets. In this setting, testlets are formed to alleviate local dependence between correlated
2

items that is irrelevant to the latent variable of interest, often resulting from related test sections or similar item stems (Wainer, Bradlow, & Wang,
2007).

688
Item-Level Factor Analysis

analysis options, such as path analysis, that do not with the Satorra–Bentler (1994) correction. The
allow the explicit modeling of unreliability of authors did not find support for the claim that factor
measurement. loading estimates would be less biased when parcels
Especially relevant to IFA, many researchers who are used, but they did find that estimates were less
use parcels do so to meet normality assumptions of variable with this technique. The Satorra–Bentler
their estimation method. When items are Poisson- correction resulted in less biased parameter esti-
distributed counts, dichotomous, or ordinal with mates. Bandalos (2008) compared the parceling
only a few response categories (e.g., a 4-point strategy to WLSMV estimation with nonnormal
Likert-type scale), then individual item distributions items and found that parameter estimates were
will badly violate normality assumptions. If normal- biased when parcels were used, particularly when
ity is assumed, parameters obtained by maximum unidimensionality of parcels was violated. This
likelihood (ML) or generalized least squares (GLS) result was obtained for both factor loadings and
estimation will be downwardly biased. In other structural parameter estimates between latent
Copyright American Psychological Association. Not for further distribution.

words, using this approach will lead to serious con- factors. Furthermore, Bandalos showed that model
sequences: Standard errors for factor loadings will fit was overestimated (i.e., thought to be better
be underestimated and chi-square tests of model fit than it was) if parcels were used when unidimen-
will be too high (West, Finch, & Curran, 1995). sionality was violated. In contrast, estimates
West et al. (1995) suggested three potential solu- obtained using WLSMV estimation were unbiased,
tions for handling nonnormal factor indicators. particularly when sample size increased.
First, they suggest that researchers can use the Several authors have shown that parceling leads
Satorra–Bentler correction for nonnormal data in to overly optimistic model fit indexes when multidi-
conjunction with the ML estimator for a better mensionality is present (Bandalos, 2008; Little et al.,
approximation of the model chi-square statistic, fac- 2002). When locally dependent items are combined
tor loadings, and associated standard errors (Satorra into a parcel, the irrelevant item correlation is attrib-
& Bentler, 1994). Second, it may be feasible to use uted to shared variance because of the common fac-
an alternative estimator that does not require items tor, thus masking multidimensionality and
to be normally distributed, such WLSMV. Flora and artificially inflating model fit. Bandalos pointed out
Curran (2004) found that WLSMV works well even that unplanned multidimensionality is common;
if the latent distribution underlying the discretely method factors are one example of this. In addition
distributed observed variable is not normally distrib- to inflating model fit, the masking of a multidimen-
uted. Finally, West et al. (1995) suggested that a sional factor structure leads to confounded and
researcher may create parcels so that factor indica- uninterpretable latent constructs (Hagtvet & Nasser,
tors more closely approximate a normal distribution 2004). Given the potential pitfalls of parceling with
to use with ML or GLS. locally dependent items, Coffman and MacCallum
Hau and Marsh (2004) evaluated the use of par- (2005) and Little et al. (2002) recommended testing
cels as a technique for handling nonnormally dis- the unidimensionality assumption and proceeding
tributed items when the WLSMV assumption of with parceling only if the assumption is reasonable.3
continuous latent underlying variables is not plausi- When researchers decide whether they should
ble. They compared this method to the technique of factor analyze individual items that are not continu-
using nonnormally distributed items as indicators ously measured, the current research suggests that

If violations of unidimensionality exist, the joint probability of item responses is no longer equal to the product of marginal probabilities, which leads
3

to biased parameter estimates (Reckase, 1979) and overestimates of score precision (Thissen, Steinberg, & Mooney, 1989). Methods for detecting
violations of unidimensionality may be broadly categorized as those stemming from unidimensional models that are diagnostic tests (e.g., Chen &
Thissen, 1997; Yen, 1984), those that introduce additional model parameters and latent variables to account for local dependence (e.g., Bradlow,
Wainer, & Wang, 1999; Hoskens & De Boeck, 1997), and those that test the assumption of conditional independence (e.g., Stout, 1987). Diagnostic
methods are useful data analytic tools for researchers interested in identifying locally dependent pairs or subsets of items. Modeling the local depen-
dence directly is useful in situations in which local dependence is expected and is a requirement of the test (e.g., the use of testlets in modeling
passage-dependent items; Wainer & Kiely, 1987).

689
Stucky, Gottfredson, and Panter

parcels are more reliable than individual items and expected value for all items, regardless of group
may allow researchers to represent more fully the membership, then weak factorial invariance exists
domains of their latent constructs. Parcels seem to (Horn & McArdle, 1992; Millsap, 1997; Millsap &
be the best option when too many items are present Kwok, 2004; Millsap & Meredith, 2007; Thurstone,
to estimate a full measurement model. On the other 1947; Widaman & Reise, 1997). If weak factorial
hand, if the relation between items and factors is invariance is met and item intercepts and thresholds
substantively meaningful, parcels may mask true are constant across groups, then there is strong facto-
relations that exist within the data, such as local rial invariance (Meredith, 1993; Millsap & Kwok,
dependence or a complex factor structure (Little 2004; Millsap & Meredith, 2007; Steenkamp &
et al., 2002). Furthermore, alternative methods of Baumgartner, 1998). Strong factorial invariance is
estimation exist that negate the necessity of parcel- desirable because it means that no systematic differ-
ing in most circumstances, and these alternatives ences are present in the measurement models across
have been empirically demonstrated to produce less groups. Partial invariance occurs when some, but
Copyright American Psychological Association. Not for further distribution.

biased estimates than those obtained with parceling. not all, factor loadings, means, and thresholds are
Given these contrasting arguments, Little et al. invariant (Millsap & Kwok, 2004). When strong
(2002) suggested using parcels when unidimension- invariance holds and unique factor variances are
ality is ensured, when constructs have been well also equivalent across groups, then strict factorial
established and defined, and when the measurement invariance is present (Meredith, 1993; Millsap &
model is not of primary interest. That is, parceling Meredith, 2007). Strict factorial invariance implies
may be a reasonable technique to use when the that systematic group differences in item means and
structural relations among latent variables are more covariances are solely a function of group differ-
interesting than the measurement model. Alterna- ences in factor means and covariances. According to
tive methods, such as analyses from the factor ana- McArdle (2007) and Meredith and Horn (2001),
lytic tradition, should be considered before strong invariance should be expected to hold if two
proceeding with a parceling strategy. subgroups are equivalent with respect to the latent
variable of interest; however, strict invariance is not
necessarily expected to hold.
Measurement Invariance
Complexities in Item Factor
Partial Invariance and Other
Analysis
Complications
Testing whether the factor structure of items is simi- If measurement invariance is not met in studies
lar or invariant across independent groups is an involving predictive relations among latent vari-
important extension of the IFA problem.4 Inferences ables, then regression parameter bias will be present
about group differences are only accurate if the (Humphreys, 1986; Millsap, 1998). If item inter-
latent construct being measured is invariant across cepts vary across groups, but the noninvariance is
groups. ignored such that the noninvariant measurement
model is used to test structural hypotheses, then not
Types of Invariance only will group differences in regression parameter
Configural invariance occurs when the same factor estimates in the structural part of the model repre-
structure exists across groups (Thurstone, 1947). If sent any true group differences in the relations
factor loadings are equivalent across groups such between latent variables, but also the structural
that a one-unit increase in the latent variable mean parameter estimates will be confounded with group
is associated with an identical λ unit increase in the differences in measurement. Thus, it is necessary to

The factor analytic idea of measurement invariance parallels the concept of differential item functioning (DIF) within IRT. DIF exists when, controlling
4

for individuals’ true latent variable score, the conditional probability of answering an item correctly is not the same across groups of individuals (i.e.,
measurement noninvariance). This is a major concern for psychometricians who are responsible for creating bias-free standardized tests.

690
Item-Level Factor Analysis

test for measurement invariance before proceeding An Application of IFA


to tests of predictive models and making inferences
In this final section we provide a brief example of
about true group differences. Millsap (1998)
IFAs, involving less than continuously measured
described a formal test of measurement invariance
items, that merges the two IFA traditions that we
that is necessary to ensure that observed structural
have discussed. The data are from the Patient
differences are due to differences in the population
Reported Outcomes Measurement Information Sys-
rather than to differences in measurement. Millsap
tem (PROMIS), a multisite project that aims to
and Tein (2004) extended this work to include sim-
develop self-reported item banks for clinical
ple factor models with ordered categorical or dichot-
research. Although content domains in many areas
omous response types.
of health are included as part of this project, for the
Before providing an IFA application, it is impor-
purposes of our example, we focus on the emotional
tant to note that there are a variety of ways in which
distress domain.
item factor analytic methods can be conducted
Copyright American Psychological Association. Not for further distribution.

Anxiety and depressive symptoms items were


depending on the analyst’s goals. In situations in
split between two test administration forms. For
which researchers have little a priori knowledge of
brevity, we report only the results of Form 1, which
the factor structure of a set of items, the data ana-
had 759 children respond to 10 anxiety items and
lytic process might begin by fitting unrestricted EFA
10 depressive symptom items (Irwin et al., 2010).
models where items have as many loadings as fac-
All items had the same 5-point response scale with
tors (see Jöreskog, 1990; Jöreskog & Moustaki,
the options never (0), almost never (1), sometimes
2001; Muthén, 1984). It is more common in the IFA
(2), often (3), and almost always (4).
tradition to begin the data analysis process by fitting
Our first step was to examine the factor structure
models that are restrictions on the general EFA
of the individual items to determine whether there
framework (i.e., constraining some or many factor
was a single dimension underlying them (a precon-
loadings to be zero concurrent with the research
dition for unidimensional IRT). Initial EFA models
hypothesis). Often these models are variations on
were fit to the Form 1 data, as shown in Table 35.1.
hierarchical models, where all items receive a load-
The factor structure generally revealed depressive
ing from a general factor that is assumed to underlie
symptoms and anxiety factors but not simple struc-
all the items, with subsets of the items receiving
ture. A close inspection reveals many instances of
loadings that account for a shared association
items loading (in part) on the incorrect factor. Had
specific to the subset of items but that are above
commonly used techniques for deciding the number
and beyond the relationship accounted for by the
of factors been used (scree plot, fit indexes, magni-
general factor.
tude of factor loadings, inspection of eigenvalues),
As an alternative to bifactor or hierarchical mod-
two factors would have been extracted.
els, and useful in situations in which multidimension-
Taking an IFA approach to this problem would
ality may not be explicitly expected, there is often
consider item-level characteristics (e.g., item con-
utility in beginning the data analytic process by fitting
tent and the location of the item on the scale) in
a single factor model and then considering local
identifying subsets of locally dependent items while
dependence (LD) statistics as evidence of nuisance or
resolving dimensionality concerns. In such
extra dimensionality. This approach may be preferred
instances, a bifactor model, which estimates two
in situations in which the analyst does not begin with
loadings for all items, a nonzero general factor load-
the expectation of multidimensionality. In less com-
ing, and a group-specific loading (here, depressive
mon situations, where the analyst has no prior beliefs
symptoms and anxiety), serves as an excellent com-
about the structure of the items, EFA may still pro-
promise between an EFA and traditional simple-
vide a satisfactory starting position. In the application
structure CFA. In this case, a bifactor model is
that follows, we conduct EFA initially to show the
particularly relevant because the goal of the analysis
strengths and weakness of such an approach and then
is both to determine whether depressive symptoms
move into more traditional IFA models.

691
Stucky, Gottfredson, and Panter

Table 35.1 A modified bifactor model was then fit using


Mplus with WLSMV estimation. This model is con-
Exploratory Factor Loadings for 20 PROMIS sidered a modified version of the traditional bifactor
Anxiety (A) and Depression (D) Items model because potential sources of LD identified in
the expanded EFA were modeled as subfactors (i.e.,
Factor three or more items receiving an additional factor
Item 1 2 loading) and correlated errors, which represent cor-
  A1. I got scared really easy. .88 −.12 relations between the residuals after accounting for
  A2. I felt afraid. .89 −.10 the covariance occurring for the general and domain-
  A3. I worried about what could happen to me. .65 .14 specific factor (Table 35.2).5 Goodness-of-fit indexes
  A4. It was hard for me to stop worrying. .59 .24 suggested that the augmented bifactor model fit the
  A5. I woke up at night scared. .55 .25
  A6. I worried when I was away from home. .42 .27
data well: χ2(76, N = 621) = 247.83, comparative fit
  A7. I was afraid that I would make mistakes. .34 .37 index = .95, Tucker-Lewis index = 0.99, root-mean-
Copyright American Psychological Association. Not for further distribution.

  A8. I felt nervous. .40 .27 square error of approximation = .06.


  A9. It was hard for me to relax. .30 .42 Used in this fashion, the bifactor model provides
A10. I felt afraid or scared. .64 .23
information regarding the intended dimensionality
  D1. I wanted to be by myself. −.08 .47
  D2. I felt that no one loved me. .04 .76 and the presence of nuisance multidimensionality.
  D3. I cried more than usual. .39 .44 The PROMIS researchers were initially interested in
  D4. I felt alone. .06 .75 determining whether these data provided evidence
  D5. I felt like I couldn’t do anything right. .08 .75
that emotional distress was a single dimension, or
  D6. I felt so bad that I didn’t want to do .12 .64
anything. whether distinguishable individual variation
  D7. I felt everything in my life went wrong. .06 .78 occurred between the anxiety items and then again
  D8. Being sad made it hard for me to do .28 .58 between the depressive symptoms items. The fact
things with my friends.
that substantial loadings differed significantly from
  D9. It was hard to do school work because I .25 .59
felt sad. zero on the group-specific factor for the depressive
D10. I felt like crying. .42 .43 symptoms items in Table 35.2 indicated that the
covariation among the item responses could not be
Note. Model fit using Crawford-Ferguson varimax
oblique rotation and mean- and variance-adjusted adequately explained with the theory that a single
weighted least squares estimation. The correlation emotional distress dimension of individual differ-
between factors is 0.55. χ2(88) = 529, comparative fit ences underlies responses to all of the items. It is
index = 0.91, Tucker-Lewis index = 0.97, root-mean- a curiosity of the data that the general factor in
square error of approximation = 0.08.
Table 35.2 is anxiety-dominated negative affect,
leaving little unique variance for the anxiety group-
and anxiety are best treated as separate dimensions, specific factor. This fact is noted by comparing the
and hence scales, while also identifying sources of ratio of the general factor to the domain-specific fac-
LD (i.e., violations of unidimensionality). Finally, tors and demonstrated in the large number of non-
detecting LD in this manner allows the researcher to significant loadings on the anxiety factor.
control the dimensionality of the scale. If LD is The analysis also highlights the precision with
detected in either pairs or subsets of items, selecting which IFA can detect nuisance dimensionality.
one item from each grouping, and setting aside the Researchers who rely on EFA to determine the
LD-inducing items, should result in a unidimen- dimensionality of the items may have settled on the
sional set of items, as the residuals of the offending set of depressive symptoms and anxiety items origi-
items will no longer covary with the selected item. nally hypothesized; however, a closer inspection

If using FIML estimation, the Gibbons and Hedeker (1992) method would apply. In this context, integration would occur over eight orthogonal
5

dimensions (1 general, 2 domain specific, 1 subfactor, and 4 error correlations), which with four quadrature points per dimension amounts to 65,536
points. Cai (2010b) considered a similar analytic problem, and using a prototype of IRTPRO with the MH-RM algorithm, reached model convergence
in under 3 minutes compared with an ML/EM solution with adaptive quadrature, which took more than 4 hours.

692
Item-Level Factor Analysis

Table 35.2

Factor Loadings and Residual Correlations for an Augmented Bifactor Model Fitted to the Items on Form 1

Orthogonal group:
Specific factors
General Depressive Afraid/ Doublet residual
Item stem factor Anxiety symptoms scared correlations
I felt afraid. .68 .11 .67
I got scared really easy. .64 .30 .38
I felt afraid or scared. .76 .15 .20
It was hard for me to stop worrying. .73 .11 .32
I worried about what could happen to me. .70 .10
Copyright American Psychological Association. Not for further distribution.

I woke up at night scared. .72 .40


I was afraid that I would make mistakes. .70 -.39
It was hard for me to relax. .68 -.15
I worried when I was away from home. .64 .10
I felt nervous. .63 -.19
I felt everything in my life went wrong. .61 .56
I felt like I couldn’t do anything right. .62 .55
I felt so bad that I didn’t want to do anything. .57 .48
I felt alone. .59 .50 .27
I felt that no one loved me. .60 .47
Being sad made it hard for me to do things with my friends. .72 .28 .26
It was hard to do school work because I felt sad. .66 .31
I felt like crying. .66 .24 .49
I cried more than usual. .64 .23
I wanted to be by myself. .27 .30

Note. Italicized entries are less than 2 standard errors from 0. From “An Item Response Analysis of the Pediatric PROMIS
Anxiety and Depressive Symptoms Scales,” by D. E. Irwin, B. D. Stucky, M. M. Langer, D. Thissen, E. M. DeWitt, J. S.
Lai, J. W. Varni, K. Yeatts, and D. A. DeWalt, 2010, Quality of Life Research, 19, p. 600. Copyright 2010 by Springer
Science+Business Media. Reprinted with permission.

reveals that the general and group-specific factors independence. However, this modeling approach is
are not (all) conditionally independent. In Table not without its own complexities. In the present
35.2, note that a cluster of items involves being application, knowledge of the factor structure served
“scared or afraid” with responses that are more cor- as a foundation for later unidimensional IRT param-
related than expected given the general factor and eter calibration (i.e., after setting aside locally
the anxiety-specific factor, and four more pairs of dependent items, the factors anxiety and depressive
items have significant residual correlations. Begin- symptoms were separately fit with unidimensional
ning at the top of Table 35.2, the pairs of items mod- IRT models). If the modified bifactor model is con-
eled with residual correlations are about “worrying,” sidered in a MIRT framework, however, many
“feelings of loneliness,” “sadness,” and “crying.” difficult interpretation and scoring issues remain.
Items in these pairs or triplets are (in part) like ask- If IRT-based scores are desired, then, including
ing the same question twice. In each instance, residual correlations, eight dimensions require
including a single item on the scale is sufficient and integration, and hence, eight possible scores. The
providing both (or all three in the case of the triplet) computation of scores for such models is underde-
would violate assumptions of unidimensionality. veloped. Although MIRT scoring is slowly gaining
Conducting IFA in this careful manner is useful some use, uses typically do not involve cases with
for identifying and eliminating violations of local small subsets of locally dependent items. With

693
Stucky, Gottfredson, and Panter

current scoring techniques, the scores per factor Journal of Educational and Behavioral Statistics, 35,
must be interpreted as conditional on the model’s 307–335.
other latent variables. Interpretation of such scores Cai, L., du Toit, S. H. C., & Thissen, D. (2011). IRTPRO:
Flexible, multidimensional, multiple
remains limited. When IFA is conducted to explore categorical IRT modeling [Computer software].
or identify multidimensionality that often occurs in Chicago, IL: Scientific Software International.
psychological data, it serves as a useful alternative to Cattell, R. B., & Burdsal, C. A. (1975). The radial parcel
traditional EFA models. Had our previous analyses double factoring design: A solution to the item-vs.-
concluded after the EFA, we would have missed a parcel controversy. Multivariate Behavioral Research,
10, 165–179. doi:10.1207/s15327906mbr1002_3
great deal of LD. The situation shown in this data
Chen, W. H., & Thissen, D. (1997). Local dependence indi-
example is not rare. In well-constructed, expert- ces for item pairs using item response theory. Journal
reviewed scales, LD is often missed, and it may be too of Educational and Behavioral Statistics, 22, 265–289.
minor to be identified via an EFA but large enough to Childs, R. A., Dahlstrom, W. G., Kemp, S., & Panter, A. T.
affect item calibration. With recent advances in both (1992). Item response theory in personality assessment:
Copyright American Psychological Association. Not for further distribution.

efficient algorithms and computational speed, we The MMPI-2 Depression Scale (Report 92-1). Chapel
Hill: Thurstone Psychometric Laboratory, University
expect IFA to continue to grow as more researchers of North Carolina at Chapel Hill.
become aware of the benefits of considering both
Christofferson, A. (1975). Factor analysis of dichoto-
scale dimensionality and item effects. mized variables. Psychometrika, 40, 5–32.
doi:10.1007/BF02291477
References Coffman, D. L., & MacCallum, R. C. (2005). Using par-
cels to convert path analysis models into latent vari-
Bandalos, D. L. (2008). Is parceling really necessary? A able models. Multivariate Behavioral Research, 40,
comparison of results from item parceling and cat- 235–259. doi:10.1207/s15327906mbr4002_4
egorical variable methodology. Structural Equation
Modeling, 15, 211–240. doi:10.1080/10705510 DiStefano, C. (2002). The impact of categorization
801922340 with confirmatory factor analysis. Structural
Equation Modeling, 9, 327–346. doi:10.1207/
Bartholomew, D. J. (1983). Latent variable models for S15328007SEM0903_2
ordered categorical data. Journal of Econometrics, 22,
Edwards, M. C. (2010). A Markov chain Monte Carlo
229–243. doi:10.1016/0304-4076(83)90101-X
approach to confirmatory item factor analysis.
Birnbaum, A. (1968). Some latent trait models and their Psychometrika, 75, 474–497.
use in inferring an examinee’s ability. In F. M. Lord & Embretson, S. E., & Reise, S. P. (2000). Item response
M. R. Novick (Eds.), Statistical theories of mental test theory for psychologists. Mahwah, NJ: Erlbaum.
scores (pp. 395–479). Reading, MA: Addison-Wesley.
Flora, D. B., & Curran, P. J. (2004). An empirical
Bock, R. D. (1972). Estimating item parameters and evaluation of alternative methods of estimation
latent ability when responses are scored in two or for confirmatory factor analysis with ordinal data.
more nominal categories. Psychometrika, 37, 29–51. Psychological Methods, 9, 466–491. doi:10.1037/
doi:10.1007/BF02291411 1082-989X.9.4.466
Bock, R. D., & Aitkin, M. (1981). Marginal maximum Gibbons, R. D., & Hedeker, D. R. (1992). Full-
likelihood estimation of item parameters: An applica- information item bi-factor analysis. Psychometrika,
tion of the EM algorithm. Psychometrika, 46, 443–459. 57, 423–436. doi:10.1007/BF02295430
doi:10.1007/BF02293801 Hagtvet, K. A., & Nasser, F. M. (2004). How well do
Bock, R. D., & Lieberman, M. (1970). Fitting a item parcels represent conceptually-defined latent
response model for n dichotomously scored items. constructs? A two-facet approach. Structural
Psychometrika, 35, 179–197. Equation Modeling, 11, 168–193. doi:10.1207/
s15328007sem1102_2
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian
random effects model for testlets. Psychometrika, 64, Hastings, W. K. (1970). Monte Carlo simulation meth-
153–168. doi:10.1007/BF02294533 ods using Markov chains and their applications.
Biometrika, 57, 97–109. doi:10.1093/biomet/57.1.97
Cai, L. (2010a). High-dimensional exploratory item fac-
Hau, K. T., & Marsh, H. W. (2004). The use of item par-
tor analysis by a Metropolis-Hastings Robbins-Monro
cels in structural equation modeling: Non-normal
algorithm. Psychometrika, 75, 33–57.
data and small sample sizes. British Journal of
Cai, L. (2010b). Metropolis-Hastings Robbins-Monro Mathematical and Statistical Psychology, 57, 327–351.
algorithm for confirmatory item factor analysis. doi:10.1111/j.2044-8317.2004.tb00142.x

694
Item-Level Factor Analysis

Heywood, H. B. (1931). On finite sequences of real num- Lord, F. M., & Novick, M. R. (1968). Statistical theories of
bers. Proceedings of the Royal Society: Series A, 134, mental test scores. Reading, MA: Addison-Wesley.
486–501. doi:10.1098/rspa.1931.0209 McArdle, J. J. (2007). Five steps in the structural factor
Horn, J. L., & McArdle, J. (1992). A practical and theo- analysis of longitudinal data. In R. Cudeck & R. C.
retical guide to measurement invariance in aging MacCallum (Eds.), Factor analysis at 100: Historical
research. Experimental Aging Research, 18, 117–144. developments and future directions (pp. 99–130).
Mahwah, NJ: Erlbaum.
Hoskens, M., & De Boeck, P. (1997). A parametric model
for local dependence among test items. Psychological McDonald, R. P. (1967). Nonlinear factor analysis.
Methods, 2, 261–277. doi:10.1037/1082-989X.2.3.261 Psychometric Monograph, J5.
Humphreys, L. G. (1986). An analysis and evaluation of McDonald, R. P. (1999). Test theory: A unified treatment.
test and item bias in the prediction context. Journal of Mahwah, NJ: Erlbaum.
Applied Psychology, 71, 327–333. doi:10.1037/0021- McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty fac-
9010.71.2.327 tors in binary data. British Journal of Mathematical
Irwin, D. E., Stucky, B. D., Langer, M. M., Thissen, D., and Statistical Psychology, 27, 82–99.
Copyright American Psychological Association. Not for further distribution.

DeWitt, E. M., Lai, J. S., . . . DeWalt, D. A. (2010). Meredith, W. (1993). Measurement invariance, factor
An item response analysis of the pediatric PROMIS analysis and factorial invariance. Psychometrika, 58,
anxiety and depressive symptoms scales. Quality of 525–543. doi:10.1007/BF02294825
Life Research, 19, 595–607.
Meredith, W., & Horn, J. L. (2001). The role of facto-
Jöreskog, K. G. (1969). A general approach to con- rial invariance in modeling growth and change. In
firmatory maximum likelihood factor analy- A. G. Sayer & L. M. Collins (Eds.), New methods for
sis. Psychometrika, 34, 183–202. doi:10.1007/ the analysis of change (pp. 203–240). Washington,
BF02289343 DC: American Psychological Association.
Jöreskog, K. G. (1970). A general method for analysis of doi:10.1037/10409-007
covariance structures. Biometrika, 57, 239–251. Millsap, R. E. (1997). Invariance in measurement
and prediction: Their relationship in the single-
Jöreskog, K. G. (1990). New developments in LISREL:
factor case. Psychological Methods, 2, 248–260.
Analysis of ordinal variables using polychoric cor-
doi:10.1037/1082-989X.2.3.248
relations and weighted least squares. Quality and
Quantity, 24, 387–404. doi:10.1007/BF00152012 Millsap, R. E. (1998). Group differences in regression
intercepts: Implications for factorial invariance.
Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of Multivariate Behavioral Research, 33, 403–424.
ordinal variables: A comparison of three approaches. doi:10.1207/s15327906mbr3303_5
Multivariate Behavioral Research, 36, 347–387.
doi:10.1207/S15327906347-387 Millsap, R. E., & Kwok, O. (2004). Evaluating the impact
of partial factorial invariance on selection in two
Jöreskog, K. G., & Sörbom, D. (2001). LISREL user’s populations. Psychological Methods, 9, 93–115.
guide. Chicago, IL: Scientific Software International. doi:10.1037/1082-989X.9.1.93
Krosnick, J. A. (1999). Survey research. Annual Review Millsap, R. E., & Meredith, W. (2007). Factorial invari-
of Psychology, 50, 537–567. doi:10.1146/annurev. ance: Historical perspectives and new problems. In
psych.50.1.537 R. Cudeck & R. C. MacCallum (Eds.), Factor analy-
Lazarsfeld, P. F. (1950). The logical and mathematical sis at 100: Historical developments and future directions
foundation of latent structure analysis. In S. A. (pp. 131–152). Mahwah, NJ: Erlbaum.
Stouffer (Ed.), Measurement and prediction. Millsap, R. E., & Tein, J. Y. (2004). Assessing facto-
Princeton, NJ: Princeton University Press. rial invariance in ordered-categorical measures.
Little, T. D., Cunningham, W. A., Shahar, G., & Multivariate Behavioral Research, 39, 479–515.
Widaman, K. F. (2002). To parcel or not to par- doi:10.1207/S15327906MBR3903_4
cel: Exploring the question, weighing the mer- Mislevy, R. J. (1986). Recent developments in the fac-
its. Structural Equation Modeling, 9, 151–173. tor analysis of categorical variables. Journal of
doi:10.1207/S15328007SEM0902_1 Educational Statistics, 11, 3–31. doi:10.2307/1164846
Little, T. D., Lindenberger, U., & Nesselroade, J. R. Muraki, E. (1992). A generalized partial credit
(1999). On selecting indicators for multi- model: Application of an EM algorithm. Applied
variate measurement and modeling with latent Psychological Measurement, 16, 159–176.
variables. Psychological Methods, 4, 192–211. doi:10.1177/014662169201600206
doi:10.1037/1082-989X.4.2.192
Muthén, B. O. (1978). Contributions to factor analysis of
Lord, F. M. (1952). A theory of test scores. New York, NY: dichotomous variables. Psychometrika, 43, 551–560.
Psychometric Society. doi:10.1007/BF02293813

695
Stucky, Gottfredson, and Panter

Muthén, B. (1983). Latent variable structural equation mod- invariance. Psychological Bulletin, 114, 552–566.
eling with categorical data. Journal of Econometrics, 22, doi:10.1037/0033-2909.114.3.552
43–65. doi:10.1016/0304-4076(83)90093-3 Robbins, H., & Monro, S. (1951). A stochastic approxi-
Muthén, B. O. (1984). A general structural equation mation method. Annals of Mathematical Statistics, 22,
model with dichotomous, ordered categorical, and 400–407. doi:10.1214/aoms/1177729586
continuous latent variable indicators. Psychometrika, Samejima, F. (1969). Estimation of latent ability using
49, 115–132. doi:10.1007/BF02294210 a response pattern of graded scores, Psychometrika,
Muthén, B. O., & Asparouhov, T. (2002). Latent variable Monograph No. 17.
analysis with categorical outcomes: Multiple-group Satorra, A., & Bentler, P. M. (1994). Corrections to test
and growth modeling in Mplus. Mplus Web Note statistic and standard errors in covariance struc-
No. 4. Retrieved from http://www.statmodel.com/ ture analysis. In A. von Eye & C. C. Clogg (Eds.),
examples/webnote.shtml Analysis of latent variables in developmental research
Muthén, B. O., du Toit, S. H. C., & Spisic, D. (1997). (pp. 399–419). Newbury Park, CA: Sage.
Robust inference using weighted least squares and Schaeffer, N. C. (1988). An application of the item
quadratic estimating equations in latent variable response theory to the measurement of depression.
Copyright American Psychological Association. Not for further distribution.

modeling with categorical and continuous outcomes. In C. C. Clogg (Ed.), Sociological methodology
Unpublished manuscript. Retrieved from http://www. (pp. 271–307). Washington, DC: American
gseis.ucla.edu/faculty/muthen/psychometrics.htm Sociological Association.
Muthén, B. O., & Lehman, J. (1985). Multiple group Schilling, S., & Bock, R. D. (2005). High-dimensional
IRT modeling: Application to item bias analy- maximum marginal likelihood item factor analysis by
sis. Journal of Educational Statistics, 10, 133–142. adaptive quadrature. Psychometrika, 70, 533–555.
doi:10.2307/1164840 Steenkamp, J. E. M., & Baumgartner, H. (1998).
Muthén, L. K., & Muthén, B. O. (1998–2007). Mplus Assessing measurement invariance in cross-national
user’s guide (5th ed.). Los Angeles, CA: Muthén & consumer research. Journal of Consumer Research, 25,
Muthén. 78–107. doi:10.1086/209528
Panter, A. T., Swygert, K., Dahlstrom, W. G., & Tanaka, Steinberg, L. (1994). Context and serial-order effects in
J. S. (1997). Factor analytic models for item-level personality measurement: Limits on the general-
personality data. Journal of Personality Assessment, ity of measuring changes the measure. Journal of
68, 561–589. doi:10.1207/s15327752jpa6803_6 Personality and Social Psychology, 66, 341–349.
doi:10.1037/0022-3514.66.2.341
Panter, A. T., & Tanaka, J. S. (1987, April). Statistically
appropriate methods for analyzing dichotomous data: Steinberg, L., & Thissen, D. (1995). Item response theory
Assessing self-monitoring. Paper presented at the in personality research. In P. E. Shrout & S. Fiske
Eastern Psychological Association, Arlington, VA. (Eds.), Personality research, methods, and theory: A
Festschrift honoring Donald W. Fiske (pp. 161–181).
Panter, A. T., Tanaka, J. S., & Wellens, T. R. (1992). Hillsdale, NJ: Erlbaum.
The psychometrics of order effects. In S. Sudman &
Stout, W. F. (1987). A nonparametric approach for
N. Schwarz (Eds.), Context effects in social and psy-
assessing latent trait dimensionality. Psychometrika,
chological research (pp. 249–264). New York, NY:
52, 589–617. doi:10.1007/BF02294821
Springer-Verlag.
Takane, Y., & de Leeuw, J. (1987). On the relationship
Rasch, G. (1960). Probabilistic models for some intelli- between item response theory and factor analysis of
gence and attainment tests. Chicago, IL: University of discretized variables. Psychometrika, 52, 393–408.
Chicago Press. doi:10.1007/BF02294363
Reckase, M. D. (1979). Unifactor latent trait models Tanaka, J. S., Panter, A. T., & Winborne, W. C. (1988).
applied to multifactor tests: Results and implica- Dimensions of the need for cognition: Subscales and
tions. Journal of Educational Statistics, 4, 207–230. gender differences. Multivariate Behavioral Research,
doi:10.2307/1164671 23, 35–50. doi:10.1207/s15327906mbr2301_2
Reckase, M. D. (2009). Multidimensional item response Thissen, D. (1992, August). Item response theory in psy-
theory. New York, NY: Springer. chological research. Invited address at the 100th
Reise, S. P., & Waller, N. G. (1990). Fitting the two- Annual Convention of the American Psychological
parameter model to personality data. Applied Association, Washington, DC.
Psychological Measurement, 14, 45–58. doi:10.1177/ Thissen, D. (2003). MULTILOG 7 user’s guide. Chicago,
014662169001400105 IL: Scientific Software International.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Thissen, D., & Steinberg, L. (1986). A taxonomy of
Confirmatory factor analysis and item response item response models. Psychometrika, 51, 567–577.
theory: Two approaches for exploring measurement doi:10.1007/BF02295596

696
Item-Level Factor Analysis

Thissen, D., Steinberg, L., & Mooney, J. (1989). Trace Structural equation modeling: Concepts, issues, and
lines for testlets: A use of multiple-categorical applications (pp. 56–75). Newbury Park, CA: Sage.
response models. Journal of Educational Measurement,
26, 247–260. doi:10.1111/j.1745-3984.1989. Widaman, K. F., & Reise, S. P. (1997). Exploring the mea-
tb00331.x surement invariance of psychological instruments:
Applications in the substance use domain. In K. J.
Thurstone, L. L. (1947). Multiple factor analysis. Chicago, Bryant, M. Windle, & S. G. West (Eds.), The science of
IL: University of Chicago Press. prevention: Methodological advances from alcohol and
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet substance abuse research (pp. 281–324). Washington,
response theory and its applications. New York, NY: DC: American Psychological Association. doi:10.1037/
Cambridge University Press. 10222-009
Wainer, H., & Kiely, G. L. (1987). Item clusters and Wirth, R. J., & Edwards, M. C. (2007). Item factor
computerized adaptive testing: A case for testlets. analysis: Current approaches and future directions.
Journal of Educational Measurement, 24, 185–201. Psychological Methods, 12, 58–79. doi:10.1037/1082-
doi:10.1111/j.1745-3984.1987.tb00274.x 989X.12.1.58
Copyright American Psychological Association. Not for further distribution.

Waller, N. G., & Reise, S. P. (1989). Computerized adap- Yen, W. M. (1984). Effects of local item dependence
tive personality assessment: An illustration with the on the fit and equating performance of the three-
absorption scale. Journal of Personality and Social parameter logistic model. Applied Psychological
Psychology, 57, 1051–1058. doi:10.1037/0022-3514. Measurement, 8, 125–145. doi:10.1177/0146621
57.6.1051 68400800201
West, S. G., Finch, J. F., & Curran, P. J. (1995). Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D.
Structural equation models with non-normal vari- (2003). BILOG-MG3 user’s guide. Chicago, IL:
ables: Problems and remedies. In R. Hoyle (Ed.), Scientific Software International.

697

You might also like