Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

David A. Kenny June 5, 2020.

Measuring Model Fit


Go to my three webinars on Measuring Model Fit in SEM (small charge): click here.
Go to my three PowerPoints on Measuring Model Fit in SEM (small charge): click here.
Fit refers to the ability of a model to reproduce the data (i.e., usually the variance-covariance
matrix).  A good-fitting model is one that is reasonably consistent with the data and so does
not necessarily require respecification.  Not surprisingly, there is considerable debate as to
what is means by "reasonably consistent with the data." Also a good-fitting measurement
model is required before interpreting the causal paths of the structural model.
The major reason for computing a fit index is that the chi square is statistically significant,
but the reseacher still wants to claim that the model is a "good fitting" model. Note that if the
model is saturated or just-identified, then most (but not all) fit indices cannot be computed,
because the model is able to reproduce the data.
It should be noted that a good-fitting model is not necessarily a valid model. For instance, a
model all of whose estimated parameters are not significantly different from zero is a "good-
fitting" model. Conversely, it should be noted that a model all of whose parameters are
statistically significant can be from a poor fitting model. Additionally, models with
nonsensical results (e.g., paths that are clearly the wrong sign) and models with poor
discriminant validity or Heywood cases can be “good-fitting” models.  Parameter estimates
must be carefully examined to determine if one has a reasonable model.  Also it is important
to realize that one might obtain a good-fitting model, yet it is still possible to improve the
model and remove specification error.  Finally, having a good-fitting model does not prove
that the model is correctly specified.
How Large a Sample Size Do I Need?
Rules of Thumb
          Ratio of Sample Size to the Number of Free Parameters
                    Tanaka (1987): 20 to 1 (Most analysts now think that is unrealistically high.)
                    Goal: Bentler & Chou (1987): 5 to 1
                    Several published studies do not meet this goal.
          Sample Size
                    200 is seen as a goal for SEM research
                    Lower sample sizes can be used for
                              Models with no latent variables
                              Models where all loadings are fixed (usually to one)
                              Models with strong correlations
                              Simpler models
                     Models for which there is an upper limit on N (e.g., countries or years as the
unit), 200 might be an unrealistic standard.
Power Analysis
          Best way to determine if you have a large enough sample is to conduct a power
analysis.
                       Either use the Sattora and Saris (1985) method or conduct a simulation.
                       To test your power to detect a poor fitting model, you can use Preacher and
Coffman's web calculator.
The Chi Square Test: χ2
For models with about 75 to 200 cases, the chi square test is generally a reasonable measure
of fit.  But for models with more cases (400 or more), the chi square is almost always
statistically significant.  Chi square is also affected by the size of the correlations in the
model: the larger the correlations, the poorer the fit.  For these reasons alternative measures
of fit have been developed.  (Go to a website for computing p values for a given chi square
value and df.)
Sometimes chi square is more interpretable if it is transformed into a Z value.  The following
approximation can be used:
Z =  √(2χ2) - √(2df - 1)
An old measure of fit is the chi square to df ratio or χ2/df.  A problem with this fit index is that
there is no universally agreed upon standard as to what is a good and a bad fitting model. 
Note, however, that two currently very popular fit indices,  TLI and RMSEA, are largely
based on this old-fashioned ratio.
The chi square test is too liberal (i.e., too many Type 1) errors when variables have non-
normal distributions, especially distributions with kurtosis.  Moreover, with small sample
sizes, there are too many Type 1 errors.
Introduction to Fit Indices
The terms in the literature used to describe fit indices are confusing, and I think confused.  I
prefer the following terms (but they are unconventional): incremental, absolute, and
comparative, which are used on the pages that follow. 
Incremental Fit Index
An incremental (sometimes called in the literature relative) fit index is analogous to R2 and so
a value of zero indicates having the worst possible model and a value of one indicates having
the best possible.  So the researcher's model is placed on a continuum. In terms of a formula,
it is
              Worst Possible Model – My Model                   
Worst Possible Model – Fit of the Best Possible Model
The worst possible model is called the null or independence model. For confirmatory factor
analysis, the usual convention is to allow all the variables in the model to have variation but
no correlation and to have free means and variances. The degrees of freedom for this null
model are k(k – 1)/2 where k is the number of variables in the model.  
The null model when there are causal paths would be to have all exogenous variables are
correlated but the endogenous variables are uncorrelated with each other and the exogenous
variables. The degrees of freedom for this null model are [k(k – 1) - p(p -1)]/2 where k is the
number of variables in the model and p the number of exogenous variables or indicators in
the model. 
For growth-curve models, the null model should set the means as equal, i.e., no growth. See
Wu, West, and Taylor (2009) for details on measuring fit for these models. Also, O’Boyle and
Williams (2011) suggest two different null models for the measurement and structural
models. Note too that different computer programs define the null or independence model in
different ways. Note the df of the null model as that will suggest what model is fitted.
Absolute Fit Index
An absolute measure of fit presumes that the best fitting model has a fit of zero.  The measure
of fit then determines how far the model is from perfect fit.  These measures of fit are
typically “badness” measure of fit in that the bigger the index, the worse the fit is.  
Comparative Fit Index
A comparative measure of fit is only interpretable when comparing two different models. 
This term is unique to this website in that these measures are more commonly
called absolute fit indices.  However, it is helpful to distinguish absolute indices that do not
require a comparison between two models. One advantage of a comparative fit index is that it
can be computed for the saturated model, and so the saturated model can be compared to non-
saturated models.  
Controversy about Fit Indices
There is considerable controversy about fit indices.  Some researchers do not believe that fit
indices add anything to the analysis (e.g., Barrett, 2007) and only the chi square should be
interpreted.  The worry is that fit indices allow researchers to claim that a miss-specified
model is not a bad model.  Others (e.g., Hayduk, Cummings, Boadu, Pazderka-Robinson,
& Boulianne, 2007) argue that cutoffs for a fit index can be misleading and subject to
misuse.  Most analysts believe in the value of fit indices, but caution against strict reliance on
cutoffs. 
Also problematic is the “cherry picking” a fit index.  That is, you compute many fit indices
and you pick the one index that allows you to make the point that you want to make.  If you
decide not to report a popular index (e.g., the TLI or the RMSEA), you need to give a good
reason why you are not.
Finally, Kenny, Kaniskan, and McCoach (2014) have argued that fit indices should not even
be computed for small degrees of freedom models.  Rather for these models, the researcher
should locate the source of specification error.
Catalogue of Fit Indices
There are now literally hundreds of measures of fit. This page includes some of the major
ones currently used in the literature, but does not pretend to include all the measures.  Though
a bit dated, the book edited by Bollen and Long (1993) explains these indexes and others. 
Also a special issue of the Personality and Individual Differences in 2007 is entirely devoted
to the topic of fit indices.
A key consideration in choice of a fit index is the penalty it places for complexity.  That
penalty for complexity is measured by how much chi square needs to change for the fit index
not to change.
Bentler-Bonett Index or Normed Fit Index (NFI)
This is the very first measure of fit proposed in the literature (Bentler & Bonett, 1980) and it
is an incremental measure of fit.  The best model is defined as model with a χ2 of zero and the
worst model by the χ2 of the null model.   Its formula is:
χ2(Null Model) - χ2(Proposed Model)
______________________________
χ2(Null Model)
The null model is discussed above. A value between .90 and .95 is now considered marginal,
above .95 is good, and below .90 is considered to be a poor fitting model.  A major
disadvantage of this measure is that it cannot be smaller if more parameters are added to the
model.  Its “penalty” for complexity is zero.  Thus, the more parameters added to the model,
the larger the index.  It is for this reason that this measure is not recommended, but rather one
of the next two is used.
Tucker Lewis Index or Non-normed Fit Index (NNFI)
A problem with the Bentler-Bonett index is that there is no penalty for adding parameters. 
The Tucker-Lewis index (also called the non-normed fit index or NNFI), another incremental
fit index, does have such a penalty.  Let χ2/df be the ratio of chi square to its degrees of
freedom, and the TLI is computed as follows:
χ2/df(Null Model) - χ2/df(Proposed Model)
_________________________________
χ2/df(Null Model) - 1
If the index is greater than one, it is set at one.  It is interpreted as the Bentler-Bonett index. 
Note that for a given model, a lower chi square to df ratio (as long as it is not less than one)
implies a better fitting model.   Its penalty for complexity is χ2/df.  That is, if the chi square
to df ratio does not change, the TLI does not change.
Note that the TLI (and the CFI that follows) depends on the average size of the correlations in
the data.  If the average correlation between variables is not high, then the TLI will not be
very high. Consider a simple example.  You have a 5-item scale that you think measures one
latent variable. You also have 3 dichotomous experimental variables that you manipulate that
cause those two latent factors.  These three experimental variables create 7 variables when
you allow for all possible interactions. You have equal N in the conditions, and so all their
correlations are zero.  If you run the CFA on just the 5 indicators, you might have a nice TLI
of .95.  However, if you add in the 7 experimental variables, your TLI might sink below .90
because now the null model will not be so "bad" because you now have added to the model 7
variables who have zero correlations with each other.  
A reasonable rule of thumb is to examine the RMSEA (see below) for the null model (see
above) and make sure that is no smaller than 0.158. An RMSEA for the model of 0.05 and a
TLI of .90, implies that the RMSEA of the null model is 0.158.  If the RMSEA for the null
model is less than 0.158, an incremental measure of fit may not be that informative. So far as
I know, this mathematical fact that a model whose null model RMSEA is less than 0.158 and
whose RMSEA is 0.05 must have a TLI of less than .90 is something that has never been
published but is in fact true.
Comparative Fit Index (CFI)
This incremental measure of is directly based on the non-centrality measure.  Let d
= χ2 - df where df are the degrees of freedom of the model.  The Comparative Fit Index or CFI
equals
d(Null Model) - d(Proposed Model)
d(Null Model)
If the index is greater than one, it is set at one and if less than zero, it is set to zero. It is
interpreted as the previous incremental indexes. 
If the CFI is less than one, then the CFI is always greater than the TLI.  CFI pays a penalty of
one for every parameter estimated.  Because the TLI and CFI are highly correlated, only one
of the two should be reported.  The CFI is reported more often than the TLI, but I think the
CFI’s penalty for complexity of just 1 is too low and so I prefer the TLI even though the CFI
is reported much more frequently than the TLI.
Again, the CFI should not be computed if the RMSEA of the null model is less than 0.158 or
otherwise one will obtain too small a value of the CFI.
Root Mean Square Error of Approximation (RMSEA)
This absolute measure of fit is based on the non-centrality parameter.  Its computational
formula is:
√(χ2 - df)
__________  
√[df(N - 1)]
where N the sample size and df the degrees of freedom of the model.  If χ2 is less than df, then
the RMSEA is set to zero.  Like the TLI, its penalty for complexity is the chi square
to df ratio.  The measure is positively biased (i.e., tends to be too large) and the amount of the
bias depends on smallness of sample size and df, primarily the latter.  The RMSEA is
currently the most popular measure of model fit and it now reported in virtually all papers
that use CFA or SEM and some refer to the measure as the “Ramsey.” 
MacCallum, Browne and Sugawara (1996) have used 0.01, 0.05, and 0.08 to indicate
excellent, good, and mediocre fit, respectively. However, others have suggested 0.10 as the
cutoff for poor fitting models.  These are definitions for the population.  That is, a given
model may have a population value of 0.05 (which would not be known), but in the sample it
might be greater than 0.10.  Use of confidence intervals and tests of PCLOSE can help
understand the sampling error in the RMSEA. There is greater sampling error for
small df and low N models, especially for the former.  Thus, models with small df and low N
can have artificially large values of the RMSEA.  For instance, a chi square of 2.098 (a value
not statistically significant), with a df of 1 and N of 70 yields an RMSEA of 0.126.  For this
reason, Kenny, Kaniskan, and McCoach (2014) argue to not even compute the RMSEA for
low df models.
A confidence interval can be computed for the RMSEA. Ideally the lower value of the 90%
confidence interval includes or is very near zero (or no worse than 0.05) and the upper value
is not very large, i.e., less than .08.   The width of the confidence interval is very informative
about the precision in the estimate of the RMSEA.
p of Close Fit (PCLOSE)
This measure is a one-sided test of the null hypothesis is that the RMSEA equals .05, what is
called a close-fitting model. Such a model has specification error, but “not very much”
specification error.  The alternative, one-sided hypothesis is that the RMSEA is greater than
0.05. So if the p is greater than .05 (i.e., not statistically significant), then it is concluded that
the fit of the model is "close."  If the p is less than .05, it is concluded that the model’s fit is
worse than close fitting (i.e., the RMSEA is greater than 0.05). As with any significance test,
sample size is a critical factor, but so also is the model df, with lower df there is less power in
this test.
You can use the RMSEA confidence interval to test any null hypothesis about the RMSEA. 
For instance, if you want to test the one-sided that that RMSEA is greater than 0.05 (what is
tested with pClose) with .05 alpha, you would examine the 90 percent confidence interval of
the RMSEA and note whether the obtained RMSEA exceeds the lower bound. If it does, the
model is worse fitting than a close fitting mode, one with a population value for the RMSEA
of 0.05.
Standardized Root Mean Square Residual (SRMR)
The SRMR is an absolute measure of fit and is defined as the standardized difference
between the observed correlation and the predicted correlation.  It is a positively biased
measure and that bias is greater for small N and for low df studies.  Because the SRMR is an
absolute measure of fit, a value of zero indicates perfect fit.  The SRMR has no penalty for
model complexity.  A value less than .08 is generally considered a good fit (Hu & Bentler,
1999).
Akaike Information Criterion (AIC)
The AIC is a comparative measure of fit and so it is meaningful only when two different
models are estimated.  Lower values indicate a better fit and so the model with the lowest
AIC is the best fitting model.  There are somewhat different formulas given for the AIC in the
literature, but those differences are not really meaningful as it is the difference in AIC that
really matters:
χ2 + k(k + 1) - 2df
where k is the number of variables in the model and df are the degrees of freedom of the
model.  (If means are included in the model, then replace k(k + 1) with k(k + 3).)  Note that
k(k + 1) - 2df equals the twice the number of free parameters in the model.  The AIC makes
the researcher pay a penalty of two for every parameter that is estimated.
One advantage of the AIC, BIC, and SABIC measures is that they can be computed for
models with zero degrees of freedom, i.e., saturated or just-identified models.
Bayesian Information Criterion (BIC)
Two other comparative fit indices are the BIC and the SABIC. Whereas the AIC has a penalty
of 2 for every parameter estimated, the BIC increases the penalty as sample size increases
χ2 + ln(N)[k(k + 1)/2 - df]
where ln(N) is the natural logarithm of the number of cases in the sample. (If means are
included in the model, then replace k(k + 1)/2 with k(k + 3)/2.)  The BIC places a high value
on parsimony (perhaps too high). 
The Sample-Size Adjusted BIC (SABIC)
Like the BIC, the Sample-size adjusted BIC or SABIC places a penalty for adding parameters
based on sample size, but not as high a penalty as the BIC.  The SABIC is not given in Amos,
but is given in Mplus.  Several recent simulation studies (Enders & Tofighi, 2008; Tofighi, &
Enders, 2007) have suggested that the SABIC is a useful tool in comparing models.  Its
formula is
χ2 +ln[(N + 2)/24][k(k + 1)/2 - df]
Again if means are included in the model, then replace k(k + 1)/2 with k(k + 3)/2.
GFI and AGFI (LISREL measures)
These measures are affected by sample size. The current consensus is not to use these
measures (Sharma, Mukherjee, Kumar, & Dillon, 2005).
Hoelter Index
The index states the sample size at which chi square would not be significant (alpha = .05),
i.e., that is how small one's sample size would have to be for the result to be no longer
significant.  The index should only be computed if the chi square is statistically significant. 
Its formula is:
[(N - 1)χ2(crit)/χ2] + 1
where N is the sample size, χ2 is the chi square for the model and χ2(crit) is the critical value
for the chi square.  If the critical value is unknown, the following approximation can be used:
[1.645 + √(2df - 1)]2   + 1
________________________        
2χ2/(N - 1) + 1   
where df are the degrees of freedom of the model.  For both of these formulas, one
rounds down to the nearest integer value.   Hoelter recommends values of at least 200.  Values
of less than 75 indicate very poor model fit.
The Hoelter only makes sense to interpret if N > 200 and the chi square is statistically
significant.   It should be noted that Hu and Bentler (1998) do not recommend this measure.
Factors that Affect Fit Indices
Number of Variables
          Anecdotal evidence that models with many variables do not fit
          Kenny and McCoach (2003) show
                    RMSEA improves as more variables are added to the model
                    TLI and CFI are relatively stable, but tend to decline slightly
          We do not understand why it is that models with more variables tend to have relatively
poor fit.
Model Complexity
          How much chi square needs to change per df for the fit index not to change:
                                                            Theoretical Value     A&M*     Reisenzein**
                    Bentler and Bonett                   0                         0           0
                    CFI                                           1                         1           1
                    AIC                                           2                         2           2
                    SABIC                               ln[(N +2)/24]              1.96      1.76
                    Tucker Lewis                       χ2/df                        2.01      1.46
                    RMSEA                                χ2/df                        2.01      1.46
                    BIC                                      ln(N)                        5.13      4.93
*A&M: Ajzen, I., & Madden, T. J. (1986). Prediction of goal-directed behavior: Attitudes,
intentions, and perceived behavioral control. Journal of Experimental Social Psychology, 22,
453-474.
**Reisenzein: Reisenzein, R. (1986). A structural equation analysis of Weiner's attribution-
affect model of helping behavior. Journal of Personality and Social Psychology, 50, 1123-
1133.
Note that
                Each changes by a constant amount, regardless of the df change.
                Larger values reward parsimony and smaller values reward complexity. The BIC
rewards parsimony most, and the CFI (after the Bentler and Bonett), the least. The SABIC is
comparable to the RMSEA and the TLI.
Sample Size
Bentler-Bonett fails to adjust for sample size: models with larger sample sizes have smaller
values.  The TLI and CFI do not vary much with sample size.  However, these measures are
less variable with larger sample sizes.
The RMSEA and the SRMR are larger with smaller sample sizes.
Normality
Non-normal data (especially high kurtosis) inflates chi square and absolute measures of fit. 
Presumably, incremental and comparative measures of fit are less affected.
 
References
Barrett, P. (2007). Structural equation modelling: adjudging model fit. Personality and Individual
Differences, 42, 815–824.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness-of-fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-600.
Bentler, P. M., & Chou, C. P. (1987) Practical issues in structural modeling.  Sociological Methods &
Research, 16, 78-117.
Bollen, K. A., & Long, J. S., Eds. (1993). Testing structural equation models. Newbury Park, CA:
Sage.
Enders, C.K., & Tofighi, D. (2008). The impact of misspecifying class-specific residual variances in
growth mixture models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 75-95.
Hayduk, L., Cummings, G. G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing!
Testing! One, two three – Testing the theory in structural equation models! Personality and Individual
Differences, 42, 841-50.
Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity
to underparameterized model misspecification. Psychological Methods, 3, 424–453.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015).  The performance of RMSEA in models with
small degrees of freedom.  Sociological Methods & Research, 44, 486-507.
Kenny, D. A., & McCoach, D. B. (2003). Effect of the number of variables on measures of fit in
structural equation modeling. Structural Equation Modeling, 10, 333-3511.
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of
sample size for covariance structure modeling. Psychological Methods, 1, 130-149.
O'Boyle, E. H., Jr., & Williams, L. J. (2011). Decomposing model fit: Measurement vs. theory in
organizational research using latent variables. Journal of Applied Psychology, 96, 1-12.
Satorra, A., & Saris,W. E. (1985). The power of the likelihood ratio test in covariance structure
analysis. Psychometrika, 50, 83–90.
Sharma, S., Mukherjee, S., Kumar, A., & Dillon, W.R. (2005). A simulation study to investigate the
use of cutoff values for assessing model fit in covariance structure models. Journal of Business
Research, 58, 935-43.
Tanaka, J.S. (1987). "How big is big enough?": Sample size and goodness of fit in structural equation
models with latent variables. Child Development, 58, 134-146.
Tofghi, D., & Enders, C. K. (2007). Identifying the correct number of classes in mixture models. In G.
R. Hancock & K. M. Samulelsen (Eds.), Advances in latent variable mixture models (pp. 317-341).
Greenwich, CT: Information Age.  
Wui, W., West, S. B., & Taylor, A. B. (2009). Evaluating model fit for growth curve models:

Integration of fit indices from SEM and MLM frameworks. Psychological Methods, 14,183 201.   
Fit indices for structural equation
modeling
Author: Dr. Simon Moss
Overview
In structural equation modeling, the fit indices establish whether, overall, the model is acceptable.
If the model is acceptable, researchers then establish whether specific paths are significant.
Acceptable fit indices do not imply the relationships are strong. Indeed, high fit indices are often
easier to obtain when the relationships between variables are low rather than high--because the
power to detect discrepancies from predictions are amplified.
Many of the fit indices are derived from the chi-square value. Conceptually, the chi-square value,
in this context, represents the difference between the observed covariance matrix and the
predicted or model covariance matrix.
The fit indices can be classified into several classes. These classes include:
 Discrepancy functions, such as the chi square test, relative chi square, and RMS
 Tests that compare the target model with the null model, such as the CFI, NFI, TFI, and
IFI
 Information theory goodness of fit measures, such as the AIC, BCC, BIC, and CAIC
 Non-centrality fit measures, such as the NCP.
Many researchers, such as Marsh, Balla, and Hau (1996), recommend that individuals utilize a
range of fit indices. Indeed, Jaccard and Wan (1996) recommend using indices from different
classes as well& this strategy overcomes the limitations of each index.

Summary of criteria that researchers often use


A model is regarded as acceptable if:
 The Normed Fit Index (NFI) exceeds .90 (Byrne, 1994) or .95 (Schumacker & Lomax,
2004)
 The Goodness of Fit Index exceeds .90 (Byrne, 1994)
 The Comparative Fit Index exceeds .93 (Byrne, 1994)
 RMS is less than .08 (Browne & Cudeck, 1993)--and ideally less than .05 (Stieger, 1990).
Alternatively, the upper confidence interval of the RMS should not exceed .08 (Hu &
Bentler, 1998)
The relative chi-square should be less than 2 or 3 (Kline, 1998& Ullman, 2001).
These criteria are merely guidelines. To illustrate, in a field in which previous models generate
CFI values of .70 only, a CFI value of .85 represents progress and thus should be acceptable
(Bollen, 1989).

Discrepancy functions
Chi-square
The chi-square for the model is also called the discrepancy function, likelihood ratio chi-square,
or chi-square goodness of fit. In AMOS, the chi-square value is called CMIN.
If the chi-square is not significant, the model is regarded as acceptable. That is, the observed
covariance matrix is similar to the predicted covariance matrix--that is, the matrix predicted by
the model.
If the chi-square is significant, the model is regarded, at least sometimes, as unacceptable.
However, many researchers disregard this index if both the sample size exceeds 200 or so and
other indices indicate the model is acceptable. In particular, this approach arises because the chi-
square index presents several problems:
 Complex models, with many parameters, will tend to generate an acceptable fit
 If the sample size is large, the model will usually be rejected, sometimes unfairly
 When the assumption of multivariate normality is violated, the chi-square fit index is
inaccurate. The Satorra-Bentler scaled chi-square, which is available in EQS, is often
preferred, because this index penalizes the chi-square for kurtosis.
Relative chi-square
The relative chi-square is also called the normed chi-square. This value equals the chi-square
index divided by the degrees of freedom. This index might be less sensitive to sample size. The
criterion for acceptance varies across researchers, ranging from less than 2 (Ullman, 2001) to less
than 5 (Schumacker & Lomax, 2004).

Root mean square residual


The RMS, also called the RMR or RMSE, represents the square root of the average or mean of
the covariance residuals--the differences between corresponding elements of the observed and
predicted covariance matrix. Zero represents a perfect fit, but the maximum is unlimited.
Because the maximum is unbounded, the RMS is difficult to interpret and consensus has not been
reached on the levels that represent acceptable models. Some researchers utilized the standardized
version of the RMS instead to override this problem.
According to some researchers, RMS should be less than .08 (Browne & Cudeck, 1993)--and
ideally less than .05 (Stieger, 1990). Alternatively, the upper confidence interval of the RMS
should not exceed .08 (Hu & Bentler, 1998).

Indices that compare the target and null models


Comparative fit index (CFI)
The comparative fit index, like the IFI, NFI, BBI, TLI, and RFI, compare the model of interest
with some alternative, such as the null or independence model. The CFI is also known as the
Bentler Comparative Fit Index.
Specifically, the CFI compares the fit of a target model to the fit of an independent model--a
model in which the variables are assumed to be uncorrelated. In this context, fit refers to the
difference between the observed and predicted covariance matrices, as represented by the chi-
square index.
In short, the CFI represents the ratio between the discrepancy of this target model to the
discrepancy of the independence model. Roughly, the CFI thus represents the extent to which the
model of interest is better than is the independence model. Values that approach 1 indicate
acceptable fit.
CFI is not too sensitive to sample size (Fan, Thompson, and Wang, 1999). However, CFI is not
effective if most of the correlations between variables approach 0--because there is, therefore,
less covariance to explain. Furthermore, Raykov (2000, 2005) argues that CFI is a biased
measure, based on non-centrality.
Incremental fit index (IFI)
The incremental fit index, also known as Bollen's IFI, is also relatively insensitive to sample size.
Values that exceed .90 are regarded as acceptable, although this index can exceed 1.
To compute the IFI, first the difference between the chi square of the independence model--in
which variables are uncorrelated--and the chi-square of the target model is calculated. Next, the
difference between the chi-square of the target model and the df for the target model is calculated.
The ratio of these values represents the IFI.

Normed fit index (NFI)


The NFI is also known as the Bentler-Bonett normed fit index. The fit index varies from 0 to 1--
where 1 is ideal. The NFI equals the difference between the chi-square of the null model and the
chi square of target model, divided by the chi-square of the null model. In other words, an NFI
of .90, for example, indicates the model of interest improves the fit by 90% relative to the null or
independence model.
When the samples are small, the fit is often underestimated (Ullman, 2001). Furthermore, in
contrast to the TLI, the fit can be overestimated if the number of parameters is increased& the
NNFI overcomes this problem.

Tucker Lewis index (TLI) or Non-normed fit index (NNFI)


The TLI, sometimes called the NNFI, is similar to the NFI. However, the index is lower, and
hence the model is regarded as less acceptable, if the model is complex. To compute the TLI:
 First divide the chi square for the target model and the null model by their corresponding
df vales--which generates relative chi squares for each model.
 Next, calculate the difference between these relative chi squares.
 Finally, divide this difference by the relative chi square for the null model minus 1.
According to Marsh, Balla, and McDonald (1988), the TFL is relatively independent of sample
size. The TFI is usually lower than is the GFI--but values over .90 or over .95 are considered
acceptable (e.g., Hu & Bentler, 1999).

Information theory goodness of fit measures


Akaike Information Criterion
The AIC, like the BIC, BCC, and CAIC, is regarded as an information theory goodness of fit
measure--applicable when maximum likelihood estimation is used (Burnham & Anderson, 1998).
These indices are used to compare different models. The models that generate the lowest values
are optimal. The absolute AIC value is irrelevant--although values closer to 0 are ideal& only the
AIC value of one model relative to the AIC value of another model is meaningful.
Like the chi square index, the AIC also reflects the extent to which the observed and predicted
covariance matrices differ from each other. However, unlike the chi square index, the AIC
penalizes models that are too complex. In particular, the AIC equals the chi square divided by n
plus 2k / (n-1). In this formula, k = .5v/v + 1 - df, where v is the number of variables and n = the
sample size.

Browne-Cudeck criterion (BCC) and Consistent AIC (CAIC)


The BCC is similar to the AIC. That is, the BCC and AIC both represent the extent to which the
observed covariance matrix differs from the predicted covariance matrix--like the chi square
statistic--but include a penalty if the model is complex, with many parameters. The BCC bestows
an even harsher penalty than does the AIC.
The BCC equals the chi square divided by n plus 2k / (n- v - 2). In this formula, k = .5v/v + 1 - df,
where v is the number of variables and n = the sample size.
The CAIC is similar to the AIC as well. However, the CAIC also confers a penalty if the sample
size is small.

Bayesian Information Criterion (BIC)


The Bayesian Information Criterion is also known as Akaike's Bayesian Information Criterion
(ABIC) and the Schwarz Bayesian Criterion (SBC). This index is similar to the AIC, but the
penalty against complex models is especially pronounced--even more pronounced than is the
BCC and CAIC indices. Furthermore, like the CAIC, a penalty against small samples is include.
BIC was derived by Raftery (1995). Roughly, the BIC is the log of a Bayes factor of the target
model compared to the saturated model.

Determinants of which indices to use


Many other indices have also been developed. These indices include the GFI, AGFI, FMIN,
noncentrality parameter, and centrality index. The GFI and, to a lesser extent, the FMIN used to
be very popular, but their use has dwindled recently.
Some indices are especially sensitive to sample size. For example, fit indices overestimate the fit
when the sample size is small--below 200, for example. Nevertheless, RMSEA and CFI seem to
be less sensitive to sample size (Fan, Thompson, and Wang, 1999).
References
Anderson, J. C., & Gerbing, D. W. (1984). The effect of sampling error on convergence,
improper solutions and goodness-of-fit indices for maximum likelihood confirmatory factor
analysis.Psychometrika, 49, 155-173.
Bentler, P M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,
238-246.
Bentler, P. M., & Bonett, D. G. (1980). Significant tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606.
Bentler, P. M., & Mooijaart, A. (1989). Choice of structural model via parsimony: A rationale
based on precision. Psychological Bulletin, 106,315-317.
Bollen, K. A. (1989). Structural equations with latent variables. NY: Wiley.
Bollen, K. A. (1990). Overall fit in covariance structure models: Two types of sample size
effects. Psychological Bulletin, 107, 256-259.
Browne, M. W., & Cudeck, R. (1989). Single sample cross-validation indices for covariance
structures. Multivariate Behavioral Research, 24, 445-455.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen &
J. S. Long (Eds.), Testing structural equation models (pp. 136-162). Newsbury Park, CA: Sage.
Burnham, K, P., and D. R. Anderson (1998). Model selection and inference: A practical
information-theoretic approach. New York: Springer-Verlag.
Byrne, B. M. (1994). Structural equation modeling with EQS and EQS/Windows. Thousand
Oaks, CA: Sage Publications.
Cheung, G. W. & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling, 9, 233-255.
Fan, X., B. Thompson, and L. Wang (1999). Effects of sample size, estimation method, and
model specification on structural equation modeling fit indexes. Structural Equation Modeling,
6, 56-83.
Hipp J. R., & Bollen K. A. (2003). Model fit in structural equation models with censored, ordinal,
and dichotomous variables: testing vanishing tetrads. Sociological Methodology, 33, 267-305.
Hu, L. T., & Bentler, P. M. (1995). Evaluating model fit. In R. H. Hoyle (Ed.), Structural
equation modeling: Concepts, issues, and applications (pp. 76-99). Thousand Oaks, CA: Sage.
Kline, R. B. (1998). Principles and practice of structural equation modeling. NY: Guilford
Press.
Jaccard, J., & Wan, C. K. (1996). LISREL approaches to interaction effects in multiple
regression. Thousand Oaks, CA: Sage Publications.
Joreskog, K. G. (1993). Testing structural equation models. In K. A. Bollen & J. S. Long (Eds.),
Testing structural equation models (pp. 294-316). Newbury, CA: Sage.
Marsh, H. W., Balla, J. R., & Hau, K. T. (1996). An evaluation of incremental fit indexes: A
clarification of mathematical and empirical properties. In G. A. Marcoulides & R. E. Schumacker
(Eds.), Advanced structural equation modeling techniques(pp.315-353 . Mahwah , NJ : Lawrence
Erlbaum.
Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory
factor analysis: The effect of sample size. Psychological Bulletin, 103, 391-410.
Marsh, H. W., & Hau, K. T. (1996). Assessing goodness of fit: Is parsimony always desirable?
Journal of Experimental Education , 64, 364-390.
Raftery, A. E. (1995). Bayesian model selection in social research. In Adrian E. Raftery (Ed.) (pp.
111-164). Oxford: Blackwell.
Raykov, T. (2000). On the large-sample bias, variance, and mean squared error of the
conventional noncentrality parameter estimator of covariance structure models. Structural
Equation Modeling, 7, 431-441.
Raykov, T. (2005). Bias-corrected estimation of noncentrality parameters of covariance structure
models. Structural Equation Modeling, 12, 120-129.
Schumacker, R. E., & Lomax, R. G. (2004). A beginner's guide to structural equation modeling,
Second edition. Mahwah, NJ: Lawrence Erlbaum Associates.
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation
approach. Multivariate Behavioural Research, 25, 173-180.
Steiger J. H. (2000). Point estimation, hypothesis testing and interval estimation using the
RMSEA: Some comments and a reply to Hayduk and Glaser. Structural Equation Modeling, 7,
149-162.
Tucker, L. R., & Lewis, C. (1973). The reliability coefficient for maximum likelihood factor
analysis. Psychometrika, 38, 1-10.
Ullman, J. B. (2001). Structural equation modeling. In B. G. Tabachnick & L. S. Fidell (2001).
Using Multivariate Statistics (4th ed& pp 653- 771). Needham Heights, MA: Allyn & Bacon.

You might also like