Professional Documents
Culture Documents
Luo Et Al. - 2019 - Exploratory Factor Analysis (EFA) Programs in R
Luo Et Al. - 2019 - Exploratory Factor Analysis (EFA) Programs in R
To cite this article: Lan Luo, Cara Arizmendi & Kathleen M. Gates (2019) Exploratory Factor
Analysis (EFA) Programs in R, Structural Equation Modeling: A Multidisciplinary Journal, 26:5,
819-826, DOI: 10.1080/10705511.2019.1615835
SOFTWARE REVIEW
We provide a brief overview of two R packages that can conduct exploratory factor analysis
(EFA): psych and EFAutilities. After introducing EFA and the exemplar data used in this
paper we discuss best practices for EFA. Next, we describe the approaches used in the two
packages for EFA. During this explanation, we provide sample code and discuss the usage
and results of two empirical datasets. Finally, we highlight the similarities and distinctions of
each package on modeling EFA.
Factor analysis is generally used to explore the under- useful and convenient when one has tried a variety of
lying correlations and structure of a set of measured hypothesized CFA models but none of the proposed
variables (Brown, 2001, p. 184). It is used very often models fit well. By giving the model a high level of
in fields that frequently encourage and analyze latent flexibility, it is able to provide insights that could lead
constructs, such as psychology and sociology. Factor researchers to better measurements of the underlying
analysis itself generally appears in the form of either constructs, in turn influencing the inferences made from
exploratory factor analysis (EFA) or confirmatory factor the analysis. Because of the highly flexible nature of
analysis (CFA). EFA allows the observed (i.e., measured) EFA, open-source software, such as R, is a good plat-
variables to be associated with any and multiple latent form with which to perform EFA analysis. This paper
factors, whereas CFA imposes restrictions regarding provides overviews of two packages in R that offer
which variables load onto which factors. CFA requires specific functions for EFA: psych package (Revelle,
a priori knowledge regarding the theoretical structure 2018) and EFAutilities package (Zhang, Jiang, Hattori,
whereas EFA aims to identify this structure in a data- & Trichtinger, 2018). In the following, we first describe
driven manner. This paper focuses on EFA and evaluates the illustrative data used throughout the paper, then
the primary approaches available for conducting this insights into the options and methods available in the
analysis within the R software framework (R Core two R packages, followed by a comparison and discus-
Team, 2018). sion of the output and results.
EFA is a useful method to describe the shared varia-
bility among measured variables, and to investigate
potential underlying latent factors through measurable ILLUSTRATIVE DATA
variables. It is useful for data reduction and to uncover
unknown patterns of relations. EFA is also especially We will use two empirical data sets to illustrate the
functionality of psych and EFAutilities. One is from the
classic Holzinger and Swineford study (Holzinger &
Correspondence should be addressed to Lan Luo, University of North Swineford, 1939, N = 301), used in many articles and
Carolina at Chapel Hill, 342 Davie Hall, Chapel Hill, NC 27599.
books on Structural Equation Modeling (SEM) and factor
E-mail: lanl27@live.unc.edu
Color versions of one or more of the figures in the article can be found analysis (e.g., Jöreskog, 1971; Mulaik, 2009, p. 341;
online at www.tandfonline.com/hsem. Tucker & Lewis, 1973) as well as in some manuals for
820 LUO ET AL.
related software packages (e.g., lavaan; Rosseel, 2012). exist for selecting the number of factors in EFA. Some of the
It consists of mental ability scores of seventh- and more straightforward approaches are the Kaiser rule (1960),
eighth-grade students from two different schools. The the scree test (Cattell, 1966), and Horn’s parallel analysis
proposed three latent variables for mental ability are (1965). Each of these methods utilizes a stopping rule deter-
visual ability, textual ability, and speed. The original mining the number of latent factors based on the eigenvalues
dataset has scores derived from 26 test questions, but for the correlation matrix, which can be represented visually
a smaller subset with only 9 test questions is more on a scree plot. Researchers are usually recommended to not
widely used in the literature (e.g. Jöreskog, 1969). The use one approach as the single cut-off criterion. For example,
dataset used in this article is the above-mentioned 9-item Ruscio and Roche (2012) noted that Kaiser rule has the
data and can be accessed in the data object tendency to over-extract factors. The root-mean-square
“HolzingerSwindeford1939” provided by the lavaan error of approximation (RMSEA; Browne & Cudeck,
package (Rosseel, 2012). 1992), another method to determine the number of factors,
The code to select the subset used in this article is as is an estimate of model misfit per degree of freedom in the
follows: population. Akaike’s information criterion (AIC; Akaike,
1973) and Bayesian information criterion (BIC; Schwarz,
library(“lavaan”)
1978) are two other methods that are based on model like-
data(“HolzingerSwineford1939”) lihood, prioritizing parsimony. Different methods may arrive
variables1 <- c("x1", "x2", "x3", "x4", at different optimal numbers of factors. Deciding which
"x5", "x6", "x7", "x8", "x9") H1939Xs <- solution is best is left up to the researcher, either by exploring
HolzingerSwineford1939[variables1] multiple solutions to find consensus or subjective selection.
It is suggested that the researcher should consider the reason-
To better examine how EFA is computed across able number of factors that are worthwhile to retain, and not
packages, we will also use the Composite Scores of the focus on getting the “correct” number of factors for the
Chinese Personality Assessment Inventory (CPAI537; N = “true” model, which many researchers have suggested that
537), provided by the EFAutilities package. This is a part of in most cases doesn’t exist (e.g. Cattell, 1966; MacCallum,
a large survey on marital satisfaction (Luo et al., 2008). The 2003). The two packages being discussed here provide dif-
original dataset consists of data from participants of 537 ferent criteria. The psych package offers three approaches to
urban Chinese couples in their first year of marriage, and calculate the number of factors: Very Simple Structure
the subset used in this article includes 28 composite scores (Revelle & Rocklin, 1979), Minimum Average Partial cor-
of the CPAI (Cheung et al., 1996) for the wives only. This relation (Velicer, 1976), and parallel analysis criteria (fa.
dataset has a sample size only slightly larger than that of parallel). EFAutilities offers the Kaiser rule.
HolzingerSwineford1939 but contains around three times The literature on selecting the final structure is a bit less
the number of variables. By including a more complex clear and homogeneous. Typically, in practice researchers
dataset, there is a greater probability of observing the will consider an observed variable as loading on a given
potential differences in both model results and computa- latent factor if the standardized factor loading is above
tional time between different packages. a certain threshold. In practice, this threshold varies and is
largely related to the sample size (Stevens, 2002).
Tabachnick and Fidell (2007) provided a rule of thumb
that for a sample size of 300, a statistically meaningful
BEST PRACTICES FOR EFA
factor loading would need to be above .32. Peterson
(2000) conducted a meta-analysis and found that the most
Before evaluating the packages is it important to under-
common cutoff value is .40; it is not only the value that
stand the best practices and current recommendations
one-third of the studies used but also the average cutoff
regarding EFA. This provides a baseline from which to
value.
investigate the extent to which each package contains
options that allow the user to adjust parameters and criteria
based on emerging research. Two primary issues are of
concern when conducting EFA: determining the number R PACKAGES FOR EFA
of factors and selecting the final structure for how the
measured variables relate to the factors. This section provides an overview of the fa() function
When little is known in substantive areas, EFA can pro- available in the psych package (Revelle, 2018) and the
vide insight to underlying patterns in the data (Bollen, 1989, efa() function from the EFAutilities package (Zhang
p. 228). Primarily, EFA provides a suggested number of et al., 2018). Other EFA-related functions (e.g. plotting
latent factors (Zwick & Velicer, 1986). A number of methods functions) included in those packages will also be
EFA R PROGRAMS 821
efa() in the EFAutilities package FIGURE 3 H1939 data independent variables correlation graph by
psych.
The EFAutilities package (Zhang et al., 2018), as indicated by
the package name, is a small R package specifically designed
for modeling EFA. It computes standard errors for parameter rotation, CF-varimax, from psych’s default, oblique oblimin,
estimates, factor loadings, and correlations under various although the other possible options in both packages largely
conditions (e.g., rotation). EFAutilities uses a different default overlap. It also contains two datasets, one of which as
EFA R PROGRAMS 823
TABLE 1
Factor Rotation and Factoring Method Options
Available
argument fa() efa()
both efa() and fa(), the amount of minimum input from factor structures are clear. However, they offer differences that
the user for this demonstration was the same: dataset users should be mindful of when selecting an approach. In
being analyzed, number of factors (for fa() only), factor practice, FA rarely is conducted as solely exploratory or solely
extraction methods, and rotation methods. Although confirmatory but often falls somewhere in between these
they have different default arguments (e.g. default rota- extremes. With this in mind, researchers can also utilize
tion criterion for fa is oblimin and for efa is CF- other packages, such as lavaan (Rosseel, 2012) to perform
varimax; possible options for the main arguments are FA and adjust models in an exploratory manner. Multiple
also shown in Table 1), they produce identical results packages and software can also be used simultaneously to
when set to use the same methods for modeling and better provide insights on how the model can be improved and
rotation, with output differences almost only due to better specified, which goes along with the intentions with
rounding. psych and both EFAutilities demonstrated to which when EFA is typically conducted.
be very useful and powerful packages for EFA.
The biggest difference between the two packages on
EFA is probably the method to determine the number REFERENCES
of factors. Indeed, there are a variety of methods to do
so; although one will usually get very similar suggested Akaike, H. (1973). Information theory and an extension of the maximum
likelihood principle. In B. N. Petrov & F. Csaki (Eds.), Second inter-
number of factors regardless of the method, there are national symposium on information theory (pp. 267–281). Budapest,
times they will suggest different values. The psych Hungary: Akademia Kiado.
package provides more flexibility in terms of options Bollen, K. A. (1989). Wiley series in probability and mathematical statis-
for deciding the number of factors. It offers three tics. Applied probability and statistics section. Structural equations with
approaches to arriving at the suggested number of latent variables. Oxford, England: John Wiley & Sons. doi:10.1002/
9781118619179
latent factors. The EFAutilities package offers only Brown, J. D. (2001). Using surveys in language programs. Cambridge
one option for identifying the number of factors. The University Press.
easy to use efa() function calculates the number of Browne, M. W. Cudeck, R. (1992). Alternative ways of assessing model
factors using the Kaiser method. This difference fit. Sociological Methods & Research, 21, 230–258.
reflects the preferences of the package authors. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate
Behavioral Research, 1, 245–276. doi:10.1207/s15327906mbr0102_10
Although the EFAutilities package does not offer Cheung, F. M., Leung, K., Fan, R., Song, W., Zhang, J., & Zhang, J.
alternative approaches calculating the number of fac- (1996). Development of the Chinese personality assessment inventory
tors, it offers four methods to calculate standard errors (CPAI). Journal of Cross-Cultural Psychology, 27, 181–199.
for rotated factor loadings and factor correlations, doi:10.1177/0022022196272003
which have been suggested as more justifiable in mak- Cudeck, R., & O’Dell, L. L. (1994). Applications of standard error
estimates in unrestricted factor analysis: Significance tests for factor
ing EFA decisions (Cudeck & O’Dell, 1994). Standard loadings and correlations. Psychological Bulletin, 115, 475–487.
errors in EFA can facilitate both other EFA decisions doi:10.1037/0033-2909.115.3.475
(e.g. the number of factors to retain) and the presenta- Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: LEA.
tion of the final model structure (Zhang, 2014). Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The
However, standard errors in EFA have not been routi- stability of a bi-factor solution. Supplementary Educational Monographs,
48.
nely examined, likely due to both the difficulty in their Horn, J. L. (1965). A rationale and test for the number of factors in factor
estimation and researchers’ unawareness of their avail- analysis. Psychometrika, 30, 179–185.
ability. By providing the option to calculate standard Jöreskog, K. G. (1969). A general approach to confirmatory maximum
errors in EFAutilities, their usage is encouraged in likelihood factor analysis. Psychometrika, 34, 183–202. doi:10.1007/
arriving the final model structure. BF02289343
Jöreskog, K. G. (1971). Simultaneous factor analysis in several
Computational time was calculated using rbenchmark populations. Psychometrika, 36, 409–426. doi:10.1007/BF02291366
(Kusnierczyk, 2012). The computational time is the average Kaiser, H. F. (1960). The application of electronic computers to factor
of 1000 runs. The fa function was faster than the efa function analysis. Educational and Psychological Measurement, 20, 141–151.
for both the datasets, and the difference is larger when the doi:10.1177/001316446002000116
sample size increases. It is also not surprising that both func- Kusnierczyk, W. (2012). rbenchmark: Benchmarking routine for R.
R package version 1.0.0. Retrieved from https://CRAN.R-project.org/
tions took a longer time to process when the dataset is larger. package=rbenchmark
Although it is unwise to conclude based on only two datasets, Luo, S., Chen, H., Yue, G., Zhang, G., Zhaoyang, R., & Xu, D. (2008).
the results here suggests that as the sample size and number of Predicting marital satisfaction from self, partner, and couple character-
variables being assessed increase, efa(), compared to fa(), will istics: Is it me, you, or us? Journal of Personality, 76, 1231–1266.
likely take a longer time to process. doi:10.1111/j.1467-6494.2008.00520.x
MacCallum, R. C. (2003). Working with imperfect models. Multivariate
In closing, the two packages are similar in that they arrive Behavioral Research, 38, 113–139. doi:10.1207/S15327906MBR3801_5
at the same solution when argument options are set to be Mulaik, S. A. (2009). Foundations of factor analysis. Chapman and
equivalent and run on commonly used exemplar data where Hall/CRC.
826 LUO ET AL.
Peterson, R. A. (2000). A meta-analysis of variance accounted for and Schwarz, G. (1978). Estimating the dimension of a model. Annals of
factor loadings in exploratory factor analysis. Marketing Letters, 11, Statistics, 6, 461–464. doi:10.1214/aos/1176344136
261–275. doi:10.1023/A:1008191211004 Stevens, J. P. (2002). Applied multivariate statistics for the social sciences
R Core Team. (2018). R: A language and environment for statistical (4th ed.). Hillsdale, NS: Erlbaum.
computing. Vienna, Austria: R Foundation for Statistical Computing. Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics
Retrieved from https://www.R-project.org/ (5th ed.). Boston, MA: Allyn & Bacon.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum
Revelle, W. (2018). psych: Procedures for psychological, psychometric, likelihood factor analysis. Psychometrika, 38, 1–10. doi:10.1007/
and personality research. Retrieved from https://CRAN.R-project.org/ BF02291170
package=psych Velicer, W. (1976). Determining the number of components from the
Revelle, W., & Rocklin, T. (1979). Very simple structure - alternative matrix of partial correlations. Psychometrika, 41, 321–327.
procedure for estimating the optimal number of interpretable factors. doi:10.1007/BF02293557
Multivariate Behavioral Research, 14, 403–414. doi:10.1207/ Zhang, G. (2014). Estimating standard errors in exploratory factor
s15327906mbr1404_2 analysis. Multivariate Behavioral Research, 49, 339–353. doi:10.1080/
Rosseel, Y. (2012). lavaan: An R package for structural equation 00273171.2014.908271
modeling. Journal of Statistical Software, 48, 1–36. Retrieved from Zhang, G., Jiang, G., Hattori, M., & Trichtinger, L. (2018).
http://www.jstatsoft.org/v48/i02/ EFAutilities: Utility functions for exploratory factor analysis.
Ruscio, J., & Roche, B. (2012). Determining the number of factors to Retrieved from https://CRAN.R-project.org/package=EFAutilities
retain in an exploratory factor analysis using comparison data of known Zwick, W. R., & Velicer, W. F. (1986). Factors influencing five rules
factorial structure. Psychological Assessment, 24, 282. doi:10.1037/ for determining the number of components to retain. Psychological
a0025697 Bulletin, 99, 432–442. doi:10.1037/0033-2909.99.3.432