Professional Documents
Culture Documents
Factor Structure of The TOEFL
Factor Structure of The TOEFL
This construct validation study investigated the factor structure of the Test of
English as a Foreign Language™ Internet-based test (TOEFL® iBT). An
item-level confirmatory factor analysis was conducted for a test form com-
pleted by participants in a field study. A higher-order factor model was iden-
tified, with a higher-order general factor (ESL/EFL ability) and four
first-order factors for reading, listening, speaking and writing. Integrated
Speaking and Writing tasks, which require language processing in multiple
modalities, defined the target modalities (speaking and writing). These
results broadly supported the current practice of reporting a total score and
four scores corresponding to the modalities for the test, as well as the test
design that permits the integrated tasks to contribute only to the scores of the
target modalities.
Address for correspondence: Yasuyo Sawaki, Educational Testing Service, MS 10-R, Center for
Validity Research, Rosedale Road, Princeton, NJ 08541, USA; email: ysawaki@ets.org
© 2009 SAGE Publications (Los Angeles, London, New Delhi and Singapore) DOI:10.1177/0265532208097335
6 Factor structure of the TOEFL Internet-based test
and two studies conducted by Hale and his associates (Hale et al.,
1988; Hale, Rock & Jirele, 1989) studied the factor structure of the
paper-based TOEFL test, which consisted of three sections:
Listening Comprehension, Structure and Written Expression, and
Vocabulary and Reading Comprehension. Despite some differences
in the methodologies employed, these studies identified multiple
correlated factors that included a distinct Listening Comprehension
factor, while the number and makeup of the factors representing the
other sections varied across the studies.
A recent multiple-group confirmatory factor analysis study by
Stricker, Rock and Lee (2005) examined the factor structure of a pro-
totype of the TOEFL iBT called LanguEdge for three native lan-
guage groups (Arabic, Chinese and Spanish). Similar to the TOEFL
iBT, this prototype consisted of four sections (Reading, Listening,
Speaking and Writing). Item parcels based on multiple-choice items
in the Reading and Listening sections and holistic ratings for indi-
vidual Speaking and Writing items were analyzed. A correlated two-
factor model – one for Speaking and the other factor, a combination
of Reading, Listening and Writing – was identified for all three lan-
guage groups. A simultaneous analysis of this model for the three
groups also suggested invariance of factor loadings and error vari-
ances but differences in the correlations between the two factors,
across the three language groups.
Given the similarity of the TOEFL iBT design to that of
LanguEdge, the results of the present study may be expected to be
similar to that of Stricker et al. (2005) to some extent. However, the
design of TOEFL iBT is not identical to that of LanguEdge, as
described subsequently. In addition, Stricker et al. (2005) analyzed
item parcels rather than individual items. For these reasons, it is pos-
sible that the results of this study may depart from those of Stricker
et al.’s.
The present item-level factor analysis study of the TOEFL iBT
investigates the factor structure of the new test with a particular
focus on two issues. The first was the structure of the entire test. It is
of theoretical interest to see whether the present study supports the
current consensus in the field, i.e., the multicomponential nature of
English language ability across modalities, for the new test. Such
analyses of the factor structure of the test would shed light on the
appropriateness of the TOEFL iBT score-reporting policy, which
provides four section scores corresponding to the four sections
(Reading, Listening, Speaking and Writing) and a composite score
(TOEFL iBT total score).
8 Factor structure of the TOEFL Internet-based test
I Method
1 Data
The data analyzed in the present study were scored item responses in
the Reading, Listening, Speaking and Writing sections of a TOEFL
iBT test form administered in a field study conducted from
November 2003 to February 2004. The paid participants were
recruited from 31 countries in North America, Latin America, Africa,
Asia, and Europe that accounted for about 80% of the 2001–2002
TOEFL testing volume (Chapelle, Enright & Jamieson, 2008). All
these participants were required to complete a TOEFL iBT test form
and a TOEFL Computer-based Test (CBT) form. The order of presen-
tation of the iBT and CBT test forms was counter-balanced.
In total, 2,720 responses were available for the TOEFL iBT test
form. A comparison of this sample with the 2002–2003 TOEFL can-
didates (ETS, 2003) showed that this sample was reasonably repre-
sentative of the TOEFL population in terms of reported native
countries of origin (Chapelle et al., 2008). The five largest groups in
the sample were India (14.8%), China (13.9%), South Korea (10.1%),
Japan (7.5%) and Taiwan (4.6%). Most participants (75.3%) took the
test at overseas test centers.
The participants’ TOEFL CBT scores provide information about
their English language ability levels. The CBT test scores are summari-
zed in Table 1 for the participants and CBT test takers in 2002–2003.
Results of one-sample t-tests for the section and total scores showed
Yasuyo Sawaki et al. 9
Table 1 Summary statistics for TOEFL CBT and iBT scores (study sample and
TOEFL 2002–2003 candidates)
CBT
Listening 17.67 5.97 20.9 5.3
Structure/Writing 19.48 6.45 21.7 5.0
Reading 18.30 5.99 21.8 4.9
Total 184.84 55.48 215 4.6
iBT
Reading 17.04 6.99
Listening 16.98 6.95
Speaking 16.97 6.98
Writing 16.05 6.67
Total 67.04 24.58
a Educational Testing Service (2003).
that all of the means for the field study sample were significantly
lower than those for the CBT population (Listening: t 28.17,
p .05, df 2719; Reading: t 30.51, p .05, df 2719;
Structure & Writing: t 17.91, p .05, df 2719; Total: t 28.35,
p .05, df 2719). Moreover, the obtained Cohen’s (1988) d
values indicated that the observed effects were medium for the
Listening (d .54), Reading (d .59) and Total (d .54) scores,
while the effect was small for the Structure and Writing (d .34)
score. This finding needs to be kept in mind when considering the
generalizability of the results to the TOEFL test taking population.
d Writing: The Writing section included two tasks. One was an inde-
pendent writing task that required examinees to support an opinion
on a common topic. The other was an integrated writing task that
required examinees to read a text, listen to a lecture that pertained to
the topic, and then respond to a question on what they had read and
heard. For each question, examinees were required to write their
answers on a word processor. The total testing time for the Writing
section was 50 minutes, with 30 minutes allocated to the independ-
ent writing task and 20 minutes to the integrated writing task. Each
examinee’s response to each task was scored on a 0–5 scale by two
trained raters. The final score on the item was the average of the
Yasuyo Sawaki et al. 11
scores of the two raters. (If the two raters’ scores were discrepant by
more than one point, a third score by another rater was obtained and
used to determine the final score.)
The section scores were scaled (a transformation of the sum of the
raw item scores) and ranged from 0 to 30. The total score was a sum
of the four scaled section scores and ranged from 0 to 120.
The present field study form was representative of the content and
format of operational TOEFL iBT test forms, with two exceptions:
Unlike the field study test form, each operational TOEFL iBT test
form uses separately-timed subparts within the Reading and
Listening sections and includes additional Reading or Listening sets.
The Listening section in the field study form also had more non-
science lectures, and a different order of administration of the six
conversation and lecture sets.
3 Analyses
The present study employed a confirmatory factor analysis (CFA) of
items. A CFA with ordinal categorical data is appropriate for factor
analysis of item responses where each item is scored dichotomously
or polytomously. In this case a polychoric correlation matrix of the
item response data is analyzed, unlike a conventional CFA of con-
tinuous variables (e.g., test section scores) that analyzes a variance-
covariance matrix. The item-level factor analysis has been applied to
analysis of language assessment data by a few previous researchers
(e.g., Carr, 2003; Davidson 1988; Swinton & Powers, 1980). This
approach was particularly useful for conducting a fine-grained
analysis of individual items and the relationships of the integrated
items to the test sections.
At the outset, descriptive statistics for the items were examined.
The main purpose of this analysis was to identify items with
extremely high or low item difficulty values, which could be prob-
lematic in the calculation of a polychoric correlation matrix
(McLeod, Swygert & Thissen, 2001). No item in the present test
form was flagged as problematic in this respect. Thus, a polychoric
correlation matrix of the item-level data for the 79 items was calcu-
lated by means of PRELIS 2.54 (Jöreskog & Sörbom, 2003a).
The part of the CFAs reported in the subsequent sections focused
on an analysis of the factor structure of the entire test. Prior to this
analysis, a series of CFAs were conducted to investigate the rela-
tionships among the items at the section level. The results of the
section-level CFAs showed that the abilities assessed in the four
12 Factor structure of the TOEFL Internet-based test
For a sequential building of the CFA models for the entire test,
relative goodness of fit of competing models was evaluated. When
two models were nested, Satorra and Bentler’s (2001) chi-square
different test procedure was used. Chi-square difference test results
for model comparisons were always evaluated in conjunction with
the criteria of goodness-of-fit listed above (i.e., 2S-B/df, GFI, NNFI,
CFI, RMSEA, and ECVI).
II Results
As already noted, the relationships between the integrated Speaking
and Writing tasks were specified as paths between the items and all
corresponding factors in all CFA models tested. One exception, how-
ever, was the integrated Writing task. Because of model identification
issues with the Writing section, which contained only two items, the
integrated Writing task could not be completely modeled. Although
18 Factor structure of the TOEFL Internet-based test
this task involved both listening and reading, it was only modeled to
load on the Reading and Writing factor. A summary of goodness-
of-fit statistics for the five CFA models are shown in Table 2.
1 See Chen, West and Sousa (2006), however, for an example where a bi-factor model was adopted
Bi-factor 2910 5158.48 1.77 .82 .98 .98 .017 (.016–.018) 2.08 (2.01–2.16) None
model
Correlated 2989 6754.78 2.26 .78 .98 .98 .022 (.021–.022) 2.61f (2.52–2.70) None
four-factor
model
Single factor 3002 11314.98 3.78 .69 .97 .97 .032 (.031.033) 4.28 f (4.16–4.40) —
model
Correlated 3001 8684.64 3.55 .74 .97 .97 .026 (.026–.027) 3.31 f (3.21–3.41) None
two-factor
mode
Higher-order 2991 6855.01 2.29 .78 .98 .98 .022 (.021–.022) 2.65 f (2.56–2.74) —
factor model
Note: 90% confidence intervals for RMSEA and EVCI appear in parentheses.
a GFI Goodness of Fit Index. b NNFI Non-normed Fit Index. c CFI Comparative Fit Index. d RMSEA Root Mean Square Error of
Approximation. e ECVI Expected Cross-validation Index. f The ECVI was significantly larger than that of the saturated model.
Yasuyo Sawaki et al. 19
20 Factor structure of the TOEFL Internet-based test
Reading
1 .52a .73 .27
2 .56 .68 .32
3 .66 .56 .44
4 .67 .56 .45
5 .46 .79 .21
6 .57 .67 .33
7 .47 .78 .22
9 .71 .49 .51
10 .30 .91 .09
11 .51 .74 .26
12 .72 .48 .52
13 .55 .70 .30
14 .65 .57 .43
15 .38 .86 .14
16 .65 .57 .43
17 .60 .64 .36
18 .53 .72 .28
19 .76 .42 .58
20 .58 .67 .33
21 .66 .57 .43
22 .49 .76 .24
23 .69 .53 .47
24 .66 .57 .43
25 .59 .65 .35
26 .57 .67 .33
27 .58 .66 .34
28 .58 .66 .34
29 .55 .70 .30
30 .43 .82 .18
31 .75 .43 .57
32 .34 .88 .12
33 .64 .59 .41
34 .43 .81 .19
35 .44 .81 .19
36 .58 .66 .34
37 .48 .77 .23
38 .43 .81 .19
39 .67 .55 .45
Listening
1 .60a .64 .36
2 .68 .54 .46
3 .70 .51 .49
4 .52 .73 .27
5 .65 .58 .42
(continued )
Yasuyo Sawaki et al. 23
Table 3 (continued)
Speaking
1 (S) .78a .39 .62
2 (S) .78 .39 .62
3 (R/L/S)c .01 .21 .67 .30 .70
4 (R/L/S)c .09 .22 .76 .25 .75
5 (L/S)d .09 .80 .26 .74
6 (L/S)d .14 .71 .33 .67
Writing
Note: All loadings were significant (ltl 1.96), except for Speaking 3 (R/L/S) on the
Reading factor.
aFixed for factor scaling. bSquared multiple correlation. cIntegrated
Reading/Listening/Speaking task. d Integrated Listening/Speaking task. e Integrated
Reading/Listening/Writing task.
24 Factor structure of the TOEFL Internet-based test
Note: * t 1.96.a
aSquared multiple correlation.
Yasuyo Sawaki et al. 25
3 Limitations
Some limitations of the present study should be noted. First, the
sample size was only marginally acceptable for the item-level CFA of
79 items in the test, and too small for separate analyses of subgroups,
such as native language groups and ability groups. Second, this study
used data collected in a field study of the TOEFL iBT. There may be
important differences between the research participants in this study
and examinees taking the TOEFL iBT operationally, including native
languages, ability level, test-taking motivation, and familiarity with
the test. Third, the model fit was not ideal. For all of these reasons,
the results of this study should be interpreted with caution, and a
replication conducted, for different groups, with examinees taking the
TOEFL iBT in routine test administrations.
Acknowledgements
The authors’ special thanks go to Fred Davidson, Neil Dorans,
Richard M. Luecht and three anonymous reviewers for their careful
review of an earlier version of this manuscript. Any opinions
expressed in this article are those of the authors and not necessarily
of Educational Testing Service.
28 Factor structure of the TOEFL Internet-based test
IV References
Bachman, L. F. (2005). Building and supporting a case for test use. Language
Assessment Quarterly, 2, 1–34.
Bachman, L. F., Davidson, F., Ryan, K., & Choi, I.-C. (1995). An investiga-
tion into the comparability of two tests of English as a foreign language:
The Cambridge-TOEFL comparability study. Cambridge: Cambridge
University Press.
Bachman, L. F., & Palmer, A. (1981). The construct validation of the FSI oral
interview. Language Learning, 31, 67–86.
Bachman, L. F., & Palmer, A. (1982). The construct validation of some com-
ponents of communicative proficiency. TESOL Quarterly, 16, 449–465.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford:
Oxford University Press.
Bagozzi, R.P., & Yi, Y. (1992). On the evaluation of structural equation
models. Journal of the Academy of Marketing Science, 16, 74–94.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model
fit. In K. A. Bollen, & J. S. Long (Eds.), Testing structural equation mod-
els (pp. 136–162). Newbury Park, CA: Sage.
Carr, N. T. (2003). An investigation into the structure of text characteristics
and reader abilities in a test of second language reading. Unpublished
doctoral dissertation. Los Angeles: University of California Press.
Carroll, J. B. (1983). Psychometric theory and language testing. In J. W. Oller,
Jr. (Ed.), Issues in language testing research (pp. 80–107). Rowley, MA:
Newbury House.
Chapelle, C., Enright, M. K., & Jamieson, J. (Eds.) (2008). Building a valid-
ity argument for the Test of English as a Foreign Language. New York:
Routledge.
Chapelle, C., Grabe, W., & Berns, M. (2000). Communicative language pro-
ficiency: Definition and implications for TOEFL 2000 (TOEFL
Monograph Series MS-10). Princeton, NJ: Educational Testing Service.
Chen, F. F., West, S. G., & Sousa, K. H. (2006). Comparison of bifactor and
second-order models of quality of life. Multivariate Behavioral
Research, 41, 189–225.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
(2nd ed.). Hillsdale, NJ: Erlbaum.
Davidson, F. D. (1988). An exploratory modeling of the trait structures of
some existing language test datasets. Unpublished doctoral dissertation.
Los Angeles: University of California Press.
Educational Testing Service (2003). TOEFL test and score data summary:
2002–03 test year data. Princeton, NJ: Educational Testing Service.
Farhady, H. (1983). On the plausibility of the unitary language proficiency
factor. In J. W. Oller, Jr. (Ed.), Issues in language testing research
(pp. 11–28). Rowley, MA: Newbury House.
Hale, G. A., Rock, D.A., & Jirele, T. (1989). Confirmatory factor analysis of
the Test of English as a Foreign Language (TOEFL Research Report
No. 32). Princeton, NJ: Educational Testing Service.
Yasuyo Sawaki et al. 29
Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., &
Oller, J. W. Jr. (1988). Multiple-choice cloze items and the Test of
English as a Foreign Language (TOEFL Research Report No. 26).
Princeton, NJ: Educational Testing Service.
Hoyle, R. H., & Panter, A. T. (1995). Writing about structural equation models.
In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues,
and applications (pp. 158–176). Thousand Oaks, CA: Sage.
Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Bollen,
& J. S. Long (Eds.), Testing structural equation models (pp. 294–316).
Newbury Park, CA: Sage.
Jöreskog, K., & Sörbom, D. (2003a). PRELIS (Version 2.54) [Computer
software]. Chicago, IL: Scientific Software International.
Jöreskog, K., & Sörbom, D. (2003b). LISREL (Version 8.54) [Computer
software]. Chicago, IL: Scientific Software International.
Kline, R. B. (1998). Principles and practice of structural equation modeling.
New York: Guilford.
Kunnan, A. (1995). Test taker characteristics and test performance: A struc-
tural equation modeling approach. Cambridge: Cambridge University
Press.
Lewkowicz, J. A. (1997). The integrated testing of a second language. In C.
Clapham, & D. Corson (Eds.), Encyclopedia of language and education,
Vol. 7: Language testing and assessment (pp. 121–130). Dordrecht, The
Netherlands: Kluwer Academic.
Manning, W. H. (1987). Development of cloze-elide tests of English as a
second language (TOEFL Research Report RR-87–18). Princeton, NJ:
Educational Testing Service.
McLeod, L. D., Swygert, K. A., & Thissen, D. (2001). Factor analysis for
items scored in two categories. In D. Thissen, & H. Wainer (Eds.), Test
scoring (pp. 189–216). Mahwah, NJ: Lawrence Erlbaum.
Oller, J. W., Jr. (1976). Evidence of a general language proficiency factor: An
expectancy grammar. Die Neuen Sprachen, 76, 165–74.
Oller, J. W., Jr. (1983). A consensus for the eighties? In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 351–356). Rowley, MA:
Newbury House.
Oller, J. W., Jr., & Hinofotis, F. A. (1980). Two mutually exclusive hypothe-
ses about second language ability: Factor analytic studies of a variety of
language subtests. In J. W. Oller, Jr., & K. Perkins (Eds.), Research in
language testing (pp. 13–23). Rowley, MA: Newbury House.
Raykov, T., and Marcoulides, G. A. (2000). A first course in structural
equation modeling. Mahwah, NJ: Lawrence Erlbaum.
Rindskopf, D., & Rose, T. (1988). Some theory and applications of confirma-
tory second-order factor analysis. Multivariate Behavioral Research, 23,
51–67.
Sawaki, Y., Stricker, L., & Oranje, A. (2008). Factor Structure of the
TOEFL(R) Internet-Based Test (iBT): Exploration in a Field Trial
Sample. (TOEFL iBT Research Report No. TOEFLiBT-04). Princeton,
NJ: ETS.
30 Factor structure of the TOEFL Internet-based test