Factor Structure of The TOEFL

Language Testing 2009 26 (1) 005–030
Factor structure of the TOEFL

Internet-based test
Yasuyo Sawaki, Lawrence J. Stricker and Andreas
H. Oranje Educational Testing Service, USA
This construct validation study investigated the factor structure of the Test of
English as a Foreign Language™ Internet-based test (TOEFL® iBT). An
item-level confirmatory factor analysis was conducted for a test form com-
pleted by participants in a field study. A higher-order factor model was iden-
tified, with a higher-order general factor (ESL/EFL ability) and four
first-order factors for reading, listening, speaking and writing. Integrated
Speaking and Writing tasks, which require language processing in multiple
modalities, defined the target modalities (speaking and writing). These
results broadly supported the current practice of reporting a total score and
four scores corresponding to the modalities for the test, as well as the test
design that permits the integrated tasks to contribute only to the scores of the
target modalities.
Keywords: construct validity, ESL, factor analysis, integrated task, language

assessment, score reporting, TOEFL
The Test of English as a Foreign Language (TOEFL) is a battery of

academic English language ability designed primarily for admission
of nonnative speakers of English to higher education institutions in
North America. The introduction of the TOEFL Internet-based test
(iBT) in late 2005 signifies one of the major changes to the test
design. In addition to the transition from a computer-based to an
Internet-based test delivery system, the new test is based on design
principles that are drastically different from previous versions. The
primary goal of this change is to better align the test design to the
variety of language use tasks that examinees are expected to
encounter in everyday academic life. Toward this end, a mandatory
speaking section has been added to the test, along with integrated
Address for correspondence: Yasuyo Sawaki, Educational Testing Service, MS 10-R, Center for
Validity Research, Rosedale Road, Princeton, NJ 08541, USA; email: ysawaki@ets.org
© 2009 SAGE Publications (Los Angeles, London, New Delhi and Singapore) DOI:10.1177/0265532208097335
6 Factor structure of the TOEFL Internet-based test
tasks that require students to process language in more than one

modality (e.g., read a text, listen to a lecture on the same topic, and
then write a response on what has been read and heard). The written
and spoken texts used in the Reading and Listening sections are
longer, and note-taking is allowed throughout the test.
This major transformation of the TOEFL test requires building a
validity argument for this new test by gathering various types of
empirical evidence. One crucial aspect of this process is to investi-
gate the internal structure of the test to ensure that the test warrants
the intended score interpretation (Bachman, 2005). Factor analyses
can be an important tool to address this issue.
The question of whether language ability is unitary or divisible
into components has been of interest to applied linguists for more
than 30 years. This issue gained great attention when Oller (1976) pro-
posed the unitary trait hypothesis. Oller proposed the existence of an
internalized grammar, or expectancy grammar, which allows efficient,
on-line processing of information and creative use of the language.
Moreover, because of the similarity in performance of language learn-
ers on ostensibly different measures of language ability, he further
hypothesized that language ability can be accounted for by a single
trait. Strong support for Oller’s claim was obtained in principal com-
ponent analyses of a variety of English language assessments in mul-
tiple modalities (e.g., Oller, 1976; Oller & Hinofotis, 1980).
However, Oller’s hypothesis was called into question by other
researchers (e.g., Carroll, 1983; Farhady, 1983), primarily because of
methodological flaws. Subsequent research, employing more power-
ful factor analytic approaches, rejected the most extreme version of
the unitary trait hypothesis: one general factor sufficiently accounts
for all of the common variances in language tests (e.g., Bachman,
Davidson, Ryan & Choi, 1995; Bachman & Palmer, 1981, 1982;
Carroll, 1983; Kunnan, 1995). The current consensus in the field of
language testing is that second language ability is multicomponential,
with a general factor as well as smaller factors (Oller, 1983; Carroll,
1983). Despite this general consensus, previous findings vary in
terms of the exact factor structures that were identified. Some studies
found correlated first-order factors (e.g., Bachman & Palmer, 1981;
Kunnan, 1995), while others found first-order factors as well as a
higher-order general factor (e.g., Bachman & Palmer, 1982; Sasaki,
1996; Shin, 2005).
Parallel to this line of research are previous studies of the TOEFL
test, which also support a multi-componential nature of language
ability. For example, Swinton and Powers (1980), Manning (1987),
Yasuyo Sawaki et al. 7
and two studies conducted by Hale and his associates (Hale et al.,
1988; Hale, Rock & Jirele, 1989) studied the factor structure of the
paper-based TOEFL test, which consisted of three sections:
Listening Comprehension, Structure and Written Expression, and
Vocabulary and Reading Comprehension. Despite some differences
in the methodologies employed, these studies identified multiple
correlated factors that included a distinct Listening Comprehension
factor, while the number and makeup of the factors representing the
other sections varied across the studies.
A recent multiple-group confirmatory factor analysis study by
Stricker, Rock and Lee (2005) examined the factor structure of a pro-
totype of the TOEFL iBT called LanguEdge for three native lan-
guage groups (Arabic, Chinese and Spanish). Similar to the TOEFL
iBT, this prototype consisted of four sections (Reading, Listening,
Speaking and Writing). Item parcels based on multiple-choice items
in the Reading and Listening sections and holistic ratings for indi-
vidual Speaking and Writing items were analyzed. A correlated two-
factor model – one for Speaking and the other factor, a combination
of Reading, Listening and Writing – was identified for all three lan-
guage groups. A simultaneous analysis of this model for the three
groups also suggested invariance of factor loadings and error vari-
ances but differences in the correlations between the two factors,
across the three language groups.
Given the similarity of the TOEFL iBT design to that of
LanguEdge, the results of the present study may be expected to be
similar to that of Stricker et al. (2005) to some extent. However, the
design of TOEFL iBT is not identical to that of LanguEdge, as
described subsequently. In addition, Stricker et al. (2005) analyzed
item parcels rather than individual items. For these reasons, it is pos-
sible that the results of this study may depart from those of Stricker
et al.’s.
The present item-level factor analysis study of the TOEFL iBT
investigates the factor structure of the new test with a particular
focus on two issues. The first was the structure of the entire test. It is
of theoretical interest to see whether the present study supports the
current consensus in the field, i.e., the multicomponential nature of
English language ability across modalities, for the new test. Such
analyses of the factor structure of the test would shed light on the
appropriateness of the TOEFL iBT score-reporting policy, which
provides four section scores corresponding to the four sections
(Reading, Listening, Speaking and Writing) and a composite score
(TOEFL iBT total score).
The second issue was the relationship between the constructs

assessed in the four sections and the newly-introduced integrated
tasks in the Speaking and Writing sections. The TOEFL iBT includes
integrated tasks that require language processing either in two
language modalities (Listening/Speaking tasks) or three modalities
(Reading/Listening/Speaking and Reading/Listening/Writing tasks).
Scores on these tasks contribute only to the sections in which these
items appear. For example, Listening/Speaking tasks are part of the
Speaking section, and the scores for these items contribute to the
Speaking score but not to the Listening score. A legitimate question
is whether performance on an integrated speaking or writing task
reflects speaking or writing ability, respectively, rather than reading
or listening ability. To address this question, the relationships of the
integrated tasks to the section scores must be investigated.
I Method
1 Data
The data analyzed in the present study were scored item responses in
the Reading, Listening, Speaking and Writing sections of a TOEFL
iBT test form administered in a field study conducted from
November 2003 to February 2004. The paid participants were
recruited from 31 countries in North America, Latin America, Africa,
Asia, and Europe that accounted for about 80% of the 2001–2002
TOEFL testing volume (Chapelle, Enright & Jamieson, 2008). All
these participants were required to complete a TOEFL iBT test form
and a TOEFL Computer-based Test (CBT) form. The order of presen-
tation of the iBT and CBT test forms was counter-balanced.
In total, 2,720 responses were available for the TOEFL iBT test
form. A comparison of this sample with the 2002–2003 TOEFL can-
didates (ETS, 2003) showed that this sample was reasonably repre-
sentative of the TOEFL population in terms of reported native
countries of origin (Chapelle et al., 2008). The five largest groups in
the sample were India (14.8%), China (13.9%), South Korea (10.1%),
Japan (7.5%) and Taiwan (4.6%). Most participants (75.3%) took the
test at overseas test centers.
The participants’ TOEFL CBT scores provide information about
their English language ability levels. The CBT test scores are summari-
zed in Table 1 for the participants and CBT test takers in 2002–2003.
Results of one-sample t-tests for the section and total scores showed
Table 1 Summary statistics for TOEFL CBT and iBT scores (study sample and
TOEFL 2002–2003 candidates)
Scores TOEFL iBT field study TOEFL population

participants (N 2,720) (N 577,038)a
Mean Std. Dev. Mean Std. Dev.
CBT
Listening 17.67 5.97 20.9 5.3
Structure/Writing 19.48 6.45 21.7 5.0
Reading 18.30 5.99 21.8 4.9
Total 184.84 55.48 215 4.6
iBT
Reading 17.04 6.99
Listening 16.98 6.95
Speaking 16.97 6.98
Writing 16.05 6.67
Total 67.04 24.58
a Educational Testing Service (2003).
that all of the means for the field study sample were significantly
lower than those for the CBT population (Listening: t 28.17,
p .05, df 2719; Reading: t 30.51, p .05, df 2719;
Structure & Writing: t 17.91, p .05, df 2719; Total: t 28.35,
p .05, df 2719). Moreover, the obtained Cohen’s (1988) d
values indicated that the observed effects were medium for the
Listening (d .54), Reading (d .59) and Total (d .54) scores,
while the effect was small for the Structure and Writing (d .34)
score. This finding needs to be kept in mind when considering the
generalizability of the results to the TOEFL test taking population.
2 Structure of the test

In accordance with the design of the operational TOEFL iBT, the
form in the field study consists of four sections: Reading, Listening,
Speaking and Writing.
a Reading: The Reading section comprised three sets, each of

which contained 12–14 items associated with a reading passage of
approximately 700 words in length. The examinees were allowed to
spend 60 minutes to complete the Reading section. Thirty-six items
were dichotomously-scored, four-option multiple-choice items, and
the remaining three were polytomously-scored items. One dichot-
omously-scored item was excluded during an initial item analysis
because of its poor performance. After excluding this item, 38 items

were included in subsequent analyses.
b Listening: The Listening section consisted of six sets, two based

on conversations and four based on academic lectures. Each conver-
sation stimulus was approximately three minutes long and was
followed by five multiple-choice items, while each lecture stimulus
was 3–5 minutes long and was followed by six multiple-choice
items. In addition to the time required for listening to the prompts,
the participants were allowed to spend up to 20 minutes to respond
to all Listening items. Thirty-two of the 34 items were scored
dichotomously, while the remaining two were polytomous items.
One dichotomously-scored item identified as misfitting in an IRT
analysis was removed from further analyses. After removing this
item, 33 items were included in subsequent analyses.
c Speaking: The Speaking section consisted of six tasks. Two were

independent speaking tasks, which required examinees to express
opinions on familiar topics. The other four were integrated speaking
tasks. Two of the four were Listening/Speaking tasks, which
required examinees to listen to a short spoken text and then respond
to it. The remaining two were Reading/Listening/Speaking tasks,
which required examinees to read a short text, listen to a spoken text
that pertained to the reading text, and then respond about what they
had read and heard. For each task, examinees were given 15–30
seconds to prepare, and 45–60 seconds to respond. Each examinee’s
response to each task was scored on a discrete scale of 0–4 by trained
raters. The final score on the task was the score given by the single
rater or the average scores given by two raters. (If the two raters’
scores were discrepant by more than one point, a chief rater adjudi-
cated the difference to determine the final score.)
d Writing: The Writing section included two tasks. One was an inde-
pendent writing task that required examinees to support an opinion
on a common topic. The other was an integrated writing task that
required examinees to read a text, listen to a lecture that pertained to
the topic, and then respond to a question on what they had read and
heard. For each question, examinees were required to write their
answers on a word processor. The total testing time for the Writing
section was 50 minutes, with 30 minutes allocated to the independ-
ent writing task and 20 minutes to the integrated writing task. Each
examinee’s response to each task was scored on a 0–5 scale by two
trained raters. The final score on the item was the average of the
scores of the two raters. (If the two raters’ scores were discrepant by
more than one point, a third score by another rater was obtained and
used to determine the final score.)
The section scores were scaled (a transformation of the sum of the
raw item scores) and ranged from 0 to 30. The total score was a sum
of the four scaled section scores and ranged from 0 to 120.
The present field study form was representative of the content and
format of operational TOEFL iBT test forms, with two exceptions:
Unlike the field study test form, each operational TOEFL iBT test
form uses separately-timed subparts within the Reading and
Listening sections and includes additional Reading or Listening sets.
The Listening section in the field study form also had more non-
science lectures, and a different order of administration of the six
conversation and lecture sets.
3 Analyses
The present study employed a confirmatory factor analysis (CFA) of
items. A CFA with ordinal categorical data is appropriate for factor
analysis of item responses where each item is scored dichotomously
or polytomously. In this case a polychoric correlation matrix of the
item response data is analyzed, unlike a conventional CFA of con-
tinuous variables (e.g., test section scores) that analyzes a variance-
covariance matrix. The item-level factor analysis has been applied to
analysis of language assessment data by a few previous researchers
(e.g., Carr, 2003; Davidson 1988; Swinton & Powers, 1980). This
approach was particularly useful for conducting a fine-grained
analysis of individual items and the relationships of the integrated
items to the test sections.
At the outset, descriptive statistics for the items were examined.
The main purpose of this analysis was to identify items with
extremely high or low item difficulty values, which could be prob-
lematic in the calculation of a polychoric correlation matrix
(McLeod, Swygert & Thissen, 2001). No item in the present test
form was flagged as problematic in this respect. Thus, a polychoric
correlation matrix of the item-level data for the 79 items was calcu-
lated by means of PRELIS 2.54 (Jöreskog & Sörbom, 2003a).
The part of the CFAs reported in the subsequent sections focused
on an analysis of the factor structure of the entire test. Prior to this
analysis, a series of CFAs were conducted to investigate the rela-
tionships among the items at the section level. The results of the
section-level CFAs showed that the abilities assessed in the four
sections were essentially unidimensional within modality. Moreover,

a series of multitrait-multimethod CFA models tested for the
Reading and Listening sections showed that the effects of stimulus
texts used in these sections were noticeable but not practically
important. As a result, a single factor was specified for each TOEFL
iBT section in the subsequent analyses in order to reflect the unidi-
mensionality of the traits within each modality without specification
of any test method factors (see Sawaki, Stricker & Oranje, 2008, for
more details about the section-level analysis).
The CFA of the entire test generally followed Rindskopf and
Rose’s (1988) procedure, which examines the relative goodness of
fit of nested models, moving from testing of least restrictive to more
restrictive models. This approach, which tests alternative or compet-
ing models (Jöreskog, 1993), is particularly appropriate for construct
validity studies, for it allows us to explicitly test the plausibility of
viable alternative explanations to the underlying structure of the test.
As already noted, TOEFL iBT Speaking and Writing sections
involve integrated tasks. The relationships between the integrated
tasks and all the associated modalities were explicitly modeled in the
CFA models as well. In total five plausible models were tested
(see Figures 1–5 for schematic representations of the models). The
Bi-Factor model, Correlated Four-Factor model, Single Factor model
and Higher-Order Factor model were based on Rindskopf and Rose
(1988). The Correlated Two-Factor model proposed by Stricker et al.
(2005) was also tested.
1) Bi-Factor model (Figure 1): This model hypothesized the pres-
ence of a general factor as well as four other factors correspon-
ding to the four language modalities (Reading, Listening,
Speaking and Writing).
2) Correlated Four-Factor model (Figure 2): This model was nested
within the Bi-Factor model. This model hypothesized the pres-
ence of four psychometrically distinct factors corresponding to
the modalities.
3) Single Factor model (Figure 3): This model specified only one
trait factor on which all 79 items loaded. This is equivalent to
say that the four factors associated with the language modalities
were indistinguishable from one another.
4) Correlated Two-Factor model (Figure 4): This model specified
two correlated trait factors, where one factor was specified for
the Reading, Listening and Writing sections combined, and the
other for the Speaking section. This model assumes that the traits
assessed in the Reading, Listening and Writing sections are
Figure 1 Bi-Factor model
psychometrically indistinguishable from one another, while the

trait assessed in the Speaking section is distinct from those
assessed in the other three sections.
5) Higher-Order Factor model (Figure 5): This model was obtained
by imposing a constraint to the inter-factor correlation structure
in the Correlated Four-Factor model to assume the presence of a
common underlying dimension across the four modalities, i.e.,
the four sections of the TOEFL iBT.
First, the two least restrictive models (the Bi-Factor model and the
Correlated Four-Factor model) were compared to develop a base-
line model. Then, relative fit of alternative models against the
baseline model was tested in sequence to obtain a final model.
All the analyses were conducted with LISREL 8.54 (Jöreskog &
Sörbom, 2003b). Maximum likelihood was used to estimate model
parameters. Because the input data are categorical and non-normal,
Satorra-Bentler Scaled chi-square statistic (Satorra, 1990), which
offers an adjustment for non-normality of data, were used for
evaluation of model fit.
Each proposed model was evaluated on the basis of multiple
criteria: (1) the appropriateness of the solutions, and (2) goodness of
model fit to the data. The model fit was evaluated by examining the
reasonableness of model parameter estimates as well as the
Figure 2 Correlated Four-Factor model
satisfaction of statistical (model chi-square) and practical goodness-

of-fit criteria below largely based on Hoyle and Panter’s (1995)
suggestions:
• The ratio of Satorra-Bentler model chi-square to model degrees
of freedom (2S-B/df): Although there is no clear-cut rule about a
cutoff point for this statistic, Kline (1998) mentions 3.0 or below
as a suggestion of good model fit.
• Goodness of Fit Index (GFI): An absolute model fit index, GFI
is analogous to a model R2 in multiple regression analysis. A GFI
of.90 or above indicates an adequate model fit.
• Non-Normed Fit Index (NNFI): An incremental fit index, NNFI
is an extension of the Tucker-Lewis Index (TLI). An NNFI
assesses whether a particular CFA model is an improvement over
Figure 3 Single-Trait model
a model that specifies no latent factor, taking into account the

model complexity (Raykov & Marcoulides, 2000). An NNFI
of .90 or above indicates an adequate model fit.
• Comparative Fit Index (CFI): An incremental fit index, CFI
assesses overall improvement of a proposed model over an inde-
pendence model where the observed variables are uncorrelated.
A CFI of .90 or above indicates an adequate model fit.
Besides the indices above, two more criteria below were also taken
into account:
• Root Mean Square Error of Approximation (RMSEA): A
RMSEA evaluates the extent to which the model approximates
the data, taking into account the model complexity. A RMSEA
of .05 or below is considered as an indication of close fit, and a
value of .08 or below as an indication of adequate fit (Browne
& Cudeck, 1993).
Figure 4 Correlated Two-Factor model
• Expected cross-validation index (ECVI): An ECVI indicates the

extent to which the model is replicated with a different sample
from the same population. The lower the value, the better the
replication of the result in another sample.
In addition, when considering the number of distinct factors pres-

ent in the data, the magnitudes of inter-factor correlations were evalu-
ated. When a correlation between two factors is extremely high, the
two factors cannot be considered distinct from each other (Bagozzi &
Yi, 1992). A correlation of .90 was chosen for the sake of consistency
with the criterion used by Stricker et al.’s (2005) CFA study of the
LanguEdge data.
Figure 5 Higher-order Factor model
For a sequential building of the CFA models for the entire test,
relative goodness of fit of competing models was evaluated. When
two models were nested, Satorra and Bentler’s (2001) chi-square
different test procedure was used. Chi-square difference test results
for model comparisons were always evaluated in conjunction with
the criteria of goodness-of-fit listed above (i.e., 2S-B/df, GFI, NNFI,
CFI, RMSEA, and ECVI).
II Results
As already noted, the relationships between the integrated Speaking
and Writing tasks were specified as paths between the items and all
corresponding factors in all CFA models tested. One exception, how-
ever, was the integrated Writing task. Because of model identification
issues with the Writing section, which contained only two items, the
integrated Writing task could not be completely modeled. Although
this task involved both listening and reading, it was only modeled to
load on the Reading and Writing factor. A summary of goodness-
of-fit statistics for the five CFA models are shown in Table 2.
1 Developing a baseline model: Bi-Factor model vs. Correlated

Four-Factor model
The first step was to compare the relative fit of the Bi-Factor model
and the Correlated Four-Factor model to develop a baseline model.
As can be seen in Table 2, the values of 2S-B/df, NNFI, CFI and
RMSEA for both models indicated good fit of these models to the
data, although one concern common across the two models was the
low GFI estimates. A chi-square difference test suggested that the fit
of the Bi-Factor model was significantly better than that of the
Correlated Four-Factor model (p .01, 2S-B difference 5293.08;
df 79). Moreover, the smaller ECVI for the Bi-Factor model indi-
cated that this model would replicate better in a different sample.
However, an inspection of the model parameter estimates
suggested that the Bi-Factor model might not be a reasonable
solution. Of particular importance were (1) that most of the load-
ings of the items on the Reading, Listening, Speaking and Writing
Four-factors in the Bi-Factor model were identical to those of the
four trait factor loadings for the Correlated Four-Factor model, and
(2) that all of the loadings on the general factor in this Bi-Factor
model were nonsignificant or low. This pattern is implausible
because, if a general factor had been successfully partialed out, the
loadings of the items to the general factor should have been
higher, while the loadings of the items to the group factors as well
as the intercorrelations among the group factors should have been
lower. Thus, the observed patterns in the model parameter esti-
mates seemed to indicate a problem with model identification sim-
ilar to those reported with bi-factor models by previous
researchers including Rindskopf and Rose (1988), Kunnan (1995),
and Shin (2005).1 Thus, the Correlated Four-Factor model was
accepted as the baseline model. Subsequently, the relative fit of
this model against the remaining three models was tested in
sequence in order to address a central question: whether the mul-
ticomponential nature of language abilities found in the previous
language assessment literature is tenable for the TOEFL iBT.
1 See Chen, West and Sousa (2006), however, for an example where a bi-factor model was adopted
over a higher-order factor model in the personality domain.

Table 2 Summary of CFA model testing for the entire test with additional paths for the integrated tasks
Model S–B S–B GFIa NNFIb CFIc RMSEAd ECVIe Inter-factor r

df scaled 2 2/df (90% Cl) (90% Cl) above .90
Bi-factor 2910 5158.48 1.77 .82 .98 .98 .017 (.016–.018) 2.08 (2.01–2.16) None
model
Correlated 2989 6754.78 2.26 .78 .98 .98 .022 (.021–.022) 2.61f (2.52–2.70) None
four-factor
model
Single factor 3002 11314.98 3.78 .69 .97 .97 .032 (.031.033) 4.28 f (4.16–4.40) —
model
Correlated 3001 8684.64 3.55 .74 .97 .97 .026 (.026–.027) 3.31 f (3.21–3.41) None
two-factor
mode
Higher-order 2991 6855.01 2.29 .78 .98 .98 .022 (.021–.022) 2.65 f (2.56–2.74) —
factor model
Note: 90% confidence intervals for RMSEA and EVCI appear in parentheses.
a GFI Goodness of Fit Index. b NNFI Non-normed Fit Index. c CFI Comparative Fit Index. d RMSEA Root Mean Square Error of
Approximation. e ECVI Expected Cross-validation Index. f The ECVI was significantly larger than that of the saturated model.
2 Testing for distinctness of the language abilities tested across the

four modalities: Baseline model vs. Single Factor model
Because the Correlated Four-Factor model suggests the presence of four
distinct factors corresponding to the modalities, the fit of this model was
compared with that of the Single Factor model, which suggests the
absence of such distinctness. The fit indices for these models shown in
Table 2 indicate that, although the NNFI, CFI and the RMSEA values for
the Single Factor model were acceptable, the GFI, 2S-B/df and ECVI
values were considerably worse than those of the Correlated Four-
Factor model. A chi-square difference test indicated that the fit
of the Single Factor model was significantly worse than that of the
Correlated Four-Factor model (p .01, 2S-B difference 685.47; df 13).
Thus, the specification of only one trait factor in the Single Factor model
yielded both a statistically and practically unfavorable model fit com-
pared to that of the Correlated Four-Factor model, suggesting that the
abilities assessed by the four modalities were distinct.
3 Testing for distinctness among the Reading, Listening and

Writing abilities: Baseline model vs. Correlated Two-Factor model
Although the favorable finding for the Correlated Four-Factor model
over the Single Factor model suggests the distinctness of the con-
structs tapped in the four sections, the inter-factor correlations among
the Reading, Listening and Writing factors were still high, ranging
from .86 to .89. In contrast, the inter-factor correlations between the
Speaking factor and the other three factors ranged from .66 to .82. This
pattern of inter-factor correlations suggests that a model that combines
the Reading, Listening and Writing factors into a single factor, i.e., the
Correlated Two-Factor model adopted by Stricker et al. (2005), might
fit as well as the Correlated Four-Factor Trait model.
The Correlated Two-Factor model resulted in a proper and inter-
pretable solution with the inter-factor correlation of .79. Table 2
shows that this model and the baseline model were comparable in
terms of the values of NNFI, CFI and RMSEA. However, the value
of 2S-B/df exceeding the pre-determined cutoff of 3.0 and the low
GFI suggested that the fit of this model was relatively poor com-
pared to that of the baseline model. Moreover, the higher ECVI for
this model than that for the baseline model also suggested that repli-
cating the same results in another sample drawn from this population
would be the more difficult for the Correlated Two-Factor model.
A chi-square difference test between these two models also suggested
the statistically significantly better fit of the baseline model than the
Correlated Two-Factor model as well (p.01, 2S-B difference 352.31;

df 12). For these reasons, the baseline model was retained for
comparison with the Higher-Order Factor model.
4 Testing for presence of a single underlying dimension across

modalities: Baseline model vs. Higher-Order Factor model
The baseline model was compared with the Higher-Order Factor
model to test the presence of a common underlying dimension across
the four modalities. A chi-square difference test indicated that the fit
of the Correlated Four-Factor model was significantly better than that
of the Higher-Order Factor model (p .01, 2S-B difference .104.71;
df 2). However, the minimal differences in the 2S-B/df, NNFI,
CFI, RMSEA and ECVI values in Table 2 suggest that the fit of these
two models was practically equivalent. Because the Higher-Order
Factor model is more parsimonious, it was selected as the model that
best represents the structure of the TOEFL iBT.
5 Interpretation of the Higher-Order Factor model

The completely standardized parameter estimates for the final model
are presented in Table 3. Regarding the first-order factor loadings,
the loadings of all the Speaking items on the Speaking factor and the
loading of the Writing items on the Writing factor were consistently
substantial (larger than .50) compared to those of the Reading and
Listening items on their respective factors (less than .50 for 11
Reading items and four Listening items). This difference is partly
due to the kinds of items involved. The constructed-response items
in the Speaking and Writing sections are not susceptible to guessing,
unlike the selected-response items in the Reading and Listening
sections, and have greater variability. Both of these variables should
result in larger factor loadings for the Speaking and Writing items.
Another important observation concerns the substantial factor
loadings of all the integrated Speaking and Writing tasks on the rele-
vant factors, the Speaking tasks on the Speaking factor and the
Writing tasks on the Writing factor. Equally important were the small
or nonsignificant loadings of these tasks on the less relevant factors,
i.e., the Listening/Speaking tasks on the Listening factor and the
Reading/Listening/Speaking tasks and the Reading/Listening/Writing
tasks on the Reading and Listening factors. In sum, the pattern of the
factor loadings of the integrated Speaking and Writing tasks suggest
that these tasks mainly tap the target modalities.
Table 3 Loadings of items on first-order factors
Item Reading Listening Speaking Writing Error SMRb
Reading
1 .52a .73 .27
2 .56 .68 .32
3 .66 .56 .44
4 .67 .56 .45
5 .46 .79 .21
6 .57 .67 .33
7 .47 .78 .22
9 .71 .49 .51
10 .30 .91 .09
11 .51 .74 .26
12 .72 .48 .52
13 .55 .70 .30
14 .65 .57 .43
15 .38 .86 .14
16 .65 .57 .43
17 .60 .64 .36
18 .53 .72 .28
19 .76 .42 .58
20 .58 .67 .33
21 .66 .57 .43
22 .49 .76 .24
23 .69 .53 .47
24 .66 .57 .43
25 .59 .65 .35
26 .57 .67 .33
27 .58 .66 .34
28 .58 .66 .34
29 .55 .70 .30
30 .43 .82 .18
31 .75 .43 .57
32 .34 .88 .12
33 .64 .59 .41
34 .43 .81 .19
35 .44 .81 .19
36 .58 .66 .34
37 .48 .77 .23
38 .43 .81 .19
39 .67 .55 .45
Listening
1 .60a .64 .36
2 .68 .54 .46
3 .70 .51 .49
4 .52 .73 .27
5 .65 .58 .42
(continued )
Table 3 (continued)
Item Reading Listening Speaking Writing Error SMRb
6 .46 .79 .21

7 .52 .73 .27
8 .58 .66 .34
9 .64 .59 .41
10 .77 .41 .59
11 .37 .86 .14
12 .59 .66 .34
13 .61 .63 .38
14 .76 .43 .57
15 .54 .71 .30
16 .52 .73 .27
17 .63 .60 .40
18 .52 .73 .27
19 .53 .72 .28
20 .70 .51 .49
21 .74 .45 .55
22 .53 .72 .28
23 .66 .57 .43
24 .73 .47 .53
25 .77 .40 .60
26 .70 .50 .50
27 .73 .47 .53
28 .65 .58 .42
29 .51 .74 .26
30 .50 .75 .25
31 .60 .64 .36
32 .40 .84 .16
34 .34 .88 .12
Speaking
1 (S) .78a .39 .62
2 (S) .78 .39 .62
3 (R/L/S)c .01 .21 .67 .30 .70
4 (R/L/S)c .09 .22 .76 .25 .75
5 (L/S)d .09 .80 .26 .74
6 (L/S)d .14 .71 .33 .67
Writing
1 (W) .93a .13 .87

2 (R/L/W) e .15 .64 .40 .60
Note: All loadings were significant (ltl 1.96), except for Speaking 3 (R/L/S) on the
Reading factor.
aFixed for factor scaling. bSquared multiple correlation. cIntegrated
Reading/Listening/Speaking task. d Integrated Listening/Speaking task. e Integrated
Reading/Listening/Writing task.
Turning to the higher-order factor loadings in Table 4, all the four

sections had high loadings, ranging from .78 to .97. This supports the
presence of a common underlying dimension that is strongly related
to the Reading, Listening, Speaking and Writing trait factors.
However, it is notable that the loading of the Speaking factor is
somewhat lower than the loadings of the other factors, suggesting
that this factor also reflects other abilities not captured by the general
trait factor.
III Discussion and conclusions

1 Structure of the test
The first key issue addressed in this study was the structure of the
test. The Higher-Order Factor model that included a single higher-
order general factor (ESL/EFL ability) and four first-order factors
corresponding to the four TOEFL iBT sections (modalities) was the
best representation of the test’s factor structure. This model broadly
supports the reporting of five scores for the test, one for each section
and a single composite score.
The Higher-Order Factor model is consistent with the current
multicomponential view of language ability in the literature: both
general and local factors account for language ability (e.g., Bachman
et al., 1995; Bachman & Palmer, 1982; Carroll, 1983; Sasaki, 1996;
Shin, 2005). This hierarchical model is also congruent with those
found in previous CFAs of language tests (e.g., Sasaki, 1996; Shin,
2005). However, the present study yielded both consistent and
inconsistent results with the previous factor-analytic studies of the
TOEFL test discussed earlier. First, finding a distinct Listening
factor was consistent with results of Swinton and Powers (1980),
Manning (1987) and two studies conducted by Hale and his associ-
ates (Hale et al., 1988; Hale et al., 1989). Second, there were some
inconsistencies with the findings of Stricker et al. (2005). While the
Table 4 Loadings of first-order factors on higher-order factors
General Error SMRa
Reading .91* .17* .84

Listening .97* .07* .93
Speaking .78* .39* .61
Writing .91* .17* .83
Note: * t 1.96.a
aSquared multiple correlation.
present study identified four first-order factors corresponding to the

language modality, Stricker et al. (2005) found only two first-order
factors in their analysis of the LanguEdge. Third, this study iden-
tified a hierarchical factor structure, whereas all the previous studies
found first-order factors.
There are some possible explanations for these divergent findings.
The number of distinct factors identified for this study was larger
than those identified in the studies of the paper-based TOEFL test by
Swinton and Powers (1980), Manning (1987) and Hale and his asso-
ciates (Hale et al., 1988; Hale et al., 1989). This may be accounted
for by differences in the content and format of the paper-based
TOEFL test and the TOEFL iBT. The paper-based TOEFL test used
in these previous studies did not include a Speaking section, and the
Structure and Written Expression section was not a constructed-
response section that required examinees to provide writing samples.
In contrast, the TOEFL iBT has a Speaking section, and both the
Speaking and Writing sections require examinees to provide speech
or writing samples based on constructed-response items. Thus, the
modalities and the range of language skills in the paper-based
TOEFL were narrower than those in the TOEFL iBT, which might
have led to the identification of the smaller numbers of factors in the
paper-based test.
These differences in the range of the language skills covered in the
different versions of the TOEFL test above do not explain the differ-
ence in the findings between this study and the Stricker et al. (2005)
study because both studies examined very similar tests, a prototype of
the TOEFL iBT (Stricker et al.) or the TOEFL iBT field study test form
(the present study). Stricker et al. (2005) identified two correlated fac-
tors: Speaking and a fusion of Reading, Listening and Writing. In con-
trast, the present study identified four distinct first-order factors
corresponding to the modalities as well as a higher-order factor.
Three issues may account for this discrepancy. First, the methods
used differed between these two studies (i.e., analysis of item parcels
in the Stricker et al. study and an item-level analysis in this study).
In an unpublished investigation by Stricker and Rock (2005), a CFA
of the data using item parcels identified a bi-factor model similar to
the one tested in this study. Second, Stricker et al. (2005) did not
explore higher-order factor structures because of the high inter-
factor correlations for first-order factors. This may to some extent be
related to the fact that item parcels inevitably discard some informa-
tion about individual items. Hence, a higher-order factor structure
may have been masked by the reduced information in the variables.
Third, the nature of the samples studied differed. Stricker et al.

analyzed three specific native language groups. It is plausible that
the examinees in each language group were relatively homogeneous
in terms of the language development patterns and the instruction
that they received. In contrast, the present study employed a com-
bined sample of a variety of different language groups, presumably
diverse in their language development and instruction.
2 Integrated Speaking and Writing tasks

The factor loading patterns of the integrated Speaking and Writing
tasks indicate that these tasks well define the target constructs, and
are minimally involved in the Reading and Listening constructs, the
other modalities involved in the test design. This finding implies that
the integrated Speaking and Writing tasks in the TOEFL iBT can be
interpreted primarily as measures of speaking or writing ability,
respectively, providing support for the current scoring practices that
assign scores on these tasks to the Speaking and Writing sections
only. However, the relationship of the integrated Writing task with
the Reading, Listening and Writing factors could not be fully speci-
fied in the CFA models tested in this study because the Writing
section has only two tasks.
The somewhat weak relationship of the integrated Speaking and
Writing tasks with the Reading and Listening tasks may be due to a
design decision made for the TOEFL iBT. In the study by Stricker
et al. (2005) of an experimental, prototype test form (LanguEdge),
the prompts used for the integrated speaking and writing tasks were
dependent on those used in the Reading and Listening sections.
Examinees first completed items in the Reading and Listening sec-
tions, then they completed the integrated Speaking and Writing tasks
based on the same texts they had already worked on in the Reading
and Listening sections. In the current TOEFL iBT, however, the
dependency of the prompt texts across the sections was removed.
Thus, the reading and listening passages used for the integrated
Speaking and Writing tasks were unique to the integrated tasks.
Moreover, relatively easier reading and listening texts compared to
those used in the Reading and Listening sections were employed for
the integrated Speaking and Writing tasks, so that the difficulty of the
reading and listening texts did not affect examinees’ speaking and
writing performance. Thus, in Lewkowicz’s (1997) terms, the degree
of integration of the TOEFL iBT task design was lower than that of
LanguEdge.
It is worth stressing that the lack of relationships of the integrated

Speaking and Writing tasks with the Reading and Listening factors
suggested by the present results does not diminish the value of
including these tasks in the TOEFL iBT. Including such tasks that
reflect characteristics of tasks in the target language use domain
(Bachman & Palmer, 1996) that examinees encounter enhances the
authenticity of the assessment (Lewkowicz, 1997; Wesche, 1987).
As Chapelle, Grabe, and Berns (2000) argue, integrated tasks reflect
the complexity of the context of language use in academic settings,
which often require processing language in multiple modalities to
fulfill specific purposes (e.g., reading a textbook to prepare for a
lecture; listening to the lecture and writing down key points of the
lecture for future reference). Critics have pointed out that the use of
tasks with a high level of integration might lead to ‘muddied meas-
urement’ (Weir, 1990; Urquhart & Weir, 1998). However, the pres-
ent study of Speaking and Writing tasks with a lower level of
integration supports the intended interpretation of these integrated
tasks as measures of speaking and writing ability, respectively,
without blurring the interpretation of the section scores.
3 Limitations
Some limitations of the present study should be noted. First, the
sample size was only marginally acceptable for the item-level CFA of
79 items in the test, and too small for separate analyses of subgroups,
such as native language groups and ability groups. Second, this study
used data collected in a field study of the TOEFL iBT. There may be
important differences between the research participants in this study
and examinees taking the TOEFL iBT operationally, including native
languages, ability level, test-taking motivation, and familiarity with
the test. Third, the model fit was not ideal. For all of these reasons,
the results of this study should be interpreted with caution, and a
replication conducted, for different groups, with examinees taking the
TOEFL iBT in routine test administrations.
Acknowledgements
The authors’ special thanks go to Fred Davidson, Neil Dorans,
Richard M. Luecht and three anonymous reviewers for their careful
review of an earlier version of this manuscript. Any opinions
expressed in this article are those of the authors and not necessarily
of Educational Testing Service.
IV References
Bachman, L. F. (2005). Building and supporting a case for test use. Language
Assessment Quarterly, 2, 1–34.
Bachman, L. F., Davidson, F., Ryan, K., & Choi, I.-C. (1995). An investiga-
tion into the comparability of two tests of English as a foreign language:
The Cambridge-TOEFL comparability study. Cambridge: Cambridge
University Press.
Bachman, L. F., & Palmer, A. (1981). The construct validation of the FSI oral
interview. Language Learning, 31, 67–86.
Bachman, L. F., & Palmer, A. (1982). The construct validation of some com-
ponents of communicative proficiency. TESOL Quarterly, 16, 449–465.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford:
Oxford University Press.
Bagozzi, R.P., & Yi, Y. (1992). On the evaluation of structural equation
models. Journal of the Academy of Marketing Science, 16, 74–94.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model
fit. In K. A. Bollen, & J. S. Long (Eds.), Testing structural equation mod-
els (pp. 136–162). Newbury Park, CA: Sage.
Carr, N. T. (2003). An investigation into the structure of text characteristics
and reader abilities in a test of second language reading. Unpublished
doctoral dissertation. Los Angeles: University of California Press.
Carroll, J. B. (1983). Psychometric theory and language testing. In J. W. Oller,
Jr. (Ed.), Issues in language testing research (pp. 80–107). Rowley, MA:
Newbury House.
Chapelle, C., Enright, M. K., & Jamieson, J. (Eds.) (2008). Building a valid-
ity argument for the Test of English as a Foreign Language. New York:
Routledge.
Chapelle, C., Grabe, W., & Berns, M. (2000). Communicative language pro-
ficiency: Definition and implications for TOEFL 2000 (TOEFL
Monograph Series MS-10). Princeton, NJ: Educational Testing Service.
Chen, F. F., West, S. G., & Sousa, K. H. (2006). Comparison of bifactor and
second-order models of quality of life. Multivariate Behavioral
Research, 41, 189–225.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences
(2nd ed.). Hillsdale, NJ: Erlbaum.
Davidson, F. D. (1988). An exploratory modeling of the trait structures of
some existing language test datasets. Unpublished doctoral dissertation.
Los Angeles: University of California Press.
Educational Testing Service (2003). TOEFL test and score data summary:
2002–03 test year data. Princeton, NJ: Educational Testing Service.
Farhady, H. (1983). On the plausibility of the unitary language proficiency
factor. In J. W. Oller, Jr. (Ed.), Issues in language testing research
(pp. 11–28). Rowley, MA: Newbury House.
Hale, G. A., Rock, D.A., & Jirele, T. (1989). Confirmatory factor analysis of
the Test of English as a Foreign Language (TOEFL Research Report
No. 32). Princeton, NJ: Educational Testing Service.
Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., &
Oller, J. W. Jr. (1988). Multiple-choice cloze items and the Test of
English as a Foreign Language (TOEFL Research Report No. 26).
Princeton, NJ: Educational Testing Service.
Hoyle, R. H., & Panter, A. T. (1995). Writing about structural equation models.
In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues,
and applications (pp. 158–176). Thousand Oaks, CA: Sage.
Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Bollen,
& J. S. Long (Eds.), Testing structural equation models (pp. 294–316).
Newbury Park, CA: Sage.
Jöreskog, K., & Sörbom, D. (2003a). PRELIS (Version 2.54) [Computer
software]. Chicago, IL: Scientific Software International.
Jöreskog, K., & Sörbom, D. (2003b). LISREL (Version 8.54) [Computer
software]. Chicago, IL: Scientific Software International.
Kline, R. B. (1998). Principles and practice of structural equation modeling.
New York: Guilford.
Kunnan, A. (1995). Test taker characteristics and test performance: A struc-
tural equation modeling approach. Cambridge: Cambridge University
Press.
Lewkowicz, J. A. (1997). The integrated testing of a second language. In C.
Clapham, & D. Corson (Eds.), Encyclopedia of language and education,
Vol. 7: Language testing and assessment (pp. 121–130). Dordrecht, The
Netherlands: Kluwer Academic.
Manning, W. H. (1987). Development of cloze-elide tests of English as a
second language (TOEFL Research Report RR-87–18). Princeton, NJ:
Educational Testing Service.
McLeod, L. D., Swygert, K. A., & Thissen, D. (2001). Factor analysis for
items scored in two categories. In D. Thissen, & H. Wainer (Eds.), Test
scoring (pp. 189–216). Mahwah, NJ: Lawrence Erlbaum.
Oller, J. W., Jr. (1976). Evidence of a general language proficiency factor: An
expectancy grammar. Die Neuen Sprachen, 76, 165–74.
Oller, J. W., Jr. (1983). A consensus for the eighties? In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 351–356). Rowley, MA:
Newbury House.
Oller, J. W., Jr., & Hinofotis, F. A. (1980). Two mutually exclusive hypothe-
ses about second language ability: Factor analytic studies of a variety of
language subtests. In J. W. Oller, Jr., & K. Perkins (Eds.), Research in
language testing (pp. 13–23). Rowley, MA: Newbury House.
Raykov, T., and Marcoulides, G. A. (2000). A first course in structural
equation modeling. Mahwah, NJ: Lawrence Erlbaum.
Rindskopf, D., & Rose, T. (1988). Some theory and applications of confirma-
tory second-order factor analysis. Multivariate Behavioral Research, 23,
51–67.
Sawaki, Y., Stricker, L., & Oranje, A. (2008). Factor Structure of the
TOEFL(R) Internet-Based Test (iBT): Exploration in a Field Trial
Sample. (TOEFL iBT Research Report No. TOEFLiBT-04). Princeton,
NJ: ETS.
Sasaki, M. (1996). Second language proficiency, foreign language aptitude,

and intelligence: Quantitative and qualitative analyses. New York:
Lang.
Satorra, A. (1990). Robustness issues in structural equation modeling: A review
of recent developments. Quality & Quantity, 24, 367–86.
Satorra, A., & Bentler, P. (2001). A scaled difference chi-square test statistic
for moment structure analysis. Psychometrika, 66, pp. 507–514.
Shin, S.-K. (2005). Did they take the same test? Examinee language profi-
ciency and the structure of language tests. Language Testing, 22, 31–57.
Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the Test of English
as a Foreign Language for several language groups (TOEFL Research
Report No. 6). Princeton, NJ: Educational Testing Service.
Stricker, L. J., & Rock, D. A. (2005). Factor analysis of New Generation
TOEFL. Unpublished analysis memorandom. Princeton, NJ: Educational
Testing Service.
Stricker, L. J., Rock, D. A., & Lee, Y.-W. (2005). Factor structure of the
LanguEdge test across language groups (TOEFL Monograph Series
MS-32). Princeton, NJ: Educational Testing Service.
Urquhart, S., & Weir, C. J. (1998). Reading in a second language: Process,
product and practice. London: Longman.
Weir, C. J. (1990). Communicative language testing. (2nd ed.). New York:
Prentice Hall.
Wesche, B. (1987). Second language performance testing: The Ontario test of
ESL as an example. Language Testing, 4, 28–37.

Factor Structure of The TOEFL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Factor Structure of The TOEFL

Uploaded by

Copyright:

Available Formats

Language Testing 2009 26 (1) 005–030

Factor structure of the TOEFL

Keywords: construct validity, ESL, factor analysis, integrated task, language

The Test of English as a Foreign Language (TOEFL) is a battery of

tasks that require students to process language in more than one

The second issue was the relationship between the constructs

Scores TOEFL iBT field study TOEFL population

Mean Std. Dev. Mean Std. Dev.

2 Structure of the test

a Reading: The Reading section comprised three sets, each of

because of its poor performance. After excluding this item, 38 items

b Listening: The Listening section consisted of six sets, two based

c Speaking: The Speaking section consisted of six tasks. Two were

sections were essentially unidimensional within modality. Moreover,

Figure 1 Bi-Factor model

psychometrically indistinguishable from one another, while the

Figure 2 Correlated Four-Factor model

satisfaction of statistical (model chi-square) and practical goodness-

Figure 3 Single-Trait model

a model that specifies no latent factor, taking into account the

Figure 4 Correlated Two-Factor model

• Expected cross-validation index (ECVI): An ECVI indicates the

In addition, when considering the number of distinct factors pres-

Figure 5 Higher-order Factor model

1 Developing a baseline model: Bi-Factor model vs. Correlated

over a higher-order factor model in the personality domain.

Model S–B S–B GFIa NNFIb CFIc RMSEAd ECVIe Inter-factor r

2 Testing for distinctness of the language abilities tested across the

3 Testing for distinctness among the Reading, Listening and

Correlated Two-Factor model as well (p.01, 2S-B difference 352.31;

4 Testing for presence of a single underlying dimension across

5 Interpretation of the Higher-Order Factor model

Table 3 Loadings of items on first-order factors

Item Reading Listening Speaking Writing Error SMRb

Item Reading Listening Speaking Writing Error SMRb

6 .46 .79 .21

1 (W) .93a .13 .87

Turning to the higher-order factor loadings in Table 4, all the four

III Discussion and conclusions

Table 4 Loadings of first-order factors on higher-order factors

General Error SMRa

Reading .91* .17* .84

present study identified four first-order factors corresponding to the

Third, the nature of the samples studied differed. Stricker et al.

2 Integrated Speaking and Writing tasks

It is worth stressing that the lack of relationships of the integrated

Sasaki, M. (1996). Second language proficiency, foreign language aptitude,

You might also like