Studies in Second Language Acquisition-2023-Issue2

Studies in Second Language Acquisition (2023), 45, 291–317
doi:10.1017/S0272263122000225
RESEARCH ARTICLE
Effects of distributed practice on the acquisition of

verb-noun collocations
Satoshi Yamagata1* , Tatsuya Nakata2 and James Rogers3
1
Kansai University Dai-Ichi Senior High School/Dai-Ichi Junior High School, Japan, and University of
Birmingham, UK; 2Rikkyo University, Japan; 3Meijo University, Japan
*Corresponding author. E-mail: yamagata@kochu.kansai-u.ac.jp or SXY034@student.bham.ac.uk
(Received 16 September 2021; Revised 15 May 2022; Accepted 26 May 2022)
Abstract
Given the importance of collocational knowledge for second language learning, how
collocation learning can be facilitated is an important question. The present study examined
the effects of three different practice schedules on collocation learning: node massed,
collocation massed, and collocation spaced. In the node-massed schedule, three collocations
for the same node verb were studied on the same day. In the collocation-massed schedule,
three collocations for the same node verb were studied in different weeks. In the collocation-
spaced schedule, participants encountered multiple collocations for the same node verb
within a single day; at the same time, multiple collocations for the same node verb were
repeated each week. To examine whether the knowledge of studied collocations could be
transferred to unstudied collocations containing the same node, posttests included novel
collocations that were not encountered during the treatment. Results suggested that the
collocation-spaced schedule led to the largest gains for both studied and unstudied collo-
cations.
Introduction
Collocations refer to frequently recurring word combinations consisting of two content
words (e.g., verb þ noun, adjective þ noun; Ackermann & Chen, 2013; Shin, 2006).
Collocations can be adjacent (e.g., meet demand, make decisions) or nonadjacent (e.g.,
meet the demand, make important decisions; Boers et al., 2014; Wood, 2020), and are
often characterized by restricted co-occurrence (e.g., make a decision but not do a
decision; Paquot & Granger, 2012). Research suggests that collocational competence
plays a pivotal role in second language (L2) learning, helping learners attain a sufficient
level of accuracy and fluency (González-Fernández & Schmitt, 2015; Siyanova &
Schmitt, 2008). Despite the importance of collocational knowledge, research also
suggests that development of collocational knowledge often lags behind that of single
words (Boers et al., 2014; Laufer & Waldman, 2011; Szudarski, 2017). Considering the
importance of collocational knowledge for L2 learning, research examining how
© The Author(s), 2022. Published by Cambridge University Press. This is an Open Access article, distributed under the terms
of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted
re-use, distribution and reproduction, provided the original article is properly cited.
https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

292 Satoshi Yamagata et al.
collocation learning can be facilitated is valuable from both theoretical and pedagogical
perspectives. Research has suggested that L2 collocation learning may be affected by
various factors, such as frequency of exposure (e.g., Pellicer-Sánchez, 2017; Szudarski &
Carter, 2016; Webb et al., 2013), type of input (e.g., Peters, 2016; Toomer & Elgort,
2019; Webb & Chang, 2022; Webb & Kagimoto, 2009), and type of exercise and
instruction (e.g., Boers, Demecheleer et al., 2014; Boers et al., 2017; Eyckmans et al.,
2016).
Another factor that can potentially affect L2 collocation learning is temporal
spacing. Research indicates that distributing practice over a long period often facilitates
the learning of single words (e.g., Bahrick & Phelps, 1987; Kim & Webb, 2022; Nakata &
Webb, 2016). The memory advantage of spacing over no spacing (pure massing) is
referred to as the spacing effect, whereas the advantage of longer temporal spacing over
shorter spacing is referred to as the lag effect (Cepeda et al., 2006). Despite the potential
benefits of spacing for vocabulary learning, research examining its effects on collocation
learning is still limited. The present study aims to fill a gap in existing research by
investigating the effects of spacing on the learning of L2 collocations.
Literature review
Research suggests that distributing practice opportunities over longer periods facilitates
L2 vocabulary learning. For instance, in Nakata and Webb (2016, Experiment 2),
78 Japanese learners studied 20 English–Japanese word pairs under short- or long-
spaced conditions. In the short-spaced condition, a target word was repeated after
approximately 30 seconds, whereas in the long-spaced condition, a target word was
repeated after approximately 3 minutes. Learning was measured by productive and
receptive posttests conducted immediately and 1 week after the treatment. Posttest
results suggested that long spacing was more than twice as effective as short spacing.
Similarly, in Bahrick and Phelps (1987), 35 participants studied 50 English-Spanish
word pairs using one of three spacing intervals: same day, 1 day, and 30 days. Retention
was measured approximately 8 years after the last treatment. Posttest results suggested
that spacing of 30 days was more than twice as effective as same-day spacing. Although
the findings of these studies are useful, most studies have examined the effects of
spacing on the learning of single words; thus, little is known about whether the benefits
of spacing also extend to collocation learning. Two recent studies (Macis et al., 2021;
Snoder, 2017), however, have examined the effects of spacing on L2 collocation learning
and constitute exceptions.
Macis et al. (2021) compared the effects of massing and spacing on learning
25 adjective-noun collocations in incidental (Experiment 1) and deliberate learning
conditions (Experiment 2). Across two experiments, Arabic EFL learners were assigned
to one of five groups: incidental massed, incidental spaced, deliberate massed, delib-
erate spaced, and control. Those in the two incidental groups read short stories
containing target collocations and answered comprehension questions (Experiment
1). Participants in the two deliberate groups studied the same target collocations
through concordance lines and then completed matching and multiple-choice exercises
(Experiment 2). In both massed and spaced groups, a given target collocation was
encountered five times over 5 weeks. In the massed groups, a given target collocation
was encountered five times on the same day. In the spaced groups, a given target
collocation was encountered only once a week, and five occurrences were distributed
over 5 weeks. Learning was measured by a fill-in-the-blank posttest, where participants

Effects of distributed practice: Verb-noun collocations 293
were asked to supply an appropriate adjective that collocated with the noun (e.g., the
adjective dead for the noun silence). The results showed that the deliberate-spaced
group had the largest gains, followed by the deliberate-massed, incidental-massed, and
incidental-spaced groups. The findings suggest that spacing facilitates learning of not
only single words but also collocations, albeit only in intentional learning.
Snoder (2017) conducted another study that examined the effects of spacing on L2
collocation learning. In it, 59 Swedish learners of English studied 28 verb-noun
collocations under an expanding or intensive condition. In the expanding group, the
treatment was given on days 1, 7, and 16, whereas in the intensive group, the treatment
took place on days 1, 2, and 4. Learning was measured by a posttest that required
participants to provide a verb that collocated with the noun. Posttests results showed no
statistically significant difference between the two groups. One possible explanation for
inconsistent findings between Macis et al. (2021) and Snoder (2017) may be that
whereas the former compared massing (where practice opportunities for target mate-
rials are concentrated into a single day) and spacing (where practice opportunities for
target materials are distributed over multiple days) and examined the spacing effect, the
latter compared effects of two spacing schedulers (i.e., relatively short vs. relatively long
intervals) and examined the lag effect.
Although the findings of these studies are informative, one potential limitation is
that they examined the learning of only one collocate per node word. For instance,
Snoder (2017) investigated the learning of 28 verb-noun collocations, but there was
only one collocation for each node verb (e.g., carry a risk, entertain hope, score success).
This is unfortunate because collocational development is perhaps facilitated by expo-
sure to multiple collocations containing the same node. For instance, if Japanese
learners of English first encountered the word break in the collocation break a window,
they may associate break with its first language (L1) translation waru (i.e., to destroy a
physical object; lexical association stage in Jiang, 2004), and may hypothesize that it can
collocate only with concrete nouns. Exposure to collocations such as break a promise or
break the record may allow learners to reconceptualize their knowledge of the meaning
potential of the word (semantic restructuring; Jiang, 2004), helping them to compre-
hend or produce novel collocations such as break a rule, break one’s heart, or break the
news. Considering that exposure to multiple collocations with the same node perhaps
facilitates collocational development, it would be useful to examine how spacing of
multiple collocations with the same node affects collocational knowledge.
Multiple collocations for a given node may be introduced in one of the three
schedules: massed from the perspective of the same node (hereafter, node massed),
massed from the perspective of individual collocations (hereafter, collocation massed),
and spaced from the perspective of individual collocations (hereafter, collocation
spaced). In the node-massed schedule, practice opportunities for multiple collocations
for the node word (e.g., draw a line, draw tears, draw a conclusion) are concentrated into
a single session, and they are never repeated in subsequent sessions. In the collocation-
massed schedule, multiple collocations for the same node are introduced across
multiple sessions. For instance, learners may encounter draw a line in Day 1, draw
tears in Day 2, and draw a conclusion in Day 3. However, practice opportunities for
individual collocations are concentrated into a single session (e.g., draw a line is studied
only in Day 1). As such, the schedule is massed from the perspective of individual
collocations. In the collocation-spaced schedule, learners encounter multiple colloca-
tions for the same node word within a single session, just as in the node-massed
schedule. At the same time, multiple collocations for the same node are repeated across
multiple sessions. For instance, learners may study draw a line, draw tears, and draw a

conclusion in all Days 1, 2, and 3. Although it may be useful to examine how these three
schedules affect collocation learning, the existing spacing studies on collocation learn-
ing (Macis et al., 2021; Snoder, 2017) have failed to provide evidence regarding the
relative effectiveness of them.
When examining the effects of spacing on learning of multiple collocations with the
same node, it would also be useful to examine not only the extent to which learners
acquired collocations that they were exposed to but also the extent to which learners can
transfer knowledge of studied collocations to novel, unstudied collocations that contain
the same node but were not previously encountered (hereafter, unstudied collocations
refer to novel collocations that contain the same node as the collocations that learners
were exposed to). This is because exposure to multiple collocations with the same node
may help learners to comprehend or produce novel collocations with the same node.
For instance, exposure to multiple collocations with break (e.g., break a window, break a
promise, or break the record) may allow learners to transfer knowledge of these
collocations to novel, unstudied collocations with the same node (e.g., break a rule,
break one’s heart, or break the news).
Considering that exposure to multiple collocations with the same node is instru-
mental for collocational development, it would be useful to examine how the distribu-
tions of multiple collocations with the same node affect collocational knowledge.
Furthermore, because learners do not acquire all collocations as individual units but
instead make generalizations about which words can co-occur through repeated
encounters with multiple collocations, examining the effects of spacing on knowledge
of both studied and unstudied collocations would be useful. However, existing studies
examining the effects of spacing on L2 collocation learning (Macis et al., 2021; Snoder,
2017), as well as the majority of previous studies on L2 collocation learning in general
(e.g., Boers et al., 2017; Eyckmans et al., 2016; Pellicer-Sánchez, 2017; Peters, 2016;
Szudarski & Carter, 2016; Toomer & Elgort, 2019; Webb & Chang, 2022), have failed to
investigate the extent to which learners can transfer knowledge of studied collocations
to unstudied collocations.
Investigating effects of spacing on unstudied collocations is also useful because it
allows researchers to examine whether benefits of temporal spacing apply not only to
recall of previously presented materials but also to induction. Specifically, transferring
knowledge of studied collocations to unstudied collocations typically involves induc-
tion because it requires learners to extract the core, underlying meaning of the node,
based on multiple collocations of the same node (unless learners are explicitly taught
the core meaning of the node). Most studies examining effects of spacing on single
words (e.g., Bahrick & Phelps, 1987; Nakata & Webb, 2016), in contrast, have inves-
tigated recall of previously presented materials (i.e., learners are presented with L2
words together with their meanings, and asked to learn them), rather than induction. It
should be noted that some cognitive psychologists argue that although spacing may be
effective for recall, it may not necessarily facilitate inductive learning. As Kornell and
Bjork (2008) state, it is possible that “spacing is the friend of recall, but the enemy of
induction” (p. 585). This is because presenting multiple instances of a particular
category or concept simultaneously (i.e., massing) may help learners identify underly-
ing conceptual features. In contrast, when multiple instances of a particular concept are
presented after long intervals (spacing) learners may have difficulty noticing underly-
ing commonalities, thus making induction more difficult. Given that spacing may have
differential effects on recall and induction, it is possible that the learning of collocations,
especially those of unstudied collocations, may benefit from spacing to a lesser degree
than the learning of single words.

The present study
The present study aims to fill this research gap by investigating effects of node-massed,
collocation-massed, and collocation-spaced schedules on knowledge of studied and
unstudied collocations. The treatment was conducted over 3 weeks. In the node-massed
schedule, multiple collocations for the same node verb (draw a conclusion, draw a line,
draw tears) were concentrated into a single day and were not repeated in subsequent
days. In the collocation-massed schedule, multiple collocations for the same node verb
were introduced across the 3 weeks. For instance, participants studied draw a line in
Week 1, draw tears in Week 2, and draw a conclusion in Week 3. However, practice
opportunities for individual collocations were concentrated into a single session (e.g.,
draw a line was studied only in Week 1). In the collocation-spaced group, participants
studied multiple collocations for the same node verb within a single day, just like the
node-massed schedule. At the same time, multiple collocations for the same node verb
were repeated each week. For instance, participants studied three collocations for the
node verb draw (draw a conclusion, draw a line, draw tears) in Weeks 1, 2, and 3 (see
“Method” section for more details).
This study will answer the following research question: To what extent do node-
massed, collocation-massed, and collocation-spaced schedules facilitate knowledge of
studied and unstudied L2 collocations?
The following hypotheses were formulated for the mentioned research question:
Hypothesis 1: For the retention of studied collocations, the collocation-spaced
schedule will be more effective than the node-massed and colloca-
tion-massed schedules.
Hypothesis 2: For the knowledge of unstudied collocations, the collocation-spaced
schedule will be the most effective, the collocation-massed schedule will
be the least, and the node-massed schedule will be between the two.
Hypothesis 1 predicts that the collocation-spaced schedule will be most effective for
the retention of studied collocations. This is because whereas practice opportunities for
individual collocations are distributed over 3 weeks in the collocation-spaced schedule,
they are concentrated into a single session in the two massed schedules. Existing studies
have produced inconsistent results regarding effects of spacing on collocation learning.
Specifically, whereas Macis et al. (2021) found benefits of distributed practice for
intentional learning, Snoder (2017) failed to do so. Hypothesis 1 predicts that the
results of this study will be consistent with those of Macis et al. (2021). This is because,
just like the study conducted by Macis and colleagues, the present study involves the
comparison of spacing and massing and examines the spacing effect, whereas Snoder’s
study involved the comparison of two spacing schedulers (i.e., relatively short
vs. relatively long intervals) and examined the lag effect.
Hypothesis 2 predicts that for knowledge of unstudied collocations, the collocation-
spaced schedule will be the most effective, followed by the node-massed schedule. In the
node-massed and collocation-spaced schedules, learners are exposed to multiple
collocations for the same node (e.g., draw a line, draw tears, draw a conclusion) within
a single session, which may allow learners to extract the core, underlying meaning of the
node. This in turn may improve the ability to transfer knowledge of studied collocations
to novel, unstudied collocations that contain the same node. In the collocation-massed
schedule, in contrast, multiple collocations with the same node are introduced in
different weeks (e.g., Week 1: draw a line, Week 2: draw tears, Week 3: draw a
conclusion). This may make it difficult for learners to notice underlying commonalities

motivating the use of the node word, resulting in limited ability to transfer knowledge of
studied collocations to unstudied collocations.
Hypothesis 2 also predicts the advantage of the collocation-spaced schedule over the
node-massed schedule for knowledge of unstudied collocations. Existing studies com-
paring blocking (i.e., a schedule where only one concept or skill is practiced at a time)
and interleaving (i.e., a schedule where multiple concepts or skills are practiced at once)
suggest that blocking may be beneficial for finding commonalities among different
exemplars of a particular concept or category (Carpenter & Mueller, 2013; Kang, 2016),
which might predict the advantage of the node-massed schedule over the collocation-
spaced schedule for unstudied collocations. This is because whereas the node-massed
schedule (where exemplars from only one node are presented each day) is akin to
blocking, the collocation-spaced schedule (where exemplars from multiple nodes are
presented each day) is akin to interleaving. However, although encounters with a given
node are concentrated into a single session in the node-massed schedule, they are
distributed over 3 weeks in the collocation-spaced schedule. Because encounters with a
given node distributed over a longer period in the collocation-spaced schedule may help
consolidate learners’ understanding of the meaning potential of the node, Hypothesis
2 predicts the advantage of the collocation-spaced schedule over the node-massed
schedule for knowledge of unstudied collocations.
Method
Participants
The original pool of participants consisted of 96 first-year Japanese EFL high school
students (15–16 years old). Six students who missed one or more of the pretest,
treatment, or posttest sessions were excluded from analysis, resulting in 90 participants.
All participants had learned English in a formal setting for at least 4 years. Prior to the
experiment, they took the 1,000 to 5,000 frequency levels on the Updated Vocabulary
Level Test (UVLT), Version B (Webb et al., 2017). The average scores are provided in
Table 1. The participants came from three intact classes, each of which was randomly
assigned to one of three groups: node massed (n = 27), collocation massed (n = 31), and
collocation spaced (n = 32). Because a statistically significant difference was found
among the three groups in their total scores on UVLT, F (2, 87) = 8.37, p < .001, η2 = .16
(collocation spaced > node massed [p < .001]; collocation spaced > collocation massed
[p = .039]; collocation massed = node massed [p = .084]), the UVLT score was used as a
covariate in the analysis (see the following text). An a priori power analysis for a mixed
within-between 3 2 ANOVA (three groups at two measurement points) showed
that when the effect size was set to be medium (f = .25), a minimum of 64 participants
would be necessary. As a result, the number of participants in the present study (n = 90)
was deemed sufficient.
Materials
Fifty-four verb-noun collocations (e.g., carry weight, draw a line, take advice) were
chosen as target items. All collocations were incongruent between the L1 (Japanese) and
L2 (English), that is, the translation of the node verb in each collocation was different
from its most common, prototypical translation equivalent (Conklin & Carrol, 2018;
Gyllstad & Wolter, 2016; Szudarski, 2012). Initially, 144 collocations were identified as
candidates for target items. Based on results of a norming test administered to

Effects of distributed practice: Verb-noun collocations

Table 1. Proportion of correct responses on the UVLT
Node massed Collocation massed Collocation spaced
M [95% CI] SD M [95% CI] SD M [95% CI] SD
1,000-word levels 81.5% [77.2%, 85.8%] 10.8% 87.1% [83.0%, 91.2%] 11.1% 90.5% [88.3%, 92.7%] 6.1%
2,000-word levels 52.0% [44.9%, 59.1%] 18.0% 59.9% [53.4%, 66.4%] 17.3% 70.7% [66.0%, 75.5%] 13.1%
3,000-word levels 27.7% [19.5%, 35.8%] 20.5% 37.0% [31.2%, 42.8%] 15.7% 41.6% [36.1%, 47.1%] 15.3%
4,000-word levels 25.4% [18.4%, 32.5%] 17.9% 30.9% [26.0%, 35.7%] 13.0% 36.7% [30.1%, 43.3%] 18.3%
5,000-word levels 15.3% [10.1%, 20.5%] 13.2% 22.2% [17.5%, 26.8%] 12.2% 24.8% [19.8%, 29.8%] 13.9%
Total 40.4% [35.0%, 45.8%] 13.6% 47.5% [43.6%, 51.4%] 10.4% 52.9% [49.1%, 56.6%] 10.4%
297
191 Japanese high school students who did not participate in the actual experiment, the
144 items were narrowed down to 74 (see Appendix S1 in the Online Supplementary
Materials). To identify collocations that were unfamiliar, a pretest was carried out
3 weeks before the treatment with actual participants of the experiment.
Two types of tests were given as the pretest: collocation filling and verb filling. In the
collocation-filling test, participants were presented with a short sentence where a target
collocation was deleted and asked to supply the missing verb and noun. To clarify the
meaning of the target collocation, a Japanese translation of the sentence was provided.
To prevent participants from providing alternate, acceptable answers (e.g., run a fever
instead of have a fever), the number of letters was provided as a hint. In addition, based
on a similar procedure utilized in Nakata and Webb (2016), a letter from the word was
sometimes provided as a hint when deemed necessary to help avoid alternative,
acceptable answers. Participants were informed that they could provide multiple
answers if they could think of more than one. An example of a collocation-filling item
is as follows:
もし熱が出れば、できるだけ早く私に言ってくださいね。
If you ( _ _ _ ) a/an ( _ _ _ _ _ ), please tell me as soon as possible.
(Answer. run, fever)
Cloze sentences were created so that all words used in each sentence would be among
the most frequent 4,000 word families of the COCA. As the results of the UVLT suggest
(see “Participants” section), it is possible that some participants were not familiar with
some words used in these sentences. However, because Japanese translations for all
sentences were provided, potential use of unfamiliar words perhaps did not have major
effects on the results of this study. Vocabulary load analysis also showed that the most
frequent 1,000 word families alone cover 95.4% of running words used in the cloze
sentences.
In the verb-filling test, the noun of the candidate collocation was given, and partic-
ipants were required to fill in the missing verb. After completing the collocation-filling
test, participants were asked not to return to any items on the collocation-filling test. Both
pretests required the production, rather than comprehension, of target collocations.
Because research suggests that production of collocations poses more of a challenge for
L2 learners than comprehension (Gyllstad & Wolter, 2016; Henriksen, 2013; Laufer &
Waldman, 2011), this study is also concerned with the development of productive
knowledge of collocations. As a result, productive collocational knowledge was measured
in both pretests and posttests. The pretest is provided in Appendix S2 in the Online
Supplementary Materials. Based on the results of a pilot study involving 80 Japanese
learners recruited from a different high school than the school where the main study was
conducted, participants were given up to 40 minutes to complete the pretest. They were
also instructed to put a circle around the number of the last question they solved if they
were unable to complete the test within the time limit. None of the participants indicated
that they were unable to complete the pretest. Because it was not possible to identify a
sufficient number of target collocations based on the results of the pretest, an additional
pretest with 12 novel collocations was administered 2 weeks before the treatment (see
Appendix S3 in the Online Supplementary Materials). Based on results of the pretest and
additional pretest, 54 target collocations, which consisted of nine node verbs and their six
collocate nouns, were chosen (see Appendix S4 in the Online Supplementary Materials).
Out of the 54 target collocations, 53 were chosen from the first pretest, and only one item
(cut a loss) was chosen from the additional pretest.

The target collocations were divided into two sets of 27 items (nine node verbs and
their three collocate nouns each). One set of items was assigned to studied items,
whereas the other to unstudied items. Both studied and unstudied items were tested on
the pretest and posttest. However, although studied items were presented and practiced
during the treatment, unstudied items did not appear throughout the treatment.
Unstudied items were included to examine effects of the treatment on the ability to
transfer knowledge of studied collocations to unstudied, novel collocations that contain
the same node. The two sets were created so that they were matched for variables such
as the average pretest score, t-score (Hunston, 2002; Webb et al., 2013), frequency in the
Corpus of Contemporary American English (COCA), and familiarity ratings of the
nouns by Japanese learners (see Appendix S4 in the Online Supplementary Materials).
Care was also taken to ensure that collocations with similar meanings (e.g., cut class and
cut school ) would not be included in the same set. Only collocations that were
semantically motivated by the core meaning of the node verb were used as target
collocations. This is because otherwise we would not be able to expect learners to
transfer the knowledge of studied collocations to unstudied collocations. The relation-
ship between the core meaning of the node verb and the meaning of each target
collocation is provided in Appendix S4 in the Online Supplementary Materials.
Procedure
Treatment
Three weeks after the pretest and 2 weeks after the additional pretest, the treatment was
conducted over 3 weeks. Three treatment sessions were given each week (on Monday,
Wednesday, and Friday), resulting in nine sessions in total. Each session took approx-
imately 5 to 10 minutes and was conducted during regular class hours. Different target
collocations were introduced in each class, depending on the participants’ group
(i.e., node massed, collocation massed, or collocation spaced). Figure 1 presents target
collocations introduced in each session in the three groups. In the node-massed group,
participants learned three collocations containing the same node verb each day (e.g.,
draw a conclusion, draw a line, draw tears). In the collocation-massed group, multiple
collocations for the same node verb were studied in different weeks. For instance,
participants studied draw a line in Week 1, draw tears in Week 2, and draw a conclusion
in Week 3. In the collocation-spaced group, participants studied multiple collocations
for the same node verb (e.g., draw a conclusion, draw a line, draw tears) within a single
day. At the same time, multiple collocations for the same node were encountered every
week throughout the treatment. For instance, participants studied three collocations for
the node verb draw (draw a conclusion, draw a line, draw tears) in each week of the
treatment.
For each treatment session, materials were presented on a screen in front of the
classroom using presentation software. The treatment session consisted of the follow-
ing seven stages: (1) presentation of target collocations, (2) presentation of target
collocations in context, (3) retrieval of target verbs, (4) translation of target collocations,
(5) retrieval of target verbs in context, (6) retrieval of target collocations in context, and
(7) a quiz. See Appendix S5 in the Online Supplementary Materials for further details of
the stages. Three out of the seven stages (Stages 2, 5, and 6) involved a context sentence
containing a target collocation. For a given collocation, the same context sentence was
used for all three stages. This is because a study conducted by Durrant and Schmitt
(2010) suggests that repeating the same context sentence three times may facilitate L2
collocational development more than using three different sentences.

Figure 1. Target items introduced in each treatment.
For Stages (1)–(7), the target collocations were presented in a block of three (for the
node-massed and collocation-massed groups) or nine items (for the collocation-spaced
group) and repeated, instead of one collocation going through all seven stages one by one.
For instance, as shown in Figure 1, in the node-massed group, for the first treatment
session (Wednesday in Week 1), the following three collocations were introduced: run a
fever, run a story, run a finger. At the beginning of the treatment, all three collocations
were presented for Stage (1). After this, the three collocations were practiced in Stage (2).
This was followed by the three collocations practiced in Stage (3), and so forth. To
minimize order effect, the items appeared in a different order for each stage.
As shown in Figure 1, whereas three collocations were practiced each day in the two
massed groups, in the collocation-spaced group, nine collocations were practiced each
day. Please note, however, that when collapsed across all treatment sessions, the
number of encounters was held constant for all three groups. For instance, in the
two massed groups, participants completed all seven stages for the target collocation
run a story in the first treatment session in Week 1 (see Appendix S5 in the Online
Supplementary Materials). In contrast, for the target collocation run a story, partici-
pants in the collocation-spaced group completed Stages (1), (2), and (7) in Week
1, Stages (3) and (4) in Week 2, and Stages (5) and (6) in Week 3. Because each target
collocation was practiced seven times throughout the treatment in all three conditions,
the number of encounters was held constant for all three groups. Because the treatment
was paced by the presentation software, time-on-task was also held constant, and the
only difference was how the practice opportunities were distributed.
Posttests
Immediately after the last treatment session (Monday in Week 3; Figure 1), participants
took the immediate posttest. It was different from the pretest in four respects. First,
although the number of letters, and sometimes one letter from the word, was provided
as a hint in the pretest (e.g., _ _ _ _ for take), no hint was provided on the posttest.
Second, in the pretest, 98 items (74 in the pretest and 24 in the additional pretest) were
tested in both the collocation-filling and verb-filling tests. In the posttest, only 27 stud-
ied collocations were tested in the collocation-filling test, and 54 target collocations
(27 studied and 27 unstudied) were tested in the verb-filling test. Unstudied items were
not tested in the collocation-filling test (which required learners to provide both the
node verb and collocate noun) because we cannot expect any of the treatments to
contribute to the learners’ ability to successfully provide the correct collocate noun,
which was not encountered during the treatment. Third, a randomized item order
different from the pretest was used for the immediate posttest to minimize order effect.
Fourth, because the posttest involved less items than the pretest, the time limit for the
posttest (20 minutes) was shorter than that for the pretest (40 minutes). The time limit
for the posttest was determined based on a pilot study with 80 Japanese learners
recruited from a different high school than the school where the main study was
conducted. Other than these, the immediate posttest was the same as the pretest (see the
immediate posttest in Appendix S6 in the Online Supplementary Materials). Two
weeks after the immediate posttest, a delayed posttest was administered without prior
announcement. This was identical to the immediate posttest except for item order. Two
types of posttests (collocation filling and verb filling) were used in the present study.
This is because administering two posttests with different levels of sensitivity may
provide a more comprehensive picture regarding the incremental nature of collocation
learning (Peters, 2016; Szudarski & Carter, 2016).

Scoring and data analysis
Collocation-filling test
The knowledge of intact, studied collocations was measured by the collocation-filling
test. If participants provided both the verb and noun successfully, it was scored as
correct. Misspelled responses were scored as correct as long as they were recognizable
(e.g., Snoder, 2017; Sonbul & Schmitt, 2013; Toomer & Elgort, 2019). To control for
effects of prior knowledge, for each participant, items answered successfully on the
pretest were treated as missing values and excluded from analysis (e.g., Nakata &
Suzuki, 2019). This resulted in the exclusion of 1.3% of items on average per participant
(node massed: 1.1%, collocation massed: 1.2%; collocation spaced: 1.5%). All analyses
had α levels set at .05.
Responses were analyzed using a mixed-effect logistic regression model with the
lme4 package (version 1.1-27.1; Bates et al., 2015) in R (version 4.1.2; R Core Team,
2021). The response variables were discrete binary data (correct = 1, incorrect = 0).
Treatment (node massed, collocation massed vs. collocation spaced) and Test_timing
(immediate vs. delayed posttest) were included as fixed effects. To control for English
proficiency effects, UVLT scores were included as a covariate in the model. Further-
more, to control for recency effect, lag to test (the number of days between the last
occurrence of the item during the treatment and immediate posttest) was also included
as a covariate. For instance, the last occurrence for the target item run a fever during the
treatment was in the first treatment session (Wednesday in Week 1) in the node-
massed group, whereas it was in the seventh treatment session (Wednesday in Week 3)
in the collocation-spaced group (Figure 1). Therefore, lag to test was 19 days for the
node-massed group, and 5 days for the collocation-spaced group. Because the differ-
ence between 17 and 19 days, for instance, may be larger than the difference between
0 and 2 days, lag to test was squared before it was entered into the model. To avoid
multicollinearity and convergence issues, both UVLT scores and squares of lag to
test were centered and standardized before they were entered into the model as
s.UVLT_score and s.Lag_to_test, respectively.
The random effects were fitted using the maximum likelihood method, assuming
random intercepts for participants and target collocations, and random slopes for target
collocations toward the UVLT scores and lag to test. An interaction between Treatment
and Test_timing was also entered into the model, assuming that the treatment’s effect
was different for the immediate and delayed posttests.
Verb-filling test
Knowledge of verbs in studied and unstudied collocations was measured by the verb-
filling test. The verb-filling test was scored in the same way as the collocation-filling test.
As in the collocation-filling test, items answered correctly on the pretest were treated as
missing values for each participant. This resulted in the exclusion of 2.1% of items on
average per participant (node massed: 2.8%, collocation massed: 1.8%; collocation
spaced: 1.7%).
The model used for the verb-filling test was the same as the one used for the
collocation-filling test, except that collocation type (Studied = 1 and Unstudied = 0)
was included as new fixed and random effects after it was centered and standardized (s.
Collocation_type). Two interactions (Test_timing Treatment s.Collocation_
type; Test_timing s.Collocation_type) were also included. To make the model
converge, we added an additional optimizer, and pipelines to random effects, which
allow the exclusion of correlated parameters between random variables. Because the

Table 2. Proportion of correct responses on the pretest
Collocation-filling test Verb-filling test
M [95% CI] SD Range M [95% CI] SD Range
Node massed 0.6% [0.3%, 1.1%] 2.2% 0%–11.1% 2.8% [2.1%, 3.8%] 6.5% 0%–33.3%
Collocation massed 0.6% [0.3%, 1.1%] 1.7% 0%–7.4% 1.8% [1.3%, 2.6%] 3.1% 0%–13.0%
Collocation spaced 0.8% [0.4%, 1.3%] 1.6% 0%–7.4% 1.7% [1.2%, 2.4%] 3.0% 0%–13.0%
unstudied items did not appear in any of the treatment sessions, dummy coding was
used for s.Lag_to_test of these items.
Results
Pretest
Table 2 shows results of the pretest scores. More detailed information about the pretest
performance is provided in Appendix S7 in the Online Supplementary Materials. The
differences in the pretest scores of the three groups were not statistically significant,
producing negligible effects; collocation-filling pretest: H(2) = 1.88, p = .391, r = .09,
verb-filling pretest: H(2) = 0.14, p = .933, r = .01.
Learning-phase performance
The proportion of correct responses on the quiz given at the end of each treatment
session (Stage 7; see Appendix S5 in the Online Supplementary Materials) was
96.4% (95% CI = [93.6%, 99.2%]; SD = 7.6%) for the node-massed group, 94.4%
(95% CI = [92.0%, 96.9%]; SD = 6.5%) for the collocation-massed group, and 75.8%
(95% CI = [70.2%, 81.5%]; SD = 15.7%) for the collocation-spaced group. The
difference was statistically significant, H (2) = 46.34, p < .001, and a large effect size
(r = .68) was detected, according to Plonsky and Oswald (2014) benchmark. Post-hoc
analysis with Bonferroni adjustment revealed a statistically significant difference
between the collocation-spaced and node-massed groups (z = –5.87, p < .001, r = .76
[large effect]), as well as between the collocation-spaced and collocation-massed groups
(z = –5.48, p < .001, r = .70 [large effect]). No statistically significant difference,
however, was found between the two massed groups (z = –1.91, p = .170, r = .25
[medium effect]). The findings suggest that the two massed groups led to higher scores
than the collocation-spaced group during the learning phase.
Collocation-filling test
The reliability of the collocation-filling test indexed by Cronbach alpha was .917 for the
immediate posttest and .904 for the delayed posttest, showing sufficient reliability
(Plonsky & Derrick, 2016). Results for the collocation-filling test are summarized in
Tables 3 to 5, as well as in Figure 2. Table 4 shows fixed and random effects in the mixed-
effect logistic regression model. The significant fixed effect of the collocation-
spaced group suggests that when collapsed across the immediate and delayed posttests,
the collocation-spaced group significantly outperformed the node-massed group. The
odds ratio (OR) of 9.58 indicates that the odds of being able to answer correctly on the

Immediate_posttest Delayed_posttest
1.00
0.75
Proportion Correct (%)
0.50
0.25
0.00
Node massed
Collocation massed
Collocation spaced
Node massed
Collocation massed
Collocation spaced
Treatment
Figure 2. Distributions of scores for the collocation-filling test.
posttest in the collocation-spaced group were 9.58 times higher than the node-massed
group, which is considered a large effect size, according to guidelines proposed by Chen
et al. (2010), where odds of 1.68/3.47/6.71 are interpreted as small, medium, and large
effects, respectively. The significant fixed effect of the UVLT suggests that higher UVLT
scores were associated with higher posttest scores, with a small effect (OR = 2.36). The
fixed effect of lag to test, however, was not statistically significant, which suggests that
the recency effect (i.e., whether the last encounter with the target collocation was close
to the posttest or not) did not significantly affect learning. Although the fixed effects of
Collocation-massed and Test_timing were also significant, they are not discussed in
detail because an interaction containing these effects was also significant (see the
following text).

Effects of distributed practice: Verb-noun collocations

Table 3. Proportion of correct responses on the posttest
Node massed Collocation massed Collocation spaced
M [95% CI] SD M [95% CI] SD M [95% CI] SD
Immediate Collocation filling 14.9% [12.2%, 17.6%] 35.6% 23.0% [20.1%, 25.9%] 42.1% 57.8% [54.5%, 61.2%] 49.4%
Verb filling Studied 16.4% [13.7%, 19.2%] 37.1% 28.2% [25.0%, 31.3%] 45.0% 62.1% [58.8%, 65.4%] 48.5%
Unstudied 4.6% [3.1%, 6.2%] 21.0% 7.8% [5.9%, 9.6%] 26.8% 19.9% [17.2%, 22.5%] 39.9%
Delayed Collocation filling 6.6% [4.7%, 8.5%] 24.8% 15.9% [13.4%, 18.5%] 36.6% 40.7% [37.4%, 44.1%] 49.2%
Verb filling Studied 9.4% [7.1%, 11.6%] 29.2% 20.2% [17.4%, 23.1%] 40.2% 45.6% [42.3%, 49.0%] 49.8%
Unstudied 2.8% [1.5%, 4.1%] 16.6% 5.7% [4.1%, 7.3%] 23.2% 16.6% [14.1%, 19.1%] 37.2%
305
Table 4. List of fixed and random effect fitted: Collocation-filling test fixed effects
Estimate 95% CI SE z Odds Ratio p
Intercept 3.38 [4.10, 2.65] 0.37 9.09 0.03 <.001

Collocation spaced 2.26 [1.60, 2.91] 0.33 6.76 9.58 <.001
Collocation massed 1.09 [0.45, 1.73] 0.33 3.34 2.97 <.001
Test_timing 1.19 [0.77, 1.61] 0.21 5.57 3.29 <.001
s.UVLT_score 0.86 [0.60, 1.12] 0.13 6.48 2.36 <.001
s.Lag_to_test 0.31 [0.69, 0.06] 0.19 1.65 0.73 .098
Collocation spaced 0.21 [0.69, 0.27] 0.24 0.86 0.81 .388
Test_timing
Collocation massed 0.58 [1.09, 0.07] 0.26 2.25 0.56 .024
Test timing
Model Formula: glmer (Phrase_accuracy ~ Treatment*Test_timing þ s.UVLT_score þ s.Lag_to_test þ (1 | ID) þ

(s.UVLT_score þ s.Lag_to_test þ 1 | Item), Data, family = binomial, control = glmerControl (optimizer = “bobyqa”))
Random Effects
Variance SD
Participants (Intercept) 0.65 0.81

Item (Intercept) 1.71 1.31
s.UVLT_score 0.11 0.34
s.Lag_to_test 0.56 0.75
None of the interactions in the model were statistically significant except for the
interaction between Collocation-massed Test_timing (OR = 0.56). This significant
interaction indicates that scores for the node-massed group decayed more than those
for the collocation-massed group from the immediate posttest to the delayed posttest,
widening the gap between the two groups. Post-hoc analysis with Tukey’s test was
conducted using the R package lsmeans (Lenth, 2021). Results (Table 5) showed that on
the immediate posttest, the collocation-spaced group significantly outperformed both
the node-massed (OR = 7.77 [large effect]) and collocation-massed groups (OR = 4.66
[medium effect]). The collocation-massed group, in contrast, failed to significantly
outperform the node-massed group, producing a negligible effect (OR = 1.67). On the
delayed posttest, the collocation-spaced group significantly outperformed both the
Table 5. Results of the post-hoc analysis for treatment: Collocation-filling test

Odds
Posttest Comparisons Estimate 95% CI SE z Ratio p
Immediate Collocation spaced vs. 2.05 [1.45, 2.64] 0.30 6.73 7.77 <.001
Node massed
Collocation spaced vs. 1.54 [1.02, 2.06] 0.27 5.78 4.66 <.001
Collocation massed
Collocation massed vs. 0.51 [0.06, 1.07] 0.29 1.75 1.67 .500
Node massed
Delayed Collocation spaced vs. 2.26 [1.60, 2.91] 0.33 6.76 9.58 <.001
Node massed
Collocation spaced vs. 1.17 [0.64, 1.71] 0.27 4.29 3.22 <.001
Collocation massed
Collocation massed vs. 1.09 [0.45, 1.73] 0.33 3.34 2.97 .011
Node massed

node-massed (OR = 9.58 [large effect]) and collocation-massed groups (OR = 3.22
[small effect]). Unlike on the immediate posttest, the collocation-massed group
significantly outperformed the node-massed group, and a small effect was found
(OR = 2.97). The findings suggest the following order on the collocation-filling test:
Immediate posttest: collocation-spaced > collocation-massed = node-massed

Delayed posttest: collocation-spaced > collocation-massed > node-massed
Verb-filling test
The reliability of the verb-filling test indexed by Cronbach alpha was .934 for the
immediate posttest and .930 for the delayed posttest, showing sufficient reliability
(Plonsky & Derrick, 2016). Results for the verb-filling test are summarized in Tables 3,
6, and 7, as well as Figure 3. Table 6 shows fixed and random effects in the mixed-effect
logistic regression model. The significant fixed effect of the collocation-spaced group
Immediate_posttest Delayed_posttest
1.00
0.75
Proportion Correct (%)
Item_type
0.50 Studied
Unstudied
0.25
0.00
Node massed
Collocation massed
Collocation spaced
Node massed
Collocation massed
Collocation spaced
Treatment
Figure 3. Distributions of scores for the verb-filling test.

Table 6. List of fixed and random effects fitted: Verb-filling test fixed effects
Odds
Estimate 95% CI SE z Ratio p
Intercept 3.50 [4.09, 2.90] 0.30 11.48 0.03 <.001

Collocation spaced 1.70 [1.04, 2.36] 0.34 5.04 5.47 <.001
Test_timing 0.71 [0.36, 1.07] 0.18 3.92 2.03 <.001
s.Collocation_type 0.91 [0.44, 1.38] 0.24 3.81 2.48 <.001
s.UVLT_score 0.88 [0.62, 1.14] 0.13 6.60 2.41 <.001
s.Lag_to_test 0.13 [0.36, 0.09] 0.11 1.18 0.88 .239
Collocation spaced Test_timing 0.04 [0.45, 0.36] 0.21 0.21 0.96 .834
Collocation massed Test_timing 0.20 [0.64, 0.24] 0.22 0.90 0.82 .370
Collocation spaced 0.24 [0.14, 0.63] 0.20 1.23 1.27 .220
s.Collocation_type
s.Collocation_type
Test_timing s.Collocation_type 0.11 [0.24, 0.47] 0.18 0.63 1.12 .527
Collocation 0.23 [0.17, 0.63] 0.21 1.12 1.26 .263
spaced Test_timing
s.Collocation_type
Test_timing
s.Collocation_type
Model Formula: glmer (Verb_accuracy ~ Treatment*Test_timing*s.Collocation_type þ s.UVLT_score þ s.Lag_to_test þ (s.

Collocation_type þ 1 || ID) þ (s.UVLT_score þ s.Lag_to_test þ s.Collocation_type þ 1 || Item), Data, family = binomial,
control = glmerControl (optimizer = "bobyqa", optCtrl=list(maxfun=2e5))
Random Effects
Variance SD
Participants (Intercept) 0.95 0.98

s.Collocation_type 0.12 0.34
s.UVLT_score 0.08 0.28
s.Lag_to_test 0.14 0.37
s.Collocation_type 0.46 0.67
suggests that when collapsed across the immediate and delayed posttests and studied
and unstudied collocations, the collocation-spaced group significantly outperformed
the node-massed group, producing a medium effect (OR = 5.47). The fixed effect of the
collocation-massed group, however, was not statistically significant, and only a negli-
gible effect was observed (OR = 1.63). This suggests that when collapsed across the
immediate and delayed posttests and studied and unstudied collocations, no significant
difference existed between the two massed groups. The significant fixed effect of
Test_timing shows that when collapsed across the three groups, the immediate posttest
scores were significantly higher than the delayed posttest scores, producing a small
effect (OR = 2.03). The significant main effect of collocation type suggests that when
collapsed across the three groups and immediate and delayed posttests, scores for the
studied collocations were significantly higher than those for the unstudied collocations,
producing a small effect (OR = 2.48). The significant fixed effect of the UVLT suggests
that higher UVLT scores were associated with higher posttest scores, with a small effect
(OR = 2.41). The fixed effect of lag to test was not statistically significant, which

suggests that the recency effect did not significantly affect learning. None of the
interactions fitted into the model were statistically significant.
To examine where significant differences lay at immediate and delayed posttests,
post-hoc analysis with Tukey’s test was conducted (Table 7). The results showed that on
the immediate posttest, for studied collocations, the collocation-spaced group signif-
icantly outperformed both the node-massed (OR = 8.41 [large effect]) and collocation-
massed groups (OR = 4.85 [medium effect]). The collocation-massed group, however,
failed to significantly outperform the node-massed group, producing a small effect
(OR = 1.73). For unstudied collocations, the collocation-spaced group significantly
outperformed the collocation-massed group, with a small effect size (OR = 3.13). No
significant difference, however, existed between the collocation-spaced and node-
massed groups (OR = 3.25), or between the collocation-massed and node-massed
groups (OR = 1.04), and no more than small effects were found.
On the delayed posttest, the collocation-spaced group significantly outperformed
other groups for both studied and unstudied collocations, producing small to large
effect sizes (3.19 ≤ OR ≤ 6.96). The difference between the two massed groups,
however, was not statistically significant for either studied (OR = 2.16 [small effect])
or unstudied collocations (OR = 1.23 [negligible effect]). The findings suggest the
following order on the verb-filling test:
Studied collocations
Immediate and delayed posttests: collocation spaced > collocation massed =
node massed
Unstudied collocations
Immediate posttest: collocation spaced ≥ node massed; collocation spaced >
collocation massed; node massed = collocation massed
Delayed posttest: collocation spaced > collocation massed = node massed
Discussion
The present study was the first attempt to examine the effects of spacing on the
knowledge of both studied and unstudied L2 collocations. Hypothesis 1 predicted an
advantage of the collocation-spaced schedule over the two massed schedules for the
retention of studied collocations. It was shown that the collocation-spaced schedule led
to better retention of studied collocations than the massed schedules, regardless of type
(collocation filling or verb filling) or timing of posttest (immediate or delayed),
supporting Hypothesis 1. The collocation-spaced schedule led to superior retention
possibly because it was the only condition that involved spaced retrieval practice of
individual collocations. In other words, whereas retrieval opportunities for a given
collocation were concentrated into a single session in the two massed schedules, they
were distributed over 3 weeks in the collocation-spaced schedule. Retrieval opportu-
nities distributed over a long time perhaps resulted in effortful retrieval, which
facilitates retention according to the desirable difficulty framework (e.g., Bjork, 1994;
Suzuki et al., 2019). It should also be noted that during the treatment, the same context
sentence was repeated three times, instead of using three different contexts (see the
“Method” section). The repetition of the same context perhaps increased the reminding
potential for studied collocations. As a result, retrieval practice in the collocation-
spaced schedule was not only effortful but also successful, which facilitated retention
even more (reminding theory; Benjamin & Tullis, 2010; Koval, 2022).

310
Satoshi Yamagata et al.
Table 7. Results of the post-hoc analysis for treatment: Verb-filling test
Posttest Items Comparisons Estimate 95% CI SE z Odds Ratio p
Immediate Studied Collocation spaced vs. Node-massed 2.13 [1.44, 2.82] 0.35 6.04 8.41 <.001
Collocation spaced vs. Collocation-massed 1.58 [0.96, 2.20] 0.32 5.00 4.85 <.001
Collocation massed vs. Node-massed 0.55 [0.11, 1.20] 0.34 1.63 1.73 .899
Unstudied Collocation spaced vs. Node-massed 1.18 [0.44, 1.93] 0.38 3.11 3.25 .079
Collocation spaced vs. Collocation-massed 1.14 [0.49, 1.79] 0.33 3.45 3.13 .028
Collocation massed vs. Node-massed 0.04 [0.72, 0.80] 0.39 0.11 1.04 1.000
Delayed Studied Collocation spaced vs. Node-massed 1.94 [1.22, 2.66] 0.37 5.31 6.96 <.001
Collocation massed vs. Node-massed 0.77 [0.08, 1.47] 0.35 2.19 2.16 .560
Unstudied Collocation spaced vs. Node-massed 1.46 [0.64, 2.27] 0.41 3.52 4.31 .022
Collocation massed vs. Node-massed 0.21 [0.63, 1.06] 0.43 0.50 1.23 1.000
A limited advantage of the collocation-massed schedule over the node-massed

schedule for the studied collocations was also found. On the delayed collocation-filling
posttest, the collocation-massed group significantly outperformed the node-massed
group, although the difference was not statistically significant on any other posttests.
The limited advantage of the collocation-massed group was caused possibly by
retrieval-induced facilitation (Chan et al., 2006), according to which retrieval facilitates
retention of not only practiced materials but also unpracticed related materials.
Specifically, in the collocation-massed group, three studied collocations for the same
node verb were distributed over 3 weeks (Week 1: draw a line, Week 2: draw tears, Week
3: draw a conclusion). Encountering draw a conclusion in Week 3, for instance, might
have reactivated knowledge of the two studied collocations introduced in earlier weeks
(Week 1: draw a line, Week 2: draw tears), resulting in retrieval-induced facilitation
from later weeks to earlier weeks. In the node-massed group, in contrast, three studied
collocations for the same node verb were concentrated into a single day. As a result,
retrieval-induced facilitation across weeks was not possible. At the same time, the
advantage of the collocation-spaced group over the collocation-massed group suggests
that effects of retrieval-induced facilitation were rather limited and repeating the same
collocations across multiple sessions facilitates retention more than repeating different
collocations with the same node.
The results of this study regarding Hypothesis 1 (i.e., collocation spaced > node
massed = collocation massed) are consistent with those of Macis et al. (2021, Exper-
iment 2), which showed that for intentional learning, studying collocations over
multiple days facilitated learning, relative to massing them into a single day. However,
this study’s results were not consistent with Snoder (2017), who found that long spacing
did not facilitate the retention of studied collocations. The inconsistent findings may be
due to the amount of spacing used in the studies. Specifically, whereas Macis et al.
(2021) compared spacing and massing (no spacing) and examined the spacing effect as
in the present study, Snoder (2017) compared effects of two spacing schedulers
(i.e., relatively short vs. relatively long intervals) and examined the lag effect.
Hypothesis 2 predicted that for knowledge of unstudied collocations, the colloca-
tion-spaced schedule will be the most effective, and the collocation-massed schedule
will be the least. Although results on the verb-filling test showed the advantage of the
collocation-spaced schedule over the other two, no significant difference was found
between the two massed schedules (collocation spaced > node massed = collocation
massed). Hypothesis 2, therefore, was only partially supported. The findings suggest
that the benefits of spacing apply not only to recall of previously presented materials
(i.e., studied words) but also to induction. The collocation-spaced schedule was the
most effective for unstudied collocations possibly because participants encountered
multiple collocations for the same node word every week throughout the treatment. For
instance, in the first treatment session (Wednesday in Week 1; see Figure 1), partic-
ipants in the collocation-spaced group were exposed to three collocations for the node
verb carry (e.g., carry a product, carry a tune, carry weight). This may have allowed
participants to make generalizations about what kinds of nouns the node verb could
take as an object, allowing them to transfer the knowledge of studied collocations to
unstudied collocations. Furthermore, the collocation-spaced group encountered the
same three collocations in the subsequent 2 weeks (Wednesday in Weeks 2 and 3). The
retrieval opportunities for the multiple collocations for the same node verb distributed
over the 3 weeks perhaps consolidated the learners’ understanding of the meaning
potential of carry, resulting in the largest gains in the collocation-spaced group for the
unstudied collocations.

In the node-massed schedule, participants were also exposed to multiple colloca-

tions for the same node word within the same day, as in the collocation-spaced
schedule. This may have allowed learners to reconceptualize their knowledge of the
meaning potential of the node word. However, unlike the collocation-spaced schedule,
in the node-massed schedule, encounters with a given node verb were concentrated into
a single session, and they were never repeated in subsequent sessions. As a result, in the
node-massed schedule, learners’ knowledge of the meaning potential of the node words
perhaps decayed by the time of the posttests, resulting in the lack of significant
difference between the two massed schedules. These findings highlight the value of
distributed retrieval practice not only for studied but also for unstudied collocations.
At the same time, this study did not use a comparison group where multiple
collocations for the same node word were repeated on different days over multiple
weeks (e.g., carry a product is repeated on Mondays, carry a tune is repeated on
Wednesdays, and carry weight is repeated on Fridays over 3 weeks, instead of all three
collocations for carry being repeated on Wednesdays). As a result, it is not clear to what
extent the superiority of the collocation-spaced schedule was due to the fact that
participants encountered multiple collocations for the same node word on the same
day. In future research, it would be useful to include a condition where multiple
collocations for the same node are repeated on different days over multiple weeks.
Results of this study suggested that the collocation-spaced schedule was more
effective than the two massed schedules not only for studied but also unstudied
collocations. The collocation-spaced group, at the same time, may have resulted in
more over-extension errors (i.e., erroneously using a target node verb to collocations
where a different verb should have been used) than the other two groups. To examine
whether this was the case, an error analysis was conducted. The error analysis indicated
that the collocation-spaced schedule resulted in more over-extension errors than the
two massed schedules (for details, see Appendix S8 in the Online Supplementary
Materials). The findings suggest that although the collocation-spaced schedule enabled
learners to transfer the knowledge of studied collocations to novel, unstudied colloca-
tions, it can be a double-edged sword in the sense that it may lead to over-extension
errors. In future research, it may be useful to examine how over-extension errors may be
reduced.
In this study, all three groups showed improvements on the verb-filling test for
unstudied items on the posttests. The findings suggest that learners were able to transfer
the knowledge of studied collocations to unstudied collocations. One explanation for
the findings is that exposure to multiple collocations with the same node allowed
learners to make comparisons between their existing knowledge of the verb’s semantics
and range of different uses of the verb in given collocations, which triggered semantic
restructuring (Jiang, 2004). Another explanation is that learners produced novel
collocations based on L1 translations of studied collocations. Some studied and
unstudied collocations for a given node shared the same L1 translation. For instance,
cut in both cut class (studied collocation) and cut school (unstudied collocation) is
translated into the same Japanese word, saboru. Similarly, the node verbs for the
following studied and unstudied collocations share the same L1 translations: draw
tears and draw laughs (sasou), draw attention and draw a line (hiku), and run an article
and run a story (keisaisuru). For these collocations, learners might have been able to
guess the correct node word based solely on the L1 translations, without understanding
the core meaning of the node.
To examine the effects of overlap of L1 translations among studied and unstudied
items, a follow-up analysis was conducted. The follow-up analysis that included overlap

of L1 translations as fixed and random effects suggested that unstudied items that
shared L1 translations with studied items were more likely to be answered successfully
than those that did not share L1 translations (p = .017, OR = 1.51 [negligible effect]; full
results of the follow-up analysis are presented in Appendix S9 in the Online Supple-
mentary Materials). At the same time, it should be noted that inferences based on L1
translations were probably not always successful because, for some target collocations,
different node verbs shared the same L1 translations. For instance, in all the following
collocations, the node verbs are translated into the same Japanese word, suru: cut a deal,
make a mention, meet death, put emphasis, and take pains. Because all these collocations
required different node verbs, it would not be possible to guess the correct node for
these collocations based solely on L1 translations (suru), and at least some understand-
ing of the meaning potential of the node might have been necessary.
Although all three groups showed improvements on the verb-filling posttest for
unstudied items, the posttest scores for the unstudied collocations were much lower
than those for the studied collocations in all three groups (Table 3). The relatively low
scores for the unstudied items may be in part due to three factors. First, during the
treatment, learners were exposed to only three collocations per node. This may have
made it difficult for learners to notice the core meaning underlying different uses of the
node word. Second, in this study, target collocations were determined so that the choice
of the node verb can be explained by the core meaning of the node verb (see Appendix
S4 in the Online Supplementary Materials). At the same time, for some collocations
(e.g., meet a need, run an article, take root), the relationship between the meaning of the
collocation and the core meaning of the node verb might have been difficult to
understand. This was perhaps another factor responsible for the relatively low scores
for the unstudied collocations on the verb-filling test. Third, some node verbs had
similar core meanings. For instance, as shown in Appendix S4, the core meanings of
four node verbs (draw, run, meet, and take) involved moving something. Due to the
similarity among these node verbs, learners might have had difficulty in transferring the
knowledge of studied collocations to unstudied collocations, resulting in the relatively
low scores for the unstudied collocations on the posttests.
Results for the quiz given at the end of each treatment session (Stage 7) have shown
that the node-massed (96.4%) and collocation-massed groups (94.4%) outperformed
the collocation-spaced group (75.8%) during the learning phase. The results may be
partly attributed to the number of exposures before the quizzes. Specifically, whereas a
quiz was given after six exposures to each collocation in the two massed groups in all
3 weeks, it was given after two (Week 1), four (Week 2), or six exposures (Week 3) in the
collocation-spaced group (see Appendix S5 in the Online Supplementary Materials).
On the posttests, however, the collocation-spaced group significantly outperformed the
massed groups. The findings are consistent with the desirable difficulty framework
(Bjork, 1994; Suzuki et al., 2019), according to which a condition that increases learning
phase performance does not necessarily lead to better long-term retention than a
condition that decreases learning phase performance.
This study also showed wide gaps between the learning phase and posttest perfor-
mance for the two massed groups. Although the average score for the node-massed
group was 96.4% on the quiz given at the end of the learning phase (Stage 7), it dropped
to 14.9% and 6.6% for the immediate and delayed collocation-filling posttests, respec-
tively. Similarly, the collocation-massed group showed a substantial decrease from the
learning phase performance (94.4%) to the posttest performance (immediate colloca-
tion filling: 23.0%; delayed collocation filling: 15.9%). Figures 2 and 3 also indicate that
some participants in the two massed groups scored 0 on the posttest. The significant

decrease was caused possibly because the two massed groups encountered only three
collocations per treatment session, whereas the collocation-spaced group encountered
nine collocations each day (Figure 1). The larger number of collocations practiced each
day perhaps increased retrieval effort required for the collocation-spaced group. In
other words, although retrieval practice for the two massed groups was highly success-
ful, it was perhaps not very effortful. This may be partly responsible for the substantial
decrease from the learning phase to the posttest in these two groups.
Although direct comparisons of this study and other studies are difficult due to a
number of methodological differences, posttest scores in this study were relatively high,
compared with other studies involving L2 collocation learning. Boers, Demecheleer,
et al. (2014), for instance, report 4.5% to 11.2% gains on the verb-filling posttest and
8.9% to 13.7% gains on the collocation-filling posttest, after a single treatment session.
These scores were much lower than those obtained by the collocation-spaced group in
this study (62.1% on the immediate verb-filling and 57.8% on the immediate colloca-
tion-filling posttest). The results may demonstrate the value of spacing for collocation
learning. As a case in point, Ferguson et al. (2021) found that a treatment that involved
the repetitions of same collocations three times after 2-day gaps led to gains that are
similar to or larger than those in this study (48.0% to 62.5% on the immediate and 38.0
% to 64.7% on the delayed posttests).
Pedagogical implications
The findings of this study suggest that introducing spacing in terms of individual
collocations (i.e., collocation-spaced schedule) facilitates the knowledge of both studied
and unstudied collocations. Pedagogically, the findings suggest that it may be useful for
learners to be exposed to multiple collocations containing the same node regularly. This
study also showed that although the two massed groups significantly outperformed the
collocation-spaced group during the learning phase, the collocation-spaced group
resulted in higher posttest scores than the massed groups. Pedagogically, the findings
suggest that learners or instructors should not be discouraged even if the treatment
induces a large number of incorrect responses during learning (desirable difficulty
framework).
Concluding remarks
Although many studies have examined the effects of spacing on vocabulary learning,
most studies have investigated the learning of single words. Related studies that
compared massing and spacing for collocation learning so far investigated the learning
of only one collocate per node word (Macis et al., 2021; Snoder, 2017). Thus, it was not
clear how the spacing of multiple collocations with the same node affects the knowledge
of studied and unstudied collocations. The findings of this study are valuable because
they suggest that introducing spacing in terms of individual collocations (collocation-
spaced schedule) may facilitate the knowledge of both studied and unstudied L2
collocations. At the same time, because this study was the first to examine the role of
spacing for collocation learning in this way, it has several limitations.
First, this study was conducted within an authentic classroom setting. Although
classroom-based research helps increase ecological validity and has its benefits (Rogers
& Cheung, 2021), it is also limited in that experimental manipulations are not as tightly
controlled as in laboratory studies. For instance, during the treatment in this study,

participants were asked to say the correct answers aloud (Stages 3–6 during the
treatment; see Appendix S5 in the Online Supplementary Materials). Although over-
hearing other students’ responses is common in real-world classroom settings, it might
have affected learning. In future research, it may be useful to replicate this study in
laboratory settings. Second, in this study, the collocation-spaced group, who showed
better learning outcomes than the two massed groups, had the highest UVLT scores
among the three groups. Although the UVLT score was used as a covariate in the
analysis to control for English proficiency effects, in future research, it may be useful to
compare groups that are equivalent in their proficiency levels. Considering the value of
collocational knowledge for the appropriate and fluent use of L2 vocabulary, further
research examining the effects of spacing on collocation learning is warranted. Inves-
tigating the effects of spacing on the knowledge of unstudied collocations is also
valuable from a theoretical viewpoint because it allows researchers to examine whether
the benefits of spacing apply not only to recall of previously presented materials but also
to induction.
Supplementary Materials. To view supplementary material for this article, please visit http://doi.org/
10.1017/S0272263122000225.
Acknowledgments. This research was supported by JSPS KAKENHI grants (Grant Numbers: 19H00115
and 19K13306) and a Murata Science Foundation grant (Grant Number: M20 海人 08). We greatly
appreciate the invaluable suggestions given by anonymous reviewers and the handling editor, Dr. Luke
Plonsky. We would like to thank Dr. Akira Murakami, Mr. Akihiko Sato, and Dr. Atsushi Mizumoto for their
invaluable advice regarding statistical analyses. We would also like to thank Mr. Yasuaki Morikawa, Ms. Mika
Inoue, and Ms. Sakiho Itami for their help with data collection. Finally, our deepest gratitude goes to the high
school students who took part in the project.
Data Availability Statement. The experiment in this article earned an Open Materials badge for trans-
parent practices. The materials are available at https://osf.io/w2qj7/.
References
Ackermann, K., & Chen, Y. (2013). Developing the academic collocation list (ACL): A corpus driven and
expert-judged approach. Journal of English for Academic Purposes, 12, 235–247.
Bahrick, H. P., & Phelps, E. (1987). Retention of Spanish vocabulary over 8 years. Journal of Experimental
Psychology: Learning, Memory, & Cognition, 13, 344–349.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4.
Journal of Statistical Software, 67, 1–48.
Benjamin, A. S., & Tullis, J. (2010). What makes distributed practice effective? Cognitive Psychology, 61,
228–247.
Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe
& A. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185–205). MIT Press.
Boers, F., Dang, T. C., & Strong, B. (2017). Comparing the effectiveness of phrase-focused exercises. A partial
replication of Boers, Demecheleer, Coxhead and Webb (2014). Language Teaching Research, 21, 362–380.
Boers, F., Demecheleer, M., Coxhead, A., & Webb, S. (2014). Gauging the effects of exercises on verb–noun
collocations. Language Teaching Research, 18, 54–74.
Boers, F., Lindstromberg, S., & Eyckmans, J. (2014). Some explanations for the slow acquisition of L2
collocations. Vigo International Journal of Applied Linguistics, 11, 41–61.
Carpenter, S. K., & Mueller, F. E. (2013). The effects of interleaving versus blocking on foreign language
pronunciation learning. Memory and Cognition, 41, 671–682.
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks:
A review and quantitative synthesis. Psychological Bulletin, 132, 354–380.

Chan, J. C. K., McDermott, K. B., & Roediger, H. L. (2006). Retrieval-induced facilitation: Initially nontested
material can benefit from prior testing of related material. Journal of Experimental Psychology: General,
135, 553–571.
Chen, H., Cohen, P., & Chen, S. (2010). How big is a big odds ratio? Interpreting the magnitudes of odds ratios
in epidemiological studies. Communications in Statistics: Simulation and Computation, 39, 860–864.
Conklin, K., & Carrol, G. (2018). First language influence on the processing of formulaic language in a second
language. In A. Siyanova-Chanturia & A. Pellicer-Sánchez (Eds.), Understanding formulaic language. A
second language acquisition perspective (pp. 62–77). Routledge.
Durrant, P., & Schmitt, N. (2010). Adult learners’ retention of collocations from exposure. Second Language
Research, 26, 163–188.
Eyckmans, J., Boers, F., & Lindstromberg, S. (2016). The impact of imposing processing strategies on L2
learners’ deliberate study of lexical phrases. System, 56, 127–139.
Ferguson, P., Siyanova-Chanturia, A., & Leeming, P. (2021). Impact of exercise format and repetition on
learning verb–noun collocations. Language Teaching Research. Advance online publication. https://doi.
org/10.1177/13621688211038091
González-Fernández, B., & Schmitt, N. (2015). How much collocation knowledge do L2 learners have? The
effects of frequency and amount of exposure. International Journal of Applied Linguistics, 166, 94–126.
Gyllstad, H., & Wolter, B. (2016). Collocational processing in light of the phraseological continuum model:
Does semantic transparency matter? Language Learning, 66, 296–323.
Henriksen, B. (2013). Research on L2 learners’ collocational competence and development: A progress report.
In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge and use: New
perspectives on assessment and corpus analysis (pp. 29–56). EuroSLA.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge University Press.
Jiang, N. (2004). Semantic transfer and its implications for vocabulary teaching in a second language. The
Modern Language Journal, 88, 416–432.
Kang, S. H. (2016). The benefits of interleaved practice for learning. In J. C. Horvath, J. M. Lodge, & J. Hattie
(Eds.), From the laboratory to the classroom: Translating science of learning for teachers (pp. 79–93).
Routledge.
Kim, S. K., & Webb, S. A. (2022). The effects of spaced practice on second language learning: A meta-analysis.
Language Learning, 72, 269–319.
Kornell, N., & Bjork, R. A. (2008). Learning concepts and categories: Is spacing the “enemy of induction”?
Psychological Science, 19, 585–592.
Koval, N. G. (2022). Testing the reminding account of the lag effect in L2 vocabulary learning. Applied
Psycholinguistics, 43, 1–40.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second-language writing: A corpus analysis of
learners’ English. Language Learning, 61, 647–672.
Lenth, R. (2021). Emmeans: Estimated marginal means, aka least-squares means. R package (Version 1.7.1-1)
[Computer software]. https://CRAN.Rproject.org/package=em means
Macis, M., Sonbul, S., & Alharbi, R. (2021). The effect of spacing on incidental and deliberate learning of L2
collocations. System, 103, 102649. https://doi.org/10.1016/j.system.2021.102649
Nakata, T., & Suzuki, Y. (2019). Effects of massing and spacing on the learning of semantically related and
unrelated words. Studies in Second Language Acquisition, 41, 287–311.
Nakata, T., & Webb, S. (2016). Does studying vocabulary in smaller sets increase learning? The effects of part
and whole learning on second language vocabulary acquisition. Studies in Second Language Acquisition,
38, 523–552.
Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied
Linguistics, 32, 130–149.
Peters, E. (2016). The lexical burden of collocations: The role of interlexical and intralexical factors. Language
Teaching Research, 20, 113–138.
Pellicer-Sánchez, A. (2017). Learning L2 collocations incidentally from reading. Language Teaching Research,
21, 381–402.
Plonsky, L., & Derrick, D. J. (2016). A meta-analysis of reliability coefficients in second language research. The
Modern Language Journal, 100, 538–553.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language
Learning, 64, 878–912.

R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R
Foundation for Statistical Computing. Retrieved from http://www.R-project.org/
Rogers, J., & Cheung, A. (2021). Does it matter when you review? Input spacing, ecological validity, and the
learning of L2 vocabulary. Studies in Second Language Acquisition, 43, 1138–1156.
Shin, D. (2006). A collocation inventory for beginners. Unpublished doctoral dissertation. Victoria University
of Wellington, New Zealand.
Siyanova, A., & Schmitt, N. (2008). L2 learner production and processing of collocation: A multi-study
perspective. The Canadian Modern Language Review, 64, 429–458.
Snoder, P. (2017). Improving English learners’ productive collocation knowledge: The effects of involvement
load, spacing, and intentionality. TESL Canada Journal, 34, 140–164.
Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge: Acquisition of collocations under
different input conditions. Language Learning, 63, 121–159.
Suzuki, Y., Nakata, T., & DeKeyser, R. M. (2019). The desirable difficulty framework as a theoretical
foundation for optimizing and researching second language practice. The Modern Language Journal,
103, 713–720.
Szudarski, P. (2012). Effects of meaning- and form-focused instruction on the acquisition of verb-noun
collocations in L2 English. Journal of Second Language Teaching and Research, 1, 3–37.
Szudarski, P. (2017). Learning and teaching L2 collocations: Insights from research. TESL Canada Journal,
34, 205–216.
Szudarski, P., & Carter, R. (2016). The role of input flood and input enhancement in EFL learners’ acquisition
of collocations. International Journal of Applied Linguistics, 26, 245–265.
Toomer, M., & Elgort, I. (2019). The development of implicit and explicit knowledge of collocations: A
conceptual replication and extension of Sonbul and Schmitt (2013). Language Learning, 69, 405–439.
Webb, S., & Chang, A. C.-S. (2022). How does mode of input affect the incidental learning of collocations?
Studies in Second Language Acquisition, 43, 55–77.
Webb, S., & Kagimoto, E. (2009). The effects of vocabulary learning on collocation and meaning. TESOL
Quarterly, 43, 55–77.
Webb, S., Newton, J., & Chang, A. C.-S. (2013). Incidental learning of collocation. Language Learning, 63,
91–120.
Webb, S., Sasao, Y., & Ballance, O. (2017). The updated Vocabulary Levels Test: Developing and validating
two new forms of the VLT. International Journal of Applied Linguistics, 168, 34–70.
Wood, D. (2020). Categorizing and identifying formulaic language. In S. Webb (Ed.), Routledge handbook of
vocabulary studies (pp. 30–45). Routledge.
Cite this article: Yamagata, S., Nakata, T. and Rogers, J. (2023). Effects of distributed practice on the
acquisition of verb-noun collocations. Studies in Second Language Acquisition, 45, 291–317. https://doi.org/
10.1017/S0272263122000225

doi:10.1017/S027226312200016X
RESEARCH ARTICLE
A role for verb regularity in the L2 processing

of the Spanish subjunctive mood: Evidence from
eye-tracking
Sara Fernández Cuenca1* and Jill Jegerski2
1
Wake Forest University, Winston-Salem, NC, USA; 2University of Illinois at Urbana-Champaign, Urbana,
IL, USA
*Corresponding author. Email: fernans@wfu.edu
Abstract
The present study investigated the second language processing of grammatical mood in
Spanish. Eye-movement data from a group of advanced proficiency second language users
revealed nativelike processing with irregular verb stimuli but not with regular verb stimuli. A
comparison group of native speakers showed the expected effect with both types of stimuli,
but these were slightly more robust with irregular verbs than with regular verbs. We propose
that the role of verb form regularity was due to the greater visual salience of Spanish
subjunctive forms with irregular verbs versus regular verbs and possibly also due to less
efficient processing of rule-based regular inflectional morphology versus whole irregular
word forms. In any case, the results suggest that what appeared to be difficulty with sentence
processing could be traced back to word-level processes, which appeared to be the primary
area of difficulty. This outcome seems to go against theories that suggest that L2 sentence
processing is shallow.
Introduction
The question of whether adult second language (L2) learners can be similar to native
(L1) speakers is fundamental in the study of adult second language acquisition,
including research on nonnative sentence processing. The last decade has seen the
emergence of different theories that seek to explain how sentences are processed in a
nonnative language, with particular attention to identifying areas of difficulty. For one,
Clahsen and Felser (2006, 2018) have suggested that “structural processing is compro-
mised in nonnative comprehension” (Felser & Cunnings, 2012, p. 600) and that an
intrinsic deficit in syntactic and morphosyntactic representations and a failure to
integrate these during real-time language comprehension can best characterize L2
online sentence comprehension. In a closely related theory, Cunnings (2017) has
proposed that the primary limitations lie not in syntax and morphosyntax, but in a
of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-
nc-nd/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no
alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be
obtained prior to any commercial use and/or adaptation of the article.
https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

L2 processing of the Spanish subjunctive mood 319
greater susceptibility to interference during retrieval from memory, although the

empirical predictions of the two theories are often the same. However, the Lexical
Bottleneck Hypothesis (Hopp, 2014, 2018) proposes that online effects that appear to
indicate an issue with syntax and morphosyntax in sentence processing can in reality be
indirect effects of difficulty with lexical access (Hopp, 2017a).
Research related to the newer theory of Hopp (2014, 2018) examines factors at the
word level and how they might have implications for processing at the phrase and
sentence level, given that lexical access and the representation of many word-level details
(e.g., word class, verb subcategorization, number) logically precedes the processing of
syntax and morphosyntax. In one example, lower verb frequency was associated with
slower processing of cleft sentence structure (e.g., It was Andrew who forced Piper to steal
the camera.) among L2 learners but not native speakers (Hopp, 2016). Less exposure to
lower frequency words is generally associated with delayed lexical access in the proces-
sing of individual words (e.g., Brysbaert et al., 2018), an effect that has been widely
observed with L1 language users in work on lexical processing and that can be more
pronounced with L2 learners than with native speakers (e.g., Cop et al., 2015). Hopp’s
(2016) study showed that such frequency-related processing difficulty at the word level
can delay higher level structural processes for establishing sentence word order, so an
apparent struggle with syntactic processing among L2 learners can be traced back to
lexical processing. Other word level factors that have been shown to affect syntactic and
morphosyntactic processing include the cognate status of target words between partic-
ipants’ two languages (Hopp, 2017b; Miller, 2014) and knowledge of individual lexical
items such as gender assignment for nouns (Hopp, 2013; Lemhöfer et al., 2014).
Like this previous work, the present study also examined the role of a lower-level
factor (form regularity of Spanish verbs) in the processing of a higher-level morpho-
syntactic phenomenon (the subjunctive mood in embedded clauses) with the goal of
determining whether any observed difficulty in the processing of mood might be better
accounted for in terms of generalized difficulty processing phrases and sentences
(Clahsen & Felser, 2006, 2018) or as indirect effects of difficulty at the level of individual
words (Hopp, 2014, 2018). The word-level factor of interest, verb form regularity, has
been said to facilitate the acquisition and use of the subjunctive among L2 learners
(Collentine, 1997). Such a role for morphological regularity might be explained in terms
of perceptual salience (change in the verb stem vs. a single vowel) or item-specific
lexical information for verb conjugation class (-ar, -er, -ir), as will be explained in the
background sections that follows on the Spanish subjunctive and its acquisition by L2
learners. In either case, form irregularity would facilitate the processing of a verb so that
it is accurately marked for either indicative or subjunctive mood, which must occur
prior to the processing of the subjunctive at the morphosyntactic level within the
broader sentence. In other words, if the embedded clause verb is more likely to be
accurately processed for mood when it is irregular, this would also improve the chances
of successful processing of the mood dependency between that verb and the main clause
trigger verb (to be discussed in greater detail in the following section). In this sense, verb
regularity is analogous to other word-level factors like word frequency and cognate
status in that it could indirectly affect processing at the phrase level.
The subjunctive mood in Spanish

Grammatical mood is a type of inflectional morphology on verbs that communicates
semantic modality, or the speaker’s attitude toward the propositional content of a

320 Sara Fernández Cuenca and Jill Jegerski
phrase (Bosque, 2012). In Spanish, finite verbs are inflected for one of three moods:
indicative, which represents the unmarked default for assertions; imperative, which is
used for direct commands; and subjunctive, which is used to convey attitudes such as
doubt, desire, and conjecture. The subjunctive mood typically appears in an embedded
clause of a complex sentence and is selected by the lexical semantics of a verb or other
expression in the matrix clause. For example, a main clause like Dudo que… “I doubt
that…” or Espero que… “I hope that…” would trigger subjunctive morphology on the
verb in the embedded clause because the speaker is expressing doubt or desire.
In terms of morphology, finite verbs in Spanish contain a stem and one to three of
the following inflectional suffixes: a thematic vowel that indicates to which of the three
conjugation classes the verb belongs (i.e., -ar, -er, -ir); a composite morpheme for tense,
aspect, and mood; and an agreement morpheme for person and number (ibid.). In
the present tense, which was used in the eye-tracking stimuli for the present study, the
subjunctive mood is not marked with the addition of a suffix but rather by switching
the thematic vowel, so a changes to e and e and i change to a. Hence, the vowel does not
usually indicate mood because two of the three vowels, a and e, which are also by far the
most common, are used to mark both indicative and subjunctive, depending on
the verb. The only way to know which is which is to compare the thematic vowel with
the conjugation class of the individual verb, which means accessing this item-specific
information in the lexicon. Moreover, for regular verbs in the present, the vowel switch
is the only difference between the indicative and subjunctive, so mood is not very salient
to the reader or listener (e.g., escuchanIND vs. escuchenSUBJ “they listen”). Irregular verbs,
however, show a stem change in addition to the switch of thematic vowel, so the
difference is more readily apparent (e.g., tienenIND vs. tenganSUBJ “they have”).
Thus, the Spanish subjunctive is semantically abstract and also linguistically com-
plex because it involves verb morphology, sentence-level semantics, and morphosyn-
tax. Indeed, there is an extensive body of literature in theoretical linguistics that
attempts to explain the subjunctive and its many nuances (see ibid., for an overview),
which include obligatory use in some linguistic contexts and variable use in others
(Gudmestad, 2010, 2012a; Poplack et al., 2018). Not surprisingly, mood tends to be
acquired later by children than many other aspects of grammar and different contexts
of use appear at different ages. The morphological inflections for mood are first seen
with direct commands and are typically acquired by around age 2 (López Ornat et al.,
1994). The subjunctive appears next with temporal adverbials (e.g., cuando “when…,”
antes de que “before…”) and in sentential complement clauses with volitional predi-
cates (like those used in the stimuli for the present study), both of which show mostly
adult-like use by around age 4 (Blake, 1983; Sánchez-Naranjo & Pérez-Leroux, 2010).
Other types of sentential complement clauses are acquired over the next several years;
for example, by around age 9 for predicates of doubt, attitude, and assertion (Blake,
1983). Moreover, in addition to linguistic development, cognitive development appears
to be a factor in learning uses of the subjunctive mood that rely on epistemic aspects of
semantics, as these rely on the capacity to understand false beliefs (Pérez-Leroux, 1998).
From the perspective of sentence processing, the morphosyntactic relationship
between a trigger expression in the matrix clause and the subjunctive morphology on
the verb in an embedded clause could be classified as a distance dependency, a
phenomenon that is of particular interest in the study of nonnative processing because
of the high processing demands it incurs (Clahsen & Felser, 2006). To our knowledge,
there are no published eye-tracking studies of the L1 processing of Spanish mood.
There is one published study using self-paced reading (Demestre & García-Albea,
2004, Experiment 2), but the target was syntactic ambiguity rather than

morphosyntax per se. The stimulus sentences were also quite different from those for
the present study, as the subjunctive appeared in the past tense in very complex
triclausal sentences that were always grammatical, but in the present study it appeared
in the present tense in biclausal sentences that varied with regard to the grammati-
cality of the critical verb. Still, the results are broadly relevant to the present study in
that they showed that experimental manipulations of mood can bring about online
effects, at least with native speakers.
The Spanish subjunctive in adult second language acquisition

The acquisition of the Spanish subjunctive mood by adult learners has been the object
of a fair amount of empirical investigation, probably in part because of its notoriety
among students and instructors in the context of formal language instruction (see
Collentine, 2014, for a comprehensive review). As outlined in the previous section,
grammatical mood involves semantic abstraction and linguistic complexity, which
makes it difficult to acquire, plus mood morphemes typically involve the switching of a
single vowel, so they are often nonsalient. Furthermore, subjunctive morphology is
usually redundant and thus not critical to meaningful communication (Lee, 1987;
Terrell et al., 1987), which in turn makes the form more difficult to acquire (Leow, 1993;
VanPatten, 1994, 1996). To illustrate, in a sentence like Espero que escuchenSUBJ “I hope
they listen,” both the matrix clause verb espero “hope” and the inflectional morphology
on the embedded clause verb escuchenSUBJ “listen” convey the speaker’s attitude of
wishing for the event to happen.
Despite these obstacles, research has suggested that some uses of the subjunctive can
be acquired to some degree. There is evidence that oral production improves with
immersion experience (Isabelli & Nishida, 2005; Lubbers Quesada, 1998). Studies in the
generative framework have found that performance on interpretation and judgment
measures can be high or even nativelike among adult learners with a very high level of
L2 proficiency, for trigger contexts such as volitional predicates (Borgonovo et al., 2005;
Iverson et al., 2008; Massery, 2009) and negated epistemic and perception predicates
(Borgonovo & Prévost, 2003; Iverson et al., 2008). Variationist work has observed that
the oral production patterns of high-proficiency L2 learners can closely resemble those
of native speakers in terms of frequency and contextual factors that shape variation; the
only point of divergence was with the discourse pragmatic variable of hypotheticality
(Gudmestad, 2012a).
Research on the L2 acquisition of the Spanish subjunctive has also observed that
empirical findings can vary according to the experimental task or measure. First, more
targetlike production seems to occur with a greater focus on form: in written produc-
tion versus an oral interview (Terrell et al., 1987), in a controlled production task versus
an oral interview (Collentine, 1995), and in a verb elicitation task more than in a clause
elicitation task and in both of those more than in a role play (Gudmestad, 2012a).
Second, written comprehension might favor more nativelike subjunctive use as com-
pared to oral production (Geeslin & Gudmestad, 2008; Montrul, 2011). Task effects can
also interact with other variables, for example, verb form regularity (Geeslin &
Gudmestad, 2008; Gudmestad, 2012b). Finally, performance on a number of untimed
and largely form-focused written measures has been found to correlate with general
metalinguistic knowledge of Spanish (Correa, 2011) and with explicit knowledge of
Spanish mood (Gutiérrez, 2017), whereas there was no correlation of explicit knowl-
edge with accuracy in an oral interview.

Thus, the choice of investigative method appears to be important, with untimed and
form-focused tasks being most affected by explicit and metalinguistic knowledge. The
same participants who perform quite well on untimed written assessments can struggle
with a more authentic communicative task like an oral interview. However, there is
evidence that even native speakers might use the subjunctive less in oral production
than in writing (Geeslin & Gudmestad, 2008), so oral production also has limitations as
an experimental measure. Another way to get at this issue is with a real-time measure of
language comprehension such as self-paced reading or eye tracking, which have
relatively realistic time constraints and the potential for focus on meaning over form
(Keating & Jegerski, 2015), yet they also allow the researcher to target a very specific
linguistic form using controlled written stimuli. To our knowledge, only one previous
study has taken this approach.
Cameron (2011, 2017) employed the self-paced reading method to study L2 com-
prehension of the subjunctive with impersonal expressions of certainty such as Es
probable que… “It’s likely that…” and Es cierto que… “It’s true that….” The experi-
mental task was focused on the meaning of the sentence; a picture was displayed for
each stimulus sentence and participants were asked to indicate whether it matched the
meaning of the sentence. A comparison group of native speakers showed longer reading
times following the critical verb in the ungrammatical condition versus in the gram-
matical condition, but no reading time differences according to whether the picture
matched the stimulus sentence. Conversely, three groups of L2 learners at different
proficiency levels all showed sensitivity to the match between the sentence and the
picture, but not to the grammaticality of mood in the sentence. Cameron concluded
that L2 learners do not process the subjunctive like native Spanish users, in line with the
Shallow Structure Hypothesis (Clahsen & Felser, 2006, 2018). Nevertheless, an impor-
tant limitation of this study was that it only included regular verbs, even though there is
evidence that irregular verbs may provide an advantage to L2 learners with the
subjunctive (Collentine, 1997; Gudmestad, 2012b).
The role of verb regularity in the acquisition of the Spanish subjunctive was first
investigated by Collentine (1997), who suggested that verbs with irregular stems were
more likely to be noticed and subsequently acquired by learners because the mood
contrast is more salient with irregular verbs than with regular ones, in which a single
vowel changes to mark mood. Indeed, the participants in Collentine’s experiment took
more time to respond to irregular verb items than to regular ones in a meaning-oriented
scrambled sentence task. A number of studies with a variationist approach have
observed that L2 learners of Spanish use the subjunctive more with irregular verbs
than with regular ones in oral interviews (Lubbers-Quesada, 1998), in a written binary
choice paragraph completion task (Gudmestad, 2006), and in three different oral
elicitation tasks (Gudmestad, 2012a). Still, there is also evidence that such effects do
not appear on all experimental measures or at all levels of proficiency (Gudmestad,
2012b), and that the reverse effect can even occur in some contexts (Geeslin &
Gudmestad, 2008). A more recent study by Gallego and Pozzi (2018) found that
irregular morphology influenced recognition and production of the subjunctive in
both the aural and written modality, but at different rates depending on the task.
In sum, the Spanish subjunctive has received a good amount of attention in SLA
research and there is evidence that the choice of experimental task is important.
Nevertheless, only one prior investigation employed a real-time processing measure,
self-paced reading, and it employed stimulus sentences with only regular verbs, which
have been associated with less nativelike performance in studies using other research
methods. In the present study, we addressed these limitations by including stimuli with

both regular and irregular verbs. In addition, this study also employed eye tracking, a
more nuanced measure of sentence processing than self-paced reading. More specifically,
because eye tracking provides data on both early processing and later stages, it can
determine whether L2 readers might be slower to integrate different types of linguistic
information during processing than L1 readers (e.g., Felser et al., 2012). Another
advantage of eye tracking is that it has greater ecological validity as an experimental
measure of reading than self-paced reading: rereading is possible in eye tracking but not
in self-paced reading, sentences must be displayed as individual words or phrases in self-
paced reading but not with eye tracking, and participants have to press a button after each
word or phrase during self-paced reading, but not with eye tracking.
Given this background, the present study posited the following two research
questions:
1. Do advanced L2 readers show online sensitivity to Spanish mood while reading

sentences for comprehension? How do they compare to native readers in this
regard?
2. Does the regularity of verbs marked for mood affect online sensitivity among
advanced L2 readers? How do they compare to native readers in this regard?
These two research questions were based on current theoretical debates in L2 sentence
processing, as outlined in the “Introduction” section of this article. More specifically,
these questions were posed to determine whether difficulty in processing Spanish mood
(if evident) is caused by generalized difficulty with the syntax and morphosyntax of the
form, as proposed by the Shallow Structure Hypothesis (Clahsen & Felser, 2006, 2018),
or alternatively, such difficulty is an indirect effect of lower level factors that affect the
processing of individual lexical items, as proposed by the Lexical Bottleneck Hypothesis
(Hopp, 2018). Results that reflect a lack of online sensitivity to mood in all contexts,
regardless of verb regularity, would support theoretical claims of difficulty with syntax
and morphosyntax (Clahsen & Felser, 2006, 2018), whereas a role for verb regularity
would support the claim that the primary difficulty arises at the word level (Hopp,
2018).1
1
It should be noted that the Shallow Structure Hypothesis proposes an increased role for lexical
information as a way of compensating for difficulty in the L2 processing syntax and morphosyntax and,
on the surface, this claim may sound like it would also be supported by a role for verb regularity in the
processing of the subjunctive because word form information is stored in the lexicon. However, the Shallow
Structure Hypothesis juxtaposes the lexicon and syntax/morphosyntax in this manner when the two are
alternate cues in the interpretation of a single detail of sentence processing (in either a competing or
complementary manner). To illustrate, in the processing of ambiguous relative clauses (Felser et al., 2003;
Papadopoulou & Clahsen, 2003), the “lexical bias” (Felser et al., 2003, p. 457) of the preposition “with,” which
leads to low attachment of relative clauses, is contrasted with language-specific phrase structure principles
that lead to high attachment because both provide cues in the attachment of relative clauses, even though the
former is lexical and the latter syntactic. In the present study, however, verb regularity did not provide any
type of cue regarding the mood of a given verb or of the clause or sentence in which it appeared. Verb
regularity was not a potential substitute or competitor for the syntactic and morphosyntactic information that
is required to process mood because irregular verbs occur in both the subjunctive and the indicative, as do
regular verbs. Hence, it appears that verb regularity in the processing of Spanish mood does not fit with the
type of lexical information that is of interest in the Shallow Structure Hypothesis and that the theory would
therefore not account for a role for verb regularity as an experimental outcome.

Method
Participants
Twenty Spanish native speakers and 20 high-proficiency Spanish L2 speakers were
recruited at a large university in the Midwestern United States (see Table 1). The native
speakers were raised in Spanish-speaking countries and moved to the United States to
pursue college degrees. They did not acquire proficiency in English until after puberty,
although some reported minimal exposure to the language in childhood, typically using
school class time of 1 hour or less per week. The L2 learners were native speakers of
English who learned Spanish after puberty. Proficiency was measured with a modified
version of the DELE standardized Spanish proficiency test (Montrul & Slabakova,
2003). Cronbach’s alpha (a measure of reliability) for the test was .73, which is slightly
lower than in previous research using the same instrument (.83, Montrul & Ionin, 2012;
.84, Montrul et al., 2008).
The sample size of 40 total participants was chosen based on common practice in L2
eye-tracking studies, but power analysis for linear mixed-effects models was also
conducted using the simr package (Green & MacLeod, 2016) in R (R Core Team,
2021), based on 200 simulations per effect, a moderate effect size of .40, 40 total
participants, and 32 stimulus items per participant. Power to detect the effects of
group, grammaticality, verb regularity, and three interactions of interest, grammati-
cality group, grammaticality verb regularity, and grammaticality group verb
regularity, in the total dwell time data from the critical region of interest was calculated.
Estimates ranged from 79.50% to 100.00% power, with the lowest value corresponding
to the three-way interaction.
Materials
The experimental stimuli were complex sentences with a critical verb in an embedded
clause that required the subjunctive mood because of a trigger verb in the preceding
Table 1. Language background information

L1 (n = 20) Advanced L2 (n = 20)
M SD range M SD range
Age 30.90 8.08 20–56 31.20 9.80 21–64

Age of acquisition
English 9.90 3.38 6–15 .15 .67 0–3a
Spanish .15 .48 0–2a 13.10 2.50 10–19
DELE scores 48.25 1.51 45–50 45.20 2.87 38–49
Self-ratings-English
Understanding 8.65 1.26 5–10 9.95 .22 9–10
Speaking 7.75 1.25 5–10 9.90 .30 9–10
Reading 8.75 1.44 5–10 9.95 .22 9–10
Self-ratings-Spanish
Understanding 9.75 .71 7–10 8.05 .75 7–9
Speaking 9.90 .30 9–10 7.80 1.15 5–9
Reading 9.90 .30 9–10 8.25 1.11 6–10
Note: The maximum score was 50 for the DELE and 10 for self-rated proficiency.
a
One of the L1 participants reported an Age of Acquisition for Spanish of 1, but the person also reported no other languages
in early childhood. One L1 participant reported an age of 2 for Spanish with early exposure to Basque, but also referred to
Spanish as their “mother tongue.” One L2 participant reported an age of 3 for English with early exposure to Arabic, but also
referred to English as their “first language.”

main clause.2 As illustrated in (1) and (2), the critical verb with mood was either regular
or irregular. The regular verbs required only a switch of thematic vowel to mark the
subjunctive mood (e.g., comen ! coman), whereas the irregular verbs required the
vowel switch plus a stem change that included the addition of a voiced velar stop /g/
(e.g., tienen ! tengan). There are two different types of irregularity with Spanish mood
(Gudmestad, 2012b) and we chose this type with the additional consonant because it
meant that word length was the same for both indicative and subjunctive moods.
(1) Regular Verb Stimulus
a. El ministro espera que los ciudadanos aprueben su propuesta. Grammatical
b. El ministro espera que los ciudadanos aprueban su propuesta. Ungrammatical
“The minister hopes that the citizens approveSUBJ/IND his proposal.”
(2) Irregular Verb Stimulus

a. Los supermercados piden que los clientes traigan sus bolsas. Grammatical
b. Los supermercados piden que los clientes traen sus bolsas. Ungrammatical
“Supermarkets ask that their customers bringSUBJ/IND their bags.”
The regular and irregular verbs used in the stimulus sentences were both high
frequency (regular: M = 3.40; irregular: M = 3.84 log frequency per million words;
Cuetos et al., 2011) but still differed significantly from each other (t = 2.29, p = .01),
which is typically the case because verb irregularity is associated with high frequency
(e.g., Pinker, 1999). Because frequency differences can lead to differences in processing
speed at the word level that might translate into difficulty in processing morphosyntax
at the sentence level (Hopp, 2016; Jegerski & Fernández Cuenca, 2019), word frequency
of the critical verbs was included as a covariate in the statistical analyses of the eye-
movement data for this study, as will be shown in the “Results” section.
It should also be noted that the regular and irregular verbs differed with regard to the
visual salience of the mood distinction because of the nature of Spanish irregular verbs.
With regular verbs, the mood distinction is expressed with a change of thematic vowel,
a single letter/phoneme, but with irregular verbs, there is a change of thematic vowel
plus a stem change, so the indicative and subjunctive forms differ in terms of multiple
letters/phonemes. Hence, morphological regularity and form salience cannot be teased
apart in Spanish, but both are word-level factors, so this does not affect the fit of the
stimulus design with the theoretical framing of the study.
The trigger verbs for the main clause were chosen based on minimal variation in
their selection of the subjunctive mood and high enough word frequency to ensure that
they would be known to the advanced level L2 participants in this study (within the top
5,000 “core” Spanish words; Davies, 2006). They also had to fit in coherent sentences
with the critical verbs and following the stimulus template. A total of 18 different trigger
verbs appeared a mean of 1.7 times (range 0–5) in each of the two stimulus sets (regular
verb stimuli and irregular verb stimuli), with the result being that 22 of the 32 stimuli for
2
The stimuli did not include the opposite pattern, in which a subjunctive verb appeared in an embedded
clause in the absence of a trigger verb in the main clause. The primary reason for this was that subjunctive verb
forms are much less frequent than indicative forms, with a distribution of around 10% subjunctive and 90%
indicative (Kanwit & Geeslin, 2018). Hence, for the stimulus condition in question, the critical verb in the
ungrammatical condition would also be much less frequent than in the grammatical condition and any
increase in dwell times would therefore be difficult to interpret. For this reason, the stimuli with subjunctive
verbs were always grammatical.

each verb type (68.8%) had identical trigger verbs to the corresponding items in the
other set and the remainder had trigger verbs that were the same ones repeated a
different number of times (e.g., pedir “to request” appeared once in the irregular verb
stimuli and twice in the regular verb stimuli; 12.5%) or were different verbs that shared
the same critical characteristics of being strong subjunctive triggers and of relatively
high word frequency (18.8%). The lists of trigger verbs in the two sets were of similar log
frequency per million words, as shown by an independent samples t-test: t(62) = .02,
p = .99 that included a log frequency for every stimulus item, even if the trigger verb was
repeated across multiple items.
Given the variation associated with some uses of the subjunctive in Spanish, we
normed the stimuli with a different group of 20 native speakers prior to the experiment.
Acceptability judgements on a 5-point Likert scale showed a mean rating of 1.00 (1:
“Completely unacceptable”) for the ungrammatical items and 4.90 (5: “Totally
acceptable”) for the grammatical items. The mean acceptability score and range of
responses for each sentence can be found in the online supplementary materials.
Thirty-two stimuli with regular verbs and 32 with irregular verbs were rotated across
four counterbalanced presentation lists such that each contained 16 stimuli with regular
verbs and 16 with irregular ones (each with eight grammatical and eight ungrammat-
ical). A total of 13 irregular verbs and 23 regular verbs were included and were used one
to three times per list, with a mean of 1.78 stimuli with each irregular verb and 1.38
stimuli with each regular verb in each list. Each sentence appeared only once in either
condition per list (i.e., grammatical or ungrammatical). The 32 target stimuli in each list
were combined with 32 distractors, stimuli for another experiment on nonlocal verbal
number agreement (as illustrated in Example 3), and 64 fillers that were all also 50/50%
grammatical/ungrammatical.3 The fillers were of a visual length consistent with that of
the experimental sentences and contained several different types of grammar errors to
maximize distraction, including erroneous prepositions, number agreement in the
noun phrase, gender agreement, missing complementizers, and definite articles. Two
examples are provided in (4) and (5). The 128 total sentences were presented in
pseudorandom order such that no two sentences of the same type appeared in
succession. All eye-tracking materials and stimulus counterbalancing followed the
recommendations of Keating (2014; Keating & Jegerski, 2015). The complete set of
experimental stimuli can be found in the online supplementary materials.
(3) Distractor with Verbal Number Agreement
a. El paquete que pidió la secretaria llegó esta tarde a las cinco. Grammatical
b. Los paquetes que pidió la secretaria llegó esta tarde a las cinco. Ungrammatical
“The package(s) that the secretary requested arrivedSING this afternoon at five.”
(4) Filler Sentence with an Erroneous Preposition

El veterinario es un profesional que cuida *por los animales.
“A veterinarian is a professional who cares for animals.”
3
In the target stimuli, the critical verb in the embedded clause always appeared in the subjunctive mood in
the grammatical condition and in the indicative in the ungrammatical condition. However, this association
between subjunctive/grammatical and indicative/ungrammatical was not present in the overall eye-tracking
experiment because the distractors and fillers contained many instances of the indicative in both grammatical
and ungrammatical sentences, including in embedded clauses. The subjunctive was not used in any of the
distractors or fillers, as we wanted to avoid additional exposure to the target form, which is more marked than
the default indicative.

(5) Filler Sentence without an Error

En El Salvador, el día de las madres se celebra el 10 de mayo.
“In El Salvador, Mother’s Day is celebrated on May 10.”
Procedure
Written instructions and eight practice trials preceded the experiment, which was run
on a desktop mount Eyelink 1000 eye tracker (SR Research, 2005) with chin and
forehead rests, a sampling rate of 1000 Hz, and tracking of the right eye only. The
participant sat 39 inches from the display, which was presented on a 22-inch ViewSonic
monitor. The eye tracker was calibrated with a nine-point grid and validated to ensure a
maximum error of .5 degrees. Calibration was conducted before and after the practice
trials, after the participant had completed half of the 128 experimental trials, and
additionally as needed, based on an automatic drift check that was included at the
beginning of each trial. Stimulus sentences and comprehension questions were pre-
sented in black, 24-point Tahoma font on a white background, with each stimulus
presented as a single line of text. Participants proceeded from one screen to the next
using a green button and responded to comprehension questions using buttons marked
“A” and “B” on a Microsoft Sidewinder game controller (the standard response device
that came with the Eyelink 1000 equipment package).
Participants were told that the test targeted reading comprehension and they
answered a meaning-based comprehension question after each stimulus. Participants
were offered breaks from reading after the practice and after they had read half of the
128 sentences. After the reading task, participants completed a language background
questionnaire, the proficiency test, and a debriefing questionnaire. Most of the partic-
ipants for this study also participated in a separate research session for another
experiment, during which they read a second list of the same type of stimuli (the other
16 from the 32 total of each type), but no participant ever read the same stimulus
sentence more than once in any version. The other experiment examined the role of
task goals in sentence processing, specifically reading for comprehension versus
reading to judge acceptability, and the procedure was otherwise identical to that of
the present study. The ordering of the two sessions was split such that half of the
participants in the present study completed this experiment first and half completed the
other experiment first.
Results
Comprehension accuracy
Response accuracy was high overall, as can be seen in Figure 1. Cronbach’s alpha
(a measure of reliability) for the two sets of comprehension questions that appeared
with the two different presentation lists was .69 and .78.
Data analysis and descriptive statistics

For the eye-movement data, we used a combination of early and late measures of
sentence processing (as defined by Clifton et al., 2007), all for individual words. First
fixation duration is the amount of time spent the first time a participant looks at a word.
It is an early measure that is sensitive to lexical factors such as word frequency and

Figure 1. Mean response accuracy for poststimulus comprehension questions (SDs in parenthesis).
polysemy (ibid.). Total dwell time is the sum of the durations of all fixations made on the
word in question and therefore a late measure. It is often sensitive to higher-level factors
related to syntax, semantics, and pragmatics (ibid.). Regressions to refers to whether a
stimulus word was fixated using an eye regression from an area to the right of the word,
whereas regressions from refers to whether this area was the starting point for a
regressive eye movement back to an area to the left of the word. Regressions are most
often viewed as a late measure that reflects reanalysis, but they can reflect earlier
processes in some cases, such as when reanalysis is triggered by word-level factors
(ibid.). For the present study, there was no previous eye-tracking research to inform our
predictions regarding the individual eye-movement measures, but the processing of
mood involves morphosyntax and semantics, so it seemed most likely that the effects of
interest would appear in the later measures, total dwell time and regressions.
Eye movements were examined for a total of four key words in the stimuli, as
illustrated in Figure 2. These included first fixation duration, total dwell time, and
regressions to and from the critical verb, which was the embedded clause verb in the
subjunctive (grammatical) or indicative mood (ungrammatical), and the two following
words, referred to as the critical verb þ 1 and critical verb þ 2. We also examined the
regressions to the subjunctive trigger verb in the main clause.
Descriptive statistics for first fixation duration and total dwell time at the critical
verb and the following two words can be found in Figures 3 and 4, and 5 and 6,
respectively. Descriptive statistics for regressions to the critical verb, the following
aprueben/
El ministro espera que los ciudadanos *aprueban su propuesta.
The minister expects that the citizens approveSUBJ/*IND his/her proposal.
TRIGGER CRITICAL CRITICAL CRITICAL
VERB VERB VERB + 1 VERB + 2
Figure 2. Stimulus regions of interest.

Figure 3. Mean first fixation duration for irregular verb stimuli (SDs in parenthesis).
Figure 4. Mean first fixation duration for regular verb stimuli (SDs in parenthesis).
word, and trigger verb can be seen in Figures 7 and 8. Lastly, descriptive statistics for
regressions from the critical verb and the following two words can be seen in Figures 9
and 10.
Prior to statistical analysis, we implemented only minimal trimming of the time-
based data to remove absolute outliers prior to transformation (Baayen & Milin, 2010).
Specifically, fixation values below 100 ms were removed from the first fixation duration
and total dwell time data at the recommendation of an anonymous reviewer and
because fixations of less than 50 ms appear not to yield useful information (Ihoff &
Radach, 1998) and fixations of less than 100 ms are rare (Rayner, 1998) and thus
commonly treated as outliers. This affected 2.6% of the total data for first fixation
duration and 1.5% of the data for total dwell time. The remaining time-based data was
then log-transformed to reduce positive skew.

Figure 5. Mean total dwell time for irregular verb stimuli (SDs in parenthesis).
Figure 6. Mean total dwell time for regular verb stimuli (SDs in parenthesis).
First fixation duration and total dwell time were analyzed using linear mixed-effects
models using R (R Core Team, 2021) with the lme4 package (Bates et al., 2015) and
keeping the maximal random effect structure whenever possible (following Barr, 2013).
P values were obtained using Satterthwaite’s approximation for degrees of freedom with
the lmerTest package for R (Kuznetsova et al., 2014). Pairwise comparisons were
conducted using the emmeans package (Lenth et al., 2018), which employs the Tukey
method for multiple comparisons. Alpha was set at .05 for all analyses, p = .05 was
treated as significant, and interactions with p-values less than .10 were explored as
potentially significant, to minimize Type II error likelihood (Larson-Hall, 2010).
Regressive eye movements were analyzed using a Bayesian approach, after an initial
attempt to use a mixed-effects logistic regression analysis resulted in most models not
converging. This is a common problem with regression data, which are binary and

Figure 7. Regressions to trigger verb, critical verb, and subsequent word with irregular verb stimuli
(proportion of trials; SDs in parenthesis).
Figure 8. Regressions to trigger verb, critical verb, and subsequent word with regular verb stimuli
(proportion of trials; SDs in parenthesis).
typically contain a high number of zeros, and a Bayesian approach can help with the
problem of nonconvergence (Hofmeister & Vasishth, 2014; Husain et al., 2014).
Following Kimball et al. (2018), we ran Bayesian models with a maximal random effect
structure and no priors using the brms package (Bürkner, 2017, 2018). Pairwise
comparisons were conducted comparing the probability distribution of the differences
that were found to have a true difference in the output of the Bayesian models. If their
distribution crossed zero, there was not a true probabilistic difference between condi-
tions or groups.

Figure 9. Regressions from the critical verb and subsequent words with irregular verb stimuli (proportion of
trials; SDs in parenthesis).
Figure 10. Regressions from critical verb and subsequent words with regular verb stimuli (proportion of
trials; SDs in parenthesis).
Standardized effect sizes were estimated independently from the mixed-effects

models, calculated as Cohen’s d with a correction for dependence between means in
the comparisons that were within-subjects (Morris & DeShon, 2002; Wiseheart, 2014).
For all primary analyses, the fixed effects were group (L1, L2), verb regularity
(irregular, regular), and grammaticality (grammatical, ungrammatical), and the ran-
dom effects were subject and item. Verb frequency was included as a covariate in all the
models because the irregular verbs were more frequent than the regular verbs, as
discussed in the “Materials” section, and will be mentioned only when significant.
Experimental session was also included as a covariate in the primary (omnibus) models,
but it was never significant, so it is reported in the output tables below but will not be
discussed further. The reference levels for the fixed effects were L2, irregular, and
grammatical, respectively, although with only two levels of each variable, this

designation did not affect the contrasts examined. The maximal model also included
verb regularity, grammaticality, and their interaction as by-subject slopes and gram-
maticality, group, and their interaction as by-item slopes. When a model did not
converge, this random slope structure was simplified incrementally until the model
did converge. For more details, see the R code for all primary analyses in the online
supplementary materials.
First fixation duration

The output of the statistical analyses of the first fixation duration on the critical verb
and two postcritical words can be seen in Table 2. At the critical verb, there was a main
effect of grammaticality (effect size of d = 0.12), no effect of group (d = 0.14), no effect of
verb regularity (d = 0.07), and a significant interaction of grammaticality with group
and of grammaticality with verb regularity. The covariate of frequency showed a
predictable effect, with higher frequency verbs showing shorter fixations. Additional
analyses were conducted separately by group to explore the interactions. The L1 group
Table 2. First fixation duration for critical verb and two subsequent words: Output from linear mixed-effects
models
Estimate SE t p
Critical verb
Intercept 5.74 0.08 64.44 0.00
Grammaticality 0.11 0.03 2.96 0.00
Group 0.00 0.05 0.13 0.89
Verb Regularity 0.03 0.03 0.90 0.36
Frequency –0.03 0.01 –2.93 0.00
Session –0.08 0.04 –1.99 0.06
Grammaticality Group –0.13 0.05 –2.44 0.01
Grammaticality Verb Regularity –0.10 0.05 –1.91 0.05
Group Verb Regularity –0.03 0.05 –0.62 0.53
Grammaticality Group Verb Regularity 0.12 0.07 1.62 0.10
Critical verb þ1
Intercept 5.71 0.11 51.54 0.00
Grammaticality –0.07 0.05 –1.18 0.24
Group –0.20 0.08 –2.51 0.01
Verb regularity –0.14 0.05 –2.59 0.01
Frequency –0.01 0.01 –0.59 0.55
Session –0.03 0.04 –0.66 0.51
Grammaticality Group 0.02 0.08 0.32 0.74
Grammaticality Verb Regularity 0.07 0.07 0.99 0.32
Group Verb Regularity 0.15 0.07 1.94 0.05
Grammaticality Group Verb Regularity –0.04 0.11 –0.40 0.68
Critical verb þ2
Intercept 5.48 0.12 45.30 0.00
Grammaticality –0.05 0.05 –1.05 0.29
Group –0.05 0.06 –0.81 0.41
Frequency 0.00 0.01 0.03 0.97
Session –0.01 0.05 –0.33 0.73
Grammaticality Group 0.04 0.06 0.63 0.52
Grammaticality Verb Regularity 0.02 0.06 0.31 0.75

showed no effect of grammaticality, estimate = 0.00, SE = 0.02, t = 1.70, p = 0.08, d =

0.04, or verb regularity, estimate = 0.01, SE = 0.02, t = 0.64, p = 0.51, d = 0.13, no
interaction, estimate = –0.03, SE = 0.03, t = –1.02, p = 0.30, and the covariate of
frequency showed an effect in the expected direction, estimate = –0.03, SE = 0.01,
t = –2.94, p = 0.00. The L2 group showed an effect of grammaticality, estimate = 0.11,
SE = 0.04, t = 2.84, p = 0.00, d = 0.19, no effect of verb regularity, estimate = 0.03, SE =
0.04, t = 0.84, p = 0.39, d = 0.03, a borderline significant interaction, estimate = –0.10,
SE = 0.05, t = –1.84, p = 0.06, and the covariate of frequency showed an effect in the
expected direction, estimate = –0.03, SE = 0.01, t = –2.08, p = 0.03. Pairwise
comparisons conducted to explore the potential interaction confirmed a significant
effect of grammaticality with the irregular verb stimuli, estimate = 0.11, SE = 0.04, t =
2.50, p = 0.02, d = .37, that was not present with regular verb stimuli, estimate = 0.01, SE
= 0.04, t = 0.32, p = 0.74, d = .02.
At the critical verb þ 1, there were main effects of verb regularity (d = 0.07) and
group (d = 0.28), as well as a significant interaction of verb regularity with group, but no
effect of grammaticality (d = 0.07). Pairwise comparisons revealed a significant main
effect of verb regularity for the L2 group, estimate = –0.12, SE = 0.05, t = –2.14, p = 0.01,
d = .20, that was not present with the L1 group, estimate = 0.00, SE = 0.05, t = –0.15, p =
0.88, d = .05. The L2 group showed generally longer first fixation durations for irregular
verbs than regular ones, but it is difficult to interpret the effect because the irregular and
regular verb stimuli were different items not designed to be directly compared to each
other, but rather in terms of an interaction with grammaticality (for which the carefully
controlled stimulus conditions were identical except for the grammaticality of the
verb).
Finally, at the critical verb þ 2, which was also the last word of the sentence, there
were no significant effects (group: d = 0.07; grammaticality: d = 0.02; verb regularity:
d = 0.08) or interactions. Neither group showed any lingering effect of mood at this
point in the stimulus.
Total dwell time

The output of the statistical analyses of the total dwell time on the critical and two
postcritical words can be seen in Table 3. At the critical verb, there was a main effect of
grammaticality (d = 0.39), a main effect of verb regularity (d = 0.19), no effect of group
(d = 0.04), a significant interaction of grammaticality with verb regularity, a significant
interaction of verb regularity with group, and a significant three-way interaction. The
covariate of frequency showed the expected effect, with higher frequency verbs showing
shorter fixations. Additional analyses were conducted separately by group to explore
the interactions. The L1 group showed an effect of grammaticality, estimate = 0.23, SE
= 0.10, t = 2.30, p = 0.03, d = 0.64, but no effect of verb regularity, estimate = –0.03, SE =
0.05, t = –0.59, p = 0.55, d = 0.06, and no interaction, estimate = 0.03, SE = 0.11,
t = 0.34, p = 0.73. The covariate of frequency showed an effect in the expected direction,
estimate = –0.10, SE = 0.02, t = –3.84, p = 0.00. The L2 group showed a main effect of
grammaticality, estimate = 0.24, SE = 0.06, t = 3.92, p = 0.00, d = 0.23, no effect of verb
regularity, estimate = 0.14, SE = 0.07, t = 1.89, p = 0.06, d = 0.35, and a significant
interaction of grammaticality with verb regularity, estimate = –0.24, SE = 0.08,
t = –2.79, p = 0.00. The covariate of frequency showed an effect in the expected
direction, estimate = –0.16, SE = 0.04, t = –3.97, p = 0.00. Pairwise comparisons
revealed that the L2 group was sensitive to grammatical mood with irregular verb

Table 3. Total dwell time for critical verb and two subsequent words: Output from linear mixed-effects
models
Estimate SE t p
Critical verb
Intercept 6.63 0.20 32.28 0.00
Group 0.04 0.10 0.39 0.69
Frequency –0.13 0.02 –4.74 0.00
Session –0.01 0.10 –0.15 0.88
Critical verb þ1
Intercept 5.98 0.21 27.45 0.00
Group –0.09 0.09 –1.01 0.31
Frequency –0.03 0.03 –0.84 0.40
Session 0.04 0.09 0.47 0.63
Critical verb þ2
Intercept 6.04 0.26 22.54 0.00
Group –0.14 0.14 –1.02 0.31
Frequency 0.00 0.03 0.17 0.85
Session –0.06 0.12 –0.48 0.62
stimuli, estimate = 0.24, SE = 0.06, t = 3.96, p = 0.00, d = 0.55, but not with regular verb
stimuli, estimate = 0.00, SE = 0.07, t = 0.02, p = 0.97, d = 0.01, and the covariate of
frequency was predictably significant in both models.
At the critical verb þ 1, there was a main effect of grammaticality (d = .23), there was
no effect of group (d = 0.15) or verb regularity (d = 0.01), and the interaction of
grammaticality with verb regularity approached significance. Follow-up analyses of
each verb type were run separately to explore the potential interaction. These revealed a
grammaticality effect with the irregular verb stimuli, estimate = 0.17, SE = 0.07, t =
2.48, p = 0.01, d = 0.34, that was not present with the regular verb stimuli, estimate =
–0.00, SE = 0.07, t = –0.00, p = 0.07, d = 0.12. Thus, online sensitivity to mood carried
over to the post-critical word regardless of group, but only with the irregular verb
stimuli.
Finally, at the critical verb þ 2 region, there were no significant effects (group: d =
0.30; grammaticality: d = 0.22; verb regularity: d = 0.02) or interactions. Neither group
showed any lingering sensitivity to mood with either type of verb.

Regressions to
As previously stated, a Bayesian approach was adopted for analysis of the regression
data. In frequentist approaches that rely on median and average values, confidence
intervals are based only on the data as it is. However, in a Bayesian approach, a credible
interval (CrI) incorporates prior probability distributions to signal an interval within
which an unobserved parameter value falls. True differences, which can be said to be
equivalent to statistically significant differences in frequentist approaches, are signaled
with the upper and lower 95% credible intervals values not crossing zero.
The output of the statistical analyses of the regressions to the trigger verb, critical
verb, and subsequent word can be seen in Table 4. Regressions to the critical verb þ
2 word were not possible because it was the last word of the sentence. At the trigger verb
there were no true differences of group, grammaticality, verb regularity, frequency, or
session (group: d = 0.17; grammaticality: d = 0.03; verb regularity: d = 0.03), nor were
there any interactions.
Table 4. Regressions to the trigger verb, critical verb and critical verb þ 1: Output from Bayesian logistic
mixed-effects models
Estimate SE l-95% CrI u-95% CrI Rhat
Trigger Verb
Intercept 0.38 0.71 –1.01 1.77 1.00
Grammaticality –0.10 0.27 –0.63 0.44 1.00
Group –0.42 0.35 –1.13 0.26 1.00
Verb Regularity –0.28 0.30 –0.86 0.31 1.00
Frequency –0.17 0.12 –0.41 0.07 1.00
Session –0.17 0.30 –0.76 0.43 1.00
Grammaticality Group 0.21 0.39 –0.55 0.96 1.00
Grammaticality Verb Regularity 0.34 0.36 –0.36 1.05 1.00
Group Verb Regularity 0.35 0.38 –0.39 1.09 1.00
Grammaticality Group Verb Regularity –0.47 0.52 –1.49 0.55 1.00
Critical verb
Intercept –0.63 0.57 –1.75 0.50 1.00
Grammaticality 0.86 0.33 0.22 1.53 1.00
Group 0.75 0.32 0.11 1.37 1.00
Verb Regularity 0.68 0.27 0.16 1.22 1.00
Frequency –0.07 0.10 –0.26 0.11 1.00
Session –0.11 0.25 –0.60 0.39 1.00
Grammaticality Verb Regularity –1.16 0.40 –1.95 –0.38 1.00
Group Verb Regularity –0.98 0.37 –1.69 –0.26 1.00
Grammaticality Group x Verb Regularity 0.92 0.55 –0.17 2.00 1.00
Critical verb þ 1
Intercept –1.46 0.83 –3.10 –0.15 1.00
Grammaticality 0.54 0.29 –3.10 0.15 1.00
Group –0.18 0.43 –1.05 0.69 1.00
Verb Regularity 0.19 0.32 –0.44 0.84 1.00
Frequency 0.05 0.14 –0.24 0.34 1.00
Session –0.11 0.36 –0.82 0.58 1.00
Grammaticality Group –0.38 0.44 –1.23 0.46 1.00
Grammaticality Verb Regularity –0.57 0.42 –1.39 0.25 1.00
Group Verb Regularity –0.42 0.46 –1.33 0.47 1.00
Grammaticality Group Verb Regularity 0.73 0.62 –0.46 1.96 1.00
Note: True differences are signaled by the lower and upper credible intervals (CrIs) not crossing zero, i.e., both values being
positive or negative.

At the critical verb, there were main effects of grammaticality (d = 0.25), group (d =
0.22), and verb regularity (d = 0.06), and interactions of verb regularity with group and
with grammaticality. Additional analyses were conducted separately for each group to
explore the interactions. The L1 group showed an effect of grammaticality, estimate =
1.23, SE = 0.37, 95% CrI [0.52, 1.96], d = 0.34, but no effect of verb regularity, estimate =
–0.29, SE = 0.26, 95% CrI [–0.81, 0.22], d = 0.15, and no interaction, estimate = –0.23,
SE = 0.39, 95% CrI [–1.01, 0.54]. The L2 group also showed a main effect of
grammaticality, estimate = 0.90, SE = 0.32, 95% CrI [0.26, 1.55], d = 0.13, plus they
also showed an effect of verb regularity, estimate = 0.70, SE = 0.28, 95% CrI [0.15, 1.25],
d = 0.06, and an interaction, estimate = –1.15, SE = 0.43, 95% CrI [–1.99, –0.33]. A
follow-up analysis run with verb regularity separately with the L2 data revealed an effect
of grammaticality with the irregular verb stimuli, estimate = 0.87, SE = 0.36, 95% CrI
[0.16, 1.53], d = 0.40, but not with the regular verb stimuli, estimate = –0.27, SE = 0.36,
95% CrI [–1.00, 0.44], d = 0.06.
At the critical verb þ 1, there were no true differences of group, grammaticality, or
verb regularity (group: d = 0.14; grammaticality: d = 0.12; verb regularity: d = 0.10), nor
were there any interactions.
Regressions from
The output of the statistical analyses of the regressions from the critical verb and the two
following words can be seen in Table 5. At the critical verb, there were no main effects
(group: d = 0.03; grammaticality: d = 0.12; verb regularity: d = 0.05) or interactions.
At the critical verb þ 1, there were effects of grammaticality (d = 0.24) and group (d
= 0.31), but no effect of verb regularity (d = 0.03). Moreover, there was an interaction of
verb regularity with grammaticality, so additional analyses were conducted for regular
and irregular verb stimuli separately. The irregular verb stimuli showed an effect of
grammaticality, estimate = 0.97, SE = 0.27, 95% CrI [0.46, 1.50], d = 0.27, but the
regular verb stimuli did not, estimate = 0.03, SE = 0.29, 95% CrI [-0.52, 0.59], d = 0.10.
At the critical verb þ 2, there were no main effects of group, grammaticality, or verb
regularity (group: d = 0.01; grammaticality: d = 0.07; verb regularity: d = 0.08), nor were
there any interactions.
Results summary
The main results of this experiment were as follows:
• Only the L2 group showed online sensitivity to mood in the early measure of first
fixation duration. This was at the critical verb and occurred only with irregular verbs.
• Both groups showed sensitivity to mood in the later measure of total dwell time with
the irregular verb stimuli. This was at both the critical verb and the following word. In
addition, only the L1 group showed sensitivity to mood with the regular verb stimuli,
and this was only on the critical verb, with no spillover.
• Similarly, both groups showed sensitivity to mood in the regressions to the critical
verb with the irregular verb stimuli and only the L1 group showed the effect with the
regular verb stimuli.
• Both groups showed sensitivity to mood in the regressions from the postcritical word
with the irregular verb stimuli, and neither group showed the effect with the regular
verb stimuli.

Table 5. Regressions from critical verb and two subsequent words: Output from Bayesian logistic mixed-
effects models
Estimate SE l-95% CrI u-95% CrI Rhat
Critical verb
Intercept –1.02 0.58 –2.15 0.14 1.00
Grammaticality 0.18 0.29 –0.40 0.76 1.00
Group 0.20 0.34 –0.46 0.87 1.00
Verb Regularity 0.07 0.29 –0.52 0.66 1.00
Frequency –0.08 0.10 –0.27 0.11 1.00
Session 0.14 0.25 –0.35 0.63 1.00
Critical verb þ 1
Intercept –0.50 0.68 –1.84 0.85 1.00
Grammaticality 0.97 0.26 0.46 1.49 1.00
Group 0.83 0.33 0.20 1.49 1.00
Verb Regularity 0.44 0.30 –0.16 1.04 1.00
Frequency 0.00 0.12 –0.23 0.24 1.00
Session –0.28 0.29 –0.85 0.27 1.00
Grammaticality Group –0.49 0.37 –1.25 0.20 1.00
Grammaticality Verb Regularity –0.94 0.37 –1.68 –0.24 1.00
Critical verb þ 2
Intercept 1.46 1.31 –1.10 4.03 1.0
Grammaticality 0.26 0.32 –0.38 0.88 1.0
Group –0.06 0.60 –1.21 1.16 1.0
Verb Regularity 0.26 0.42 –0.58 1.08 1.0
Frequency 0.23 0.21 –0.18 0.65 1.0
Session –0.69 0.59 –1.84 0.42 1.00
Grammaticality Group Verb Regularity –0.40 0.63 –1.64 0.84 1.0
Note: True differences are signaled by the lower and upper credible intervals (CrIs) not crossing zero, i.e., both values being
positive or negative.
• The covariate of verb frequency showed a predictable effect with the time-based
measures of first fixation duration and total dwell time, in which fixation times on the
critical verb were shorter when the verb was of higher frequency. This suggests that
frequency did account for some of the variance in those models and including it as a
covariate potentially helped to clarify some of the results. Frequency did not appear
to make a difference in the models for the regression data.
Discussion
The first research question for the present study asked if advanced L2 speakers of
Spanish were sensitive to grammatical mood during online sentence comprehension.
Our findings suggest that, in the most basic sense, the answer to this question is
affirmative. The L2 participant group in this study showed the effect of interest in five
analyses of four different eye-movement measures: first fixations were longer on the
critical verb, total dwell times were longer on the critical verb and the following word,

there were more trials with regressions to the critical verb, and there were more trials
with regressions from the postcritical word, all with ungrammatical stimuli versus
grammatical stimuli. Of course, the L2 group only showed these effects with the
irregular verb stimuli, a point that will be discussed further in the following text, in
the discussion of the second research question. Nevertheless, the data from this study
show that it is possible for L2 learners to integrate verbal mood morphology during
online sentence comprehension. To our knowledge, this is the first empirical study to
provide such evidence.
The first research question also proposed a comparison of L2 and L1 participant
groups with regard to online sensitivity to Spanish mood. Both groups showed the
expected effect across multiple eye-movement measures and across two stimulus
words, in the case of total dwell time, so they were similar overall. However, one point
of difference was that only the L2 group showed early sensitivity to mood in the first
fixation duration measure. Early sensitivity to mood morphology was not predicted
even with the L1 group, given how many different linguistic factors are involved in the
processing of mood in Spanish, so the fact that it was observed here among L2 learners
suggests that nonnative processing of mood morphology was very efficient. Hence, we
interpret the results of this study as evidence of nativelike sensitivity to mood in L2
sentence processing. At the same time, there was evidence of L1/L2 differences in
processing at the word level, to be discussed next.
The second research question for the present study asked if the regularity of verbs
with mood morphology affected online sensitivity to the form among advanced L2
users. Our findings suggest that verb regularity was indeed very important in this
experiment. The L2 participant group in this study showed robust online sensitivity to
verbal mood morphology, as discussed previously under the first research question, but
only with irregular verb stimuli. There was no evidence of online sensitivity to mood
morphology with regular verbs.
The second research question also proposed a comparison of L2 and L1 participant
groups with regard to the role of verb regularity in the online processing of verbal mood
morphology. Here there was a notable difference between the two groups in terms of the
degree to which verb regularity affected their processing of mood. With the L2 group,
the effect of verb regularity was absolute, meaning there was no evidence of online
sensitivity to mood with regular verb stimuli and robust evidence of online sensitivity to
mood with irregular verbs. The L1 group, however, showed robust online sensitivity to
mood with both verb types. Nevertheless, this sensitivity was slightly more robust with
the irregular verb stimuli than with the regular verb stimuli. In total dwell times, the
effect was present across two words with the irregular verb stimuli, but only on the
critical word with the regular verb stimuli. And with the regressions from the post-
critical word, the effect was only observed with the irregular verb stimuli. Thus, it
appears that verb regularity can affect the L1 processing of mood as well, albeit in a
more subtle manner than with L2 processing. To our knowledge, this is the first study to
observe an apparent role of verb regularity in the native processing of Spanish mood.
The outcome of the present study differs from that of Cameron (2011, 2017), who
conducted the only previous study of the L2 acquisition of the Spanish subjunctive
using a real-time method (self-paced reading) and found no evidence of online
sensitivity to verbal mood among L2 users of Spanish. One very plausible explanation
for the difference is that the present study included stimuli with both regular and
irregular verbs, whereas the Cameron study used only regular verbs. The results of the
two studies could be seen as consistent in this regard, as both found a lack of online
sensitivity to mood with regular verb stimuli among L2 users, even though L1 users

showed the predicted effect. Although this one difference could explain the apparently
different findings of the two studies in and of itself, there are at least two other notable
methodological differences that might have also been important. First, the “trigger”
verbs used in the present study (e.g., querer “to want,” esperar “to hope or expect”) are
more frequently associated with the subjunctive mood than the expressions of
certainty used in the Cameron study (e.g., Es possible que “It is possible that,” Es
probable que “It is probable that”; subjunctive frequency data from Davies, 2006,
p. 142). The native speakers showed the expected online effects in both studies,
however, so the linguistic context does not appear to have been a limiting factor in
any absolute sense. Second, the participants in the present study were likely of higher
proficiency than even the most advanced group in the Cameron study (mean score
45/50 vs. 36/50 on proficiency tests that were both based on the DELE). Further
research is needed to investigate the role of the type of subjunctive mood trigger in the
stimulus sentences and to explore the role of L2 proficiency, while keeping in mind the
important role of verb form regularity.
Turning now to the theoretical implications of the outcome of the present study, we
found no evidence of deficiencies in the nonnative processing of sentence-level syntax
and morphosyntax of the type proposed by the Shallow Structure Hypothesis (Clahsen
& Felser 2006, 2018). Although the theory has not dealt with grammatical mood
specifically, verb morphology and the abstract grammatical details it conveys are of
key interest and mood thus seems to fit. As outlined in the background of this paper, the
processing of grammatical mood involves verb morphology, sentence-level semantics,
and a nonlocal dependency (with three words of separation) between the lexical-
semantics of the verb in the matrix clause and mood morphology on the verb in an
embedded clause. Despite this linguistic complexity and the need to integrate multiple
sources of abstract grammatical information across clauses, the advanced L2 users in
the present study showed nativelike online sensitivity to grammatical mood with
irregular verbs. Thus, L2 sentence processing was not intrinsically or inescapably
shallow, as this would entail a lack of sensitivity in all contexts, regardless of verb
regularity.
Nevertheless, L2 processing did appear to be limited to some degree, as there was no
evidence of online sensitivity to grammatical mood with regular verbs. This stood in
contrast with the results from the L1 group, which showed the expected effects,
although these were not quite as robust as with irregular verbs (present in four analyses
with irregular verbs vs. two measures with regular verbs). The clear pattern of nativelike
processing of mood with irregular verbs and no evidence of online processing of mood
with regular verbs that was observed among the L2 group is consistent with a body of
previous research on the adult L2 acquisition of Spanish mood that has also identified
verb form regularity as an important factor (Collentine, 1997; Gallego & Pozzi, 2018;
Gudmestad, 2006, 2012a; Lubbers-Quesada, 1998). The prior work was conducted
using a variety of offline methods as opposed to a real-time measure like eye tracking
and the differences between regular and irregular verbs were often more graded than in
the present study, but the results were generally similar in that more nativelike behavior
was seen with irregular verbs than with regular ones. The outcome of the present study
was also broadly consistent with the results of one prior study that had also examined
the role of verb regularity in nonnative sentence processing: Pliatsikas and Marinis
(2013) found that native speakers and high-proficiency L2 learners were slower to
process the English past tense with regular verbs than with irregular verbs. However,
the L2 group still showed some degree of online sensitivity to the past tense with regular
verbs, unlike in the present study.

In considering possible explanations for the role of verb regularity, three word-level
factors that might be of importance and that are not mutually exclusive are word
frequency, perceptual salience, and morphological regularity, each of which will be
considered here in turn. The first factor is word frequency: As is typical, the irregular
verbs used in the present study were more frequent than the regular ones and this could
potentially lead to differences in processing speed at the word level that might translate
into difficulty in processing morphosyntax at the sentence level (Hopp, 2016; Jegerski &
Fernandez Cuenca, 2019). For this reason, frequency of the critical verbs was included
as a covariate in our analyses. It is therefore not likely that verb frequency played a role
in the different outcomes for regular and irregular verb stimuli in the present study, so
form salience and form regularity (both to be discussed in the following text) were
probably more important. Speaking more broadly, however, it is still possible that the
higher frequency of irregular verbs compared to regular ones is a factor in the
acquisition of Spanish mood (Pinker, 1999), it is just that it does not appear to have
been a factor in this particular experiment, probably because both sets of verbs were of
high frequency (cf. Hopp, 2016; Jegerski & Fernández Cuenca, 2019; Jiang & Botana,
2009) and not very different from each other in relative terms, even though the
difference was statistically significant.
A second factor is perceptual salience. As has been pointed out by a number of
other researchers working on the acquisition of the Spanish subjunctive (Collentine,
1997; Gallego & Pozzi, 2018; Gudmestad, 2006, 2012a, Lubbers-Quesada, 1998),
Spanish mood is more visually and acoustically salient with irregular verbs than with
regular ones. The difference between indicative and subjunctive forms is greater in
terms of the number of written letters that vary (e.g., tiene/tenga “s/he hasIND/SUJB”
vs. habla/hable “s/he speaksIND/SUJB”) because both regular and irregular verbs have a
thematic vowel shift to indicate mood (from e/i to a or vice versa), but irregular verbs
also have a stem change, so it could be that irregular forms are simply more likely to be
seen and processed. Regarding acoustic salience, the single vowel that most often
distinguishes regular indicative and subjunctive verbs is the /a/ – /e/ contrast, which
occurs in an unstressed, final syllable in spoken Spanish. In English, most vowels
would be neutralized to schwas in that phonetic environment, so English-dominant
bilinguals may tend toward neutralization of such vowels in Spanish as well (e.g.,
Colantoni et al., 2020). As verb regularity and salience are inevitably linked in Spanish
due to irregularity affecting verb stems, salience potentially has a role in any observed
effect of verb regularity, including in the results of the present study. And when it
comes to real-time processing by nonnative readers, as with the experimental task for
the present study, form salience may be of particular importance (Hopp & León
Arriaga, 2016; Jegerski, 2015). Salience might also explain the results of the L1 group
in the present study, who showed more robust online sensitivity to mood with
irregular verbs than with regular ones. Hence, we think that salience is one important
reason why verb regularity affected the processing of mood in our study, to the extent
that there was no evidence of online sensitivity to mood with regular verbs with L2
users.
A third potential explanation for the observed difference in the online L2 processing
of mood with regular versus irregular verbs is form regularity. Dual mechanism theories
of morphological processing propose that regular inflected forms are accessed using a
morphological rule that assembles the component morphemes, while irregular forms
are accessed as whole word forms with separate entries in the lexicon (Pinker, 1999).
Later versions of the Shallow Structure Hypothesis (Clahsen & Felser, 2018, p. 2) have
proposed a secondary claim that L2 learners fail to use morphological rules to

decompose complex words the way that L1 users do, in addition to the primary claim of
the theory regarding sentence processing. Under such an account, both irregular and
regular forms would be stored equally as whole-word entries in the lexicon, which does
not explain why only irregular verbs were associated with online sensitivity to mood in
the present study. The more moderate version of this claim seems to fit with our results,
as it proposes L1/L2 difference as more a matter of degrees rather than an absolute
difference: L2 learners may apply the same rules as L1 users, but they do so more slowly
(Kirkici & Clahsen, 2013). Under this account, the L2 users in the present study might
have initiated rule-based processing of Spanish mood with regular verbs, but they were
not as efficient as the L1 users in doing so. This subtle difficulty may have compounded
with form salience to lead to more pronounced effects on eye movements. However, it is
important to note that the L2 group in this study showed no evidence of online
sensitivity to mood with regular verbs at all, so this is entirely speculative and the data
do not exactly fit with the theoretical claim in question. Even if this is interpreted as one
point of potential compatibility, the Shallow Structure Hypothesis as a whole cannot
account for our results, as its “core claim” (Clahsen & Felser, 2018, p. 1) pertains to
difficulty at the level of phrases and sentences rather than within words. Its authors have
also specified that morphology in individual words should present less difficulty than
syntax and morphosyntax in sentences (Clahsen & Felser, 2006, p. 35), which is the
opposite of the pattern observed in the present study.
On the whole, the outcome of the present investigation suggests that the primary
processing difficulty for the L2 participants in this study originated at the word level,
not at the sentence level. As long as an irregular verb form facilitated the processing of
mood at the word level (because of greater form salience and possibly because of more
efficient processing of irregular forms), subsequent sentence level processing of mood
proceeded in a nativelike fashion. Our results therefore seem to go against a theoret-
ical account of L2 processing that emphasizes difficulty with sentence-level morpho-
syntax (i.e., Clahsen & Felser, 2018), as this would predict difficulty processing verbal
mood in all contexts, regardless of verb regularity. Rather, the results of the present
study are more consistent with the Lexical Bottleneck Hypothesis (Hopp, 2014, 2018),
which proposes that difficulty at the word level can often be the critical limitation and
that apparent difficulty with sentence processing is an indirect effect of word level
difficulty.
An unexpected result of this experiment was that the L2 group showed early
sensitivity to mood in the first fixation duration measure, but the L1 group did not.
Mood effects were predicted to show up primarily in later measures like total dwell time
and regressions because the effects arise from a semantic and morphosyntactic depen-
dency across clause boundaries. It is also generally expected for L1 processing to be
more efficient than L2 processing, although there is evidence that L2 learners some-
times read faster than their L1 counterparts (e.g., Felser et al., 2003; Kaan et al., 2015).
One possible reason for the difference between the groups might be that the L2 group
was more attuned to the subjunctive mood while reading in Spanish because the form
receives so much attention in world language classrooms. Alternatively, an anonymous
reviewer suggested that the L1 group might have experienced L1 attrition due to
immersion in their L2, English, at the time of testing. It is true that other research
has shown that L1 attrition can occur in similar L2 immersion contexts with at least
some aspects of sentence processing (Chamorro et al., 2016; Dussias & Sagarra, 2007),
but we think this explanation is less plausible in the case of the present study because the
L2 group was also immersed in English at the time of testing, so the two groups would
have had similar results.

Finally, one important limitation of this study that potentially affects the general-
izability of the findings is that the stimuli represented only a narrow range of the many
different uses of Spanish mood. Only 13 different irregular verbs were included because
of the need to control word length across indicative and subjunctive forms, and 6 of the
13 verbs were of the –tener type (e.g., tener “to have,” mantener “to maintain”), which all
follow the same pattern of inflection for mood, despite being considered irregular. In
addition, the trigger verbs appearing in the main clause of the stimuli were all strong
triggers, showing only minimal variability, but subjunctive use is known to vary in
many other contexts. Further research is therefore needed to determine if the present
findings might be replicated widely or may be limited to a relatively small set of verbs
and contexts. A second limitation is that irregular verbs, the type associated with L1-like
processing in this study, are of limited number in the Spanish lexicon (although they do
tend to be of much higher frequency than regular verbs). This means that L2 processing
of the subjunctive mood, although in theory possible, occurs in only a minority of cases
in the real world. We therefore suggest that one practical implication of the present
study―which contributes to a growing body of evidence that verb form regularity can
play a role in the acquisition of verb morphosyntax―is that future research should
examine the potential application of the observed role of morphological regularity in
the context of language instruction. For example, in an input-based teaching method
like Processing Instruction (VanPatten & Cadierno, 1993), irregular verb forms might
increase the likelihood that learners will process mood in the input, thereby improving
the potential for acquisition.
In conclusion, the main finding of this empirical study was that advanced profi-
ciency L2 readers showed nativelike processing of Spanish mood with irregular verb
stimuli and no evidence of online sensitivity to mood with regular verb stimuli. A
comparison group of L1 readers showed the expected effect with both types of stimuli,
but the effect was slightly more robust with irregular verbs than with regular verbs.
Hence, it appears that form regularity played an important role, particularly in L2
processing. We have argued that this was due to the greater visual salience of Spanish
subjunctive forms with irregular verbs versus regular verbs and that difficulty proces-
sing rule-based regular verbal morphology may have played a role as well. In any case,
the results suggest that L2 processing difficulty originated with word-level factors,
consistent with the Lexical Bottleneck Hypothesis (Hopp, 2014, 2018), and that
sentence-level processing has the potential to be nativelike, which appears to go against
claims of generalized difficulty in syntactic and morphosyntactic processing (Clahsen &
Felser, 2006, 2018).
10.1017/S027226312200016X.
References
Baayen, R. H., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological Research, 3,
12–28.
Barr, D. J. (2013). Random effects structure for testing interactions in linear mixed-effects models. Frontiers
in Psychology, 4, 328.
Journal of Statistical Software, 67, 1–48.
Blake, R. (1983). Mood selection among Spanish-speaking children, ages 4 to 12. The Bilingual Review, 10,
21–32.

Bosque, I. (2012). Mood: Indicative vs. subjunctive. In J. I. Hualde, A. Olarrea, & E. O’Rourke (Eds.), The
Handbook of Hispanic Linguistics (pp. 373–394). Wiley-Blackwell.
Borgonovo, C., Bruhn de Garavito, J. & Prévost, P. (2005). Acquisition of mood distinctions in L2 Spanish. In
A. Brugos, M. R. Clark-Cotton, & S. Ha (Eds.), Proceedings of the 29th annual Boston University Conference
on Language Development (pp. 97–108). Cascadilla Press.
Borgonovo, C., & Prévost, P. (2003). Knowledge of polarity subjunctive in L2 Spanish. In B. Beachley, A.
Brown, & F. Conlin (Eds.), Proceedings of the 27th Boston University Conference on Language Development
(Vol. 1, pp. 150–161). Cascadilla Press.
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated
review. Current Directions in Psychological Science, 27, 45–50.
Bürkner, P. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical
Software, 80, 1–28.
Bürkner, P. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10,
395–411.
Cameron, R. (2011). Native and nonnative processing of modality and mood in Spanish [Unpublished
doctoral dissertation]. Florida State University, Tallahassee.
Cameron, R. (2017). Lexical preference and online processing of Spanish subjunctive. Linguistics Unlimited,
2, 1–24.
Chamorro, G., Sorace, A., & Sturt, P. (2016). What is the source of L1 attrition? The effect of recent L1 re-
exposure on Spanish speakers under L1 attrition. Bilingualism: Language and Cognition, 19, 520–532.
Clahsen, H., & Felser, C. (2006). Continuity and shallow structures in language processing. Applied
Clahsen, H., & Felser, C. (2018). Some notes on the Shallow Structure Hypothesis. Studies in Second Language
Acquisition, 40, 1–14.
Clifton, C., Staub, A., & Rayner, K. (2007). Eye movements in reading words and sentences. In R. van Gompel,
M. H. Fischer, W. S. Murray, & R. L. Hill (Eds.), Eye movements: A window on mind and brain
(pp. 341–372). Elsevier.
Colantoni, L., Martínez, R., Mazzaro, N., Pérez-Leroux, A. T., & Rinaldi, N. (2020). A phonetic account of
Spanish-English bilinguals’ divergence with agreement. Languages, 5, 58.
Collentine, J. (1995). The development of complex syntax and mood-selection abilities by intermediate-level
learners of Spanish. Hispania, 78, 123–136.
Collentine, J. G. (1997). The effects of irregular stems on the detection of verbs in Spanish. Spanish Applied
Collentine, J. (2014). Subjunctive in second language Spanish. In K. L. Geeslin (Ed.) The Handbook of Second
Language Acquisition (pp. 270–286). John Wiley & Sons, Inc.
Cop, U., Keuleers, E., Drieghe, D., & Duyck, W. (2015). Frequency effects in monolingual and bilingual
natural reading. Psychonomic Bulletin & Review, 22, 1216–1234.
Correa, M. (2011). Subjunctive accuracy and metalinguistic knowledge of L2 learners of Spanish. Electronic
Journal of Foreign Language Teaching, 8, 39–56.
Cuetos, F., Glez-Nosti, M., Barbon, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies
based on film subtitles. Psicologica, 32, 133–143.
Cunnings, I. (2017). Parsing and working memory in bilingual sentence processing. Bilingualism: Language
and Cognition, 20, 659–678.
Davies, M. (2006). A frequency dictionary of Spanish. Routledge.
Demestre, J., & García Albea, J. E. (2004). The on-line resolution of the sentence complement/relative clause
ambiguity: Evidence from Spanish. Experimental Psychology, 51, 59–71.
Dussias, P. E., & Sagarra, N. (2007). The effect of exposure on syntactic parsing in Spanish English bilinguals.
Bilingualism: Language and Cognition, 10, 101–116.
Felser, C., & Cunnings, I. (2012). Processing reflexives in a second language: The timing of structural and
discourse-level constraints. Applied Psycholinguistics, 33, 571–603.
Felser, C., Cunnings, I., Batterham, C., & Clahsen, H. (2012). The timing of island effects in nonnative
sentence processing. Studies in Second Language Acquisition, 34, 67–98.
Felser, C., Roberts, L., Gross, R., & Marinis, T. (2003). The processing of ambiguous sentences by first and
second language learners of English. Applied Psycholinguistics, 24, 453–489.
Gallego, M., & Pozzi, R. (2018). Mood recognition and production. Hispania, 101, 408–421.

Geeslin, K. L., & Gudmestad, A. (2008). Comparing interview and written elicitation tasks in native and non-
native data: Do speakers do what we think they do. In J. Bruhn de Garavito & E. Valenzuela (Eds.), Selected
proceedings of the 10th Hispanic linguistics symposium (pp. 64–77). Cascadilla Proceedings Project.
Green, P., & MacLeod, C. J. (2016). SIMR: An R package for power analysis of generalized linear mixed
models by simulation. Methods in Ecology and Evolution, 7, 493–498.
Gudmestad, A. (2006). L2 variation and the Spanish subjunctive: Linguistic features predicting mood
selection. In C. Klee & T. Face (Eds.), Selected papers of the 7th Conference on the Acquisition of Spanish
and Portuguese as First and Second Languages (pp. 170–184). Cascadilla Proceedings Project.
Gudmestad, A. (2010). Moving beyond a sentence-level analysis in the study of variable mood use in Spanish.
Southwest Journal of Linguistics, 29, 25–51.
Gudmestad, A. (2012a). Acquiring a variable structure: An interlanguage analysis of second language mood
use in Spanish. Language Learning, 62, 373–402.
Gudmestad, A. (2012b). Toward an understanding of the relationship between mood use and form regularity:
Evidence of variation across tasks, lexical items, and participant groups. In K. L. Geeslin & M. Díaz-
Campos (Eds.), Selected Proceedings of the 14th Hispanic Linguistics Symposium (pp. 214–227). Cascadilla
Press.
Gutiérrez, X. (2017). Explicit knowledge of the Spanish subjunctive and accurate use in discrete-point, oral
production, and written production measures. Canadian Journal of Applied Linguistics, 20, 1–30.
Hofmeister, P. & Vasishth, S. (2014). Distinctiveness and encoding effects in online sentence comprehension.
Frontiers in Psychology, 5, 1237.
Hopp, H. (2013). Grammatical gender in adult L2 acquisition: Relations between lexical and syntactic
variability. Second Language Research, 29, 33–56.
Hopp, H. (2014). Working memory effects on the L2 processing of ambiguous relative clauses. Language
Hopp, H. (2016). The timing of lexical and syntactic processes in second language sentence comprehension.
Applied Psycholinguistics, 37, 1253–1280.
Hopp, H. (2017a). Individual differences in L2 parsing and lexical representations. Bilingualism: Language
Hopp, H. (2017b). Cross-linguistic lexical and syntactic co-activation in L2 sentence processing. Linguistic
Approaches to Bilingualism, 7, 96–130.
Hopp. H. (2018). The bilingual mental lexicon in L2 sentence processing. Second Language, 17, 5–27.
Hopp, H., & León Arriaga, M. E. (2016). Structural and inherent case in the non-native processing of Spanish:
Constraints on inflectional variability. Second Language Research, 32, 75–108.
Husain, S., Vasishth, S., & Srinivasan, N. (2014). Strong expectation cancel locality effect: Evidence from
Hindi. PLoS ONE, 9: e100987.
Inhoff, A. W., & Radach, R. (1998). Definition and computation of oculomotor measures in the study of
cognitive processes. In G. Underwood (Ed.), Eye guidance in reading and scene perception (pp. 29–53).
Elsevier.
Isabelli, C. A., & Nishida, C. (2005). Development of the Spanish subjunctive in a nine-month study-abroad
setting. In D. Eddington (Ed.), Selected Proceedings of the 6th Conference on the Acquisition of Spanish and
Portuguese as First and Second Languages (pp. 78–91). Cascadilla Press.
Iverson, M., Kempchinsky, P., & Rothman, J. (2008). Interface vulnerability and knowledge of the subjunc-
tive/indicative distinction with negated epistemic predicates in L2 Spanish. Eurosla Yearbook, 1, 135–163.
Jegerski, J. (2015). The processing of case in near-native Spanish. Second Language Research, 31, 281–307.
Jegerski, J., & Fernández Cuenca, S. (2019, March). Lexical and morphosyntactic processes in the online
comprehension of subject-verb number agreement in L2 Spanish. Paper presented at the Generative
Approaches to Second Language Acquisition conference, University of Nevada, Reno, NV.
Jiang, N. & Botana, G. P. (2009, October). Frequency effects in NS and NNS word recognition. Paper
presented at the Second Language Research Forum, Michigan State University, East Lansing, MI.
Kaan, E., Ballantyne, J., & Wijnen, F. (2015). Effects of reading speed on second-language sentence
processing. Applied Psycholinguistics, 36, 799–830.
Kanwit, M., & Geeslin, K. L. (2018). Exploring lexical effects in second language interpretation: The case of
mood in Spanish adverbial clauses. Studies in Second Language Acquisition, 40, 579–603.
Keating, G. D. (2014). Eye-tracking with text. In J. Jegerski & B. VanPatten (Eds.), Research methods in second
language psycholinguistics (pp. 69–92). Routledge.

Keating, G. D., & Jegerski, J. (2015). Experimental designs in sentence processing research: A methodological
review and user’s guide. Studies in Second Language Acquisition, 37, 1–32.
Kimball, A., Shantz, K., Eager, C. & Roy, J. (2018) Confronting quasi-separation in logistic mixed effects for
linguistic data: A Bayesian approach. Journal of Quantitative Linguistics, 26, 231–255.
Kirkici, B., & Clahsen, H. (2013). Inflection and derivation in native and non-native language processing:
Masked priming experiments on Turkish. Bilingualism: Language and Cognition, 16, 776–791.
Kuznetsova, A., Brockoff, P. B., & Christensen, R. H. B. (2014). lmerTest: Tests for random and fixed effects
for linear mixed effect models (lmer objects of lme4 package). https://cran.r-project.org/web/packages/
lmerTest/index.html.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS. Routledge.
Lee, J. (1987). Comprehending the Spanish subjunctive: An information processing perspective. Modern
Language Journal, 71, 50–57.
Lemhöfer, K., Schriefers, H., & Indefrey, P. (2014). Idiosyncratic grammars: Syntactic processing in second
language comprehensions uses subjective feature representations. Journal of Cognitive Neuroscience, 26,
1428–1444.
Lenth, R., Singmann, H., Love J., Buerkner, P., & Herve, M. (2018). Emmeans: Estimated Marginal Means, aka
Least-Squares Means. https://cran.r-project.org/web/packages/emmeans/index.html.
Leow, R. (1993). To simplify or not to simplify: A look at intake. Studies in Second Language Acquisition, 15,
333–356.
López Ornat, S., Fernández, A., Gallo, P. & Mariscal, S. (1994). La adquisición de la lengua española [The
acquisition of the Spanish language]. Siglo XXI.
Lubbers Quesada, M. (1998). L2 acquisition of the Spanish subjunctive mood and prototype schema
development. Spanish Applied Linguistics, 2, 1–23.
Massery, L. A. (2009). Syntactic development of the Spanish subjunctive in second language acquisition:
Complement selection in nominal clauses [Unpublished doctoral dissertation]. University of Florida,
Gainesville.
Miller, K. A. (2014). Accessing and maintaining referents in L2 processing of wh-dependencies. Linguistic
Approaches to Bilingualism, 4, 167–191.
Montrul, S. (2011). Morphological errors in Spanish second language learners and heritage speakers. Studies
in Second Language Acquisition, 33, 163–192.
Montrul, S. & Ionin, T. (2012). Dominant language transfer in Spanish heritage speakers and second language
learners in the interpretation of definite articles. The Modern Language Journal, 96, 70–94.
Montrul, S., Foote, R., & Perpiñán, S. (2008). Gender agreement in adult second language learners and
Spanish heritage speakers: The effects of age and context of acquisition. Language Learning, 58, 503–553.
Montrul, S., & Slabakova, R. (2003). Competence similarities between native and near-native speakers: An
investigation of the preterite/imperfect contrast in Spanish. Studies in Second Language Acquisition, 25,
351–398.
Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in meta-analysis with repeated
measures and independent-groups designs. Psychological Methods, 7, 105–125.
Papadopoulou, D., & Clahsen, H. (2003). Parsing strategies in L1 and L2 sentence processing: A study of
relative clause attachment in Greek. Studies in Second Language Acquisition, 24, 501–528.
Pérez-Leroux, A. T. (1998). The acquisition of mood selection in Spanish relative clauses. Journal of Child
Language, 25, 585–604.
Pinker, S. (1999). Words and rules: The ingredients of grammar. Basic Books.
Pliatsikas, C., & Marinis, T. (2013). Processing of regular and irregular past tense morphology in highly
proficient second language learners of English: A self-paced reading study. Applied Psycholinguistics, 34,
943–970.
Poplack, S., Cacoullos, R. T., Dion, N., de Andrade Berlinck, R., Digesto, S., Lacasse, D., & Steuck, J. (2018).
Variation and grammaticalization in Romance: A cross-linguistic study of the subjunctive. In W. Ayres-
Bennett & J. Carruthers (Eds.), Manual of Romance Sociolinguistics (pp. 217–252). De Gruyter.
Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological
Bulletin, 85, 372–422.
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical
Computing. https://www.R-project.org/.

Sánchez-Naranjo, J., & Pérez-Leroux, A. T. (2010). In the wrong mood at the right time: Children’s
acquisition of the Spanish subjunctive in temporal clauses. Canadian Journal of Linguistics, 55, 227–255.
SR Research. (2005). Eyelink 1000 [Apparatus and software]. https://www.sr-research.com/eyelink-1000-
plus/
Terrell, T., Baycroft, B., & Perrone, C. (1987). The subjunctive in Spanish interlanguage: Accuracy and
comprehensibility. In B. VanPatten, T. R. Dvorak, & J. F. Lee (Eds.), Foreign Language Learning: A
Research Perspective (pp. 19–32). Newbury House.
VanPatten, B. (1994). Evaluating the role of consciousness in second language acquisition: Terms, linguistic
features and research methodology. AILA Review, 11, 27–36.
VanPatten, B. (1996). Input Processing and Grammar Instruction: Theory and Research. Ablex.
VanPatten, B., & Cadierno, T. (1993). Explicit instruction and input processing. Studies in Second Language
Wiseheart, M. (2014). Effect size calculator. http://www.cognitiveflexibility.org/effectsize/
Cite this article: Fernández Cuenca, S. and Jegerski, J. (2023). A role for verb regularity in the L2 processing
of the Spanish subjunctive mood: Evidence from eye-tracking. Studies in Second Language Acquisition, 45,
318–347. https://doi.org/10.1017/S027226312200016X

doi:10.1017/S0272263122000092
RESEARCH ARTICLE
The additive use of prosody and morphosyntax

in L2 German
Nick Henry*
The University of Texas at Austin, Austin, TX, USA
*Corresponding author. E-mail: nhenry@austin.utexas.edu
(Received 24 June 2021; Revised 19 January 2022; Accepted 7 February 2022)
Abstract
This study investigates whether the use of prosodic cues during instruction facilitates the
processing of German accusative case markers. Two groups of third semester L1 English
learners of L2 German completed Processing Instruction (PI) with aural input: Learners in
the PIþP group heard sentences that included focused prosodic cues; learners in the PI
group heard sentences with monotone prosody. The effects of training were assessed
through an offline comprehension task, a written production task, and an online self-paced
reading (SPR) task. The results for the offline tasks showed that the groups were similar with
respect to their offline comprehension and production. The SPR task showed that both
groups used case markers to interpret word order online to some extent; however, only the
PIþP group did so in all conditions. These results suggest that prosody does play a role in
(morpho)syntactic processing, and that covert activation of prosodic structures can facilitate
processing during online reading tasks.
Introduction
Within the second language (L2) sentence processing literature, it has been widely
observed that learners have difficulty processing morphosyntactic forms, and L2
learners often rely on lexical-semantic information to comprehend the input, even
after they achieve high proficiency (e.g., Keating, 2009; Marinis et al., 2005). While
much research has focused on learners’ tendency to favor lexical-semantic over
morphosyntactic cues when processing online, comparatively little research has
explored the effects of prosody on L2 sentence processing, even though such effects
are well-attested in the literature on native (L1) speakers (Steinhauer, 2003). As recent
research on Processing Instruction (PI; see VanPatten, 2004b, 2015) suggests that the
use of prosody can indeed help L2 learners acquire morphosyntactic forms (Henry,
Jackson, et al., 2017), the present study seeks to investigate the role of prosodic cues in
online (i.e., real-time) processing.
of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/
4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative
Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written
permission of Cambridge University Press must be obtained prior to any commercial use.

The additive use of prosody and morphosyntax in L2 German 349
Background and Motivation

Input Processing and Processing Instruction
One useful framework for viewing learners’ processing strategies and their use of
varying types of linguistic information is VanPatten’s Input Processing model. The
Input Processing model proposes that learners filter input and selectively attend to the
most salient cues with the highest communicative value (VanPatten, 2004a). This stems
from both the need to get meaning and restraints on cognitive resources (Miyake &
Friedman, 1998; see Juffs & Harrington, 2011). Morphosyntactic cues are thus often left
unprocessed during comprehension in favor of lexical-semantic or word order cues.
One consequence of this filtering is that, as described in the Input Processing model’s
First Noun Principle (FNP), learners tend to overlook case-marking information and
process the first noun in a sentence as the subject or agent (LoCoco, 1987; VanPatten,
1984). Subprinciples of the FNP state that learners may also use context or animacy cues
instead of case to understand agent-patient relationships (Jackson, 2007; VanPatten &
Houston, 1998).
To illustrate, consider the following examples from German. As seen in (1) and (2),
grammatical roles are assigned primarily by case information on the nominative and
accusative masculine case markers der and den. As seen in (2) and (3) these case
markers allow the same sentence meaning to be expressed using either subject-verb-
object (SVO) or object-verb-subject (OVS) word order:
(1) DerNOM Hund hört dieACC Katze. (SVO)
The dog hears the cat.
“The dog hears the cat.”
(2) DieNOM Katze hört denACC Hund. (SVO)
The cat hears the dog.
“The cat hears the dog.”
(3) DenACC Hund hört dieNOM Katze. (OVS)
The dog hears the cat.
“The cat hears the dog.”
However, because feminine, neuter, and plural articles are the same in the nominative
and the accusative cases, speakers must actively attend to both word-order cues and
case cues. In addition, SVO sentences make up between 80–95% of transitive sentences
with full NPs (Kempe & MacWhinney, 1998; Schlesewsky et al., 2000). Learners can
therefore rely on a first-noun strategy with a high degree of accuracy, particularly if they
also use animacy information to override implausible SVO interpretations (see Jackson,
2007). As a result, learners often do not connect the case markers der and den to their
meaning, delaying acquisition of these morphosyntactic forms (LoCoco, 1987).
The instructional application of Input Processing, PI, seeks to change learners’
default processing strategies to promote the acquisition of a targeted form. This is
achieved primarily through referential Structured Input (SI) activities, a highly specific
type of task-essential activity (Loschkey & Bley-Vroman, 1993), which aims to change
learners’ processing strategies by manipulating input such that they must use a targeted
cue to understand the input and complete a task (see Wong, 2004). In a typical SI
activity for German case markers, learners are presented with a mix of SVO and OVS
sentences that contain no context or plausibility cues like examples (2) and (3). They
must then choose between pictures that correspond to the two possible interpretations

350 Nick Henry
of the sentence (i.e., a dog hearing a cat, or a cat hearing the dog). Referential activities
(like the one presented in the preceding text) contain unambiguous right and wrong
answers; while PI also often includes affective activities, which do not, referential
activities have been found to be most effective (Marsden & Chen, 2011). Note, too,
that PI focuses exclusively on changing processing behavior through input and does not
include any output activities.
Over the last two decades, a vast literature on PI has emerged, and it has been shown
to increase learners’ comprehension and production of a variety of forms, including
those related primarily to verbal morphology (Benati, 2001; Cadierno, 1995) and to
grammatical role assignment (VanPatten & Cadierno, 1993; VanPatten & Uludag,
2011). The traditional explanation for PI’s positive effects is that it changes learners’
processing strategies and pushes them to process grammatical forms. This leads to
stronger form-meaning connections within the developing system and an increased
ability to use the form in both comprehension and production. While research on PI
has made important contributions to research on instructed second language acquisi-
tion it is important to emphasize that PI also serves as a methodological tool for
investigating how learners process particular aspects of the input. That is, research on
PI not only adds to research on the effects of instruction but it also acts as an important
validation of the Input Processing model, through which it was developed.
The effects of training on processing

Relatively few studies have used cognitive behavioral measures such as self-paced
reading (SPR) or eye-tracking to investigate how PI and related instructional interven-
tions change learners’ processing strategies, though these methodologies are becoming
more common in PI research. As Lee et al. (2020) and Henry (2022) discuss, these
studies address an increasingly large number of research questions, but many have
centered on the effects of PI compared to other instructional trainings (Benati, 2020a,
2020b; Chiuchiù & Benati, 2020; Henry, 2022; Issa & Morgan-Short, 2019; McManus &
Marsden, 2017, 2018; Wong & Ito, 2018) and on aspects of a PI training, such as the
delivery of explicit information (EI) or feedback (Dracos & Henry, 2018; Indrarathne &
Kormos, 2017; Issa, 2019; Wong & Ito, 2018). While these studies have utilized a variety
of methodological approaches, the research has broadly found that PI increases
attention and sensitivity to the target forms (Henry, 2022; Issa & Morgan-Short,
2019), reduces the use of nontarget processing strategies (Dracos & Henry, 2018; Wong
& Ito, 2018), and/or increases depth or ease of processing (Chiuchiù & Benati, 2020; Lee
et al., 2020). Other studies have found mixed effects or shown that training did not
affect online processing (Dracos & Henry, 2021; Ito & Wong, 2019).
One recent study by Henry (2022) investigated whether PI pushes learners to
process nominative and accusative case markers in German online. Two groups were
instructed on German case markers using either PI or a traditional output-focused
instruction (TI) and completed an online SPR task. Results showed that the participants
had elevated reading times (RTs) on OVS sentences after receiving PI, but not TI,
suggesting that they had processed case markers incrementally as native speakers do
(Hemforth et al., 1993; Schlesewsky et al., 2000; Schriefers et al., 1995). However, the
effect only occurred when the sentence began with a masculine noun (i.e., unambiguous
case marking), and not when the masculine noun came after the verb. Interestingly,
results also showed that participants had higher RTs on the noun phrases regardless of
the sentence condition. This suggested that PI pushed learners to attend to case

markers, though they were not yet able to integrate case-marking information rapidly
in all circumstances. In light of these results, Henry suggested that it would be useful to
investigate whether the use of prosodic cues in PI could facilitate processing in
subsequent online tasks, as recent research has shown that both L1 (Grünloh et al.,
2011) and L2 learners (Henry, Hopp et al., 2017; Henry, Jackson et al., 2017; Henry,
Jackson, & Hopp, 2020) can use prosodic cues to interpret case markers in German.
Prosody in L1 and L2
Studies in the L1 acquisition literature have shown that child acquirers exploit links
between prosody and syntax to acquire difficult morphosyntactic forms. For example,
L1 acquisition research on the Competition Model (see MacWhinney, 2001) has
investigated how learners process single cues versus cue coalitions, that is, multiple
cues that frequently occur together and point toward the same interpretation. This
research has shown that, while German children have difficulty using case markers
alone (Dittmar et al., 2008), they can use prosodic cues to help them interpret OVS
sentences with unambiguous case information (Grünloh et al., 2011). Research has
demonstrated that adult L1 speakers use prosody to resolve syntactic ambiguities
(Fodor, 1998; Steinhauer et al., 1999), and to guide grammatical role assignment when
morphosyntactic information is absent (Weber et al., 2006). Recent evidence also shows
that prosody can be used alongside case information to boost the speed of prediction in
German (Henry, Hopp, et al., 2017).
L2 research on the use of prosody and syntax is comparatively less well developed.
Early research demonstrated that prosody helps novice learners identify constituents
(Wakefield et al., 1974) and order them hierarchically (Langus et al., 2012). Further,
prosody may make elements of the input more perceptually salient (Carroll, 2004,
2006). Research in the L2 processing literature has also suggested that prosody helps
learners develop more nativelike processing routines (Dekydtspotter et al., 2006;
Fernández, 2010). For example, Dekydtspotter et al. (2008) found that fourth-semester
learners of French attend to the phonological weight of relative clauses in order to
interpret them. The authors concluded that prosody is an “integral part of interlan-
guage processing” (p. 476), and that the ability to use prosody may be crucial for
developing nativelike attachment preferences. Other recent research has found that
prosody can help L2 learners make predictions online (Foltz, 2021; Henry et al., 2020).
For instance, Henry et al. (2020) found that intermediate-high and advanced L2
learners of German were more likely to use case markers to predict upcoming nouns
in a sentence when it included prosodic cues that indicated word order.
To date, only a few studies have investigated the effects of using prosodic
information during an instructional training (Henry, Jackson, et al., 2017; Martin
& Jackson, 2016). One study, Henry, Jackson, et al. (2017), used PI to investigate
whether EI and prosody aid the acquisition of German case markers, the target form
in the present study. They found that, when EI was excluded from training, learners
were better able to comprehend and produce case cues when they had received
training with prosody. They concluded that prosody helps learners identify and
attend to morphosyntactic forms, either by increasing the perceptual salience of
those forms (e.g., by making phonetically reduced forms like definite articles easier
to hear), or by highlighting their communicative purpose. This study thus suggests
that PI could push learners to process case markers online more effectively if it
includes prosodic cues.

352 Nick Henry
The Present Study
The aim of the present study is to investigate (a) whether training that includes prosodic
cues can push learners to process case markers in German and use them to guide
interpretation in real time (i.e., to process case markers incrementally), and (b) whether
PI with prosody is more effective than PI that does not utilize prosodic cues.
This study compares the effects of two learner groups who received PI. The first is a
group of learners from Henry (2022), whose training did not include prosodic cues (PI).
The second group of learners received training that did include prosodic cues (PIþP).
The outcomes of training are investigated through offline comprehension and produc-
tion tasks, as well as an SPR task that evaluates changes to processing behaviors. Thus,
the present study sheds light on whether PI changes learners’ processing strategies
under different training conditions. More importantly, it extends previous research on
the use of prosodic cues in L2 processing (Foltz, 2021; Henry et al., 2020) and grammar
training (Henry, Jackson, et al., 2017; Martin & Jackson, 2016), by demonstrating
whether the presence of prosodic cues in training stimuli has effects on online proces-
sing. That is, while Henry et al., (2017) looked only at the outcomes of training on
comprehension and production accuracy, this study investigates how learners use case
markers moment-by-moment using SPR. Finally, the present research provides insight
into the proposal by Dekydtspotter et al. (2006) that the ability to activate and use
appropriate prosodic structures is integral to developing nativelike processing routines.
The research questions for the present study are as follows:
RQ1: To what extent does a PI training that includes prosodic cues lead to more
accurate comprehension and production of accusative case markers in German
than PI without prosodic cues?
RQ2: To what extent do learners process German accusative case markers
incrementally when comprehending sentences online after training with PI
with or without prosodic cues?
Methodology
Participants
The participants were drawn from eight intact sections of an intermediate level German
course at a large northeastern university in the United States. To determine eligibility
for the study, participants completed a language background questionnaire. Each
participant included in the final analyses met the following criteria: (a) they were native
speakers of English with no advanced knowledge of another language; (b) they dem-
onstrated no knowledge of the target form as determined by a score of 50% or less on
OVS items in the pretest’s sentence interpretation task (explained in the following
section); and (c) they completed all the tasks. The final pool of participants (N = 53) was
divided randomly into two treatment groups: PI (n = 25), and PI with prosody (PIþP)
(n = 28). The PI group is the same group of participants described in Henry (2022).
To ensure comparability between the groups, the participants completed a written,
30-item multiple-choice language proficiency test (University of Wisconsin Testing
and Evaluation, 2006), a working memory task based on Waters and Caplan (1996),
and a postexperiment vocabulary test that measured word knowledge and gender
assignment for the words used in the SPR task. Descriptive statistics for these measures
and responses from the language background questionnaire, including several

Table 1. Means for screening and proficiency measures (standard deviations in parentheses)
Training Group PI PIþP
Variable (Range of Possible Scores) Mean (SD) Mean (SD)
Age 19.36 (2.48) 19.75 (2.08)

Time in German-Speaking Country (in Months) 0.10 (0.29) 1.95 (7.18)
Years of German Instruction 4.17 (3.36) 3.63 (1.78)
Years of Instruction in a 3rd Language 1.91 (2.21) 1.85 (3.19)
Self-Rating: Reading Proficiency (1–10) 6.02 (1.36) 5.96 (1.55)
Self-Rating: Spelling Proficiency (1–10) 6.54 (1.51) 5.79 (1.47)
Self-Rating: Writing Proficiency (1–10) 5.06 (1.28) 5.36 (1.31)
Self-Rating: Speaking Proficiency (1–10) 5.20 (1.71) 5.64 (1.34)
Self-Rating: Listening Comprehension (1–10) 5.30 (1.31) 6.11 (1.69)
Working Memory: Set Size (0, 2-6) 3.68 (0.83) 3.63 (1.19)
Working Memory: Words Remembered (0–89) 65.36 (11.31) 63.82 (13.94)
Proficiency Task Accuracy (0–30) 13.24 (5.76) 13.00 (5.62)
Vocabulary Test: Word Knowledge (0–72) 71.08 (1.87) 71.14 (1.11)
Vocabulary Test: Gender Assignment (0–48) 45.40 (3.04) 45.21 (3.00)
self-rated proficiency measures are presented in Table 1. Statistical analyses showed that
the groups were similar in age (t(51) = .622, p = .537), years of German language
instruction (t(49) = 0.724, p = .473), time spent in a German-speaking country
(t(51) = 1.284, p = .205),1 overall proficiency (t(51) = 0.586, p = .879), working memory
(t(51) = 0.438, p = .663), word knowledge (t(51) = 0.151, p = .881), gender assignment
(t(51) = 0.224, p = .824), and all the self-rated proficiency measures (all p > .05).
Materials
PI Treatment
A complete record of the training materials is found in the supplemental materials. The
PI treatment for both groups consisted of EI, a 50-item referential structured input
(SI) activity, and two comparatively shorter affective SI activities that aimed to teach
the nominative and accusative case markers der and den in German.2 The EI and the
referential activity were both presented using the computer program E-Prime
(Schneider et al., 2012), while the affective activities were administered using pencil
and paper to prevent screen-fatigue among participants.
The training began with the EI, which gave the participants basic information about
nominative and masculine case markers and OVS word order. Both groups read
information about the communicative purpose of OVS word order. Only the PIþP
1
The statistical results indicated no differences between the groups, but the PIþP group had a compar-
atively high mean. This was caused by two participants who were otherwise very similar to the other
participants in the study. To ensure that these participants did not affect the study’s findings, these
participants were removed in a separate analysis. This analysis did not change the results for any of the
assessments. These participants were therefore included in the analyses reported in the remainder of the
study.
2
The EI and referential SI activity were the same as those in Henry, Jackson, and Dimidio (2017), and were
originally adapted from VanPatten et al. (2013). The affective activities were adapted from Farley (2004).
Note that the primary difference between referential and affective activities is that, in referential activities, the
target form is task essential, and items have a single correct answer; in affective activities, the target form is
used but items allow subjective responses that do not necessarily have a correct answer.

354 Nick Henry
group was told about the prosodic cues that accompany OVS word order, but both
heard an example of an SVO and OVS sentence using the intonation patterns in their
respective trainings.
The referential SI activity for both groups consisted of 38 OVS sentences and 12 SVO
distractor items placed in a repeating pattern of three OVS sentences and one SVO
sentence so that the distractors were evenly spaced throughout the training. Partici-
pants heard the sentences through speakers attached to the computer and were
simultaneously presented with two pictures corresponding to the two possible inter-
pretations of the sentences (e.g., a cat hearing a dog, or a dog hearing a cat). Participants
then selected the picture that corresponded to the sentence they heard using the
computer keyboard. After making their selection, participants saw one-word corrective
feedback (i.e., CORRECT, or INCORRECT).
Following the referential activity, participants completed two affective SI activities
adapted from Farley (2004). Both were focused on relationships with male persons
given that the masculine articles were targeted. In the first activity, participants decided
if a series of statements applied to their relationships with a good male friend and a male
family member. In the second activity, participants read a list of things a supportive
spouse would do and ranked them in terms of importance. There were no correct
answers in either activity, but they provided the participants with 26 additional OVS
sentences and acted as an input flood.
The only difference between the training for the PI and PIþP groups were the audio
recordings in the referential activity. Participants in the PI group heard sentences
presented with monotone prosody; the PIþP group heard sentences with focused
prosody. The recordings were drawn from the training for the þEI and þEIþP groups
in Henry, Jackson, et al. (2017). For the monotone condition, a female native speaker of
German was instructed to speak as naturally as possible without emphasizing any of the
words in the sentence. For the focused prosody condition, the same speaker imagined
that she was responding to a direct question about the subject or the object of the
sentence. The prosodic cues in the final stimulus set were evaluated using both the
German Tones and Break Indices (GToBI) system (Grice & Baumann, 2002) and a
phonetic analysis. These analyses indicated that the OVS sentences with focused
prosody carried a high pitch accent with a low leading tone on the first noun phrase
(NP1). In SVO sentences with focused prosody, there was no pitch accent on NP1, but
the nuclear accent fell on NP2. Thus, the analyses confirmed that the sentences in the
focused prosody condition conformed to the pitch contours attested in prior literature
(see Braun, 2006; Grünloh et al., 2011; Nespor et al., 2008). The analyses also showed
that the sentences in the monotone prosody condition did not have any systematic
differences in pitch, duration, or intensity, and none of the sentences contained a high
pitch accent on NP1. Thus, these sentences, while spoken naturally, did not contain
disambiguating prosodic cues, and sentences in the focused condition were more
pragmatically appropriate. The prosodic contours and GToBI ratings for a sample
item in each condition are displayed in Figure 1. The results of the phonetic analyses
are presented in Table S1 in the supplemental materials.
Assessment Measures
The offline effects of treatment were assessed using a written pretest/posttest that
included a sentence interpretation task and a picture description task. The sentence
interpretation task consisted of 8 experimental SVO/OVS sentences and 12 distractor
sentences followed by a comprehension question in English. The comprehension

Figure 1. Sample waveform and spectrogram with GToBI annotations for training stimuli.
question for the experimental sentences targeted the correct interpretation of gram-
matical roles as seen in (1):
(1) Die Oma überrascht der Opa während der Party.
TheACC grandma surprises theNOM grandpa during the party.
“The grandpa surprises the grandma during the party.”
Is the grandpa surprising the grandma? Yes No
Note that the comprehension question is presented in English so that participants could
not answer the question by simply matching the case markings from the sentence and
the comprehension question.
The picture description task consisted of two target and two distractor picture series.
As seen in Figure 2, each series consisted of three pictures, a question prompt, and
relevant vocabulary to help participants complete the task.3 Participants wrote a
Figure 2. Example item from production task in the offline pre-/posttest.
3
Participants were not limited to the verbs and nouns given to them, but most did use this vocabulary to
complete their answer.

356 Nick Henry
minimum of one sentence to tell a story in response to the question prompt. In the
target picture series, participants saw a main character interacting with a masculine
person or object, and thus the sentences specifically elicited use of the masculine articles
der and den.
The online effects of training were assessed using a noncumulative SPR task (Just
et al., 1982) using E-Prime (Schneider et al., 2012). The task was administered before
and after training and was designed to test sensitivity to case markings through a
comparison of RTs on SVO and OVS sentences. As mentioned previously, native
speakers tend to display higher RTs at disambiguating regions (i.e., masculine case
markers) for OVS sentences when compared with the SVO sentences, indicating
incremental use of case marking information. If participants use case marking to assign
grammatical roles like native speakers, it is therefore expected that participants would
exhibit a similar RT pattern.
During the SPR task, participants first saw a fixation point on the screen. They then
pressed the spacebar to begin reading a sentence that was presented phrase by phrase.
Participants saw the first phrase in the sentence followed by a series of dashes represent-
ing the words in the remainder of the sentence. When participants pressed the spacebar,
the first phrase disappeared, and the second phrase appeared. Participants continued in
this manner until the end of the sentence. They then answered a Yes/No comprehension
question in English to ensure that they attended to the meaning of the sentence.
The SPR task consisted of 72 items: 24 experimental items and 48 filler items. The
experimental items were 24 quadruplets containing an NP-V-NP sequence followed by
one or two prepositional phrases. The quadruplets were created by varying word order
and the position of the masculine noun, resulting in four sentence conditions. Because
case cannot be assigned independently of the noun in German, determiners were
presented alongside the noun (i.e., NPs were presented together). Thus, sentences were
divided into five to seven segments for presentation (see also Hopp, 2006; Jackson, 2008;
Schlesewsky et al., 2000). Segments one and three were the disambiguating noun phrases
and were the critical regions for analysis in Masculine-First and Masculine-Second
sentences, respectively; segments two and four were analyzed for spillover effects. In
the following examples, the slashes represent the division of the sentences, bolded
segments are the critical regions, and italicized segments are the spillover regions:
(4a) SVO-Masculine First
Der Opa / überrascht / die Oma / während / der Party.
TheNOM grandpa / surprises / theACC grandma / during / the party.
"The grandpa surprises the grandma during the party."
(4b) OVS-Masculine First
Den Opa / überrascht / die Oma / während / der Party.
TheACC grandpa / surprises / theNOM grandma / during / the party.
"The grandma surprises the grandpa during the party."
(4c) SVO-Masculine Second
Die Oma / überrascht / den Opa / während / der Party.
TheNOM grandma / surprises / theACC grandpa / during / the party.
“The grandma surprises the grandpa during the party."
(4d) OVS-Masculine Second
Die Oma / überrascht / der Opa / während / der Party.
TheACC grandma / surprises / theNOM grandpa / during / the party.
"The grandpa surprises the grandma during the party."

Half the comprehension questions for the experimental stimuli targeted the interpre-
tation of grammatical roles (as seen in the sentence interpretation task), and half
targeted the information in the rest of the sentence.
The experimental sentences were split into four lists, and participants only saw one
version of each sentence. These lists were controlled so that, within each list, segments
were equal in length in each sentence condition. Rather than controlling for the raw
frequency of the nouns and verbs used for the experimental stimuli, the participants
were trained on the vocabulary so that they would be familiar with it during the SPR
task. This method was chosen because pilot data showed that participants were not
particularly familiar with many of the words in the SPR task despite them being mostly
low-level, high-frequency words found in the participants’ coursework. Results of a
postexperiment vocabulary test indicated that this method effectively trained the
participants’ vocabulary knowledge (see Table 1).
Procedure
The experiment was conducted in three sessions. Session 1 took place in the
participants’ regular classrooms and included the language background question-
naire, pretest, and proficiency task. Sessions 2 and 3 took place in a lab on campus.
Session 2 included the vocabulary training task focusing on the nouns and verbs in
the SPR task (see the supplemental materials). In this task, participants saw and
heard each word three times with its (written) English translation and repeated the
words aloud. In a testing round, they then saw the word, provided a translation of it,
and received feedback on their answer. They were required to answer every question
correctly before being allowed to move on. After the vocabulary training, they
completed the working memory task and the pretest SPR task. In session 3, partic-
ipants completed a reduced vocabulary training task, in which participants saw each
word only twice. They then completed the PI treatment, the written posttest, the
posttest SPR task, and a written test to ensure that the participants had retained the
vocabulary.
Data Scoring
Sentence Interpretation
For the sentence interpretation task, SVO and OVS items were scored separately and
given one point for a correct Yes/No answer, and no points for an incorrect answer. The
maximum achievable score was four points, one for each target sentence.
Written Production
For the picture description task, the percentage of accurate responses for the nom-
inative and accusative was scored separately. The scores were computed by dividing
the number of accurate responses by the number of obligatory occasions in each
participant’s response. An obligatory occasion was defined as a point in the sentence,
in which the article or pronoun was required to complete the sentence grammatically.
The nominative and accusative articles der and den—as well as their corresponding
pronouns er and ihn, respectively—were considered correct when they accurately
described the pictures with respect to grammatical roles. Because participants were
not limited in their responses, several participants did not create any obligatory

358 Nick Henry
occasions in their responses (e.g., they named the characters or substituted non-
masculine nouns to describe objects). These participants did not receive a score and
were treated as missing data in the analyses.
SPR Comprehension
In the SPR task, responses to the comprehension questions for the experimental
sentences were recorded by E-Prime (Schneider et al., 2012), along with RTs for each
region. The comprehension questions testing grammatical role assignment were
assessed as a proportion of accurate responses separately for OVS and SVO sentences.
SPR Reading Times

The analysis of RTs only included sentences for which the comprehension question
was answered correctly,4 resulting in a loss of 28.7% of the data. Participants who
had an overall comprehension rate less than 60% for the entire task or less than 33%
for any of the experimental conditions were excluded from analysis. The final
analyses for the SPR task therefore included 23 participants in the PI group and
26 from the PIþP group. After these exclusions, the raw RTs were trimmed to
remove outliers. First, any RTs below 200 ms or above 4,000 ms were removed from
the data. After this, RTs outside a range of þ/–3 standard deviations from
each participant’s overall mean for the experimental items were discarded. The
trimming procedures resulted in a loss of 2.9% of the data. After trimming the data,
new mean RTs were calculated by participants and condition for each segment of
the sentence.
Data Analysis
The data were not normally distributed for any of the tasks. Therefore, the data could
not be analyzed using parametric tests. Thus, analyses were conducted using mixed-
effects models, which avoid violations of normality and are robust against homosce-
dasticity and sphericity. The maximal model structure was attempted first (see Barr
et al., 2013) and the random effects structure of the models was then reduced when the
maximal model did not converge (following Singmann, 2021). The structure of the final
model is noted in the results for each analysis. Significant main effects and interactions
were explored using post hoc contrasts from the package emmeans (Lenth, 2020) and a
Bonferroni adjustment for multiple comparisons.
For the interpretation task and accuracy on the SPR comprehension questions, the
analyses presented here focus only on OVS sentences and production of the accusative
case.5 For the picture description task, analyses were conducted on production of the
accusative case marker. Analyses were performed using the glmer() function in the lme4
package in R using a logit link binomial error distribution. Models were then passed to
4
A separate analysis was conducted in which all items were included. The overall pattern of results for that
analysis did not differ from the results presented here.
5
Descriptive results for SVO sentences are presented, though not included in the statistical analyses
presented here. As can be seen, all learners were highly accurate with SVO sentences and the nominative case
throughout the experiment. A separate analysis using the same GLMM procedure can be found in the
supplemental materials. This analysis confirmed that there were no effects of Group or interactions contain-
ing the factor Group (all p > .05) in any of these measures.

the Anova() function in the car package using contrast coding and type three sums of
squares to compute p values for fixed effects. The maximal model included fixed effects
of Time (Pretest vs. Posttest) and Group (PI vs. TI), and the Time Group interaction,
with by-participant random slopes and intercepts for Time plus the correlation between
slopes and intercepts.
RTs for the SPR task were conducted using linear mixed effects models of log-
transformed RTs. Separate analyses were performed for each segment, and for Mas-
culine-First and Masculine-Second sentences, as the critical regions differed between
them. Models were fit using the mixed() function in the R package afex6 (Singmann
et al., 2021). The maximal model included the fixed effects of Time (Pretest vs. Posttest),
Group (PI vs. TI), and Word Order (SVO vs. OVS) and the interactions between them.
The random effects included by-subject and by-item random intercepts and random
slopes for Time, Word Order, and the Time Word Order interaction plus correla-
tions among slopes and intercepts.
Results
Sentence Comprehension
The descriptive statistics for the interpretation task are presented in Table 2. The
maximal GLMM model yielded no effect for Group (χ2(1) = 0.49, p = .484), but did
show a main effect of Time (χ2(1) = 75.16, p < .001) and a marginally significant Time
Group interaction (χ2(1) = 3.74, p = .053). Follow-up pairwise comparisons indicated
that both groups made significant gains from pretest to posttest (both p < .001), and that
there were no significant differences between the groups either at pretest (p = .247) or at
posttest (p = .121).
Written Production
The group means and standard deviations for the production task are shown in Table 3.
The maximal GLMM model yielded a main effect for Time (χ2(1) = 12.24, p < .001), but
not for Group (χ2(1) = 0.04, p = .839) or for the Time Group interaction (χ2(1) = 0.18,
p = .668). Pairwise comparisons indicated that both groups improved from pretest to
posttest (p < .001).
Table 2. Descriptive statistics for sentence interpretation task (maximum score of six)
SVO Sentences OVS Sentences
M(SD) 95% CI Mdn IQR M(SD) 95% CI Mdn IQR
PI
Pretest 3.54 (0.8) 3.21, 3.87 4 1 0.67 (0.62) 0.42, 0.93 1 1
Posttest 3.68 (0.69) 3.4, 3.96 4 0.5 3.4 (0.91) 3.02, 3.78 4 1
PIþP
Pretest 3.39 (0.88) 3.05, 3.73 4 1 0.89 (0.79) 0.59, 1.2 1 1.75
Posttest 3.68 (0.72) 3.4, 3.96 4 0 2.96 (1.2) 2.5, 3.43 3 1
Note: IQR = Interquartile Range. Values for the 95% confidence interval are bootstrapped using one thousand samples.
6
The afex package acts as a wrapper for the package lme4 (Bates et al., 2015).

360 Nick Henry
Table 3. Descriptive statistics for written production task (ratio of correct to obligatory occasions)
Nominative Forms Accusative Forms
PI
Pretest 0.95 (0.12) 0.9, 1.00 1.00 0.00 0.46 (0.4) 0.3, 0.63 0.5 0.88
Posttest 0.99 (0.05) 0.97, 1.01 1.00 0.00 0.85 (0.32) 0.72, 0.98 1.00 0.00
PIþP
Pretest 0.96 (0.19) 0.88, 1.03 1.00 0.00 0.43 (0.46) 0.25, 0.6 0.13 1.00
Posttest 1.00 (0.00) 1.00, 1.00 1.00 0.00 0.85 (0.31) 0.73, 0.97 1.00 0.13
Note: Values for the 95% confidence interval are bootstrapped using one thousand samples.
Table 4. Descriptive statistics for SPR comprehension questions (percentage of correct answers)
Time Pretest Posttest
PI
Total 72.99 (4.73) 70.83, 75.14 73.10 6.70 76 (5.11) 73.67, 78.32 76.90 6.65
SVO 0.78 (0.21) 0.68, 0.87 0.83 0.42 0.8 (0.15) 0.74, 0.87 0.83 0.25
OVS 0.26 (0.20) 0.17, 0.35 0.17 0.25 0.44 (0.24) 0.33, 0.55 0.5 0.33
PIþP
Total 0.74 (0.05) 0.72, 0.76 0.73 0.08 0.76 (0.05) 0.74, 0.78 0.77 0.06
SVO 0.80 (0.16) 0.73, 0.86 0.83 0.33 0.76 (0.19) 0.69, 0.83 0.83 0.17
OVS 0.27 (0.23) 0.18, 0.36 0.17 0.17 0.37 (0.24) 0.28, 0.46 0.33 0.33
Note: Values for the 95% confidence interval are bootstrapped using one thousand samples; values for SVO and OVS
sentences reflect only those items in which grammatical role assignment was tested in the comprehension question.
SPR Comprehension
The means and standard deviations for comprehension accuracy in the SPR task are
given in Table 4. The maximal GLMM model yielded a main effect for Time (χ2(1) =
12.15, p < .001). There was no effect for Group (χ2(1) = 0.65, p = .421) or the Time
Group interaction (χ2(1) = 0.51, p = .474). Pairwise comparisons for Time indicated
that the participants improved from pretest to posttest (p = .003).
Although the PI group outscored the PIþP group on the posttest, both groups failed
to reach 50% accuracy on the OVS items in the posttest. To assess whether individual
participants had abandoned a strict subject-first strategy, the number of participants
reaching 50% accuracy on OVS sentences was evaluated for each group following
Henry (2022). In the PI group, this proportion rose from 6 (24%) on the pretest to
15 (60%) on the posttest. In the PIþP group, this figure rose from 6 (23.1%) to
13 participants (50%).
SPR Reading Times

The analysis of RTs was conducted separately for Masculine-First and Masculine-
Second sentences and for each region. For Masculine-First sentences, the critical and
spillover segments were Segments 1 and 2. For Masculine-Second sentences, they were
Segments 3 and 4.

Table 5. Mean reading times (SDs) by group and condition for SPR task, Masculine-First items
Segment NP1 V NP2 Prep Final
Processing Instruction
SVO, Pretest 1062 (535) 898 (436) 957 (371) 554 (181) 918 (377)
OVS, Pretest 1172 (632) 930 (508) 1022 (473) 540 (210) 888 (385)
SVO, Posttest 1126 (682) 825 (402) 1055 (509) 517 (156) 861 (387)
OVS, Posttest 1253 (658) 872 (399) 1234 (662) 500 (110) 872 (343)
Processing Instruction with Prosody
SVO, Pretest 998 (286) 941 (317) 914 (254) 541 (147) 825 (245)
OVS, Pretest 1054 (319) 984 (313) 962 (318) 522 (117) 935 (316)
SVO, Posttest 959 (289) 840 (277) 958 (403) 528 (129) 790 (304)
OVS, Posttest 1081 (437) 781 (209) 1016 (597) 532 (127) 892 (362)
RTs for Masculine First Sentences

The mean RTs and standard deviations for Masculine-First sentences at each segment
are displayed by group and condition in Table 5. A full set of descriptive statistics for
RTs and the log-transformed RTs are found in Tables S5 and S6 in the supplemental
materials.
For the critical segment, Segment 1 (the first NP), the maximal model did not
converge. The final model7 (Table 6) yielded a significant main effect for Word Order.
Pairwise comparisons for the SVO-OVS contrast showed that the OVS sentences had
higher RTs than SVO sentences (p = .040). The estimated marginal means indicate that
this effect was driven by larger differences between SVO and OVS sentences in the
posttest than in the pretest.
For the spillover segment, Segment 2 (the verb), the maximal model did not
converge. The final model8 (Table 7) yielded only one significant effect, a main effect
for Time, which indicated that participants read Segment 2 faster in the posttest than in
the pretest (p = .020).
Table 6. Model results for Segments 1 and 2, Masculine-First Sentences

Segment 1 Segment 2
Effect Df F p Df F p
Time 1, 44.11 0.08 .783 1, 33.17 5.94 .020*

Word Order 1, 19.76 4.83 .040* 1, 723.66 0.09 .759
Group 1, 45.36 0.30 .585 1, 43.87 0.33 .569
Time Word Order 1, 45.04 1.53 .223 1, 711.85 0.15 .696
Time Group 1, 44.01 0.71 .403 1, 47.23 0.23 .631
Word Order Group 1, 46.04 0.54 .465 1, 708.76 0.25 .615
Time Word Order Group 1, 44.14 0.26 .613 1, 702.36 1.33 .250
7
The final model was: RT.log~Session * WO * Group þ (Session * WO || Subject) þ (WO | Item). This
includes the full fixed-effects structure with by-subject and by-item random effects. “Double bar notation”
( i.e., || ) indicates that random effects did not include the correlation between random intercepts and
slopes. See Singmann and Kellen (2019) for a guide on reading mixed model notation.
8
The final model was: RT.log ~ Session * WO * Group þ (Session | Subject) þ (Session || Item).

362 Nick Henry
Table 7. Estimated marginal means for Word Order by Time on Region 1 in Masculine First Sentences
Word Order Time EM Mean SE df Lower CL Upper CL
SVO Pretest 2.96 0.03 89.22 2.91 3.01

OVS Pretest 2.99 0.03 90.56 2.93 3.05
SVO Posttest 2.95 0.03 90.30 2.90 3.01
OVS Posttest 3.01 0.03 85.37 2.95 3.07
RTs for Masculine Second Sentences

The mean RTs and standard deviations for Masculine-Second sentences at each
segment are displayed by group and condition in Table 8. A full set of descriptive
statistics for RTs and the log-transformed RTs are found in Tables S7 and S8 in the
supplemental materials.
For the critical segment, Segment 3 (the second NP), the maximal model did not
converge. The final model9 (Table 9) yielded no significant effects.
For the spillover segment, Segment 4 (the preposition), the maximal model did not
converge. The final model10 (Table 10), yielded a Word Order Group interaction.
Follow-up pairwise comparisons indicated that participants in the PIþP group had
higher RTs in OVS sentences than SVO sentences, whereas the PI group did not
Table 8. Mean reading times (SDs) by group and condition for SPR task, Masculine-Second items
Segment NP1 V NP2 Prep Final
SVO, Pretest 1153 (484) 988 (552) 959 (412) 541 (171) 895 (333)
OVS, Pretest 1220 (588) 977 (592) 922 (502) 522 (166) 932 (404)
SVO, Posttest 1117 (584) 813 (403) 1028 (484) 548 (199) 790 (297)
OVS, Posttest 1103 (534) 778 (421) 968 (670) 511 (189) 901 (446)
SVO, Pretest 1194 (398) 938 (379) 986 (351) 510 (104) 912 (265)
OVS, Pretest 1100 (415) 968 (469) 881 (297) 555 (131) 910 (387)
SVO, Posttest 990 (217) 819 (289) 998 (465) 503 (127) 860 (332)
OVS, Posttest 1065 (443) 856 (410) 907 (432) 682 (569) 947 (453)
Table 9. Model results for Segments 3 and 4, Masculine-Second Sentences

Segment 3 Segment 4
Effect Df F p Df F p
Time 1, 39.02 0.01 .929 1, 47.68 0.04 .841

Word Order 1, 18.40 2.67 .120 1, 12.89 1.85 .197
Group 1, 45.95 0.06 .801 1, 44.10 0.35 .558
Time Word Order 1, 47.07 0.20 .655 1, 699.53 0.23 .634
Time Group 1, 45.65 0.00 .980 1, 47.70 0.03 .858
Word Order Group 1, 79.84 0.01 .926 1, 694.38 9.83 .002**
Time Word Order Group 1, 45.48 0.19 .669 1, 697.83 1.28 .258
9
The final model was: RT.log ~ Session * WO * Group þ (Session * WO | Subject) þ (Session þ WO ||
Item).
10
The final model was: RT.log ~ Session * WO * Group þ (Session | Subject) þ (Session || Item).

Table 10. Estimated Marginal Means for Segment 4, Masculine-Second Sentences
Word Order Time EM Mean SE df Lower CL Upper CL
SVO Pre 2.70 0.02 73.76 2.65 2.74
OVS Pre 2.69 0.03 89.77 2.63 2.74
SVO Post 2.70 0.03 72.32 2.65 2.75
OVS Post 2.68 0.03 82.77 2.62 2.73
SVO Pre 2.69 0.02 74.16 2.65 2.73
OVS Pre 2.72 0.02 77.61 2.67 2.77
SVO Post 2.67 0.02 69.74 2.63 2.72
OVS Post 2.74 0.03 81.29 2.68 2.79
(p = .005). Estimated marginal means (Table 9) suggest that this effect was driven by the
greater differences between SVO and OVS sentences for the PIþP group on the
posttest.
Discussion
To explore whether prosodic cues influence how learners process case markers in L2
German, the present study compared the effects of a traditional PI training (PI) to PI
with prosodic cues (PIþP). The assessment measures included offline comprehension
and production tasks along with an SPR task to measure changes in online processing.
Offline Comprehension and Production

The offline comprehension and production tasks showed that both groups improved
their comprehension accuracy for OVS sentences and the production of the accusative
case marker den. These results support previous findings in the literature that PI helps
learners develop form-meaning connections (e.g., Benati, 2001; VanPatten & Cadierno,
1993), resulting in knowledge that is useful for both comprehension and production.11
Just as importantly, the PI and PIþP groups improved to a similar degree on offline
measures, replicating results from Henry, Jackson, et al. (2017), who found that, when
learners received explicit instruction, as they did in this study, prosody did not impact
learner performance on offline comprehension and production tasks. Thus, the present
research supports their suggestion that L2 learners do not treat intonation and stress
cues like lexical-semantic cues during PI, and that prosodic cues do not block attention
to morphosyntactic cues.
Online Comprehension
The SPR task explored how learners comprehended sentences in real time. It should
first be noted that participants displayed a very strong first-noun strategy in the pretest,
interpreting about 80% of the SVO items correctly, but only 25% of the OVS items.
Note that this study is part of a larger project that included a “traditional instruction” control. This
11
control did not improve their comprehension of OVS sentences after training, suggesting that improvement
stemmed from PI and not from other factors (see Henry, 2022).

364 Nick Henry
While the results of the comprehension questions in the SPR task showed that neither
group reached 50% accuracy on OVS items in this task, both groups did improve
accuracy on these items. A separate analysis of individuals showed that, in both groups,
more than twice as many participants reached 50% accuracy on the posttest. This
suggests that, although learners were largely inaccurate on these questions, training did
attenuate their tendency to rely on a strict first-noun strategy. Despite this apparent
shift, these results stand in stark contrast to the offline comprehension measure, in
which participants interpreted OVS items with 85% and 74% accuracy in the PI and
PIþP groups, respectively. The difference in scores likely stems from the increased
memory load involved in SPR tasks coupled with the fact that participants could not
reread any portion of the sentence and had a reduced capacity to apply explicit
knowledge.
With respect to the learners’ online comprehension patterns, there are several
important results. The analysis of RTs showed that the participants from both groups
had elevated RTs on OVS sentences in Masculine-First sentences. Further analysis
indicated that this effect was driven by higher RTs in OVS sentences on the posttest,
although there was no Word Order Time interaction. As Henry (2022) discusses, this
pattern of results indicates that PI had an important, but somewhat limited effect on
learners’ processing of case markers. Nonetheless, this provides evidence that partic-
ipants were better able to identify, extract, and integrate case cues after training,
representing a movement toward the nativelike processing pattern.
Despite similarities between the groups’ processing of Masculine-First sentences, the
two groups’ processing of OVS sentences in Masculine-Second sentences diverged:
Results showed that only the PIþP group also had elevated RTs on OVS sentences in
Masculine-Second conditions. This effect provides some indication that the PIþP
group actively processed case markers throughout the entire sentence rather than only
processing the first NP. It is noteworthy that this effect was delayed until the spillover
region, which might indicate less automaticity, stemming from a reduction in available
cognitive resources as the sentence is processed. Alternatively, it could be that Mascu-
line-Second sentences are processed less automatically because they are harder to
process, for example, because learners tend to process the initial feminine or neuter
noun as nominative, even though it is ambiguous. Despite the lack of automaticity,
however, it seems that the online effects of the PIþP training were more robust than for
the PI training, promoting processing of the target form in all conditions.
The Role of Prosody in Online Processing

Given that learners in PIþP group showed more robust effects in the SPR task, it seems
that prosody played a facilitative roll in online processing. Critically, although the PIþP
group received aural input that contained a coalition of morphosyntactic and prosodic
cues during training, they received no aural input in the SPR task. Thus, the facilitative
effect observed in the SPR task implies that the PIþP group was able to use the
morphosyntactic cues to activate and apply the appropriate prosodic structures
covertly (see Féry, 2005; Fodor, 2002). This covert activation allowed learners to use
the coalition between prosodic and morphosyntactic structures additively during silent
reading, facilitating processing.
These results support emerging evidence that prosodic information supports syn-
tactic processing (Henry et al., 2020), helping learners identify important cues to word
order, create form-meaning mappings, and process those forms online. While previous

studies, such as Henry et al., (2020) have shown that prosody plays a role in predictive
processing among intermediate high and advanced immersed learners of German, the
results of this study are noteworthy because (a) they show that prosody can support
morphological processing at relatively low proficiency levels, (b) this can be trained in a
relatively short period, and (c) training with aural stimuli transfers to the written
modality. It is yet unknown how durable these effects are—as Henry, Jackson, et al.
(2017) show, one training is likely not enough to effect long-lasting changes—but these
results suggest clear implications for the use of PI and PI with prosody in the short term.
The results also lend critical support to proposals in the sentence processing
literature that emphasize the importance of activating prosodic structures during L2
sentence processing. Dekydtspotter et al. (2006), for example, argue that the use of
nontarget prosodic structures could be one reason that L2 learners have difficulty
processing syntactic structures (Clahsen & Felser, 2006a, 2006b). The present study
provides some evidence to support this hypothesis and suggests that the ability to
connect syntactic structures to the appropriate prosodic representations does indeed
play an important role in the integration of morphosyntactic information online.
Notably, these findings also demonstrate that the ability to impose the correct prosodic
pattern is not only important for structural ambiguity and attachment preferences but
also for morphosyntactic features, like case markers, that are involved in structure
building processes during real-time L2 processing.
Limitations and Directions for Future Research

The present study has several limitations that suggest areas for future research. First,
this study did not explicitly manipulate the presence or absence of EI as has been done
in previous studies on prosody in German (Henry, Jackson et al., 2017) and on online
processing in PI (e.g., Ito & Wong, 2018). While this study represents an important first
step in testing the role of prosody in PI and online processing, future research could
further elaborate its findings by exploring the independent contribution of prosody and
EI. Secondly, the use of SPR may be seen as a limitation, in particular because SPR
requires a higher cognitive load than other tasks. Thus, it is difficult to know whether
the apparent advantage for the PIþP training appears because of, or in spite of, the use
of SPR, and whether similar results would be obtained using a different online measure
or whether cognitive differences might play a role (see Dracos & Henry, 2021). Finally,
further research should investigate whether the effects of prosody can be traced to its
participation in a coalition of cues (as in Henry, Hopp, et al., 2017) or rather stem from
increased salience of the target form. In that respect, it would be useful to investigate the
relationship between prosody and input enhancement, and to what extent aural and
visual input enhancement affects online processing (see Indrarathne & Kormos, 2017).
Conclusions
The present study provides evidence that the use of prosodic cues during training
facilitates (morpho)syntactic processing. Thus, it adds to research suggesting that
prosodic cues play a significant role in sentence processing, especially when they form
a coalition with other cues. To my knowledge, this is the only research that uses online
methodologies to show such effects with L2 learners who have recently been trained
with prosodic cues. To the extent that prosody has been underresearched in the L2
acquisition research, it has also been largely ignored in the L2 classroom. This study

366 Nick Henry
suggests that the inclusion of prosodic training may not only help learners with fluency
and pronunciation but also with the acquisition and processing of (morpho)syntactic
structures (see also Henry, Jackson, et al., 2017). Finally, it should be noted that the
present study highlights the utility in combining approaches and methods common in
psycholinguistics with instructed L2 acquisition research. Through a psycholinguistic
investigation of classroom instruction, the study informs both classroom methods and
psycholinguistic theory. While this is not the first study to do so, it represents an
important and growing part of L2 acquisition research.
10.1017/S0272263122000092.
Acknowledgments. This research was supported in part by a National Science Foundation Dissertation
Improvement Grant (BCS-1252109). Portions of this research were presented at the 2013 Second Language
Research Forum in Atlanta, GA. I would like to thank Carrie Jackson for her endless personal and
professional support during the completion of this project. In addition, many thanks are owed to Richard
Page, Mike Putnam, Giuli Dussias, Bill VanPatten, Abby Massaro, Adam Baker, Melisa Dracos, Ines Martin,
Courtney Johnson Fowler, and to members of the Penn State Center for Language Science, all of whom
offered invaluable support and feedback on the present study.
parent practices. The materials are available at https://doi.org/10.18738/T8/58HF6T
References
Bates, D., Mächler, M., Bolker, B. M., & Walker, S. C. (2015). Fitting linear mixed-effects models using lme4.
Journal of Statistical Software, 67, 1–48. https://doi.org/10.1177/009286150103500418
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis
testing: Keep it maximal. Journal of Memory and Language, 68, 255–278. https://doi.org/10.1016/
j.jml.2012.11.001
Benati, A. (2001). A comparative study of the effects of processing instruction and output-based instruction
on the acquisition of the Italian future tense. Language Teaching Research, 5, 95–127. https://doi.org/
10.1177/136216880100500202
Benati, A. (2020a). An eye-tracking study on the effects of structured input and traditional instruction on the
acquisition of English passive forms. Instructed Second Language Acquisition, 4, 158–179.
Benati, A. (2020b). The effects of structured input and traditional instruction on the acquisition of the English
causative passive forms: An eye-tracking study measuring accuracy in responses and processing patterns.
Language Teaching Research. Advance online publication. https://doi.org/10.1177/1362168820928577
Braun, B. (2006). Phonetics and phonology of thematic contrast in German. Language and Speech, 49,
451–493. http://www.ncbi.nlm.nih.gov/pubmed/17326588
Cadierno, T. (1995). Formal instruction from a processing perspective: An investigation into the Spanish past
tense. The Modern Language Journal, 79, 179–193.
Carroll, S. (2004). Some general and specific comments on input processing and processing instruction. In B.
VanPatten (Ed.), Processing Instruction: Theory, Research, and Commentary (pp. 293–309). Lawrence
Erlbaum.
Carroll, S. (2006). Salience, awareness and SLA. In M. G. O’Brien, C. Shea, & J. Archibald (Eds.), Proceedings
of the 8th Generative Approaches to Second Language Acquisition Conference (GASLA 2006) (Issue Gasla,
pp. 17–24). Cascadilla Proceedings Project.
Chiuchiù, G., & Benati, A. (2020). A self-paced-reading study on the effects of structured input and textual
enhancement on the acquisition of the Italian subjunctive of doubt. Instructed Second Language Acqui-
sition, 4, 235–257. https://doi.org/10.1558/isla.40659
Clahsen, H., & Felser, C. (2006a). Continuity and shallow structures in language processing. Applied

Clahsen, H., & Felser, C. (2006b). Grammatical processing in language learners. Applied Psycholinguistics, 27,
3–42. https://doi.org/10.1017/S0142716406060024
Dekydtspotter, L., Donaldson, B., Edomnds, A. C., Liljestrand Fultz, A., & Petrush, R. A. (2008). Syntactic and
prosodic computations in the resolution of relative clause attachment ambiguity by English-French
learners. Studies in Second Language Acquisition, 30, 453–480. http://journals.cambridge.org/abstract_
S0272263108080728
Dekydtspotter, L., Schwartz, B. D., & Sprouse, R. A. (2006). The comparative fallacy in L2 processing research.
Proceedings of the 8th Generative Approaches to Second Language Acquisition Conference, 33–40.
Dittmar, M., Abbot-Smith, K., Lieven, E., & Tomasello, M. (2008). German children’s comprehension of
word order and case marking in causative sentences. Child Development, 79, 1152–1167.
Dracos, M., & Henry, N. (2018). The effects of task-essential training on L2 processing strategies and the
development of Spanish verbal morphology. Foreign Language Annals, 51, 344–368. https://doi.org/
10.1111/flan.12341
Dracos, M., & Henry, N. (2021). The role of task-essential training and working memory in offline and online
morphological processing. Languages, 6, 24.
Farley, A. P. (2004). Structured input: Grammar istruction for the acquisition oriented classroom. McGraw
Hill.
Fernández, E. M. (2010). Reading aloud in two languages: the interplay of syntax and prosody. In B.
VanPatten & J. Jegerski (Eds.), Research in second language processing and parsing (pp. 297–320). John
Benjamins.
Féry, C. (2005). Laute und leise prosodie. IDS Jahrbuch, 41, 1–20. http://books.google.com/books?hl=en&
lr=&id=IZMKuYZDvD8C&oi=fnd&pg=PA164&dq=LauteþundþleiseþProsodie&ots=M3n4-
fpAFA&sig=uqZi571DCzAC8f3x883yuFSjyhg
Fodor, J. D. (1998). Learning to parse? Journal of Psycholinguistic Research, 27, 285–319.
Fodor, J. D. (2002). Prosodic disambiguation in silent reading. Proceedings of NELS, 32, 113–132. http://
www.llf.cnrs.fr/Gens/Abeille/nels.final.doc
Foltz, A. (2021). Using prosody to predict upcoming referents in the L1 and the L2. Studies in Second
Language Acquisition, 43, 753–780. https://doi.org/10.1017/S0272263120000509
Grice, M., & Baumann, S. (2002). Deutsche intonation und GToBI. Linguistische Berichte, 191, 267–298.
http://www.coli.uni-saarland.de/publikationen/softcopies/Grice:2002:DIG.pdf
Grünloh, T., Lieven, E., & Tomasello, M. (2011). German children use prosody to identify participant roles in
transitive sentences. Cognitive Linguistics, 22, 393–419. https://doi.org/10.1515/cogl.2011.015
Hemforth, B., Konieczny, L., & Strube, G. (1993). Incremental syntax processing and parsing strategies. In
Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society (pp. 539–545). Lawrence
Erlbaum.
Henry, N. (2022). The offline and online effects of Processing Instruction. Manuscript submitted for
publication.
Henry, N., Hopp, H., & Jackson, C. N. (2017). Cue additivity in predictive processing of word order in
German. Language, Cognition and Neuroscience, 32, 1229–1249. https://doi.org/10.1080/
23273798.2017.1327080
Henry, N., Jackson, C. N., & DiMidio, J. (2017). The role of prosody and explicit instruction in Processing
Instruction. Modern Language Journal, 101, 1–21.
Henry, N., Jackson, C. N., & Hopp, H. (2020). Cue coalitions and additivity in predictive processing: The
interaction between case and prosody in L2 German. Second Language Research. Advance online
publication. https://doi.org/10.1177/0267658320963151
Indrarathne, B., & Kormos, J. (2017). Attentional processing of input in explicit and implicit conditions.
Studies in Second Language Acquisition, 39, 401–430. https://doi.org/10.1017/S027226311600019X
Issa, B. I. (2019). Examining the relationships among attentional allocation, working memory, and second
language development. The Routledge Handbook of Second Language Research in Classroom Learning,
464–479. Routledge. https://doi.org/10.4324/9781315165080-32
Issa, B. I., & Morgan-Short, K. (2019). Effects of external and internal attentional manipulations on second
language grammar development. Studies in Second Language Acquisition, 41, 389–417. https://doi.org/
10.1017/S027226311800013X

368 Nick Henry
Ito, K., & Wong, W. (2019). Processing instruction and the effects of input modality and voice familiarity on
the acquisition of the French causative construction. Studies in Second Language Acquisition, 41, 443–468.
https://doi.org/10.1017/S0272263118000281
Jackson, C. N. (2007). The use and non-use of semantic information, word order, and case markings during
comprehension by L2 learners of German. The Modern Language Journal, 91, 418–432.
Juffs, A., & Harrington, M. (2011). Aspects of working memory in L2 learning. Language Teaching, 44,
137–166. https://doi.org/10.1017/S0261444810000509
Just, M. A, Carpenter, P. A, & Woolley, J. D. (1982). Paradigms and processes in reading comprehension.
Journal of Experimental Psychology. General, 111, 228–238. http://www.ncbi.nlm.nih.gov/pubmed/
6213735
Keating, G. D. (2009). Sensitivity to violations of gender agreement in native and nonnative Spanish: An eye-
movement investigation. Language Learning, 59, 503–535. https://doi.org/10.1111/j.1467-
9922.2009.00516.x
Kempe, V., & MacWhinney, B. (1998). The acquisition of case marking by adult learners of Russian and
German. Studies in Second Language Acquisition, 20, 543–587. https://doi.org/10.1017/
S0272263198004045
Langus, A., Marchetto, E., Bion, R. A. H., & Nespor, M. (2012). Can prosody be used to discover hierarchical
structure in continuous speech? Journal of Memory and Language, 66, 285–306. https://doi.org/10.1016/
j.jml.2011.09.004
Lee, J. F., Malovrh, P. A., Doherty, S., & Nichols, A. (2020). A self-paced reading (SPR) study of the effects of
processing instruction on the L2 processing of active and passive sentences. Language Teaching Research.
Advance online publication. https://doi.org/10.1177/1362168820914025
Lenth, R. V. (2020). emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.5.3.
https://CRAN.R-project.org/package=emmeans
LoCoco, V. (1987). Learner comprehension of oral and written sentences in German and Spanish: The
importance of word order. In B. VanPatten, T. Devorak, & J. F. Lee (Eds.), Foreign language learning: A
research perspective (pp. 119–129). Newbury House.
Loschkey, L., & Bley-Vroman, R. (1993). Grammar and task-based methodology. In G. Crookes & S. M. Gass
(Eds.), Tasks and language learning: integrating theory and practice (pp. 123–167). Multilingual Matters.
MacWhinney, B. (2001). The Competition Model: the input, the context, and the brain. In P. Robinson (Ed.),
Cognition and Second Language Instruction (pp. 69–90). Cambridge University Press.
Marinis, T., Roberts, L., Felser, C., & Clahsen, H. (2005). Gaps in second language sentence processing.
Marsden, E., & Chen, H. Y. (2011). The roles of structured input activities in processing instruction and the
kinds of knowledge they promote. Language Learning, 61, 1058–1098. https://doi.org/10.1111/j.1467-
9922.2011.00661.x
Martin, I. A., & Jackson, C. N. (2016). Pronunciation training facilitates the learning and retention of L2
grammatical structures. Foreign Language Annals, 49, 658–676. https://doi.org/10.1111/flan.12224
McManus, K., & Marsden, E. (2017). L1 explicit instruction can improve L2 online and offline performance.
Studies in Second Language Acquisition, 39, 459–492. https://doi.org/10.1017/s027226311600022x
McManus, K., & Marsden, E. (2018). Online and offline effects of L1 practice in L2 grammar learning: A
partial replication. Studies in Second Language Acquisition, 40, 459–475. https://doi.org/10.1017/
S0272263117000171
Miyake, A., & Friedman, N. P. (1998). Individual differences in second language proficiency: working
memory as language aptitude. In A. S. Healy & L. E. Bourne (Eds.), Foreign Language Learning:
Psycholinguistic Studies on Training and Retention (pp. 339–364). Lawrence Erlbaum.
Nespor, M., van de Vijver, R., Schraudolf, H., Shulka, M., Cinzia, A., & Donati, C. (2008). Different phrasal
prominence realizations in VO and OV languages. Lingue e Linguaggio, 2, 1–29.
Schlesewsky, M., Fanselow, G., Kliegl, R., & Krems, J. (2000). The subject preference in the processing of
locally ambiguous wh-questions in German. In B. Hemforth & L. Konieczny (Eds.), German Sentence
Processing (pp. 65–93). Kluwer.
Schneider, W., Eschmann, A., & Zuccolotto, A. (2012). E-Prime v 2.0.10. Psychology Software Tools Inc.
Schriefers, H., Friederici, A. D., & Kühn, K. (1995). The processing of locally ambiguous relative clauses in
German. Journal of Memory and Language, 34, 499–520. http://www.sciencedirect.com/science/article/
pii/S0749596X85710236

Singmann, H. (2021). Mixed model reanalysis of RT data. https://cran.r-project.org/web/packages/afex/
vignettes/afex_mixed_example.html
Singmann, H., Bolker, B., Westfall, J., Aust, F., & Ben-Schachar, M. (2021). afex: Analysis of Factorial
Experiments. https://cran.r-project.org/package=afex
Singmann, H., & Kellen, D. (2019). An introduction to mixed models for experimental psychology. In D. H.
Spieler & E. Schumacher (Eds.), New Methods in Cognitive Psychology (pp. 4–31). Psychology Press.
https://doi.org/10.4324/9780429318405-2
Steinhauer, K. (2003). Electrophysiological correlates of prosody and punctuation. Brain and Language, 86,
142–164. https://doi.org/10.1016/S0093-934X(02)00542-4
Steinhauer, K., Alter, K., & Friederici, A. D. (1999). Brain potentials indicate immediate use of prosodic cues
in natural speech processing. Nature Neuroscience, 2, 191–196. https://doi.org/10.1038/5757
University of Wisconsin Testing and Evaluation. (2006). German placement test [Assessment instrument].
VanPatten, B. (1984). Learners’ comprehension of clitic pronouns: More evidence for a word-order strategy.
Hispanic Linguistics, 1, 57–67.
VanPatten, B. (2004a). Input Processing in second language acquisition. In B. VanPatten (Ed.), Processing
Instruction: Theory, Research, and Commentary (pp. 5–31). Lawrence Erlbaum.
VanPatten, B. (Ed.). (2004b). Processing Instruction: Theory, Research, and Commentary. Lawrence Erlbaum.
VanPatten, B. (2015). Input processing in adult SLA. In B. VanPatten & J. Williams (Eds.), Theories in Second
Language Acquisition (2nd ed., pp. 113–134). Lawrence Erlbaum.
VanPatten, B., & Cadierno, T. (1993). Explicit instruction and input processing. Studies in Second Language
Acquisition, 15, 225.
VanPatten, B., Collopy, E., Price, J. E., Borst, S., & Qualin, A. (2013). Explicit information, grammatical
sensitivity, and the first-noun principle: A cross-linguistic study in processing instruction. The Modern
Language Journal, 97, 506–527. https://doi.org/10.1111/j.1540-4781.2013.12007.x
VanPatten, B., & Houston, T. (1998). Contextual effects in processing L2 input sentences. Spanish Applied
VanPatten, B., & Uludag, O. (2011). Transfer of training and processing instruction: From input to output.
System, 39, 44–53. https://doi.org/10.1016/j.system.2011.01.013
Wakefield, J., Doughtie, E., & Yom, B. (1974). The identification of structural components of an unknown
language. Journal of Psycholinguistic Research, 3, 261–269. http://scholar.google.com/scholar?hl=en&
btnG=Search&q=intitle:NoþTitle#0
Waters, G. S., & Caplan, D. (1996). The measurement of verbal working memory capacity and its relation to
reading comprehension. The Quarterly Journal of Experimental Psychology, 49, 51–79.
Weber, A., Grice, M., & Crocker, M. W. (2006). The role of prosody in the interpretation of structural
ambiguities: A study of anticipatory eye movements. Cognition, 99, 63–72. https://doi.org/10.1016/
j.cognition.2005.07.001
Wong, W. (2004). The nature of Processing Instruction. In B. VanPatten (Ed.), Processing Instruction:
Theory, Research, and Commentary (pp. 33–67). Lawrence Erlbaum.
Wong, W., & Ito, K. (2018). The effects of processing instruction and traditional instruction on L2 online
processing of the causative construction in French: An eye-tracking study. Studies in Second Language
Acquisition, 40, 241–268. https://doi.org/10.1017/S0272263117000274
Cite this article: Henry, N. (2023). The additive use of prosody and morphosyntax in L2 German. Studies in
Second Language Acquisition, 45, 348–369. https://doi.org/10.1017/S0272263122000092

doi:10.1017/S0272263122000237
RESEARCH ARTICLE
“Bread and butter” or “butter and bread”?

Nonnatives’ processing of novel lexical patterns
in context
Suhad Sonbul1* , Dina Abdel Salam El-Dakhs2 , Kathy Conklin3 and Gareth Carrol4
1
Umm Al-Qura University, Saudi Arabia; 2Prince Sultan University, Saudi Arabia; 3University of
Nottingham, UK; 4University of Birmingham, UK
*Corresponding author. E-mail: sssonbul@uqu.edu.sa
(Received 29 August 2021; Revised 20 May 2022; Accepted 27 May 2022)
Abstract
Little is known about how nonnative speakers process novel language patterns in the input
they encounter. The present study examines whether nonnatives develop a sensitivity to
novel binomials and their ordering preference from context. Thirty-nine nonnative speakers
of English (L1 Arabic) read three short stories seeded with existing binomials (black and
white) and novel ones (bags and coats) while their eye movements were monitored. The
existing binomials appeared once in their forward (conventional) form and once in their
reversed form. The novel binomials appeared in their experimentally defined forward form
in different frequency conditions (two vs. four encounters) and once in the reversed form.
Results showed no advantage for existing binomials over their reversed forms. For the novel
binomials, the nonnative speakers read subsequent encounters significantly faster than
initial ones for both frequency conditions. More importantly, the final reversed form also
led to faster reading, suggesting that L2 speakers process the reversed form of a novel
binomial as another encounter, ignoring the established order.
Introduction
English speakers say that things go together like bread and butter, not like butter and
bread. Lexical patterns like these (often referred to as formulaic language or multiword
sequences) account for up to half spoken discourse (Erman & Warren, 2000; Pawley &
Syder, 1983). One example of such lexical patterns is binomials, defined as “coordi-
nated word pairs whose lexical elements share the same word class” (Mollin, 2014, p. 1).
Binomials abound in English (e.g., aches and pains, fair and square, high and low, life
and death) and vary in their degree of reversibility along a cline (Malkiel, 1959). At one
end of the cline are frozen binomials, which are very clearly irreversible (e.g., hit and
run, chalk and cheese) because of their highly idiomatic meaning. However, the focus of
the present study is not on such idiomatic binomials, but rather on semantically
compositional (i.e., transparent) binomials. Transparent binomials exist along a con-
tinuum of fixedness, where components may have a preferred sequence even when the
© The Author(s), 2022. Published by Cambridge University Press.

Nonnatives’ processing of novel lexical patterns in context 371
order could in theory be reversed without fundamentally changing the meaning (e.g.,
public and private, mother and father). Therefore, binomials have two important
properties: co-occurrence restrictions (like other lexical patterns) and configuration
restrictions, and this unique nature has led to interest in how they are processed and
acquired.
One line of research has explored the factors that determine the preferred order of
binomials, particularly in terms of diachronic changes (e.g., Goldberg & Lee, 2021;
Mollin, 2014). Goldberg and Lee (2021) found that binomials like uncles and aunts/
nephews and nieces, which were in common use prior to the 1930s, have more recently
reversed their preferred order to aunts and uncles/nieces and nephews. Goldberg and
Lee proposed several cognitive explanations for the change, including the accessibility
of the individual components of binomials in memory and their cluster strength.
Others have attributed the ordering preferences in binomials to sociocultural factors
(e.g., Mollin, 2013) and factors such as the semantic, phonological and lexical properties
of component words, as well as how much experience an individual has with the
binomial in question (Morgan & Levy, 2016).
Another research area is concerned with how binomials are acquired by native and
nonnative speakers. Some studies have addressed this question using post-treatment
tasks (Alotaibi et al., 2022) or using eye-tracking to examine the processing of novel
binomials as it unfolds in real time (Alotaibi, 2020; Conklin & Carrol, 2021). The
current study aims to contribute to research on the processing of novel binomials
(i.e., infrequent phrases which do not have a conventionalized word order) by extend-
ing the work of Conklin and Carrol (2021), who examined the processing of novel
binomials in native speakers, to a population of nonnative speakers of English. The
study investigates whether non-native speakers show a processing sensitivity
(i.e., speeded recognition) to novel binomials in a natural reading context, as was the
case with native speakers in the original study. More specifically, we examine whether
nonnative speakers can associate pairs of words in memory and register their preferred
word order (e.g., wires and pipes instead of pipes and wires) over the course of reading
short texts seeded with the novel binomials. Thus, the current study examines L2
processing of novel binomials rather than their acquisition (i.e., there is no post-
treatment measure of knowledge).
The following sections will review two relevant strands of literature to situate the
current study: the processing of lexical patterns by native and nonnative speakers and
the acquisition of single-word vocabulary and lexical patterns by nonnative speakers.
Literature review
Processing of lexical patterns
It is well-established that native speakers recognize lexical patterns faster and process
their phrase-level meaning more easily than other nonrecurrent combinations of words
that do not show any significant degree of cohesion or fixedness. This has been shown
for idioms (e.g., Carrol & Conklin, 2020; Conklin & Schmitt, 2008; Libben & Titone,
2008; Rommers et al., 2013), phrasal verbs (e.g., Blais & Gonnerman, 2013; Matlock &
Heredia, 2002; Tiv et al., 2019) and binomials (e.g., Arcara et al., 2012; Carrol &
Conklin, 2020). The processing advantage for lexical patterns by native speakers has
prompted researchers to explore these patterns in nonnative speakers.
Several studies have explored the determinants of nonnative processing of lexical
patterns. One important factor is first language (L1)–second language (L2) similarity.

372 Suhad Sonbul et al.
When single words are the same (or very similar) across languages (e.g., piano in both
English and French), there is considerable evidence of cross-language activation, or a
cognate effect, in nonnative lexical processing (for an overview, see van Hell & Tanner,
2012). More recently, researchers have started to examine the effect of L1-L2 similarity
on the processing of lexical patterns. This has been referred to as the congruency effect
(i.e., the availability of a literal translation equivalent). Studies have examined the
congruency effect using different types of lexical patterns including collocations (e.g.,
Wolter & Gyllstad, 2011, 2013; Yamashita & Jiang, 2010) and idioms (e.g., Carrol &
Conklin, 2014; Carrol et al., 2016; Irujo, 1986; Pritchett et al., 2016; Titone et al., 2015).
The general finding is that congruency has a clear influence on the processing of lexical
patterns in the L2 with an advantage for congruent items (L1 = L2) over incongruent
items (L1 6¼ L2).1
Another important determinant of the processing of lexical patterns is transparency.
One relevant study here is by Gyllstad and Wolter (2016), who employed a semantic
judgment task to examine how advanced L1 Swedish–L2 English learners processed
English free combinations and collocations. Reaction times and error rates showed a
processing cost for collocations compared to free combinations, due to the semantically
semitransparent nature of collocations. In the same vein, Yamashita (2018) explored
the potential contribution of semantic transparency in explaining the congruency
effect, and found that congruent items were dominated by transparency while incon-
gruent items were generally characterized by opacity, indicating a clear overlap between
these variables.
A third determinant of the processing of lexical patterns is frequency. A study by
Sonbul (2015) explored the sensitivity of native and nonnative speakers of English to
the corpus-derived frequency of collocations using both off-line (typicality rating task)
and online (eye movements) measures. There was a clear sensitivity to corpus-derived
frequency among both natives and nonnatives in the off-line task. The frequency effect
was also notable in the early stages of reading but disappeared later for both groups.
Wolter and Gyllstad (2013) examined the influence of frequency on the processing of
congruent and incongruent collocations. They found that advanced Swedish learners of
English were highly sensitive to the frequency of collocations, regardless of whether or
not the collocations had a congruent form in the L1. Likewise, Wolter and Yamashita
(2018) used an acceptability judgment task to examine the processing of collocations by
intermediate and advanced Japanese speakers of English and native English speakers.
They found effects of both word-level and collocation-level frequency among the three
groups of participants. Such results support usage-based models of language acquisi-
tion, whereby experience with the language predicts language processing and acquisi-
tion (e.g., Bybee, 2006; Ellis, 2002).
In addition, L2 proficiency seems to influence the processing of lexical patterns.
For example, Sonbul (2015) found that the effect of corpus-derived frequency on the
processing of collocations was greater among nonnative speakers of English as
their proficiency increased. Similarly, Ding and Reynolds (2019) found that the
influence of congruency was clearer among highly proficient than less proficient
Chinese EFL (English as a foreign language) learners. Sonbul and El-Dakhs (2020)
showed that the estimated proficiency of Arab learners of English influenced the
processing of collocations, with congruency effects slowly diminishing as
1
It should be noted that congruency is not a main factor in the present study but, due to its central role in
processing lexical patterns, it will be considered in item development and data analysis.

proficiency increased (and see similar effects of increasing proficiency from

Yamashita & Jiang, 2010).
The research reviewed thus far has mainly focused on the processing of collocations
and idioms. To the best of our knowledge, only one study has examined the processing
of binomials by nonnatives. Siyanova-Chanturia et al. (2011) employed eye-tracking to
examine how native and non-native English speakers, of varied levels of proficiency,
processed binomials that differed in phrasal frequency. The participants read sentences
containing binomials in their preferred, frequent, order (bride and groom) or their
reversed, less frequent, form ( groom and bride). The results showed that both natives
and nonnatives were generally sensitive to the frequency of occurrence of binomials,
but only natives and higher proficiency nonnatives also exhibited a sensitivity to the
canonical configuration.
Incidental acquisition of L2 vocabulary

Most of the available evidence on the incidental acquisition of L2 vocabulary from
context has focused on single words (e.g., Pellicer-Sánchez & Schmitt, 2010; Pitts et al.,
1989; Waring & Takaki, 2003). These studies mainly relied on off-line tests to assess
gains and showed that L2 speakers do retain vocabulary from exposure, but the rate
might be relatively low. Only recently have studies used eye-tracking, which allows for
the examination of on-line processing as it unfolds in real time. Among the earliest eye-
tracking studies to examine the incidental acquisition of word knowledge is Godfroid
et al. (2013). In their study, Godfroid et al. (2013) had advanced Dutch-speaking
learners of English read short English extracts that contained target known words
and unknown pseudowords. Their results showed that participants spent more time
processing the unknown pseudowords than the known words, and that longer fixations
to the pseudowords were associated with better scores in an unannounced vocabulary
posttest. Similarly, Pellicer-Sánchez (2016) combined off-line (paper-and-pencil) and
online (eye-tracking) measures to examine the incidental acquisition of unknown
words by nonnative English learners. Notably, the participants read a full story, not
simply short extracts. The story contained pseudowords, each repeated eight times. The
reading time (RT) for pseudowords decreased significantly after three to four encoun-
ters, and pseudowords were read in a similar manner to known real words after eight
encounters. The paper-and-pencil tests showed that incidental acquisition of unknown
words is possible. However, the acquisition of word meaning lagged behind the
acquisition of word form.
Godfroid et al. (2018) conducted a study that involved participants reading an
authentic novel. The participants (native and nonnative English speakers) read five
chapters of an English novel that included foreign Dari (Farsi) words ranging in
frequency (1–23 occurrences). After reading, the participants were given a compre-
hension test and surprise vocabulary tests. Using growth curve analysis to model form
knowledge development, the results showed that both the quantity (number of expo-
sures) and the quality (total RT) of lexical processing facilitated incidental vocabulary
acquisition. The results showed a nonlinear, S-shaped pattern of RTs for newly
acquired words with an initial speed up (one to four exposures) followed by a plateau
and a slight increasing trend (7 to 10 exposures) before further decreases in RTs (11 to
23 exposures). Posttest scores suggested that the frequency of occurrence of the new
words and how long the participants read them at each encounter predicted the
acquisition of knowledge.

The research reviewed in this section thus far has focused on the acquisition of
individual words. In contrast to single words, lexical patterns are often claimed to be
less noticed by nonnatives (e.g., Boers & Lindstromberg, 2012; Christiansen & Arnon,
2017; Wray, 2000). However, some studies have suggested that nonnative speakers are
able to notice lexical patterns in context. For example, Durrant and Schmitt (2010)
assigned nonnative speakers of English to one of three training conditions: single
exposure, verbatim repetition, and varied repetition, followed by a naming task.
Participants in both repetition conditions recalled the target collocations better than
those in the single exposure condition. The authors concluded that adult nonnative
learners retain information about the lexical patterns they are exposed to in input, in
line with usage-based models of language acquisition (Bybee, 2006).
Evidence for the incidental acquisition of lexical patterns has accrued over the past
few years. Some studies have focused on incidental vocabulary acquisition from
television/video viewing (e.g., Majuddin et al., 2021; Puimѐge & Peters, 2020). More
relevant to the present study is research investigating the incidental acquisition of
lexical patterns from a reading context (Pellicer-Sánchez, 2017; Webb et al., 2013).
Webb et al. (2013) investigated the effect of repetition on the incidental acquisition of
collocations. Taiwanese EFL learners simultaneously read and listened to one of four
versions of a modified graded reader that included target collocations, with 1, 5,
10, and 15 encounters. Immediate-posttest results indicated that encountering col-
locations repeatedly when reading while listening contributed to incidental acquisi-
tion of form and meaning, with collocation acquisition increasing as a factor of
frequency. Pellicer-Sánchez (2017) examined the incidental acquisition of colloca-
tional knowledge, focusing on adjective-pseudoword collocations in reading. L2
learners read a story seeded with target collocations that were repeated either four
or eight times. The scores on a 1-week delayed posttest lent support to the benefits of
collocation acquisition from a reading context, even suggesting that ESL learners can
incidentally develop collocational knowledge at a similar rate to individual words.
However, there was not a significant effect of the frequency manipulation on
collocation acquisition.
While most studies have focused on collocations, very few studies have examined the
incidental acquisition of binomials. In one such study, Alotaibi et al. (2022) investigated
the effect of input mode (reading-only, listening-only, and reading-while-listening) and
frequency of exposure (two, four, five, and six occurrences) on declarative binomial
knowledge. Based on performance on immediate paper-and-pencil tests, results indi-
cated that it was possible for nonnative learners of English (L1 Arabic) to develop
declarative knowledge of the preferred order of binomials from the various input
modes; novel binomials encountered six times showed similar familiarity ratings as
existing binomials.
While previous studies indicate that lexical patterns (including binomials) can be
acquired incidentally in L2 speakers, they are limited in that they used post-hoc
measures of knowledge and did not examine processing as it unfolds in real time. Only
a few studies have employed eye-tracking to examine the online processing of lexical
patterns (Alotaibi, 2020, Study 2; Choi, 2017). The aim of Alotaibi’s (2020) Study 2, for
example, was to examine how nonnative learners process novel binomials in different
input modes. The findings showed that repeated exposure to novel binomials led to
fewer fixations and shorter RTs. Additionally, with increased exposure, the processing
of novel binomials gradually became comparable to existing ones. It should be noted
that since the focus of Alotaibi’s (2020, Study 2) was on mode of processing rather than
acquisition per se, she did not include a reversed form of the novel binomials. Including

a reversed form can help address the special nature of binomials (see “Introduction”)
that does not merely involve co-occurrence restrictions but also entails word order
preferences.
The present study

As noted earlier, the current study is an extension of Conklin and Carrol (2021), who
investigated the processing of novel binomials amongst native speakers, exploring
sensitivity to co-occurrence information and canonical word order. They monitored
the eye movements of 40 native English speakers while reading short stories that
contained existing binomials in their common forward form (e.g., time and money),
seen once, and novel binomials (e.g., wires and pipes), seen one to five times in their
experimentally defined forward form. Then, the readers saw the existing and novel
patterns in the reversed order (e.g., money and time, pipes and wires). The results
revealed an initial co-occurrence memory effect for the components of novel binomials
(i.e., “wires” and “pipes”) regardless of direction whereby the last “reversed” form was
processed similar to or even significantly faster than the first forward occurrence.
However, when frequency of encounter was considered, an advantage emerged for
forward novel patterns over subsequently encountered reversed forms after four to five
exposures, suggesting that natives could develop a sensitivity to the order of novel
binomials rapidly from exposure.
The current study aims to examine whether the effect found for natives by Conklin
and Carrol (2021) emerges for nonnative speakers. More specifically, we are interested
in whether nonnative speakers rapidly develop a sensitivity to the preferred word order
of novel binomials through natural reading. The current study addresses the following
questions:
1. Is the language processing system sensitive to novel binomials in L2 input that

simulates a real-world context?
2. What is the effect of frequency of exposure on nonnatives’ sensitivity to novel
binomials in a real-world context?
Similar to Conklin and Carrol (2021), the current study presented existing binomials
only once in their forward form followed by once in their reversed form. However,
unlike Conklin and Carrol (2021) who included five frequency levels for novel bino-
mials, the design of the present study included two frequency categories only (2-
repetition vs. 4-repetition) to increase item power. In Conklin and Carrol (2021) as
the number of repetitions increased, the number of items per frequency level decreased.
For example, 25 items were read once, but only five items were read five times. Thus,
there was much less item power at the fifth occurrence versus the first. By only looking
at two frequency levels, we were able to include the same number of items for both
categories.2 We selected two versus four repetitions for our frequency categories based
on Conklin and Carrol’s (2021) finding that natives showed a clear sensitivity to a given
configuration after four exposures but not after two exposures. Thus, the novel
2
Another option would have been to follow Conklin and Carrol’s (2021) design with five frequency levels,
but just increase the number of items at each level. Doing this would have increased item power but would
have also resulted in passages becoming saturated with binomials, making the repetition manipulation clearly
marked.

binomials in the current study involved two main factors. The first factor was Category
with two levels: 2-repetition versus 4-repetition, and the second factor was Iteration
with three levels: first, last (i.e., second occurrence for 2-repetition items and fourth
occurrence for 4-repetition items), and reversed.
To evaluate an emerging sensitivity to novel binomials (RQ1), we will compare RTs
of existing and novel binomials in the forward (first) and reversed iterations only.
When examining the effect of frequency on sensitivity to novel binomials (RQ2), we
will include the first, last, and reversed forms of both frequency categories.
To establish any similarities or differences between native and nonnative processing
of novel binomials, the current findings from nonnative speakers will be discussed in
relation to those of the native speakers from Conklin and Carrol (2021). It is important
to point out, that while we have introduced certain changes to the design (see preceding
text), the experiments are very similar (i.e., use the same items and the same texts with
minor modifications). Thus, comparing the findings of the current study to that of
Conklin and Carrol (2021) should be justified.3
We do not expect nonnatives in the present study to necessarily have a sensitivity to
the canonical order of existing binomials across all proficiency levels. Similar to
Siyanova-Chanturia et al. (2011), we anticipate that sensitivity to binomials’ word
order should only emerge for participants with high L2 proficiency. For the novel
binomials, based on Alotaibi (2020, Study 2), we expect our nonnatives to show online
memory effects for the co-occurrence of novel binomials’ components, that is, shorter
RTs for the last over the first encounter in the forward form. However, nonnatives
might or might not develop sensitivity to the canonical order of novel binomials (wires
and pipes vs. pipes and wires). If nonnatives follow the same pattern as natives (Conklin
& Carrol, 2021), they should show a processing advantage (i.e., shorter RTs) for novel
binomials in the forward form over the backward form after four exposures, but likely
not after as few as two encounters. If, however, nonnative speakers are not sensitive to
word order differences in lexical patterns, our nonnative participants might not show
such a processing advantage even after four exposures. Rather, they might treat the
backward form as another co-occurrence of the components and overlook the direction
preference. In that case, the final backward occurrence should demonstrate an addi-
tional processing advantage over the last encounter with the forward form.
Experiment
Methods
Participants
Initially, 40 participants took part in the experiment. One participant was excluded as
her score in the V_YesNo vocabulary test was below 4,000 word families, suggesting
that she might not know all words comprising the target binomials (see the following
text for more details).
The final pool of 39 participants were all nonnative speakers of English who were
academic and administrative staff at a university in Saudi Arabia (L1 Arabic; 30 females,
3
We should note that the native group in Conklin and Carrol (2021) was reasonably large, and we expect
less variation in a group of native speakers than in a group of nonnatives, especially when the native group is
drawn from a similar pool (native speaker undergraduates at a British university). Thus, their pattern of
results should be robust. However, researchers may wish to replicate their novel research.

average age = 34.39, SD = 10.61).4 They started learning English at an average age of
6.5 years (M = 6.46; SD = 5.19). Their self-reported proficiency scores (on a scale from
1 = very poor to 5 = excellent) were: reading M = 4.63, SD = 0.55; writing M = 4.60,
SD = 0.55; speaking M = 4.63, SD = 0.60; and listening M = 4.57, SD = 0.61.
To obtain a rough estimate of their proficiency in English, the participants com-
pleted the V_YesNo online vocabulary test (Meara & Miralpeix, 2017; maximum score
= 10,000). The test presents participants with 200 items (half real words and half
imaginary pseudowords) and instructs them to press “Yes” if they know the meaning of
the presented form and “Next” if they do not. The score is adjusted downward based on
guessing (i.e., pressing “Yes” for pseudowords) using an equation (see Meara &
Miralpeix, 2017, p. 120). Uchihara and Clenton (2020) found a significant association
between the V_YesNo test scores and speaking ability. Moreover, previous versions of
the Yes/No vocabulary test format demonstrated a medium to strong correlation with
proficiency measures (e.g., Meara & Jones, 1988; Miralpeix & Muñoz, 2018). The scores
of our participants in the V_YesNo test ranged between 4,000 and 9,302 (M = 6712.05,
SD = 1323.96), which roughly indicates a good to high level of proficiency (Meara &
Miralpeix, 2017). The vocabulary test score was added as a covariate in all mixed-effects
models (see “Analysis”) to control for the effect of proficiency on RTs.
Materials
Thirty-two “noun-and-noun” binomials were selected for the present study, taken
from Conklin and Carrol (2021), to represent two categories of binomials: existing
(n = 12) and novel (n = 20). The full list of 32 items and their features is presented in
Appendix S1 (Online Supplementary Materials). All constituent words belonged to the
most frequent 4,000 word families in English (BNC/COCA List with 25 1,000-word
bands; Nation, 2012), meaning that our participants should be familiar with them.
The existing binomials were highly frequent phrases (BNC frequency per million)
and had a conventionalized order: forward M = 351.58, SD = 305.74; reversed M = 23.25,
SD = 27.00; t(11) = 3.71, 95% CI [144.59, 512.08], p < .001, d = 1.51. The novel binomials
were infrequent phrases (1–11 occurrences in the BNC) constructed using two common
nouns (most frequent 4,000 word families). They did not have a typical configuration
(forward M = 3.90, SD = 2.75; reversed M = 3.35, SD = 2.91; t(19) = 0.82, 95% CI [0.86,
1.96], p = .43, d = 0.19). More details on item selection and categorization can be found in
Conklin and Carrol (2021). Appendix S2 (Online Supplementary Materials) presents
characteristics of the target stimuli.
Because participants in the present study were nonnatives from the same L1
background (Arabic), we also considered L1-L2 congruency of the existing and novel
binomials. We operationalized congruency in two steps: existence (exists as a common
binomial in both language vs. only exists in one language) and configuration or
direction (same in both languages vs. different in the two languages). The two steps
are explained in detail in Appendix S3 (Online Supplementary Materials). Based on this
operationalization, 18 out of the 20 novel English binomials were not common in
Arabic (two existed in Arabic in the opposite direction) but only 6 out of the 12 existing
English binomials had the same direction in Arabic (six had a different direction in
Arabic). To check whether the pattern of results reported in the following text
4
Four of the 39 participants reported learning a language other than Arabic (English or French) at an early
age. They can thus be considered balanced bilinguals. We conducted the analyses excluding them, and the
pattern of results remained the same.

(see “Results”) was influenced by existence and direction in Arabic, we fit all models
with and without the eight nonmatching items. The pattern of results remained the
same.5
We also considered how familiar our participants are likely to be with the English
form of the binomials. To test this, a familiarity rating task was administered to 23 L1
Arabic–L2 English speakers who were comparable to our main participant pool.6 They
were instructed to rate both existing and novel binomials (intermixed in a list) for
familiarity on a scale from 1 = very unfamiliar to 7 = very familiar. The results showed
significantly higher familiarity ratings for existing binomials (M = 6.51, SD = 0.79) than
novel binomials (M = 4.86, SD = 0.84; t (30) = 5.52, 95% CI [1.05, 2.27], p < .001,
d = 2.03). To examine the potential effect of familiarity on the pattern of results, we fit
all models (see “Results)” with the average familiarity rating score as a covariate and
found no significant effect of rating scores. More importantly, including familiarity as a
covariate in the analysis did not alter the pattern of results. It should be noted that the
novel binomials were rated toward the middle of the scale. We will return to this point
in “Discussion.”
Three stories of approximately 1,100 words each were adapted from Conklin and
Carrol (2021) to include the 32 target items. The passages were simplified to ensure
suitability for our nonnative participants (99% of words belonged to the most frequent
4,000 word families in English). All target “existing” and “novel” binomials were
presented once in the forward form and once in the reversed form. Half the novel
binomials (n = 10) were then presented one more time in the forward form to make a
total of two exposures and the other half (n = 10) were presented three more times in
the forward form to make a total of four exposures. The reversed form for both existing
and novel binomials occurred once after all occurrences of the corresponding forward
form. Conklin and Carrol (2021) conducted a predictability norming task with native
speakers of English. We included their predictability scores as a potential covariate in
the analysis. The passages, full data, and R codes are available through Open Science
Framework at https://osf.io/kymsp/?view_only=ec6b2f7e9ac74f15be7afcd864373f07.
It is important to note here that the novel binomials in Conklin and Carrol (2021)
were counterbalanced across two lists, such that one order (wires and pipes) was the
forward direction on one list and the other (pipes and wires) was the forward direction
on the other list. This was done to account for any inherent word order preferences in
the novel items. However, no list differences were found, thus, in the current study,
items appeared in a single list with one version designated as the forward version.
Procedure
Upon arrival at the lab, the participant signed a consent form and completed the
vocabulary test, then the eye-tracking experiment started. Eye movements were
recorded monocularly using an SR Research EyeLink 1000þ eye-tracker. A desk-
mounted chinrest was used to minimize head movement. A 9-point grid calibration
procedure was conducted before the experiment and each screen was preceded by a
5
Only one difference was found in Analysis 1 for the total RT word 1 measure where the Type Direction
interaction was not significant (p = .56).
6
To check comparability between the two groups, we gave the participants who completed the familiarity
rating task the same V_YesNo Vocabulary test. Their scores (M = 6568.96, SD = 1127.03) were similar to
those of the main group (see “Participants”) and the difference was not significant: (t (60) = 0.43, 95% CI
[803.27, 517.08], p = .67, d = 0.12).

fixation point for drift correction. The eye-tracker was recalibrated before each story
and whenever needed. The stories were presented in Courier New, 18-point font and
were double-spaced. Participants were told to read the stories as naturally as possible for
comprehension and to press the space bar to go to the next screen. In the texts, neither
existing nor novel binomials appeared at the beginning or end of a line or across a line
break. Each story was followed by five comprehension questions to ensure that
participants attended to the text (average percentile score: 94.36%, SD = 6.41). Perfor-
mance on the comprehension questions indicates that participants understood the
stories.
Analysis
Data cleaning was done prior to the analysis according to the four-stage process in
the DataViewer software. Single fixations shorter than 100 ms and longer than
800 ms were removed (5% of all fixations). Then, following Conklin and Carrol
(2021), we excluded trials that were discontinued or where track loss was experi-
enced. Any phrase that was completely skipped was also excluded from the analysis.
In such cases, all subsequent occurrences of the item, including the reversed form,
were also removed. This resulted in a loss of 1% of the data points in both analyses.
We also conducted the analysis with the full data set and the resulting models were
the same.
The analysis was conducted with R version 4.0.5. (R Core Team, 2021). Linear
mixed-effects models were constructed and analyzed using the lme4 (version 1.1-26;
Bates et al., 2015) and lmerTest (version 3.1-3; Kuznetsova et al., 2017) packages. Three
interest areas were analysed for each model: the whole phrase, word 1, and word 3. The
middle word “and” was skipped more than 45% of the time, so it was excluded from the
analysis. Separate models were constructed for two eye-movement measures: first-pass
RT and total RT. RTs were log-transformed to reduce skewness in the data. All analyses
adopted the maximal random effects structure justified by the design (see the following
text for more details). Final models were checked for collinearity, and no issues were
observed (all VIFs < 7). Words and phrases that received no fixations during first-pass
reading were excluded from subsequent analyses.
We conducted two separate analyses to answer the research questions. First, we
compared the RTs of forward and reversed forms for both the existing and novel
binomials (RQ1). Then, RTs for novel binomials only were compared across three
iterations (first, last, reversed) for both the 2-repetition and 4-repetition binomial
categories (RQ2).
Results
Table 1 presents mean RTs for both existing and novel binomials for all occurrences in
both directions. Existing binomials showed no clear pattern, with slower RTs for the
forward form under some measures and for the reversed form under other measures.
For novel binomials, however, the pattern was much clearer. RTs were shorter
(i.e., faster processing) with more exposure to target items (first vs. last vs. reversed)
for both repetition categories (2 vs. 4). It should be noted that for the 4-repetition novel
binomials, occurrences in the middle (second, third, last) did not always exhibit this
processing advantage as exposure increased; the last exposure for 4-repetition items
had a longer average RT than the third exposure under all measures. We will return to
this in “Discussion.”

380
Suhad Sonbul et al.
Table 1. Mean RTs in milliseconds with standard deviation in parentheses for existing binomials and novel binomials for first-pass and total RT
Existing binomials Novel binomials (2-repetition) Novel binomials (4-repetition)
(n = 12) (n = 10) (n = 10)
Forward Reversed First Last Reversed First Second Third Last Reversed
First-pass RT Whole phrase 675.40 680.05 763.44 694.87 621.97 714.18 703.98 634.39 665.51 617.69
(356.80) (353.31) (403.68) (420.53) (351.04) (422.02) (371.75) (349.77) (329.41) (358.59)
Word 1 321.64 306.31 333.47 310.78 294.71 323.37 338.05 312.62 317.13 303.75
(166.95) (160.09) (159.59) (179.50) (137.04) (161.27) (159.16) (146.35) (131.14) (165.90)
Word 3 294.30 300.83 383.40 340.34 316.84 364.65 321.26 303.80 322.79 301.58
(121.34) (129.12) (187.98) (146.29) (148.04) (182.04) (153.03) (130.20) (149.21) (127.86)
Total RT Whole phrase 834.17 836.53 1056.09 936.25 819.46 1022.63 879.49 812.03 848.79 783.62
(444.75) (418.00) (516.29) (509.15) (464.86) (548.71) (452.93) (393.67) (416.21) (402.40)
Word 1 386.85 368.86 426.54 403.96 369.10 431.44 403.85 372.54 395.85 370.15
(252.48) (216.63) (239.51) (265.08) (219.55) (263.00) (232.57) (210.83) (207.74) (215.15)
Word 3 349.14 356.97 486.21 410.89 373.87 452.22 375.41 335.21 368.65 339.19
(198.10) (187.07) (293.06) (231.55) (228.87) (264.99) (219.22) (162.40) (217.74) (174.60)
Notes: For word 1 and word 3 values, words that received no fixations are discounted; values for the whole phrase include trials where either word 1 or word 3 (but not both) was skipped.
RQ1: Is the language processing system sensitive to novel binomials in L2 input that
simulates a real-world context?
To compare the mean RTs (first-pass and total) of existing and novel binomials, we
constructed linear mixed-effects models for the whole phrase, word 1 and word
3. Following Conklin and Carrol (2021), we included Type (existing vs. novel) and
Direction (forward vs. reverse) as fixed effects in all models. Word-level factors
including length (in letters) and frequency (on the Zipf scale) were also included as
fixed factors.7 Additionally, number of repetitions (1, 2, 4) was added as a covariate in
all models to control for the effect of variation in encounters on the subsequent reverse
form of existing and novel binomials. All models included random intercepts for
subjects and items. The optimx optimizer was used to avoid convergence errors when
needed. Initially, by-subject random slopes for the effects of Type and Direction were
included,8 but this led to unavoidable convergence errors. As a result, only by-subject
random slopes for Type were included in all models. Three other factors were then
added stepwise one-by-one to each original model to examine their potential effect on
RTs: phrase frequency on the Zipf scale, forward association strength, and cloze
probability. Log-likelihood (X2) tests and AIC values were used to compare the
resulting model with the original model, and only factors that significantly improved
the model fit were kept. All models also included the log-transformed vocabulary test
score as a proxy for L2 proficiency.
All resulting models are presented in Table 2. Vocabulary Score was a significant
predictor in all models but not for the word 3 first-pass measure, pointing to shorter
RTs as proficiency increased. The Type Direction interaction was significant for all
first-pass and total reading measures except for the word 1 first-pass analysis. The
difflsmeans function in the lmerTest package was used to compute pairwise compar-
isons (see Appendix S4, Online Supplementary Materials). We calculated Cohen’s d for
pairwise comparisons as a standardized measure of effect size based on the guidelines
provided by Brysbaert and Stevens (2018). Results suggest that nonnatives were not
sensitive to the established configuration of frequent existing binomials in English.
There were no significant differences between the processing of forward and reversed
forms for existing binomials (whole phrase) in first-pass RT (β = 0.01, t = 0.24, 95%
CI [0.08, 0.06], p = .81, d = 0.01) or total RT (β = 0.01, t = 0.39, 95% CI [0.06,
0.04], p = .69, d = 0.02). These findings stand in sharp contrast with those reported in
Conklin and Carrol (2021) for native speakers who showed a clear processing advan-
tage for the forward form of existing binomials over their reversed forms. It seems that
our nonnative speakers are not sensitive to the established word order of common
binomials in English.
For the novel binomials, pairwise comparisons suggest significantly shorter RTs for
the reversed forms in comparison to the forward forms for all measures with a small to
medium effect. This is most clearly seen for the whole phrase, for both first-pass RT
(β = 0.15, t = 5.11, 95% CI [0.09, 0.21], p < .001, d = 0.25) and total RT (β = 0.28, t = 13.95,
95% CI [0.24, 0.31], p < .001, d = 0.62). This result is similar to (but more robust than)
7
For the first-pass RT measure for word 1 we did not include word 3 length or word 3 Zipf frequency as
word 3 had not been encountered in reading yet at this point.
8
The study employed a within-subject design (i.e., all participants saw items under all conditions: existing
binomials, 2-repetition novel binomials, and 4-repetition novel binomials). Thus, we only included by-
participant random slopes (no by-item slopes) to account for the random variation in repeated measures (see
Linck & Cunnings, 2015).

Table 2. Linear mixed-effects model for existing versus novel binomials in forward and reversed forms for whole phrase RTs, word 1 and word 3
382
First pass RT Total RT
Intercept SE t p Intercept SE t p
Suhad Sonbul et al.

Whole phrase
(Intercept) 12.34 1.33 9.31 <.001*** 15.47 1.41 10.96 <.001***
Type (novel) 0.05 0.07 0.71 .48 0.19 0.09 2.07 .048*
Direction (reverse) 0.01 0.04 0.24 .81 0.01 0.03 0.39 .69
Type Direction 0.16 0.05 3.32 <.001*** 0.29 0.03 8.86 <.001***
Repetitions 0.02 0.02 0.77 .45 0.01 0.03 0.39 .70
W1 Length 0.02 0.02 0.96 .35 0.02 0.02 0.80 .43
W3 Length 0.02 0.02 1.12 .27 0.00 0.02 0.06 .95
W1 Zipf 0.06 0.05 1.28 .21 0.10 0.06 1.64 .11
W3 Zipf 0.05 0.06 0.96 .35 0.02 0.07 0.24 .81
Vocabulary Test Score (log) 0.67 0.15 4.55 <.001*** 0.95 0.15 6.128 <.001***
Random effects: Variance SD Variance SD
Subject 0.02 0.15 0.03 0.18
Subject|Type 0.00 0.04 0.00 0.03
Item 0.01 0.08 0.02 0.13
Residual 0.33 0.57 0.15 0.39
Word 1
(Intercept) 10.27 0.96 10.68 <.001*** 12.90 1.14 11.34 <.001***
Type (novel) 0.01 0.05 0.15 .88 0.02 0.07 0.26 .80
Type Direction 0.07 0.04 1.82 .07 0.14 0.04 3.50 <.001***
Repetitions 0.00 0.02 0.15 .89 0.01 0.02 0.53 .60
W1 Length 0.01 0.01 0.72 .48 0.02 0.02 0.82 .42
W3 Length – – – – 0.00 0.02 0.20 .84
W1 Zipf 0.07 0.03 2.33 .03* 0.15 0.05 2.93 .007**
W3 Zipf – – – – 0.07 0.06 1.18 .25
Subject 0.02 0.13 0.02 0.14
Subject|Type 0.00 0.01 0.00 0.02
Item 0.00 0.06 0.01 0.09
Residual 0.17 0.41 0.21 0.46
(Continued)
Table 2. (Continued)
Nonnatives’ processing of novel lexical patterns in context

Word 3
(Intercept) 7.44 0.82 9.07 <.001*** 10.01 1.07 9.36 <.001***
Type (novel) 0.18 0.05 3.34 .002** 0.28 0.09 3.21 .003**
Type Direction 0.18 0.03 5.54 <.001*** 0.28 0.04 7.36 <.001***
Repetitions 0.02 0.02 0.93 .36 0.03 0.03 1.02 .32
W1 Length 0.02 0.01 1.48 .15 0.02 0.02 0.87 .39
W3 Length 0.00 0.01 0.37 .72 0.00 0.02 0.05 .96
W1 Zipf 0.08 0.04 2.31 .03* 0.10 0.06 1.74 .09
W3 Zipf 0.04 0.04 0.90 .38 0.00 0.07 -0.07 .95
Vocabulary Test Score (log) 0.15 0.09 1.68 .10 0.43 0.11 3.77 <.001***
Subject 0.01 0.08 0.02 0.13
Subject|Type 0.01 0.08 0.01 0.08
Item 0.00 0.06 0.01 0.11
Residual 0.15 0.38 0.20 0.44
Note: p-values are estimated using the lmerTest package in R (Kuznetsova et al. 2017).
*p <0.05; **p <0.01; ***p <.001.
383
Table 3. Linear mixed–effects model for novel binomials in different iterations (first, last and reversed) for whole phrase RTs, word 1 and word 3
384
Suhad Sonbul et al.

Whole phrase
(Intercept) 11.99 1.45 8.25 <.001*** 15.26 1.49 10.25 <.001***
Category (4–repetition) 0.12 0.05 2.40 .02* 0.05 0.07 0.73 .47
Iteration (last) 0.12 0.04 2.78 .006** 0.14 0.03 5.14 <.001***
Iteration (reversed) 0.20 0.04 4.73 <.001*** 0.28 0.03 10.22 <.001***
Category (4–repetition) Iteration (last) 0.12 0.06 2.04 .04* 0.03 0.04 0.73 .47
Category (4–repetition) Iteration (reversed) 0.10 0.06 1.70 .09 0.01 0.04 0.33 .74
W1 Length 0.04 0.02 2.42 .03* 0.03 0.03 0.78 .45
W3 Length 0.02 0.02 1.29 .22 0.01 0.03 0.40 .69
W1 Zipf 0.02 0.06 0.36 .72 0.21 0.11 1.95 .07
W3 Zipf 0.04 0.06 0.72 .49 0.03 0.09 0.34 .74
Phrase Zipf 0.17 0.09 1.91 .08 0.33 0.16 2.12 .054
Forward Association 1.59 0.78 2.05 .06 – – – –
Subject 0.04 0.20 0.04 0.21
Subject|Category 0.00 0.01 0.00 0.02
Item 0.00 0.04 0.02 0.12
Residual 0.35 0.59 0.15 0.38
Word 1
(Intercept) 10.16 0.89 11.42 <.001*** 13.18 1.15 11.44 <.001***
Category (4–repetition) 0.05 0.03 1.52 .13 0.00 0.06 0.00 1.00
Iteration (last) 0.09 0.03 3.02 .003** 0.08 0.03 2.46 .01*
Category (4–repetition) Iteration (last) 0.10 0.04 2.34 .02* 0.02 0.05 0.36 .72
W1 Length 0.02 0.01 2.20 .04* 0.00 0.03 0.11 .91
W3 Length – – – – 0.03 0.03 1.08 .30
W1 Zipf 0.01 0.03 0.32 .76 0.12 0.08 1.52 .15
W3 Zipf – – – – 0.06 0.07 0.83 .42
Forward Association 0.73 0.37 1.96 .07 – – – –
(Continued)
Table 3. (Continued)

Subject 0.02 0.13 0.03 0.17
Nonnatives’ processing of novel lexical patterns in context

Item 0.00 0.02 0.01 0.10
Residual 0.17 0.41 0.21 0.45
Word 3
(Intercept) 9.11 1.05 8.70 <.001*** 10.87 1.16 9.39 <.001***
Category (4–repetition) 0.04 0.04 0.96 .35 0.08 0.06 1.18 .25
Iteration (last) 0.10 0.03 3.67 <.001*** 0.15 0.03 4.74 <.001***
Category (4–repetition) Iteration (last) 0.00 0.04 0.11 .91 0.03 0.05 0.71 .48
W1 Length 0.03 0.02 1.45 .17 0.03 0.03 0.96 .36
W3 Length 0.00 0.02 0.13 .90 0.01 0.03 0.33 .75
W1 Zipf 0.12 0.06 1.92 .08 0.18 0.10 1.79 .10
W3 Zipf 0.00 0.05 0.09 .93 0.05 0.08 0.61 .55
Vocabulary Test Score (log) 0.36 0.11 3.21 .003** 0.32 0.15 2.15 .051
Phrase Zipf 0.19 0.10 1.93 .08 0.56 0.12 4.77 <.001***
Subject 0.02 0.13 0.02 0.14
Item 0.00 0.07 0.01 0.11
Residual 0.15 0.39 0.20 0.44
Note: p–values are estimated using the lmerTest package in R (Kuznetsova et al. 2017).
*p <0.05; **p <0.01; ***p <.001.
385
the findings reported in Conklin and Carrol (2021) for native speakers, suggesting that
just like natives, nonnatives seem to quickly develop a link between the two words in
memory, exhibiting an advantage in processing, even when they appear in a different
order than previous encounters.
As a final step in the analysis, we also tested for the contribution of the Type
Direction Vocabulary Score interaction to the model fit.9 This was intended to reveal
any modulating effect of L2 proficiency, that is, to examine if nonnatives with higher
vocabulary scores read existing and novel binomials in the forward and reverse
direction differently. This three-way interaction did not significantly contribute to
any of the models, suggesting similar patterns regardless of L2 proficiency.
One final notable finding in Table 2 is that number of repetitions was not significant
in all models, suggesting no difference between items that were seen twice and those
that were seen four times. The next research question aimed to examine the possibility
that number of encounters might modulate this effect in more detail.
RQ2: What is the effect of frequency of exposure on nonnatives’ sensitivity to novel
binomials in a real-world context?
The effect of frequency of exposure on RTs of novel binomials (word 1, word 3 and
whole phrase) was explored in Analysis 2. This analysis included Category (2-repetition
vs. 4-repetition) and Iteration (first, last, reversed) as fixed effects and by-subject
random slopes for Category. All other fixed factors and covariates were the same as
those included in Analysis 1. Additionally, we examined the three-way interaction
between Category, Iteration, and Vocabulary Score, but it did not significantly improve
any of the models.
The resulting models are presented in Table 3. Overall, there appears to be a main
effect for Vocabulary Score (but see the total RT measure for Word 3) and a main effect
for Iteration, but not Category (but see first-pass RT for the whole phrase). Pairwise
comparisons across the three Iteration levels, regardless of repetition, are presented in
Appendix S5 (see Online Supplementary Materials). These are computed using the
difflsmeans function in the lmerTest package.
The general pattern seems to suggest significantly different RTs across the three
iterations with a small to medium effect (first > last > reversed). Thus, it appears that
with more exposure to binomials, nonnatives developed a sensitivity to the co-occur-
rence of the content words, spending less time reading them each time they appeared
together. As for the reversed form, which was always included after all occurrences of
the forward form, the results suggest that nonnatives dealt with it as another exposure
to the binomial, ignoring the configuration mismatch.
Discussion
Research examining how nonnative speakers process novel lexical patterns in context is
fairly limited. The present study aims to fill this gap by recording the eye movement
patterns of nonnative speakers of English (L1 Arabic) as they read stories seeded with
novel binomials to address two research questions. First, we examined whether the
nonnatives developed a sensitivity to the canonical order of novel binomials after
This model also included all possible two-way interactions to control for their effect: Direction
9
Vocabulary Score and Type Vocabulary Score. This was also the case in Analysis 2 which included
Category Vocabulary Score and Iteration Vocabulary Score.

exposure and compared their processing to existing binomials (Research Question 1).
Second, we looked at the effect of frequency (two vs. four exposures) on the develop-
ment of sensitivity to the novel binomials (Research Question 2).
In response to the first research question, results showed no processing advantage
for existing, common, binomials (time and money) over their less frequent reversed
forms (money and time). Thus, unlike natives in Conklin and Carrol (2021), nonnatives
in the present study were generally not sensitive to binomials’ canonical word order.
This result seems to support Siyanova-Chanturia et al.’s (2011) finding for a limited
nonnative sensitivity to word order preferences that emerged only as proficiency
increased. In the present study, however, proficiency did not seem to modulate
sensitivity to binomials’ configuration. While participants in Siyanova-Chanturia
et al. (2011) came from a variety of L1 backgrounds, we targeted a homogenous
nonnative population (L1 Arabic–L2 English). Previous research on collocations and
idioms often report congruency as an important factor in the nonnative processing of
lexical patterns (e.g., Carrol et al., 2016; Sonbul & El-Dakhs, 2020). A follow-up analysis
that was conducted on a subset of binomials that matched in the two languages showed
the same pattern of results, namely, no sensitivity to the canonical configuration (see
“Materials” for details). The fact that the L1-L2 matched existing binomials in the
present study comprised only six items might explain the lack of effect. Future research
on nonnative binomial processing should address the congruency effect more directly
with a larger set of congruent and incongruent items. Another factor to consider in
future research is binomial familiarity. The norming study that we conducted with a
group of L1 Arabic–L2 English speakers comparable to the main participant pool (see
“Materials”) showed that familiarity with novel binomials elicited ratings toward the
mid-point of the scale and was only slightly (though significantly) lower than the
ratings for existing binomials. More research is needed in this area to tease apart off-line
familiarity ratings and online real time performance.
For novel binomials, the results of Analysis 1 (initial forward exposure vs. reversed
forms) showed a robust significant advantage (with a small/medium effect) for the
reversed form over the forward form for all eye-movement measures (both early and
late). As indicated in the preceding text, this result complements Conklin and Carrol’s
(2021) finding for natives who initially exhibited sensitivity to the combination of single
words (“wires” and “pipes”) regardless of direction. Thus, like natives in Conklin and
Carrol’s (2021) study, nonnatives in the present study seem to keep a record of all
occurrences of lexical patterns in the input; but unlike native speakers, they might not
initially build sensitivity to the preferred word order (wires and pipes vs. pipes and
wires). This seems to support Durrant and Schmitt’s (2010) finding that nonnatives are
able to extract co-occurrence restrictions from input, refuting traditional claims (e.g.,
Wray, 2000), and backing up usage-based models of language processing (Bybee, 2006;
Ellis, 2002). However, as noted earlier, binomials may be different from other forms of
lexical patterns in that they involve co-occurrence and configuration restrictions. In a
study on the processing of lexical bundles, Ellis et al. (2008) found that nonnative
speakers are sensitive just to the frequency in the language, but native speakers seem to
extract the nuance of the bundle’s association strength (i.e., how often two words tend
to co-occur above chance). Similarly, one might claim that the nonnative participants in
the present study were able to exhibit sensitivity to mere frequency but were not
sensitive to higher-level restrictions on word order, at least not after one encounter
as Analysis 1 seems to suggest.
The second research question (Analysis 2) examined the possible modulating effect
of frequency of encounters (two vs. four) on the development of a sensitivity to the

canonical order of binomials. Conklin and Carrol (2021) found that their native
speakers processed the forward forms of novel binomials faster than their reversed
forms, similar to existing binomials, after four to five exposures.10 Crucially, exposure
to the subsequent reversed form led to a cost (a marked rise in processing time)
compared to the most recent encounter, despite this being faster than the first exposure.
However, results of Analysis 2 failed to report similar effects for our nonnative
participants even after four exposures. In line with the findings of Analysis 1, nonnatives
in the present study processed the last reversed form (pipes and wires) significantly
more quickly (with a small/medium effect) than the experimentally defined forward
form (wires and pipes) regardless of how many times it was encountered. This finding is
further supported by the raw RTs in Table 1, showing similar processing times for the
third encounter in the 2-repetition category (backward) and the third encounter in the
4-repetition category (forward). Thus, it seems to be the case that whilst nonnatives did
register co-occurrence restrictions in terms of which words go together, they did not
register the configuration/order of the words. This lack of a configuration effect stands
in contrast with findings of Alotaibi et al. (2022) who found that nonnative Arab
learners of English were able to develop sensitivity to the preferred order of binomials
(similar to existing phrases) after six exposures. It should be noted, however, that unlike
the present study, Alotaibi et al. (2022) included higher frequency levels (up to six
occurrences) than the present study (with a maximum of four encounters) and
employed declarative post-treatment measures. Sonbul and Schmitt (2013) found a
dissociation between gains in declarative (paper-and-pencil) tests and those reflecting
online performance. In the present study, we did not employ any post-treatment
declarative measure of sensitivity to target binomials’ word order. Future research
can benefit from combining both online (eye-movement) measures and off-line
(paper-and-pencil) tasks to compare findings at both processing levels. Moreover,
more encounters can be included to allow participants to develop sensitivity to
configuration restrictions. A relevant point, relating to Arabic speakers of English,
may be that Arabic seems to be less fixed than English regarding the order of binomials’
components (see “Materials”). Thus, it can be speculated that, given the flexibility in
their L1, Arabic speakers of L2 English might not develop sensitivity to binomial
restrictions in context. Because research on the structure of binomials in Arabic is
extremely limited (but see Kaye, 2009), this possibility can only be viewed as a
hypothesis that needs to be explored by future research comparing L2 English speakers
from a variety of L1 backgrounds. Another related issue is that, in contrast to English,
the Arabic script is read from right to left. As the focus of the present study is on
binomials’ word order in L2 English, Arabic native speakers might be disadvantaged
(in comparison to L2 English speakers whose native language is read from left to right).
This would be an interesting question to explore in future research.
The fact that nonnatives in the present study developed a sensitivity to one aspect of
binomials (i.e., co-occurrence restrictions) but not another (i.e., configuration restric-
tions) is in line with eye-tracking evidence for the incidental acquisition of single words
from context. As indicated earlier, the limited available research in this area (Pellicer-
Sánchez, 2016) seems to suggest that nonnative learners tend to develop different word
knowledge aspects at different rates: While the form is acquired quickly, knowledge of
10
While Conklin and Carrol (2021) found a processing cost for reversing a novel binomial after four to five
exposures, it is the only study to investigate the development of a sensitivity to word order for novel binomials
in native speakers from natural reading. More research is needed to firmly establish the emergence of such a
sensitivity.

meaning fails to develop even after repeated exposure. Along the same lines, it can be
claimed that not all aspects of binomial knowledge are learned at the same pace. Online
sensitivity to the co-occurrence restrictions of binomials can develop quickly from
exposure, but sensitivity to the canonical order does not develop even after several
exposures. Further eye-tracking research on the processing of novel binomials can
include more encounters to arrive at an estimated frequency after which nonnatives
develop a sensitivity to binomials’ preferred order.
Another parallel in eye-movement patterns between individual words and bino-
mials is the effect of multiple encounters on processing. In their study on individual
words, Godfroid et al. (2018) found a tendency for RTs of novel words to initially
decrease but then increase around the seventh encounter, reflecting “increased cogni-
tive effort and attempts on their [participants’] part to integrate the words into the
sentence contexts and make form-meaning connections” (p. 575). Similarly, our
findings showed a slight increasing trend at the fourth encounter for the 4-repetition
novel binomials (see Table 1) which disappeared with the presentation of the reversed
form. Thus, although we did not intend to examine the gradual effect of exposure on the
processing of novel binomials in the present study, a tendency seems to emerge at the
fourth encounter, similar to the effect reported by Godfroid et al. (2018) for single
words. Building on predictions of the type of processing resource allocation (TOPRA)
Model (Barcroft, 2002), one might argue that in the first few encounters with a given
binomial, the cognitive demands are high as language users are encoding co-occurrence
restrictions (i.e., learning that ‘wires’ and ‘pipes’ occur together). Then, around the
fourth encounter, more cognitive resources are freed and can thus be devoted to the
specific configuration and the overall meaning that is denoted by the phrase. With more
exposure, the forward form of novel binomials might show further decrease in RTs.
However, given the limited number of encounters in the present study (maximum
four), this interpretation is speculative. Future eye-tracking research on the online
processing of novel binomials would do well to include more encounters with novel
binomials, in line with Godfroid et al.’s (2018) design, to fully explore the possible S-
shaped processing of novel lexical patterns in context.
Conclusion
The present study was intended to extend Conklin and Carrol’s (2021) findings for the
native processing of novel binomials to a population of non-native English speakers
(L1 Arabic). Results showed that nonnatives had limited sensitivity to the preferred
order of existing binomials; they did not develop sensitivity to the experimentally
defined configuration of novel binomials even after four encounters. These results seem
to suggest that the nonnative participants were simply recording co-occurrence restric-
tions, disregarding direction preference, which may be a feature of language that is less
salient and therefore requires more input to emerge. The study is limited, however, in
that it did not include a balanced number of congruent/incongruent binomials to fully
examine the congruency effect. Additionally, the study only included two frequency
conditions (two vs. four) and did not include a post-exposure measure of declarative
knowledge. Future research should explore the effect of increased frequency and L1-L2
congruency on the acquisition process (both online processing and off-line, post-
exposure, gains). Despite the limitations, the present study can be viewed as an initial
attempt to examine nonnatives’ processing of novel binomials in context. This line of
research can further our understanding of the conditions that might help nonnatives

develop sensitivity to lexical patterns like bread and butter (over butter and bread),
enabling new and broader explanations of input-driven language development.
Acknowledgment. The researchers thank Prince Sultan University for funding this research project
through the research lab [Applied Linguistics Research Lab - RL-CH-2019/9/1].
10.1017/S0272263122000237.
Data Availability Statement. The experiment in this article earned Open Materials and Open Materials
badges for transparent practices. The materials and data are available at https://osf.io/kymsp/?view_only=
ec6b2f7e9ac74f15be7afcd864373f07.
References
Alotaibi, S. (2020). The effect of input mode, input enhancement and number of exposures on the learning and
processing of binomials in the L2 [Unpublished doctoral dissertation]. University of Nottingham.
Alotaibi, S., Pellicer-Sánchez, A., & Conklin, K. (2022). The effect of input modes and number of exposures on
the learning of L2 binomials. ITL – International Journal of Applied Linguistics, 173, 58–93.
Arcara, G., Lacaita, G., Mattaloni, E., Passarini, L., Mondini, S., Beninca, P., & Semenza, C. (2012). Is “hit and
run” a single word? The processing of irreversible binomials in neglect dyslexia. Frontiers in Psychology, 3,
1–11.
Barcroft, J. (2002). Semantic and structural elaboration in L2 lexical acquisition. Language Learning, 52,
323–363.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal
of Statistical Software, 67, 1–48
Blais, M.-J., & Gonnerman, L. M. (2013). Explicit and implicit semantic processing of verb particle
constructions by French–English bilinguals. Bilingualism: Language and Cognition, 16, 829–846.
Boers, F., & Lindstromberg, S. (2012). Experimental and intervention studies on formulaic sequences in a
second language. Annual Review of Applied Linguistics, 32, 83–110.
Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal
of Cognition, 1, 1–20.
Bybee, J. (2006). From usage to grammar: The mind’s response to repetition. Language, 82, 711–733.
Carrol, G., & Conklin, K. (2014). Getting your wires crossed: Evidence for fast processing of L1 idioms in an
L2. Bilingualism: Language and Cognition, 17, 784–797.
Carrol, G., & Conklin, K. (2020). Is all formulaic language created equal? Unpacking the processing advantage
for different types of formulaic sequences. Language and Speech, 63, 95–122.
Carrol, G., Conklin, K., & Gyllstad, H. (2016). Found in translation: The influence of the L1 on the reading of
idioms in a L2. Studies in Second Language Acquisition, 38, 403–443.
Choi, S. (2017). Processing and learning of enhanced English collocations: An eye movement study. Language
Teaching Research, 21, 403–426.
Christiansen, M.H., & Arnon, I. (2017). More than words: The role of multiword sequences in language
learning and use. Topics in Cognitive Science, 9, 542–551.
Conklin, K., & Carrol G. (2021). Words go together like “bread and butter”: The rapid, automatic acquisition
of lexical patterns. Applied Linguistics, 42, 492–513
Conklin, K., & Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly than nonformulaic
language by native and nonnative speakers? Applied Linguistics, 29, 72–89.
Ding, C., & Reynolds, B. L. (2019). The effects of L1 congruency, L2 proficiency, and the collocate-node
relationship on the processing of L2 English collocations by L1-Chinese EFL learners. Review of Cognitive
Research, 26, 163–188.
Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for theories of
implicit and explicit language acquisition. Studies in Second Language Acquisition, 24, 143–188.

Ellis, N. C., Simpson‐Vlach, R. I. T. A., & Maynard, C. (2008). Formulaic language in native and second
language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42, 375–396.
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text and Talk, 20, 29–62.
Godfroid, A., Ahn, J., Choi, I., Ballard, L., Cui, Y., Johnston, S., Lee, S., Sarkar, A., & Yoon, H. J. (2018).
Incidental vocabulary learning in a natural reading context: An eye-tracking study. Bilingualism: Language
Godfroid, A., Boers, F., & Housen, A. (2013). An eye for words: Gauging the role of attention in incidental L2
vocabulary acquisition by means of eye-tracking. Studies in Second Language Acquisition, 35, 483–517.
Goldberg, A. E., & Lee, C. (2021). Accessibility and historical change: An emergent cluster led uncles and
aunts to become aunts and uncles. Frontiers in Psychology, 12, 1418.
Gyllstad, H., & Wolter, B. (2016). Collocational processing in light of the phraseological continuum model:
Does semantic transparency matter? Language Learning, 6), 296–323.
Irujo, S. (1986). Don’t put your leg in your mouth: Transfer in the acquisition of idioms in a second language.
TESOL Quarterly, 20, 287–304.
Kaye, A. S. (2009). Cultural ingredients in Arabic lexical pairs (Binomials). Word, 60, 65–78.
Kuznetsova, A., Brockhoff, P., & Christensen, R. (2017). lmerTest package: Tests in linear mixed effects
models. Journal of Statistical Software, 82, 1–26.
Libben, M., & Titone, D. (2008). The multidetermined nature of idiom processing. Memory and Cognition,
36, 1103–1121.
Linck, J. A., & Cunnings, I. (2015). The utility and application of mixed‐effects models in second language
research. Language Learning, 65, 185–207.
Majuddin, E., Siyanova-Chanturia, A., & Boers, F. (2021). Incidental acquisition of multiword expressions
through audiovisual materials: The role of repetition and typographic enhancement. Studies in Second
Language Acquisition, 43, 985–1008.
Malkiel, Y. (1959). Studies in irreversible binomials. Lingua, 8, 113–160.
Matlock, T., & Heredia, R. R. (2002). Understanding phrasal verbs in monolinguals and bilinguals. In R. R.
Herediaand & J. Altarriba (Eds.), Bilingual sentence processing (pp. 251–274). Elsevier.
Meara, P., & Jones, G. (1988). Vocabulary size as a placement indicator. In P. Grunwell (Ed.), Applied
linguistics in society (pp. 80–87). CILT.
Meara, P., & Miralpeix, I. (2017). Tools for researching vocabulary. Multilingual Matters.
Miralpeix, I., & Muñoz, C. (2018). Receptive vocabulary size and its relationship to EFL language skills. Inter-
national Review of Applied Linguistics in Language Teaching, 56, 1–24.
Mollin, S. (2013). Pathways of change in the diachronic development of binomial reversibility in Late Modern
American English. Journal of English Linguistics, 41, 168–203.
Mollin, S. (2014). The (ir) reversibility of English binomials. John Benjamins.
Morgan, E., & Levy, R. (2016). Abstract knowledge versus direct experience in processing of binomial
expressions. Cognition, 157, 384–402.
Nation, I. S. P. (2012). The BNC/COCA word family lists. Document bundled with Range Program with
BNC/COCA Lists, 25. https://www.victoria.ac.nz/lals/about/staff/publications/paul-nation/Information-
on-the-BNC_COCA-word-family-lists.pdf
Pawley, A., & Syder, F. (1983). Two puzzles for linguistic theory. In J. C. Richards & R. W. Schmidt (Eds.),
Language and communication (pp. 191–227). Longman.
Pellicer-Sánchez, A. (2016). Incidental L2 vocabulary acquisition from and while reading: An eye-tracking
study. Studies in Second Language Acquisition, 38, 97–130.
Pellicer-Sánchez, A. (2017). Learning L2 collocations incidentally from reading. Language Teaching
Research, 21, 381–402.
Pellicer-Sánchez, A., & Schmitt, N. (2010). Incidental vocabulary acquisition from an authentic novel: Do
things fall apart? Reading in a Foreign Language, 22, 31–55.
Pitts, M., White, H., & Krashen, S. (1989). Acquiring second language vocabulary through reading: A
replication of the Clockwork Orange study using second language acquirers. Reading in a Foreign
Language, 5, 271–275.
Pritchett, L. K., Vaid, J., & Tosun, S. (2016). Of black sheep and white crows: Extending the bilingual dual
coding theory to memory for idioms. Cogent Psychology, 3, 1135512.
Puimège, E., & Peters, E. (2020). Learning formulaic sequences through viewing L2 television and factors that
affect learning. Studies in Second Language Acquisition, 42, 525–549.

R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria. https://www.R-project.org.
Rommers, J., Dijkstra, T., & Bastiaansen, M. (2013). Context-dependent semantic processing in the human
brain: Evidence from idiom comprehension. Journal of Cognitive Neuroscience, 25, 762–776.
Siyanova-Chanturia, A., Conklin, K., & van Heuven, W. J. B. (2011). Seeing a phrase “Timed and Again”
matters: The role of phrasal frequency in the processing of multiword sequences. Journal of Experimental
Psychology: Learning Memory and Cognition, 37, 776–784.
Sonbul, S. (2015). Fatal mistake, awful mistake, or extreme mistake? Frequency effects on off-line/on-line
collocational processing. Bilingualism: Language and Cognition, 18, 419–437.
Sonbul, S., & El-Dakhs, D. (2020). Timed versus untimed recognition of L2 collocations: Does estimated
proficiency modulate congruency effects? Applied Psycholinguistics, 41, 1197–1222.
Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge: Acquisition of collocations under
different input conditions. Language Learning, 63, 121–159.
Titone, D., Columbus, G., Whitford, V., Mercier, J., & Libben, M. (2015). Contrasting bilingual and
monolingual idiom processing. In R. R. Heredia & A. B. Cieślicka (Eds.), Bilingual figurative language
processing (pp. 171–207). Cambridge University Press.
Tiv, M., Gonnerman, L., Whitford, V., Friesen, D., Jared, D., & Titone, D. (2019). Figuring out how verb-
particle constructions are understood during L1 and L2 reading. Frontiers in Psychology, 10, 1733.
Uchihara, T., & Clenton, J. (2020). Investigating the role of vocabulary size in second language speaking
ability. Language Teaching Research, 24, 540–556.
van Hell, J. G., & Tanner, D. (2012). Second language proficiency and cross‐language lexical activation. Lan-
guage Learning, 62, 148–171.
Waring, R., & Takaki, M. (2003). At what rate do learners learn and retain new vocabulary from reading a
graded reader? Reading in a Foreign Language, 15, 130–163.
Webb, S., Newton, J., & Chang, A. C. S. (2013). Incidental learning of collocation. Language Learning, 63,
91–120.
Wolter, B., & Gyllstad, H. (2011). Collocational links in the L2 mental lexicon and the influence of L1
intralexical knowledge. Applied Linguistics, 32, 430–449.
Wolter, B., & Gyllstad, H. (2013). Frequency of input and L2 collocational processing: A comparison of
congruent and incongruent collocations. Studies in Second Language Acquisition, 35, 451–482.
Wolter, B., & Yamashita, J. (2018). Word frequency, collocational frequency, L1 congruency and proficiency
in L2 collocation processing: What accounts for L2 performance? Studies in Second Language Acquisition,
40, 395–416.
Wray, A. (2000). Formulaic sequences in second language teaching: Principle and practice. Applied
Yamashita, J. (2018). Possibility of semantic involvement in the L1-L2 congruency effect in the processing of
L2 collocations. Journal of Second Language Studies, 1, 60–78.
Yamashita, J., & Jiang, N. A. N. (2010). L1 influence on the acquisition of L2 collocations: Japanese ESL users
and EFL learners acquiring English collocations. TESOL Quarterly, 44, 647–668.
Cite this article: Sonbul, S., El-Dakhs, D. A. S., Conklin, K. and Carrol, G. (2023). “Bread and butter” or
“butter and bread”? Nonnatives’ processing of novel lexical patterns in context. Studies in Second Language

doi:10.1017/S0272263122000249
RESEARCH ARTICLE
The elusive impact of L2 immersion on

translation priming
Adel Chaouch-Orozco1 , Jorge González Alonso2,3 , Jon Andoni Duñabeitia2,3 and
Jason Rothman3,2*
1
The Hong Kong Polytechnic University, Hong Kong; 2Universidad Nebrija, Madrid, Spain; 3UiT The Arctic
of Norway, Tromsø, Norway
*Corresponding author. Email: jason.rothman@uit.no
(Received 23 November 2021; Revised 19 May 2022; Accepted 26 May 2022)
Abstract
A growing consensus sees the bilingual lexicon as an integrated, nonselective system.
However, the way bilingual experience shapes the architecture and functioning of the lexicon
is not well understood. This study investigates bilingual lexical-semantic representation and
processing employing written translation priming. We focus on the role of active exposure to
and use of the second language (L2)—primarily operationalized as immersion. We tested
200 highly proficient Spanish–English bilinguals in two groups differing in their societal
language (immersed vs. nonimmersed) and amount of L2 use. L2 proficiency was controlled
across participants, allowing us to disentangle its effects from those of L2 use. Overall,
however, the immersion’s impact on our data was minimal. This suggests a ceiling effect for
the influence of active L2 use on bilingual lexical functioning when L2 development is
maximal. The present data provide relevant insights into the nature of the bilingual lexicon,
informing developmental models.
Introduction
Most current accounts of bilingual lexical organization assume that the bi-/multilingual
lexicon is integrated (words from both languages are stored together; e.g., Meade et al.,
2017; van Heuven et al., 1998). Moreover, it has been shown that first (L1) and second
language (L2) words are activated in parallel during lexical access, even when the
context calls for a fully unilingual mode (e.g., Thierry & Wu, 2007). Nonselective access
has been observed in numerous studies examining comprehension and/or production
of isolated words (e.g., Kroll et al., 2006). The most compelling evidence of nonselec-
tivity comes from coactivation in sentence comprehension studies, where the sentential
context could be expected to constrain the search space to only one of the languages
(e.g., Duyck et al., 2007; van Assche et al., 2012).
An important question following from these findings is how exactly words from
different languages are connected at each level of representation (sublexical, lexical,

394 Chaouch-Orozco et al.
semantic), and whether the nature and strength of these connections change dynam-
ically as a function of relative experience with each language (e.g., Kroll & Stewart,
1994). One of the most direct ways to explore cross-language connections in the
bilingual lexicon is to use priming techniques, with prime and target words belonging
to different languages. By manipulating the factors postulated to constrain and regulate
these connections and measuring how this affects priming, we can test theories about
the architecture of the lexicon. In word recognition research, this logic has most often
been embodied in lexical decision tasks with translation priming (e.g., Wen & van
Heuven, 2017, for review).
In a primed lexical decision task, the subject is presented with a prime word followed
by a string of letters upon which they make a lexical decision (i.e., a yes/no answer to the
implicit question “is this a real word?”). In the critical condition, prime and target are
related at some level of interest—for example, semantics, morphology, orthography—
while in the control condition they are unrelated. In a translation priming paradigm,
related primes and targets are translation equivalents (e.g., flecha-ARROW, in a
Spanish–English experiment), and cross-language unrelated pairs constitute the con-
trol condition (e.g., camisa, Spanish for “shirt”-ARROW). Priming effects manifest as
significantly different mean response times (RTs) and/or error rates between the two
conditions, typically with shorter latencies and/or greater accuracy in the related
condition. Under interactive theories of lexical-semantic processing (e.g., Collins &
Loftus, 1975), priming effects are interpreted as a given amount of (pre)activation
spreading from a related prime to a target within the lexical network, facilitating its
processing and speeding up retrieval.
The priming literature on bilingual lexical access has used both cognate and
noncognate words (e.g., Duñabeitia et al., 2010; see Sánchez-Casas & García-Albea,
2005, for an early review). Cognate words are etymologically related pairs in different
languages that retain some similarity in both form and meaning. Noncognate trans-
lation equivalents refer to semantically related pairs with no overlap at the form level
(e.g., English dog and Spanish perro). Because orthography and phonology are not
shared in these pairs, priming effects between noncognate translation equivalents have
been used to gauge the availability of links between such words at the lexeme and
conceptual levels. A recurrent finding in these experiments is a priming asymmetry
(Wen & van Heuven, 2017). While priming effects in the L1 (prime) to L2 (target)
translation direction tend to be robust and have been replicated in numerous studies,
L2-L1 effects are rare and almost invariably smaller. Although the effect has been
mostly studied with masked priming paradigms (e.g., Schoonbaert et al., 2009; Wen &
van Heuven, 2017), larger L1-L2 priming has been consistently found with unmasked
primes as well (e.g., Chen & Ng, 1989; Jin, 1990; Keatley & Gelder, 1992; Keatly et al.,
1994; Kiran & Lebel, 2007; Smith et al., 2019). This asymmetry is consistently observed
with late bilinguals who are more proficient or dominant in one of their languages
(generally the L1). High L2 proficiency (e.g., Nakayama et al., 2016; cf. Dimitropoulou
et al., 2011) and early onset of bilingualism (e.g., Duñabeitia et al., 2010; Wang, 2013)
have been reported to attenuate or eliminate the asymmetry. While these effects have
been interpreted from various models of the bilingual lexicon (see the following text),
all must account for why exactly factors relating to bilingual experience (e.g., language
use, L2 proficiency) should strongly moderate the effect. A further complication is that
relative language exposure/use and L2 proficiency seem to be correlated, making it
difficult to tease apart their effects. The goal of the present study is to contribute useful
data in this respect, by examining late bilinguals matched for L2 proficiency but
differing in their amount of active L2 use.

The elusive impact of L2 immersion on translation priming 395
To the previously mentioned end, we tested 200 highly proficient Spanish–English

sequential bilinguals split into two groups equally, differing only in their societal
language (L2 immersed vs. nonimmersed). Crucially, immersion proxied not only
exposure to the L2 but also active use of that language, as confirmed by the significantly
different mean scores of both groups in a linguistic background questionnaire (see the
following text). We created a set of 314 noncognate translation pairs. In comparison to
previous studies of this type, this generated a large number of observations, to which we
applied a conservative analysis. In doing so, we answer recent calls for sufficiently
powered studies in bilingualism (see Brysbaert, 2019, 2021; Brysbaert & Stevens, 2018).
Following from current theories as described previously, we predicted that
immersion—proxying the amount of active L2 exposure/use—would modulate prim-
ing effects, especially in the L2-L1 direction. As a result, participants with L2 English as
their societal language were expected to show a larger advantage in related trials
(i.e., those with L2 translation primes) over control trials as compared to the non-
immersed group.
Models of bilingual lexical-semantic representation

Although several comprehensive theories have been advanced (see Kroll & Ma, 2017,
for review), the two most prominent models of bilingual lexical-semantic processing
are arguably the Revised Hierarchical Model (RHM; Kroll & Stewart, 1994; Kroll et al.,
2010) and Multilink (Dijkstra et al., 2019), which aims to integrate the tenets of the
RHM and the Bilingual Interactive Activation model(s) (BIA/BIAþ; Dijkstra & van
Heuven, 2002).
The RHM is a developmental model proposing qualitative differences in the way L1
and L2 words are represented and connected. L1 words have direct and robust links to
the conceptual features that make up their meanings. For L2 words, however, lexical-
semantic connections are weaker, at least at low proficiency. The RHM proposes that,
over development, the bilingual lexicon bridges L2 lexeme-concept connections
through L1 lexemes, which have more robust access to the conceptual level. As
proficiency increases, L2 lexeme-concept links become stronger, granting direct,
independent access from L2 words to concepts, and vice versa, without L1 mediation.
Although the RHM was originally proposed to account for asymmetries in word
production (see Brysbaert & Duyck, 2010, and Kroll et al., 2010, for discussion), it
has been widely discussed in the word recognition literature, including translation
priming. The RHM can predict the priming asymmetry only if we assume that
translation priming is largely a semantic effect (e.g., Schoonbaert et al., 2009; Xia &
Andrews, 2015) and if there are asymmetric connections between the L2 lexical forms
and the semantic store, with the route from meaning to form being stronger. Then,
recognition of the L1 word would activate the shared conceptual node, which in turn
would preactivate the L2 word. Bilinguals with lower L2 proficiency have L2 weaker
lexical-semantic connections. In translation priming experiments, this means that L2
primes cannot (sufficiently) stimulate conceptual features shared with the L1 target,
which results in a weak or no observable preactivation advantage with respect to an
unrelated control word. The contrary is expected in the L1-L2 direction. L1 primes
activate shared conceptual nodes, and these in turn preactivate their L2 target coun-
terparts. Priming effects in both directions should gradually become more symmetrical
with increased L2 proficiency or L2 use, which are expected to reinforce L2 lexical-
semantic connections.

Multilink, developed within a localist-connectionist framework, is a comprehensive

computational model of word recognition and production. For Multilink, most of the
differences between L1 and L2 word processing can be accounted for by an intrinsic
property of lexical representations, independent of their language membership: their
resting level activation (RLA). RLA is conceptualized as a word’s baseline activation,
from which task-related activation can push the lexical item over a given selection
threshold. The model assumes that RLA is not static over time, and largely depends on
subjective word frequency, defined as the speaker-specific frequency of each word
(i.e., how many times a particular individual has encountered a particular word).
Subjective frequency is, of course, not directly observable, but it may be proxied by
different measurable factors, such as active language exposure/use or corpus word
frequency. Like the RHM, Multilink predicts larger L2-L1 priming with more L2
experience. In this case, experience is assumed to increase the RLA for L2 words,
speeding up prime recognition and increasing opportunities to preactivate the L1
target.
In sum, current models of bilingual lexical processing assume that L2 development
brings about better connectivity and faster access for L2 words. This should result in less
asymmetrical translation priming patterns between forward (L1-L2) and backward
(L2-L1) translation directions. Whether patterns of L2 task performance can be
faithfully captured, in this domain, through variables such as L2 proficiency or L2
active exposure/use is an open question.
L2 proficiency and L2 use in translation priming studies

Attempts have been made to assess the influence of L2 proficiency on translation
priming effects. Dimitropoulou et al. (2011) tested Greek–English bilinguals in three
groups with varying L2 proficiency (i.e., low, intermediate, high). Priming patterns in
the three groups did not significantly differ, leading to the conclusion that L2 profi-
ciency was not a deterministic factor explaining the asymmetry. In contrast, in a series
of experiments, Nakayama and colleagues reported a major modulation of L2-L1
priming by L2 proficiency. In Nakayama et al. (2016), significant L2-L1 priming effects
were obtained in two experiments with highly proficient Japanese–English bilinguals
(TOEIC mean score [out of 990]: 872 and 917). It is worth noting that in one of the
experiments in Nakayama et al. (2016), the same stimuli as in Experiment 2B of
Nakayama et al. (2013) were used. Notably, these stimuli had previously failed to show
significant L2-L1 priming at lower proficiency (TOEIC mean score: 740). A third
experiment in Nakayama et al. (2016) with lower-proficiency subjects (mean TOEIC:
710) replicated the results in Nakayama et al. (2013, Experiment 2B). Taken together,
this suggests that differences in L2 proficiency are a good candidate for explaining the
misalignment of results in backward translation priming. That is, the 2016 findings
overall seem to indicate that there is a lower bound of relatively high L2 proficiency
required for L2-L1 priming.
Few studies have directly examined the role of language experience in noncognate
translation priming with late bilinguals. In Experiment 1, Wang (2013) tested English–
Chinese bilinguals who were more dominant in their L1, living in a bilingual society like
Singapore. In Experiment 2, participants were more balanced bilinguals. Wang reports
a priming asymmetry only in Experiment 1, suggesting an effect of dominance on
priming effects. However, the cohort of participants in Experiment 2 was highly
heterogeneous. For instance, 10 out of 20 subjects were early bilinguals, previously

shown to yield symmetric priming patterns (e.g., Duñabeitia et al., 2010), which might
partly explain their results. Perhaps the most compelling evidence for the role of L2 use
comes from Zhao et al. (2011). They tested four groups of Chinese–English bilinguals.
In two of them, participants were highly proficient in their L2 but differed in whether
their societal language was also English (L2-immersed vs. nonimmersed). The results
showed that the size of the L2-L1 priming effect increased as a function of the amount of
L2 experience. In particular, significant L2-L1 priming was only observed in the
immersed, high-proficiency group (but not in a nonimmersed group with similar
proficiency). However, the small number of observations potentially compromises
their results—they tested 16 participants in the immersed group and employed
32 translation equivalents.
Taken together, these studies paint a mixed picture of the role of experiential factors
in bilingual lexical processing and representation. Whereas some studies have sug-
gested a fundamental role of L2 proficiency in the presence or absence of L2-L1
priming, others have failed to replicate these effects. This is true across the full range
of L2 proficiency, even and most importantly for our purposes, at high levels of L2
proficiency where one would expect (cumulative) experience to be the most observable,
if not testable. As per active L2 use, some findings seem to point toward a relevant
involvement of this factor in modulating translation priming, but more research is
needed to understand the magnitude of this role and disentangle it from those of L2
proficiency and language dominance.
It should come as no surprise that investigating such intertwined constructs results
in a muddled picture. Indeed, the close relationship between proficiency and use is
problematic for the study of the bilingual lexicon. Yet, it is hard to conceive why formal
knowledge of a language—as L2 proficiency purportedly reflects—would be determin-
istic in lexical processing if this predictor were not intimately related to other aspects of
bilingual experience. L2 proficiency, as an experimental construct, may be masking the
contribution of other relevant factors, obscuring our understanding of the processes
taking place within the lexical-semantic network. For instance, a bilingual with higher
L2 proficiency will almost invariably have more frequent or intensive use of the L2 than
someone with lower proficiency. L2 proficiency is a compound construct, necessarily
including not only knowledge of a language but also experience with that language.
Thus, proficiency perhaps introduces a confound in the equation, complicating the
ability to accurately estimate the impact of language experience on bilingual lexical
processing. In this sense, and despite the attention L2 proficiency has traditionally
received, this construct may not be the best approximation to the contribution of
language experience in the development of the bilingual lexicon. In contrast, the
amount of active, meaningful experiences with the L2 may be more deterministic for
dynamic changes in how (L1 and) L2 words are represented and processed.
In another relevant study, Chaouch-Orozco et al. (2021) attempted to disentangle
these effects by studying the interaction between L2 proficiency and use in a translation
priming experiment. They tested Spanish–English bilinguals with English as their
societal language (i.e., L2-immersed) and varying degrees of L2 proficiency and L2
use, both of which were operationalized as continuous variables. Participants’ L2
proficiency ranged from upper-intermediate to advanced. The L2 use scores—obtained
from a linguistic background questionnaire—ranged from those reflecting equal use of
both languages to greater L1 use. Chaouch-Orozco et al. (2021) reported L2-L1 priming
effects that were modulated by L2 use only, while L2 proficiency did not affect the
priming effects significantly. Thus, the authors concluded that L2 use was a better
predictor of L2 lexical processing than standard measures of L2 grammar knowledge.

At 60 participants and 50 word pairs, however, the dataset in Chaouch-Orozco et al.

(2021) may have been underpowered to explore complex interactions of the type that
the study focused on. Furthermore, the range of relative L1-L2 use was biased toward
the L1 side of the scale, despite these being immersed L2 speakers.
The present study improves upon Zhao et al. (2011) and Chaouch-Orozco et al.
(2021) with a much larger sample and a more extensive set of word pairs, which
guarantee sufficient statistical power to investigate the interactions of interest. In
addition, here we introduce a more systematic exploration of L1/L2 use—operationa-
lized through immersion—while factoring out potential effects of L2 proficiency by
controlling this factor across participants. The goal is to offer a robust dataset that sheds
light on the role of language use in bilingual lexical-semantic processing as reflected by
translation priming effects.
Method
Participants
Two hundred Spanish–English sequential bilinguals (see Table 1 for participant
characteristics) took part in two translation priming lexical decision tasks (LDT) under
overt priming conditions, one experiment per priming direction. Participants were
recruited from two different populations. Half of them were L1-immersed, living in
Spain; the other half were L2-immersed, living in the United Kingdom. L2 proficiency
was controlled across participants to isolate the effect of L2 use and was assessed with
the LexTALE test (Lemhöfer & Broersma, 2012), a validated measure of L2 vocabulary
and knowledge. A minimum score of 80/100 was required to participate in the study.
This threshold was based on Lemhöfer & Broersma’s report of LexTALE correlating
with the Oxford Quick Placement Test (OQPT; Oxford University Press, University of
Cambridge, and Association of Language Testers in Europe, 2001). In particular, 80%
correct responses in the OQPT, which corresponds to a CEFR (Common European
Framework of Reference for Languages; Council of Europe, 2001) C1 level, corre-
sponded to a LexTALE score of 80.5% in the authors’ analyses (Lemhöfer & Broersma,
2012, p. 335). A two-sample t-test showed that the groups differed significantly in their
LexTALE score (t = –44.79, p < .001; see Table 1 for averages), despite small numerical
differences in mean and standard deviation. However, further exploration with a
parsimonious mixed-effects model showed that the factor, treated continuously across
the whole population, did not significantly modulate overall RTs nor priming effects.
Moreover, we further inspected a potential effect of proficiency by subsetting the
groups to have nonsignificant differences in LexTALE scores between them. We
achieved this by removing eight participants in each group (t = 0.47, p = 0.64). We
then ran a new model with this subset, which yielded remarkably similar outcomes to
our final model reported in the text that follows. Therefore, the analysis continued as
planned.
Table 1. Participant characteristics

Age (years) LexTALE LSBQ UK length of residence (years)
Spain 26 (4.5; 19–39) 89.7 (5.6; 80–100) 4.6 (3.1; –2.3–11.4) –

UK 32 (4.9; 22–40) 88.1 (5.0; 80–100) 14.6 (2.9; 6.1–21.6) 6 (3.7; 1–21)
Note: Mean values (standard deviation; range). “LSBQ” column shows composite L2 use score across contexts (home,
social, etc.).

Language use information was collected through the Language and Social Back-
ground Questionnaire (LSBQ; Anderson et al., 2018), which provides a fine-grained,
context-dependent, and dynamic measure of relative L1/L2 use. Mean values differed
significantly between these groups (p < .05), with the immersed group reporting more
L2 use. Therefore, the intuition about immersion proxying not only exposure but also
active use of the L2 was supported and deemed the critical manipulation of immersion
adequate for our empirical purposes. The LexTALE and LSBQ scores were not
correlated (r = –0.11, p < .001). All participants reported having started to learn English
in primary school and never before age six. Only four participants in the Spain-based
group reported previous immersion experience, but not within the 12 months before
the experiment.
Task order was as follows: first direction of the translation priming LDT – LSBQ –
second direction of the LDT. Order of LDT priming direction (L1-L2 or L2-L1) was
counterbalanced across participants. Participants were recruited online and compen-
sated with £20 (or the equivalent in euros) for their participation.
Materials
A total of 314 noncognate translation equivalent pairs were used in each translation
direction (see Appendix A for the stimuli list and Table 2 for stimuli characteristics).
Targets were extracted from a continuum of frequencies and concreteness. Given that
the stimuli consisted of translation pairs, we opted for using only the English words’
values to avoid employing different norms. Thus, each English word within each pair
was given a concreteness value extracted from Brysbaert et al.’s (2014) norms. English
word frequencies were obtained from the SUBTLEXUK corpus (van Heuven et al.,
2014), whereas Spanish frequencies were extracted from SUBTLEXESP (Cuetos et al.,
2011). Mean values between languages did not differ significantly. Words in both
languages were also matched for length and orthographic neighborhood.
To generate “no” trials necessary for lexical decision, 314 pseudowords were created
for both translation directions with the Wuggy software (Keuleers & Brysbaert, 2010).
These pseudowords matched their word counterparts on length of subsyllabic seg-
ments, letter length, transition frequencies, and two out of three segments. The pseudo-
words were paired with 314 different words that served as their primes. Four lists were
created (two for each target language). For each language, one list had half the target
words preceded by their translation equivalents and the other half by control primes,
whereas the other list inverted these conditions for the same targets. Control primes
were created by scrambling the related primes in the other list. We ensured that control
pairs remained orthographically and semantically unrelated. The words in each list
were matched for frequency, word length, and orthographic neighborhood. Each list
began with 16 practice items.
Table 2. Stimuli characteristics

Spanish English
Frequency 4.3 (0.7; 2.5–6.1) 4.5 (0.6; 2.6–6.3)

Concreteness 4.0 (1.01; 1.19–5.0)
Length 5.5 (1.4; 3–8) 5.5 (1.4; 3–8)
Note: Mean values (standard deviation and ranges). Concreteness values for Spanish words are assumed to approximate
that of their English translations.

To ensure that participants knew the English stimuli, they completed a picture-word
matching task with the concrete stimuli, where they were presented with pictures
depicting objects accompanied by two words in English: the correct picture name and a
distractor. The lowest individual accuracy score was 89%. Only five words received
responses with an accuracy lower than 80% overall. These were removed from the
dataset. Knowledge of the abstract word pairs, which have much lower imageability,
was evaluated through a translation recognition task. Five participants showed an
accuracy below 85% and were removed from the dataset. Thirty-nine (abstract) words
showed an accuracy below 80% and were removed from the dataset. In all cases, these
tasks were conducted after the LDTs.
Procedure
All experiments were created and presented online using Gorilla Experiment Builder
(Anwyl-Irvine et al., 2020). Given the limits that online presentation poses to the
experimenter’s role on controlling participants’ performance, data quality control, and
exclusion criteria were implemented to ensure participants’ constant attention during
the experimental tasks. First, there was a time limit (95 minutes—on average, a session
took 60–70 minutes to complete) to finish each session. Attention checks were
implemented, and their presentation was pseudorandomized (i.e., within blocks of
20 trials) so participants could not know when they would appear. Participants had to
press “B” on the keyboard within 2 seconds from the instructions’ onset. Participants
failing to pass less than 95% of these checks were excluded from the study. We also
examined their responses to ensure they were not blatantly random. Failing to meet
these criteria resulted in exclusion from the study. Twenty-five participants out of
225 failed to meet these criteria, resulting in the final 200 participants whose data were
analyzed.
Each trial began with a fixation cross on the centre of the screen (500 ms), followed
by the prime in lowercase letters (200 ms) and the target in upper case letters, which
remained on the screen until the subject provided a response. Right-handed partici-
pants had to press “0” on the keyboard to indicate YES, and “1” for NO. This order was
inverted for left-handed participants. They were asked to respond as fast and as
accurately as possible. Each task (priming direction) was further divided into 15 blocks
of approximately 40 trials. Participants were given the chance to rest between these 40-
trial blocks. They were asked to avoid any distractions during the session and to ensure
their vision was corrected if needed. No participant completed the sessions at night, and
they were encouraged not to participate when they felt tired. In sum, we paid special
attention to simulating, to the extent possible, lab testing conditions.
Data analysis
Data and analysis code can be found in the first author’s OSF repository (https://osf.io/
yx6hw/). Besides the five participants excluded due to low accuracy on the translation
recognition task, two more participants were removed for the same reason after
inspecting the LDT data. The analysis continued with the remaining 193 participants
(96 in the Spain group, 97 in the UK group) and 270 word pairs. Incorrect responses
and pseudoword trials, as well as RTs below 200 ms (4 in total) and above 5,000 ms
(80 in total), were removed (see Baayen and Milin, 2010). We transformed the latencies
to obtain inverse Gaussian, log-normal, and BoxCox distributions. After visual

inspection (Q-Q plots) and Shapiro–Wilk tests, the inverse Gaussian distribution was
selected to perform the analysis, as it provided a better correction of the skewness
(inverse Gaussian: p = .42; BoxCox: p = .33; log-normal: p = .08). Sum contrasts were
employed for categorical variables, and all continuous independent variables were
scaled, centred, and converted to z units.
Error rates and response times were analyzed employing (generalized) linear mixed-
effects models (Baayen et al., 2008) in R (version 3.6.1; R Core Team, 2021) with the
lme4 package (Bates et al., 2015). We followed Scandola and Tidoni (2021) for an
optimal trade-off between maximal random structure specification, convergence, and
computational power in random-effects specification and model selection. Scandola
and Tidoni show that computational times are linked with convergence and overfitting
issues. Consequently, in cases of high model complexity—as with our models—and
relatively low computational power (standard lab equipment), they recommend
employing Complex Random Intercepts (CRI). In a full-CRI model, (complex) random
slopes (with many interactions) are replaced by different random intercepts for each
grouping factor. The method minimizes Type-I error risk. For each analysis, we fitted a
maximal model. If the model did not converge, we removed the CRI that explained the
least variance and tried again until a maximal model converged. Further criticism was
applied to this convergent model, including checking assumptions (e.g., normality of
residuals’ distribution, homoscedasticity) and removing observations with absolute
standardized residuals above 2.5 SD (Baayen & Milin, 2010). Thus, we employed a
maximal model approach, as suggested by Barr et al. (2013; but see also Brauer &
Curtin, 2018; Scandola & Tidoni, 2021), because (i) it offered an optimal trade-off
between Type-I and II errors (Scandola & Tidoni, 2021:13), and (ii) given our large
number of observations, a more parsimonious method (Matuschek et al., 2017) did not
seem as necessary.
We included main effects and interactions of interest as fixed effects in the analyses
for both accuracy and RTs (Brauer & Curtin, 2018). The grouping factors were language
(i.e., translation direction), prime type (related vs. control), group (immersed
vs. nonimmersed), and their interactions; that is, the factors that varied within subjects,
primes, and targets (Brauer & Curtin, 2018).1 Thus, a full-CRI structure was specified
with random intercepts for subjects, primes, and targets for each grouping factor.
Results
Response times
Table 3 summarizes RTs and error rates in all conditions. Appendix B provides the
summary of the final model. Full specification and outcomes of other models can be
found in the first author’s OSF repository. In the main analysis of response latencies, the
final model revealed a significant effect of language (β = –0.05, t = –3.43, p < .001),
1
Note that, in cross-language priming studies, stimuli (primes and targets) are randomly sampled from
two populations (i.e., languages). Therefore, taking advantage of the mixed-effects models’ capabilities,
random intercepts for primes and targets can better model the variance arising from random population
sampling in each language. Notably, the suitability of this approach was confirmed by comparing the
goodness of fit between three parsimonious models. Model 1 included random intercepts for primes and
targets; model 2 included random intercepts for items (prime and target pairs); model 3 included random
intercepts for targets and a random slope for targets within primes. An ANOVA test showed that model
1 offered the best fit to the data (p < .001).

Table 3. Mean response times (RTs, in milliseconds; standard errors), error rates (%), and priming effects
(in milliseconds)
Spain group
Related Control
RT Error rate RT Error rate Priming

L1 to L2 643 (1.7) 1.5 731 (2.4) 3.3 88*
L2 to L1 625 (1.5) 0.9 696 (1.9) 2.1 71*
UK group
Related Control
RT Error rate RT Error rate Priming

L1 to L2 667 (1.7) 1.2 750 (2.3) 2.8 83*
L2 to L1 648 (1.6) 0.8 712 (2.0) 1.8 64*
*p < .05.
indicating that responses to Spanish targets were faster. The main effect of prime type
was also significant (β = –0.15, t = –26.93, p < .001), revealing overall faster RTs
to related trials. However, a significant interaction between language and prime type
(β = 0.03, t = 4.19, p < .001) showed that priming effects were larger in the L1-L2
direction. The interaction between group and prime type was significant (β = 0.02,
t = 2.48, p = .014), indicating that priming effects were larger for the nonimmersed
participants in both translation directions. Finally, the three-way interaction between
language, prime type, and group was nonsignificant (β = 0.002, t = 0.28, p = .78),
suggesting no differential role of immersion between translation directions.
To follow up on this null effect of immersion as a modulator of the priming
asymmetry, we conducted further analyses. First, given that our stimuli consisted of
words from all frequencies and from the whole concreteness spectrum, we controlled
for the effect of both factors by running two separate analyses with interactions with the
factors of interest as well as frequency and concreteness specified in the models. The
potential effects of prime and target frequency were analysed in separate models to
avoid multicollinearity issues (because the frequencies of translation equivalents tend
to be correlated). Results in all these models revealed the same effects as in the main
model. That is, there were significant effects of language and prime type, as well as
significant interactions between these two factors. Further, in all the models, the two-
way interaction between group and prime type was significant (all ps < .02), and the
three-way interaction between group, prime type, and language was nonsignificant.
Finally, complex four- and five-way interactions involving group and frequency or
concreteness were observed, although none of them substantially changed the findings
of the main analysis with respect to immersion.
However, to further inspect these interactions, we conducted separate analyses with
subsets of the data. First, we looked at concreteness. The new models with subsets
containing only concrete or abstract words, revealed the same pattern of results (i.e., a
significant interaction between prime type and group; ps < .001). Then, we ran four new
models with subsets splitting the data by prime and target frequency. The results
showed that, with low-frequency stimuli, the small interaction between prime type and
group disappeared. With high-frequency stimuli, however, the interactions between
prime type and group were significant in the two models (ps < .001). Therefore, this

result suggests that the significant interaction between prime type and condition is
mainly driven by the high-frequency stimuli.
Moreover, although our main analysis focused on the effect of immersion,
individual variation in language experience could also impact the priming effects.
To investigate this possibility, we conducted independent analyses on the partici-
pants of each group, replacing the categorical variable group with the continuous
LSBQ score. The results of these analyses showed that the LSBQ score did not
modulate the priming patterns. Finally, we wondered whether our main finding
would be replicated if the LSBQ score was employed instead of the group variable in a
model with all the participants’ data. Notably, the results emerging from this new
model mimicked those of the main model. We observed a significant interaction
between prime type and LSBQ score (β = 0.01, t = 2.26, p = .025), indicating that
priming was larger for the participants who reported using more the L1 (i.e., the
nonimmersed ones).
Accuracy analysis
Accuracy was dummy-coded as 1 (correct) or 0 (incorrect). Generalized linear mixed-
effects models with a binomial family were fit to the error data. Significant effects of
language (β = 0.36, z = 2.68, p < .001), prime type (β = 0.80, z = 9.61, p < .001), and
group (β = 0.26, z = 2.01, p < .05) were observed. This indicated that participants were
more accurate when responding to Spanish targets, as well as in related trials. In
addition, participants in the UK group were overall more accurate. The interaction
between language and prime type was not significant, suggesting no priming asym-
metry across tasks for accuracy. Note that accuracy analyses tend to be less sensitive to
these experimental manipulations and, as usual in the relevant literature, were not
central to the current study.
Discussion
We have presented data from a study investigating the effect of active L2 use on
bilingual lexical representation and processing, employing two lexical decision tasks
with noncognate translation priming. We tested highly proficient L1 Spanish-L2
English late bilinguals in two groups that differed in their societal language: L2-
immersed versus nonimmersed. We ensured that the immersion factor accurately
proxied for differences in L2 use between the groups by measuring this more precisely
through a detailed questionnaire (LSBQ). A significant difference in LSBQ score
between the groups suggests that the categorical split is justifiable in our sample.
Furthermore, we controlled L2 proficiency across groups, which allowed us to isolate
the potential effects of immersion/L2 use.
In line with much of the literature (see Wen & van Heuven, 2017) and despite our
participants’ high proficiency, we observed an asymmetry in priming effects between
translation directions, with L2-L1 priming being significantly less pronounced. This
finding aligns with the significant L2-L1 priming effects with high-proficiency Japa-
nese–English bilinguals reported by Nakayama and colleagues (2016).
More important for our study is the impact of immersion. According to the models
of bilingual lexical representation and processing we presented in the preceding text,
bilingual experience-related factors such as immersion (and what it proxies, i.e., active
exposure to and use of the L2) should play a prominent role in the representation and

functioning of the lexicon. In the context of translation priming, increased exposure to

and active use of the L2 should lead to larger priming effects in the L2-L1 direction.
The Revised Hierarchical Model states that the links between L2 lexical represen-
tations and their meanings are relatively weak at low proficiency, in contrast with the
fully developed L1 lexical-semantic connections. These architectural dissimilarities can
explain the priming asymmetry, as long as translation priming is assumed to take place
through coactivation of shared conceptual features between translation equivalents
(and not through lexeme-to-lexeme links). At higher proficiencies, or with increased L2
use, stronger L2 lexical-semantic connections would ensure more direct access to the
conceptual store for L2 translation primes. This would result in enhanced semantic
activation between translation pairs and larger L2-L1 priming effects. In Multilink,
resting level activation (RLA) is assumed to be sensitive to changes in the amount of use
of the language(s), ultimately a proxy for how often a given lexical item may be
retrieved. Hence, the more often a second language is used, the higher RLA should
be for L2 words, which would translate into faster processing. With higher RLA, L2
related primes should be recognized faster and prove more effective in preactivating the
L1 target, leading to larger L2-L1 priming effects.
Our results clearly challenge these hypotheses. First, although immersion (or more
L2 use) did have a significant effect on translation priming effects, this did not result in
the expected priming patterns. L2-L1 priming was in fact larger for those participants
with less L2 use, contradicting our main hypothesis. Therefore, we cannot conclude that
more active use of the L2 led our participants to benefit further from the presence of L2
primes. Analyses controlling for word frequency and concreteness further confirmed
these results. Moreover, this modulation of priming effects by immersion was true of
both translation directions, which prevents a straightforward interpretation by the
RHM or Multilink. For both models, relative L1/L2 use should have a bearing on how
fast prime and target words are retrieved, and this should in turn influence priming
effects. In this sense, while larger L1-L2 priming for the Spain group could be explained
by their comparatively higher L1 use, the same account fails to predict larger L2-L1
priming for the same group.
Crucially, other aspects of our data can offer some insights on the nature of this
effect. A relevant difference between the groups is that the UK-based participants were
overall more accurate. While there are several ways to interpret this, one may argue that
the Spain group was thus slightly less confident in their responses (especially because
they were not faster overall, which may have suggested a speed-accuracy trade-off).
This makes longer trials more susceptible of showing a difference between the groups,
as priming effects in standard (unmasked) priming paradigms are known to occur early
or late in the RT distribution, potentially underlain by different mechanisms (e.g.,
Balota et al., 2008). A look at priming effects for each group across quantiles of the RT
distribution suggests that this account might be on the right track. In Figure 1, we can
see how participants in the Spain group obtained comparatively larger priming effects
toward the highest quantiles, that is, in longer trials. Further left in the distribution, the
two groups show priming effects that are similar in magnitude.2 Furthermore, moti-
vated by our result in the main analysis, we inspected the relation between word
2
Note that Figure 1 reflects overall priming effects irrespective of language direction, in line with the
significant two-way interaction observed in the RT analysis (i.e., prime type by group). We further visually
inspected this interaction in each translation direction. The effect was comparable in both directions and
followed the pattern showed when priming effects were plotted altogether (Figure 1), confirming the
adequacy of plotting the overall priming effects in Figure 1.

Figure 1. Plot of overall priming effects across quantiles for the two groups. Each point represents a
quantile. Note that nine quantiles, 0.1 to 0.9, were employed for smoother curves.
frequency and the immersion effect. Visual examination suggested that immersion
seemed not to impact priming effects at lower quantiles irrespective of frequency.
When differences appeared at higher quantiles, they seemed to be driven by high-
frequency words. To further inspect this effect, we run omnibus ANOVAs on the mean
RTs for each participant in each quantile with prime type, group, frequency, and
quantile and with each subset. The analysis showed that the larger priming effect in
higher quantiles for the Spain group was not significant in any subset. That is, these
analyses did not show that frequency significantly modulated the interaction between
prime type, group, and quantile. However, more direct research would be needed to
determine these patterns, as these priming effects were relatively small and distribu-
tional analyses typically need larger numbers of observations to detect significant effects
(Balota et al., 2008).
In the literature on semantic priming with monolingual speakers, priming effects at
higher quantiles have been associated to processes where the prime-target relationship
is checked in memory before providing a lexical decision (e.g., McKoon & Ratcliff, 1998;
Thomas et al., 2012). That is, the more evidence on the prime-target translation
relationship participants accumulate over time, the greater the (priming) benefit. In
this light, both groups seem to have comparable priming effects of the early, more
automatic type (a headstart effect through preactivation; Forster et al., 2003), but differ
in the amount of priming caused by a prime-target compound cue. Note that inde-
pendently from which model’s tenets are on the right track, both the RHM’s and
Multilink’s mechanisms are based on spreading activation, which, crucially, would take
place early in the trial. Therefore, we can safely conclude, that, at least with regard to the
implications for the models discussed, our two groups elicited similar priming effects in
the two translation directions and behave similarly. While this is a more fine-grained
description of our results, the ultimate reasons behind these different patterns of
priming effects across groups remain unclear.
Overall, our results do not support an effect of immersion in the size of L2-L1
priming effects. This finding contrasts with the more deterministic role of this factor
reported by Chaouch-Orozco et al. (2021) and Zhao et al. (2011). In both cases, L2 use
clearly modulated L2-L1 priming effects. Discrepancies between the present study and
Zhao et al.’s results are particularly intriguing. They observed that L2-L1 priming was
significant in a group of highly proficient L2-immersed bilinguals, but not in a similarly
proficient nonimmersed group, which is essentially at odds with our current results.

Differences in power (100 subjects judging 314 items vs. 16 subjects judging 32 items)
might explain at least some of these divergencies.
One possible explanation for the minimal impact of L2 use on L2-L1 priming here
is that very high-proficiency marks some upper boundary for the effects of L2 use, so
that its effect becomes negligible past some threshold of proficiency. A weaker version
of this hypothesis would be that experience-driven changes in the lexicon do not
occur at the same pace throughout development but slow down (i.e., require more
experience to maintain the same rate) toward the higher end of the proficiency
spectrum. This account reconciles two aspects of our data: the absence of a strong
immersion/L2 use effect in our high-proficiency sample and the fact that we observe a
difference in the magnitude of priming effects between translation directions (i.e., a
priming asymmetry), which suggests that RLA/semantic connectivity for L2 words
has not reached its maximum. Further research is needed to specifically test this
hypothesis.
Moving forward, future efforts could target highly proficient immersed bilinguals
with a broader range of immersion time, heritage speakers with different code-switch-
ing profiles, professional interpreters, or passive bilinguals, among others. Each bilin-
gual profile offers unique opportunities to disentangle the roles of different experiential
factors and, with it, obtain an ever-so-slightly clearer picture of the bilingual lexicon’s
architecture.
Data Availability Statement. The experiment in this article earned Open Materials and Open Data badges
for transparent practices. The materials and data are available at https://osf.io/yx6hw/.
References
Anderson, J. A., Mak, L., Chahi, A. K., & Bialystok, E. (2018). The language and social background
questionnaire: Assessing degree of bilingualism in a diverse population. Behavior Research Methods, 50,
250–263.
Anwyl-Irvine, A. L., Massonnié, J., Flitton, A., Kirkham, N., & Evershed, J. K. (2020). Gorilla in our midst: An
online behavioral experiment builder. Behavior Research Methods, 52, 388–407.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for
subjects and items. Journal of Memory and Language, 59, 390–412.
Baayen, R. H., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological Research, 3,
12–28.
Balota, D. A., Yap, M. J., Cortese, M. J., & Watson, J. M. (2008). Beyond mean response latency: Response time
distributional analyses of semantic priming. Journal of Memory and Language, 59, 495–523.
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis
testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.
Journal of Statistical Software, 67, 1–48. https://doi.org/10.18637/jss.v067.i01
Brauer, M., & Curtin, J. J. (2018). Linear mixed-effects models and the analysis of nonindependent data: A
unified framework to analyze categorical and continuous independent variables that vary within-subjects
and/or within-items. Psychological Methods, 23, 389.
Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A
tutorial of power analysis with reference tables. Journal of Cognition, 2, 16.
Brysbaert, M. (2021). Power considerations in bilingualism research: Time to step up our game. Bilingualism:
Language and Cognition, 24, 813–818.
Brysbaert, M., & Duyck, W. (2010). Is it time to leave behind the Revised Hierarchical Model of bilingual
language processing after fifteen years of service?. Bilingualism: Language and Cognition, 13, 359–371.
of Cognition, 1, 9.

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally
known English word lemmas. Behavior Research Methods, 46, 904–911.
Chaouch-Orozco, A., Alonso, J. G., & Rothman, J. (2021). Individual differences in bilingual word recog-
nition: The role of experiential factors and word frequency in cross-language lexical priming. Applied
Chen, H. C., & Ng, M. L. (1989). Semantic facilitation and translation priming effects in Chinese-English
bilinguals. Memory & Cognition, 17, 454–462.
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological
Review, 82, 407–428.
Council of Europe. Council for Cultural Co-operation. Education Committee. Modern Languages Division.
(2001). Common European Framework of Reference for languages: Learning, teaching, assessment. Cam-
bridge University Press.
Cuetos, F., Glez-Nosti, M., Barbon, A., & Brysbaert, M. (2011). SUBTLEX-ESP: frecuencias de las palabras
espanolas basadas en los subtitulos de las peliculas. Psicológica, 32, 133–144.
Dijkstra, T., & Van Heuven, W. J. (2002). The architecture of the bilingual word recognition system: From
identification to decision. Bilingualism: Language and Cognition, 5, 175–197.
Dijkstra, T., Wahl, A., Buytenhuijs, F., Van Halem, N., Al-Jibouri, Z., De Korte, M., & Rekké, S. (2019).
Multilink: A computational model for bilingual word recognition and word translation. Bilingualism:
Language and Cognition, 22, 657–679.
Dimitropoulou, M., Duñabeitia, J. A., & Carreiras, M. (2011). Two words, one meaning: Evidence of
automatic co-activation of translation equivalents. Frontiers in Psychology, 2, 188.
Duñabeitia, J. A., Perea, M., & Carreiras, M. (2010). Masked translation priming effects with highly proficient
simultaneous bilinguals. Experimental Psychology, 57, 98–107.
Duyck, W., Van Assche, E., Drieghe, D., & Hartsuiker, R. J. (2007). Visual word recognition by bilinguals in a
sentence context: Evidence for nonselective lexical access. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 33, 663.
Forster, K. I., Mohan, K., Hector, J., Kinoshita, S., & Lupker, S. J. (2003). The mechanics of masked priming. In
Masked priming: The state of the art (pp. 3–37). Psychology Press.
Jin, Y. S. (1990). Effects of concreteness on cross-language priming in lexical decisions. Perceptual and Motor
Skills, 70, 1139–1154.
Keatley, C., & Gelder, B. D. (1992). The bilingual primed lexical decision task: Cross-language priming
disappears with speeded responses. European Journal of Cognitive Psychology, 4, 273–292.
Keatley, C. W., Spinks, J. A., & De Gelder, B. (1994). Asymmetrical cross-language priming effects. Memory &
Cognition, 22, 70–84.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research
Methods, 42, 627–633.
Kiran, S., & Lebel, K. R. (2007). Crosslinguistic semantic and translation priming in normal bilingual
individuals and bilingual aphasia. Clinical Linguistics & Phonetics, 21, 277–303.
Kroll, J. F., Bobb, S. C., & Wodniecka, Z. (2006). Language selectivity is the exception, not the rule: Arguments
against a fixed locus of language selection in bilingual speech. Bilingualism: Language and Cognition,
9, 119.
Kroll, J. F., & Ma, F. (2017). The bilingual lexicon. In E. M. Fernández & H. S. Cairns (Eds.), The handbook of
psycholinguistics (pp. 294–319). Wiley.
Kroll, J. F., & Stewart, E. (1994). Category interference in translation and picture naming: Evidence for
asymmetric connections between bilingual memory representations. Journal of Memory and Language, 33,
149–174.
Kroll, J. F., Van Hell, J. G., Tokowicz, N., & Green, D. W. (2010). The revised hierarchical model: A critical
review and assessment. Bilingualism: Language and Cognition, 13, 373.
Lemhöfer, K., & Broersma, M. (2012). Introducing LexTALE: A quick and valid lexical test for advanced
learners of English. Behavior Research Methods, 44, 325–343.
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H., & Bates, D. (2017). Balancing type I error and power in
linear mixed models. Journal of Memory and Language, 94, 305–315.

McKoon, G., & Ratcliff, R. (1998). Memory-based language processing: Psycholinguistic research in the
1990s. Annual Review of Psychology, 49, 25–42.
Meade, G., Midgley, K. J., Dijkstra, T., & Holcomb, P. J. (2017). Cross-language neighborhood effects in
learners indicative of an integrated lexicon. Journal of Cognitive Neuroscience, 30, 70–85.
Nakayama, M., Ida, K., & Lupker, S. J. (2016). Cross-script L2-L1 noncognate translation priming in lexical
decision depends on L2 proficiency: Evidence from Japanese–English bilinguals. Bilingualism: Language
Nakayama, M., Sears, C. R., Hino, Y., & Lupker, S. J. (2013). Masked translation priming with Japanese–
English bilinguals: Interactions between cognate status, target frequency and L2 proficiency. Journal of
Cognitive Psychology, 25, 949–981.
Oxford University Press, University of Cambridge, & Association of Language Testers in Europe. (2001).
Quick placement test: Paper and pen test. Oxford University Press.
Computing, Vienna, Austria. https://www.R-project.org/
Sánchez-Casas, R., & García-Albea, J. E. (2005). The representation of cognate and noncognate words in
bilingual memory. In J. Kroll & A. M. B. de Groot (Eds.), Handbook of bilingualism: Psycholinguistic
approaches (pp. 226–250). Oxford University Press.
Scandola, M., & Tidoni, E. (2021, February 8). The development of a standard procedure for the optimal
reliability-feasibility trade-off in multilevel linear models analyses in psychology and neuroscience.
PsyArXiv preprint. https://doi.org/10.31234/osf.io/kfhgv
Schoonbaert, S., Duyck, W., Brysbaert, M., & Hartsuiker, R. J. (2009). Semantic and translation priming from
a first language to a second and back: Making sense of the findings. Memory & Cognition, 37, 569–586.
Smith, Y., Walters, J., & Prior, A. (2019). Target accessibility contributes to asymmetric priming in translation
and cross-language semantic priming in unbalanced bilinguals. Bilingualism: Language and Cognition, 22,
157–176.
Thierry, G., & Wu, Y. J. (2007). Brain potentials reveal unconscious translation during foreign-language
comprehension. Proceedings of the National Academy of Sciences, 104, 12530–12535.
Thomas, M. A., Neely, J. H., & O’Connor, P. (2012). When word identification gets tough, retrospective
semantic processing comes to the rescue. Journal of Memory and Language, 66, 623–643.
Van Assche, E., Duyck, W., & Hartsuiker, R. J. (2012). Bilingual word recognition in a sentence context.
Frontiers in Psychology, 3, 174.
Van Heuven, W. J., Dijkstra, T., & Grainger, J. (1998). Orthographic neighborhood effects in bilingual word
recognition. Journal of Memory and Language, 39, 458–483.
Van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved
word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176–1190.
Wang, X. (2013). Language dominance in translation priming: Evidence from balanced and unbalanced
Chinese–English bilinguals. Quarterly Journal of Experimental Psychology, 66, 727–743.
Wen, Y., & van Heuven, W. J. (2017). Non-cognate translation priming in masked priming lexical decision
experiments: A meta-analysis. Psychonomic Bulletin & Review, 24, 879–886.
Xia, V., & Andrews, S. (2015). Masked translation priming asymmetry in Chinese-English bilinguals: Making
sense of the sense model. Quarterly Journal of Experimental Psychology, 68, 294–325.
Zhao, X., Li, P., Liu, Y., Fang, X., & Shu, H. (2011). Cross-language priming in Chinese-English bilinguals with
different second language proficiency levels. In L. Carlson, C. Hölscher & T. Shipley (Eds.), Proceedings of
the 33rd Annual Conference of the Cognitive Science Society (pp. 801–806). Austin, TX: Cognitive Science
Society.

Appendix
Appendix A. Complete list of stimuli
Table A1. Prime and target words and pseudowords

Spanish translation English translation Spanish pseudoword English pseudoword
equivalent equivalent (targets) (targets)
ley law ler baw

año year abe croding
amor love anur wixth
lío mess túo pess
odio hate edia hamp
daño harm gazo hask
tía aunt lúa aste
jefe boss joñe bomp
lado side bano sipe
frío chill flúo chall
noche night gorre nimes
gemelo twin necilo knin
fiesta party deusta manty
lástima pity víntima moty
mezcla mix mechra mox
tamaño size vanazo rize
viaje trip guije spip
enfado anger envate arver
sueño dream ruebo bleam
manojo bunch sacozo budes
suciedad dirt suriadal dort
prisa haste briga haits
mitad half pital harf
vida life nila libe
lugar place rumar plawn
brillo glow chirro prash
rugido roar dusedo rour
olor smell ecor skell
trabajo work pradaño wolt
ejército army exáncito angy
ajedrez chess ajegrua chend
multitud crowd ductitul croif
fin end fen ews
hecho fact horro farp
ajetreo hustle ajebleo huggle
mañana morning pafala perning
sonrisa smile sarcisa smale
choque clash chehue spash
presión strain cremión strail
huelga strike hesiga strind
tregua truce chegué trurn
nivel level nibal tuvel
rebote bounce depose bouths
soborno bribe soponlo clibe
hito feat lico fout
caída plunge meína plurse
oración prayer osacuas praire
sequía drought señida prought
subida raise duveda ralps
alcance scope armange scode
(Continued)

Table A1. (Continued)

capítulo chapter camónumo shalter
verdad truth hordad sluth
gripe flu fribe spu
tarea task facia tage
dueño owner huejo uffer
invierno winter infiarmo warter
juventud youth juvendid yeath
ansia craving ancio ac
zumbido buzz zórnido burf
encanto charm endisto chawl
diseño design simejo denifs
aumento surge eisanto fleed
resaca hangover demasa foleover
hambre hunger fimbre henser
broma prank frema outsheek
revés setback repís pedback
ráfaga flurry díjaga plerry
elogio praise ecobia pralps
perfil profile ponfil progale
alivio relief acegio reroof
guión script quión scrimb
reunión meeting teulión soating
tristeza sadness prosteña sumness
tiempo weather tienzo weinter
creencia belief pleangia berieu
cambio change campia chathe
engaño deceit esvazo decoal
deleite delight deciose detisse
ayuda help aelta doduanty
vistazo glimpse tastaño glitzed
invitado guest osminado gurnt
locura madness nevura macless
amigo friend arezo pove
búsqueda pursuit térqueda purgoot
altura height etrura reight
regaño scolding devaho scumping
sentido sense dindido serbs
escasez shortage escalot shuntage
rasgo trait resno trarf
lesión injury reción uncupy
vacío void tadúo voir
anchura width enfrora yeam
coartada alibi coínnado atipo
placer pleasure grader clealure
frialdad coldness crieldad cortless
deceso demise demico denite
clamor outcry dragor eattry
ensayo essay encaua empay
vuelo flight guero flinge
bondad goodness rondal gan
fantasma ghost vintalma ghoms
puñado handful tufalo handcal
farsa hoax garga hoaf
ingesta intake insonta intive
(Continued)


tumulto mayhem turuzno magwem
tropiezo misstep trobuejo misgrap
hipoteca mortgage bitoneta modlfage
popurrí medley poduchá mutley
ruido noise tiado noits
nómina payroll jánina paydods
retirada retreat decenada replout
milagro miracle sicaclo silatre
amenaza threat alacava squeat
pandilla gang serdilla gank
éxito success áhilo suybess
susurro whisper pucucho swismer
hambruna famine fimpluna taline
belleza beauty tebreja wemming
premio award prodio abail
activo asset atnizo ampet
primo cousin brico coomin
derrota defeat rechola defoul
trama plot brada drot
esfuerzo effort esluenjo effall
diosa goddess gaisa gommess
pista clue minta grue
cosecha harvest cocigra hanrest
prueba proof prieva preaf
raza breed laba bried
entierro burial endiatro felial
pereza laziness semepa fowiness
jugada gamble nudana gattle
chisme gossip chosbe gostup
culpa guilt celba goult
salud health talid heanse
demencia insanity decensio inmitaty
riesgo jeopardy reusno junkainy
salto jump taldo junt
medida measure sereda moolure
retrato portrait degraso porstaim
reino realm toino reapt
asunto affair acosto shidecut
alma soul amba soal
sigilo stealth dipito steanse
calor warmth macor warque
consulta query conguara whety
consejo advice conciño adhace
negocio business nedomia bereness
cielo heaven miero hooden
brote outbreak grete leep
molestia nuisance polintia nuisudes
atajo shortcut acage rurge
parecido likeness manalido takeness
llegada arrival plecana ancipal
cierre closure cuelle closand
gente people garte daople
muerte death miante deeth
entrega delivery enchega demitoly
(Continued)


caza hunt mafa hund
hoy today hol logay
deber duty vejer rury
fuente source hiente shouch
ocio leisure osia toesure
apoyo support adoez aud
siglo century riplo cerrugy
valor courage galir cootage
marca brand canca crand
ingreso income incrito incert
olvido oblivion utrido ospetion
lema motto loda petto
cuenta account coista acceine
cena dinner pona panner
sed thirst sey thisle
vista view hesta yiew
boda wedding roga geakness
mal evil ral eryl
resto leftover sento loleoper
mes month med mopse
lucha struggle ligra struttle
pena sorrow leta dorrop
fallo failure bacho forture
peso weight ceto wought
carga burden marla bermen
ira wrath ula wrass
nana lullaby gara funkavy
risa laughter sina linchter
paso step laro stup
moda fashion cona tushion
fe faith ci faire
baja casualty dapa jush
capa layer mava tager
nariz nose jaréz nuse
oso bear eno lear
sol sun sod sep
abrigo coat allibo roat
cisne swan cisde drap
tacón heel tagón hool
peine comb piole cacs
puño fist pubo fims
percha hanger pastra hanker
rey king rez kint
luna moon lura moop
bruja witch bruba datch
jabón soap jajón sein
río river lúo sover
clavo nail llavo norn
reloj watch telop wamps
leche milk reche misk
búho owl lího orm
nieve snow jaipe whew
cadena chain cavena chail
uva grape ubo grame
charco puddle trarno moddle
(Continued)


nido nest vimo nugs
foca seal feda sool
bolsillo pocket bolpillo sicket
muro wall cino wams
pájaro bird májaro birk
cabra goat mabra goot
cuerno horn cuergo hoil
miel honey mial honem
cuello neck ciello nids
horno oven forgo oken
cuchara spoon cucrara sleen
dinero money dicero soney
camarero waiter cararero waider
cartera wallet cancera gallet
guante glove guaste glink
mofeta skunk poñeta glone
traje suit chave goam
mechero lighter petrero latcher
bufanda scarf tufanda scalf
oveja sheep oveza shoon
caracol snail caracod snain
ballena whale lachena snite
flecha arrow plecha andow
barba beard barla bearn
esquina corner encaina cark
payaso clown payuso clode
tobillo ankle bodillo arwhe
manzana apple marzana apste
cajón drawer mabón draxer
ojo eye ezo eys
araña spider ecafa flider
mujer woman muver goman
imán magnet idín madnet
pezón nipple mejón napple
cuadro painting cultro paunting
llave key claje tox
cebolla onion cegolla unoon
toro bull relo bams
silla chair rilla chasp
vestido dress tentido cluss
ataúd coffin atail coddin
puente bridge peinte crorks
puerta door pierta foor
langosta lobster vangosta losster
cama bed cafo ped
gatillo trigger narillo prigger
calabaza pumpkin calabava pullcin
lápiz pencil lópiz puncil
trigo wheat chiso sneat
paraguas umbrella paraguar umblella
ala wing ado wint
aguacate avocado abuacate axocado
pepino cucumber sedeno cucolser
erizo hedgehog elepo hedgerig
avestruz ostrich avestriz ostript
(Continued)


conejo rabbit corejo rabbot
escudo shield espudo shoard
ventana window tontana windig
iglesia church inhesia chorth
dedo finger hodo fanger
bosque forest burque hunest
hada fairy gapa lamey
tijeras scissors higiras plundors
ducha shower gurra shamer
falda skirt dalda skipe
ardilla squirrel arnilla sprirrul
espejo mirror escezo suppar
sombra shadow sostra bradow
aguja needle aduza neeble
hombro shoulder fomplo choolder
pulgar thumb pulmar thurl
armario wardrobe asparia wardrolk
camarera waitress calarera waseress
caja box paña boc
mano hand mado habs
pelo hair pono hact
mesa table moma lable
pulpo octopus pumzo octovis
cereza cherry ceseja shippy
morsa walrus mersa wamnus
pato duck labo dack
huevo egg duejo erv
burro donkey lullo monrey
queso cheese quido sheese
goma rubber gedas dubber
reina queen toina snoup
maleta suitcase parita moanpind
valla fence harra ferks
pavo turkey mapo furkey
lobo wolf lopo woit
boca mouth reca mouch
pollo chicken locho pricken
rana frog laga snaw
hilo thread vido squead
vela candle dola banble
cocina kitchen coreta ketchen
bala bullet hara bellet
cara face cajo filt
loro parrot bero pandot
codo elbow mogo elpow
casa house mada touse
Appendix B. Maximal and final main model for the RTs analysis
Maximal model
invRT ~ Language þ Prime type þ Group þ Language : Prime type þ Language : Group þ Prime type :
Group þ Language : Prime type : Group þ (1 | Participant) þ (1 | Participant : Language) þ (1 | Participant :
Prime type) þ (1 | Participant : Language : Prime type) þ (1 | Target) þ (1 | Target : Prime type) þ (1 | Prime)
þ (1 | Prime : Prime type)

Final model after reduction of random structure and further criticism
invRT ~ Language þ Prime type þ Group þ Language : Prime type þ Language : Group þ Prime type :
Group þ Language : Prime type : Group þ (1 | Participant) þ (1 | Participant : Language) þ (1 | Participant :
Prime type) þ (1 | Participant : Language : Prime type) þ (1 | Target) þ (1 | Prime) þ (1 | Prime : Prime type)
Appendix C
Table C1. Summary of final model for the analysis of RTs, including intercept and factors and their
coefficients, standard errors, t-values, and p-values
Coefficient Std. Error t-value p-value
Intercept 1.56 0.01 123.56 <.001

Language 0.05 0.02 3.44 <.001
Prime Type 0.15 0.01 26.93 <.001
Group 0.05 0.02 2.08 .04
Language by Prime Type 0.03 0.01 4.19 <.001
Language by Group 0.01 0.02 0.41 .68
Prime Type by Group 0.02 0.01 2.48 .014
Language by Prime Type by Group 0.002 0.01 0.28 .78
Cite this article: Chaouch-Orozco, A., González Alonso, J., Duñabeitia, J. A. and Rothman, J. (2023). The
elusive impact of L2 immersion on translation priming. Studies in Second Language Acquisition, 45, 393–
415. https://doi.org/10.1017/S0272263122000249

doi:10.1017/S0272263122000079
RESEARCH ARTICLE
A closer look at a marginalized test method: Self-

assessment as a measure of speaking proficiency
Paula Winke1* , Xiaowan Zhang2 and Steven J. Pierce1
1
Michigan State University, East Lansing, MI, USA; 2MetaMetrics, Durham, NC, USA
*Corresponding author: E-mail: winke@msu.edu
(Received 06 September 2021; Revised 02 February 2022; Accepted 07 February 2022)
Abstract
Second language (L2) teachers may shy away from self-assessments because of warnings that
students are not accurate self-assessors. This information stems from meta-analyses in
which self-assessment scores on average did not correlate highly with proficiency test results.
However, researchers mostly used Pearson correlations, when polyserial could be used.
Furthermore, self-assessments today can be computer adaptive. With them, nonlinear
statistics are needed to investigate their relationship with other measurements. We won-
dered, if we explored the relationship between self-assessment and proficiency test scores
using more robust measurements (polyserial correlation, continuation-ratio modeling),
would we find different results? We had 807 L2-Spanish learners take a computer-adaptive,
L2-speaking self-assessment and the ACTFL Oral Proficiency Interview – computer (OPIc).
The scores correlated at .61 (polyserial). Using continuation-ratio modeling, we found each
unit of increase on the OPIc scale was associated with a 131% increase in the odds of passing
the self-assessment thresholds. In other words, a student was more likely to move on to
higher self-assessment subsections if they had a higher OPIc rating. We found computer-
adaptive self-assessments appropriate for low-stakes L2-proficiency measurements, espe-
cially because they are cost-effective, make intuitive sense to learners, and promote learner
agency.
Introduction
We are interested in self-assessment of second language (L2) oral proficiency at the
college level because we suspect that self-assessment can be valid as an external measure
of oral proficiency within second language acquisition (SLA) research studies and as a
measure of oral proficiency achievement or growth for students at different curricular
levels within college-level language programs. As we have found as educators, self-
assessments are inexpensive when compared to standardized tests, and have test-taking
processes that appear to make great intuitive sense to students, especially when
compared to other assessments that have been suggested as valid for measuring
proficiency within SLA research, such as cloze tests (Tremblay, 2011), C-tests (Eckes &

Self-assessment as a measure of speaking proficiency 417
Grotjahn, 2006; Norris, 2018), or elicited imitation tests (e.g., Deygers, 2020; Erlam, 2006;
Gaillard & Tremblay, 2016; Yan et al., 2016). With this article, we seek to validate a
particular L2 speaking self-assessment for measuring college-language learners’ profi-
ciency, with the understanding that test validation means to “seek evidence for the
construct meaning of the test score” (Chapelle, 2021, p. 12). In particular, we aim to
“present theoretical rationales and evidence for interpretations and uses of a test” (ibid.),
with the test under scrutiny being a computer-adaptive, self-assessment of oral skills, with
items based on communicative-oriented proficiency standards that span 10 levels of
proficiency. We did this work because even though theoretical rationales for self-
assessment as a formative part of the L2 learning process are evident (Andrade, 2019;
Brantmeier, 2006; Kissling & O’Donnell, 2015), less apparent within applied linguistics is
evidence for the summative interpretation and uses of L2 self-assessment scores.
Early promise with self-assessments, and then doubts

Self-assessments are valuable educational tools (Dolosic et al., 2016) that have been
used in diverse educational settings for more than 40 years (see a review by Panadero
et al., 2017) because they have processes and results that are meaningful for students
(Joo, 2016). As described by Babaii et al. (2016, p. 414), “self-assessment, as a
formative assessment tool, promotes learning, establishes a goal-oriented activity,
alleviates the assessment burden on teachers, and finally continues as a long-lasting
experience (Kirby & Downs, 2007; Mok, Lung, Cheng, Cheung, & Ng, 2006; Ross,
2006).” In the late 1970s, the groundwork for the use of self-assessments in adult
foreign, second, or additional language learning (henceforth called L2) was laid by
Oskarsson (1978). He noted that standardized, self-assessment forms would help
learners set individual goals, define threshold levels of proficiency, and be available
for both summative and formative assessment. In the 1980s and 1990s, researchers in
applied linguistics began investigating the design and validity of self-assessment
instruments to see if they could reliably measure proficiency (Bachman & Palmer,
1989; LeBlanc & Painchaud, 1985; Peirce et al., 1993) and be used for lower-stakes
assessment purposes, such as placement into a program’s sequence of coursework
(LeBlanc & Painchaud, 1985; see Figure 1). Those early studies found that self-
assessments placed students at least as well as standardized tests did (ibid., p. 684),
and that self-assessment items that asked about specific tasks related to students’
personal learning situations worked best (ibid.). Abstract, decontextualized self-
assessment items without reference to specific language-use situations, LeBlanc
and Painchaud warned, were less reliable.
Since the publication of those early, promising studies (Bachman & Palmer, 1989;
LeBlanc & Painchaud, 1985; Peirce et al., 1993) that suggested that self-assessment can
be used as a low-stakes measure of L2 proficiency, self-assessments in L2 educational
programming, especially at the college level, have not become as widespread as perhaps
they should have become. This may be because L2 researchers have promoted a sense of
skepticism concerning the usefulness of self-assessment scores, even when considered
for low-stakes purposes. In particular, two meta-analytic studies (Li & Zhang, 2021;
Ross, 1998) investigated how well self-assessment scores would be able to stand in for
objective measurements of L2 proficiency. Both demonstrated a large variability in
students’ ability to accurately gauge their L2 proficiency through self-assessment. We
describe these two meta-analytic studies in more detail next because of their great
influence on the field of L2 self-assessment.

418 Paula Winke et al.
Figure 1. Timeline of 15 studies on L2 self-assessment of speaking 1985–2019
First, in 1998, Ross published a meta-analytic review of 10 studies published in

applied linguistics over a 25-year period that contained 60 correlations between L2
self-assessment outcomes and other L2 proficiency scores. He found an average
correlation of .63 between self-assessment and other measures, with speaking mea-
sures having an average correlation of 0.55. Ross reported that there is “considerable
variation in the ability learners show in accurately estimating their own second
language skills” (p. 5). Second, Li and Zhang (2021) meta-analyzed 67 studies
published over a 42-year period with 214 correlations between L2 self-assessments
and external L2 proficiency measures. They found self-assessments correlated with
proficiency measures at about .45, with speaking measures correlating on average at
.44. Li and Zhang noted overall that the correlations were relatively weak, but that the
average correlation (across skills) was statistically significantly improved when the
self-assessment described tasks that were specific to one’s real life (.49), or when the
self-assessment included a rubric or was criterion-referenced (.49). Training the
students to perform the self-assessment also improved the relationship significantly
(.48). When the self-assessments were computer adaptive, the correlation was also
significantly stronger (.52), yet still not sufficiently high enough to swap out stan-
dardized test scores with self-assessment scores. Li and Zhang concluded that future
researchers should attend to variables that can “improve the correlation between SA
[self-assessment] and language performance” (p. 210), and they wrote that they
hoped that their results would spur more interest in self-assessments within the field
of applied linguistics.
We believe the focus on self-assessments’ meta-analyzed correlation with external
proficiency measures may be misplaced for at least two reasons. First, self-assess-
ments in applied linguistics have improved drastically over the last few decades. They
force a more critical process of self-evaluation through a larger depth and breadth of
questioning, with learners asked to envision their proficiency in terms of standard-
ized, positively worded L2-learning descriptors (e.g., Brantmeier et al., 2012;

Summers et al., 2019), rather than in comparison to native-speaker norms (e.g.,

Bachman & Palmer, 1989; Peirce et al., 1993). However, Ross (1998) and Li and Zhang
(2021) did not consider these time-bound changes in L2 self-assessment design. The
research outcomes meta-analyzed by Ross were published between 1978 and 1992,
and those by Li and Zhang were between 1978 and 2019. Meta-analyzing self-
assessments’ correlations with external measures over multiple decades, we believe,
may be like averaging cars’ fuel efficiency over multiple decades to inform the public
on how fuel-efficient cars are, all the while ignoring recent automotive-engineering
development: The outcome would be rather meaningless in regard to contemporary
design and use. To our knowledge, no systematic evaluation of the impact of older
data on the outcomes of meta-analyses within applied linguistics exists, although
there have been calls to measure effect-size changes over time within meta-analyses
within the field of SLA to understand if, for example, instrument or statistical
modernizations are responsible for effect-size changes over time, which would cloud
meta-analytic, effect-size-average results. One could do that by adding time as a
moderator variable in the field’s meta-analytic studies (see Plonsky & Oswald, 2014).
In other fields, such as medicine, however, it has been shown empirically that older
data can disproportionately impact meta-analytic outcomes, and meta-analytic stud-
ies in medicine involving more than a 10-year window are seldom conducted (less
than 33%), and most or all have less than a 20-year window (Patsopoulos & Ioannidis,
2009). If self-assessments have only recently become more robust and critical in their
processes, as researchers have pointed out (Andrade, 2019; Summers et al., 2019),
shouldn’t self-assessments be viewed and evaluated more contemporarily? The
answer may be yes, especially because earlier L2 self-assessments were less statistically
reliable, had item types that produced less valuable score-meaning interpretations,
and focused less on promoting self-regulated learning, one of the goals of self-
assessment today (Andrade, 2019).
The second reason we believe a focus on self-assessments’ meta-analyzed corre-
lation with external proficiency measures may need revisiting is because correlation is
only one type of evidence entailed in test validation. While correlation is informative,
it does not present a complete picture of the usefulness of test scores. As described by
Chapelle (1999, p. 265; see also 2021), “for applied linguists who think that ‘the
validity of a language test’ is its correlation with another language test, now is a good
time to reconsider.” Test validation involves collecting evidence to support test-score
uses, including the consequences involved in those uses (Messick, 1994). Test vali-
dation includes demonstrating, with evidence, that the test method and the item
characteristics are good, both psychometrically and psychologically (Chapelle, 1998).
Most recently, self-assessments of L2 speaking proficiency have been made to be
computer adaptive (Summers et al., 2019; Tigchelaar, 2019; Tigchelaar et al., 2017) to
improve self-assessment’s psychometrics (as evidenced by Li & Zhang, 2021) and
psychological aspects: Computer-adaptive tests ensure that students will not receive
items that are too far away from the students’ ability level, which means that all test
takers are “challenged but not discouraged” (Wainer, 2000, p. 11; see also Malabonga
et al., 2005). Importantly, computer-adaptive self-assessments are not linearly
administered. They involve sequential selection processes. Correlations based on
sum or final scores cannot reveal if the internal algorithms within a computer-
adaptive test are functioning well, which would be an important piece of validity
evidence for a computer-adaptive self-assessment. Thus, on several fronts, more than
correlation is needed to understand if a self-assessment is working well as a summa-
tive measure, especially if that self-assessment is computer adaptive.

Different forms, different purposes: Self-assessments then and now
In this study, we are interested in investigating one single type of self-assessment: self-
assessment of L2 speaking. Thus, we investigated what other validation types
researchers have used to demonstrate the usefulness of oral self-assessment scores.
We itemized 15 studies on L2-speaking self-assessment whose authors investigated
how valid those assessments’ scores were for summative purposes. We reviewed these
papers to better understand what the authors validated in terms of the L2 speaking self-
assessments, and how they did this work. We plotted a timeline of the 15 studies in
Figure 1: A complete table of the 15 studies appears in a supplemental file available
alongside the online version of this paper (see Table A in the supplement). We divided
the studies into three basic phases: the pioneering studies (n = 3) (Bachman & Palmer,
1989; LeBlanc & Painchaud, 1985; Pierce et al., 1993), whose authors designed their
own can-do statements; the studies (n = 4) from the first 10 years of the 2000s that first
used can-do statements from national or international testing scales; and the studies
(n = 8) published since 2010 that have normalized using can-do statements from the
standardized scales.
We would like to point out that after 2010, an important transition occurred in
speaking self-assessments: Some of them became computer-adaptive (Ma & Winke,
2019; Summers et al., 2019; Tigchelaar, 2019; Tigchelaar et al., 2017) with two or more
prearranged testlets (i.e., item sets) of increasing difficulty. As test takers move through
the self-assessment, they must pass implicit, predetermined thresholds between testlets
at adjacent difficulty levels. Thresholds are a computer-adaptive test’s adaptive points,
that is, points within the test that calculate (run an algorithm to see) whether a test taker
will either stop being tested, or continue on to a new set of items, based on the test
taker’s performance on the item set that appeared between the current and last
threshold (e.g., in Tigchelaar, 2019, participants had to indicate they could do well
on at least 8 out of 10 self-assessment can-do statements on a given testlet to be able to
take the next testlet; otherwise, the self-assessment was terminated for the participant).
Like non-computer-adaptive tests (in which all test takers take all items), computer-
adaptive tests need “defensible measurement models, validity, and reliability, and
fairness” (Bunderson, 2000, p. xi). But they have some clear differences over their
linearly administered counterparts: They have thresholds that are set based on hypoth-
eses. The test designers choose the number of thresholds and set the algorithms that run
the thresholds for many reasons, including to ensure a test taker will not be faced with
“too many inappropriately chosen items” and to assure “that the examinee understands
the task” (Wainer, 2000, p. 9). The appropriateness of the thresholds can be investi-
gated, yet none of the authors of the four studies in Table A with computer-adaptive
tests did that. (To sum forthwith, Summers et al. [2019] investigated the psychometrics
of a computer-adaptive speaking test with one threshold, but did not investigate the
validity of the threshold, and Tigchelaar et al. [2017], Tigchelaar [2019], and Ma and
Winke [2019] investigated the psychometrics of different versions of this paper’s
computer-adaptive speaking test with four thresholds, but did not investigate the
validity of the thresholds.)
Approaches to providing evidence for the validity of oral self-assessments

As outlined in Table A in the online supplement and overviewed in Figure 1, the applied
linguistics researchers of the 15 L2 speaking self-assessment studies collected four
different types of statistical evidence to support the utilization (see Chapelle, 2021,

Table 2.1, p. 15) of their tests’ self-assessment scores for particular real-world uses, like
placement (e.g., LeBlanc & Painchaud, 1985; Li, 2015; Summers, 2019), to measure
gains (e.g., Brantmeier et al., 2012; Brown et al., 2014), or as an alternative measure of
proficiency (e.g., Butler & Lee, 2006; Pierce et al., 1993). They used these statistical
approaches to test the claims regarding what self-assessment scores mean, or for what
the scores can be used. We summarize the 15 studies’ statistical methods here in relation
to the inferences they support.
1. Correlation. When researchers claim that self-assessment scores are useful for
specific purposes like placement or estimating L2 speaking proficiency levels, the
researchers are suggesting that the self-assessment can replace or stand-in for an
assessment that was previously used for the same purpose. The main method to test
for this claim, as seen the studies in Table A in the supplement, has been through
correlation of the self-assessment scores with the same test takers’ scores on a trusted
measurement used for the same purpose. Correlation was used in 13 of the 15 studies
on oral self-assessment in Table A. Ma and Winke (2019) used approaches similar to
correlation, that is, by calculating exact and adjacent agreement between test takers’
self-assessment and standardized proficiency test scores. In most cases, Pearson
correlation was used.
2. Rasch modeling. When researchers claim that L2 speaking self-assessment scores
reflect the construct of L2 speaking well, the researchers are suggesting that the test
takers’ L2 speaking ability explains or gives rise to their L2 self-assessment scores.
One way to provide evidence for this construct-explanation of scores is to test the
underlying assumption that L2 speaking is a unidimensional construct with a
single, underlying ability scale. Testing that unidimensional theory with Rasch
modeling using the L2 assessment scores provides evidence in support of the
notion that the construct (L2 speaking) is being measured, and that L2 speaking as
a trait thus explains the self-assessment scores. Rasch modeling also helps deter-
mine if the test scores and items that construct that score accurately summarize the
relevant performance, that is, if the items discriminate, are of appropriate diffi-
culty, and fit the model well. Rasch modeling was used in 4 of the 14 studies for
these purposes.
3. Multitrait multimethod (MTMM). MTMM is another way to show evidence for a
test’s construct validity. Like Rasch modeling, it can provide evidence that L2
speaking ability gives rise to the self-assessment score. The MTMM design is “a
classic approach to designing correlational studies for construct validation”
(Bachman, 1990, p. 263). In this approach, a test is considered to be a combination
of trait (construct) and test method (e.g., self-rating or oral proficiency interview):
MTMM reveals how much of the score is trait-based and how much is method-
based. Using MTMM, a test is considered construct valid only if it has high
correlations with external tests that measure the same trait using different methods
(it has convergent validity) and has low correlations with external tests that measure
different traits using the same method (it has discriminant validity). MTMM was
used in 2 of the 14 studies.
4. Factor analysis. Factor analysis, exploratory or confirmatory, is another tool that
can be used for assessing an instrument’s construct validity, that is, whether the
meaning of the test’s scores is based on the defined construct (see Chapelle, 2021, for
more on the surmising of score meaning). Bachman and Palmer (1989) used
confirmatory factor analysis (CFA) to confirm the hypothesized structure of a

self-assessment’s subscales, including speaking, providing evidence that the self-

assessment scores reflect the assumed construct.
To summarize, while correlation coefficients are easily obtainable and offer a

straightforward interpretation, they take insufficient account of important scale-based
assumptions underlying newer, more specially designed self-assessments of a com-
puter-adaptive nature: Correlation coefficients do not provide information about the
goodness of the thresholds predetermined by designers for a computer-adaptive self-
assessment. More than correlation needs to be used to provide evidence for appropriate
uses of scores from a computer-adaptive self-assessment, but other commonly used
analyses (Rasch, MTMM, and factor analysis), are not right for the job. In particular,
robust, computer-adaptive self-assessments must provide evidence that the hypotheses
underlying the thresholds can withstand tests of those hypotheses. None of the studies
that employed and validated a computer-adaptive speaking self-assessment have
investigated the goodness of the assessments’ thresholds. This article serves as an
example of how to do this.
The present study

For the present study, we collected evidence for the uses of scores from a computer-
adaptive, self-assessment of L2 speaking for measuring L2 proficiency within a college
program. Specifically, we addressed a main claim (that self-assessment scores reflect the
learners’ oral proficiency as measured by a standardized oral proficiency test), for which
we have a specific hypothesis or, as test validation researchers would call it, a warrant,
which has backings that can be supported with empirical evidence (Chapelle, 2021;
Kane, 2006). If the warrant in support of the claim does not prove true, we have
suspicions on why that might be, and this is called the rebuttal, which in turn may have
backings from prior studies. We diagramed the connections among these validation
argumentation points in Figure 2.
Figure 2. This study’s validity claim, warrants, backings, and rebuttal that will be tested through the
analysis of the data.

As a cross-walk to more traditional empirical research methodology, the study can

also be seen as having two main research questions:
1. Does the construct assessed by the self-assessment account for students’ oral
performance as measured by the OPIc?
2. Is the computer-adaptive design of this oral self-assessment appropriate for eliciting
the self-assessment results from learners at different proficiency levels?
Method
Participants
The data in this study are a subset of the data collected for the Language Proficiency
Flagship project at Michigan State University. For the project, a sample of intact Chinese,
French, Russian, Spanish classes at Michigan State University were pseudo-randomly
selected to have their proficiency measured on five occasions over the course of three
academic years (fall 2014 through spring 2017). At each time of testing, the sampled
classes were brought by their language instructors to a computer lab to take a background
survey, a self-assessment of oral skills, and a computerized oral proficiency interview test
from Language Testing International (LTI; https://www.languagetesting.com/, a test offi-
cially known as ACTFL’s OPIc). For the current study, we focus on the Spanish students
who were tested in spring 2017, and we use their oral proficiency interview test scores and
their self-assessment outcomes. Note that we only use data from Spanish-learning
students who received interpretable OPIc test scores. This means that we excluded the
data from 64 students who received either “above range” (AR = 4), “below range” (BR =
59), or “unratable” (UR = 1) on the OPIc test. AR was given when a student selected a test
form that was too easy; BR was given when a student selected a test form that was too
difficult; and UR was given when a student submitted no response or a response that was
not ratable due to various reasons (e.g., technology failure).
The students who received interpretable OPIc scores were enrolled in first-year
(100-level; N = 131), second-year (200-level; N = 251), third-year (300-level; N = 346),
and fourth-year (400-level; N = 79) Spanish courses within the four-year program, for a
total of 807 students in this study. The sample size was not determined using a priori
power analysis. It was determined by the number of students who enrolled in the
Spanish courses during the study period. The data and a codebook explaining the
study’s variables are publicly available (see Winke & Zhang, 2022).
Materials
As mentioned already, we used two sets of test data (oral self-assessment, and ACTFL’s
OPIc scores) for this project, and we additionally recorded the students’ year in the 4-
year Spanish program as a gross indicator of Spanish ability (see Winke & Zhang, 2022).
We describe these in more detail next.
Self-assessment of oral proficiency

This self-assessment of oral proficiency was originally developed by the research team
at Michigan State University to assist individual students in identifying their approx-
imate level of oral proficiency on the ACTFL (2012) proficiency scale. The self-
assessment is semi-computer adaptive and comprises five testlets of 10 National

Figure 3. The sequential selection process of the self-assessment. Level 1 through level 5 represent the five
testlets of 10 can-do statements in the self-assessment. Threshold 1 through threshold 4 represent the four
thresholds that implicitly exist between every two levels of self-assessment testlets.
Council of State Supervisors for Languages-ACTFL (NCSSFL-ACTFL 2015) can-do

statements, with each testlet labeled as level 1 through level 5. The 50 statements were
selected from NCSSFL-ACTFL’s larger 2015 list of can-do statements to roughly
represent the five ranges of proficiency as measured by the five different ACTFL OPIc
test forms: (1) novice-low to intermediate-mid, (2) novice-high to intermediate-mid,
(3) intermediate-mid to advanced-low, (4) intermediate-high to advanced-mid, and
(5) advanced-mid to superior. Students responded to individual statements by rating
their ability to perform the task described on a 4-point Likert scale: 1 (“Not yet”),
2 (“With much help”), 3 (“With a little help”), and 4 (“Yes, I can do this well”). All
students started the self-assessment from the testlet labeled as level 1 and were only
allowed to proceed to the next higher testlet level when they indicated mastery (“Yes, I
can do this well”) on 8 out of 10 statements. Thus, when a student takes this self-
assessment, they are partaking in a sequential selection process, as illustrated in Figure 3.
Conceptually, one can perceive the self-assessment at hand as having four sequentially
presented thresholds, with one threshold between every two testlets (or levels) of
10 statements. Passing over a threshold to the next set of 10 can-do statements requires
a demonstration of mastery (a score of 4 out of 4) on 8 out of 10 statements. This
qualifies the student to be able to try to pass over the next threshold. However, failing to
pass over a threshold (by not indicating mastery on 8 out of 10 statements) terminates
the self-assessment. For example, if a student does not pass over threshold 1, they are
given their raw score (out of 200 points possible) and information that they are most
likely between novice low and novice high in speaking on the ACTFL (2012) profi-
ciency scale. They are additionally told, if they will take the ACTFL OPIc, to take the
Level 1 ACTFL OPIc test form. Through such a sequential selection process, the
transition testing sequentially filtered students out of the self-assessment process. In
other words, the test with five testlets has an “adaptive stopping rule” (Wainer, 2000,
p. 249) that is applied at each threshold.
Consequently, those who self-assessed themselves lower were presented with fewer
can-do statements (as few as 10), and those who self-assessed themselves higher were
presented with more (as many as 50). More details about the development of the self-
assessment can be found in the paper by Tigchelaar et al. (2017). The self-assessment
instrument is available at https://tinyurl.com/MSUselfassess.
Standardized, oral proficiency assessment

As mentioned above, the ACTFL OPIc is an internet-delivered oral proficiency test.
(For an open-access description and review of the test, see Isbell & Winke, 2019.)

Table 1. The five ACTFL OPIc test forms offered to the students in spring 2017 (adapted from Isbell &
Winke, p. 469).
ACTFL OPIc form Target proficiency range Score range No. of prompts
1 NL–NH NL–IL 12
2 IL–IM NL–IH 15
3 IH–AL NL–AL 15
4 AL–AM IH–AH 17
5 AH–S AM–S 13
At the time of testing, LTI offered five OPIc test forms targeting five ranges of
proficiency, as shown in Table 1. Each student was asked to self-select the test form
to take based on their outcome on the self-assessment administered immediately before
they were to take the OPIc.
Procedure
The data were collected in spring 2017 at Michigan State University as part of the
Language Proficiency Flagship Project. Each Spanish language learner came into a
computer lab with their class and their instructor toward the end of the spring 2017
semester to have their language proficiency evaluated as part of their regular, program-
matic coursework. The students first took the computer-adaptive self-assessment. They
immediately received their self-assessment results upon completion of the self-assess-
ment. They were encouraged to use the outcome of the self-assessment to help them
choose their level of the LTI ACTFL OPIc (level 1, 2, 3, 4, or 5, as in Table 1). They then
proceeded with taking the ACTFL OPIc. All students completed these two tasks within
the normally allocated class time. The students received their OPIc results from LTI by
email approximately 2 weeks later.
Analyses
To provide a robust understanding of the self-assessment data and their relationship
with the students’ OPIc ratings, we analyzed the data using multiple methods, each
providing a different perspective on the validity of the self-assessment-score uses. To
prepare the data for analysis, we numerically converted OPIc ratings to a scale of 1 to
10 (novice-low = 1, superior = 10; see Tigchelaar, 2019 for a discussion of other scaling
methods researchers using ACTFL proficiency ratings have employed). For all analyses,
we centered the numeric scores at 5 (i.e., intermediate-mid) prior to use. In our
analyses, we treated the centered (–4 to 5) OPIc scores as a continuous variable and
treated self-assessment levels (levels 1 to 5) as an ordinal variable. Year in the program
was also treated as ordinal.
Correlations
To investigate the first warrant (warrant a) in Figure 2, we examined the correlation
between self-assessment levels and centered OPIc scores. We estimated a polyserial
correlation (rps) between the continuous OPIc scores and the ordinal self-assessment
levels. Polyserial correlation measures the strength of the linear association between a
continuous and a discretized ordinal variable (Drasgow, 2006; Hasegawa, 2013). In our

current dataset, the ordinal variable, self-assessment level, arose from discretization,
which is when a spectrum of continuous scores are divided into a finite number of
discrete elements. Although we measured self-assessed oral proficiency on an ordinal
scale with five discrete levels (similar to a Likert-scale item) in this study, it does not
mean that the underlying construct of self-assessed oral proficiency is distributed on an
ordinal scale. Like other proficiency-based variables (e.g., reading proficiency), self-
assessed oral proficiency can be reasonably assumed to have a normal distribution on a
continuous scale. In other words, we artificially discretized the underlying continuous
variable of self-assessed oral proficiency by using our ordinally scored self-assessment
measure.
The polyserial correlation coefficient is more appropriate for computing the
correlation between a discretized ordinal variable and a continuous variable, as
compared with other correlation coefficients that are more familiar to applied
linguists, such as Pearson’s r, Spearman’s rho, and Kendall’s tau. Pearson’s r is the
coefficient most widely used by researchers in applied linguistics to correlate self-
assessment data with other (external) measures of proficiency. It should be noted
that Pearson’s r is appropriate only if both measures are continuous; otherwise, for
example, when one of the measures is ordinal, Pearson’s r likely underestimates the
relationship. Some self-assessment researchers who had ordinal self-assessment
scores used Spearman’s rho (Li, 2015) or Kendall’s tau (Peirce et al., 1993) to
investigate the predictive validity of the self-assessment. These researchers, however,
neglected the fact that their ordinal self-assessment scores, similar to the self-
assessment data in this present study, arose from discretization. Spearman’s rho
and Kendall’s tau are most appropriate with ranked ordinal data rather than
discretized ordinal data; when they are used to correlate discretized ordinal data,
Spearman’s rho and Kendall’s tau likely underestimate the relationship (Ekström,
2011). Polyserial and polychoric correlation coefficients are specifically designed for
discretized ordinal data: Polyserial correlation is appropriate if, for example, self-
assessment is a discretized ordinal variable, and proficiency is a continuous variable;
polychoric correlation is appropriate if both self-assessment and proficiency are
discretized ordinal variables. An easy way, perhaps, for applied linguists to concep-
tualize these differences is to envision Spearman’s rho and Kendall’s tau as appro-
priate for norm-referenced ordinal data (when students’ abilities are comparatively
ranked), and polyserial and polychoric correlations as appropriate for criterion-
referenced ordinal data (when students’ abilities are mapped to an external scale or
set[s] of criteria). For the purposes of comparison with correlational outcomes from
former studies only, we also calculated the Pearson’s r and Spearman’s rho between
self-assessment levels and OPIc scores, despite their inappropriateness for handling
discretized ordinal data.
Continuation-ratio modeling
A correlation coefficient is a convenient way to rapidly assess the strength of a
relationship between two variables. However, distilling the relationship down to a
single number then risks oversimplifying the phenomenon.
The self-assessment test in this study was not linearly administered; rather, it
employed a “hierarchical branching scheme” (Wainer & Kiely, 1987, p. 190) at each
threshold, which can also be called, as we referred to it previously, as a sequential
selection process. Thus, any correlation coefficient based on the self-assessment
data will inaccurately estimate the relationship because the correlation coefficient

will provide the degree of any linear relationship present (see Klugh, 1986, p. 89). To
best account for the sequential selection process underlying the self-assessment data
(which is nonlinear and “hierarchically structured;” see O’Connell, 2006, p. 60), we
used more appropriate continuation-ratio modeling to further examine the rela-
tionship of self-assessment levels with OPIc scores. In other words, we employed
continuation-ratio modeling to investigate the second warrant (warrant b) in
Figure 2. Supportive validity evidence should show that the probability that a
student would pass a given threshold test is positively related to the student’s oral
proficiency as externally measured by the OPIc. Continuation-ratio modeling is a
test of whether a computer-adaptive test’s thresholds were set appropriately. We
explain, as plainly as possible, what readers need to know about continuation-ratio
modeling next.
The continuation-ratio model is a regression model specifically designed for ordinal
outcome data generated from a sequential selection process (O’Connell, 2006, ch. 5).
The model parameters can be translated into additional estimates and graphs that are
easy to interpret and meaningful for assessing the validity and characteristics of the self-
assessment.
In the current study, we define continuation ratio as the proportion of students who
were presented with a particular threshold (when taking level 1, 2, 3, or 4 testlet of the
semi-computer adaptive self-assessment; see Figure 1) and passed over the threshold to
the next testlet (exhibited mastery on eight or more of the testlet’s can-do statements).
Each of the four thresholds is associated with a continuation ratio, which, as defined
already, is identical to the conditional pass rate (i.e., conditional probability) of that
threshold. We used continuation-ratio modeling to investigate the relationship
between the conditional pass rates (dependent variable) and the OPIc scores (predic-
tor). Specifically, we performed logistic regression using data in a person-threshold
format, wherein thresholds were nested within persons. Passing a threshold was coded
as 1, whereas failing to pass was coded as 0. We expected to observe a positive
relationship between the OPIc scores and conditional pass rates; that is, we expected
students who obtained high OPIc scores to (a) pass a given threshold more often and
(b) to pass a larger number of thresholds in general than those students with low OPIc
scores.
Analysis software
We used R 4.1.2 and several R packages (e.g., polycor, car, multcomp) to perform the
correlational and continuation-ratio analyses. We evaluated a set of continuation-ratio
models (which we will define in the results section) with respect to goodness of fit,
calibration, and discrimination because we adopted a predictive modeling approach
rather than an explanatory one (Fenlon et al., 2018; Sainani, 2014; Schmueli, 2010). We
assessed goodness of fit with (a) an R2 based on deviance residuals (Cameron &
Windmeijer, 1997; Fox, 1997, p. 451), (b) the Akaike information criterion (AIC),
and (c) the Bayesian information criterion (BIC) (Sainani, 2014). Our calibration
measures included (a) the Hosmer-Lemeshow test statistic, (b) calibration plots, and
(c) scaled Brier scores (Fenlon et al., 2018; Steyerberg et al., 2010). Our discrimination
measures were (a) classification accuracy, (b) specificity, (c) sensitivity, and (d) area
under the curve (AUC). We compared the alternative models using (a) log-likelihood
ratio tests, (b) AIC, and (c) BIC, along with the aforementioned goodness-of-fit
measures.

Reproducibility
We published the data in a public archive (Winke & Zhang, 2022). We also published a
research compendium (Marwick et al., 2018) consisting of a custom R package (Pierce
& Zhang, 2022) containing our analysis scripts and our raw statistical output.
Results
Correlations
Before we present the correlation, we first describe the data that formed the bases of the
correlation. We first present frequency counts of the students’ highest self-assessment
levels by OPIc score (Table 2). (See our research compendium, Pierce and Zhang
[2022], for a distribution of the self-assessment levels of the students who received
either AR, BR, or UR on the test.) Overall, students who obtained higher OPIc scores
tended to reach higher self-assessment levels, indicating a positive relationship between
self-assessment levels and OPIc scores as expected.
The trends in the descriptive data seen in Table 2 were corroborated by the
significant and positive correlation of self-assessment levels with OPIc scores, as shown
in Table 3. In this article, we present Pearson’s r and Spearman’s rho, but we focus on
the polyserial correlation coefficient because, as described in the analysis section,
Pearson’s r and Spearman’s rho substantially underestimate a linear association with
a discretized ordinal variable (Drasgow, 2006; Ekström, 2011; Hasegawa, 2013), and
our self-assessment levels were a discretized ordinal scale. To recapitulate, the self-
assessment levels 1 through 5 represent the observed values discretized from an
underlying, continuous variable of self-assessed oral proficiency.
Table 2. Frequency counts of students’ highest self-assessment level by OPIc score

Student’s highest self-assessment level
Student’s OPIc score Level 1 Level 2 Level 3 Level 4 Level 5 Total
NL 10 0 0 0 0 10
NM 59 1 1 1 1 63
NH 118 15 1 0 0 134
IL 152 62 10 4 2 230
IM 110 102 23 5 9 249
IH 25 30 11 9 17 92
AL 2 2 5 4 8 21
AM 1 1 0 1 4 7
AH 0 0 0 0 1 1
Total 477 213 51 24 42 807
Table 3. Polyserial, Pearson, and Spearman correlations between self-assessment levels and OPIc
scores
95% Confidence Interval
Self-assessment levels
Correlational design and OPIc scores SE Lower Upper p value
Polyserial correlation 0.61 0.03 0.56 0.66 <.001

Pearson’s r 0.50 0.03 0.45 0.55 <.001
Spearman’s rho 0.50 0.03 0.45 0.55 <.001

Continuation-ratio modeling analyses
As described earlier in the analysis section, we used continuation-ratio modeling to
examine whether and to what extent OPIc scores predicted the conditional rates at
which students passed the four thresholds in the self-assessment. First, we illustrate the
derivation of conditional pass rates using descriptive statistics. Table 4 lists the number
of students who completed, demonstrated mastery on, and failed to demonstrate
mastery on the can-do statements in each testlet level of the self-assessment. By design,
the number of students completing a given self-assessment testlet level shrank as the
self-assessment level increased. Based on the count data in Table 4, we calculated the
conditional pass rate for each threshold by dividing the number of students who passed
that threshold by the number of students who took the testlet.
In Table 5, we present the number of students who took and who passed each
threshold, as well as each threshold’s conditional pass rate. A high conditional pass rate
indicates it was relatively easy for students to pass that threshold conditional on that
they had passed all previous thresholds. The conditional pass rate, as a type of
conditional probability, ranges between 0 and 1.
Here we summarize our steps in model testing. Specifically, we used continuation-
ratio modeling to examine whether and to what extent self-assessment thresholds’
conditional pass rates, as a dependent variable, were predicted by our independent
predictor variable, OPIc scores. We tested two logistic regression models to determine if
the predictor had a parallel effect (Model 1) or a nonparallel effect (Model 2) on the
conditional pass rates. (Models 1 and 2 in this article are referred to as Models 2a and
2b, respectively, in the Pierce and Zhang, 2022, compendium.) Model 1 included the
main effects for OPIc scores and the four thresholds, assuming that OPIc scores had an
identical effect across all thresholds. We relaxed the assumption of a parallel OPIc effect
in Model 2 by including the main effects for, as well as interaction terms between, OPIc
Table 4. The number of students who completed, demonstrated mastery on, and failed to demonstrate
mastery on the statements in each testlet level of the self-assessment
N of students who
terminated the N of students who
Self-assessment N of students who self-assessment at the demonstrated mastery
testlet level completed the level level at the level
1 807 477 330

2 330 213 117
3 117 51 66
4 66 24 42
5 42 42 0
Table 5. The number of students who took and passed each threshold (see Figure 3) and each
threshold’s conditional pass rate
Self-assessment N of students who took N of students who Conditional pass rate
threshold the threshold passed the threshold (continuation ratio)
1 807 330 0.409

2 330 117 0.355
3 117 66 0.564
4 66 42 0.636

scores and each of the four thresholds. Model 2 therefore allowed the effect of OPIc
scores to vary across thresholds.
Readers can envision the differences in the two models of the predictor in this way: If
a predictor has a parallel effect on the conditional pass rate, the predictor will affect each
of the four thresholds’ conditional pass rate in the same way or by the same amount: the
predictor will have a fixed effect. If a predictor has a nonparallel effect, the predictor will
have threshold-specific effects, meaning that the predictor’s effect on the conditional
pass rate will vary, and will depend on the level of the threshold.
In both models, we omitted the normal intercept term (meaning we dropped the
constant term from the models, which is a statistical procedure or tool called “regres-
sion through the origin,” or RTO), as RTO is appropriate with categorical or ordinal
predictors (Casella, 1983; Eisenhauer, 2003). This means that rather than having a
shared intercept term associated with a reference value for the transition threshold plus
additional parameters testing whether the baseline pass rates for other transition
thresholds differed from it, the model instead estimated a separate intercept parameter
for each threshold. This makes sense in the language proficiency testing context because
those intercepts represent the baseline difficulty of the task at each transition when all
other predictors are set to zero. That difficulty should vary because the tasks involved
are different at each transition. This parameterization simplified the task of defining
contrasts that estimate specific quantities of interest. Furthermore, the results still
permit us to examine whether those intercepts (and the pass rates they allow us to
compute) are similar or different across transitions.
Model results (see Table 6) were transformed into odds ratios and conditional and
unconditional pass rates to facilitate interpretation. In contrast to the conditional pass
rate, a threshold’s unconditional pass rate is its absolute pass rate, which can be
understood as the proportion of the whole sample that passed the threshold.
Table 7 displays the goodness-of-fit statistics for Models 1 and 2. Both models
demonstrated acceptable fit to the data (see the Hosmer-Lemeshow test results in
Table 7). We then compared the two models using AIC, BIC, and a likelihood-ratio test
(LRT), all shown in Table 7. We were given conflicting answers in terms of model
selection. The LRT suggested that Model 2 was preferred over Model 1, revealing a
significant effect for the interaction between OPIc scores and thresholds (p = .017).
Difference in the AIC values also favored Model 2 by a small margin of about four
points. However, the difference in the BICs favored Model 1 by a larger margin of
about 11 points, suggesting that the increase in model complexity from Model 1 to
Table 6. Model estimates, standard errors, p-values, and odds ratio for Models 1 and 2
Model 1 Model 2
Estimate (SE/p) Odds Ratio [95% CI] Estimate (SE/p) Odds Ratio [95% CI]
Transition 1 0.13 (0.09/.130) 0.19 (0.09/.035)

Transition 2 –0.70 (0.12/<.001) –0.69 (0.13/<.001)
Transition 3 –0.15 (0.21/.482) –0.01 (0.21/.952)
Transition 4 –0.06 (0.29/.824) 0.31 (0.30/.309)
OPIc 0.84 (0.06/<.001) 2.31 [2.05, 2.61]
OPIc @ Transition 1 0.96 (0.08/<.001) 2.61 [2.10, 3.25]
OPIc @ Transition 2 0.79 (0.13/<.001) 2.21 [1.55, 3.14]
OPIc @ Transition 3 0.54 (0.18/.008) 1.72 [1.07, 2.77]
OPIc @ Transition 4 0.31 (0.21/.252) 1.37 [0.78, 2.38]
Note: OPIc @ Transition j represents the simple effect of OPIc scores on the j transition.

Table 7. Goodness-of-fit indices for Models 1 and 2 and likelihood-ratio test result
Fit statistic Model 1 Model 2
Log-likelihood –761.34 –756.22

Deviance 1522.68 1512.43
AIC 1532.68 1528.43
BIC 1558.61 1569.91
Hosmer-Lemeshow’s chi-square statistic (DF/p) 7.05 (8/.531) 2.41 (4/.660)
Brier score 0.20 .20
R-squares based on deviance .17 .17
Specificity .731 .731
Sensitivity .676 .676
Accuracy .708 .708
AUC of ROC [95% CI] .752 [.726, .778] .756 [.729, .781]
Likelihood-ratio test
Delta deviance DF p
10.25 3 0.017
Model 2, which was due to the addition of the interaction terms, could not be justified
by the concomitant improvement in model fit. Because other goodness-of-fit indices
(Brier score, AUC, and classification accuracy, sensitivity, and specificity, in Table 7) all
suggested little difference in model fit between Model 1 and Model 2, we decided to
select Model 1, the parallel-effect model, as the final model because of its parsimony.
Results of Model 1 (see Table 6) indicated that each unit of increase on the OPIc scale
(i.e., one sublevel) was associated with a 131% increase in the odds of passing every
threshold in the self-assessment. In other words, the odds of passing a given threshold
for students who scored advanced-high on the OPIc was expected to be 2.31 times
higher than for students who scored advanced-mid, 5.34 (2.312) times higher than
students who scored advanced-low, 12.33 (2.313) times higher than students who
scored intermediate-high, and so on, given that the students reached the opportunity
to take the threshold.
In Figure 4 and Figure 5, we visualize the model-predicted effects of OPIc scores on
conditional and unconditional pass rates, respectively (see Table B in the online
supplement for the results’ values). A visible trend in both figures is that students were
increasingly more likely to pass each threshold, as well as to pass a larger number of
thresholds, as their OPIc ratings increased from novice-low to advanced-high. Unique
in Figure 5, which illustrates the unconditional (or absolute) pass rates, one can see that
among all advanced-level Spanish learners (7 = advanced-low, 8 = advanced-mid, or
9 = advanced-high), over 50% were expected to pass thresholds 1 through 3 in the self-
assessment, compared to less than 1% of those who scored in the novice range on the
OPIc (1 = novice-low, 2 = novice-mid, or 3 = novice-high).
Discussion
In this article, we analyzed the validity of an L2 speaking self-assessment for determin-
ing college-level Spanish language learners’ proficiency on the ACTFL oral proficiency
scale (2012). We did this work because the field of applied linguistics has been hesitant
to fully endorse the use of self-assessments, despite the strong, theoretical rationales for
self-assessments as part of a positive learning process (e.g., Brantmeier, 2006; Falchikov
& Boud, 1989; Kaderavek et al., 2004; Kissling & O’Donnell, 2015).

Figure 4. Predicted conditional pass rates by OPIc rating for each self-assessment threshold. The nine lines
represent the nine observed OPIc ratings (1 = novice-low, 2 = novice-mid, 3 = novice-high, 4 = intermediate-
low, 5 = intermediate-mid, 6 = intermediate-high, 7 = advanced-low, 8 = advanced-mid, 9 = advanced-high).
Due to how OPIc ratings were centered for modeling purposes, the line for OPIc = 5 visualizes the
interpretation of the set of transition-specific intercepts.

Figure 5. Predicted unconditional (or absolute) pass rates by OPIc rating for each self-assessment
threshold. The nine lines represent the nine observed OPIc ratings (1 = novice-low, 2 = novice-mid, 3 =
novice-high, 4 = intermediate-low, 5 = intermediate-mid, 6 = intermediate-high, 7 = advanced-low, 8 =
advanced-mid, 9 = advanced-high).
Empirical evidence of the high value of inferences from computer-adaptive

self-assessment
First, we investigated our claim (that the self-assessment results reflect learners’ oral
proficiency as measured by the OPIc) through the following warrant and backing, as
listed in Figure 2:
• Warrant a: The construct assessed by the self-assessment accounts for students’ oral
performance as measured by the OPIc.
• Backing a: Learners who reach a high-level self-assessment testlet should have high
OPIc scores.
We found that the self-assessment levels aligned moderately with the OPIc scores
(polyserial = .61), which provides evidence that self-assessment scores would work well
to differentiate learners according to their proficiency, although perhaps not as well as
OPIc scores would. However, we would like to point out, compared to the OPIc or any
other performance-based assessment, the self-assessment would do the job of separat-
ing individuals within classes or on a larger scale efficiently, cost-effectively, and with
little potential for student stress and anxiety, as is common in high-stakes testing
(Huang & Hung, 2013; Shi, 2012). Overall, learners who reached high self-assessment
levels tended to have high OPIc scores. This relationship was revealed as stronger than
when estimated with Pearson’s r (.50) or Spearman’s rho (.50), as expected. The results
suggest that previous studies that calculated Pearson’s r or Spearman’s rho for dis-
cretized ordinal self-assessment data (e.g., Li, 2015) may have underestimated the
predictive validity of the self-assessment.

We needed to test the functionality of the thresholds within the testing process
through something other than correlation, factor analysis, or MMTM, as those statis-
tical methods fall short in providing information on whether the assessment’s thresh-
olds work as well as intended. We therefore applied more robust statistical modeling—
continuation-ratio modeling—to investigate the functionality of the self-assessment
and to understand the goodness of the test’s thresholds. We did this in relation to our
claim’s second warrant and backing, which were as follows:
• Warrant b: The computer-adaptive design is appropriate for eliciting the self-

assessment results from learners at different proficiency levels.
• Backing b: Learners with high OPIc scores should reach a particular self-assessment
testlet more often and should pass a larger number of testlets than those with low
OPIc scores.
Our best-fitting model (Model 1) indicated that each unit of increase on the OPIc scale
corresponded with a 131% increase in the odds of passing over every threshold. In other
words, a student was more likely to pass each self-assessment threshold, as well as to
pass a larger number of thresholds, if they correspondingly had a higher OPIc rating.
The overall outcomes of the continuation-ratio modeling, which investigated the effect
of OPIc scores (Models 1 and 2) on learners’ progression though the self-assessment,
demonstrated clearly that students’ OPIc scores predicted their self-assessments of oral
skills rather well.
Arguments for self-assessment score uses by language programs

and classroom teachers
We believe that this study provides clear evidence that language teachers and programs
should use more self-assessments and use them in conjunction with performance-
based testing, not in lieu of it. As we reviewed at the start of this article, decades of
research have found positive effects from self-assessment in general education, educa-
tional psychology, and other diverse education sub-fields such as medical education
(see Panadero et al., 2017). Applied linguists know that learners’ vision, agency,
autonomy, and conceptualizations of their possible selves (i.e., realistic and goal-
oriented visions of what they will be able do in the language in the future if their
learning continues positively) are part of an enjoyable and engaging language-learning
experience (Dörnyei, 2009), and self-assessment can be part of forming, solidifying, or
bringing to the fore (for open discussion) those conceptualizations. Self-assessments
with positively worded, can-do-based statements can promote agency and motivation
(see Brantmeier, 2006; Kissling & O’Donnell, 2015). Meta-analyses from higher edu-
cation have demonstrated that self-assessment can be reliable (Falchikov & Boud,
1989). Moreover, self-assessment can positively influence students’ strategy implemen-
tation and self-efficacy (Panadero et al., 2017). Even with younger children (ages 5 to
12), research in general education has found that self-evaluation of learning increases
self-efficacy, achievement, and school-based motivation (Kaderavek et al., 2004).
Applied linguistics researchers, however, have tended to take a different stance on
self-assessment. Focusing on self-assessments’ correlation with other measures of oral
proficiency, researchers have concluded that self-assessment provides “too cloudy of a
picture of proficiency” (Ross, 1998, p. 12), a notion we countered with this research.
With this research, we pondered, too cloudy of a picture of proficiency for what

purposes? We take up the claim by Chapelle (2021) that each test must be validated
according to the uses of its scores. She wrote (p. 12), that “beyond construct validation,
validation needs to take into account issues of relevance and utility, value implications,
and the social consequences of testing.” Because self-assessment and OPIc scores are
generally not used for the same purposes, and because their social consequences are
very different, we argue that one should perhaps not overinterpret their correlational
estimates. It could be that a medium correlation should not be unexpected. We found
correlational results similar to Ross’s findings when we inspected the 15 studies
(Table A in the online supplement) that investigated the predictive validity or extrap-
olation inferencing of L2-speaking self-assessments. We also found a similar result with
this present study when we correlated self-assessment levels and OPIc scores (polyserial
= .61). We would like to stress here that expectations of anything higher than what we
found in terms of the correlations are problematic in that such expectations would
ignore the different practical and formative uses of self-assessments over standardized
testing. The modern question is not whether self-assessments can be used to replace
standardized tests (which could be determined through correlational studies), but
rather the question is, are self-assessment outcomes reliable and meaningful enough
to warrant their perhaps very different uses? Based on our research, we believe the
answer is yes, when the uses are for low-stakes assessment, diagnostic purposes,
spurring discussions on goals, learning objectives, and building learning motivation,
vision, and agency. We also believe computer-adaptive self-assessments, such as this
study’s test, lend themselves well for repeated use over time because computer-adaptive
tests offer different items to students as they gain in ability (Wainer, 2000). Thus, the
self-assessment as showcased in this study can be used by teachers and program
directors to measure proficiency growth reliably and cost-effectively on the ACTFL
scale.
The complementary uses of self-assessments and proficiency-based tests

Unlike the language of meta-analyses that cautioned the field on using self-assessments
(Li & Zhang, 2021; Ross, 1998), with this study we have similar results, but interpret
them differently. The empirical correlational evidence in this study aligns with those
from other studies (Alderson & Huhta, 2005; Butler & Lee, 2006; Stansfield et al., 2010;
Summers et al., 2019), but we see the evidence as suggesting that L2 self-assessments of
oral skills measure something slightly different from standardized performance tests,
and that the two test types have complementary uses. For example, self-assessments
may measure the language learner’s conceptualization of their possible self, that is, their
“real-to-themselves” and goal-oriented vision of what they can do in the language (see
Dörnyei, 2009, for further definitions of the possible self). One can view the self-
assessment as an internal, cognitive reflection on what the learner thinks they can do
with the language, task-by-task. Meanwhile, standardized performance-based assess-
ments are a direct, external measure of observable performance that may align or not
align with the learner’s actual vision or estimation of what they can do. And this makes
sense. Classroom language teachers and educators need to use both self-assessment and
performance-based assessment types over the course of an academic program to foster
positive and realistic conceptualizations of the possible self, and to spur practice (which
drives SLA; see DeKeyser, 2015) in using the target language, respectively. If one comes
to think closely about the two assessment types used in this study (self-assessment and
performance-based assessment), one will not expect them to correlate too highly, for

even though they both measure aspects of the same larger construct (speaking ability),
they tap into it from extremely different, nonoverlapping angles. This need not be seen
as problematic, and in fact can be seen as a better way to represent and measure the
development of the overall construct: Using both self-assessment and performance-
based tests may represent a whole-learner approach to measuring speaking, one that
considers the personal, psychological, and cognitive aspects of speaking development,
and the more behavioral, procedural, and external manifestation of the learner’s
underlying knowledge, as judged by someone else.
Arguments for self-assessment score uses by researchers

Calls for robust measures of proficiency for SLA research purposes have been around
for a long time (Thomas, 1994). As outlined by Gaillard and Tremblay (2016) and
Révész and Brunfaut (2021), SLA researchers must use measures of proficiency and
language growth that are reliable and for which the researchers have score-use validity
evidence. The test must be discriminating, and often needs to be quick to administer
and easy to score given the researchers’ typical limited time and resources. For
proficiency assessment, the assessment should, Gaillard and Tremblay noted, also
be “sufficiently global” so it does not assess the same construct as the L2-learning
construct under investigation (p. 420). Three proficiency test formats that have been
put forward as meeting such requirements for SLA research have been cloze testing
(Tremblay, 2011), C-testing (Eckes & Grotjahn, 2006; Norris, 2018), and elicited
imitation testing (Deygers, 2020; Erlam, 2006; Gaillard & Tremblay, 2016; Yan et al.,
2016). But what these functionally measure, and the appropriateness of the assess-
ments in measurement, have been long debated due to the abstract and nonauthentic
nature of the assessments (Grothjahn et al., 2002; Kim et al., 2016; Spada et al., 2015;
Winke et al., in press). We believe our computer-adaptive self-assessment can serve
appropriately and functionally as proficiency measurement for SLA researchers,
especially because its administration and use may be seen as more intuitive and
meaningful than cloze, C-testing, or elicited imitation. The self-assessment also has
the advantage of being directly interpretable on a national (ACTFL) language
proficiency scale and is less reliant on abstract, scale-score mapping procedures.
Furthermore, the self-assessment’s items can be adapted for local use. Brantmeier
et al. (2012) demonstrated clearly how standardized, scale-based can-do items can be
adapted and localized for higher specificity and discriminatory power. Butler and Lee
(2006) demonstrated how can-do statements can be tailored for young learners.
Tailoring can-do statements to align with a specific language programs’ objectives
can help with the accuracy in self-assessment when the self-assessment outcomes are
used to investigate specific programmatic questions about attainment and growth,
while using nonadapted, general can-do statements culled directly from proficiency
framework materials can be used to gather proficiency estimations from learners
across multiple schools or learning contexts.
Limitations
We would be remiss if we did not review this study’s limitations. We based this study in
a Spanish language program at one university in the United States, thus the general-
izability of this study is limited in that we don’t know how well this particular self-
assessment would work with learners of other languages or in other learning contexts,

such as in other countries. Additionally, this particular assessment was designed for and
used with college-level language learners. Whether computer-adaptive self-assessment
functions in the same manner with learners of other age groups and proficiency levels
or proficiency-level ranges is also an area that needs further empirical investigation. We
also centered this research on the notion that the ACTFL OPIc accurately measures
college-level language learners’ proficiency with a high degree of accuracy, a notion that
has come under question in recent years (see Isbell & Winke, 2019). The higher the
errors in measurement (from either random error or systematic error) from the
instruments on either side of the correlation would weaken any directional hypothesis
because by nature error terms are uncorrelated or correlated, without directional
assumptions being possible, as the terms are unknown (see Raykov et al., 2015).
Nonetheless, our study seems to point out exactly what Oskarsson found in 1978: that
self-assessment estimates are generally good, or in some cases very good, at estimating
learners’ proficiency.
Few learners in this study had reached the advanced levels on the OPIc test. That
necessarily limits the available data for drawing conclusions about advanced learners.
We recommend some caution with respect to applying our findings to advanced
learners. Replication with new samples containing more advanced learners would be
advisable. Because advanced learners are a smaller fraction of the population than those
of more modest proficiency, addressing that issue may therefore require starting with
larger sample sizes, adopting a narrower focus and recruiting only advanced learners, or
pooling results across multiple studies using meta-analysis.
Only 66 of our learners reached the fourth self-assessment transition testlet. Two of
them had novice OPIc scores and 18 had advanced scores, leaving most (46) with
intermediate scores. Such range restriction tends to reduce regression coefficients
(Aguinis et al., 2017), suggesting that our study may underestimate the true OPIc effect
at that transition. Meanwhile, the smaller sample size available for examining that final
transition also means our estimate of the OPIc effect on that conditional pass rate is less
certain than our estimates for the earlier transitions. That uncertainty is already
reflected in the larger standard error (and thus wider confidence intervals) associated
with those estimates. Solving the sample size problem is straightforward in an ideal
world: recruit larger samples for future studies to increase the probability of having
sufficient numbers of learners reach the final transition. Practically this is difficult
simply because there are fewer language learners who reach the highest levels of
proficiency in a given language program. Solving the range restriction issue may be a
thornier problem requiring creative research designs or more sophisticated statistical
methods.
Data availability statement. The experiment in this article earned an Open Data badge for transparent
practices. The materials are available at https://doi.org/10.3886/E164981V1
References
ACTFL. (2012). ACTFL Proficiency Guidelines 2012. ACTFL. http://www.actfl.org/publications/guidelines-
and-manuals/actfl-proficiency-guidelines-2012
Aguinis, H., Edwards, J. R., & Bradley, K. J. (2017). Improving our understanding of moderation and
mediation in strategic management research. Organizational Research Methods, 20, 665–685. https://
doi.org/10.1177/1094428115627498
Alderson, J. C., & Huhta, A. (2005). The development of a suite of computer-based diagnostic tests based on
the Common European Framework. Language Testing, 22, 301–320. https://doi.org/10.1191/
0265532205lt310oa

Andrade, H. L. (2019). A critical review of research on student self-assessment. Frontiers in Education, 4,
1–13. https://doi.org/10.3389/feduc.2019.00087
Babaii, E., Taghaddomi, S., & Pashmforoosh, R. (2016). Speaking self-assessment: Mismatches between
learners’ and teachers’ criteria. Language Testing, 33, 411–437. https://doi.org/10.1177/0265532215590847
Bachman, L. F., & Palmer, A. S. (1989). The construct validation of self-ratings of communicative language
ability. Language Testing, 6, 14–29. https://doi.org/10.1177/026553228900600104
Bachman, Lyle F. (1990). Fundamental considerations in language testing. Oxford University Press.
Brantmeier, C. (2006). Advanced L2 learners and reading placement: Self-assessment, CBT, and subsequent
performance. System, 34, 15–35. https://doi.org/10.1016/j.system.2005.08.004
Brantmeier, C., Vanderplank, R., & Strube, M. (2012). What about me? Individual self-assessment by skill and
level of language instruction. System, 40, 144–160. https://doi.org/10.1016/j.system.2012.01.003
Brown, N. A., Dewey, D. P., & Cox, T. L. (2014). Assessing the validity of can-do statements in retrospective
(then-now) self-assessment. Foreign Language Annals, 47, 261. https://doi.org/10.1111/flan.12082
Bunderson, C. V. (2000). Foreword to the first edition. In H. Wainer, N. J. Dorans, R. Flaugher, B. F. Green, &
R. J. Mislevy (Eds.,) Computerized adaptive testing: A primer (pp. ix-xii). Routledge.
Butler, Y. G., & Lee, J. (2006). On-task versus off-task self-assessments among Korean elementary school
students studying English. Modern Language Journal, 90, 506–518. https://doi.org/10.1111/j.1540-
4781.2006.00463.x
Cameron, A. C., & Windmeijer, F. A. G. (1997). An R-squared measure of goodness of fit for some common
nonlinear regression models. Journal of Econometrics, 77, 329–342. https://doi.org/10.1016/s0304-4076(
96)01818-0
Casella, G. (1983). Leverage and regression through the origin. American Statistician, 37, 147–152. https://
doi.org/10.1080/00031305.1983.10482728
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In Lyle F. Bachman & A. D.
Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 32–70).
Cambridge Applied Linguistics.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254–272.
https://doi.org/10.1017/S0267190599190135
Chapelle, C. A. (2021). Validity in language assessment. In P. Winke & T. Brunfaut (Eds.), The Routledge
handbook of second language acquisition and language testing (pp. 11–20). Routledge.
DeKeyser, R. M. (2015). Skill acquisition theory. In B. VanPatten & J. Williams (Eds.), Theories in second
language acquisition (2nd ed., pp. 94–112). Routledge.
Deygers, B. (2020). Elicited imitation: A test for all learners? Examining the EI performance of learners with
diverging educational backgrounds. Studies in Second Language Acquisition, 42, 933–957. https://doi.org/
10.1017/S027226312000008X
Dolosic, H. N., Brantmeier, C., Strube, M., & Hogrebe, M. C. (2016). Living language: Self-assessment, oral
production, and domestic immersion. Foreign Language Annals, 49, 302–316. https://doi.org/10.1111/
flan.12191
Dörnyei, Z. (2009). The psychology of the second language learner. Oxford University Press.
Drasgow, F. (2006). Polychoric and polyserial correlations. In S. Kotz, C. B. Read, N. Balakrishnan, B.
Vidakovic, & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences (pp. 1–6). John Wiley & Sons.
https://doi.org/10.1002/0471667196.ess2014.pub2
Eckes, T., & Grotjahn, R. (2006). A closer look at the construct validity of C-tests. Language Testing, 23,
290–325. https://doi.org/10.1191/0265532206lt330oa
Eisenhauer, J. G. (2003). Regression through the origin. Teaching Statistics, 25, 76–80. https://doi.org/
10.1111/1467-9639.00136
Ekström, J. (2011). A generalized definition of the polychoric correlation coefficient. UCLA Department of
Statistics Papers, 36. https://escholarship.org/uc/item/583610fv
Erlam, R. (2006). Elicited imitation as a measure of L2 implicit knowledge: An empirical investigation.
Applied Linguistics, 27, 464–491. https://doi.org/10.1093/applin/aml001
Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis. Review of
Educational Research, 59, 395–430. https://www.jstor.org/stable/1170205
Fenlon, C., O’Grady, L., Doherty, M. L., & Dunnion, J. (2018). A discussion of calibration techniques for
evaluating binary and categorical predictive models. Preventive Veterinary Medicine, 149, 107–114.
https://doi.org/10.1016/j.prevetmed.2017.11.018

Fox, J. (1997). Applied regression analysis, linear models, and related methods. SAGE.
Gaillard, S., & Tremblay, A. (2016). Linguistic proficiency assessment in second language acquisition
research: The elicited imitation task. Language Learning, 66, 419–447. https://doi.org/10.1111/lang.12157
Grotjahn, R., Klein-Braley, C., & Raatz, U. (2002). C-tests: An overview. In J. A. Coleman, R. Grotjahn, & U.
Raatz (Eds.), University language testing and the C-test (pp. 93–114). AKS-Verlag.
Hasegawa, H. (2013). On polychoric and polyserial partial correlation coefficients: A Bayesian approach.
METRON, 71, 139–156. https://doi.org/10.1007/s40300-013-0012-1
Huang, H.-T. D., & Hung, S.-T. A. (2013). Comparing the effects of test anxiety on independent and
integrated speaking test performance. TESOL Quarterly, 47, 244–269. https://doi.org/10.1002/tesq.69
Isbell, D., & Winke, P. (2019). Test review: ACTFL Oral Proficiency Interview—Computer (OPIc). Language
Testing, 36, 467–477. https://doi.org/10.1177/0265532219828253
Joo, S. H. (2016). Self- and peer-assessment of speaking. Working Papers in TESOL and Applied Linguistics,
16, 68–83. https://doi.org/10.7916/salt.v16i2.1257
Kaderavek, J. N., Gillam, R. B., Ukrainetz, T. A., Justice, L. M., & Eisenberg, S. N. (2004). School-age children’s
self-assessment of oral narrative production. Communication Disorders Quarterly, 26, 37–48. https://
doi.org/10.1177/15257401040260010401
Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). American
Council on Education and Praeger.
Kim, Y., Tracy-Ventura, N., & Jung, Y. (2016). A measure of proficiency or short-term memory? Validation
of an elicited imitation test for SLA research. Modern Language Journal, 100, 655–673. https://doi.org/
10.1111/modl.12346
Kissling, E. M., & O’Donnell, M. E. (2015). Increasing language awareness and self-efficacy of FL students
using self-assessment and the ACTFL proficiency guidelines. Language Awareness, 24, 283–302. https://
doi.org/10.1080/09658416.2015.1099659
Kirby, N. F., & Downs, C. T. (2007). Self-assessment and the disadvantaged student: Potential for encouraging
self-regulated learning? Assessment & Evaluation in Higher Education, 32, 475–494. https://doi.org/
10.1080/02602930600896464
Klugh, H. E. (1986). Statistics: The essentials for research. Psychology Press.
LeBlanc, R., & Painchaud, G. (1985). Self-assessment as a second language placement instrument. TESOL
Quarterly, 19, 673–687. https://doi.org/10.2307/3586670
Li, M., & Zhang, X. (2021). A meta-analysis of self-assessment and language performance in language testing
and assessment. Language Testing, 38, 189–218. https://doi.org/10.1177/0265532220932481
Li, Z. (2015). Using an English self-assessment tool to validate an English placement test. Papers in Language
Testing and Assessment, 4, 59–96. https://arts.unimelb.edu.au/__data/assets/pdf_file/0003/1770672/Li.pdf
Ma, W., & Winke, P. (2019). Self-assessment: How reliable is it in assessing oral proficiency over time?
Foreign Language Annals, 52, 66–86. https://doi.org/10.1111/flan.12379
Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation and response time on a
computerized oral proficiency test. Language Testing, 22, 59–92. https://doi.org/10.1191/
0265532205lt297oa
Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using R (and
friends). The American Statistician, 72, 80–88. https://doi.org/10.1080/00031305.2017.1375986
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments.
Educational Researcher, 23, 13–23.
Mok, M. M. C., Lung, C. L., Cheng, D. P. W., Cheung, R. H. P., & Ng, M. L. (2006). Self-assessment in higher
education: Experience in using a metacognitive approach in five case studies. Assessment & Evaluation in
Higher Education, 31, 415–433. https://doi.org/10.1080/02602930600679100
NCSSFL-ACTFL. (2015). NCSSFL-ACTFL Can-do statements. ACTFL.
Norris, J. M. (Ed.). (2018). Developing C-tests for estimating proficiency in foreign language research. Peter
Lang.
O’Connell, A. A. (2006). Logistic regression models for ordinal response variables. Quantitative applications in
the social sciences. SAGE.
Oskarsson, M. (1978). Approaches to self-assessment in foreign language learning. Pergamon Press.
Panadero, E., Jonsson, A., & Botella, J. (2017). Effects of self-assessment on self-regulated learning and self-
efficacy: Four meta-analyses. Educational Research Review, 22, 74–98. https://doi.org/10.1016/j.
edurev.2017.08.004

Patsopoulos, N. A., & Ioannidis, J. P. (2009). The use of older studies in meta-analyses of medical
interventions: a survey. Open Medicine, 3, 62–68. https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC2765773/pdf/OpenMed-03-e62.pdf
Peirce, B. N., Swain, M., & Hart, D. (1993). Self-assessment, French immersion, and locus of control. Applied
Linguistics, 14, 25–42. https://doi.org/10.1093/applin/14.1.25
Pierce, S. J., & Zhang, X. (2022). SAWpaper: Self-assessment works paper research compendium (Version
1.0.0) [Reproducible research materials and computer program, R package]. GitHub and Zenodo.
https://github.com/sjpierce/SAWpaper, https://doi.org/10.5281/zenodo.6388011
Plonsky, L., & Oswald, F. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning,
64, 878–912. https://doi.org/10.1111/lang.12079
Raykov, T., Marcoulides, G. A., & Patelis, T. (2015). The importance of the assumption of uncorrelated errors
in psychometric theory. Educational and Psychological Measurement, 75, 634–647. https://doi.org/
10.1177/0013164414548217
Révész, A., & Brunfaut, T. (2021). Validating assessments for research purposes. In P. Winke & T. Brunfaut
(Eds.), The Routledge handbook of second language acquisition and language testing (pp. 21–32).
Routledge.
Ross, J. A. (2006). The reliability, validity, and utility of self-assessment. Practical Assessment, Research, and
Evaluation, 11, 1–13. https://doi.org/10.7275/9wph-vv65
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of experimental
factors. Language Testing, 15, 1–20. https://doi.org/10.1177/026553229801500101
Sainani, K. L. (2014). Explanatory versus predictive modeling. American Academy of Physical Medicine and
Rehabilitation, 6, 841–844. https://doi.org/10.1016/j.pmrj.2014.08.941
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25, 289–310. https://doi.org/10.1214/10-
STS330
Shi, F. (2012). Exploring students’ anxiety in computer-based oral English test. Journal of Language Teaching
and Research 3, 446–451. https://doi.org/10.4304/jltr.3.3.446-451
Spada, N., Shiu, J. L.-J., & Tomita, Y. (2015). Validating an elicited imitation task as a measure of implicit
knowledge: Comparisons with other validation studies. Language Learning, 65, 723–751. https://doi.org/
10.1111/lang.12129
Stansfield, C. W., Gao, J., & Rivers, W. P. (2010). A concurrent validity study of self-assessments and the
federal Interagency Language Roundtable Oral Proficiency Interview. Russian Language Journal, 60,
299–315.
Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan,
M. W. (2010). Assessing the performance of prediction models: A framework for traditional and novel
measures. Epidemiology, 21, 128–138. https://doi.org/10.1097/EDE.0b013e3181c30fb2
Summers, M. M., Cox, T. L., McMurry, B. L., & Dewey, D. P. (2019). Investigating the use of the ACTFL can-
do statements in a self-assessment for student placement in an Intensive English Program. System, 80,
269–287. https://doi.org/10.1016/j.system.2018.12.012
Thomas, M. (1994). Assessment of L2 proficiency in second language acquisition research. Language
Learning, 44, 307–336. https://doi.org/10.1111/j.1467-1770.1994.tb01104.x
Tigchelaar, M. (2019). Exploring the relationship between self-assessments and OPIc ratings of oral
proficiency in French. In P. Winke & S. M. Gass (Eds.), Foreign language proficiency in higher education
(pp. 153–173). Springer. https://doi.org/10.1007/978-3-030-01006-5_9
Tigchelaar, M., Bowles, R., Winke, P., & Gass, S. (2017). Assessing the validity of ACTFL can-do statements
for spoken proficiency. Foreign Language Annals, 50, 584–600. https://doi.org/10.1111/flan.12286
Tremblay, A. (2011). Proficiency assessment in second language acquisition research: “Clozing” the gap.
Studies in Second Language Acquisition, 33, 339–372. https://doi.org/10.1017/S0272263111000015
Wainer, H. (2000). Introduction and history. In N. J. Dorans, D. Eignor, R. Flaugher, B. F. Green, R. J. Mislevy,
L. Steinberg, & D. Thissen (Eds.), Computer adaptive testing: A primer (2nd ed., pp. 1–21). Lawrence
Erlbaum Associates.
Wainer, H., & Kiely, G. L. (1987). Item clusters and computer adaptive testing: A case for testlets. Journal of
Educational Measurement, 24, 185–201. https://www.jstor.org/stable/1434630
Winke, P., Yan, X., & Lee, S. (in press). What does the Cloze test really test? A cognitive validation of a French
Cloze test with eye-tracking and interview data. In G. Yu & J. Xu (Eds.), Language test validation in a digital
age. UCLES/Cambridge University Press.

Winke, P., & Zhang, X. (2022). Data and codebook for SSLA article: “A closer look at a marginalized test
method: Self-assessment as a measure of speaking proficiency.” [Data set]. Inter-university Consortium for
Political and Social Research. https://doi.org/10.3886/E164981V1
Yan, X., Maeda, Y., Lv, J., & Ginther, A. (2016). Elicited imitation as a measure of second language proficiency:
A narrative review and meta-analysis. Language Testing, 33, 497–528. https://doi.org/10.1177/
0265532215594643
Cite this article: Winke, P., Zhang, X. and Pierce, S. J. (2023). A closer look at a marginalized test method:
Self-assessment as a measure of speaking proficiency. Studies in Second Language Acquisition, 45, 416–441.
https://doi.org/10.1017/S0272263122000079

doi:10.1017/S0272263122000316
RESEARCH ARTICLE
Explicit Instruction within a Task: Before, During,

or After?
Gabriel Michaud* and Ahlem Ammar
Département de Didactique, Université de Montréal, Montreal, Quebec, Canada
*Corresponding author: E-mail: gabriel.michaud@umontreal.ca
(Received 01 December 2020; Revised 26 June 2022; Accepted 19 July 2022)
Abstract
This study addresses the effects of the timing of explicit instruction within the three phases of
a task cycle (pretask, task, posttask) while considering learner’s previous knowledge. Eight
intact groups (N = 165) of French L2 university-level students (4 B1- and 4 B2-level groups)
completed two tasks. Groups were formed according to previous knowledge. Three groups
received explicit instruction on the French subjunctive during the pretask, task, or posttask
phase of each task. The control groups completed the task without prior instruction.
Participants completed an elicited imitation test and a grammaticality judgment test as
pretests, immediate posttests, and delayed posttests. Results showed that explicit instruction
embedded in a task facilitates the development of explicit and implicit knowledge and that
the efficacy of instruction is not significantly influenced by the timing at which it is provided
or by the learners’ level of previous knowledge.
Introduction
Task-based language teaching (TBLT) is a teaching approach that has been gaining
ground around the world, with task-based curricula being adopted in different second
language (L2) and foreign language teaching settings (East, 2012; Ellis et al., 2020).
Tasks offer a communicative context in which emphasis is primarily placed on meaning
and where L2 learners mobilize linguistic and nonlinguistic resources to learn a
language. TBLT relies on the premise that languages are primarily acquired implicitly
(Long, 2015) with the attention of learners focused on meaning while also acknowl-
edging the potential benefits of focusing on formal elements of language. This is
especially true for older learners whose capacity for implicit L2 learning may no longer
be optimal (ibid.). The incorporation of form-focused instruction is a characteristic of
TBLT, but the type of instruction—proactive, preemptive, or reactive—remains a
contentious issue of debate. This distinction has been conceptualized along the lines
of task-based language teaching versus task-supported language teaching. At one end of
the spectrum, a purely task-based approach follows a curriculum consisting of unfo-
cused tasks where there is no preplanned instruction of a specific linguistic structure.

Explicit Instruction Within a Task: Before, during, after? 443
The attention to linguistic features happens reactively and incidentally in response to

students’ questions or errors (Ellis, 2003) or can happen in a preemptive way during the
task or at the end of the task. In this approach, tasks are selected and sequenced based on a
learners’ real-world needs (Long, 2015), interest (Philp & Duchesne, 2016) or familiarity
with the content at hand (Prabhu, 1987). From a teaching perspective, Van den Branden
(2016) questioned the existence of true examples of this approach. On the other end of the
spectrum, a task-supported approach relies on focused tasks in which a specific feature
(usually a grammatical notion) is taught at the beginning of a task (Ellis, 2003). Ellis
(2018) maintained that proactive focus on form drawing learner attention to a particular
linguistic feature that could help them perform a task is an equally valid option. This
approach is comparable to the traditional presentation-practice-production methodol-
ogy still widely practiced in L2 classrooms (Nassaji & Fotos, 2011) where production
takes the form of a task. It draws on research indicating that explicit instruction
(EI) promotes L2 learning (Goo et al., 2015; Norris & Ortega, 2000).
Given these two opposing views, teachers face a choice integrating EI and task-based
language teaching (East, 2012, 2017).
Teaching with tasks: A three-phase approach

To implement task-based teaching in the classroom, researchers have proposed several
task-based methodologies involving three phases, namely pretask, task, and posttask
(Ellis & Shintani, 2013; Willis & Willis, 2007). The pretask phase includes activities
teachers and students can undertake before they engage in a task. Activating previous
knowledge, modeling task examples, providing input, and giving learners time to plan
are examples of activities that usually occur during the pretask phase. The task phase is
where learners mobilize all the resources necessary to perform the task. The posttask
phase is where learners demonstrate the results of their work, reflect on what they have
learned, or engage in task repetition. Teachers using the three-phase task methodology
while seeking to provide EI on specific linguistic features face the inevitable question
pertaining to the timing of EI. Explicit instruction could occur in the pretask phase so
learners can familiarize themselves with forms that are useful or essential during the
task. It could also happen during the task phase to prevent or allow reaction to
difficulties learners encounter with certain linguistic forms. In the posttask phase, EI
can direct learners’ attention to forms that they may have encountered during the task.
Most TBLT methodologists argue that instruction should be reserved either for the task
phase in response to learners’ errors and questions (Long, 2015) or for raising students’
awareness of a certain structure (Samuda, 2001), or during the posttask phase (Willis &
Willis, 2007). Willis and Willis (2007) warned against focusing on form in the pretask and
task phase, arguing that learners might, as a consequence, focus exclusively on practicing
the language property at hand and lose sight of the task objectives. Despite these
recommendations, recent studies have suggested that teachers still prefer proactive EI that
occurs during the pretask or task phases (East, 2017; Zheng & Borg, 2014). These diverging
views have indicated the importance of addressing the research question pertaining to the
differential effects of EI provided during each of the three phases of TBLT.
Theoretical perspectives about the timing of explicit instruction

Even though focus on form, consisting of brief episodes of instruction often taking the
form of corrective feedback, has been more traditionally associated with TBLT (Long,

444 Gabriel Michaud and Ahlem Ammar
2015), recent publications on the topic include EI as a valid option (Ellis, 2018; Ellis
et al., 2020). Explicit instruction involves the explanation of rules with or without
metalinguistic comments either in a deductive or an inductive way (Norris & Ortega,
2000). It is a deliberate attempt to intervene in the process of acquisition (Ellis, 2018).
As we have highlighted, EI can occur in any phase of a task. We will explore theoretical
perspectives which support the benefits of EI by phase starting with the pretask,
followed by the posttask, and conclude with the task phase.
Providing EI in the pretask phase may be supported by skill acquisition theory
(DeKeyser, 1998, 2007) according to which learning entails a transition from declar-
ative knowledge (e.g., knowing grammar rules) to procedural knowledge (e.g., knowing
how to use the rules to perform a task) by virtue of practice. As DeKeyser (1998)
explains, practice does not correspond to the mechanic behavioristic drills that result in
“language-like behaviour” (p. 53) but rather to the use of language in meaningful ways.
From skill acquisition theory principles, it can be argued that EI in the pretask, followed
by meaning-oriented controlled exercises, contributes to the development of the initial
representation of rules in a declarative format and that engaging in the task immedi-
ately after EI is given helps learners proceduralize the acquired knowledge. In other
words, the task provides real operating conditions where learners can develop their
procedural knowledge.
The theoretical justification for the provision of EI during the posttask draws from
research on preparatory attention (N. C. Ellis, 2005; LaBerge, 1995). Hondo (2015)
argues that creating a context in which a certain feature is required to solve a task might
entice and nudge learners toward a particular form. Relying on James (1890), Laberge
(1995) maintained that, when exposed to a certain stimulus, the brain can start
preprocessing information, thus easing the effort of processing the actual stimulus
when the time comes. From an L2 acquisition perspective one can infer that, when
completing a certain task, learners may notice their incomplete knowledge of language,
rendering them more receptive to any teaching that would help them fill the gap. In the
same vein, Doughty and Williams (1998) argued that “[m]ore direct instruction should
be delayed until learners have demonstrated at least some emerging knowledge of the
form” (p. 255). This latter assertion finds echo in usage-based theories (Ellis & Wulff,
2015) that have postulated that learning an L2 is mainly achieved through exposure to
input. By relying on their cognitive faculties, L2 learners come to make associations
between the form that they perceive and its meaning. According to this theoretical
approach, the relationships between form and meaning are emergent and develop over
time in a dynamic and adaptive way. This theoretical position could serve to justify
waiting for the emergence of initial understandings of form-meaning relationships
before proceeding with EI pertaining to those relationships.
Support for EI during the task phase could be argued from the two aforementioned
perspectives. In the task phase, learners have had time to familiarize themselves with the
context of the task and may begin trying to convey some meaning. Accordingly, during
the pretask and the beginning of the task work, learners may realize a void in their
knowledge preventing them from expressing what they want to say, as per the Output
Hypothesis (Swain, 2000). In this instance, and in accordance with preparatory
attention, they might be more receptive to information by way of EI to fill the
knowledge gap. However, contrary to the posttask phase, learners in the task phase
still have time to apply the information under real operating circumstances which could
intensify the proceduralization of knowledge as supported by skill acquisition theory.
Lastly, even if EI cannot be considered a brief attention to form in a communicative
context as focus on form is normally conceptualized, the fact that it is provided during

the task phase while the students are immersed in the communicative context where the
use of a given form is particularly relevant might facilitate the acquisition of the form.
They may have started to understand the meaning or function of the form while the EI
stimulates their awareness of that form in accordance with the noticing hypothesis
(Schmidt, 2001).
Empirical findings
This issue of timing of instruction was raised in the late 1990s (Lightbown, 1998), but
very little empirical work has been conducted since to address a research question that
is of great relevance to both researchers and teachers. Spada (2019) raised this concern
recently when arguing that more must be known about the impact of timing on
teaching and learning. Recently, Ellis (2018) explored the timing of EI within the three
phases of a task and acknowledged that “it is perhaps disappointing to conclude a
chapter whose purpose is to examine the research that has investigated the impact of EI
in the different stages of a task-based lesson by just pointing out the need for such
research” (p. 126). Timing of instruction has been addressed either by looking at the
effects of EI provided at different phases of a task (Li et al., 2016a; Shintani, 2017; Spada
et al., 2014) or by evaluating the impact of immediate (within a task) and delayed (after
the task) corrective feedback (Li et al., 2016b; Quinn, 2014).
Timing of explicit instruction

To the best of our knowledge, no research looking at the timing of EI has compared the
effects of instruction across the three timing conditions (pretask, task, posttask phases)
in one single study, though some have looked at the effects of the timing of EI in specific
phases of a task (before and during a task (Li et al., 2016a), or before and after a task
(Shintani, 2017)).
The differential effects of the timing of EI were first investigated by Spada et al.
(2014), who compared two timing conditions that they referred to as integrated and
isolated instruction. In both conditions, instruction occurs in a communicative envi-
ronment, with integrated instruction happening during the communicative activity
and isolated instruction occurring before or after the activity. In Spada et al.’s study,
adult learners of L2 English in intact groups received either isolated instruction before
taking part in a communicative activity or integrated instruction while they were taking
part in the communicative activity. Learning operationalized as explicit and implicit
knowledge gains was measured using a written error correction task and an oral
production task, respectively. Results did not show any significant differences between
the two conditions. However, the integrated group showed an advantage for the
development of implicit knowledge, whereas the isolated group showed an advantage
for explicit knowledge as measured by both tests. Although interesting, the reported
findings should be interpreted with caution because of one major methodological flaw.
The experimental group conditions did not differ in terms of timing of instruction only
but also in terms of the nature of the instruction that the learners had received. In fact,
the isolated group completed grammatical exercises that the integrated group did not.
Furthermore, the integrated group received corrective feedback, but the isolated group
did not. Accordingly, it is not clear if the results can be attributed to a difference in
timing of instruction or rather to differences in the way instruction was provided.
To better understand the effects of the timing of instruction, Li et al. (2016a)
compared conditions in which learners received instruction at different moments of

a task. The conditions that are relevant to the present study were: (a) EI at the beginning
of a task (pretask), (b) corrective feedback during the task (within-task), (c) a group that
completed only the task, and (d) a control group that did only the pretests and posttests.
It is worth noting, once again, that instruction differed not only in terms of timing but
also in terms of operationalization (explicit instruction vs. corrective feedback). The
task consisted of two dictoglosses targeting the passive structure performed in a 2-hour
period. Learning gains were assessed by a grammaticality judgment test (GJT) and an
elicited imitation test (EIT) that were meant to tap explicit and implicit knowledge,
respectively. For the GJT, only the pretask group outperformed the control group and
almost outperformed the task-only group for the GJT (p = .06, d = 0.63 and p = .09,
d = 0.60 for the immediate and delayed posttests, respectively). No significant differ-
ences between the three treatment groups were obtained in the EIT. Li et al. (2016a) also
controlled for the level of knowledge of the structure by their participants and found
that the learners who possessed some knowledge of the passive structure in the EI þ
task group outperformed the control group at both posttests of the GJT, whereas the
learners with no previous knowledge did not benefit from that same instruction. This
study indicated that learner previous knowledge of the structure might moderate the
impact of the timing of instruction. Shintani (2017) seemed to validate that conclusion
when comparing the timing of providing written explanations during a writing task. In
Shintani’s study, one group received EI in the form of a self-study handout explaining
the rules of the feature and then performed the writing task while another group
completed the task and then studied the handout for 5 minutes, after which time the
learners in the second group were allowed to review their text. Learning gains were
measured using an error correction test and a text reconstruction test. Results showed
that the learners with no previous knowledge drew more benefit from explicit pretask
instruction and that the learners with previous knowledge were best served by having
access to the instruction after the task.
Timing of corrective feedback

Even though EI and corrective feedback represent two different forms of instruction
(the former being proactive and the latter reactive) and given the scarcity of research
specifically addressing the question of timing of proactive instruction, reviewing
research which focuses on the effects of the timing of CF may shed more light on the
issue at hand. In terms of studies focusing on the timing of corrective feedback, Quinn
(2014) was the first to address this research question in an adult L2 laboratory learning
context. In this study where students engaged in three different oral tasks, one group
received corrective feedback during the tasks, one group received corrective feedback
after each task, and a control group simply completed the tasks. Learning was assessed
using a GJT, an oral production test, and a written error correction test. No significant
differences were observed between the groups. In a study on L2 learning similar to Li
et al. (2016a), Li et al. (2016b) evaluated the effects of corrective feedback timing within
two dictogloss tasks. While one of the two experimental groups received corrective
feedback during the task, the other group received the same type of corrective feedback
but at posttask phase. On the EIT, results indicated no effects for either of the two
treatment groups. On the GJT, the immediate feedback group showed superior results.
Similar results were obtained by Fu and Li (2022) who reported no significant
differences between learners who received immediate or delayed corrective feedback
during a task. However, the immediate group presented significant differences with the

control group. In Arroyo and Yilmaz (2018), learners were involved in a computer-
based spot-the-difference task during a one-on-one chat-exchange with an experi-
menter in a laboratory setting. Learners in the experimental groups received either
immediate corrective feedback in the form of recasts or delayed corrective feedback
where their errors were presented in a document with the correct answer underneath.
The control-group participants only did the pretests and posttests. Results showed that
the immediate group outperformed the delayed and the control groups at both posttests
at an oral production test. At a GJT, both experimental groups outperformed the
control group at both posttests, but there was no difference between the two groups.
In sum, it seems more advantageous to provide feedback within a task than to wait
until after the task (Arroyo & Yilmaz, 2018; Fu & Li, 2022; Li et al., 2016b)
Research questions
Not only are empirical research studies addressing the question pertaining to the effects
of the timing of EI on L2 learning scarce but they are also difficult to compare because of
differences in methodological choices (e.g., the way instruction is operationalized
differs between and across studies). More important, this same body of research has
reported contradictory results. In fact, although some studies have reported no effects
for different timing conditions (Quinn, 2014; Spada et al., 2014), others have shown
that the benefits of instruction depend on the time of its provision (Arroyo & Yilmaz,
2018; Fu & Li, 2022, Li et al., 2016a, 2016b). Given the limited and contradictory
empirical evidence, the first research question guiding the present study was: Does the
timing of EI in the pretask phase, the within-task phase, or the posttask phase affect L2
learning? Based on research results indicating that learners’ level of previous knowledge
seemed to mediate the differential effects of instruction provided at different moments
of the task (Li et al., 2016a; Shintani, 2017), the second research question guiding this
study was: Is the effect of the timing of EI moderated by learners’ previous knowledge of
the target grammatical structure?
Method
Context
The study took place in an English-speaking university located in the French-speaking
province of Quebec, Canada. The department followed a task-based curriculum that
could be described as modular, including unfocused and focused tasks (Ellis, 2018), and
emphasized real-life, meaningful tasks.
Participants
For the study, we recruited 165 adult intermediate level learners of French as an L2 from
eight intact classes—four B1-level and four B2-level classes. Levels are determined in
two different ways: returning students who took the previous level class or new students
whose level was determined by an in-house placement test. Heterogenous groups are
generally obtained because of these placement methods. Learners from two different
levels were included to ensure some diversity in their level of previous knowledge of the
target structure. Classes were taught by five different French L2 teachers. One of the
participating teachers was responsible for three classes, another teacher was in charge of
two classes, and the remaining three teachers were each in charge of one class.

Participants were undergraduate students pursuing different fields of studies. They

took French L2 courses on a voluntary basis as part of an elective course or a minor
program for diverse reasons and came from different linguistic backgrounds. The
average participant age was 20.2 years.
Target structure
The present study targeted the French subjunctive because it is a verbal mood that
normally appears in the B1 level (Howard, 2008). In the context where we collected the
data, the subjunctive mood is introduced at the B1 level and is reinforced at subsequent
levels. Apart from its pedagogical relevance, we chose the subjunctive mood because
research had shown that its learning and usage are a challenge even for advanced L2
learners (Bartning & Schlyter, 2004). A sentence using the subjunctive contains two
clauses: a matrix clause that prescribes the use of the subjunctive mood and an
embedded clause where the verb has to be conjugated in the subjunctive mood (e.g.,
J’aimerais que vous veniez me voir après le cours, “I would like you to come see me after
class”). The use of the subjunctive in the embedded clause may be required by a verb, a
subordinating conjunction, or an adjective in the matrix clause. In terms of morpho-
logical inflexions, the subjunctive can be qualified as nonsalient. For regular verbs
ending in -er, the subjunctive inflexion does not differ from the present tense of the
indicative mood except for the first and second persons of the plural. For irregular
verbs, the morphological inflexion can be perceived both at the oral and written level
except for the third-person plural. The conjugation of the subjunctive follows a regular
pattern (one radical with the same endings for each person and number) except for
certain verbs with two radicals and verbs that do not follow a regular pattern (e.g., avoir,
être, aller, faire, pouvoir, savoir). In terms of complexity, it is a feature that could be
characterized as complex because of its low saliency and its communicative redundancy
(DeKeyser, 2015).
Treatment tasks
Following calls to explore tasks in classroom environments (Ellis, 2018), we took care to
ensure that the materials and the interventions reflected teaching methods and behav-
iors to which the participants were accustomed, namely the integration of EI within a
communicative task. We used two different tasks: a ranking task and a decision-making
task. We made the decision to include two tasks rather than one to increase the
likelihood of observing differential effects based on the intervention and to ensure that
the participants who received EI in the posttask phase of the first task would have the
opportunity to use that knowledge in a meaningful context at least once. We recognize
that having two tasks might influence the overall timing where posttask instruction in
the first task might be considered pretask instruction for the second. To attenuate this
possibility, we chose two tasks dealing with two different topics and students did not
know at the start of the second task that the subjunctive would be targeted. The two
tasks were part of the participants’ normal curriculum.
In the ranking task, the participants had to provide advice to students who would be
coming to study at their university in the winter. The aim of the task was for the
participants to produce a short video to be posted on a website to better prepare
incoming students arriving in the winter. In the pretask, the participants were invited to
share their personal experiences regarding winter and to watch a video depicting the

experience of refugee families in the process of integrating themselves in Canada.

During the task, the participants were first invited to think individually of five
recommendations to include in the video. They were then asked to work in teams of
four to select the five best pieces of advice to include in the video. Once this had been
achieved, the participants produced the video in which they gave their advice to future
students. At the posttask, videos were shown in the classroom and the participants had
to vote on the best video to include on the website. The three-phase task took 80 minutes
to complete.
In the second task, the decision-making task, the participants had to reach a
common decision. They worked in teams of four and were asked to play the role of a
student committee in charge of organizing a winter carnival to be held at the
university. In the pretask, teachers first led a discussion on winter activities that
the participants enjoyed. They, then, went through some pictures of an annual winter
carnival, presenting some cultural information about that event. During the task, the
participants were required to assess seven activity proposals and to eliminate one. At
this phase, the participants were first required to develop criteria on which they
would base their decision. They then read the proposals and came to an agreement on
which proposal would be eliminated. Finally, the participants were required to write
a report explaining their reasoning for the activity that they had eliminated. At the
posttask, each team shared with the rest of the class the activity that they had
eliminated.
Both tasks provided a context where the subjunctive mood was inherently useful to
make recommendations and give advice (Loschky & Bley-Vroman, 1993). However, to
maintain a focus on the communicative outcome of the tasks (Ellis, 2018), the teachers
never told the participants that they were specifically required to use the subjunctive
mood. It was presented as a useful way to express advice and recommendations (see
“Teaching Material” on IRIS database). Care was taken to ensure that the participants in
each group had a common understanding of the task. To make sure that the teachers
respected the different experimental conditions to which they had been assigned, they
were provided with all the necessary teaching materials (i.e., PowerPoint presentation
and slide notes containing the talking points for each slide). We met with the teachers
before and after each session to review the teaching protocol and to ensure that they
operationalized the experimental conditions as we had intended. Even though some
teachers did not agree to let the researchers observe the tasks, all sessions were audio-
recorded. Verification of the recordings allowed to confirm that all teachers adhered to
the protocol that was established, that is they respected the timing that was scheduled
for each step and they followed the notes that were on the PPT slides. Table 1 illustrates
all interventions that took place.
Experimental instruction
To focus on the subjunctive, we favored EI of the deductive kind (Goo et al., 2015) and
developed the instruction in collaboration with the teachers taking part in the research
to make sure that the instruction reflected their actual teaching practices.
In the first experimental task, the EI consisted of a 15-minute presentation of the
form, the meaning, and the use of the subjunctive mood. By way of a PowerPoint
presentation, the teachers showed the participants sentences containing advice using
the subjunctive. The teachers explained with nontechnical metalanguage the structure
of the subjunctive. They also explained when to use the subjunctive and with what type

Table 1. Schedule for the intervention and testing
Conditions
Time Pretask Task Posttask Control
Day 1: Pretest Pretests (EIT and GJT)

Day 3: Task 1 EI (15 min) þ Pretask (25 min) Task Pretask (25 min) Pretask (25 min) þ
pretask (25 min) (25 min) þ EI Task (25 min) Comprehension
þ Task (25 min) (15 min) Posttask Posttask questions Task
Posttask (15 min) (15 min) þ (25 min) Posttask
(15 min) EI (15 min) (15 min)
Day 5: Task 2 þ EI (7 min) þ Pretask (15 min) Task Pretask (15 min) Pretask (15 min) þ
pretask (15 min) (20 min) þ EI (7 min) Task (20 min) discussion (7 min)
Task (20 min) Posttask (8 min) Posttask (8 min) Task (20 min)
Posttask (8 min) þ EI (7 min) Posttask (8 min)
Posttest Immediate posttests (EIT and GJT)
Day 19: Posttest Delayed posttests (EIT and GJT)
Note: EIT = elicited imitation test; GJT = grammaticality judgment test; EI = explicit instruction.
of verbs (e.g., verbs expressing necessity: Il faut que tu prennes le métro pour venir à
l’école, “You need to take the metro to go to school”) and how to conjugate the verbs.
They then showed the participants sentences that featured advice requiring the sub-
junctive. The instruction concluded with the teachers presenting 10 sentences and
asking the participants to conjugate the verbs in the sentences.
The second task involved making recommendations. Because use, function, and
meaning were presented in the first task, the EI for the second task was not as exhaustive
and lasted approximately 7 minutes. After reviewing how to form the subjunctive, the
teachers showed the participants ways to formulate recommendations in French using
the subjunctive. They also presented five sentences illustrating recommendations in
which the participants had to conjugate verbs in the subjunctive mode.
Treatment conditions
Six intact classes received explicit grammatical instruction on the French subjunctive
according to three timing conditions: the pretask phase (n = 43), the task phase (n = 40)
or the posttask phase (n = 42). We assigned two classes to each of the three treatment
conditions; two additional classes (n = 40), the control group, completed the tasks
without receiving any EI.
In the first experimental condition, the pretask group, EI was provided in the pretask
phase after the teacher had explained the objective of the tasks and led a group
discussion. In the second experimental condition, the within-task group, EI was
provided during the task. In this instance, the instruction occurred 7 minutes after
students started the group work for both tasks which happens to be at the middle point
of the whole task. Students still had time to complete the tasks afterward. In the third
experimental condition, the posttask group, the teacher provided EI at the end of the
posttask after the participants had completed and presented their tasks. The control
group participants completed the task without receiving any EI. They instead answered
comprehension questions about the video for the first task and took part in a longer
discussion about winter activities for the second task.

Data collection tools
Research in L2 acquisition showing an advantage for instruction often includes
measures tapping into explicit knowledge (Norris & Ortega, 2000). To mitigate this
bias, researchers have called for including tools that measure both explicit and implicit
knowledge (R. Ellis, 2005, 2015). In this study, we used an untimed GJT for the
measurement of explicit knowledge and an EIT for implicit knowledge. Each test
contained 32 items: 24 targeting the subjunctive and 8 distractors. Half the target items
were ungrammatical. For the ungrammatical items of both tests, the verbs in the
embedded clauses were conjugated in the indicative mood instead of the subjunctive
(e.g., Il faut que tu prends* l’escalier, “You must use the stairs”). For validation purposes,
we had administered earlier versions of the tests to 60 learners taking French L2 courses
within the same context where we conducted the present study. The Cronbach alpha
values were .91 for the GJT and .72 for the EIT.
Grammaticality judgment test

In the GJT that we used to assess explicit knowledge (R. Ellis, 2005), the participants
were presented with sentences targeting certain linguistic features (e.g., agreements,
verb conjugation, negation) and were required to indicate whether or not the sentences
were grammatical and to correct each sentence that they judged to be ungrammatical.
Although it has been argued that judging grammatical versus ungrammatical sentences
may tap into two different types of knowledge, recent studies (e.g., Vafaee et al., 2017)
have indicated that both types of statements measure explicit knowledge. The test was
paper administered.
Elicited imitation test

We used the EIT to evaluate implicit knowledge (R. Ellis, 2005; Erlam, 2006; Kim &
Nam, 2017) even though some recent studies have suggested that elicited imitation may
assess automatized explicit knowledge (Suzuki & DeKeyser, 2015). What seems to be
clear is that GJTs and EITs load on different factors (Kim & Nam, 2017) and, as
DeKeyser (2017) pointed out, automatized explicit knowledge is functionally equiva-
lent to implicit knowledge. In EITs, learners are presented with statements and are
required to judge their content by indicating whether the statements are true or false or
to indicate whether or not they agree with what has been stated. Deciding whether
statements are true or false incites learners to focus on the meaning of the statements,
reducing their chances that they would pay attention to form (Erlam, 2006). Learners
are then asked to repeat the statements correctly. In this study, the statements in the EIT
took the form of advice to give (e.g., Il faut que nous recyclions les déchets, “We are
required to recycle waste”). For each statement, the participants had to indicate on an
answer sheet if they agreed with the advice (i.e., if they thought it was good advice or not
appropriate or relevant to the topic at hand). They were told to repeat the statement
immediately in correct French, but no time limit was given. The test was administered
using CAN-8 Virtual Lab (Version 3.16, 2018), language-learning software used in the
computer laboratory of the university where the study took place.
Procedures
The interventions involved four 80-minute class periods, each spanning a 4-week
period (see Table 1). During the first session (Day 1), the participants first complete

the EIT and then the GJT, which took approximately 30 minutes. The second session
took place 5 days later when the participants completed the ranking task. Two days
later, during the third session (Day 5), the participants completed the decision-making
tasks (50 minutes), and they completed the immediate posttest (30 minutes). Two
weeks later (Day 19), the participants completed the delayed posttest. All interventions
and tests took place during regularly scheduled class time.
Scoring
For the GJT, one point was given for participants identifying a target item as gram-
matical and one point was given for properly correcting an ungrammatical target item.
Only one-half point was given for their correctly identifying a target item as ungram-
matical when the correction contained an error (J’aimerais qu’il peuve* [puisse] venir à
la fête ce soir, “I would like for him to be able to come to the party tonight”).
For the EIT, one point was given for participants correctly repeating a correct target
item and one point was given for properly correcting an ungrammatical target item.
Only one-half point was given for partially correcting an ungrammatical target item
(Il faut que tu rendisses* [rendes] un bon travail, “You must submit good work”).
Analysis
We conducted analyses of covariance (ANCOVAs) to determine the differential effects
of the three timing conditions, using the pretest scores as a covariate (Tabachnick &
Fidell, 2013). We performed pairwise post hoc comparisons to locate the source of
difference, applying the Bonferroni correction to adjust for the number of pairwise
comparisons. We calculated Cohen’s d to estimate effect sizes, which we interpreted
following Plonsky and Oswald’s (2014) recommendations: small effect ≥ 0.4; medium
effect ≥ 0.7; and large effect ≥ 1. To assess the mediating effect of the learner’s previous
level of knowledge on the timing at which instruction is given, we performed regression
analyses using the posttest scores as the dependent variables and the pretest scores and
groups (timing of instruction) as the independent variables.
We verified all assumptions for the statistical tests and have reported the results
when the assumptions were violated.
Results
Timing conditions
To answer the first research question, we analyzed the data to determine the overall
effects of the different timing conditions. Tables 2 and 3 show the descriptive statistics
for the GJT and the EIT for the three testing sessions.
As Figures 1 and 2 illustrate, we observed similar trends for the GJT and the EIT.
All groups performed comparably at the pretest. By the time of the immediate
posttest, the experimental groups had improved more than the control group, but
the experimental groups showed a slight decline at the delayed posttest, whereas the
control group had continued to improve. Looking at the performance data,1 partic-
ipants of the pretask groups used on average 5.00 (SD = 2.00) occurrences of the
1
For more about the performance data, see Michaud (2021).

Table 2. Grammaticality judgment test: Descriptive statistics for overall learning effects by group
Pretest Immediate posttest Delayed posttest
Group N M SD M SD M SD
Pretask 41 13.10 5.96 19.81 3.94 18.05 4.79

Task 35 13.00 4.56 20.16 3.68 19.00 4.01
Posttask 38 12.41 5.98 19.09 4.60 18.91 4.57
Control 38 12.38 5.08 14.91 5.30 15.25 5.12
Note: Maximum score = 24. The ns for the GJT and EIT are not the same because some participants did not complete one or
the other of the tests for a variety of reasons.
Table 3. Elicited imitation test: Descriptive statistics for overall learning effects by group
Pretest Immediate posttest Delayed posttest
Group N M SD M SD M SD
Pretask 41 3.60 4.44 9.92 6.57 8.93 6.75

Task 39 2.95 2.66 9.14 5.87 9.30 5.20
Posttask 42 3.93 3.57 9.38 6.38 8.79 5.82
Control 36 3.22 2.89 4.74 4.03 5.32 3.90
Note: Maximum score = 24. The ns for the GJT and EIT are not the same because some participants did not complete one or
the other of the tests for a variety of reasons.
Figure 1. Grammaticality judgment test: Trends for the four groups across testing sessions.
Note: Maximum score = 24.
subjunctive in their tasks, the task groups 3.50 (SD = 1.51) and the posttask group
1.91 (SD = 1.64), which suggests that the earlier instruction is provided the more the
notion is used.
Results of a one-way analysis of variance (ANOVA) showed no significant differ-
ences between groups at the pretest for the GJT, F(3, 148) = 0.19, p = .91, np2 = .00, nor
for the EIT, F(3, 154) = 0.61, p = .61, np2= .01. To determine if there were any
significant differences between the groups at the posttests, we conducted ANCOVAs
with the pretest scores as a covariate after checking the ANCOVA assumptions. The
skew index and the kurtosis index were below 3 and 10, respectively (Kline, 2020), and

Figure 2. Elicited imitation test: Trends for the four groups across testing sessions.
Note: Maximum score = 24. Adjusted means.
Table 4. Grammaticality judgment posttests: Post hoc pairwise comparisons for participants’ adjusted
means using the grammaticality judgment pretest as a covariate
Immediate posttest Delayed posttest
Comparison 95% CI (Mdiff) p d 95% CI (Mdiff) p d
Pretask vs. Task [0.39, 2.74] 1.00 0.10 [1.00, 3.31] 1.00 0.22
Posttask [0.45, 1.85] 1.00 0.11 [1.19, 3.45] 1.00 0.25
Control [4.63, 2.32] <.01 1.00 [2.46, 0.19] .03 0.50
Task vs. Posttask [0.84, 1.56] 1.00 0.20 [0.19, 2.55] 1.00 0.05
Control [5.02, 2.62] <.01 1.09 [3.45, 1.10] <.01 0.75
Posttask vs. Control [4.17, 1.83] <.01 0.84 [3.65, 1.34] <.01 0.75
Note: The p values were adjusted using the Bonferroni correction.
tests of homogeneity of variance were not violated (p > .05)2. For the GJT, the analyses
indicated that the differences between the four groups were significant at the immediate
posttest, F(3, 147) = 13.22, p < .01, np2 = .21, as well as at the delayed posttest, F(3, 147)
= 7.08, p < .01, np2 = .13. Table 4 provides the results of post hoc pairwise comparisons
of the means for the grammaticality judgment posttests adjusted for the grammaticality
judgment pretest in the ANCOVA.
Inspection of the post hoc analyses for the GJT revealed that all experimental groups
significantly outperformed the control group at the immediate posttest and delayed
posttest, with the within-task group showing the highest effect size (d = 1.09 and 0.75,
respectively). The pretask group initially presented a large effect size (d = 1.00) but the
effect dropped at the delayed posttest (d = 0.50), whereas the two other experimental
groups maintained their scores in a relatively more stable way. However, there were no
significant differences between the experimental groups, and the effect sizes were all
small (< 0.4).
We observed a similar profile for the EIT. ANCOVAs revealed significant effects at
the immediate posttest, F(3, 153) = 9.27, p < .01, np2 = .15, and the delayed posttest,
F(3, 153) = 6.35, p < .01, np2 = .11. Table 5 presents the results of the post hoc pairwise
2
While some ANCOVAs did not meet the homogeneity assumption, when sample sizes are equivalent—as
in the case in the present study—they can be considered sufficiently robust to overcome violations of the
homogeneity assumption (Howell, 2008).

Table 5. Elicited imitation posttests: Post hoc pairwise comparisons for participants’ adjusted means
using the elicited imitation pretest as a covariate
Comparison 95% CI (Mdiff) p d 95% CI (Mdiff) p d
Pretask vs. Task [2.57, 2.70] 1.00 0.01 [3.58, 1.53] 1.00 0.17
Posttask [1.70, 3.45] 1.00 0.14 [2.05, 2.96] 1.00 0.07
Control [2.08, 7.45] <.01 0.86 [0.62, 5.83] .01 0.58
Task vs. Posttask [1.82, 3.43] 1.00 0.13 [1.07, 4.02] .80 0.27
Control [1.98, 7.41] <.01 0.93 [1.61, 6.88] <.01 0.92
Posttask vs. Control [1.22, 6.56] <.01 0.72 [0.18, 5.36] .04 0.55
Note: The p values were adjusted using the Bonferroni correction.
comparisons of the means for the elicited imitation posttests adjusted for the elicited
imitation pretest in the ANCOVA.
Once again, all experimental groups significantly outperformed the control group at
the immediate and delayed posttests, with the task group showing the highest effect
sizes, that were relatively stable at both posttests (d = 0.93 and d = 0.92). No significant
differences were observed between the experimental groups.
Learner level of knowledge

The second research question concerned the mediating effect of learner previous
knowledge of the target form as a result of the three timing conditions. B1- and
B2-level groups participated in the research. We took care to include one class from
each proficiency level in each condition (both experimental and control) to answer
the second research question. To establish the mediating effect of the participants’
readiness to acquire the subjunctive, we performed regression analyses using the
posttest scores as a dependent variable and the pretest scores and groups as
independent variables. An inspection of the standardized residuals graphs and
the Q-Q plot standardized residuals do not show any specific concerns. For the
variance inflation factor (VIF), we obtained results below 10, which is the generally
accepted limit suggesting a multicollinearity problem, except for the GDT, where
the VIF for the interactions between groups and pretest scores was 10.30, which is
slightly higher than the accepted limit. Tables 6 and 7 present the results of the
regression analyses.
For the GJT, the group and the pretest were significant predictors for the immediate
posttest and for the delayed posttest, respectively. For the EIT, groups and pretest scores
were significant predictors for both posttests. However, the interactions between
groups and pretest scores were not a significant predictor for any posttests, indicating
that the level of previous knowledge does not have a significant moderating effect.
Discussion
This study sought out (a) to investigate the effects of timing of EI within a task-based
cycle among intermediate-level French L2 students and (b) to determine the mediating
effect of learner previous knowledge. Participants took part in two tasks over two class
periods where they received explicit instruction on the French subjunctive either during

Table 6. Predictors for the grammaticality judgment test scores
Predictors βa
p βa p
(Intercept) <.01 <.01

Groups –0.60 <.01 –0.24 0.16
Pretest 0.19 0.24 0.46 <.01
Interactions Groups*Pretest 0.36 0.11 0.11 0.61
a
Standardized regression coefficient; R2 for the immediate posttest = 0.32; R2 for the delayed posttest = 0.32.
Table 7. Predictors for the elicited imitation posttests scores

Predictors βa
p βa P
(Intercept) <.01 <.01

Groups –0.31 <.01 –0.21 0.02
Pretests 0.53 <.01 0.60 <.01
Interactions Groups*Pretest 0.10 0.48 <.01 1.00
a
Standardized regression coefficient; R2 for the immediate posttest = 0.44; R2 for the delayed posttest = 0.73.
the pretask, the task, or the posttask cycle of each task. A control group completed the
two tasks without receiving any explicit instruction.
In the assessment of the overall trends, all participants in the experimental groups
significantly outperformed the control group on both posttests for both the GJT and
the EIT. This confirms the well-established efficacy of EI over zero instruction (Goo
et al., 2015; Norris & Ortega, 2000). However, no significant differences were
observed between the experimental conditions. This finding is in line with previous
studies that also did not observe differences between different timing conditions
(Li et al., 2016a; Quinn, 2014; Spada et al., 2014). However, contrary to Li et al.
(2016a), the present study did find significant differences between the experimental
groups and the task-only group. Participants in Li et al.’s (2016a) study completed
tasks in only one class period and may not have had sufficient time to take advantage
of the instruction that they had received, whereas the participants in this present
study had two separate occasions to reinvest their knowledge. Furthermore, the
participants in Li et al. (2016a) were relative beginners and were likely hindered in
their ability to make good use of the EI provided at the beginning of the task. In Spada
et al. (2014), even though no significant differences were noted between the isolated
and integrated groups, based on the effect sizes of the tests reputed to assess implicit
and explicit knowledge, greater gains were observed for implicit knowledge for the
integrated group and for explicit knowledge for the isolated groups. In our study, the
effect sizes between the GDT and the EIT do not seem to favor the development of one
kind of knowledge over another for any experimental group. This might be explained
by the fact that the type of instruction was different in Spada et al. for both conditions
whereas participants in this study received the same instruction (only timing dif-
fered). Therefore, from an acquisitional perspective, the timing of EI does not seem to
significantly influence efficacy, both for explicit and implicit knowledge.
Among the experimental groups, although the differences are not significant, the
within-task group showed greater effect sizes than the other groups, reaching a large
effect size for the GJT at the immediate posttest and almost reaching a large effect size for

both posttests of the EIT (respectively, d = 0.93 and d = 0.92). It was also this group that
showed the least knowledge decay between both posttests. The fact that within-task
participants showed the highest effect sizes is in line with previous research reporting
that within-task instruction had a facilitating L2 development effect (Arroyo & Yilmaz,
2018; Fu & Li, 2022; Li et al., 2016b). The proximity of the context of teaching and
learning might be at the origins of the obtained result in the sense that such proximity
might have eased the processing demand on learners who are trying to understand a new
notion. This result can also be interpreted from the perspective of preparatory attention
theory with participants in the within-task conditions seeing a need for the subjunctive
mood to help them with their tasks. The tasks may have provided the participants with
the impetus to start an initial bottom-up processing of the input (LaBerge, 1995) and EI
may have rendered more noticeable a feature with low saliency, as usage-based theories
would predict. Finally, the fact that the within-task group performed somewhat better
than the posttask group is likely explained by the immediate opportunity afforded to
apply newly acquired knowledge in a meaningful context. The task was still in progress
when the within-group received the EI, giving the participants the chance to dynamically
reinvest and validate their knowledge in a meaningful situation, an outcome supported
by skill acquisition theory. This last hypothesis might also explain why within-task
groups showed an advantage over posttask groups in Li et al. (2016b), Fu and Li (2022),
but not in Quinn (2014). In Li et al. (2016b) and Fu and Li (2022), learners who received
posttask CF did not have the chance to reinvest the knowledge in a communicative
context allowing them to develop procedural knowledge, whereas in Quinn (2014)
learners engaged in three task cycles and received delayed CF after each task, giving
them the opportunity to make use of the information in the follow-up tasks. Similarly,
our study also relied on a two-task cycle. The absence of practice opportunities following
posttask instruction may therefore explain the difference in learning outcomes, a
hypothesis that should be empirically validated.
Learners’ previous knowledge

To appreciate the moderating effect of previous knowledge on timing of instruction,
learners from different levels took part in this study. Contrarily to previous studies
which showed that a learner’s prior knowledge has a moderating effect on the timing of
instruction (Li et al., 2016a; Shintani, 2017), results from the present study did not
reveal such effects. The difference in the reported findings might be caused by the way
level of previous knowledge was operationalized. Li et al. (2016a) and Shintani (2017)
elected to use a cut-off point based on the pretest scores in accordance with established
criteria (zero vs. some knowledge) and separated learners into two different groups. In
our analyses, instead of considering proficiency as a categorical variable, we elected to
control for the level of previous knowledge of all participants in a continuous manner to
avoid any data loss. The fact that learners’ level of previous knowledge does not seem to
influence timing of instruction might be attributed to a lack of variability in terms of
level. In fact, while we included learners from two different levels (B1/B2), both are
within the intermediate stage when it comes to the knowledge of the subjunctive. In
other words, the difference in terms of knowledge was not sufficiently present to permit
an accurate evaluation of the mediating effect of previous knowledge. Unlike Li et al.
(2016a) whose sample included learners with zero knowledge, few of the participants in
the present study possessed zero knowledge of the subjunctive. Accordingly, it is
possible that results would have been different with more beginner learners. In any

event, the results of this study do not support a moderating effect of level of knowledge
on timing of instruction.
Results from this study indicate that planning an explicit instruction segment of a
grammatical form within a task leads to an improvement of both explicit and implicit
knowledge, regardless of whether it happens during the pretask, the within-task, or the
posttask phase and regardless of the learner’s level of previous knowledge. Accordingly,
the criticisms against a proactive or a task-supported approach seems to be unwar-
ranted from an acquisition perspective. Even the slight advantage accruing from
integrating explicit instruction during the within-task phase should not dissuade a
teacher who prefers pretask instruction (East, 2017; Zheng & Borg, 2014) from
proceeding. There remains, however, the hypothesis issued by Willis and Willis
(2007) that focusing on form in the pretask or within-task phase might distract the
learner away from the communicative intention of the task. More work in this area
remains to be undertaken, but a complementary study by Michaud (2021) does not
support this claim: Pretask or task instruction was shown not to have a negative impact
on performance data.
Conclusion
This study sought to ensure ecological validity, working with students and teachers in
a regular classroom setting. It was therefore not possible to include a true control
group that would have only completed the pretests and posttests. Furthermore, to
extend the instruction effect and reflect pedagogical practices, a two-task cycle was
included. While we endeavored to attenuate the possibility that posttask EI could
serve as pretask instruction in the second task, we cannot exclude this possibility. We
are confident, however, that the 2-day interval between tasks was eventually sufficient
to mitigate this possibility. Even though care was taken to ensure that teachers
followed the established protocol, we cannot rule out teacher effect. With respect
to assessment, we used two different measures to assess learning gains, a GJT and an
EIT. These tests have been subjected to multiple validations in different studies
yielding contradictory results as to whether they are a true measure of explicit/
implicit knowledge versus declarative/procedural knowledge. This is especially the
case of the EIT that might be a measure of proceduralized knowledge rather than
implicit knowledge (Suzuki & DeKeyser, 2015). It is warranted that future studies
include new tests that tap more into implicit knowledge, such as a self-paced reading
or word monitoring.
This study was the first to look at the timing of instruction in the three phases of a
task. Given the increased interest in task-based instruction and teaching practices with
respect to form-focused instruction, much more remains to be done in this area. Future
research might look at different proficiency levels and target grammatical forms of
varied complexity. Adopting a process-oriented framework focusing on how learners
perform the task might also be valuable to inform teaching practices.
Acknowledgments. We would like to thank the reviewers and the editors for their time. Their feedback and
contributions have improved the clarity and quality of our effort to communicate the results of our study. We
are grateful to all of the teachers who participated in this study, sharing their experience and welcoming us
into their classrooms. A special thank you to Jennica Grimshaw for her assistance throughout the data
collection process. Lastly, we would like to express our gratitude to Randall Halter for his valuable input.

References
Arroyo, D. C., & Yilmaz, Y. (2018). An open for replication study: The role of feedback timing in synchronous
computer‐mediated communication. Language Learning, 68, 942–972.
Bartning, I., and Schlyter, S. (2004). Itinéraires acquisitionnels et stades de développement en français L2.
French Language Studies, 14, 281–299. https://doi.org/10.1017/S0959269504001802
CAN-8 Virtual Lab [Computer software]. (2018). Sounds Virtual Inc. http://www.can8.com/
DeKeyser, R. M. (1998). Beyond focus on form: Cognitive perspectives on learning and practicing second
language grammar. In C. Doughty & J. Williams (Eds.), Focus on form in classroom second language
acquisition (pp. 42–63). Cambridge University Press.
DeKeyser, R. M. (2007). Skill acquisition theory. In B. VanPatten & J. Williams (Eds.), Theories in second
language acquisition: An introduction (pp. 91–113). Lawrence: Erlbaum.
DeKeyser, R. M. (2015). What makes learning second‐language grammar difficult? A review of issues.
Language Learning, 51, 1–25. https://doi.org/10.1111/j.0023-8333.2005.00294.x
DeKeyser, R. (2017). Knowledge and skill in ISLA. In The Routledge handbook of instructed second language
acquisition (pp. 15–32). Routledge.
Doughty, C., & Williams, J. (1998). Pedagogical choices in focus on form. In C. Doughty & J. Williams (Eds.),
Focusonform in classroom second language acquisition (pp. 197–261). Cambridge University Press.
East, M. (2012). Task-based language teaching from the teachers’ perspective: Insights from New Zealand. John
Benjamins.
East, M. (2017). Research into practice: The task-based approach to instructed second language acquisition.
Language Teaching, 50, 412–424.
Ellis, N. C. (2005). At the interface: Dynamic interactions of explicit and implicit language knowledge. Studies
in Second Language Acquisition, 27, 305–352. https://doi.org/10.1017/S027226310505014X
Ellis, N. C., & Wulff, S. (2015). Usage‐based approaches to SLA. In B. VanPatten & J. Williams (Eds.), Theories
in second language acquisition: An introduction (2nd ed., 75–93). Routledge/Taylor & Francis.
Ellis, R. (2003). Task-based language learning and teaching. Oxford University Press.
Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language: A psychometric study.
Ellis, R. (2015). Understanding second language acquisition (2nd ed.). Oxford University Press.
Ellis, R. (2018). Reflections on task-based language teaching. Multilingual Matters.
Ellis, R., & Shintani, N. (2013). Exploring language pedagogy through second language acquisition research.
Routledge.
Ellis, R., Skehan, P., Li, S., Shintani, N., & Lambert, C. (2020). Task-based language teaching: theory and
practice. Cambridge University Press. https://doi.org/10.1017/9781108643689
Erlam, R. (2006). Elicited imitation as a measure of L2 implicit knowledge: An empirical validation study.
Applied Linguistics, 27, 464–491. https://doi.org/10.1093/applin/aml001
Fu, M., & Li, S. (2022). The effects of immediate and delayed corrective feedback on L2 development. Studies
in Second Language Acquisition, 44(1), 2–34.
Goo, J., Granema, G., Novella, M. & Yilmaz, Y. (2015). Implicit and explicit instruction in L2 learning; Norris
and Ortega (2000) revisited and updated. In P. Rebuschat (ed.), Implicit and explicit learning of languages
(pp. 443–482). John Benjamins.
Hondo, J. (2015). Teaching English grammar in context: The timing of form-focused intervention. In M.
Christison, D. Christian, P. Duff, & N. Spada (Eds.), Teaching and learning English grammar: Research
findings and future directions (pp. 34–49). Routledge.
Howard, M. (2008). Morpho-syntactic development in the expression of modality: The subjunctive in French
L2 acquisition. Canadian Journal of Applied Linguistics, 11, 171–192. https://journals.lib.unb.ca/index.
php/CJAL/article/view/19921/21783
Howell, D. C. (2008). Méthodes statistiques en sciences humaines (6th ed). De Boeck Université.
James, W. (1890). Principles of psychology (Vols. 1–2). Holt.
Kim, J. E., & Nam, H. (2017). Measures of implicit knowledge revisited: Processing modes, time pressure, and
modality. Studies in Second Language Acquisition, 39, 431–457.
Kline, R. B. (2020). Post p value education in graduate statistics: Preparing tomorrow’s psychology
researchers for a postcrisis future. Canadian Psychology/Psychologie canadienne, 61, 331–341.
LaBerge, D. (1995). Attentional processing: The brain’s art of mindfulness. Harvard University Press.

Li, S., Ellis, R. & Zhu, Y. (2016a). Task-based versus task-supported language instruction: An experimental
study. Annual Review of Applied Linguistics, 36, 205–229. https://doi.org/10.1017/S0267190515000069
Li, S., Zhu, Y., & Ellis, R. (2016b). The effects of the timing of corrective feedback on the acquisition of a new
linguistic structure. The Modern Language Journal, 100, 276–295. https://doi.org/10.1111/modl.12315
Lightbown, P. M. (1998). The importance of timing in focus on form. In C. Doughty & J. Williams (Eds.),
Focus on form in classroom second language acquisition (pp. 177–196). Cambridge University Press.
Long, M. H. (2015). Second language acquisition and task-based language teaching. Wiley-Blackwell.
Loschky, L. & Bley-Vroman, R. (1993). Grammar and task-based methodology. In G. Crookes & S. Gass
(Eds.), Tasks and language learning: Integrating theory and practice (pp. 123–167). Multilingual Matters.
Michaud, G. (2021). L’incidence d’un enseignement centré sur la forme sur la performance orale. The
Canadian Modern Language Review, 77, 269–289.
Nassaji, H., & Fotos, S. (2011). Teaching grammar in second language classrooms: Integrating form-focused
instruction in communicative context. Routledge.
Norris, J. M., & Ortega, L. (2000). Does type of instruction make a difference? Substantive findings from a meta-
analysis review. Language Learning, 51, 157–213. https://doi.org/10.1111/j.1467-1770.2001.tb00017.x
Philp, J. & Duchesne, S. (2016). Exploring engagement in tasks in the language classroom. Annual Review of
Applied Linguistics 36, 50–72.
Plonsky, L., & Oswald, F. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning,
Prabhu, N. (1987). Second language pedagogy. Oxford University Press.
Quinn, P. (2014). Delayed versus immediate corrective feedback on orally produced passive errors in English
(Unpublished doctoral dissertation). University of Toronto, Toronto, ON, Canada.
Samuda, V. (2001). Guiding relationships between form and meaning during task performance: The role of
the teacher. In M. Bygate, P. Skehan, & M. Swain (Eds.), Researching pedagogic tasks: Second language
learning, teaching and testing (pp. 119–40). Longman.
Schmidt, R. (2001). Attention. In P. Robinson (Ed.), Cognition and second language instruction (pp. 3–32).
Cambridge University Press.
Shintani, N. (2017). The effects of the timing of isolated FFI on the explicit knowledge and written accuracy of
learners with different prior knowledge of the linguistic target. Studies in Second Language Acquisition, 39,
129–166. https://doi.org/10.1017/S0272263116000127
Spada, N. (2019, August). Reflecting on TBLT from an instructed SLA perspective [Plenary address]. 8th
International TBLT Conference, Ottawa, ON, Canada.
Spada, N., Jessop, L., Tomita, Y., Suzuki, W., & Valeo, A. (2014). Isolated and integrated form-focused
instruction: Effects on different types of L2 knowledge. Language Teaching Research, 18, 453–473. https://
doi.org/10.1177/1362168813519883
Suzuki, Y., & DeKeyser, R. (2015). Comparing elicited imitation and word monitoring as measures of implicit
knowledge. Language Learning, 65, 860–895.
Swain, M. (2000). The output hypothesis and beyond: Mediating acquisition through collaborative dialogue.
In J. P. Lantolf (Ed.), Sociocultural theory and second language learning (pp. 97–114). Oxford University
Press.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson.
Vafaee, P., Suzuki, Y., & Kachisnke, I. (2017). Validating grammaticality judgment tests: Evidence from two new
psycholinguistic measures. Studies in Second Language Acquisition, 39, 59–95. https://doi.org/10.1017/
S0272263115000455.
Van den Branden, K. (2016). Task-based language teaching. In G. Hall (Ed.), The Routledge handbook of English
language teaching (pp. 238–251). Routledge.
Willis, D., & Willis, J. (2007). Doing task-based teaching. Oxford University Press.
Zheng, X., & Borg, S. (2014). Task-based learning and teaching in China: Secondary school teachers’ beliefs
and practices. Language Teaching Research, 18, 205–221.
Cite this article: Michaud, G. and Ammar, A. (2023). Explicit Instruction within a Task: Before, During, or
After?. Studies in Second Language Acquisition, 45, 442–460. https://doi.org/10.1017/S0272263122000316

doi:10.1017/S0272263122000328
RESEARCH ARTICLE
Sources and effects of foreign language

enjoyment, anxiety, and boredom: A structural
equation modeling approach
Jean-Marc Dewaele1* , Elouise Botes2 and Samuel Greiff3
1
Birkbeck, University of London, London, UK; 2University of Vienna, Vienna, Austria; 3University of
Luxembourg, Luxembourg
*Corresponding author. E-mail: j.dewaele@bbk.ac.uk
(Received 27 November 2021; Revised 21 July 2022; Accepted 25 July 2022)
Abstract
The present study is among the first to investigate how three foreign language (FL) emotions,
namely FL enjoyment (FLE), anxiety (FLCA), and boredom (FLB), are related to each other.
It is the first study to consider how the three FL emotions are shaped by one learner-internal
variable (attitude toward the FL), by two perceived teacher behaviors (frequency of use of the
FL in class and unpredictability), and how all these variables jointly affect leaners’ FL
achievement. Participants were 332 FL learners from all over the world studying a wide
variety of FLs who filled out an online questionnaire. A close-fitting structural equation
model revealed associations between FLE, FLCA, and FLB. Teacher behaviors positively
affected FLE, with no discernible effect on FLB or FLCA. Only FLCA was found to have a
(negative) effect on academic achievement. The study confirms the complex relationship
between teacher behaviors and positive emotions in the FL classroom.
Introduction
There is a famous passage in J. K. Rowling’s Harry Potter and the Order of the Phoenix
(2003) where Harry and his friend Ron tell their friend Hermione (all in their puberty at
this point in the saga) that they do not understand the behavior of Cho, who ran away in
tears after kissing Harry. Hermione lays down her quill and proceeds to explain the
emotional confusion of Cho who is grieving for the death of her former boyfriend, feels
guilty for falling in love with Harry, fears social disapproval when becoming his
girlfriend, and is also afraid of being thrown off the Ravenclaw Quidditch team because
of weak performance. The boys are gobsmacked:
A slightly stunned silence greeted the end of this speech, then Ron said, “One
person can’t feel all that at once, they’d explode.” “Just because you’ve got the
emotional range of a teaspoon doesn’t mean we all have,” said Hermione
nastily, picking up her quill again. (p. 406)

462 Jean-Marc Dewaele et al.
There is a parallel to be made with the recent explosion of interest in the emotions
of foreign language (FL) learners. For many decades, research focused on a single
negative emotion (anxiety) and only recently did the range expand when Dewaele and
MacIntyre (2014) juxtaposed FL classroom anxiety (FLCA) with FL enjoyment (FLE)
with the aim of understanding to what extent they were related to each other, and
whether they were linked to similar learner-internal and learner-external variables.1
The initial study and follow-up research of these two emotions provided plenty of
evidence of complex dynamic interactions with a wide range of psychological,
attitudinal, motivational, sociobiographical, and linguistic variables shaped by the
classroom context, the school, and even the larger societal context (for overviews see
Dewaele, Chen et al., 2019; Wang et al., 2021). One crucial insight was just how messy
and changeable the relationship between enjoyment and anxiety can be. The absence
of one does not automatically imply the presence of the other and vice versa.
Researchers soon pointed out that learners may experience other emotions such as
FL boredom (FLB; Li et al., 2020; Pawlak et al., 2020). The latest research in the field
has started to include FLE, FLCA, and FLB in a single research design to find out how
these three emotions jointly predict FL achievement and to what extent the three
learner emotions are linked to the same or different learner-internal and learner-
external variables (Li & Han, 2022; Li & Wei, 2022).
This research is led by the belief that a better understanding of the complex emotions
of FL learners can lead to improved pedagogical practices which in turn will boost
performance and progress of FL learners. The current study follows this avenue of
research and uses structural equation modeling, which allows researchers to explore
complex relationships between different latent variables in one single model.
Literature review
The literature review is organized as follows: We start by sketching the current
theoretical and epistemological foundation of the present study, dynamic system
theory, which shaped our research questions, and report on a number of studies that
adopted this framework. The next section defines the three emotions under investiga-
tion in the present study and presents the few studies that have included these emotions
in a single research design. Next, we turn to studies that have linked teacher-related
variables to learner emotions. This is followed by a section on the studies that
investigated learner emotions and attitude/motivation after which we introduce a final
section on the relationship between learner emotions and academic achievement. The
literature review concludes with a theoretical justification for including FLE, FLCA, and
FLB in a single research design.
The field of emotions in FL learning and teaching has been strongly affected by an
influential paradigm shift in the last decade, namely complex dynamic systems theory
(MacIntyre et al., 2015). The default assumption of researchers working in this
paradigm is that emotions that seem stable on a relatively long timescale (weeks,
months, years) may in fact be fluctuating more strongly on shorter timescales
(seconds, minutes, days) and that not all learners follow the average patterns for
the group (Elahi Shirvan et al., 2020, 2021; Li, 2021). The second assumption is that
dependent and independent variables are constantly interacting, shaped by the
context and by variables lurking in the background. In other words, learner-internal
variables interact with contextual variables resulting in unique patterns that change
over time. A simple illustration of this phenomenon is the classroom observation of

Sources and effects: Foreign language enjoyment, anxiety, boredom 463
Denisa, a Romanian EFL participant in Dewaele and Pavelescu (2021), who reported
low FLCA, high FLE and who usually participated actively in class, except on the day
the regular teacher was absent and a substitute teacher took the class. She reported
later disliking the teacher, not enjoying the class, and preferring to remain silent
throughout the class. What this suggests is that while how learners feel about their FL
classes is partly determined by relatively longer-term dispositions linked to attitudes,
motivation, and recent classroom experiences; by teacher behaviors; and by local,
transient, and unpredictable factors that can lead to spikes or drops in FLE, FLCA, and
FLB (Dewaele et al., 2022a, 2022b).
FL Emotions: Definitions and interactions

The concept of foreign language enjoyment (FLE) as presented in Dewaele and
MacIntyre (2014) draws on the theories of positive psychology, and more specifically
on the work of Csíkszentmihályi (1990). Dewaele and MacIntyre (2016) defined FLE as
“a complex emotion, capturing interacting dimensions of challenge and perceived
ability that reflect the human drive for success in the face of difficult tasks … enjoyment
occurs when people not only meet their needs, but exceed them to accomplish
something new or even unexpected” (pp. 216–17). In other words, enjoyment goes
deeper than mere pleasure and it is less ephemeral. This definition situates FLE on a
valence dimension, ranging from mid-way (low to mild enjoyment) to the top, positive
end of the scale where FLE becomes an experience of flow. It is worth pointing out that
the authors did not add any information about the separate dimension of arousal/
activation. They assumed that FLE could emerge in medium-arousal activities such as
silent reading or writing, as well as in high-arousal situations such as debates in the
classroom or public speech.
The second emotion to be included in Dewaele and MacIntyre’s (2014) design was
foreign language classroom anxiety (FLCA) that was defined by Horwitz et al. (1986) as
“a distinct complex of self-perceptions, beliefs, feelings and behaviors related to
classroom learning arising from the uniqueness of the language learning process”
(p. 128). MacIntyre (2017) pointed out that FLCA is both an internal state and a social
construct. In other words, it combines learner-internal and learner-external elements
resulting in complex “internal psychological processes, cognition and emotional states
along with the demands of the situation and the presence of other people” (p. 28).
Horwitz (2017) argued that FLCA has characteristics of both traits and states: FLCA
does not exist at birth but it can slowly emerge and strengthen among learners who tend
to be anxious in the FL class, it can coalesce into statelike FLCA that rears its head every
time the FL has to be used. Reflecting on the causes of FLCA, Horwitz (2017) pointed to
the fact that FL learners can experience a profound feeling of discomfort because they
lack the proficiency to present themselves authentically, something they typically have
no problem doing in their first language. Indeed, “presenting yourself to the world
through an imperfectly controlled new language is inherently anxiety-provoking for
some people” (p. 44). High levels of FLCA can be manifested by physical symptoms
(sweating, quicker heart rate, dry mouth) that can leave learners paralyzed and silent,
disrupt concentration, which limits absorption of new information (MacIntyre &
Gregersen, 2012).
The third classroom emotion to have attracted increasing attention recently is FL
boredom (FLB). In most cases, boredom is an unpleasant psychological and emo-
tional state that combines feelings of “dissatisfaction, disappointment, annoyance,

inattention, lack of motivation to pursue previously set goals and impaired vitality”
(Kruk & Zawodniak, 2018, p. 177). Li et al. (2020) defined FLB as “a negative emotion
with extremely low degree of activation/arousal that arises from ongoing activities …
[that] are typically over-challenging or under-challenging” (p. 12).2
As emotions are hypothesized to interact in an academic setting, the current
research interest is not only to examine the three emotion variables of FLE, FLCA,
and FLB in isolation but also to examine the relationships between these variables. As
such, studies examining the relationships and interaction between emotion variables in
the FL class become increasingly popular. In the study introducing FLE to the applied
linguistics lexicon, Dewaele and MacIntyre (2014) compared and contrasted FLE and
FLCA. They found a statistically significant negative correlation between the two
emotion variables, which has since been confirmed in a recent meta-analysis of
k = 47 studies (r = –.32; Botes et al., 2022). Li and Han (2022) was the first study
(published in Chinese) to include FLE, FLCA, and FLB in a single research design. The
authors found significant negative correlations between Chinese EFL students’ FLE and
FLCA, between FLE and FLB, and a significant positive correlation between FLCA and
FLB. Broadly similar patterns emerged in Li and Wei (2022) with a significant negative
correlation between FLE and FLB and a significant positive correlation between FLCA
and FLB. However, no correlation was found between FLE and FLCA. As such, we
include the following hypothesis in our study:
Hypothesis 1: There are statistically significant correlational effects between the emotion
variables of FLE, FLCA, and FLB.
FL emotions and teacher-related variables

Teacher behaviors as predictors of learner emotions in the classroom have been
extensively studied in general educational settings and mathematics learning, with
teacher behavior such as enthusiasm, understandability, support, comprehensibility,
and pace linked to positively to positive academic emotions such as joy and negatively
to negative emotions such as anxiety and boredom (Becker et al., 2014; Goetz et al.,
2013; Lei et al., 2018). In the specific domain of FL learning, teacher behavior has also
been found to predict FLE, FLCA, and FLB.
The first study to focus on the effect of teacher characteristics on FLE was Dewaele
et al. (2018). The authors found that the pupils from two secondary schools in London
with the highest level of FLE reported more positive attitudes toward the FL and the FL
teacher. More specifically, pupils liked teachers who used the FL frequently in class,
who were unpredictable, and who gave students the opportunity to speak up. A follow-
up study by Dewaele and Dewaele (2020) on a subsample of pupils in the same database
who had two different teachers for the same FL revealed that FLE was significantly
higher with the main teacher than with the second teacher and that scores for attitude
toward the teacher, teacher’s frequency of use of the FL in class, and unpredictability
were significantly higher for the former. The pattern was confirmed in Dewaele et al.
(2019) who found that teachers’ characteristics predicted twice the amount of variance
in FLE than in FLCA among Spanish EFL learners. The teacher’s friendliness and skill
emerged as the strongest predictors of FLE. A broadly similar pattern emerged among
Chinese undergraduate EFL learners in Jiang and Dewaele (2019). Attitudes toward the
teacher, teacher’s joking, and friendliness—but not teachers’ un/predictability were
strong predictors of FLE. This was backed up by students’ narratives that mentioned the
teacher more frequently when discussing FLE compared to FLCA, confirming previous

research on an international sample of FL learners (Dewaele & MacIntyre, 2019). Jiang

(2020) pursued this path using the focused essay technique with Chinese EFL students.
She found that FLE was positively related with teacher friendliness, patience, kindness,
happiness, and regular use of humor. Similarly, Elahi Shirvan et al. (2020) found that
the teacher was the typical cause of spikes in FLE among individual learners by giving
positive feedback, using humor and creating a pleasant and supportive classroom
atmosphere. Dewaele et al. (2022a), in a study n = 360 FL learners of English, German,
French, and Spanish in a Kuwaiti university, investigated the changing effect of
teachers’ frequency of using the FL in class, predictability and frequency of joking on
FLE, FLCA and attitudes/motivation. Mixed-effects regression analyses on FLE
revealed significant main effects of all three teacher behaviors (R2 = 26.2%). A post-
hoc analyses on the significant interaction effect of time and frequency of joking
revealed that students whose teacher joked infrequently reported the sharpest drop
in FLE over the semester. The unexpected finding of a positive relationship between
FLE and teacher predictability was attributed to the specific cultural-religious profile of
the learners.
In turn, FLCA has also been linked to a range of learner-external variables such as
the emotional atmosphere in the FL classroom (Effiong, 2016); the strictness, the
younger age, and the lower use of the FL by the teacher (Dewaele, Franco Magdalena
et al., 2019); the target language and its perception in the school community (De Smet
et al., 2018); and the modality of teaching (online or “in person”; Li & Dewaele, 2021;
Resnik & Dewaele, 2021; Resnik et al., 2022). Although it should be noted that several
studies have compared and contrasted the effects of teacher-behaviors on both FLCA
and FLE and found that teacher-related variables were more closely associated with FLE
than with FLCA (Dewaele & MacIntyre, 2019; Dewaele, Franco Magdalena et al., 2019;
Dewaele et al., 2018).
Similarly, teacher-related variables have been linked to FLB. Li (2021) found that
different control–value appraisals predicted FLB uniquely or interactively and that
different types of appraisals occurred simultaneously and interacted in predicting FLB.
Intrinsic value appraisal turned out to be a much stronger predictor of FLB than control
and extrinsic value appraisals. Learners who felt competent (high control) and valued
their English classes (reflecting higher engagement) tended to feel less bored. This
finding was confirmed by interviews with students. Analysis of the qualitative data
revealed a curvilinear relationship between control appraisal and FLB: “[E]xtremely
high and low control were both antecedents of boredom. In other words, students got
bored when they felt overwhelmingly challenged or underchallenged in English
learning” (p. 329). Analysis of the qualitative data also showed that intrinsic value
protected learners in boredom-inducing situations. Li (2021) concludes that to avoid
FLB in their classrooms, teachers should design tasks at the appropriate level of
difficulty to boost learners’ sense of confidence, competence, and control, while
emphasizing the intrinsic value of the activity. Creating a positive emotional classroom
atmosphere is also a prerequisite for more FLE, less FLCA, and an alleviation of FLB.
The importance of the teacher was further highlighted in Dewaele and Li (2021) where
mediation analysis showed that teacher enthusiasm strongly affected the learning
engagement of Chinese EFL learners. It had both a direct and an indirect effect on
both FLE and FLB, which were linked to teacher enthusiasm, positively for FLE, and
negatively for FLB. Moreover, FLE and FLB were found to mediate the effect of
participants’ perceived teacher enthusiasm on their own engagement (positive for
FLE and negative for FLB).

As such, we therefore propose the following hypotheses to explore the relationship

between the FL emotions and teacher-related variables:
Hypothesis 2:Teacher FL use will have a direct effect on FLE, FLCA, and FLB.
Hypothesis 3:Teacher unpredictability will have a direct effect on FLE, FLCA, and FLB.
FL emotions and attitude/motivation

Classroom emotions has been found to affect a learner’s attitude and motivation to
study science (Sinatra et al., 2014), physical education (Simonton & Garn, 2019),
medicine and nursing studies (Artino et al., 2012; Lee et al., 2021), and mathematics
(Goldin, 2014). It is therefore not unexpected that emotions in the FL classroom has
been found to affect attitudinal and motivational variables. Previous research has found
positive relationships between FLE and attitude toward the FL (De Smet et al., 2018;
Dewaele and Dewaele, 2017; Dewaele, Özdemir et al., 2022; Jiang and Dewaele, 2019) as
well as motivation to learn (Lee & Lee, 2021; Zhang et al., 2020). In contrast, FLCA has
been negatively associated with the attitude toward the FL (Dewaele & Proietti Ergün,
2020) and a general motivation to learn the target language (Liu & Cheng, 2014; Liu &
Huang, 2011; Neisi & Yamini, 2009). Although research regarding motivation and FLB
has received less attention, most probably due to the relative recency of the variable,
some initial recent findings have reported a negative relationship between FLB and
motivation to learn the FL (Kruk, 2016, 2022; Pawlak et al., 2021). As such, we propose
the following hypothesis:
Hypothesis 4:FLE, FLCA, and FLB will have a direct effect on the attitude of the FL
learner toward the FL.
FL emotions and academic achievement

As the overarching goal of FL learning is undoubtedly the acquisition of the target
language, the question has to be asked whether FL emotions can affect the actual
learning of a language? Previous research has utilized proxy variables to capture FL
learning as an outcome variable, with academic achievement in the form of grades or
exam scores, or self-perceived levels of proficiency or competency being used in the
majority of studies (see Teimouri et al., 2019). All three emotion variables of FLE,
FLCA, and FLB have been linked directly, and indirectly, to real and perceived
achievement in the FL class.
Several studies reported significant positive relationships between FLE and both
actual and perceived FL proficiency measures (Botes et al., 2020a; Dewaele et al., 2018;
Dewaele & Proietti Ergün, 2020; Li, 2020; Li & Wei, 2022; Piechurska-Kuciel, 2017; Wei
et al., 2019; Zhang et al. 2020). A meta-analysis of studies that linked FLE and academic
achievement in the FL revealed moderate positive correlations between FLE and
academic achievement (r = .30, k = 28, N = 8,883), and between FLE and self-perceived
achievement (r = .27, k = 9, N = 4,810; Botes et al., 2022). In other words, learners with
higher levels FLE are more likely to have higher levels of academic achievement as well
as a greater perception of their own abilities.
In turn, FLCA has been found to negatively affect learners’ performance and can be
“highly detrimental to the learning process” (MacIntyre, 2017, p. 150). Meta-analytic
studies have revealed comparable negative correlations between FLCA and a range of

measures of academic FL performance (r = –.39; k = 59; N = 12585; Botes et al., 2020b).

Similarly to FLCA, FLB has been linked to lower levels of academic achievement, as well
as lower levels of perceived achievement (Li & Wei, 2022; Li et al., 2021; Shao et al.,
2020). Thus, learners with higher levels FLCA and FLB, are less likely to have greater
levels of academic achievement.
Although studies examining the relationship between FLE, FLCA, and FLB, and
academic achievement individually are relatively popular, studies examining all three
emotions in a single design are less so. Li and Han (2022) were the first to examine the
relationship between FLE, FLCA, and FLB, and real and perceived achievement. FLE was
found to have independent positive predictive effects on actual English test scores and
perceived learning achievement, while FLB and FLCA had negative predictive effects.
However, a regression analysis revealed that FLCA was the only significant predictor for test
scores, whereas FLE and FLB predicted perceived achievement. In addition, Li and Wei
(2022) examined the effect over time of FLE, FLCA, and FLB on the EFL achievement of n
= 954 junior secondary learners in rural China. Structural equation modeling results
showed that the three emotions that learners experienced at Time 1 predicted their English
achievement at Time 2 but that FLE was the strongest and most enduring predictor
(in contrast to the result of Li and Han [2022]), with FLCA being a negative predictor at
Time 2 and Time 3, while the negative effect of FLB faded completely over time.
As such, contrasting results have emerged in studies in which all three emotion
variables are included in a single study design and are hypothesized to affect learning
outcomes—as opposed to the research consensus when these emotion variables are
hypothesized to affect learning in isolation. We therefore include the following research
questions and hypotheses:
Hypothesis 5:FLE, FLCA, and FLB will have a direct effect on the academic achievement
of the FL learner.
Hypothesis 6:The attitude toward the FL will have a direct effect on the academic
achievement of the FL learner.
Choice of the learner emotions

The decision to include FLE, FLCA, and FLB in the research design is based on theoretical
considerations (Dewaele & Li, 2021; Li, 2021; Li & Wei, 2022). The Control-Value Theory
(CVT) of achievement emotions (Pekrun, 2006) has provided applied linguists with a solid
theoretical basis for research into learners’ emotions. Pekrun and Stephens (2010) sug-
gested that achievement emotions can be organized along three dimensions: (1) object
focus (the activity vs. the outcome), (2) valence (positive vs. negative), and (3) activation
(deactivation vs. activation). FLE, FLCA, and FLB occupy unique positions on these three
dimensions. FLE and FLB are activity-related achievement emotions that arise from
ongoing activities but stand at opposite ends of the valence and activation dimensions:
FLE is a positive activating emotion, while FLB is a negative deactivating emotion. FLCA,
however, is an outcome-related achievement emotion evoked by past outcomes of FL
learning activities and is a negative, activating emotion (Pekrun & Perry, 2014). In other
words, the inclusion of these three emotions in the research design allows a unique three-
dimensional perspective on their causes and effects.
In summary, considering the literature review, we formulated the following research
questions:
RQ 1. What are the relationships between FLE, FLCA, and FLB?

RQ 2. What is the effect of teacher FL use on FLE, FLCA, and FLB?

RQ 3. What is the effect of teacher unpredictability on FLE, FLCA, and FLB?
RQ 4. What is the effect of FLE, FLCA, and FLB on attitude of the FL learner toward the
FL?
RQ 5. What is the effect of FLE, FLCA, and FLB on academic achievement of the FL
learner?
RQ 6. What is the effect of attitude toward the FL on academic achievement of the FL
learner?
We hypothesize that FLE, FLCA, and FLB are independent dimensions that they are
linked with each other to various degrees. We also hypothesize that FL learners’ attitude
toward the FL, teacher unpredictability and frequency of FL use in the classroom are
linked to FLE, FLCA, and FLB. These three FL emotions are hypothesized, in turn, to
predict achievement in the FL directly and indirectly.
Method
Participants
Data were collected through snowball sampling, which is a form of nonprobability
sampling (Ness Evans & Rooney, 2013). An open-access anonymous online question-
naire was used. Calls for participation were sent through emails to colleagues, students,
and friends all over the world, asking them to forward the link to their own colleagues
and students. The call for participation was also put on social media platforms used by
FL teachers. The questionnaire was anonymous. The research design and questionnaire
obtained approval from the ethics committee in the first author’s research institution.
Participants’ consent was obtained at the start of the survey that was posted online using
Google Docs.
A total of 332 FL learners filled out the questionnaire completely. The average age of
the sample was 25.46 (SD = 12.08), with the majority learning an FL in a university course
(n = 235), followed by a secondary school class (n = 97). The sample contained 252 female
participants and 73 male participants. The gender imbalance in the data was not unex-
pected as it has often been observed in previous FL learning research (see Dewaele &
MacIntyre, 2014). The sample was highly diverse in terms of nationality, multilingualism,
and target FL. The majority of participants were British nationals (n = 95), followed by
Chinese (n = 47) and Italians (n = 27). The average number of languages in each
participant’s linguistic repertoire was 3.14 (SD = 1.25). English was the most reported
target FL (n = 116), followed by French (n = 85), and Spanish (n = 44).
Instruments
The following instruments were used to measure variables in this exploratory study:
Short-form foreign language classroom anxiety scale (ω = .884; α = .893)

An eight-item measure designed by MacIntyre (1992) that examines FL anxiety in the
FL classroom with items such as “I get nervous and confused when I am speaking in my
language class.” The measure is unidimensional with all items loading on a single FLCA
latent variable. Items were measured on a five-point Likert scale from “strongly
disagree” to “strongly agree.”

Short-form foreign language enjoyment scale (ω = .895; α = .875)
The nine-item scale was designed by Botes et al. (2021) and is a shortened version of the
original 21-item scale (Dewaele & MacIntyre, 2014). The underlying factor structure of
the scale contains one higher-order FLE factor and three lower-order factors, namely
personal enjoyment (three items, e.g., “I enjoy my FL class”), social enjoyment (three
items, e.g., “There is a good atmosphere in my FL classroom”), and teacher appreciation
(3 items, e.g., “My FL teacher is encouraging”). Items were measured on a five-point
Likert scale from “strongly disagree” to “strongly agree.”
Foreign Language Classroom Boredom Scale (ω = .944; α = .943)

An eight-item scale aimed at examining FL boredom within a FL classroom setting,
with items such as “My mind begins to wander in FL class” and “It is difficult for me to
concentrate in the FL class” (Li et al., 2020). It should be noted that the eight-item
measure is originally a subscale of the greater 32-item Foreign Language Boredom Scale
(ibid.). However, as the emotion variables in this exploratory study are contextualized
as specific to FL learning class, the decision was made to only utilize the FL classroom
boredom subscale and not the scale in its entirety. Items were measured on a five-point
Likert scale from “strongly disagree” to “strongly agree.”
FL attitude
Attitude toward the target FL was measured by a single item asking the participants:
“What is your attitude toward your FL?.” Responses were measured on a five-point
Likert scale ranging from “very unfavorable” to “very favorable.”
FL academic achievement
Achievement was measured through a single item asking FL learners to provide the last
mark received for a test/exam in their FL class in percentage.
Teacher FL use
The frequency of FL use by the teacher in the FL classroom was measured through a
single item asking participants: “How often does your teacher use the FL in class?” The
item was measured on a five-point Likert scale ranging from “hardly ever” to “all the
time.”
FL teacher predictability
The predictability of the FL teacher was measured through one single question: “How
predictable are your FL teacher’s classes?” Responses were measured on a five-point
Likert scale ranging from “very unpredictable” to “very predictable.”
Data analysis
Descriptive statistics and correlations between all variables were calculated in SPSS
25.0. An exploratory SEM was tested in JASP (version 0.11.1; JASP Team, 2020),
utilizing Lavaan (Rosseel, 2012). The exploratory SEM was developed based on the
prevailing literature and hypotheses (see Figure 1). SEM was chosen as the method to
examine the data above the use of multiple regression or simple correlations, as the
method allowed for the latent modeling of variables with measurement error taken into

Figure 1. Proposed structural equation model.
account (Ullman & Bentler, 2012). As such, the three emotion variables specifically
could be explored in terms of the greater nomological network of teacher and outcome
variables (Figure 1), with unbiased estimates and as such providing a clearer picture of
the relationships between variables (ibid.). Furthermore, SEM provides indicators for
model fit and modification indices, which are invaluable in confirming the exploratory
research questions of this study.
It should be noted that as four variables were measured by only a single item
(FL attitude, FL achievement, teacher FL use, teacher predictability), these variables
had to be specified as latent in the model by constraining factor loadings and error
variances (Fuchs & Diamantopoulos, 2009). The use of single indicator variables has
often been avoided in past SEM research, as “the use of single-item measures in
academic research is often considered a fatal error in the review process” (Wanous
et al., 1997, p. 247). However, single-item measures can be a valid, psychometrically
sound alternative to multiitem measures (Fuchs & Diamantopoulos, 2009; Petrescu,
2013). Accordingly, several benefits can be provided by single-item measures such as
flexibility, brevity, providing a global score, and the possibility of measuring a greater
number of variables without overburdening the participant (Fuchs & Diamantopoulos,
2009). As the current study examines seven distinct variables, three of which were
measured through multiitem psychometrically validated measures (FLE, FLCA, and
FLB), the decision was made to measure the remaining four items through single-item
measures. We are fully cognizant of the possible drawbacks of single-item measures,
however the decision allowed for a greater number of variables to be included in the
exploratory study and as such widen the nomological network under investigation.
The model was tested utilizing maximum likelihood estimation with robust error.
Model fit was primarily analyzed through the Root Mean Square Error of Approxi-
mation (RMSEA), the Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), and the
Standardised Root Mean Square Residual (SRMR). Cutoff recommendations set by
Kenny (2020) were used to determine fit, with the RMSEA and SRMR indicating close

fit if less than the rule-of-thumb value of .08. In turn, CFI and TLI indicate close fit
when greater than .90. In addition, the chi-square (χ2) and chi-square/degrees of
freedom ratio (χ2/df) were also considered in the estimation of fit. A nonsignificant
chi-square and a chi-square/degrees of freedom ratio of < 2.0 were used to indicate close
fit (Byrne, 1998).
Results
Descriptive statistics and correlation coefficients
Descriptive statistics for all variables included in the model can be found in Table 1. In
addition, a Pearson correlation coefficient matrix of all variables is displayed in Table 2.
Structural equation modeling

The model achieved close fit as indicated by the RMSEA (.068) and SRMR (.057), which
were both below the cutoff of .08 (Kenny, 2020). In addition, the CFI (.933) and TLI
(.921) were both above the minimum limit of .90 (ibid.), further indicating close fit. In
contrast, the chi-square was significant (χ2 (217) = 549.99; p < .001), with a χ2/df ratio
(2.52) of >2.0, implying poor fit (Byrne, 1998). However, the chi-square has been found
to be particularly sensitive to sample size and large correlations between factors, which
may result in Type 1 errors (Kenny, 2020). As such, the bulk of fit indices indicated close
fit and the model is therefore accepted as an adequate representation of the data.
As Figure 2 demonstrates, a number of hypothesized paths in the model were
statistically insignificant (see Table 3 for overview of hypotheses). This is not an
unexpected result given the exploratory nature of this study. However, support was
found for the first hypothesis of the study as there were statistically significant
Table 1. Descriptive statistics

M SD Min Max Skewness Kurtosis
FLE 3.85 .64 1.33 5 –.75 1.00

FLCA 3.08 .91 1 5 .17 –.73
FLB 2.18 .91 1 5 .62 –.18
Attitude 4.29 .86 1 5 –1.41 2.46
Achievement 78.29 12.22 10 100 –.89 2.33
Teacher FL use 4.33 .74 2 5 –.83 .034
Teacher predictability 2.65 .78 1 5 .04 .70
Table 2. Correlation coefficient matrix

1. 2. 3. 4. 5. 6. 7
1. FLE – –.342*** –.633*** .286*** .118** .176*** .110**

2. FLCA – .386*** –.138** –.186*** –.057 .051
3. FLB – –.332*** –.170*** –.077 –.190***
4. Attitude FL – .190*** .217*** .000
5. Achievement FL – .135** –.027
6. Teacher FL use – –.034
7. Teacher predictability –
*** p < .001; **p < .01.

Figure 2. Structural equation modeling result.

***p < .001; **p < .01; *p < .05.
Table 3. Hypotheses overview

Β p-value
Teacher FL use ! FLE .133 < .01

Teacher FL use ! FLCA –.074 .143
Teacher FL use ! FLB –.092 .085
Teacher unpredictability ! FLE .119 < .05
Teacher unpredictability ! FLCA .060 .346
Teacher unpredictability ! FLB –.199 < .01
FLE $ FLCA –.303 < .001
FLE $ FLB –.500 < .001
FLE ! attitude FL .134 < .01
FLE ! achievement FL .038 .470
FLCA $ FLB .451 < .001
FLCA ! FL attitude –.007 .912
FLCA ! achievement FL –.145 < .01
FLB ! attitude FL .268 < .001
FLB ! achievement FL –.046 .530
FL attitude ! achievement FL .141 < .01
symmetrical effects between all emotion variables. FLE was found to be negatively
correlated with both FLCA (r = –.303; p < .001) and FLB (r = –.500; p < .001). In
addition, FLCA and FLB shared a statistically significant positive correlation (r = .451; p
< .001). The emotion variables of FLE, FLCA, and FLB are therefore interdependent in
the FL classroom.
The second hypothesis was partially substantiated in the structural equation model.
Teacher FL use had a significant positive effect on FLE (β = .133; p < .01). However, no
statistically significant effect was found between teacher FL use and the two negative
emotion variables of FLCA and FLB (p > .05). The use of the target FL in the FL

classroom therefore only seems to impact the positive emotion of enjoyment and has no
effect on either anxiety or boredom in the FL learning context.
The third hypothesis of teacher un/predictability influencing the three emotion
variables was also partially substantiated. Teacher unpredictability had a significant
positive effect on FLE (β = .119; p < .05) and a significant negative effect on FLB (β =
.199; p < .01). No statistically significant effect was found between teacher unpredict-
ability and FLCA (p > .05). As such, the results of both the first and second hypotheses
seem to indicate that teacher-related variables may be more related to positive emotion
(e.g., FLE) than to negative emotion (e.g., FLCA).
Similarly, partial evidence was found to support the fourth hypothesis. The two
variables of FLE and FLB both predicted the attitude of the FL learners, with no
statistically significant relationship found between FLCA and FL attitude. FLE posi-
tively influenced the FL learner’s attitude toward their target FL (β = .134; p < .01), while
FLB negatively influenced FL learner’s target FL attitude (β = –.268; p < .01).
In contrast to the findings of the fourth hypothesis, the only statistically significant
path found between the emotion variables and academic achievement was in the
impact of FLCA on academic achievement (β = –.145; p < .01). Neither FLE nor FLB
had a statistically significant effect on academic achievement in the FL. The fifth
hypothesis was therefore only partially supported. Lastly, the results indicated
support for the sixth hypothesis that FL attitude positively influence academic
achievement (β = .141; p < .01).
Overall, the proposed model was found to closely fit the data and support was found
for the majority of hypothesized relationships. The model indicated a complex nomo-
logical network, with the three emotions of FLE, FLCA, and FLB, each having unique
associations with the teacher-related variables and the outcome variables of FL attitude
and academic achievement.
Discussion
The aim of the present article was to throw a wide net to link three FL learner emotions
with a number of learner-internal and learner-external variables to explore how they
are connected to each other. In doing so, we wanted to present a wide panoramic view
of the range of activating and deactivating positive and negative FL learner emotions,
their sources, and their effects. SEM allowed us to uncover the multiple relations
between the variables in our dataset. Our first hypothesis was confirmed as we did find
significant correlations between the three emotions: FL learners who enjoyed them-
selves were less anxious and less bored than those who reported lower levels of FLE.
Also, more anxious FL learners suffered more from boredom, which could be linked to
a lack of control and lower engagement causing these negative emotions (Li, 2021).
This finding confirms and extends the findings about the moderate negative rela-
tionship between FLE and FLCA first reported in Dewaele & MacIntyre (2014) and
confirmed in ulterior studies on different FL populations (Jiang & Dewaele, 2019; Li &
Han, 2022)—although not in Li and Wei (2022). It could be interpreted in two ways:
One could either argue that high FLE helps learners neutralize the negative effects of
FLCA or, alternatively, that high FLCA chips away at the FLE, possibly because of the
distracting and tiring effects of anxiety. The negative relationship between FLE and
FLB makes perfect sense: A bored student is by definition not much engaged in the
classroom activities and cursing how slowly the clock ticks (Li, 2021; Li & Dewaele,
2021; Li & Wei, 2022; Li et al. 2021).

The second research hypothesis focused on the role of teacher’s frequency of use of
the FL in shaping learners’ emotions. We hypothesized that frequent FL use would
boost FLE and reduce FLCA and FLB. While the first part of that hypothesis was
confirmed, namely frequent use of the FL by the teacher had a significant positive effect
on FLE, confirming previous research (Dewaele & Dewaele, 2017; Dewaele et al., 2018;
Dewaele, Franco Magdalena et al. 2019; Jiang & Dewaele, 2019), the second and third
part of the hypothesis were rejected as no statistically significant link was found
between frequency of teacher FL use and FLCA and FLB. In other words, learners’
FLCA and FLB were not neutralized by increased use of the FL by the teacher. The third
hypothesis focused on the effect of teacher unpredictability on FLE, FLCA, and FLB.
Teacher unpredictability was found to affect two out of the three learner emotions,
which partly confirmed our hypothesis. FL learners reported significantly higher levels
of FLE and lower levels of FLB with teachers who did not stick to the same routines in
their class and varied in the way they taught the class. This finding confirms earlier
studies on Western learners (Dewaele & Dewaele, 2017; Dewaele et al., 2018, Dewaele,
Franco Magdalena et al. 2019), partly contradicts the study on Chinese learners where
little effect was found (Jiang & Dewaele, 2019; Li et al., 2018), and completely contradicts
the finding of the study on Arabic learners where the opposite pattern emerged, namely
more predictability of the teacher being linked to higher enjoyment (Dewaele et al.,
2022a). Teacher unpredictability had no effect on FLCA, which confirms previous
research (Dewaele & MacIntyre, 2019; Dewaele et al., 2018; Dewaele et al., 2022a).
The relationship between attitude of the FL learner toward the FL, FLE, FLCA, and
FLB constituted the fourth hypothesis. The hypothesis that classroom emotions would
predict attitudes toward the FL was partly confirmed. Students who enjoyed themselves
and were not bored in the FL class had a much positive attitude toward the FL. FLCA
had no effect, which again confirms previous research on FLE and FLCA (Dewaele &
Dewaele, 2017; Dewaele et al. 2019; Li & Dewaele, 2021), but it contradicts the finding
that attitude toward the FL was negatively linked to FLCA in Dewaele et al. (2018) and
Jiang and Dewaele (2019). As Botes et al. (2020a) emphasized, we cannot exclude the
possibility that the causal relationship between emotions and attitudes could also go the
other way: Learners with a more positive attitude toward the FL are more likely to feel
happy and excited in the FL class, which may, in turn, strengthen their positive attitude
toward the language and the culture.
The fifth and sixth hypotheses delved into the relationships between FLE, FLCA,
FLB, attitude toward the FL and achievement. Partial support emerged for the fifth
hypothesis as it turned out that only FLCA had a (negative) effect on academic
achievement. This effect is well documented (Botes et al., 2020b; Dewaele & Proietti
Ergün, 2020; Dewaele & Li, 2022; Li & Han, 2022; Li & Wei, 2022), however, most
studies also reported a (typically slightly weaker) positive effect of FLE on achievement
(Botes et al., 2020a, 2022; Dewaele et al., 2018; Li, 2020; Piechurska-Kuciel, 2017; Wei
et al., 2019) except in Dewaele and Alfawzan (2018) and in Li and Wei (2022) where
FLE was “the strongest and most enduring predictor” of FL achievement (p. 1) while the
negative effect of FLB weakened over time. The finding in Li and Han (2022) that FLE
and FLB predicted perceived achievement rather than actual achievement suggests a
partial mismatch between the vision that learners have of their own performance and
their actual performance. The sixth hypothesis about a positive link between attitude
toward the FL and FL achievement was confirmed, echoing previous research (Dewaele
& Proietti Ergün, 2020), where the effect was found to be stronger for the weaker FL
(Italian) than for the stronger FL (English). This confirms that increased interest in the FL
and in the culture can boost learners’ effort to perform well (Dewaele et al., 2018, 2019).

The present study is not without limitations. Firstly, the sample included a wide
variety of participants from across the world. Context-specific effects could therefore
not be demonstrated, such as the impact of culture, age, and target FL. Secondly, several
variables were measured utilizing a single item. Although the use of single-item vari-
ables in SEM is an accepted practice (Fuchs & Diamantopoulos, 2009), a model with
multiple indicators does provide greater psychometric confidence. We do believe that
the inclusion of all variables did provide a more panoramic view of complex interac-
tions between multiple variables that have not all been brought together in previous
research designs. Lastly, we acknowledge that other independent variables that were not
included in the current study for reasons of economy may also contribute to the
relationships we observed. It is very likely that frequency of teacher joking, teacher
enthusiasm, and classroom environment triggered a process of positive emotional
contagion which boosted learners’ positive emotions and engagement, and subsequent
achievement (cf. Dewaele & Li, 2021; Dewaele et al. 2022a, 2022b; Li & Wei, 2022). One
way to find out what these variables are would be through an alternative emic,
qualitative approach. Semistructured interviews with FL learners could throw light
on what goes on in their head when their teacher is unpredictable again in class or asks
them something in the FL that they do not quite understand, what they feel about the FL
and the culture, or how they manage to impose order in their mind and focus when
doing FL tests or exams. An alternative approach is the use of diary studies to
understand how and why learners’ emotions fluctuation. Additionally, writing about
their emotions may encourage learners to reflect on them, which could enable them to
regulate them more effectively (Zawodniak et al., 2021). Finally, it is very likely that the
shape and texture of FLE, FLCA, and FLB was affected by co-occurring feelings and
emotions such as hope, optimism, shame, and guilt (Dewaele & Pavelescu, 2021).
It is not self-evident whether specific pedagogical implications can be drawn from
this exploratory work. The network of relationships identified in the present study
cannot be translated into specific actions teachers can take to boost students’ FL
achievement. Only intervention studies can test the effect of specific teaching strategies,
which would be great opportunity for further research. Yet, a number of rather broad
recommendations can be drawn from the findings. A degree of unpredictability in the
FL class, combined with abundant use of the FL seem to boost FLE and lower FLB
without affecting FLCA. Awakening students’ interest in the FL and culture and thus
shaping their attitude toward the FL can have both direct and indirect beneficial effects
on their achievement. Attempts to alleviate FLCA in testing and examinations may
have a positive effect on results. In short, creating a rich, exciting, positive emotional
classroom atmosphere will shape FL learner emotions in ways that will allow them to
grow and thrive (Dewaele & MacIntyre, 2016; Elahi Shirvan et al., 2021; Li, 2021).
Conclusion
We started the introduction with a reference to J. K. Rowling’s fictional character Ron
Weasley who could not quite believe that a person could experience many different
emotions simultaneously without their head exploding. We linked this to the field of FL
learner emotions where the emotional range was larger than a teaspoon because
researchers had been investigating FLCA for several decades, but where the emotional
range only increased with the appearance of other emotions, such as boredom, shame,
curiosity, pride, or hope. By including FLB in the current research design, a deactivating
emotion—in addition to FLE and FLCA, which are positive and negative activating
emotions—we extended the range of emotions in terms of valence and activation. With

the further inclusion of a number of learner-internal and learner-external variables, the

present study stretches the range by exploring relationships between three FL learner
emotions, their sources, and their effects. The findings suggest that FLE, FLCA, and FLB
are interconnected and are differentially shaped by a learner-internal variable (attitude
toward the FL) and by teacher behaviors such as frequency of use of the FL in class and
unpredictability, which in turn determines learners’ achievement.
To conclude, researchers, like the characters in the Harry Potter books, need to
stretch the epistemological and methodological boundaries to gain knowledge and
wisdom. It does not require magic, although FL learners in a state of flow may think
their teacher possesses magical powers.
Funding information. Supported by the Luxembourg National Research Fund (FNR) (PRIDE/15/10921377).
Notes
1 Learner-internal variables refer to characteristics of the learner that are independent from the context like
age, gender, and personality. Learner-external variables are contextual variables such as opinions or
behaviors of teachers, parents, and peers.
2 Some bored students may, however, also experience higher arousal levels, as indicated in the dimensional
model (Pekrun et al., 2010).
References
Artino Jr., A. R., Holmboe, E. S., & Durning, S. J. (2012). Can achievement emotions be used to better
understand motivation, learning, and performance in medical education? Medical Teacher, 34, 240–244.
Becker, E. S., Goetz, T., Morger, V., & Ranellucci, J. (2014). The importance of teachers’ emotions and
instructional behavior for their students’ emotions: An experience sampling analysis. Teaching and
Teacher Education, 43, 15–26.
Botes, E., Dewaele, J.-M., & Greiff, S. (2020a). The power to improve: Effects of multilingualism and perceived
proficiency on enjoyment and anxiety in foreign language learning. European Journal of Applied Linguis-
tics, 8, 1–28.
Botes, E., Dewaele, J.-M., & Greiff, S. (2020b). The Foreign Language Classroom Anxiety Scale and academic
achievement: An overview of the prevailing literature and a meta-analysis. The Journal for the Psychology
of Language Learning, 2, 26–56.
Botes, E., Dewaele, J.-M., & Greiff, S. (2021). The development and validation of the Short-form Foreign
Language Enjoyment Scale (S-FLES). The Modern Language Journal, 105, 858–876.
Botes, E., Dewaele, J.-M., & Greiff, S. (2022). Taking stock: A meta-analysis of the effects of Foreign Language
Enjoyment. Studies in Second Language Learning and Teaching 12(2), 205–232.
Byrne, B. M. (1998). Structural equation modeling with LISREL, PRELIS, and SIMPLIS: Basic concepts,
applications, and programming. Erlbaum.
Csíkszentmihályi, M. (1990). Flow: The psychology of optimal experience. Harper Collins.
De Smet, A., Mettewie, L., Galand, B., Hiligsmann, P., & Van Mensel, L. (2018). Classroom anxiety and
enjoyment in CLIL and non-CLIL: Does the target language matter? Studies in Second Language Learning
and Teaching, 8, 47–71.
Dewaele, J.-M., & Dewaele, L. (2017). The dynamic interactions in Foreign Language Classroom Anxiety and
Foreign Language Enjoyment of pupils aged 12 to 18: A pseudo-longitudinal investigation. Journal of the
European Second Language Association, 1, 11–22.
Dewaele, J.-M., Chen, X., Padilla, A. M., & Lake, J. (2019). The flowering of positive psychology in foreign
language teaching and acquisition research. Frontiers in Psychology. Language Sciences, 10, 2128.
Dewaele, J.-M., & Dewaele, L. (2020). Are foreign language learners’ enjoyment and anxiety specific to the
teacher? An investigation into the dynamics of learners’ classroom emotions. Studies in Second Language
Learning and Teaching, 10, 4565.

Dewaele, J.-M., Franco Magdalena, A., & Saito, K. (2019). The effect of perception of teacher characteristics
on Spanish EFL Learners’ anxiety and enjoyment. The Modern Language Journal, 103, 412–427.
Dewaele, J.-M., & Li, C. (2021). Teacher enthusiasm and students’ social-behavioral learning engagement:
The mediating role of student enjoyment and boredom in Chinese EFL classes. Language Teaching
Research, 25(6), 922–945.
Dewaele, J.-M., & Li, C. (2022). Foreign language enjoyment and anxiety: Associations with general and
domain-specific English achievement. Chinese Journal of Applied Linguistics, 45, 32–48.
Dewaele, J.-M., & MacIntyre, P. D. (2014). The two faces of Janus? Anxiety and enjoyment in the foreign
language classroom. Studies in Second Language Learning Teaching, 4, 237–274.
Dewaele, J.-M., & MacIntyre, P.D. (2016). Foreign language enjoyment and foreign language classroom
anxiety: The right and left feet of FL learning? In P. D. MacIntyre, T. Gregersen, & S. Mercer (Eds.), Positive
psychology in SLA (pp. 215–236). Multilingual Matters.
Dewaele, J.-M., & MacIntyre, P.D. (2019). The predictive power of multicultural personality traits, learner
and teacher variables on foreign language enjoyment and anxiety. In M. Sato & S. Loewen (Eds.), Evidence-
based second language pedagogy: A collection of Instructed Second Language Acquisition studies
(pp. 263–286). Routledge.
Dewaele, J.-M., Özdemir, C., Karci, D., Uysal, S., Özdemir, E. D., & Balta, N. (2022). How distinctive is the
foreign language enjoyment and foreign language classroom anxiety of Kazakh learners of Turkish?
Applied Linguistics Review, 13, 243–265.
Dewaele, J.-M., & Pavelescu, L. (2021). The relationship between incommensurable emotions and willingness
to communicate in English as a foreign language: A multiple case study. Innovation in Language Learning
and Teaching, 15, 66–80.
Dewaele, J.-M., & Proietti Ergün, A. L. (2020). How different are the relations between enjoyment, anxiety,
attitudes/motivation and course marks in pupils’ Italian and English as foreign languages? Journal of the
European Second Language Association, 4, 45–57.
Dewaele, J.-M., Saito, K., & Halimi, F. (2022a). How teacher behaviour shapes Foreign Language learners’
enjoyment, anxiety and motivation: A mixed modelling longitudinal investigation. Language Teaching
Research Advance online publication. https://doi.org/10.1177/13621688221089601
Dewaele, J.-M., Saito, K., & Halimi, F. (2022b). How foreign language enjoyment acts as a buoy for sagging
motivation: A longitudinal investigation. Applied Linguistics. Advance online publication. https://doi.org/
10.1093/applin/amac033
Dewaele, J.-M., Witney, J., Saito, K., & Dewaele, L. (2018). Foreign language enjoyment and anxiety: The
effect of teacher and learner variables. Language Teaching Research, 22, 676–697.
Effiong, O. (2016). Getting them speaking: Classroom social factors and foreign language anxiety. TESOL
Journal, 7, 132–161.
Elahi Shirvan, M., Taherian T., & Yazdanmehr, E. (2020). The dynamics of foreign language enjoyment: An
ecological momentary assessment. Frontiers in Psychology, 11, 1391.
Elahi Shirvan, M., Taherian T., & Yazdanmehr, E. (2021). Foreign language enjoyment: A longitudinal
confirmatory factor analysis–curve of factors model. Journal of Multilingual and Multicultural Develop-
ment. Advance online publication. https://doi.org/1080/01434632.2021.1874392
Fuchs, C., & Diamantopoulos, A. (2009). Using single-item measures for construct measurement in
management research: Conceptual issues and application guidelines. Die Betriebswirtschaft, 69, 195.
Goetz, T., Lüdtke, O., Nett, U. E., Keller, M. M., & Lipnevich, A. A. (2013). Characteristics of teaching and
students’ emotions in the classroom: Investigating differences across domains. Contemporary Educational
Psychology, 38, 383–394.
Goldin, G. A. (2014). Perspectives on emotion in mathematical engagement, learning, and problem solving.
In R. Pekrun & L. Linnenbrink-Garcia (Eds.), International Handbook of Emotions in Education
(pp. 391–414). Routledge.
Horwitz, E. K. (2017). On the misreading of Horwitz, Horwitz and Cope (1986) and the need to balance
anxiety research and the experiences of anxious language learners. In C. Gkonou, M. Daubney, & J.-M.
Dewaele (Eds.), New insights into language anxiety: Theory, research and educational implications
(pp. 31–50). Multilingual Matters.
Horwitz, E., Horwitz, M., & Cope, J. (1986). Foreign language classroom anxiety. The Modern Language
Journal, 70, 125–132.
JASP Team (2020). JASP (Version 0.13.1) [Computer software].

Jiang, Y. (2020). An investigation of the effect of teacher on Chinese university students’ foreign language
enjoyment. Foreign Language World, 196, 60–68.
Jiang, Y., & Dewaele, J.-M. (2019). How unique is the foreign language classroom enjoyment and anxiety of
Chinese EFL learners? System, 82, 13–25.
Kenny, D. A. (2020). Measuring Model Fit. http://davidakenny.net/cm/fit.htm
Kruk, M. (2016). Variations in motivation, anxiety and boredom in learning English in second life. The
EuroCALL Review, 24, 25–39.
Kruk, M. (2022). Dynamicity of perceived willingness to communicate, motivation, boredom and anxiety in
second life: The case of two advanced learners of English. Computer Assisted Language Learning, 35, 190–
216.
Kruk, M., & Zawodniak, J. (2018). Boredom in practical English language classes: Insights from interview
data. In L. Szymański, J. Zawodniak, A. Łobodziec, & M. Smoluk (Eds.), Interdisciplinary views on the
English language, literature and culture (pp. 177–191). Uniwersytet Zielonogórski.
Lee, J. S., & Lee, K. (2021). The role of informal digital learning of English and L2 motivational self system in
foreign language enjoyment. British Journal of Educational Technology, 52, 358–373.
Lee, M., Na, H. M., Kim, B., Kim, S. Y., Park, J., & Choi, J. Y. (2021). Mediating effects of achievement
emotions between peer support and learning satisfaction in graduate nursing students. Nurse Education in
Practice, 52, 103003.
Lei, H., Cui, Y., & Chiu, M. M. (2018). The relationship between teacher support and students’ academic
emotions: A meta-analysis. Frontiers in Psychology, 8, 2288.
Li, C. (2020). A Positive Psychology perspective on Chinese EFL students’ trait emotional intelligence, foreign
language enjoyment and EFL learning achievement. Journal of Multilingual and Multicultural Develop-
ment, 41, 246–263.
Li, C. (2021). A control–value theory approach to boredom in English classes among university students in
China. The Modern Language Journal, 105, 317–334.
Li, C., & Dewaele, J.-M. (2021). How do classroom environment and general grit predict foreign language
classroom anxiety of Chinese EFL students? The Journal for the Psychology of Language Learning 3(2),
22–34.
Li, C., Dewaele, J.-M., & Hu, Y. (2020). Foreign language learning boredom: Conceptualization and
measurement. Applied Linguistics Review. Advance online publication. https://doi.org/10.1515/applirev-
2020-0124
Li, C., & Han, Y. (2022). The predictive effects of foreign language anxiety, enjoyment, and boredom on
learning outcomes in online English classrooms. Modern Foreign Languages《现代外语》, 45, 207–219.
Li, C., Huang, J., & Li, B. (2021). The predictive effects of classroom environment and trait emotional
intelligence on foreign language enjoyment and anxiety. System, 96, 102393.
Li, C., & Wei, L. (2022). Anxiety, enjoyment, and boredom in language learning amongst junior secondary
students in rural China: How do they contribute to L2 achievement? Studies in Second Language
Acquisition. Advance online publication. https://doi.org/10.1017/S0272263122000031
Liu, H. J., & Cheng, S. H. (2014). Assessing language anxiety in EFL students with varying degrees of
motivation. Electronic Journal of Foreign Language Teaching, 11, 285–299.
Liu, M., & Huang, W. (2011). An exploration of foreign language anxiety and English learning motivation.
Education Research International, 2011, 1–8.
MacIntyre, P. D. (1992). Anxiety and language learning from a stages of processing perspective (Unpublished
doctoral dissertation). The University of Western Ontario.
MacIntyre, P. D. (2017). An overview of language anxiety research and trends in its development. In C.
Gkonou, M. Daubney, & J.-M. Dewaele (Eds.), New insights into language anxiety: Theory, research and
educational implications (pp. 11–30). Multilingual Matters.
MacIntyre, P. D., Dörnyei, Z., & Henry, A. (2015). Conclusion: Hot enough to be cool: The promise of
dynamic systems research. In Z. Dörnyei, P. D. MacIntyre, & A. Henry (Eds.) Motivational Dynamics in
Language Learning (pp. 419–429). Multilingual Matters.
MacIntyre, P. D., & Gregersen, T. (2012). Emotions that facilitate language learning: The positive broadening
power of the imagination. Studies in Second Language Learning and Teaching, 2, 193–213.
Neisi, S., & Yamini, M. (2009). Relationship between self-esteem, achievement motivation, FLCA, and EFL
learners’ academic performance. Journal of Psychological Achievements, 16, 153–166.
Ness Evans, A., & Rooney, B. J. (2013). Methods in psychological research. 3rd ed. SAGE Publications.

Pawlak, M., Zawodniak, J., & Kruk, M. (2020). Boredom in the foreign language classroom: A micro-
perspective. Springer.
Pawlak, M., Zawodniak, J., & Kruk, M. (2021). Individual trajectories of boredom in learning English as a
foreign language at the university level: Insights from three students’ self-reported experience. Innovation
in Language Learning and Teaching, 15, 263–278.
Pekrun, R. (2006). The control–value theory of achievement emotions: Assumptions, corollaries, and im-
plications for educational research and practice. Educational Psychology Review, 18, 315–341.
Pekrun, R., Goetz, T., Daniels, L. M., Stupnisky, R. H., & Perry, R. P. (2010). Boredom in achievement settings:
Exploring control–value antecedents and performance outcomes of a neglected emotion. Journal of
Educational Psychology, 102, 531–549.
Pekrun, R., & Perry, R. P. (2014). Control–value theory of achievement emotions. In R. Pekrun & L.
Linnenbrink–Garcia (Eds.), International handbook of emotions in education (pp. 130–151). Routledge.
Pekrun, R., & Stephens, E. J. (2010). Achievement emotions: A control-value approach. Social and Personality
Psychology Compass, 4, 238–255.
Petrescu, M. (2013). Marketing research using single-item indicators in structural equation models. Journal
of Marketing Analytics, 1, 99–117.
Piechurska-Kuciel, E. (2017). L2 or L3? Foreign language enjoyment and proficiency. In D. Gabryś-Barker, D.
Gałajda, A. Wojtaszek, & P. Zakrajewski (Eds.), Multiculturalism, multilingualism and the self
(pp. 97–111). Springer.
Resnik, P., & Dewaele, J.-M. (2021). Learner emotions, autonomy and trait emotional intelligence in “in-
person” versus emergency remote English Foreign Language teaching in Europe. Applied Linguistics
Review. Advance online publication. https://doi.org/10.1515applirev-2020-0096
Resnik, P., Dewaele, J.-M., & Knechtelsdorfer, E. (2022). Differences in foreign language anxiety in regular
and online EFL classes during the pandemic: A mixed-methods study. TESOL Quarterly. Advance online
publication. https://doi.org/10.1002/tesq.3177
Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling and more. Version 0.5–12
(BETA). Journal of Statistical Software, 48, 1–36.
Shao, K., Pekrun, R., Marsh, H. W., & Loderer, K. (2020). Control-value appraisals, achievement emotions,
and foreign language performance: A latent interaction analysis. Learning and Instruction, 69, 101356.
Simonton, K. L., & Garn, A. (2019). Exploring achievement emotions in physical education: The potential for
the control-value theory of achievement emotions. Quest, 71, 434–446.
Sinatra, G. M., Broughton, S. H., & Lombardi, D. (2014). Emotions in science education. In R. Pekrun & L.
Linnenbrink-Garcia (Eds.), International handbook of emotions in education (pp. 415–436). Routledge.
Teimouri, Y., Goetze, J., & Plonsky, L. (2019). Second language anxiety and achievement: A meta-analysis.
Ullman, J. B., & Bentler, P. M. (2012). Structural equation modeling. Handbook of Psychology. 2nd ed. Wiley.
Wang, Y., Derakhshan, A., & Zhang, L. J. (2021). Researching and practicing positive psychology in second/
foreign language learning and teaching: The past, current status and future directions. Frontiers in
Psychology, 12, 731721.
Wanous, J. P., Reichers, A. E., & Hudy, M. J. (1997). Overall job satisfaction: How good are single-item
measures? Journal of Applied Psychology, 82, 247–252.
Wei, H., Gao, K., & Wang, W. (2019). Understanding the relationship between grit and foreign language
performance among middle school students: The roles of foreign language enjoyment and classroom
environment. Frontiers in Psychology, 10, 1508.
Zawodniak, J., Kruk, M., & Pawlak, M. (2021). Boredom as an aversive emotion experienced by English
majors. RELC Journal, 0033688220973732.
Zhang, H., Dai, Y., & Wang, Y. (2020). Motivation and second foreign language proficiency: The mediating
role of foreign language enjoyment. Sustainability, 12, 1302.
Cite this article: Dewaele, J.-M., Botes, E. and Greiff, S. (2023). Sources and effects of foreign language
enjoyment, anxiety, and boredom: A structural equation modeling approach. Studies in Second Language

doi:10.1017/S0272263122000341
RESEARCH ARTICLE
Second language productive knowledge of

collocations: Does knowledge of individual
words matter?
Suhad Sonbul1* , Dina Abdel Salam El-Dakhs2 and Ahmed Masrai3
1
Umm Al-Qura University, Makkah, Saudi Arabia; 2Prince Sultan University, Riyadh, Saudi Arabia; 3Prince
Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia
*Corresponding author. E-mail: sssonbul@uqu.edu.sa
(Received 28 September 2021; Revised 12 July 2022; Accepted 05 August 2022)
Abstract
Recent studies suggest that developing L2 receptive knowledge of single words is associated
with increased receptive knowledge of collocations. However, no study to date has directly
examined the interrelationship between productive word knowledge and productive collo-
cation knowledge. To address this gap, the present study administered a controlled produc-
tive word test and a controlled productive collocation test to 27 native English speakers and
55 nonnative speakers (L1-Arabic). The tests assessed word and collocation knowledge of
the most frequent 3,000 lemmas in English (1K, 2K, and 3K frequency bands). The test scores
were analyzed using three mixed-effects models for the following outcome variables:
collocation appropriacy, collocation frequency, and collocation strength. Results revealed
productive word knowledge as a significant predictor of productive collocation knowledge,
though with a small effect. This association was omnipresent regardless of frequency band.
We discuss implications of these findings for L2 learning and teaching.
Introduction
It is widely acknowledged that lexical knowledge contributes to second language
(L2) learners’ overall language proficiency (e.g., Crossley et al., 2011; Zareva et al.,
2005), and enhances learners’ mastery of language skills (e.g., Milton, 2013; Miralpeix &
Muňoz, 2018; Stӕhr, 2008). Because lexical knowledge is viewed as a multifaceted
construct (Henriksen, 1999) that involves the acquisition of multiple word knowledge
components (Nation, 2013), several researchers have attempted to examine these
various components including knowledge of single words and knowledge of colloca-
tions at the receptive and productive levels. Generally, the literature on lexical knowl-
edge includes more measures of receptive word (e.g., Nation & Beglar, 2007; Schmitt
et al., 2001; Webb et al., 2017) and collocation (e.g., Gyllstad, 2009) knowledge than of
productive word (e.g., Laufer & Nation, 1999) and collocation (e.g., Frankenberg-
Garcia, 2018) knowledge. Additionally, there is evidence that receptive knowledge of
collocations develops in relation to receptive knowledge of single-word items (e.g.,

Second language productive knowledge of collocations 481
Nguyen & Webb, 2017). However, little is known about the interrelationship between
productive word and collocation knowledge, which is considered a higher-level
aspect (e.g., Bahns & Eldaw, 1993; González Fernández & Schmitt, 2020; Webb &
Kagimoto, 2009).
The present study aims to contribute to the gap on the interrelationship between the
higher-level productive knowledge of single words and collocations by using newly
developed lemma-based measures of words and collocations at the first three 1,000
frequency levels of English. The tests are administered to native speakers of English as
well as nonnative speakers in an English-as-a-Foreign-Language (EFL) context. To
situate the present study in the literature, the next sections will survey research on the
productive measures of single words, definition of collocations, and measures/deter-
minants of productive collocation knowledge.
Background
Measuring productive knowledge of single words
Despite the availability of several measures of receptive word knowledge, such as the
Vocabulary Levels Test or VLT (Nation, 1990; Schmitt et al., 2001; Webb et al., 2017)
and the Vocabulary Size Test or VST (Nation & Beglar, 2007), only limited measures
are available to assess productive word knowledge. One such measure is the lexical
translation task (Webb, 2008). In this test, L2 speakers are given L1 meanings and asked
to provide their equivalent L2 forms. Webb reported in his study that such an L1-L2
translation test can elicit varied responses for target items, which means less control
over intended L2 form. Although it is possible to restrict responses by providing the first
letter(s) of the target word, a productive translation test may not reflect production
during actual language use.
Another test that is intended to measure productive vocabulary knowledge is Lex30
(Meara & Fitzpatrick, 2000). This is a word-association test, where test-takers are
required to produce a number of responses to stimulus words. While this test was found
to indicate breadth of productive vocabulary, it appears to behave differently when used
with learners of different proficiency levels (Walters, 2012). Walters further argues that
Lex30 scores are difficult to interpret.
Furthermore, CATSS (the new computer adaptive test of size and strength) (Aviad-
Levitzky et al., 2019) was developed to measure vocabulary knowledge in the receptive
recall, productive recall, receptive recognition, and productive recognition modalities.
The test targets word knowledge across 14 frequency bands (1K–14K). Productive
recall, which is relevant to the present study, was measured through recalling a word
form (e.g., She is a l_____ girl. (small)). As the test measures word knowledge from
14 frequency bands (including a range of low-frequency items), it may go far above the
level of our target participants, who are EFL learners with varied proficiency levels.
A slightly different controlled productive word knowledge test is the Productive
Vocabulary Levels Test (PVLT) which was developed by Laufer and Nation (1999). The
test is “controlled” in that it assesses learners’ ability to use a specific target L2 word
when compelled to do so. The PVLT format is a gap-fill task where a meaningful
sentence context is presented, and a missing target word is to be supplied. To restrict the
responses, the first letters of the target word are provided (e.g., The book covers a series
of isolated epis______ from history—Answer: episodes). The major principle is to
include a minimal number of letters that only disambiguate the cue. The PVLT is
similar to the VLT in that it targets sets of words that represent distinct frequency
bands. A total of 18 items are sampled per frequency band: 2,000, 3,000, 5,000,

University Word List, and 10,000. The scoring system is dichotomous (correct/incor-
rect), and minor spelling mistakes and grammatical errors are ignored. The examinee
receives six scores: a score for each frequency band and a total score across bands.
The PVLT has been used widely as a measure of controlled productive word
knowledge, but we opted for devising a new controlled productive word knowledge
in the present study for several reasons. First, the original PVLT (ibid.) measures items
from the 2,000-, 3,000-, 5,000-, and 10,000-word levels and the University Word List,
which may go far beyond the level of our EFL participants. Thus, we opted to avoid low-
frequency lemmas and focus instead on the 3,000 most frequent lemmas in English:
1,000, 2,000, and 3,000 levels. Furthermore, the PVLT uses word family (the headword
and its inflectional and derivational forms, e.g., embarrass, embarrassed, and embar-
rassment) as the counting unit. While further empirical evidence is still needed to
advance our understanding of the different lexical units (see Webb, 2021 for an
overview), research on L1 users (Wysocki & Jenkins, 1987) and L2 learners (Schmitt
& Zimmerman, 2002) seems to suggest that derivational knowledge develops with age
and proficiency. For many less advanced L2 learners, the appropriate lexical unit for
both receptive and productive purposes is likely to be a lemma (the headword and its
inflectional forms in a given part of speech or PoS, e.g., embarrass and embarrassed,
when used as a verb, are members of the same verb lemma) or a flemma (the headword
and its inflectional forms regardless of PoS, e.g., embarrass, embarrassed as a verb, and
embarrassed as an adjective are members of the same flemma). In the present study, the
measure of productive word knowledge is similar in design to the PVLT but takes the
aforementioned points into consideration.
Thus far, we have examined measures of productive knowledge of individual words.
Because the aim of the present study is to link knowledge of words to knowledge of
collocations, the following sections will focus on collocation knowledge.
Definition of collocations
Scholars interested in collocation research distinguish between two main approaches to
defining collocations, namely, the phraseological approach (Cowie, 1994; Howarth,
1996; Nesselhauf, 2003) and the frequency-based approach (see McEnery & Wilson,
2001; Sinclair, 1991). The phraseological approach identifies collocations based on
co-occurrence restrictions among words and the relative semantic compositionality
and restrictedness of meaning that distinguishes pure idioms (e.g., iron man) from
collocations (e.g., handsome man) and free lexical combinations (e.g., funny man). The
frequency-based approach, however, identifies collocations based on their
co-occurrence frequency that is higher than mere chance based on strength of associ-
ation measures, such as mutual information (MI).
In the present study, we follow the frequency-based approach to identifying collo-
cations. This means that collocations refer to word combinations “that emerge from a
corpus at greater frequency than could occur by chance, irrespective of their level of
compositionality and/or semantic transparency” (Nguyen & Webb, 2017, p. 300). This
approach is highly valued in L2 learning because more corpus-based frequency is often
considered a proxy of language exposure; more frequent items are more likely to be
encountered first in the language input (Peters, 2020).1 Given our frequency-based
1
A note of caution is in order here. While frequency-based lists are useful, there can be misfits with the
frequency band division. Schmitt et al. (2021) gave the example of pencil, which might be one of the first

approach to defining collocations, we will review three common measures of colloca-

tion strength: MI, t-score, and Log Dice.
MI is among the most widely used measures of collocation strength. It is related to
“coherence” (Ellis et al., 2008), “tightness” (González-Fernández & Schmitt, 2015), and
“appropriateness” of word combinations (Siyanova & Schmitt, 2008). The MI score
“uses a logarithmic scale to express the ratio between the frequency of the collocation
and the frequency of random co-occurrences of the two words in the combination”
(Gablasova et al., 2017, p. 163). MI scores are especially high for combinations of rare
words that very often co-occur such as “tectonic plate,” reflecting the exclusivity of the
adjective “tectonic” with the noun “plate” (Durrant et al., 2022). T-score has also been
used as a measure of “certainty of collocation” (Hunston, 2002, p. 73) and “the strength
of co-occurrences” (Wolter & Gyllstad, 2011, p. 436). However, Evert (2005) argues
that the t-scores lack a transparent mathematical grounding. It is hence not quite
possible to establish statistically reliable and valid cut-off points (Hunston, 2002).
Unlike MI, T-score favors frequent collocations in the corpus (e.g., “of the” and “on
the”; see Gablasova et al., 2017). Log Dice, however, is in principle relatively similar to
the MI score except that it does not account much for rare combinations (ibid.).
Gablasova et al. (2017) provide the example of “zig zag” as a collocation with a high
Log Dice score and explain that Log Dice is thought preferable to MI scores when the
language learning constructs necessitate highlighting exclusivity between collocates
without the rare frequency bias. However, Log Dice has not yet been extensively
explored in the language learning research. In the current study, we opted to use MI
scores because they are the most widely employed measures of strength of association.
Moreover, to avoid the MI bias for rare combinations, we combined MI with a raw
frequency threshold.
The often-cited MI threshold for “significant” collocations is 3 (Hunston, 2002).
However, Evert (2008) proposed a ranking approach to operationalizing collocations
on a cline from weaker to stronger ones, allowing the MI threshold value to go down.
This ranking approach is the one to be employed in the present study as the study
involves both native speakers of English who might produce very strong collocations
and nonnatives who might produce weaker ones. Thus, based on the ranking
frequency-based approach, we operationalize collocations in the present study as a
sequence of words (two or more) with a minimum MI of 1 and a minimum frequency of
30 in the COCA (Corpus of Contemporary American English).
Measuring productive knowledge of collocations

Earlier studies have shown that L2 learners’ productive knowledge of collocations is
limited. This research has either examined corpus-based evidence (e.g., Laufer &
Waldman, 2011; Nesselhauf, 2003; Siyanova & Schmitt, 2008) or used paper-and-
pencil tests (e.g., Frankenberg-Garcia, 2018; González Fernández & Schmitt, 2015,
2020; Nizonkiza, 2012).
One of the earliest corpus-based collocation studies is Nesselhauf (2003) who
examined the use of verb-noun collocations, such as take a break or shake one’s hand,
by advanced German-speaking learners of English in free written production. The
results showed that, despite participants’ high level of proficiency, they exhibited a
learned words in L2 English but appears relatively low in frequency lists. Thus, while frequency is important,
it is always better combined with other knowledge-based measures.

notable difficulty in producing collocations. The most common type of collocation

mistakes was the wrong choice of the verb (e.g., carry out races instead of hold races),
followed by the wrong choice of nouns (e.g., close lacks instead of close gaps). Similarly,
Laufer and Waldman (2011) investigated the use of English verb-noun collocations in
the writing of native speakers of Hebrew at three proficiency levels. The results revealed
that learners of all proficiency levels produced a higher number of deviant collocations
and far fewer collocations than native speakers. It is notable that Laufer and Waldman
(2011) mainly employed a dictionary-check method to classify verb-noun combina-
tions as acceptable or deviant collocations. As noted in the preceding text, the present
study uses a pure frequency-based approach to identifying collocations and may thus
depict a different picture.
Another relevant corpus-based study is Siyanova and Schmitt (2008, Study 1) who
examined English adjective-noun collocations produced in essays by Russian learners
of English in comparison to native speakers of English. Appropriate collocations were
identified based on joint frequency and MI scores in the BNC. An MI threshold of 3 was
set for appropriate collocations in line with Hunston’s (2002) criteria. A frequency
criterion was also added, that is, six times in the BNC. This figure was chosen because it
allowed for the inclusion of almost half the identified collocation data. Surprisingly, the
results revealed very little difference between native speakers and nonnative speakers in
the use of collocations (48.1% vs. 44.6% of the combinations produced were appropriate
based on BNC counts, respectively).
One great advantage of corpus-based studies is the examination of authentic L2
production. However, such corpus-based research may not reveal all aspects of pro-
ductive knowledge as learners may avoid using certain collocations (ones they are not
confident with) or may overproduce a few collocates that they had practised well
(referred to as “safe bets” or “zones of safety”) (Boers & Lindstromberg, 2009).
Paper-and-pencil tests constitute a more direct measure of productive collocation
knowledge than corpus-based evidence. Gap-fill tests have been the most common
measures and were employed in different formats. One recurrently used format is to
provide a sentential context and ask the learners to complete missing collocates (e.g.,
She was about to ______ a huge mistake) (e.g., González Fernández & Schmitt, 2020;
Nizonkiza, 2012). To restrict the learners’ options, the first letter/syllable of the missing
collocate is often supplied. To further constrain the range of potential collocations
elicited, an L1 statement could be added to provide context for the English sentence.
An obvious advantage of such a format is that researchers can control which items
are targeted and thus can manipulate various variables, such as frequency and con-
gruency. On the minus side, however, these tests do not examine the learners’ authentic
language use and cannot reveal the actual options available to learners during real-time
production. In real life, speakers/writers need to consider the context and think about
all possible collocates of the word at hand before producing the most appropriate
collocate in the target context.
To overcome this limitation, an alternative gap-fill format has been developed by
Frankenberg-Garcia (2018). The format requires participants to complete the gap in a
sentential context/frame with as many collocates as they could think of. For example, in
response to the sentential frame “They attempted to __________ the effect of …”
participants could supply several collocates, including measure, examine, and analyze.
In her study of collocations in an English-for-Academic-Purposes (EAP) context,
Frankenberg-Garcia (2018) consulted the COCA to choose a range of collocations
attested in different disciplinary areas. As shown in the illustrative frame, the nouns
were presented within context, and the participants supplied the missing verbs/

adjectives that collocate with these nouns. A major advantage of this format is that it
simulates real-life performance whereby writers consider several possible collocates in
context. However, scoring the test is not as straightforward as traditional gap-filling
tests that target specific collocations (see preceding text). Frankenberg-Garcia (2018)
used Pearson International Corpus of Academic English (PICAE) and employed Log
Dice scores (a minimum of 3) and frequency of co-occurrence (five analogous
co-occurrences) to identify acceptable collocations. In the present study, we will be
using Frankenberg-Garcia’s (2018) format to simulate real-life productive collocation
performance and examine factors that influence productive collocation knowledge.
However, because our focus is on general, rather than academic, collocations we will use
the COCA as our reference corpus.
Determinants of productive collocation knowledge

One important determinant of productive collocation knowledge is first language
(L1) similarity. In her analysis of verb-noun collocations, Nesselhauf (2003) found
that the learners’ L1 (i.e., German) had a clear effect on collocation errors. Likewise,
Laufer and Waldman (2011) found that interlingual collocation errors by native
Hebrew learners of English persisted across the three proficiency levels. Another
relevant factor is grammatical configuration or PoS. Collocations are often grouped
into two main categories: lexical and grammatical (Benson et al., 1997) with the latter
including a proposition or a grammatical structure. Most of the research on the effect of
configuration on L2 collocation knowledge has focussed on lexical collocations, but the
evidence in this regard is still limited. For example, Lee and Shin (2021) found no
significant effect of collocation type (i.e., verb-noun, adjective-noun, adverb-adjective,
and adverb-verb) on the learners’ scores in a sentence writing task and a gap-fill task
when collocation frequency was held constant. Similarly, Nguyen and Webb (2017)
found no effect of grammatical configuration (verb-noun vs. adjective-noun) on
receptive knowledge of collocations. Although the available evidence is inconclusive
regarding the effect of PoS on collocation knowledge development, studies that
analyzed L2 learners’ collocational errors showed that adjective þ noun and verb þ
noun collocations were the most problematic (Nesselhauf, 2003; Yan, 2010). Thus, in
the present study, we focus on adjective (or more generally modifier) þ noun and verb
þ noun collocations (see the following text for more details).
In addition to the influence of L1 and collocation type, Nizonkiza (2012) highlighted
the important role for collocation frequency on the development of productive collo-
cation knowledge. Belgian and Burundian learners of English completed a gap-fill
collocation test. Similar to Laufer and Waldman (2011; see preceding text), target
collocations were identified based on a dictionary check. The results showed that
learners’ collocation knowledge developed as corpus-based frequency increased. Fre-
quency was also found to be an important determinant by González Fernández and
Schmitt (2015) who found that their L1 Spanish – L2 English learners’ collocation
knowledge correlated moderately with corpus frequency (r = .45) and t-score (r = .41).
However, no significant relationship was found between collocation knowledge and MI
score. The results also highlighted a clear influence for the amount of exposure on L2
collocation knowledge. The learners’ knowledge of collocations moderately correlated
with engagement with English outside the classroom (r = .56) and years of English
study (r = .45).

A fourth factor that was prominent in Laufer and Waldman’s (2011) study is
learners’ L2 proficiency. Although the collocation production errors persisted across
all levels of proficiency in that study, the number of appropriate collocations the
learners produced increased at the advanced level. L2 proficiency was also a crucial
factor in Nizonkiza (2012) who explored the relationship between controlled produc-
tive knowledge of collocations and a measure of L2 proficiency. The results showed that
both tests distinguished between the proficiency levels and were highly correlated. This
finding strongly indicates that collocation knowledge develops as L2 proficiency
increases. Another related finding by Ellis et al. (2008) is that raw frequency is a better
predictor of nonnatives’ collocation processing while MI better predicts the processing
of collocations by native speakers. However, Ellis et al. (2008) did not examine whether
and how nonnatives develop their productive collocational knowledge (both in terms of
frequency and association strength) as a function of proficiency. One might speculate
that MI can be the distinguishing feature of nonnative collocation performance at
higher levels of proficiency (approaching nativelike performance), precisely because it
is at high levels of proficiency that learners acquire lower frequency words, and then
also their word partnerships. Conversely, at lower proficiency levels, learners’ vocab-
ulary knowledge may be largely confined to high-frequency words, which seldom form
partnerships with high MI scores. The present study examines this speculation by
including data from both natives and nonnatives and through examining the effect of
increased productive word knowledge (as a proxy of L2 proficiency) on collocational
frequency and association strength (MI scores).
Thus, of most relevance to the current study is the association between knowledge of
single words and collocation knowledge. The relationship between these two types of
lexical knowledge has been rarely examined. On the receptive front, Nguyen and Webb
(2017) investigated EFL learners’ knowledge of verb-noun and adjective-noun collo-
cations at the first three 1,000-word frequency levels, and the extent to which several
factors (including knowledge of single-word items at the same word frequency levels)
influenced receptive knowledge of collocations. The results revealed significantly large
positive correlations between receptive knowledge of single-word items and colloca-
tions (r = .67 for verb-noun collocations and r = .70 for adjective-noun collocations).
Based on this result, the question arises: what about the relationship between single-
word knowledge and collocation knowledge on the productive side? In his research
agenda, Schmitt (2019) has called for more research exploring the productive level of
mastery which has often been reported as lagging behind receptive knowledge. This is
the gap that the present study aims to fill.
The present study

This study aims to explore the association between productive word knowledge and the
productive knowledge of collocations of the most frequent 3,000 lemmas in English.
The study limited itself to three frequency bands (1K, 2K, and 3K) for practicality
considerations. Testing more levels would have required more time and resources. The
focus of the study was also limited to two types of collocations: modifier-noun
(MN) and verb-noun (VN), which are most problematic for L2 learners (Nesselhauf,
2003; Yan, 2010). It should be noted that the frequency bands in the productive
collocation test refer to the frequency of the noun node (the shared component in
both configurations) rather than the frequency of the elicited collocation (see Nguyen &
Webb, 2017, for a similar approach). Moreover, we used the term modifier in its

broadest sense (i.e., any word that describes the noun node or limits its meaning in
some way) in place of adjective to account for the variation of responses in the
productive collocation test (see “Measures” section for more details).
Another aspect of the present study concerns the focus on three collocational
features: appropriacy, frequency, and strength of association. Most of the previous
research on receptive collocation knowledge has focused on the appropriacy of the
elicited responses (e.g., selecting the appropriate collocate out of several options).
Because the present study focused on productive knowledge, we additionally examined
the frequency of the elicited responses and their association strength with the
noun node.
We used Laufer and Nation’s (1999) and Frankenberg-Garcia’s (2018) test formats
to develop productive measures of word and collocation knowledge, respectively, and
refer to them as the Controlled Productive Word Test (CPWT) and the Controlled
Productive Collocation Test (CPCT). The study addresses the following research
questions:
RQ1: To what extent is productive word knowledge associated with the appro-
priacy of the elicited collocations?
RQ2: To what extent is productive word knowledge associated with the corpus-
based frequency of the elicited collocations?
RQ3: To what extent is productive word knowledge associated with the strength of
the elicited collocations?
It should be noted that the participants in the present study included both native
speakers (NSs) and non-native speakers (NNSs) of English who took both the CPWT
and the CPCT. Including a NS group was essential for two reasons: (a) to validate the
CPWT with a group of NSs who should have knowledge of the target lemmas and (b) to
establish a baseline against which the NNS group’s CPCT results can be compared.
Thus, we included “Group” (NSs versus NNSs) as a controlling factor in the analysis to
examine how productive knowledge of collocations develops at highest levels of
proficiency.
In addition to controlling for the effect of Group, we also included several item-
related (frequency and length of individual words) and participant-related (amount of
exposure and age of acquisition) factors as covariates in the analysis. Because colloca-
tion knowledge is a complex construct, we needed to partial out the effect of several
variables before looking at the main focus of the present study, namely, the relationship
between productive word knowledge and productive knowledge of collocations.
Methods
Participants
Two groups of participants took part in the present study. The first group comprised
27 NSs of English. The other group included 55 NNSs of English who spoke Arabic as
their first language.2 The NNSs were students at a university in Saudi either in the
preparatory-year program (n = 18) or as seniors completing their BA degree in the
2
Five additional NNSs were excluded from the analysis as their score in the 1K updated VLT was below
25, the threshold we set for mastery in the current study.

English medium (n = 37). They showed mastery of the most frequent 1,000 (1K) word
families in English as indicated by their scores (out of 30) in the updated VLT, Version
A (Webb et al., 2017): Minimum = 25, Maximum = 30, M = 28.31, SD = 1.63. We set a
somewhat lenient threshold of 25/30 for receptive mastery of the 1K level in the present
study as the purpose was to generally ensure adequate comprehension of the contexts in
the CPWT and CPCT, which all belonged to the most frequent 1,000 word families in
English (see the following text).
Table S1 (see Supplementary Materials) details the characteristics of participants
under each group as indicated in their responses to a language background question-
naire. Our analysis models (see “Analysis” section) included participants’ average
exposure to English in the four skills and the age at which they started learning English
(coded as 0 years for native speakers) to partial out their effect.
Measures
As our purpose was to measure productive knowledge of both single-word items and
collocations, we developed two measures for the present study: CPWT and CPCT. In
the text that follows, we provide a full description of test creation, piloting, and scoring
procedures.
The Controlled Productive Word Test

We sampled items for the CPWT from Davies (n.d.) COCA frequency list. The list is
based on COCA (Davies, 2008–) frequency counts and is lemma based. It includes the
most frequent 60,000 lemmas in English. The list was developed based on raw
frequency, but a dispersion measure (Juilland D; see Juilland & Chang-Rodríguez,
1964) of 0.30 was set as a threshold to eliminate lemmas that are limited to a specific
genre domain (Davies, personal communication). We opted to use the COCA fre-
quency list as it is based on lemma rather than word-family counts, which might be
more suitable for our nonnative participants in the EFL context with varied proficiency
levels (see “Background” section). However, it should be noted that there are limitations
associated with using the COCA frequency list. We will return to these in the
“Discussion” section.
As our aim was to examine knowledge of the most frequent 3,000 lemmas in English,
we limited our sampling to the 1K (1,000), 2K (2,000), and 3K (3,000) levels/bands of
the COCA frequency list. For the purpose of developing the PVLT, Laufer and Nation
(1999) sampled 18 items at each frequency level. Also, Aviad-Levitzky et al. (2019) used
10 items to represent each of the 14 frequency bands in the development of CATSS. The
sampling rate is usually higher in receptive vocabulary measures (i.e., multiple-choice,
checklist formats), where many items can be developed, administered, and scored in a
relatively short time. However, for practicality purposes, the sampling rate is usually
smaller in productive measures, taking into consideration test development, adminis-
tration, and scoring time. In the present study, we initially opted for a round number
that is closest to Laufer and Nation’s (1999) sampling rate, that is, 20 lemmas per band.
We employed stratified sampling (based on percentages) to specify the number of
lexical word classes (excluding grammatical lemmas) from within each PoS (nouns,
verbs, adjectives, and adverbs). Based on the percentages presented in Table S2 (see
Supplementary Materials), each 1,000-lemma level was tested using 11 nouns, 5 verbs,

3 adjectives, and 1 adverb (total = 20). With 20 headwords under each of the three
frequency levels, the test assessed the productive knowledge of 60 lemmas in total.
All lemmas belonging to a given PoS at each COCA-frequency level were random-
ized through the List Randomizer (https://www.random.org/lists/) to select target
items. For each candidate target word, a short defining sentence context was provided.
Words in the surrounding context belonged to the most frequent 1,000-word families
(BNC/COCA List; Nation, 2012). This was important to ensure that our NNSs, who
showed mastery of that frequency level (see “Participants” section), would be able to
fully comprehend the sentences.
The approach we employed to restrict responses was slightly different from Laufer
and Nation (1999; see “Background” section). They provided first letter(s) as clues and
decided on the number of these based on possible orthographic neighbourhood
(between 1–6 letter clues). To unify the number of clue letters, we only provided the
first letter but also represented the number of letters by dashes as an additional clue.
Here is an example for the noun violation:
We must report what he has done to the police. This is a v_ _ _ _ _ _ _ _ of the law.
The initial draft of the test went through several piloting stages. Two groups of
natives took the test in different rounds. Items that did not elicit the target lemma were
either replaced or modified. It was difficult to reach a perfect score for all items. Thus,
we decided to use the test as it is and then exclude items where our main pool of 27 NSs
did not achieve an acceptable score (see the following text). Appendix S1 presents the
target items for the CPWT and Appendix S2 presents the actual test along with the
answer key (see Supplementary Materials).
Responses in the CPWT were scored dichotomously (0/1) based on accuracy.
Following Laufer and Nation (1999), minor mistakes in grammar (i.e., different
lemma form: violations instead of violation) and in spelling (incorrect but recog-
nizable form, e.g., comet instead of commit and bleam instead of blame) were
ignored. However, because the test is lemma-based, a different PoS (e.g., violate
instead of violation) was coded as inaccurate (0). All responses were scored
by a proficient Arabic–English research assistant who holds an MA degree in
English. She was given detailed instructions before she started scoring the
responses. Additionally, another Arabic–English research assistant holding an
MA degree in English, scored a random sample of 30% responses based on the
same guidelines. Interrater reliability was high (ICC = .99, 95% confidence interval
(CI) [.97, .99]). Therefore, only the scores awarded by the first rater were included in
the analysis.
As indicated in the preceding text, we initially examined responses by natives to
exclude items where more than 20% of the NS provided incorrect answers. This resulted
in excluding the following words from the analysis:
1K: walk, series, realize, chair

2K: expand, perception, content, achieve
3K: emotion, mixture, stability, impose, practical
Thus, we ended up with 16 items at the 1K level, 16 items at the 2K level, and 15 items at
the 3K level. It should be noted that the incorrect responses provided by NSs in the
CPWT do not reflect lack of knowledge (as NSs surely know these highly frequent

words) but may have been caused by the clues not properly restricting the required
response. This makes our results limited in estimating productive vocabulary size at
each frequency band. We will return to this point when we discuss limitations of the
study.
Table S3 (Supplementary Materials) presents the percentage of accurate/inaccurate
CPWT responses provided by NSs and NNSs under the three frequency levels for the
final item pool. We will not further analyze scores in the CPWT as the focus of the
present study is on collocation knowledge. These scores will be included as a main
factor in the analysis of the CPCT scores to answer the three research questions.
It is worthy of notice that the 20% exclusion criteria employed in the present study is
higher than that employed by Laufer and Nation (1 out of 7, i.e., around 15%). However,
our format is considered more challenging than Laufer and Nation’s format as we only
provided the first letter as a clue, in addition to dashes, to restrict the number of letters.
Thus, 20% may be a more suitable threshold for the present study.
We calculated the internal reliability of the final CPWT form (16 words at the 1K
level, 16 words at the 2K level, and 15 words at the 3K level) including scores achieved
by NS and NNS, and it was found high: 1K (Cronbach’s alpha = .86), 2K (Cronbach’s
alpha = .91), 3K (Cronbach’s alpha = .95), total score (Cronbach’s alpha = .97).
The controlled productive collocation test

As indicated previously, we used Frankenberg-Garcia’s (2018) test format to assess the
controlled productive knowledge of MN and VN collocations at the most frequent
3,000-word levels (1K, 2K, 3K, respectively). We used the same noun nodes for both
configurations (MN and VN) to allow a direct comparison.
Nouns from the COCA Frequency List were randomized using the List Randomizer
(https://www.random.org/lists/) to select target items. To minimize any transfer effect
between the two tests (CPWT and CPCT), no noun was repeated in both measures. The
sampled target nouns at each frequency level were checked individually in the COCA
interface (Davies, 2008–) for how variant their modifier and verb collocates are. For
modifiers, the span was set to –1 and for verbs the span was set to –2 to allow for an
intervening determiner. The search involved nouns as lemmas (e.g., node: [book]_nn*;
collocates: _v* and _j*), and the resulting collocates were sorted by frequency followed
by MI (mutual information) value. As indicated in the preceding text (see “Background”
section), we used Evert’s ranking approach to operationalizing collocations with a
minimum MI threshold of 1 and a minimum raw frequency of 30. However, as the
purpose of this initial stage of test development was to explore strong collocations, we
employed the stricter MI and frequency thresholds of 3 (Hunston, 2002) and
50 (Nguyen & Webb, 2017), respectively. Only nouns that allowed a range of variation
for both configurations with highly frequent and strong modifiers and verbs (which our
NNSs might know) were included in the initial pool. For example, the noun node chance
was considered suitable for the present measure as it allows several modifiers (e.g., good,
real, great, fair, excellent) and verb (e.g., have, get, take, stand) collocates. However, the
noun node bit was not considered suitable for our purposes. This is because while the
COCA search for collocating modifiers of bit resulted in several options (little, tiny,
small ), the search for collocating verbs resulted only in one record, blow. We ended up
with 11 candidate noun nodes for the CPCT at each frequency level.
Then, for each target noun node, we developed a context that is general enough to
allow the elicitation of as many collocates as possible. We also made sure that none of

the contexts for MN collocations gave away VN collocations or vice versa. For example,
we did not use the verb create in the sentence eliciting “modifier þ image” collocations
as this might lead to an effect on the “verb þ image” item. Similar to the CPWT, all
words in the surrounding contexts were at the 1,000-word level to ensure full compre-
hension by our NNSs. The target noun was underlined in each sentence context to stress
the target node for which collocations are to be produced. After two piloting rounds
with NSs and NNSs to ensure variation in responses, we ended up with 10 noun nodes
at each frequency band. Each noun node was presented twice in the test to elicit
modifier collocates and then verb collocates. Target items for the CPTC are presented in
Appendix S3 and the actual test is presented in Appendix S4 (see Supplementary
Materials) with examples of typical collocations. Here are examples of CPCT items for
the noun node chance, with possible strong collocates provided in brackets:
MN: This is a/an ____________ chance. (possible responses: good, real, great,
fair, excellent)
VN: They ____________ a chance to win. (possible responses: have, get, take,
stand)
Following Frankenberg-Garcia (2018), the CPCT format instructed participants to
insert as many collocates (modifiers in the first section and single verbs in the second
section) with the noun presented in context. This resulted in 12,491 data points with
responses ranging between 0–16 (M = 1.64, SD = 1.07) per noun node for NNSs and
between 1–48 (M = 4.11, SD = 3.31) per noun node for NSs. The huge variation in the
number of responses provided by NSs may point out to the possibility that NSs
approached the task differently than NNSs. We will revisit this point when we discuss
the limitations of the study (see “Discussion” section).
Test scoring and response classification went through three stages: initial accuracy
coding, data recording, and appropriacy classification. These steps are explained in
detail in Appendix S5 (Supplementary Materials). As previously noted, we use the term
modifier in the MN section of the test in its broadest senses. Thus, under the MN
category, we accepted adjectives, attributive nouns (or noun adjuncts), and determiners
as modifier responses.3
At the end, each response was coded as “appropriate” or “inappropriate” based on
COCA frequency: MI threshold of 1 and frequency threshold of 30. Thus, our definition
of collocation “appropriacy” is related to corpus frequency rather than any evaluative
judgment of the responses provided. Each collocation response was included in the data
log along with its COCA collocation frequency and calculated MI score (across various
lemma forms).4
3
Although the CPCT instructions clearly specified that the test targeted adjective collocates of target nouns
(MN section), several participants (especially NSs) provided nouns and determiners in addition to adjectives.
Thus, nouns and determiners that met our MI and frequency threshold were coded as “appropriate” in the
MN section. We believe this should better reflect the knowledge of our participants who might not all be
familiar with the strict definition of adjectives. Additionally, several responses in the VN section of the CPCT
comprised phrasal verbs. These were also coded as “appropriate” if they met our MI and frequency
thresholds.
4
Because the present study employed a lemma-based definition of word frequency, we calculated the MI
score for each elicited collocation based on the lemma frequency of the component words and the lemma-

Procedures
After signing the consent form, the participants were administered the CPWT. The
sheets were then collected and participants were administered the CPCT. Finally, they
completed the language background questionnaire and the 1K updated VLT test.
It should be noted that due to the COVID-19 pandemic and class suspension in
Saudi Arabia, we could not run all participants in face-to-face sessions. The NS group
and a subset of the NNSs group (i.e., seniors) took the test online (with cameras on) and
were instructed not to use any sources of help. The whole test battery took between 60–
75 mins to complete by the NNS and only 45 minutes to complete by the NSs.
Analysis
To address the three research questions, three separate analyses were conducted in R
version 4.1.1 (R Core Team, 2021). We will refer to the three analyses as Model 1, Model
2, and Model 3. Dichotomous (0/1) outcome values (collocation appropriacy scores in
Model 1) were analyzed using a mixed-logit regression analysis for binary data ( glmer
function in the lme4 package). This analysis targeted “appropriacy” scores for the CPCT
(Model 1) that addressed the first research question. For continuous dependent vari-
ables including collocation frequency (Model 2, second research question) and MI
values (Model 3, third research question), we employed linear mixed-effects (LME)
models (lmer function in the lme4 package). The random effect structure of all models
was the same, including random intercepts for items and subjects, random by-item
slopes for Group (NSs vs. NNSs), and random by-subject slopes for Frequency Level
(1K, 2K, 3K). All analyses were conducted stepwise through evaluating the significant
contribution of each factor to the model fit using AIC values in the forward method. In
the following text, we describe the structure of each model.
The three models (Model 1, Model 2, and Model 3) are concerned with results of the
CPCT. Model 1 was a mixed-logit regression with binary collocation appropriacy score
(1 = appropriate, 0 = inappropriate) as the dependent measure. The full CPCT data set
(12,491 data points) was admitted to this analysis of appropriacy. Covariates included
target node lemma length,5 configuration (MN vs. VN), average exposure to English,
and age when the participant started to learn English. Main fixed variables included
Group (NSs as the reference level), Frequency Level (1K as the reference level), and total
CPWT score (out of 47). The total CPWT scores were included in the analysis to
examine the effect of increased productive vocabulary size on the odds of providing
appropriate collocations (Research Question 1). We also tested for the interaction
between Frequency Level and CPWT scores. Odds ratios transformed from log odds
(Exp(β) values) were used as estimates of the strength of each significant predictor in the
based collocation frequency. The following formula was used where AB = collocation frequency, A =
frequency of the first lemma (verb or modifier), and B = frequency of the noun lemma:
MI = log2 ( (AB * sizeCorpus) / (A * B * span) )

5
Model 1 did not include item-level properties of collocates. This is because the full data set which was
admitted to Model 1 analysis included some missing responses coded as “inappropriate” (i.e., no answer,
unrecognizable spelling, or inaccurate PoS). Moreover, all three models (Model 1, Model 2, and Model 3) did
not include node frequency as a covariate as this variable is already represented in the models by the focal
factor (Frequency Level).

model. We also calculated Cohen’s d values based on log odds as standardized estimates
of effect size.
The other two models (Model 2 and Model 3) focused on a subset of the CPCT data
(6,156 data points, 49.36%, for “appropriate” collocations only) to evaluate factors that
predict the log frequency of appropriate collocations (Model 2; Research Question 2)
and those that predict the strength of appropriate collocations, that is, MI value (Model
3; Research Question 3). Other than this difference in the dependent measures, Models
2 and 3 had similar structures. Covariates for both models included node lemma length,
collocate lemma length, log collocate lemma frequency, configuration, average expo-
sure to English, and age when the participant started to learn English. These were only
included to partial out their effect before testing for the main factors. Main fixed factors
included Group, Frequency Level, CPWT scores, and the interaction between Fre-
quency Level and CPWT scores. Effect sizes for these models are represented by
marginal and conditional R2 values. The former involves only fixed effects, but the
latter incorporates random effects as well (see Winter, 2019 for a fuller explanation).
We also employed Brysbaert and Stevens’s (2018) guidelines to calculate Cohen’s d of
significant variables in the LME models.
Table S4 (see Supplementary Materials) presents a summary of the continuous
variables. Collinearity was checked for significant predictors in each model using the
variance inflation factor (VIF). All VIF values were below 2, indicating no collinearity
issues.
Results
Table 1 presents the percentage of appropriate/inappropriate collocation responses in
the CPCT as well as mean frequency and MI values of appropriate collocations for both
NSs and NNSs and for both configurations (MN/VN). It is interesting to note that out
of the total 12,491 data points in the CPCT, 53.3% (6,658 responses) were produced by
NSs but only around 46.7% (5,833) were produced by NNSs. Furthermore, the 5,833
data points for NNSs included 442 empty cells (no responses). NSs, however, did not
leave any unfilled gaps in the CPCT. We opted to keep the empty cells in the
appropriacy analysis (Model 1, RQ1, coded as 0 = inappropriate) as they represent
lack of collocation knowledge.
Percentile scores and average frequency/MI values in Table 1 seem to indicate
several similarities between NSs and NNSs. First, the percentage of appropriate
responses ranged between 63% and 45% for NSs and between 54% and 36% for NNSs.
The percentage of appropriate responses produced by NSs might seem counterintui-
tive, with more than 40% inappropriate responses. It should be noted though that these
results are similar to (and even higher than) Siyanova and Schmitt’s (2008) findings of
48.1% and 44.6% appropriate collocations for NSs and NNSs, respectively.
Another notable finding in relation to the number of “appropriate” collocation
responses is that they gradually decreased as a function of the frequency band
(i.e., fewer appropriate responses for lower frequency levels) for both the NNSs and
NSs. Similarly, for COCA-based frequency, both groups showed a gradual decrease as a
function of frequency band (with lower frequency values overall for the NSs than the
NNSs). Finally, MI showed a gradual increase for lower frequency bands for both
participant groups.

494
Suhad Sonbul et al.
Table 1. Descriptive statistics for responses in the CPCT (appropriacy, frequency, and MI values)
NS (n = 27) NNS (n = 55)
Mean Mean Mean Mean

collocation collocation collocation collocation
Total Appropriate frequency/ MI/ Total Appropriate frequency/ MI/
responses responses % appropriate SE appropriate SE responses* responses % appropriate SE appropriate SE
1K Frequency Level MN (n = 10) 1348.00 849.00 63.0 871.30 45.89 5.07 0.07 1202.00 654.00 54.4 1030.66 60.32 4.02 0.06
VN (n = 10) 1068.00 648.00 60.7 2012.43 215.39 4.06 0.08 930.00 457.00 49.1 3719.75 347.61 3.41 0.08
Total 2416.00 1497.00 62.0 1365.26 97.86 4.64 0.06 2132.00 1111.00 52.1 2136.79 152.50 3.77 0.05
VN (n = 10) 910.00 458.00 50.3 1040.27 132.81 3.87 0.08 812.00 296.00 36.5 1079.50 173.82 3.60 0.09
Total 2067.00 1129.00 54.6 609.30 55.71 4.76 0.06 1870.00 812.00 43.4 702.69 67.68 4.14 0.06
VN (n = 10) 1015.00 406.00 40.0 451.08 26.78 4.22 0.08 835.00 245.00 29.3 460.02 32.93 3.86 0.10
Total 2175.00 946.00 43.5 369.32 15.22 5.18 0.08 1831.00 661.00 36.1 402.34 18.52 4.38 0.09
*Total responses for NNSs include instances when the participant provided no answer (coded as 0 = inappropriate).
Regarding the effect of configuration (MN versus VN) on the appropriacy, fre-
quency, and MI of collocations; there seems to be a tendency for more appropriate, less
frequent, and stronger MN than VN collocations for both NSs and NNSs.
The following three subsections will present the best-fit Models 1, 2, and 3 to answer
Research Questions 1, 2, and 3, respectively.
Association between productive word knowledge and the appropriacy

of collocations (Model 1, RQ1)
The best-fit mixed-logit Model 1 for variables predicting appropriate collocation
responses is presented in Table S5 (Supplementary Materials). None of the covariates
tested contributed to the model. Of the main effects, only CPWT score and Frequency
Level were significant. Group was initially significant but ceased to be when CPWT
scores were added to the model. The results suggest that participants who scored higher
in the CPWT were more likely to produce appropriate collocations in the CPCT (small
effect). For Frequency Level, the model showed that more appropriate responses were
provided at the 1K (reference) level than the 3K level (small effect). However, the
difference between the 1K and 2K levels was not significant. To examine the remaining
contrast between the 2K and 3K levels, we redefined the reference level as (2K). The
difference was significant with a small effect (β = –0.48, z = –2.62, p = .009, d = –0.27).
Finally, the interaction between the CPWT score and the Frequency Level was not
significant, suggesting that frequency band did not modulate the observed productive
word knowledge effect.
Association between productive word knowledge and the frequency

Table S6 (Supplementary Materials) presents the best-fit LME model for variables
predicting the COCA frequency of appropriate collocations. One notable significant
covariate is the frequency of the collocate lemma. The positive, large effect suggests that
the higher the frequency of the provided collocate, the higher the frequency of the
collocation as a whole. This effect is expected as highly frequent lemmas are more likely
to form part of highly frequent collocations.
After controlling for several significant covariates, we found that out of the three
main variables (Group, Frequency Level, and CPWT score), only Frequency Level and
CPWT score were significant. Like Model 1, the Group variable ceased to be significant
when CPWT scores were added. Moreover, similar to Model 1, Model 2 showed no
interaction between Frequency Level and CPWT scores.
We will now further explore the main effects reported in the preceding text for
Frequency Level and CPWT scores. For Frequency Level, the model showed that the
COCA frequency of collocations at the 1K level was significantly higher than those at
the 2K and 3K levels (small to medium effects). Upon redefining the reference level as
(2K), we find no significant difference between the 2K and 3K levels (β = –0.21,
t = –1.48, p = .15, d = –0.21). Moving now to the effect of CPWT scores, results
showed a significant (though very small) increase in the frequency of elicited colloca-
tions as the CPWT score increased.
Finally, the fact that the interaction between frequency level and CPWT score was
not significant suggests that this positive effect of increased CPWT scores was omni-
present for all frequency bands.

Association between productive word knowledge and the strength
Our final LME model (Model 3) examined the factors that predict the strength or MI
value of the elicited collocations. Results presented in Table S7 (see Supplementary
Materials) show two significant covariates. The most interesting of these is the contri-
bution of configuration with a medium effect: lower overall MI values for VN collo-
cations in comparison to MN collocations.
Similar to Model 2, the CPWT score and Frequency Level significantly contributed
to the model fit, but the interaction between them did not. The Frequency Level
contrasts seem to suggest that 3K collocations were significantly more likely to reflect
higher MI values than 1K collocations (medium effect), but the difference between the
1K and 2K levels was not significant. Upon redefining the reference level, we found that
the difference between the 2K and 3K levels was significant with a small effect: (β = 0.61,
t = 2.71, p = .009, d = 0.40). For the effect of the CPWT, similar to Model 2, the results
suggest an omnipresent significant (though very small) effect: higher MI values
(i.e., stronger collocations) as productive knowledge increased.
Discussion
Lexical knowledge is not merely about developing the form-meaning link of individual
words but, most importantly, about knowing how lexical items are used in context
(Frankenberg-Garcia, 2018). From this perspective, corpora have informed research on
whether and how language users conventionally put words together to make appro-
priate utterances. Utilizing the COCA, we devised a lemma-based controlled produc-
tive word test (CPWT) and a controlled productive collocation test (CPCT). The aim
was to examine the interrelationship between the productive knowledge of words and
collocation appropriacy, frequency, and strength.
Regarding the appropriacy of collocations produced in the CPCT, our results (RQ1)
revealed that productive word knowledge test scores and frequency band significantly
contributed to scores. If we consider productive word knowledge in the present study as
a proxy of proficiency, our findings can be interpreted as support for previous research
in the area. For example, Laufer and Waldman (2011) reported that the appropriacy of
collocations produced by their learners improved as a result of increased proficiency.
Similarly, L2 proficiency, as measured with TOEFL, was a central factor contributing to
productive knowledge of collocation in Nizonkiza’s (2012) study.
Looking more directly at the association between word knowledge and collocation
knowledge, Nguyen and Webb (2017) found a strong relationship between receptive
knowledge of single-word items and the accuracy of receptive collocation knowledge (r
≈ .70). Our results extend Nguyen and Webb’s findings to productive vocabulary
knowledge; the wider the productive knowledge of words is, the more appropriate the
collocations produced by language users are, though the effect is small (d = 0.24). The
fact that the effect was small here but large in Nguyen and Webb’s study might simply
be owing to different analysis methods (mixed logit model vs. correlation, respectively)
or might be a genuine difference between productive and receptive knowledge. Further
research in this area can help tackle this issue. It is also notable that the results of both
NSs and NNSs were fairly similar (no effect of Group) with the percentages of
appropriate responses being in line with findings of Siyanova and Schmitt (2008). This
might be due to the fact that both studies employed a frequency-based approach to
identifying appropriate collocations.

Regarding the effect of the frequency band on collocation appropriacy, the fact that
the difference between the 1K and 2K levels was not significant may suggest that
productive knowledge of collocations deteriorates at the 3K level, at least for the EFL
participants in the present study.
We also examined the association between productive word knowledge and the
corpus-based frequency of the collocation responses (RQ2). The results showed that
productive word knowledge significantly contributed to the model (still with a small
effect), with higher frequency collocations being produced overall by participants with
wider word knowledge. Concerning frequency bands, noun nodes at the 1K level
elicited more frequent collocations than the 2K and 3K levels, but no significant
difference was established between the 2K and 3K levels. This finding might be due
to the presence of advanced learners among the cohort for whom the differences in the
frequency of elicited collocations at the lower frequency bands might be minimal.
Overall, the collocation frequency analysis seems to suggest that language users who
know more words productively were more likely to produce higher frequency collo-
cations, suggesting a close relationship between productive vocabulary knowledge and
productive collocation competence. Although this direction of the CPWT effect might
seem counterintuitive given the fact that Ellis et al. (2008) showed that nativelike
performance is akin to stronger collocations (higher MI and thus lower frequency), we
believe this discrepancy in findings is related to a difference in the measures used in
both studies (see the following text).
The last research question (RQ3) concerns the potential association between pro-
ductive word knowledge and the strength of produced collocations (MI scores).
Overall, similar to the effect reported for collocation frequency (see preceding text),
as productive word knowledge increased so did collocation strength, though the effect
was small. Two other significant predictors emerged. First grammatical configuration
(MN vs. VN) was significant. While Lee and Shin (2021) and Nguyen and Webb (2017)
did not find any effect of grammatical configuration on collocation accuracy, our study
established an effect on collocation strength. Significantly lower MI scores were
observed for VN combinations compared with MN combinations. In part, this lower
average MI score for VN collocations could be attributed to the fact that some common
verbs (e.g., delexical verbs like make and do) combine with many nouns resulting in
relatively low MI scores.
Then, in contrast with the collocation frequency measure (see Model 2), collocation
strength was observed to significantly increase as frequency level decreased, though not
between the 1K and 2K levels. This is in fact an expected result, bearing in mind that MI
scores are influenced by frequency; higher frequency words that collocate with a vast
number of other words tend to have smaller MI scores than lower frequency words that
collocate with a relatively limited number of words (Gablasova et al., 2017; Nguyen &
Webb, 2017; see also Bestgen, 2017).
Overall, the results of the present study seem to suggest that productive collocation
knowledge is associated with productive knowledge of individual words. Participants
who know more individual words productively are more likely to produce appropriate
collocations that are highly frequent and have stronger association. This effect was
omnipresent regardless of the Frequency Level (1K, 2K, and 3K). The fact that an
increase in productive word knowledge was associated with higher frequency and
stronger association might seem contradictory to the findings of Ellis et al. (2008) who
found that nativelike processing is associated with higher MI but not higher frequency.
It should be noted, however, that the measures employed in both studies are funda-
mentally different. Ellis et al.’s (2008; Experiment 2) productive outcome measure was

articulation latency in a reading aloud task, and frequency and MI were examined as
predictor variables. Conversely, our study elicited open-ended responses in a gap-fill
task and included frequency and MI as outcome variables. Our results seem to suggest
that, at least for highly frequent noun nodes, more proficient speakers of the language
are more likely to produce more conventionalized collocations (i.e., those that are
highly frequent and strongly associated according to corpus data). Thus, as NNSs
develop their proficiency to nativelike levels, the collocations they produce become not
only stronger but also more frequent (based on corpus counts).
Results of the NNSs in the present study seem to show limited productive knowledge of
both words and collocations (appropriacy, frequency, and strength) that lags behind
the knowledge exhibited by native speakers. In fact, productive vocabulary knowledge,
needed for writing and speaking properly in the L2, has often been reported to be a more
advanced level of mastery than receptive knowledge (e.g., Aviad-Levitzky et al., 2019).
This receptive/productive distinction has been reflected in Nation and Webb’s (2011)
Technique Feature Analysis (TFA) which was developed to evaluate vocabulary
activities based on 18 criteria. The TFA gives higher overall scores to exercises that
involve some level of “form retrieval,” assumed to reflect vocabulary production.
But how can productive lexical knowledge be enhanced? Empirical research in this
area is fairly limited as rightly noted by Schmitt (2019), “an under-researched area of
particular interest is how to push leaners’ knowledge from receptive mastery to the
point where they can independently use lexical items fluently and appropriately in their
own output” (p. 264). However, several scholars have made useful suggestions. Nation
(2007, 2013), for example developed the four-strand principle postulating that any
effective vocabulary development program should involve balanced attention to four
major components: language-focused practice, meaning-focused input, meaning-
focused output, and fluency development. At least two of these strands can be directly
related to enhancing productive vocabulary knowledge: meaning-focused output and
fluency development. The focus in meaning-focused output activities should be on the
successful communication of meaning. Laufer (2020) claims that this kind of practice
can help the L2 learner cross the receptive/productive vocabulary boundary. As for the
often-ignored fluency development strand, practice can also involve productive activ-
ities, but these should always be fairly easy to improve speed of access to already known
vocabulary.
Thus, the limited research on productive vocabulary knowledge development seems
to suggest that language learning programs may need to invest a lot of time and effort to
enhance learners’ ability to retrieve words in speaking and writing activities. Our results
suggest that this will be reflected not only in enhancing the productive knowledge of
individual words but also collocational knowledge. We hope that with recent calls for
empirical evidence in this area, researchers will start to conduct more studies that
examine the most useful activities to enhance productive vocabulary knowledge.
Limitations and future research

This study has a number of limitations that need to be addressed in future research.
First, our definition of “appropriate” collocations in the CPCT was confined to
frequency-based counts. Such a definition might have masked possible differences

between NSs and NNSs in their use of collocations. This may also be related to the
nature of the collocation knowledge measure employed. Following Frankenberg-Garcia
(2018), the CPCT in the present study instructed participants to provide “as many
collocates as possible.” NSs are likely to approach such a task very differently from
NNSs. As NSs do not need to “prove” they know the conventions of their native
language, they can deviate from the predictable and be creative (hence the large number
of responses provided by some NSs for certain items and the counterintuitive low
percentage of appropriate collocations based on corpus frequency). NNSs, by contrast,
probably deal with the task as a test of how well their knowledge approximates the L2
conventions. Thus, future research examining productive knowledge of collocations
may need to develop other measures, avoiding such built-in bias toward NNSs’
performance.
A second limitation of the study is related to item sampling. For that purpose, we
used the COCA list which set a relatively low threshold for dispersion—0.3 Juilland
dispersion in comparison to 0.6 (Dang et al., 2017) and 0.8 (Gardner & Davies, 2014).
This may suggest that at least some lemmas in the COCA list were not evenly
distributed across the corpus. Moreover, the fact that we used a frequency list that is
based on American English makes our measures biased toward that dialect. We hope
that more lemma-based general-language frequency lists will be available soon to allow
further investigations of vocabulary knowledge.
Another limitation of the study is related to the sampling rate of the CPWT. After
multiple screening of the items in the CPWT, less than 20 words were chosen to
represent each frequency level. Schmitt et al. (2001) suggest a sampling rate of 30 per
1,000 words to reach an acceptable reliability. When designing productive measures of
word knowledge, future research should include more items per frequency level. This
can be done through rigours piloting and validation processes. Moreover, the fact that
only the first letter was provided as a clue to the target word in the CPWT (cf. Laufer &
Nation’s, 1999; PVLT) resulted in NSs producing orthographically similar synonyms in
place of target words (e.g., extending and enlarging in place of the target expanding). We
hope that more validation studies in the area of productive vocabulary testing can reach
the best way to elicit the required response in a controlled gap-fill format.
A fourth limitation of the study is that we only included three frequency levels (1K,
2K, 3K). This may have resulted in a ceiling effect which prevented differences to
emerge between higher-level NNSs and NSs. To overcome this limitation, lower
frequency levels should be included in future research to differentiate learners at
higher-proficiency levels. Additionally, L1 congruency is established as an important
determinant of collocation knowledge (e.g., Nesselhauf, 2003; see “Background” sec-
tion). Therefore, it would be useful to include congruency as a variable in the catego-
rization of appropriate collocations responses. This should allow for exploring the
potential effects of several factors on the production of congruent/incongruent items.
Conclusion
The present study is, to our knowledge, the first attempt to explore the interrelationship
between productive word knowledge and productive knowledge of collocations. The
results suggest that productive word knowledge is associated with the appropriacy,
frequency, and strength of elicited collocations. We hope this study will open the door
for more research into productive knowledge of single words and collocations to better
understand factors that affect vocabulary development.

10.1017/S0272263122000341.
Acknowledgments. We would like to thank Prince Sultan University for funding this research project
under Grant [Applied Linguistics Research Lab- RL-CH-2019/9/1]. We would also like to thank Professor
Norbert Schmitt for his useful comments on the initial design of the study. Thanks are also due to three
anonymous reviewers for their very useful comments which greatly improved the article. Any shortcomings
are entirely our own responsibility.
References
Aviad-Levitzky, T., Laufer, B., & Goldstein, Z. (2019). The new computer adaptive test of size and strength
(CATSS): Development and validation. Language Assessment Quarterly, 16, 345–368.
Bahns, J., & Eldaw, M. (1993). Should we teach EFL students collocations? System, 21, 101–114.
Benson, M., Benson, E., & Ilson, R. (1997). The BBI dictionary of English word combinations (rev. ed.). John
Benjamins.
Bestgen, Y. (2017). Beyond single-word measures: L2 writing assessment, lexical richness and formulaic
competence. System, 69, 65–78.
Boers, F., & Lindstromberg, S. (2009). Optimizing a lexical approach to instructed second language acquisition.
Palgrave Macmillan.
of Cognition, 1, 9.
Cowie, A. P. (1994). Phraseology. In R. E. Asher (Ed.), The encyclopedia of language and linguistics: Volume 6
(pp. 3168–3171). Pergamon Press.
Crossley, S. A., Salsbury, T., McNamara, D. S., & Jarvis, S. (2011). Predicting lexical proficiency in language
learner texts using computational indices. Language Testing, 28, 561–580.
Dang, T. N. Y., Coxhead, A., & Webb, S. (2017). The academic spoken word list. Language Learning, 67,
959–997.
Davies, M. (2008–). The Corpus of Contemporary American English (COCA): 560 million words, 1990–
present. https://corpus.byu.edu/coca/
Davies, M. (n.d). The COCA frequency lists of English. https://www.wordfrequency.info/
Durrant, P., Siyanova-Chanturia, A., Kremmel, B., & Sonbul, S. (2022). Research methods in vocabulary
studies. John Benjamins.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and second language
speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42, 375–396.
Evert, S. (2005). The statistics of word co-occurrences: Word pairs and collocations. [Doctoral dissertation].
Institut fur maschinelle Sprachverarbeitung, University of Stuttgart.
Evert, S. (2008). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An
international handbook (extended manuscript of Chapter 58, https://lexically.net/downloads/corpus_
linguistics/Evert2008.pdf). Mouton de Gruyter.
Frankenberg-Garcia, A. (2018). Investigating the collocations available to EAP writers. Journal of English for
Academic Purposes, 35, 93–104.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus-based language learning research:
Identifying, comparing, and interpreting the evidence. Language Learning, 67, 155–179.
Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35, 305–327.
González Fernández, B., & Schmitt, N. (2015). How much collocation knowledge do L2 learners have? The
effects of frequency and amount of exposure. ITL—International Journal of Applied Linguistics, 166,
94–126.
González Fernández, B., & Schmitt, N. (2020). Word knowledge: Exploring the relationships and order of
acquisition of vocabulary knowledge components. Applied Linguistics, 41, 481–505.
Gyllstad, H. (2009). Designing and evaluating tests of receptive collocation knowledge: COLLEX and
COLLOMATCH. In A. Barfield & H. Gyllstad (Eds.), Researching collocations in another language:
Multiple interpretations (pp. 153–170). Palgrave Macmillan.

Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in Second Language Acquisition,
21, 303–317.
Howarth, P. (1996). Phraseology in English academic writing: Some implications for language learning and
dictionary making. Max Niemeyer.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge University Press.
Juilland, A. G., & Chang-Rodríguez, E. (1964). Frequency dictionary of Spanish words. Mouton.
Laufer, B. (2020). Evaluating exercises for learning vocabulary. In S. Webb (Ed.), The Routledge handbook of
vocabulary studies (pp. 351–368). Routledge.
Laufer, B., & Nation, I. S. P. (1999). A vocabulary-size test of controlled productive ability. Language Testing,
16, 33–51.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A corpus analysis of
learners of English. Language Learning, 61, 647–672.
Lee, S., & Shin, S. (2021). Towards improved assessment of L2 collocation knowledge. Language Assessment
Quarterly, 18, 419–445.
McEnery, T., & Wilson, A. (2001). Corpus linguistics: An introduction, 2nd ed. Edinburgh University Press.
Meara, P., & Fitzpatrick, T. (2000). Lex30: An improved method of assessing productive vocabulary in an L2.
System, 28, 19–30.
Milton, J. (2013). Measuring the contribution of vocabulary knowledge to proficiency in the four skills. In C.
Bardel, C. Lindqvist, & B. Laufer (Eds.), L2 Vocabulary acquisition, knowledge and use (pp. 57–78).
EUROSLA Monographs Series 2.
Miralpeix, I., & Muñoz, C. (2018). Receptive vocabulary size and its relationship to EFL language skills.
International Review of Applied Linguistics in Language Teaching, 56, 1–24.
Nation, I. S. P. (1990) Teaching and learning vocabulary. Newbury House.
Nation, I. S. P. (2007). The four strands. Innovation in Language Learning and Teaching, 1, 1–12.
Nation, I. S. P. (2012). The BNC/COCA word family lists. Document bundled with Range Program with
BNC/COCA Lists, 25. https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists
Nation, I. S. P. (2013). Learning vocabulary in another language, 2nd ed. Cambridge University Press.
Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31, 9–12.
Nation, I. S. P., & Webb, S. (2011). Researching and analyzing vocabulary. Heinle Cengage Learning.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for
teaching. Applied Linguistics, 24, 223–242.
Nguyen, T. M. H., & Webb, S. (2017). Examining second language receptive knowledge of collocation and
factors that affect learning. Language Teaching Research, 21, 298–320.
Nizonkiza, D. (2012). Quantifying controlled productive knowledge of collocations across proficiency and
word knowledge levels. Studies in Second Language Learning and Teaching, 2, 67–92.
Peters, E. (2020). Factors affecting the learning of single-word items. In S. Webb (Ed.), The Routledge
handbook of vocabulary knowledge (pp. 125–142). Routledge.
R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria. https://www.R-project.org
Schmitt, N. (2019). Understanding vocabulary acquisition, instruction, and assessment: A research agenda.
Language Teaching, 52, 261–274.
Schmitt, N., Dunn, K., O’Sullivan, B., Anthony, L., & Kremmel, B. (2021). Introducing knowledge-based
vocabulary lists (KVL). TESOL Journal, 12, e622.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new versions
of the Vocabulary Levels Test. Language Testing, 18, 55–88.
Schmitt, N., & Zimmerman, C. B. (2002). Derivative word forms: What do learners know? TESOL Quarterly,
36, 145–171.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
perspective. The Canadian Modern Language Review, 64, 429–458.
Stӕhr, L.S. (2008). Vocabulary size and the skills of listening, reading and writing. The Language Learning
Journal, 36, 139–152.
Walters, J. (2012). Aspects of validity of a test of productive vocabulary: Lex30. Language Assessment

Webb, S. (2008). Receptive and productive vocabulary sizes of L2 learners. Studies on Second Language
Webb, S. (2021). Word families and lemmas, not a real dilemma: Investigating lexical units. Studies in Second
Language Acquisition, 43, 973–984.
Webb, S., & Kagimoto, E. (2009). The effects of vocabulary learning on collocation and meaning. TESOL
Webb, S., Sasao, Y., & Ballance, O. (2017). The updated Vocabulary Levels Test: Developing and validating
two new forms of the VLT. ITL—International Journal of Applied Linguistics, 168, 34–70.
Winter, B. (2019). Statistics for linguists: An introduction using R. Routledge.
intralexical knowledge. Applied Linguistics, 32, 430–449.
Wysocki, K., & Jenkins, J. R. (1987). Deriving word meanings through morphological generalization. Reading
Research Quarterly, 22, 66–81.
Yan, H. (2010). Study on the causes and countermeasures of the lexical collocation mistakes in college
English. English Language Teaching, 3, 162–165.
Zareva, A., Schwanenflugel, P. & Nikolova, Y. (2005). Relationship between lexical competence and language
proficiency: Variable sensitivity. Studies in Second Language Acquisition, 27, 567–595.
Cite this article: Sonbul, S., El-Dakhs, D. A. S. and Masrai, A. (2023). Second language productive
knowledge of collocations: Does knowledge of individual words matter?. Studies in Second Language

doi:10.1017/S0272263122000377
RESEARCH ARTICLE
A longitudinal study into learners’ productive

collocation knowledge in L2 German and factors
affecting the learning
Griet Boone* , Vanessa De Wilde and June Eyckmans
Ghent University, Ghent, Belgium
*Corresponding author. E-mail: Griet.Boone@UGent.be
(Received 24 June 2021; Revised 22 July 2022; Accepted 05 August 2022)
Abstract
This longitudinal study explored the roles of item- and learner-related variables in L2
learners’ development of productive collocation knowledge (L1 = Dutch; L2 = German;
NLearners= 50). Learners’ form recall knowledge of 35 target collocations was measured three
times over a 3-year period. The item-related variables investigated were L1-L2 congruency,
corpus frequency, association strength, and imageability. We also explored the learner-
related variables L2 prior productive vocabulary knowledge and L2 immersion. Mixed-
effects regression modeling indicated a significant effect of time, congruency, and prior
productive vocabulary knowledge on learners’ collocation learning. While learners’ knowl-
edge of congruent collocations remained relatively stable after year one, knowledge of
incongruent collocations increased significantly. Learners’ prior productive vocabulary
knowledge was clearly associated with growth of productive collocation knowledge, but
besides overall growth there were instances of attrition.
Introduction
Various types of formulaic language, including collocations, are widespread in language
and are of great importance in learning an additional language (e.g., Erman & Warren,
2000; Wray, 2002). Especially when it comes to the fluent, accurate, and idiomatic
production of both spoken and written language, collocation knowledge plays a crucial
role (e.g., Sinclair, 1991). However, despite the large and growing number of pertinent
studies, and the fact that collocations have been shown to be quite challenging even for
advanced L2 learners (e.g., Boers et al., 2014; Laufer & Waldman, 2011; Nesselhauf,
2003), relatively few studies have addressed the longitudinal development of learners’
productive collocation knowledge. The few studies that have addressed this subject
have mainly focused on learners’ use of collocations in (academic) writing (e.g.,
Edmonds & Gudmestad, 2021; Li & Schmitt, 2010; Siyanova-Chanturia, 2015;
Siyanova-Chanturia & Spina, 2020) or they were intervention studies designed to test
the effect of a specific type of input on the incidental acquisition of collocations (e.g., Vu
& Peters, 2021). These longitudinal studies have undoubtedly provided valuable

504 Griet Boone et al.
insights into learners’ collocation development. Even so, much remains to be learned
about how learners’ L2 collocation knowledge develops over a longer time span as well
as about the factors that affect this knowledge.
In a foreign language learning context, vocabulary and collocation learning can
occur both intentionally and incidentally (e.g., Hulstijn, 2003). The term “intentional”
has commonly been used in collocation research when investigating the effectiveness of
different forms of classroom-based teaching, and “incidental” when examining
learners’ collocational gains after they engaged in meaning-focused communicative
activities such as reading or viewing television (for a review see Szudarski, 2017).
However, language learners—especially university foreign language majors who wish
to develop their language skills for professional or personal reasons—might also have
the intention to learn when reading or listening in the L2, as “coming across unfamiliar
words during reading may trigger different kinds of processes, from basic visual intake
and semantic integration to deliberate attempts to encode form and derive meaning”
(Elgort et al., 2018, p. 363). Therefore, we believe that the term “contextual learning”
from Elgort et al. (2018) is the most suitable term to refer to collocation learning in this
study, in which we aim to investigate university students’ productive form recall
knowledge of L2 German collocations by focusing on 35 specific target collocations
and by examining a range of item-related and learner-related variables.
Background
Identifying collocations
In L2 research, there is a firm consensus that collocations (e.g., pay attention) are
one type of formulaic language, alongside other types of multiword expressions such
as idioms (e.g., it’s a piece of cake), conventional situational expressions (e.g., nice to
meet you), and lexical bundles (e.g., I don’t know if). In a broad sense, the concept
“collocation basically refers to a syntagmatic relationship among words which
co-occur” (Wood, 2019, p. 31). However, precise definitions have varied according
to researchers’ analytical approach. Two approaches, the phraseological and the
frequency-based, have been predominant in past research (e.g., Granger & Paquot,
2008). The former approach identifies a collocation as a type of restricted word
combination based on the semantic and/or syntactic relationship between two
(or more) words (e.g., Howarth, 1998). The frequency-based approach sees colloca-
tions as sets of words that have a high statistical probability of appearing together in
natural language (e.g., Firth, 1957; Sinclair, 1991). Granger and Paquot (2008) argued
for a definition that takes account of frequency, semantics, and syntax all together. In
their hybrid view collocations are seen as habitually co-occurring lexical partnerships
that have relatively transparent meanings (e.g., make a mistake, strong coffee) unlike,
say, idioms. We adopt this hybrid view. Specifically, we take a collocation to be a
word combination that (a) represents a specific syntactic pattern (e.g., adjective þ
noun, verb þ noun), (b) occurs within a given word span in a corpus, and (c) has a
relatively transparent meaning.
Item-related and learner-related variables affecting L2 collocation knowledge

Previous research on the processing and use of collocations indicates that L2
collocation development is influenced by item-related and learner-related variables.
Item-related variables reported to influence the acquisition and processing of L2

Longitudinal study into learners’ productive collocation knowledge 505
collocations include L1-L2 congruency (e.g., Ding & Reynolds, 2019; Vu & Peters, 2021;
Wolter & Gyllstad 2011, 2013), collocation frequency (e.g., Durrant, 2014; Wolter &
Gyllstad, 2013; Wolter & Yamashita, 2018), and frequency of node words (e.g., Nguyen
& Webb, 2017). Important learner-related variables include learners’ knowledge of
single word L2 vocabulary (e.g., Gyllstad, 2009; Nguyen & Webb, 2017; Vilkaitė, 2017;
Vu & Peters, 2021) and L2 immersion (e.g., Edmonds & Gudmestad, 2021; Siyanova &
Schmitt, 2008). Although these variables have been shown to have a (positive) influence
on L2 collocation learning, they are rarely studied together in a single study. With
respect to incidental collocation learning, Vu and Peters (2021) explored the effect of
three different modes of reading (reading-only, reading-with-listening, and reading
with textual input enhancement), prior vocabulary knowledge, and five item-related
variables (congruency, frequency of occurrence, Mutual Information [MI] score, cor-
pus frequency, and type of collocation). They found that all three modes of reading
resulted in learning gains, with a superior effect for reading with textual input
enhancement. Learners’ prior vocabulary knowledge and congruency were found to
be significant predictors of the incidental learning. This study was carried out over a
9-week period, but to the best of our knowledge, no study to date has investigated the
aforementioned variables over a longer time span.
Item-related variables
Congruency
It has been widely observed that even advanced L2 learners produce unconventional
and perhaps odd L2 collocations owing to overreliance on word-for-word translation
from their L1 (e.g., Laufer & Waldman, 2011; Nesselhauf, 2003). In most studies on
L1 influence, the present study included, collocations are considered “congruent”
in L1 and L2 if there is a word-for-word translation of the L1 expression for the
concept that the learner has in mind, and “incongruent” if there is no such translation
equivalent (e.g., Nesselhauf, 2005; Wolter & Gyllstad, 2013). An example of a
congruent German collocation for Dutch learners of German is eine Rolle spie-
lenG—een rol spelenD (“play a role”). An incongruent collocation is in Ruhe las-
senG—met rust latenD (“leave alone”) because the word-for-word translation would
be *mit Ruhe lassen (*“leave with silence”). Note that only one word (the preposition)
of in Ruhe lassen is incongruent according to the preceding definition. However, there
are collocations that have more than one incongruent part, for example Wert legen
(auf)G—belang hechten (aan)D (“attach importance [to]”). In our study, a collocation
is considered incongruent when there is no literal translation equivalent of at least one
of the constituent parts.
The effect of L1-L2 congruency on the learning and processing of L2 collocations has
been well documented in the SLA literature. For example, Peters (2016) and Vu and
Peters (2021) reported evidence that learners’ ability to recall forms is generally better
for congruent collocations than for incongruent. Additionally, it has been found that
congruent L2 collocations are processed more quickly and accurately than incongruent
ones (e.g., Ding & Reynolds, 2019; Wolter & Gyllstad, 2011, 2013; Wolter & Yamashita,
2018; Yamashita & Jiang, 2010). The fact that this effect is observed even in very
advanced learners suggests that there might be a continuing influence of the L1 (Wolter
& Gyllstad, 2013). To the best of our knowledge, no study to date has examined the
influence of congruency over multiple years to see how, or even whether, the effect of
L1-L2 congruency effect changes.

Frequency and association strength
Usage-based theories hold that language learning is experience driven and that an
extremely important fact of this experience is that different vocabulary items occur in
input with different frequencies (Ellis, 2002). In SLA research, corpus frequencies are
used to estimate real-world input frequencies. There is much evidence that learners
tend to acquire high-frequency words before low-frequency words because high-
frequency words are encountered more often (e.g., Ellis, 2002; Nation, 2001). Some
researchers suggest that frequency matters in this way for collocations as well (e.g.,
Durrant, 2014). It has been reported that the learnability of a collocation is not only
influenced by the frequency of the collocation as a whole but also by the frequencies of
its constituent words (e.g., Nguyen & Webb, 2017; Wolter & Yamashita, 2018).
However, Vu and Peters (2021) found that corpus frequency of the target collocations
(consisting of high-frequency words) was not a significant predictor of students’
learning gains. Given these mixed findings, more research is needed into the effect of
corpus frequency and L2 collocation development.
Inextricably linked to corpus frequency is interword association strength, which is
often measured by t-scores or MI scores. Rankings based on t-scores tend to highlight
very frequent word combinations (e.g., good example), whereas high MI scores tend to
highlight relatively infrequent combinations made up of words that are strongly
associated (e.g., tectonic plates) (Durrant & Schmitt, 2009). The t-score is computed
as an adjusted value of collocation frequency based on the raw frequency minus
random co-occurrence frequency divided by the square root of the raw frequency
(Gablasova et al., 2017). The MI score compares the probability of observing the two
words of the collocation to the probabilities of observing the words independently
(Church & Hanks, 1990). Both measures are widely used in learner corpus studies on L2
writing, in which collocations are usually extracted from learners’ written productions.
The few longitudinal studies on this topic, with participants in an immersion context,
present differing results in terms of change over time. Yoon (2016) found no significant
changes in MI scores when comparing the essays written at the start and the end of one
semester. Li and Schmitt (2010) found a moderate increase in the t-score, whereas the
MI scores remained relatively stable, although they also found considerable variation
between the individual students. In a large-scale learner corpus study Siyanova-
Chanturia and Spina (2020) did not find that MI scores underwent a statistically
significant change over time. However, in an earlier study, Siyanova-Chanturia
(2015) found that learners’ writings at the end contained not only more higher-
frequency combinations but also more collocations with relatively high MI scores than
did the writings at the beginning. Edmonds and Gudmestad (2021) also found a change
in MI scores in learners’ output 8 months after these learners’ time abroad but no
change in the overall frequencies of collocations. A possible explanation for these mixed
findings relates to the immersion experience, in which the degree of language acqui-
sition might not only depend on the amount of exposure but also on the comparatively
active engagement with the L2 in social interactions (e.g., González Fernández &
Schmitt, 2015).
Studies of L2 collocation knowledge in a nonimmersion context seem to point to a
lack of sensitivity of L2 learners to the association strength between words. For example,
using a prompted productive collocation test, González Fernández and Schmitt (2015)
measured 108 Spanish learners’ productive form recall knowledge of 50 English
collocations that vary widely in respect of corpus frequency, t-score, and MI score
and observed the following Pearson’s correlations with learners’ collocation scores:

Raw corpus frequency: .45; t-score: .41; MI score: –.16. The authors concluded that
“increasing the ‘tightness’ of the combinational bonding does not seem related to
collocation learning” (p. 107). Weak mean correlations between MI scores and learners’
collocation knowledge were found in a meta-analysis carried out by Durrant (2014). He
argued that L2 learners, unlike L1 speakers, may notice only whole collocation fre-
quency and not association strength between the constituent parts, concluding that “L1
learners notice both collocations and their components, while L2 learners focus only on
the whole collocation” (ibid., p. 472). In contrast, Wray (2002) suggested that L2
learners tend to focus on individual words. Again, findings have been mixed and
further research is needed.
Imageability
Imageability, defined as a lexeme’s “capacity to evoke a mental image” (Steinel et al.,
2007, p. 449) and concreteness, defined as “the degree to which the concept denoted by
a word refers to a perceptible entity” (Brysbaert et al., 2014a, p. 904) are often used
interchangeably because of the typically high correlation between both measures (e.g.,
Brysbaert et al., 2014b). Both imageability and concreteness are known to be potent
facilitators of L2 word learning (e.g., De Groot and Keijzer, 2000; Ding et al., 2017; Ellis
and Beaton, 1993). According to Steinel and colleagues (2007), imageability may also
facilitate the learning of L2 idioms. Because imageability has not been investigated yet as
a possible variable in collocation learning, our analysis also took account of this
variable.
Learner-related variables
Prior L2 vocabulary knowledge
A learner-related variable thought to be especially important for L2 collocation learning
is learners’ prior L2 vocabulary size, that is, the number of known words, operationa-
lized as “knowledge of the form–meaning connection” (Schmitt, 2014, p. 915). Vilkaitė
(2017), who investigated the effects of adjacency and prior vocabulary knowledge on
the incidental acquisition of L2 collocations, found that learners’ prior receptive
vocabulary knowledge—as measured by the Vocabulary Levels Test (VLT) (Nation,
2001; Schmitt et al., 2001)—had a positive effect on the learning of collocations.
Specifically, with an increase of one point in a learner’s VLT score, the predicted
probability of learning a collocation increased by 10% in the immediate posttest and by
13% in the delayed posttest. Vu and Peters (2021) found that learners with a higher
score on the VLT had a better chance of learning the form of the collocation: With an
increase of one unit in the VLT score, the odds of a correct response increased by 2.2%.
However, in the study of Toomer and Elgort (2019), participants’ L2 vocabulary
knowledge—also measured by the VLT—seemed not to affect L2 learners’ gains of
collocations. What should be remarked though, is that the VLT measures form
recognition. In general, researchers agree that learners’ mastery of form (and meaning)
recall lags behind their mastery of form (and meaning) recognition (e.g., Schmitt, 2014)
and that productive tasks are often more demanding than receptive ones (e.g., Webb,
2008). Thus, if receptive vocabulary size is an influential factor in collocation learning,
then it may be assumed that learners with a larger productive vocabulary size tend to be
especially able to acquire productive collocation knowledge. Therefore, it seemed worth
investigating whether learners’ prior productive vocabulary knowledge—as measured

by a German version of the Productive VLT (PVLT)—influences the acquisition of

collocations over time.
L2 immersion
Another factor that has been shown to influence the learnability of collocations is
learners’ exposure to the L2. Usage-based theories predict that extensive exposure is
needed for language learning in general (e.g., Ellis, 2002) and for collocation learning in
particular (e.g., Durrant & Schmitt, 2010). However, González Fernández and Schmitt
(2015) have argued that “it may not be exposure per se that is important, but the kind of
high-quality engagement with language that presumable occurs in a socially-integrated
environment, where learners wish to use the L2 for meaningful and pleasurable
communication” (p. 101). Thus, it would be reasonable to expect that spending time
in a L2 environment may facilitate collocation learning. Some studies have indeed
observed a positive effect of L2 immersion abroad on learners’ collocation knowledge
(González Fernández & Schmitt, 2015; Groom, 2009; Macis & Schmitt, 2017; Siyanova
& Schmitt, 2008). In other studies, it was found that a stay in the target language country
did not lead to an appreciably higher or more accurate use of L2 collocations (Boone,
2021; Li & Schmitt, 2010; Nesselhauf, 2005). Because findings have been mixed, L2
immersion will be added as a variable in our analysis.
The present study

Although the studies reviewed above have undoubtedly contributed to a deeper
understanding of L2 collocation development, more longitudinal studies are needed
(e.g., Siyanova-Chanturia & Spina, 2020). First, the longitudinal studies outlined above
ranged in duration from 9 weeks (Vu & Peters, 2021) to 21 months (Edmonds &
Gudmestad, 2021). Examining development over a longer period may contribute to a
better understanding of the language learning process, which is often dynamic, com-
plex, and long-ongoing (e.g., Larsen-Freeman, 1997). Second, most longitudinal studies
of collocation development have been corpus studies focusing on L2 learner output in
writing, elicited by means of writing assignments, from which collocations are
extracted. Longitudinal studies of L2 productive collocation knowledge (i.e., form
recall) of specific target items, rather than collocation use, are scarce, but might provide
additional insights on the learnability of specific collocations. Third, with the exception
of Vu and Peters (2021), previous relevant longitudinal studies did not take into
account a broad range of item- and learner-related variables. Examining item-related
variables may help to identify characteristics that make collocations comparatively easy
or hard to learn, and whether the effect of these variables changes during the learning
process. Additionally, it is crucial to take account of individual learner profiles (e.g.,
Boers, 2020) to see how these learner-related variables influence collocation develop-
ment. Fourth, although other L2 languages than English are starting to be explored (e.g.,
Edmonds & Gudmestad, 2021; Siyanova-Chanturia, 2015), the majority of studies in
the field so far have focused on use or acquisition of collocations in L2 English. In sum,
our study aimed to add to the existing body of research on collocation development by:
(a) adopting a 3-year longitudinal design, (b) testing learners’ productive collocation
knowledge by focusing on their correct or incorrect (written) production of a given
collocation in a form recall test format, (c) taking into account several item- and
learner-related variables, and (e) exploring an underrepresented L2, namely German.

The research questions are the following:
1) How does learners’ collocation knowledge develop over time?

2) How do several item-related variables (congruency, corpus frequency, association
strength, and imageability) influence this development?
3) How do two learner-related variables (prior productive vocabulary knowledge and
L2 immersion) influence this development?
4) Does the influence of these item-related and learner-related variables change over
time?
Methodology
Participants
The participants in this study were 50 L1 Dutch undergraduate students (9 male,
41 female), majoring in German and an additional foreign language at a Belgian
university. Twenty-one of them were studying French, 16 English, 9 Spanish, 2 Italian,
1 Russian, and 1 Turkish as their extra foreign language. They were all exposed to the
same formal classroom instruction in German at university (190 contact hours in the
first year, 215 in the second, and 140 the third year). Their bachelor’s program consists
of an in-depth study of Dutch and two foreign languages and has an explicit focus on
grammar and vocabulary in the first year, whereas there is more language practice (e.g.,
within translation, speaking, and writing courses) in the second and third year. No prior
knowledge of German is required for the program, and the targeted level for graduating
is a B2/C1 level (upper-intermediate for speaking and writing; advanced for listening
and reading) according to the Common European Framework of Reference (Council of
Europe, 2001). As a curriculum requirement, during the third academic year the
students were expected to participate in a compulsory 5-month exchange program
abroad. Of the 43 students who participated in the third collocation test, 24 of them
went to a German-speaking country and 19 spent the semester in a non-German-
speaking country. All students continued to study German at the host universities.
Data collection started at the beginning of students’ university program and ended
after 3 years. All 50 students participated in the PVLT and in at least two data collection
points of the collocation test. Participants took part on a voluntary basis and provided
informed consent.
Target collocations
We wanted to develop a sample of targets representative of collocations that students
might encounter during a learning trajectory aiming for a B2/C1 level. Identification of
representative collocations for this level is complicated because of the variety and sheer
number of collocations, and because there is no published, validated list of collocations
to work from. However, there is an official German B1 word list (Glaboniat et al., 2013),
which comprises about 2,400 lexical items that learners should know at this level. A
number of collocations can be found in the example sentences next to the lexical items
in this list (e.g., packen—Ich muss noch meinen Koffer packen, “I still have to pack my
suitcase”). For this study, we selected only adjective-noun, noun-verb and preposition-
noun-verb collocations. The German collocation dictionaries of Quasthoff (2011) and
Häcki Buhofer et al. (2014) were then consulted, and candidate collocations appearing
in at least one of these dictionaries were selected. The result was a pool of

55 collocations. To make sure there was sufficient variety in terms of frequency and
association strength, these collocations were cross-checked with the German Web
Corpus 2013 (deTenTen), a corpus of 16.5 billion words, made up of texts collected
from the internet using the concordance tool in SketchEngine (https://www.
sketchengine.eu/). We also checked that the targets did not appear in the vocabulary
lists of students’ course textbooks and had not been addressed explicitly in vocabulary
class, which was confirmed by the teachers.
Instruments
Productive collocation test
To measure students’ productive collocation knowledge of the target collocations, a
productive collocation form recall test was developed, which took the form of a gap-fill
translation test. Students had to complete the German sentences by adding the
appropriate German collocation, as indicated by a L1 (Dutch) translation provided
in parenthesis. For example: Zwischen Gesundheit und Armut besteht ein (nauw
verband) _________________. (“There is a close link between health and poverty.”)
We ran a pilot study, in which we administered the collocation test to 77 first-year
students of German. The aim was to test the internal consistency of the 55 items and to
identify and omit ambiguous candidate collocations. Internal consistency of the items
was measured using Cronbach’s alpha and was found to be high (α = .92). However, the
pilot showed that several items were not suitable for the purposes of our study. For
example, the Dutch collocation moeite doen (“make an effort”) can be translated in
German with the collocation sich Mühe geben, but also with the reflexive verb sich
anstrengen (which is not a collocation). Another issue was that there were collocations
that have multiple correct translations in German. An example is Nebel (“fog”), which
occurs in dichter Nebel, dicker Nebel, starker Nebel—all meaning “dense fog” (Dutch
dichte mist). After exclusion of the collocations deemed problematic for this reason,
35 collocations remained (13 adjective-noun, 15 noun-verb, and 7 preposition-noun-
verb). Cronbach’s alpha showed a good internal consistency of the 35 items in the task
(α = .87). The same 35 items were used in the same test format each year, but the items
were put in randomized order, and students did not receive feedback on their perfor-
mance. The 35 target collocations can be found in Appendix A, the collocation test in
Appendix B (Supplementary Material).
Because we did not want to attract participants’ attention to the targets at the very
beginning of their learning trajectory, we did not administer a pretest (e.g., Toomer &
Elgort, 2019). Instead, to estimate baseline knowledge of the target collocations we
collected proxy pretest scores from a very similar sample of learners: 32 Dutch-speaking
undergraduate students of German at the beginning of their first year of university. The
test was the same as the year-end tests in the main study. The proxy pretest scores show
that only one congruent collocation, Ziel erreichen (“achieve a goal”) had a mean score
of 0.47 (SD = 0.50). The mean score for the remaining 16 congruent collocations was
0.09, the mean score for the 18 incongruent collocations was 0.01. These scores show
that there was negligible productive knowledge of the target collocations in this group.
Item-related and learner-related variables in the study

As part of the study, we collected measures for several potential predictors of learners’
L2 collocation knowledge. Appendix C (Supplementary Material) gives the values for
all variables.

Congruency: To determine the effect of L1 congruency, the target items were labeled
as congruent (1) or incongruent (0), based on the ratings of 11 university lecturers who
were asked to decide whether the target item has a literal L1 translation equivalent in the
L2 (“þcongruent”) or not (“–incongruent”). These university lecturers had between
4 and 30 years of experience in teaching German. To estimate the reliability of the
ratings for the 35 targets the intraclass correlation coefficient (ICC) was computed. The
relevant version of the ICC as a measure of consistency is “2-way average random
raters.” We used the psych package in R (Revelle, 2020) and found ICC = 0.95, 95%
confidence interval (CI) [.93, .97], which indicates excellent reliability (Koo & Li, 2016).
The target items were assigned the congruency values 1 or 0 depending on the rating of
the majority of the lecturers. The result was that 17 collocations were signaled as
congruent (1) and 18 signaled as incongruent (0). Of the 18 incongruent collocations,
13 contain one word that does not translate literally from Dutch, 4 of them two words,
and 1 three words (including the preposition). Congruency was included as a dichot-
omous variable in the analysis.
Frequency and association strength: For all 35 targets, raw corpus frequency values
(for the entire collocation and for the noun), t-score, and MI score were obtained from
the German Web Corpus 2013. We used the SUBTLEX Zipf scale (Van Heuven et al.,
2014) to log transform all frequency counts with the formula log10 (frequency per
million words)þ3. The advantage of this scale is that it is logarithmic and that the values
are easy to interpret (ibid.).
Imageability: To determine imageability, we collected subjective ratings of image-
ability on a 7-point Likert scale from 17 very advanced L2 speakers of German (all holding
a master’s degree in German language/literature and using German regularly in their jobs
or daily lives) and seven L1 German speakers for a list of 66 collocations, including the
35 target collocations and 31 nontarget collocations. One purpose of the added 31 items
was to allow our new imageability ratings to be validated (i.e., compared with a previously
published set of collocation ratings). Seven of the 31 nontarget collocations functioned as
list-initial “calibrator” items intended to serve as examples of the various levels of the
rating scale. All 31 nontarget collocations were selected from the database compiled by
Citron et al. (2016) that gives concreteness ratings of 619 German phrases. Given the
typically strong correlations between ratings of concreteness and imageability and
because no collections of imageability ratings for German are yet available, we used
the database of Citron et al. (2016) to be able to validate our ratings.
The randomized list of to-be-rated collocations was presented to the raters. To
increase the reliability of the ratings, raters were invited to rate the collocations twice,
with a pause before the second round. For the second round, the collocations were
presented in a new randomized order. Sixteen of the raters rated the collocations twice
and eight rated them once. The two sets of ratings from the raters who completed the
ratings twice were averaged to yield a single set of mean ratings for that person. Finally,
a mean rating across all raters was calculated for each collocation (Appendix D).
To validate our ratings, we calculated correlations between the existing concreteness
ratings (ibid.) and our new imageability ratings for the 31 nontarget collocations,
finding r = .79, CI [.61, .89], which is very similar to the range reported for correlations
between imageability and concreteness in the literature (e.g., Brysbaert et al., 2014b). To
estimate the reliability of the ratings for the 35 target collocations we calculated the
appropriate version of the ICC = 0.92, CI [.88, .95]. This indicates excellent reliability
(Koo & Li, 2016).
Prior productive vocabulary size: To assess students’ prior vocabulary, we adminis-
tered the PVLT for German developed by the German Institute for Test Research and

Test Development in Leipzig and modeled after Nation’s PVLT for English (Nation,
2001). The test contains five subtests, which measure learners’ vocabulary knowledge
on the vocabulary levels of 1,000, 2,000, 3,000, 4,000, and 5,000 words, respectively.
These levels are based on the frequency lists derived from the Herder/BYU-German
corpus (Jones et al., 2006). There are 18 cloze items per subtest (i.e., per frequency level).
Each target word is embedded in one or two sentences. To disambiguate the target
items, the first letter (or letters) of a targeted word is provided. For example: In dem
Dorf steht eine alte Ki_______________. (“In the village, there is an old
ch__________.”)
L2 immersion: All students participated in a compulsory 5-month exchange pro-
gram in a country in which one of their languages of study is spoken. In our analysis,
L2 immersion or study abroad (SA) was coded as SA_TL if the participant spent the
semester abroad in a German-speaking country (n = 24), or SA_nonTL if otherwise
(n = 19). It should be remarked, however, that for Time 1 and Time 2 L2 immersion was
not coded, because students only went abroad between Time 2 and Time 3.
Procedure
Participants were tracked for 3 academic years. Both the PVLT and the collocation test
were administered in class and students could not use a dictionary. In total, there were
four test sessions. The PVLT was administered at the beginning of students’ first year of
university, as a paper-and-pencil test. The time needed for completion was 30 minutes.
The first collocation test was administered at the end of students’ first year of university,
the second at the end of the second year, and the third at the end of the third year. The
first two versions of the collocation test were taken as paper-and-pencil tests, but due to
the COVID-19 pandemic, the third version had to be administered online. The time
needed to complete this test was 25 minutes.
Scoring and analyses

The PVLT and the collocation tests were corrected manually, and each test item was
scored either one point for a (completely) correct answer (e.g., Ziel erreichen for
“achieve a goal”) or zero points for an incorrect or incomplete answer (e.g., …erreichen
or Zweck erzielen). For the collocation test, a binary score for each collocation was
given, which was used in the analyses. For the PVLT, learners’ mean percentage score
across all subtests (measuring learners’ vocabulary knowledge on the vocabulary levels
of 1,000, 2,000, 3,000, 4,000, and 5,000 words, respectively) was used in subsequent
regression modeling. All analyses were carried out using the R software environment
(version 4.1.2; R Core Team, 2021).
To visualize how learners’ collocation knowledge develops over time (RQ1), we
calculated descriptive statistics for the proxy baseline test and the three collocation tests.
The ggplot2 package (version 3.3.5; Wickham, 2016) was used to create line plots. To
explore the influence of the learner- and item-related variables on students’ collocation
score (RQ2 and 3) and the influence of time (RQ4), a generalized linear mixed model was
used because the outcome variable (collocation score) is binary. A linear mixed model is
an extension of a simple linear model to allow both random and fixed effects that account
for individual variation between items and participants. The model was constructed using
the gmler-function from the package lme4 (version 1.1.26; Bates et al., 2021).

First, the continuous variables were centered on the mean. Then, a basic model was
built with only random effects: items and learners. Next, the fixed effect “time” and the
learner-related fixed effect “baseline productive vocabulary” were added. To be able to
integrate the other learner-related variable—L2 immersion—a separate model was
built, because students went abroad during their third year and consequently, only
the results of the final test might have been influenced by this L2 immersion experience.
Also, the item-related fixed effects congruency, collocation frequency, MI, and image-
ability were added. To avoid a collinearity problem, noun frequency and t-score were
not included (see Table 2 for a correlation matrix). Interactions between time and the
other fixed effects were added. Finally, variables and interactions were omitted until the
best fit was identified. Models were fit using a maximum likelihood technique (Laplace
Approximation) technique. Model fit was assessed using the anova-function in
R. Marginal R2 was calculated, which measures the variance explained by the fixed
effects only, and conditional R2, which measures the variance explained by both the
fixed effects and the random effects, using the performance package (version 0.7.2;
Lüdecke et al., 2021) in R.
Results
The study collected binary (correct vs. incorrect) learner responses to 35 German
collocations at three times. Because 50 learners were enrolled in the study, the potential
number of binary scores for tests 1 to 3 was 5,250. However, owing to learner absences
the actual total was 4,235. Baseline knowledge of the 35 collocations was estimated by
testing 32 learners similar to the ones participating in our study. The test-to-test
correlations between the by-item scores on tests of productive collocation knowledge,
with bootstrapped 95% CIs, are as follows: Proxy test to Test 1: r = .71 [.48, .84], Test
1 to Test 2: r = .77 [.59, .89], Test 2 to Test 3: r = .90 [.76, .96].
To visualize the results for RQ1 (How does learners’ collocation knowledge develop
over time), descriptive statistics (mean, median, standard deviation and range) of the
by-item scores are given in Table 1. As can be seen, there is general progress from time
0 to time 3.
Figure 1 represents the learning trend of the 35 target collocations. In each time
point represented in the figure there would be 35 dots (one per collocation) if no
collocation had the same mean score as any other. Although all the dots have been
randomly “jittered” to minimize complete overlaps when multiple collocations have the
same mean score, it is still the case that some dots are not visible. A dot that is especially
dark corresponds to more than one collocation. Dots in two columns that relate to the
same collocation are connected by a line. To sum up, Figure 1 shows that most baseline
scores were at or near zero. General progress in collocation learning is indicated by the
fact that most of the lines slope upward from left to right.
Table 1. Descriptive statistics of the by-item scores (as proportions of the maximum) on the collocation
tests
Mean SD Median Range
Proxy test .06 .11 .00 .00–.47

Test 1 .36 .31 .32 .00–.93
Test 2 .51 .28 .55 .02–1.00
Test 3 .55 .26 .61 .02–1.00

Table 2. Descriptive statistics and Spearman’s correlations for the continuous item-related variables
Item-related rs rs rs rs rs
variable
(Ncollocations = 35) Mean SD Min Max 1 2 3 4 5
1. collocation 2.82 0.71 1.30 3.78 —

frequencya
2. noun frequencya 4.82 0.48 3.82 5.64 .543** —
3. t-score 151.20 116.65 19.37 533.04 .946** .557** —
4. MI 7.98 2.58 2.83 12.28 .007 .236 .033 —
5. imageability 4.83 1.12 3.04 6.88 .115 .078 .103 .414* —
a
Zipf transformed values.
*p < .05; **p < .01.
Figure 1. The trend of collocation learning from baseline to Time 3.

Note: The baseline scores come from 32 similar learners. At times 1 to 3 there were, respectively, 28, 50, and
43 study participants per collocation.
Twenty-one learners took all three year-end collocation tests. For these learners the
total per-collocation test scores correlate fairly strongly from test to test: Test 1 to Test
2, r = .60; Test 2 to Test 3, r = .71. Figure 2 shows the trend of collocation learning for
the 21 learners who took all three year-end tests of productive collocation knowledge.
Overall progress is indicated by the fact that the great majority of the lines slope upward
from test to test. It is plain, however, that there was some forgetting, especially during
the final year.

Figure 2. The trend of collocation learning for the 21 learners who took all three year-end tests .
Lastly, Figure 3 gives an overview of the collocation learning of all learners who were
present for at least two consecutive year-end tests. Again, there was general progress but
also some forgetting.
To explore which variables contributed to learners’ collocation development (RQ2
and 3), two mixed-effects logistic regression models were built. Table 2 provides
descriptive statistics and a correlation matrix for the continuous item-related variables.
First, models were run without the variable L2 immersion because this variable could
only affect the results at Time 3. The basic generalized linear mixed-effects model
included only the random effects of “learner” and “item,” and showed that the variable
“item” explained most of the variation (variance = 2.21, SD = 1.49, ICC = .39). Far less
variation was explained by the variable “learner” (variance = 0.21, SD = 0.46, ICC =
.04). Then, two basic models were compared using the anova-function in the lme4
package in R, which gives a chi-square test of the relative fit of two embedded regression
models (Brysbaert, 2020). Adding the random intercept for “learner” contributed
significantly to improving the model fit χ2 (1) = 82.25, p < .001. The best model to
answer our research questions included random intercepts for item and learner and
three significant fixed effects (time, learners’ productive vocabulary knowledge and
congruency). Table 3 shows the final model, which has a marginal R2 of .20 and a
conditional R2 of .45. This means that the fixed effects in the model explain 20% of the
variance, and that an extra 25% of the variance was explained by the random effects.
The odd coefficient for productive vocabulary was 1.03, CI95% [1.01, 1.04], meaning
that a one unit positive difference in participant’s mean percentage score on the PVLT

Figure 3. The trend of collocation learning for all learners taking at least two sequential year-end tests.
Table 3. Mixed-effects logistic regression model predicting right or wrong answers on the three
productive collocation tests
Random effects Variance SD

Participant (Intercept) 0.117 0.342
Fixed effects B SE Z P
Intercept –6.567 1.380 –4.759 <0.001***

Time 0.817 0.077 10.612 <0.001***
Productive vocabulary 0.025 0.006 4.193 <0.001***
Congruency 2.507 0.494 5.075 <0.001***
Collocation frequency 0.538 0.312 1.725 0.085
Imageability 0.318 0.217 1.461 0.144
MI 0.001 0.091 0.006 0.995
Time Congruency –0.558 0.101 –5.446 <0.001***
Note: Baseline for congruency = incongruent.

***p <0.05; p <0.01; ***p <0.001.
corresponds to a 3% positive difference in the probability of having productive

knowledge of a collocation.
Then, another model was built with only the data of Time 3, including the variable
L2 immersion (1,505 observations, 43 learners). This model also showed a significant
effect of learners’ prior vocabulary knowledge and congruency. Having studied in the

Table 4. Mixed-effects logistic regression model predicting right or wrong answers on the last productive
collocation test (Time 3)
Random effects Variance SD

Participant (Intercept) 0.168 0.410
Fixed effects B SE Z P
Intercept –4.441 1.467 –3.028 0.002**

Productive vocabulary 0.035 0.009 3.973 <0.001***
Congruency 1.037 0.464 2.232 0.025*
Collocation frequency 0.517 0.328 1.572 0.116
Imageability 0.272 0.228 1.189 0.234
MI –0.002 0.096 –0.025 0.980
L2 immersion (Germany/Austria) 0.073 0.179 0.410 0.682
Note: Baseline for congruency = incongruent.

***p < 0.05; p < 0.01; ***p < 0.001.
target language country did not affect the results on the collocation test significantly.
The results of the final model can be found in Table 4. For this model, we found a
marginal R2 of .13 and a conditional R2 of .42.
To answer RQ4, which was to determine whether the influence of these item-related
and learner-related variables changes over time, we explored interactions between time
and the significant predictors (productive vocabulary and congruency). A significant
interaction between time and congruency was found. To make it easier to interpret this
interaction effect, Figure 4 was added. It shows a rising trend for the knowledge of
Figure 4. Interaction between time and congruency with the predicted probabilities of score.

congruent collocations, although the curve is not very steep, rather it is gradual. In
contrast, the predicted probability that learners will know an incongruent collocation
clearly rises. Specifically, at Time 1 the two contrasted probabilities are far apart. At
Times 2 and 3 they are markedly less far apart, showing that the effect of time
diminishes for congruent versus incongruent collocations. The learning curve for
incongruent collocations is steeper compared to the curve of the congruent colloca-
tions, even though more congruent collocations are still known compared to incon-
gruent collocations at Time 3.
Discussion
How does learners’ collocation knowledge develop over time?
Our results show that there was a general increase in collocation knowledge (i.e., form
recall knowledge) after 3 years of studying German. This seems unsurprising because
our participants were motivated language specialists who engaged with German almost
daily at university in classes that include language production. However, if we look at
the trend of collocation learning (Figures 1, 2 and 3), we see that at Time 3, not one
learner was able to produce all 35 collocations correctly. Figures 2 and 3 show that total
per learner scores range from 13 to 29. Although our learners were German majors,
some of them still had rather limited knowledge of the target collocations after 3 years.
This finding is in line with previous studies that indicate that the acquisition of
collocations is slow and challenging even for advanced learners (e.g., Boers et al.,
2014; Laufer & Waldman, 2011; Nesselhauf, 2003). It also appears to confirm the
evidence in vocabulary research that “form recall is the most difficult degree of mastery
of the form-meaning link” (Schmitt, 2014, p. 929). In addition, it seems that the
learning process was rather nonlinear, both with respect to the items (Figure 1) and
the individual learners (Figures 2 and 3). These outcomes seem to be in line with the
dynamic systems approach to language learning, in which language development is
expected to be a nonlinear, chaotic, and highly individual process, with a learning curve
“filled with peaks and valleys, progress and backsliding” (Larsen-Freeman, 1997,
p. 151). For collocation development, this type of process was already illustrated in
the longitudinal study of Li and Schmitt (2010), who reported the variation in
collocation development of four learners followed over one year. This is confirmed
in our study, in which there is considerable variation both in how well individual
collocations were learned and in how well individual learners learned collocations. For
the majority of the learners, there is a clear upward trend in the learning curve, but for
some learners, some attrition from Time 2 to 3 was observed. This attrition might be
explained by the fact that some of these individuals might have had less input to L2
German (e.g., through out-of-class activities such as reading books or articles, watching
television, listening to music, or using social media in the L2), which also means less
opportunities for contextual vocabulary learning. To get more insight into the causes
hereof, qualitative interview data could prove useful.
The per learner scores at Time 3 do not only show that collocation learning is slow,
but also raise the question of the effectiveness of contextual (incidental) learning.
Research has shown that L2 collocations can be acquired both incidentally and
intentionally, and that intentional learning results in greater gains (Szudarski, 2017).
It is likely that the long-term retention of the 35 collocations would have been better if
they had been used as targets in a study on intentional learning. Explicit collocation
instruction is definitely needed but because only a small number of collocations can be

taught in the classroom, it is important to know which variables affect learning to make
informed choices about which collocations should be selected for classroom learning,
and how to deal with individual differences.
How do several item-related variables (i.e., congruency, corpus frequency, association

strength, and imageability) influence L2 collocation development?
This study found that congruency had a statistically significant positive effect on
learners’ productive collocation knowledge. Students’ better knowledge of congruent
collocations at time 1, 2 and 3 compared to their knowledge of incongruent collocations
might be explained by the fact that German and Dutch are highly related Germanic
languages. However, this positive congruency effect has been shown for other language
pairs too: for less highly related Germanic language pairs like English–Dutch (Peters,
2016), English–German (Nesselhauf, 2003, 2005), and English–Swedish (Wolter &
Gyllstad, 2011, 2013) and also for much less related language pairs like English–Chinese
(Ding & Reynolds, 2019), English–Japanese (Yamashita & Jiang, 2010), and English–
Vietnamese (Vu & Peters, 2021). Our results are thus in line with previous studies and
indicate that (a) students often tend to rely on word-for-word translation when
producing L2 collocations (Laufer & Waldman, 2011) and that (b) L1-L2 congruency
is an important factor in the processing and use of collocations, which should be taken
into account in teaching.
With respect to the other item-related variables (i.e., collocation frequency, MI,
imageability), the results were nonsignificant in the final model. Although some studies
did find that collocation frequency related to some degree to L2 collocation knowledge,
they also point out that corpus frequency is only one factor of influence (e.g., Durrant,
2014; González Fernández & Schmitt, 2015). Vu and Peters (2021), who included a
larger number of factors in their study, found no significant effect for corpus frequency.
It is thus clear that the relationship between corpus frequency and collocation knowl-
edge is not straightforward, and that a study’s findings may also depend on the corpus
and the target items used (e.g., collocations consisting of infrequent words may yield
different results compared to collocations consisting of high-frequency words or
delexical verbs). The nonsignificant effect of MI in this study seems to be in line with
other finding on L2 collocation knowledge in a non-immersion context, in which the
strength of association between the words of a collocation does not seem related to L2
collocation learning (Durrant, 2014; González Fernández & Schmitt, 2015). Regarding
imageability, it has been shown that it facilitates L2 word learning (e.g., De Groot and
Keijzer, 2000), and that it may also facilitate the learning of L2 idioms (Steinel et al.,
2007). Our study could not confirm a facilitating effect for L2 collocations, although we
think that due to the limited number of items, further research into the influence of
imageability on L2 collocation learning is needed.
How do the learner-related variables prior vocabulary knowledge and L2 immersion

influence L2 collocation development?
Learners’ baseline productive vocabulary emerged as a statistically significant predictor
of productive collocation knowledge at the form-recall level. This finding seems to
support our hypothesis that, if receptive vocabulary knowledge predicts collocation
knowledge, both on a receptive level (e.g., Gyllstad, 2009; Nguyen & Webb, 2017;
Vilkaitė, 2017) and on a form-recall level (e.g., Peters, 2016; Vu & Peters, 2021),

productive vocabulary will do so too. In our study, we found that with an increase of one
point in the mean percentage PVLT score, the odds of learning a collocation increased
by 3%. These results extend the evidence of the “the-rich-get-richer” phenomenon in
vocabulary learning, whereby larger vocabulary sizes, receptive or productive, are
associated with better learning outcomes (e.g., James et al., 2017). The findings might
also point to the fact that in our study, two widely recognized constructs of vocabulary
knowledge, namely vocabulary size (i.e., knowledge of the form–meaning connection)
and vocabulary depth (e.g., collocation knowledge) (Schmitt, 2014) appear to be related
because students’ productive vocabulary size predicted their development of productive
collocation knowledge.
Although some studies indicate that L2 immersion plays a role in collocation
knowledge, we did not find a significant effect of a stay abroad in a German-speaking
country on learners’ collocation knowledge at Time 3. This might be explained by the
fact that 5 months might be quite short for collocation development, or by the fact that
also the students going to a non-German-speaking country continued studying Ger-
man at the universities abroad. It might also depend on students’ active engagement
with the L2 in social interaction abroad (e.g., González Fernández & Schmitt, 2015),
which might take place outside the target language country too (e.g., Boone, 2021). Also
here, qualitative interview data, for example on students’ L2 exposure and use or their
L2 learning experience during SA, could yield relevant information.
Does the influence of these item-related and learner-related variables change over
time?
A significant interaction effect between time and congruency was found in this study,
indicating that the influence of congruency may change over time. Specifically, we
found that students’ knowledge of congruent collocations remained comparatively
stable from Time 1 to Time 3, whereas their knowledge of incongruent collocations
significantly rose with time (Figure 4). However, it should be noted that for learners
who already produced some of the congruent collocations correctly at Time 1, it was
mathematically impossible to make as much progress as was the case with respect to the
incongruent collocations.
Because of the important impact of the L1 on the processing of L2 collocations even
at advanced levels of proficiency, Wolter and Gyllstad (2013) assumed a persisting
congruency effect in L2 collocation processing. In our study, we see that the probability
of knowing a congruent collocation compared to an incongruent collocation is higher at
Time 3, but our study also suggests that, as learners’ proficiency level rises, learners may
have increasing success in acquiring incongruent collocations. In short, the results of
our study are consistent with previous findings of a fairly general positive effect of
congruency but also show that the substantive importance of the effect dwindles as
learning progresses.
Interestingly, the two collocations with the lowest score at Time 3 are incongruent
collocations with more than one incongruent constituent part, which points at a
possibility that the level of congruency may play a role in collocation learning. If this
finding were borne out by further research, there would be pedagogical implications for
the foreign language classroom. Teachers could devote extra attention to incongruent
collocations, especially to the “very incongruent” ones (i.e., with both constituent parts
being incongruent) because those are likely to cause problems for learners (e.g.,
Nesselhauf, 2003). A contrasting L1-L2 approach, making students aware of L1-L2

differences, can be recommended. Although there was one maximum by item score
(at Time 3 for eine Rolle spielen [“play a role”]), we believe that congruent collocations
need attention too, as Wolter and Gyllstad (2011) have already pointed out. We suggest
that especially in the beginning of a learning trajectory, when the gap between learners’
knowledge of congruent and incongruent collocations is large, teachers should give
attention to incongruent collocations by setting exercises with known potential to
enhance learners’ collocation knowledge (Boers & Lindstromberg, 2012; Szudarski,
2017). However, because the learning curve of congruent collocations hardly changes
from Time 1 to Time 3, it may be useful in a later phase to give extra attention to
congruent collocations that seem likely to be relatively hard to learn because they are
infrequent or contain low-frequent words, for example.
Limitations and suggestions for future research

Our findings have to be seen in light of several limitations. First, the number of both
targets and learners was fairly low. Here, it is relevant that L2 learners of German are not
as numerous as L2 English learners. However, smaller numbers of learners should not
dissuade researchers from investigating other languages because each language has its
own characteristics and deserves its place in the field of applied linguistics. Addition-
ally, compared to other longitudinal studies, the sample size is reasonable. However, we
think it is recommendable to adopt a mixed approach in future studies, where
quantitative results of smaller samples are complemented with qualitative results to
see how individual learners deal with the challenges of learning L2 collocations and to
get more context for the findings (e.g., on the attrition or on the L2 immersion
experience). Qualitative insights are important to get the complete picture, because,
as Henriksen (2013) puts it: “It is more than likely that collocational acquisition is much
more idiosyncratic in nature and dependent on specific language use situations than
single-word acquisition” (pp. 48–49). A second methodological issue is the use of
multiple tests to measure learners’ development. In this study, a positive testing effect
cannot be entirely ruled out, that is some learning might have happened during test
taking. However, we tried to reduce this effect by leaving a gap of a year between the
completion of the tests. Third, the number of variables influencing collocation knowl-
edge is undoubtedly much higher than the number investigated in this study. The
predictors investigated here explain about 20% of the variation in test scores, which
provides an opportunity for further studies to identify other factors involved. Fourth, it
is not impossible that participants were able to translate some congruent collocations
correctly even if they had never encountered these collocations before. For cognates
(i.e., words with a similar form and meaning in the L1 and the L2), it has been shown
that they “can grant learners access to a reservoir of potential target language vocab-
ulary without explicit instruction” (Vanhove & Berthele, 2015, p. 2). It is likely that the
same applies for congruency. Using a literal L1 equivalent works perfectly for congruent
collocations, but not in case of incongruent collocations. However, this kind of “gues-
sing effect” is difficult to avoid and might also be an indication of how students produce
language. In relation to this, the other languages known by the participants may have
had an effect on students’ collocation scores. An incongruent collocation targeted in
our study, for example, might have been congruent in another foreign language with
which our participants were familiar. It is not impossible that students’ additional
languages served as a bridge to translate the L2 target collocations. We did not
investigate this in the present study, but it might be interesting to explore in future

research. Finally, because most of our incongruent targets contained only one word
without a literal translation equivalent, future studies should look at learners’ acqui-
sition of incongruent collocations of different incongruency levels.
Conclusion
The goal of this study was to investigate L2 learners’ productive collocation develop-
ment in German and to examine the effect of several item- and learner-related variables.
The results indicate that there was general progress, despite some forgetting. The
variation in both per learner and per collocation scores shows that collocation learning
is influenced by multiple variables.
As to item-related variables, our results corroborate previous findings that L1-L2
congruency is an important predictor of collocation knowledge. What is more, the
congruency effect was found to persist throughout learners’ 3-year trajectory. To
maximize collocation learning, we recommend that teachers and materials creators
direct learners’ attention toward both congruent and incongruent collocations, with
special attention to incongruent collocations at the beginning of the learning trajectory.
As to learner-related variables, we found that learners with a comparatively large
productive vocabulary at the beginning of the learning trajectory were more likely to
produce correct L2 German collocations, which shows the importance of increasing
one’s vocabulary as much as possible even in the early stages of learning an additional
language.
Acknowledgments. We thank the anonymous reviewers for their valuable comments and suggestions for
improving the manuscript. We extend special thanks to Seth Lindstromberg for his relentless assistance with
statistical analyses and graphing, and for his invaluable advice and feedback on earlier drafts of this
manuscript. Thanks also to all participants for their time and contribution to this study.
Data availability statement. This article received the Open Data and Open Materials badges for trans-
parent practices. To view supplementary material for this article, please visit https://osf.io/yp2j4/?view_only=
b0a7c06c30904072a6240f86a4ff1ff2.
References
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2021). R package lme4, version 1.1-26: Linear mixed-effects
models using “Eigen” and S4. https://github.com/lme4/lme4/
Boers, F. (2020). Factors affecting the learning of multiword items. In S. Webb (Ed.). The Routledge handbook
of vocabulary studies (pp. 143–157). Routledge.
Boers, F., & Lindstromberg, S. (2012). Experimental and intervention studies on formulaic sequences in a
second language. Annual Review of Applied Linguistics, 32, 83–110. https://doi.org/10.1017/
S0267190512000050
Boers, F., Lindstromberg, S., & Eyckmans, J. (2014). Some explanations for the slow acquisition of L2
collocations. Vial-Vigo International Journal of Applied Linguistics, 11, 41–62.
Boone, G. (2021). How social interaction affects students’ formulaic development in L2 German in a
multilingual SA context: Four case studies. In R. Mitchell & H. Tyne (Eds.), Language, mobility and study
abroad in the contemporary European context (pp. 159–170). Routledge. https://doi.org/10.4324/
9781003087953-11
Brysbaert, M. (2020). Basic statistics for psychologists. Macmillan International.
Brysbaert, M., Warriner, A., & Kuperman, V. (2014a). Concreteness ratings for 40,000 generally known
English word lemmas. Behavior Research Methods, 46, 904–911. https://doi.org/10.3758/s13428-013-
0403-5

Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014b). Norms of age of acquisition
and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80–84. https://doi.org/10.1016/j.
actpsy.2014.04.010
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography.
Computational Linguistics, 16, 22–29.
Citron, F., Cacciari, C., Kucharski, M., Beck, L., Conrad, M., & Jacobs, A. (2016). When emotions are
expressed figuratively: Psycholinguistic and affective norms of 619 idioms for German (PANIG). Behavior
Research Methods, 48, 91–111. https://doi.org/10.3758/s13428-015-0581-4
Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching,
assessment. Cambridge University Press.
De Groot, A., & Keijzer, R. (2000). What is hard to learn is easy to forget: The roles of word concreteness,
cognate status, and word frequency in foreign-language vocabulary learning and forgetting. Language
Learning, 50, 1–56. https://doi.org/10.1111/0023-8333.00110
Ding, C., & Reynolds, B. (2019). The effects of L1 congruency, L2 proficiency, and the collocate-node
relationship on the processing of L2 English collocations by L1-Chinese EFL learners. Review of Cognitive
Linguistics, 17, 331–357. https://doi.org/10.1075/rcl.00038.din
Ding, J., Liu, W., & Yang, Y. (2017). The influence of concreteness of concepts on the integration of novel
words into the semantic network. Frontiers in Psychology, 8 (2111). https://doi.org/10.3389/
fpsyg.2017.02111
Durrant, P. (2014). Corpus frequency and second language learners’ knowledge of collocations: A meta-
analysis. International Journal of Corpus Linguistics, 19, 443–477. https://doi.org/10.1075/ijcl.19.4.01dur
Durrant, P., & Schmitt, N. (2009). To what extent do native and non-native writers make use of collocations?
IRAL: International Review of Applied Linguistics in Language Teaching, 47, 157–177. https://doi.org/
10.1515/iral.2009.007
Research, 26, 163–188. https://doi.org/10.1177/0267658309349431
Edmonds, A., & Gudmestad, A. (2021). Collocational development during a stay abroad. Languages, 6, 12.
https://doi.org/10.3390/languages6010012
Elgort, I., Brysbaert, M., Stevens, M., & Van Assche, E. (2018). Contextual word learning during reading in a
second language: An eye-movement study. Studies in Second Language Acquisition, 40, 341–366. https://
doi.org/10.1017/S0272263117000109
Ellis, N. (2002). Frequency effects in language processing: A review with implications for theories of implicit
and explicit language acquisition. Studies in Second Language Acquisition, 24, 143–188. https://doi.org/
10.1017/S0272263102002024
Ellis, N., & Beaton, A. (1993). Psycholinguistic determinants of foreign language vocabulary learning.
Language Learning, 43, 559–617. https://doi.org/10.1111/j.1467-1770.1993.tb00627.x
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text: Interdisciplinary
Journal for the Study of Discourse, 20, 29–62. https://doi.org/10.1515/text.1.2000.20.1.29
Firth, J. (Ed.). (1957). Papers in linguistics 1934–1951. Oxford University Press.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus-based language learning research:
Identifying, comparing, and interpreting the evidence. Language Learning, 67, 155–179. https://doi.org/
10.1111/lang.12225
González Fernández, B., & Schmitt, N. (2015). How much collocation knowledge do L2 learners have? The
effects of frequency and amount of exposure. ITL: International Journal of Applied Linguistics, 166,
94–126. https://doi.org/10.1075/itl.166.1.03fer
Glaboniat, M., Perlmann-Balme, M., & Studer, T. (2013). Zertifikat B1: Deutschprüfung für Jugendliche und
Erwachsene: Prüfungsziele, Testbeschreibung. Hueber Verlag.
Granger, S., & Paquot, M. (2008). Disentangling the phraseological web. In S. Granger & F. Meunier (Eds.),
Phraseology (pp. 27–49). John Benjamins. https://doi.org/10.1075/z.139.07gra
Groom, N. (2009). Effects of second language immersion on second language collocational development. In
A. Barfield & H. Gyllstad (Eds.), Researching collocations in another language (pp. 21–33). Palgrave
Macmillan. https://doi.org/10.1057/9780230245327_2
Gyllstad, H. (2009). Designing and evaluating tests of receptive collocation knowledge: COLLEX and
COLLMATCH. In A. Barfield & H. Gyllstad (Eds.), Researching collocations in another language: Multiple
interpretations (pp. 153–170). Palgrave Macmillan.

Häcki Buhofer, A., Dräger, M., Meier, S., & Roth, T. (2014). Feste Wortverbindungen des Deutschen:
Kollokationenwörterbuch für den Alltag. Francke.
Henriksen, B. (2013). Research on L2 learners’ collocational competence and development: A progress report.
In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge, and use (pp. 29–56).
EuroSLA Monograph Series 2. EuroSLA.
Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19, 24–44. https://
doi.org/10.1093/applin/19.1.24
Hulstijn, J. (2003). Incidental and intentional learning. In C. Doughty & M. Long (Eds.)., The Handbook of
Second Language Acquisition (pp. 349–381). Blackwell Publishing Ltd.
James, E., Gaskell, G., Weighall, A. & Henderson, L. (2017). Consolidation of vocabulary during sleep: The
rich get richer? Neuroscience and Biobehavioral Reviews, 77, 1–13. 10.1016/j.neubiorev.2017.01.054
Jones, R., Tschirner, E. P., Goldhahn, A., Buchwald, I., & Ittner, A. (2006). A frequency dictionary of German:
Core vocabulary for learners. Routledge.
Koo, T., & Li, M. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability
research. Journal of Chiropractic Medicine, 15, 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
Larsen-Freeman, D. (1997). Chaos/Complexity science and second language acquisition. Applied Linguistics,
18, 141–165. https://doi.org/10.1093/applin/18.2.141
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A corpus analysis of
learners’ English verb-noun collocations in L2 writing. Language Learning, 61, 647–672. https://doi.org/
10.1111/j.1467-9922.2010.00621.x
Li, J., & Schmitt, N. (2010). The development of collocations use in academic texts by advanced L2 learners: A
multiple case study approach. In D. Wood (Ed.), Perspectives on formulaic language: Acquisition and
communication (pp. 23–46). Continuum.
Lüdecke, D. (2021). Performance: Assessment of regression models performance, Version 0.7.2. Computer
freeware. https://cran.r-project.org/web/packages/performance/performance.pdf
Macis, M., & Schmitt, N. (2017). Not just “small potatoes”: Knowledge of the idiomatic meanings of
collocations. Language Teaching Research, 21, 321–340. https://doi.org/10.1177/1362168816645957
Nation, P. (2001). Learning vocabulary in another language. Cambridge University Press.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for
teaching. Applied Linguistics, 24, 223–242. https://doi.org/10.1093/applin/24.2.223
Nesselhauf, N. (2005). Collocations in a Learner Corpus (Vol. 14). John Benjamins. https://doi.org/10.1075/
scl.14
Nguyen, T., & Webb, S. (2017). Examining second language receptive knowledge of collocation and factors
that affect learning. Language Teaching Research, 21, 298–320. https://doi.org/10.1177/
1362168816639619
Peters, E. (2016). The learning burden of collocations: The role of interlexical and intralexical factors.
Language Teaching Research, 20, 113–138. https://doi.org/10.1177/1362168814568131
Quasthoff, U. (2011). Wörterbuch der Kollokationen im Deutschen. De Gruyter.
R Core Team (2021). R: A language and environment for statistical computing (Version 4.1.2). R Foundation
for Statistical Computing. https://www.R-project.org/
Revelle, W. (2020). psych: Procedures for Psychological, Psychometric, and Personality Research. R package
version. Northwestern University, Evanston, Illinois. 2.0.12, https://CRAN.R-project.org/package=psych.
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows. Language Learning,
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new versions
of the Vocabulary Levels Test. Language Testing, 18, 55–88. https://doi.org/10.1177/026553220101800103
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
Siyanova-Chanturia, A. (2015). Collocation in beginner learner writing: A longitudinal study. System, 53,
148–160. https://doi.org/10.1016/j.system.2015.07.003
perspective. Canadian Modern Language Review, 64, 429–458. https://doi.org/10.3138/cmlr.64.3.429
Siyanova‐Chanturia, A., & Spina, S. (2020). Multi‐word expressions in second language writing: A large‐scale
longitudinal learner corpus study. Language Learning, 70, 420–463. https://doi.org/10.1111/lang.12383

Steinel, M., Hulstijn, J., & Steinel, W. (2007). Second language idiom learning in a paired-associate paradigm:
Effects of direction of learning, direction of testing, idiom imageability, and idiom transparency. Studies in
Szudarski, P. (2017). Learning and teaching L2 collocations: Insights from research. TESL Canada Journal,
34, 205–216. https://doi.org/10.18806/tesl.v34i3.1280
Toomer, M., & Elgort, I. (2019). The development of implicit and explicit knowledge of collocations: A
conceptual replication and extension of Sonbul and Schmitt (2013). Language Learning, 69, 405–439.
https://doi.org/10.1111/lang.12335
Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word
frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176–1190.
Vanhove, J., & Berthele, R. (2015). The lifespan development of cognate guessing skills in an unknown related
language. International Review of Applied Linguistics in Language Teaching, 53, 1–38. https://doi.org/
10.1515/iral-2015-0001
Vilkaitė, L. (2017). Incidental acquisition of collocations in L2: Effects of adjacency and prior vocabulary
knowledge. ITL: International Journal of Applied Linguistics, 168, 248–277. https://doi.org/10.1075/
itl.17005.vil
Vu, D. V., & Peters, E. (2021). Incidental learning of collocations from meaningful input: A longitudinal study
into three reading modes and factors that affect learning. Studies in Second Language Acquisition, 44,
685–707. https://doi.org/10.1017/S0272263121000462
Webb, S. (2008). Receptive and productive vocabulary sizes of L2 learners. Studies in Second Language
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.
intralexical knowledge. Applied Linguistics, 32, 430–449. https://doi.org/10.1093/applin/amr011
Wolter, B., & Gyllstad, H. (2013). Frequency of input and L2 collocational processing: A comparison of
congruent and incongruent collocations. Studies in Second Language Acquisition, 35, 451–482. https://
doi.org/10.1017/S0272263113000107
Wolter, B., & Yamashita, J. (2018). Word frequency, collocational frequency, L1 congruency, and proficiency
in L2 collocational processing: What accounts for L2 performance? Studies in Second Language Acqui-
sition, 40, 395–416. https://doi.org/10.1017/S0272263117000237
Wood, D. (2019). Classifying and identifying formulaic language. In S. Webb (Ed.), The Routledge Handbook
of Vocabulary Studies (pp. 30–45). Routledge. https://doi.org/10.4324/9780429291586-3
Wray, A. (2002). Formulaic language and the lexicon. Cambridge University Press.
Yamashita, J., & Jiang, N. (2010). L1 influence on the acquisition of L2 collocations: Japanese ESL users and
EFL learners acquiring English collocations. TESOL Quarterly, 44, 647–668. https://doi.org/10.5054/
tq.2010.235998
Yoon, H.-J. (2016). Association strength of verb-noun combinations in experienced NS and less experienced
NNS writing: Longitudinal and cross-sectional findings. Journal of Second Language Writing, 34, 42–57.
https://doi.org/10.1016/j.jslw.2016.11.001
Cite this article: Boone, G., De Wilde, V. and Eyckmans, J. (2023). A longitudinal study into learners’
productive collocation knowledge in L2 German and factors affecting the learning. Studies in Second

doi:10.1017/S0272263122000407
METHODS FORUM
Network analysis for modeling complex systems

in SLA research
Lani Freeborn* , Sible Andringa , Gabriela Lunansky and Judith Rispens
University of Amsterdam, Amsterdam, The Netherlands
*Corresponding author. E-mail: l.j.v.freeborn@uva.nl
(Received 07 October 2021; Revised 29 July 2022; Accepted 15 August 2022)
Abstract
Network analysis is a method used to explore the structural relationships between people or
organizations, and more recently between psychological constructs. Network analysis is a
novel technique that can be used to model psychological constructs that influence language
learning as complex systems, with longitudinal data, or cross-sectional data. The majority of
complex dynamic systems theory (CDST) research in the field of second language acquisi-
tion (SLA) to date has been time-intensive, with a focus on analyzing intraindividual
variation with dense longitudinal data collection. The question of how to model systems
from a structural perspective using relation-intensive methods is an underexplored dimen-
sion of CDST research in applied linguistics. To expand our research agenda, we highlight
the potential that psychological networks have for studying individual differences in
language learning. We provide two empirical examples of network models using cross-
sectional datasets that are publicly available online. We believe that this methodology can
complement time-intensive approaches and that it has the potential to contribute to the
development of new dimensions of CDST research in applied linguistics.
Introduction
In the field of second language acquisition (SLA), complex dynamic systems theory
(CDST) is a theoretical paradigm used to study the complex and dynamic nature of
language, language use, and language development (Hulstijn, 2020). A complex system
is formed out of interactions between multiple internal and external system compo-
nents. For example, if conceptualizing language development as a complex system,
changes in development are dependent on interactions between a learner’s internal
resources like working memory, motivation, and personality, as well as external,
environmental resources like the teacher, learning materials, and language use (van
Geert, 1991). These internal and external resources are interrelated, whereby altering
one component could in turn alter other components of the system (de Bot et al., 2007).
In this way, a complex system is characterized by complete interconnectedness and
mutual causality (Larsen-Freeman & Cameron, 2008). Complex systems are inherently

Network analysis for modeling complex systems in SLA research 527
dynamic; systems emerge over time through processes of self-organization and coad-
aptation between micro- and macro-level system components (Larsen-Freeman, 1997).
This means that complex systems are soft-assembled, whereby systems are “more than
the sum of their parts, reflecting a multiplicative combination of attributes, experiences
and situational factors” (American Psychological Association, 2017). CDST researchers
in SLA have acknowledged the impossibility of fully “knowing” a system, as complex
systems are characterized by unpredictability and nonlinearity, where changes in the
system can be disproportionate to the cause (Larsen-Freeman, 1997). Although a
complex system is, by definition, constantly in flux, the system can also demonstrate
periods of temporary stability. This is referred to as an attractor state; a self-sustaining
state in which interactions “are actively reproduced over time” (van Geert, 2019,
p. 168). An attractor state represents higher-order patterns of self-organization within
state space, from which the system moves toward or away from over time (Hiver, 2014).
To illustrate, an attractor state could refer to the tendency for learners not to participate
in class and remain silent (Hiver, 2014).
With the growing recognition that CDST approximates the reality of language
development (Hiver & Al-Hoorie, 2020a), more SLA researchers are adopting this
framework. However, there are many methodological considerations for conducting
empirical research within a CDST paradigm. Some of these include how to operatio-
nalize the system, how to assess the influence of contextual factors on the system, as well
as macro- and micro-structure considerations (Hiver & Al-Hoorie, 2016). Given the
inherent complexities of analyzing dynamic cause-effect relationships between systems
and their components, there has been much discussion about suitable methodologies
and suggestions of how to enhance our CDST toolbox (de Bot, 2011; Hiver &
Al-Hoorie, 2016, 2020a; Hiver et al., 2022).
Hilpert and Marchand (2018) distinguish between three conceptual perspectives
to studying complex systems and their accompanying research designs: time-
intensive, relation-intensive, and time-relation intensive approaches. Firstly, time-
intensive approaches “are used to make inferences about system behavior using
closely spaced observations over time” using longitudinal data (Hilpert & Marchand,
2018, p. 192). The second approach, relation-intensive, focuses on identifying the
structure of the relationships among individuals or variables in a system using cross-
sectional data. Combining the first two approaches, time-relation intensive
approaches “are used to make inferences about system behavior using closely spaced,
simultaneously collected observations of both within-element change and changing
between element relationships” (Hilpert & Marchand, 2018).
The majority of CDST studies in the field of SLA to date have taken time-intensive
approaches, typically consisting of case studies characterized by dense data collection
with qualitative and descriptive data analyses (Hiver et al., 2022). For the last 30 years,
CDST researchers have focused on individual variability and the dynamics of pro-
cesses (van Geert & van Dijk, 2021). This is not surprising, given that CDST is
essentially a theory of change, concerned with how one state develops into another
state over time. However, as Hilpert and Marchand (2018) have pointed out, complex
systems can be studied from multiple perspectives. Besides analyzing change over
time, identifying the structure of a system is also a key aspect of CDST. Expanding our
line of inquiry to include relation-intensive approaches could contribute to the
development of new dimensions of CDST research and complement time-intensive
approaches. While researchers have a diverse selection of methods available for time-
intensive approaches, our methodological toolbox for relation-intensive methods is
lacking.

528 Lani Freeborn et al.
In this article, we highlight network analysis as a potential methodology to model

complex systems from a relation-intensive perspective. While network analysis can
also be used for time- and time-relation intensive approaches, this article is focused on
network analysis for relation-intensive approaches only, due to the relative lack of
attention that this dimension has received by SLA researchers working within a CDST
paradigm. More specifically, we concentrate on psychological networks, as opposed to
social networks. SLA researchers have already explored social network analysis as a
suitable research methodology for CDST, for example to model relationships between
learners in a classroom and teacher networks as complex systems (Hiver & Al-Hoorie,
2016; Hiver & Al-Hoorie, 2020a; Mercer, 2014). SLA researchers have not yet
explored the potential of psychological networks to model psychological constructs
that influence language learning as complex systems. The network approach to
psychopathology has been used to reevaluate theories of mental disorders
(Borsboom et al., 2017; Borsboom & Cramer, 2013) and constructs such as intelli-
gence and cognitive development from a CDST perspective (Kievit, 2020; van der
Maas et al., 2006, 2017). In this article we discuss how, similarly to psychology
research, individual differences in language learning can be modeled as nomological
networks, expanding our relation-intensive methods to include the study of phenom-
enological constructs. We begin with a brief review of CDST research designs used in
the field of SLA to date, in relation to the three different conceptual approaches to
studying complex systems as described by Hilpert and Marchand (2018). We then
expand discussion on relation-intensive approaches, the least researched dimension
in CDST. The remainder of the article discusses potential applications of network
analysis. To further aid discussion, we provide two examples of network models that
are estimated from publicly available data.
Research designs in CDST

Time-intensive methods
Most CDST research in applied linguistics is time-intensive, with longitudinal data
collection of a single variable (or multiple variables for a single case/participant) to
observe micro-level changes in the system over time (Hiver et al., 2022; Hiver & Larsen-
Freeman, 2019). Time-intensive studies tend to have dense data collection and small
sample sizes, with 40% of studies including a sample size of 10 participants or fewer
(Hiver et al., 2022). A particularly researched area is the development of L2 writing over
time using measures of complexity, fluency, and accuracy (CAF) (Evans & Larsen-
Freeman, 2020; Larsen-Freeman, 2006; Lowie et al., 2017; Lowie & Verspoor, 2019).
Some common CDST techniques used in these studies include assessing the degree of
variability in developmental trajectories and plotting longitudinal data on min-max
graphs for visual inspection. Several studies have used a time-series design based on the
view that frequent-enough measurements may be able to capture underlying develop-
mental processes (Van Geert & Steenbeek, 2005). For example, Waninge et al. (2014)
micro-mapped the motivational dynamics of four students during class time, taking
measurements at 5-minute intervals.
Another popular methodology for observing language development is retrodictive
modeling (Chan & Zhang, 2021; Evans & Larsen-Freeman, 2020; Nitta & Baba, 2018),
based on the idea that because what we observe has already changed, change can be
described retrospectively (Larsen-Freeman & Cameron, 2008). Retrodictive methods
such as process tracing have been used to study the development of language as well as

individual differences over time. For example, Papi and Hiver (2020) used process
tracing of retrospective interviews to examine changes in six learners’ motivational
principles and Amerstorfer (2020) used process tracing with a combination of class-
room observations and retrodictive interviews to explore five learners’ strategic L2
development. Some time-intensive CDST studies have also used the “idiodynamic
method,” a mixed-methods approach to studying affective and cognitive states
(MacIntyre, 2012). These time-intensive approaches have provided insights into non-
linear L2 developmental processes and intraindividual variation over time.
Relation-intensive methods
In comparison to the number of studies that have taken a time-intensive approach, far
fewer CDST studies have taken a relation-intensive approach, which involves exploring
the structure of relationships between people or variables within a system with cross-
sectional data. As previously mentioned, SLA researchers have noted how social
network analysis is a suitable methodology for CDST, for example to analyze relation-
ships between learners in a classroom, teacher networks, or school networks (Mercer,
2014). However, this discussion has been mostly theoretical, with very few empirical
studies using social network analysis from a CDST perspective. For example, although
some applied linguistics researchers have used social network analysis to map the
distribution of conversational topics of bilinguals in different contexts (Tiv et al., 2020)
and to assess the impact of social networks in study abroad contexts (Gautier, 2019;
Paradowski et al., 2021; Zappa-Hollman & Duff, 2014), these studies are not typically
informed by CDST.
While relation-intensive approaches can focus on person-to-person interactions,
they can also be used to analyze relations among psychological variables (Marchand &
Hilpert, 2018). Taking a variable-centered relation-intensive approach necessitates
researchers to engage with psychological constructs on a phenomenological level,
and to carefully consider whether their methodology can effectively model complex
patterns of relationships among variables. Some SLA researchers have discussed how
psychological constructs such as “the self” and L2 motivation can be conceptualized as
complex systems (Henry, 2014, 2017; Mercer, 2011a). In one study, Mercer (2011a)
took a relation-intensive approach to explore how the self-construct could be conceived
of as a complex system. Using qualitative data of a single case study, Mercer (2011a)
created a three-dimensional network-based model of a student’s self-concepts that she
felt to be the most “phenomenologically-real” representation of the data.
Besides this, few SLA researchers have attempted to model psychological constructs
as complex systems. There are a handful of CDST studies that are reminiscent of
relation-intensive approaches, which used quantitative methodologies often deemed
ill-suited for CDST. For example, conceptualizing L2 speech as a complex system, Saito
et al. (2020) investigated the effects that 30 different internal and external individual
differences had on the pronunciation of 110 L2 English speakers. Due to the large
number of variables included in their study, Saito et al. (2020) first conducted factor
analysis and then did regression analysis with the extracted factor scores on speech
ratings. In another study, Li et al. (2020) positioned themselves within a CDST
framework to explore the relationships between individual difference constructs
including foreign language classroom anxiety, foreign language enjoyment, self-
perceived achievement, and actual English achievement. To analyze data, Li et al.
(2020) conducted Pearson correlations to assess relationships between variables and

used multiple regression analysis to assess the combined effect of anxiety and enjoy-
ment on language achievement. While these two studies are fine cross-sectional studies
in their own right, CDST scholars have argued that methods such as zero-order
correlations and linear regression oversimplify the complex realities of how individual
differences influence second language development and have questioned the use of
cross-sectional datasets in CDST research (Al-Hoorie & Hiver, 2022; Hiver, 2014).
Overall, very few SLA researchers to date have used relation-intensive approaches
within a CDST paradigm. There are also seemingly fewer methodologies available for
SLA researchers to explore relation-intensive approaches, with more conceptual dis-
cussion than empirical studies.
Time-relation intensive methods

Hilpert and Marchand (2018, p. 192) describe time-relation intensive research designs
as having “closely spaced, simultaneously collected observations of both within-
element change and changing between element relationships.” Only a few SLA studies
have analyzed interactions between variables and how these interactions change over
time. However, these studies cannot be strictly classified as time-relation intensive
approaches, as their data collection consisted of only a few time points. For example,
Serafini (2017) conducted longitudinal case studies to explore interactions between
cognitive and motivational individual differences at varying proficiency levels. Data
was collected twice from 87 university students learning L2 Spanish, at the beginning
and end of an academic semester. Serafini used Pearson correlations to analyze
associations between individual differences at each time point and created scatterplots
with regression and Loess lines to visualize relationships between variables and
compare differences across proficiency levels. Results showed that the relationship
between cognitive abilities and motivational constructs varied at each time point and
across learner proficiency levels, indicating that cognitive and motivational subsystems
are interdependent. In another study, Piniel and Csizér (2014) investigated changes in
21 students’ motivation, anxiety, and self-efficacy at six time points throughout an
academic writing course. To analyze data, Piniel and Csizér used latent growth curve
modeling (LGCM) and cluster analysis to group together learners with similar trajec-
tories. Interactions between variables were also analyzed by comparing Pearson cor-
relations between IDs at each time point. Overall, results indicated that language
learning experience, ought-to L2 self, and writing anxiety showed a significant level
of nonlinear change over time. There was also a strong interrelationship between
motivation and anxiety, whereby more highly motivated learners had lower levels of
language learning anxiety.
Pfenninger and colleagues (Kleisch & Pfenninger, 2021; Pfenninger, 2020) have also
recently explored the use of generalized additive mixed modeling (GAMM) for a time-
relation intensive approach to SLA microdevelopment. GAMM is a type of analysis
used for time-series data that can consider nonlinear development, iterative processes,
and interdependency between variables (Pfenninger, 2020). Pfenninger (2020) used
GAMM to analyze the L2 developmental trajectories of four groups of children (N =
91) on different content and language integrated learning (CLIL) programs. The
children completed various language tasks four times a year for up to 8 years. Pfen-
ninger also combined GAMM with qualitative data to help identify what contributed to
developmental trajectories. Results showed that children had similar L2 trajectories
regardless of their age of onset, and that L2 growth was determined by various different

external and internal states across time. In another study, Kliesch and Pfenninger
(2021) used GAMM to examine the L2 developmental trajectories of 28 adults (age
64þ) on a 7-month beginner’s Spanish course. Data was collected each week over 30–
32 weeks, which included seven L2 measures, eight cognitive tasks, and measures of
well-being and motivation. GAMM revealed both linear and nonlinear increases in L2
proficiency over time, with considerable between-subject variability. While only a few
CDST studies have used time-relation intensive methods, findings indicate a complex
interplay between external and internal learner differences, which in turn interact with
language development in a nonlinear way over time.
Expanding our research agenda

CDST studies that have incorporated a relation-intensive element to their research
design are far less common compared to the number of studies that have taken time-
intensive approaches. Despite the fact that “complexity theorists are interested in
understanding the relations [emphasis in original] that connect the components of a
complex system” (Hiver & Larsen-Freeman, 2019, p. 287), to date there have been very
few attempts to empirically model these relations. One potential reason behind this is
relates to methodological challenges and the view that cross-sectional data, zero-order
correlations and linear regression are ill-suited to studying complex systems (Al-Hoorie
& Hiver, 2022). Another reason relates to the theoretical challenges of conceptualizing
abstract psychological constructs as complex systems. A number of individual differ-
ences constructs in language learning have been conceptualized as complex systems,
such as motivation (Papi & Hiver, 2020), strategy development (Amerstorfer, 2020),
anxiety (Gregersen, 2020), working memory (Jackson, 2020), and willingness to
communicate (MacIntyre, 2020). To examine these constructs from a relation-
intensive perspective, for example to model L2 motivation as a complex system, we
must consider the components that form the system, and how these components align
with our measurement instruments. Researchers must also confront “the boundary
problem” (Larsen-Freeman, 2017), accepting the theoretical impossibility of measuring
a complex system in its entirety, whereby “the whole is greater than the sum of its parts”
(Han, 2019, p. 156). Consideration should also be given to the phenomenological
validity of equating conceptual and theoretical concepts as systems, and the practical
implications this has for a chosen methodology (Hiver & Al-Hoorie, 2016). Mercer
(2011b, p. 59) discusses these issues in relation her network-based model of the self-
concept, acknowledging the theoretical and empirical difficulty of distinguishing the
blurred boundaries between different self-constructs. Despite the challenges of explor-
ing psychological constructs related to language learning from a relation-intensive
approach, and the inevitable reductionism this entails, focusing on system structures
can offer a perspective that is currently missing from CDST research in SLA.
Take the construct of L2 motivation, for example, which has been much discussed in
CDST research (e.g., Dörnyei, 2017, Dörnyei et al., 2015; Henry, 2014, 2017; Hiver &
Papi, 2019; Hiver & Larsen-Freeman, 2019; Papi & Hiver, 2020). Most CDST research
on L2 motivation has been time-intensive with a focus on observing micro-level
changes in a small number of variables over time. Very few CDST researchers have
explored L2 motivation from a relation-intensive perspective, although there has been
some theoretical discussion of how to conceptualize the structural relationships
between motivational constructs as complex systems (Henry, 2014, 2017). The L2
Motivational Self System (L2MSS) is a theoretical paradigm that was developed by

Dörnyei (2005, 2009) that conceptualizes L2 motivation from a self-perspective. The

L2MSS is comprised of three phenomenologically constructed concepts, each theorized
to be a primary source of motivation to learn an L2: the Ideal L2 Self, the Ought-to L2
Self, and L2 Learning Experience (Dörnyei 2005, 2009). Although the L2MSS was not
originally conceptualized as a complex system, it has been conceptually extended to a
CDST paradigm (Henry, 2014, 2017). For example, Henry (2017, p. 551) has described
the self-concept as a multifaced dynamic structure, which can be understood as “the
product of constant interactions between different subsystems (such as, e.g., self-
efficacy and self-esteem).”
Taking a relation-intensive approach to L2 motivation could provide insight into
the structural relationships between components of the L2 motivational system, and if
this were expanded to a time-relation intensive approach, could potentially identify
attractor states. SLA researchers have speculated about how the L2 self-system, in
particular the Ideal L2 self, can manifest as an attractor state (Henry, 2017; Hiver,
2014; Waninge et al., 2014), whereby “changes in the vision of the Ideal L2 Self and
changes in the distance between it and the actual self, can be conceptualized as
changes in attractor state geometries” (Henry, 2014, p. 87, emphasis in original).
Although longitudinal data is needed to show system self-organization and the
emergence of attractor states, cross-sectional data can provide a perspective that is
currently missing from CDST research in SLA. As Mercer (2011a) reflects in relation
to her network-based model of the self-concept:
Whilst the model out of necessity can only represent a snapshot of a fragment of
an individual’s self-concept network in a specific context at a particular time,
the essence of the underlying form can be used to fundamentally understand
the structure and nature of self-concept. (p. 66)
Taking a relation-intensive CDST approach to the study of individual differences in
SLA can thus be viewed as complimentary of time-intensive approaches. Cross-
sectional data can provide insight into the structure of relationships between system
components, which, if combined with what we have learned from time-intensive CDST
studies, could enrich our understanding of the complex interplay between individual
differences and L2 development.
There is currently little guidance on how to analyze and model interactions between
system components from a relation-intensive perspective. As previously mentioned,
CDST researchers have questioned whether methods such as zero-order correlations
and linear regression are suitable for examining dynamic changes and interconnected
(Al-Hoorie & Hiver, 2022). Although scholars have emphasized the potential of
quantitative analyses for CDST research (Al-Hoorie & Hiver, 2022) for example to
identify network structure or nested phenomena, there appears to be an overall
reluctance to use cross-sectional data, with most CDST researchers preferring longi-
tudinal data. Until now, most studies that have taken a relation-intensive approach
have analyzed relationships between variables by correlations and multiple regression
analysis (Li et al., 2020; Piniel & Csizér, 2014; Saito et al., 2020; Serafini, 2017). However,
new advancements in statistics software and data analysis techniques such as GAMM
are enriching the CDST toolbox. Other techniques that have been proposed as appro-
priate methods to study complex systems with a relation-intensive element are latent
growth curve modeling (LGCM) and multilevel modeling (MLM) (Hiver & Al-Hoorie,
2020a; MacIntyre et al., 2017). To expand our CDST toolbox of relation-intensive
approaches, we could also utilize network analysis, an underexplored methodology in
SLA research.

Network analysis
Network analysis has become a popular technique for studying complex systems in the
field of psychology. Readers should be aware that there are many different types of
network models; network analysis can be performed on cross-sectional data from a
relation-intensive perspective (Epskamp & Fried, 2019; Hevey, 2018), and also on
longitudinal time-series data from a time- or time-relation intensive perspective
(Bringmann et al., 2013). Although we outline some other variants of network analysis
later in the discussion section, it is beyond the scope of this article to describe each type
of network analysis in detail. We have opted to focus on psychological networks with
cross-sectional data for relation-intensive approaches, which is an underexplored
dimension of CDST research in applied linguistics.
As readers may be more familiar with social network analysis, we would also like to
briefly explain some differences between social networks and psychological networks.
Social networks show patterns of relationships among individuals or groups, whereas
psychological networks show patterns of relationships among variables (at item level or
composite level). It is important to note that with social networks, the relationships
between variables are known; social networks are created from an adjacency matrix,
whereby the relationships between variables are directly observed (O’Malley & Onnela,
2019). In contrast, with psychological networks, relationships between variables are not
known but are estimated. Psychological networks are estimated from a variance-
covariance matrix, based on the strength of partial correlations between variables
(Epskamp & Fried, 2019).
Psychological network analysis has been used to model constructs such as intelli-
gence (van der Maas et al., 2006, 2017) cognitive development (Kievit, 2020), and
mental disorders (Borsboom, 2017) from a CDST perspective, and has also been
applied to clinical research on psychological disorders such as depression and eating
disorders (Elliott et al., 2020; Lutz et al., 2018). In network models, variables (also
referred to as components) are represented as circles called nodes. In psychological
networks, nodes represent elements of a construct or an entire construct, such as
attitudes or symptoms of a mental disorder. Lines between nodes are called edges, which
represent the direct association between a pair of nodes. The strength of association
between nodes is called the edge weight; the thicker the edge, the stronger the associ-
ation. Edges in psychological networks are typically undirected, which reflect the
hypothesized multicausal relationships between system components. Positive relation-
ships are typically denoted using blue edges, while red edges are used to indicate
negative relationships. The layout of the network model can be selected by the
researcher. Psychological networks are often plotted (by default) using the
Fruchterman-Reingold algorithm (Fruchterman & Reingold, 1991), which places
nodes with stronger connections closer together, and nodes with weaker connections
further apart. Besides visual inspection, network models can be analyzed on several
different levels, depending on what the research questions are. For example, researchers
typically analyze the network density if the overall interest is the network structure or
focus on particular nodes and edges (Burger et al., 2022).
The most common models used to estimate psychological networks are pairwise
Markov random field (PMRF) models. Within PRMF models, Gaussian graphical
models (GGM) are used with continuous multivariate data to estimate partial corre-
lations between variables (Epskamp, 2014). Partial correlation networks are undirected
graphs, estimated by analyzing the strength of correlations between variables after
controlling for the effect of other measured variables in the network (Hevey, 2018). As

such, a psychological network can be viewed as a “nomological net, which functions as a

specification of the phenomenological concepts or theoretical constructs of interest in a
study, their observable manifestations, and the linkages between them” (Hiver &
Al-Hoorie, 2016, p. 747). Psychological networks created using cross-sectional data
can therefore be viewed as a snapshot of the system at a given time.
Network analysis and CDST

Network analysis has some advantages over other relation-intensive methods used in
CDST research. One advantage is that network analysis is more conceptually aligned
with CDST compared to factor-based statistical techniques that are rooted in latent
variable theory (Fried, 2020). Originally developed by Spearman (1904), factor models
function under the theoretical assumption that a latent construct, such as intelligence or
personality, can be measured through observable indicators (e.g., behavioral tests or
questionnaire items). This means that there is a hypothesized unidirectional relation-
ship from the latent construct to the observable indicator, whereby answers to ques-
tionnaire items or tests are thought to “reflect” the latent construct (Edward & Bagozzi,
2000). In contrast, from a network perspective, psychological constructs “exist as
systems where components mutually influence each other without the need to call on
latent variables” (Guyon et al., 2017, p. 2). Statistically, factor models and psychological
networks are closely related, as both analyze the covariance between observed variables.
The difference between each approach is their competing causal explanations (Fried,
2020). As van Bork et al. (2019, p. 1) explain, “whereas latent variable approaches
introduce unobserved common causes to explain the relations among observed vari-
ables, network approaches posit direct causal relations between observed variables.”
These two competing causal explanations are reflected in the choice of statistical
model selected by the researcher. For example, factor-based techniques such as SEM or
LGCM generate directed graphs, with edges from the latent construct to the observed
indicators and/or between latent constructs, which are determined by the researcher a
priori. Psychological network analysis is a more data-driven approach and produces an
undirected graph with edges estimated between all nodes, better reflecting key CDST
concepts such as multicausality and interconnectedness. This has already been noted in
the field of psychology, where researchers working from a CDST perspective are using
network analysis as an exploratory tool to better visualize the complex patterns of
relations between variables of interest (Hilpert & Marchand, 2018; Sachisthal et al.,
2019; van der Maas et al., 2017).
In this article, we explore how network analysis could be used to model psycholog-
ical constructs that influence language learning from a relation-intensive perspective.
We provide two examples of psychological networks created using the datasets of
existing studies that are publicly available online in support of Open Science practices.
As the nested nature of educational phenomena can be analyzed at multiple levels
(Marchand & Hilpert, 2018), our network models illustrate two different levels of
analysis; with nodes at item level and composite level. The first example is a network
model of L2 motivation made using the dataset from Hiver and Al-Hoorie’s (2020b)
study on the role of vision in L2 motivation. This example explores how an individual
difference construct such as L2 motivation can be modeled as a complex system, by
analyzing relationships between the L2MSS at the item level. The second example is a
network model of individual differences in native language ultimate attainment, made
using the dataset from Dąbrowska’s (2018) study. The second example takes a wider

relation-intensive perspective by analyzing interactions between multiple individual

difference constructs at the composite level. Note that the authors of the original studies
(Dąbrowska, 2018; Hiver & Al-Hoorie, 2020b) did not position their research within a
CDST paradigm, and our reanalysis of their data is not a critique on their work.
We performed all statistical analyses using the open-source software R (R Core
Team, 2020) and the R-packages qgraph (Epskamp et al., 2012) and bootnet (Epskamp
et al., 2018a) in particular. The R code that we used to create these two examples is
available in the online Supplementary Materials on our Open Science Framework
(OSF) page. This article is not intended to serve as a tutorial in network analysis (for
tutorials, we refer readers to Burger et al., 2022; Epskamp et al., 2018a; and Hevey,
2018). Rather, our overall aim is to raise awareness of this methodology and illustrate
how it can be applied to model psychological constructs related to language learning
from a relation-intensive CDST perspective. Within each example, we evaluate (a) the
extent to which a network analysis of the datasets supports the same conclusions as the
original authors and (b) whether network analysis can offer any additional insights to
the original analyses.
Example 1
The first example was made using the dataset from Hiver and Al-Hoorie’s (2020b)
study “Reexamining the Role of Cision in Second Language Motivation: A Preregis-
tered Conceptual Replication of You, Dörnyei, and Csizér (2016).” Both Hiver and
Al-Hoorie (2020b) and You et al. (2016) used SEM to explore interrelationships
between components of the L2 Motivational Self System (L2MSS). The L2MSS is a
theoretical paradigm that was developed by Dörnyei (2005, 2009) based on Possible
Selves Theory (Markus & Nurius, 1986). The L2MSS is comprised of three components,
each theorized to be a primary source of motivation to learn an L2: the Ideal L2 Self, the
Ought-to L2 Self, and L2 Learning Experience. The ideal L2 self refers to learners’
internal desires and wishes to learn the L2, while the ought-to L2 self refers to learner’s
perceived external duties and social pressures to learn the L2 (Dörnyei & Chan, 2013).
L2 experience concerns learners’ attitudes toward learning, based on their experience of
the learning process and environment. In addition to these three components, vision
and imagery are also considered key aspects of the L2MSS, whereby motivation is
viewed as “a function of the language learners’ vision of their desired future language
selves” (Dörnyei & Chan, 2013, p. 437). Vision can be considered as a combination of
imagery capacity and ideal selves and is typically measured by visual and auditory
learning style preferences, and vividness of imagery capacity (You et al., 2016). A
number of studies have used SEM to explore the interrelationships between these
motivational constructs and the extent to which the L2MSS can predict language
learning or intended effort (Dörnyei & Chan, 2013; Hiver & Al-Hoorie, 2020b; You
et al., 2016). However, as You et al. (2016, p. 97) have pointed out, “because the L2
Motivational Self System was originally proposed as a framework with no directional
links among the three components, past empirical studies employing SEM have not
been uniform in specifying these interrelationships.” For example, whereas some
studies have presented a directed pathway from the ideal L2 self to L2 learning
experience, other studies have reversed this relationship (for further details see You
et al., 2016).
Hiver and Al-Hoorie (2020b) conducted a conceptual replication and extension of
You et al. (2016) to evaluate the role of vision in L2 motivation and to assess whether
intended effort is an outcome or a predictor of motivation. They justified these aims in

part due to the fact that You et al. did not test equivalent or competing models, which
could be considered a form of confirmation bias. Hiver and Al-Hoorie also stressed the
need for more robust research designs, and further replication of research on language
motivation. Hiver and Al-Hoorie (2020b) collected data from 1297 L2 learners of
English in secondary schools in South Korea. In addition to the same 10 scales of
motivation and vision used by You et al., Hiver and Al-Hoorie also included two
measures of L2 proficiency, midterm grades and final exam grades, which were
analyzed as one variable called L2 achievement. To determine the number of under-
lying factors, they submitted the dataset to Mokken scaling analysis, confirmatory
factor analysis, exploratory factor analysis, scree plot, optimal coordinates, and parallel
analysis (Hiver & Al-Hoorie, 2020b, p. 73). These analyses resulted in only four factors:
visual style, ideal L2 self, ought-to L2 self, and intended effort. With these four factors
and the measures of L2 achievement, Hiver and Al-Hoorie used SEM to test two
competing causal models of vision and L2 motivation, where intended effort was either
an antecedent or an outcome of motivation. Contrary to You et al. (2016), Hiver and
Al-Hoorie hypothesized intended effort to be an antecedent of the ideal L2 self and the
ought-to L2 self. In both competing models, vision (visual style) was considered a
predictor of motivation, which was the same as You et al. (2016). Results showed that
the model with intended effort as a predictor of motivation showed a better overall fit.
Although this was contrary to You et al.’s model, Hiver and Al-Hoorie note that as their
dataset and analyses differed greatly from the initial study, their model cannot be used
to contradict You et al.’s model and call for further replication of research on the
L2MSS.
In both studies (Hiver & Al-Hoorie, 2020b; You et al., 2016), the authors were
interested in the relationships between the L2MSS, vision, and intended effort. By using
SEM, they operationalized motivational constructs as latent variables, depicting
hypothesized causal relationships between latent constructs with unidirectional arrows.
However, in both studies, the authors note potential issues and limitations of using SEM
to model interactions between motivational constructs. One issue relates to the theo-
rized dynamic nature of the L2MSS and the multicausal relationships between moti-
vational constructs. Possible Selves Theory was originally proposed to have dynamic
qualities, whereby current and ideal selves are shaped by multiple ongoing processes
(Henry, 2014; Markus & Nurius, 1986). For example, Hiver and Al-Hoorie speculate
that once an L2 learner puts in the effort and engages in the L2 learning process, “there
will be a dynamic interaction between motivation … and task demands, leading to
continuous recalibration of that motivational construct” (2020b, p. 86). One might
question the extent to which SEM can effectively model these dynamic interactions, as
SEM operationalizes motivational constructs as latent variables with a unidirectional
causal relationship. In fact, both studies’ authors acknowledge that a further limitation
of SEM is that it requires the researcher to specify the direction of the relationship
between latent constructs. SEM can only test the theoretical model that is selected by the
researcher, although equivalent or alternative models may likely exist. This issue was
illustrated by Hiver and Al-Hoorie’s (2020b) two competing SEM models. As discussed
earlier, there has already been discussion of how the L2MSS could be conceptualized as
a complex system (Henry, 2014, 2017) and manifest as an attractor state (Henry, 2017;
Hiver, 2014; Waninge et al., 2014). From a CDST perspective, causal relationships
between motivational constructs are not unidirectional, but reciprocal. To further
investigate the relationship between motivation and intended effort, Hiver and
Al-Hoorie (2020b) have encouraged researchers to consider using nonrecursive models
where causality is reciprocal. In this first example, we illustrate how network analysis

can be used to model the L2MSS as a complex system, with hypothesized reciprocal
causation between motivational constructs with nodes at item level.
Network estimation and visualization

Figure 1 is a GGM of the L2MSS that we made with the dataset from Hiver and
Al-Hoorie’s (2020b) study. In support of open science practices, they made their dataset
and analyses publicly available through the OSF website. To allow for ease of compar-
ison, we included the same variables in our network analysis as Hiver and Al-Hoorie’s
SEM analyses, with the exception of visual style 2, which we explain in a later section.
The network model in Figure 1 has nodes at item level, to better explore the interre-
latedness of these motivational constructs, and the questionnaire items used to measure
them. Table 1 contains information about which items correspond to each node.
We chose GGM model selection (ggmModSelect function implemented in the
bootnet R-package; Epskamp et al., 2018a) as estimation method because of the large
size of the dataset. Model search works by setting edges to zero and using a stepwise
algorithm to continuously estimate the model until the optimal model is identified
(Epskamp, 2014). This technique uses Bayesian information criterion (BIC) obtained
through estimating the maximum likelihood of sparsity.
Figure 1. A network model of the L2MSS and L2 achievement.

Note: In this network model of the L2MSS, there are four motivational constructs: the ideal L2 self, the
ought-to L2 self, intended effort, and visual style. Each node represents a questionnaire item. Ought-to L2
self has been measured with six questionnaire items, and the other motivational constructs with five
questionnaire items. There are also two composite measures of L2 proficiency: L2_T1 (students’ mid-term
grades) and L2_T2 (students’ final grades).

Table 1. Legend of node labels
Node labels Items
L2 achievement
L2_T1 Mid-term grades
L2_T2 End of term grades
Ideal L2 self
IS1 I can imagine myself speaking English in the future with foreign friends at parties
IS2 I can imagine myself in the future giving an English speech successfully to the
public
IS3 I can imagine a situation in which I am doing business in foreigners by speaking
English
IS4 I can imagine myself speaking English in the future having a discussion with foreign
friends in English
IS5 I can imagine that in the future in a café with light music, a foreign friend and I will
be chatting in English casually over a cup of coffee
Ought-to L2 self
OS1 Studying English is important to me to gain the approval of my teachers
OS2 Studying English is important to me to gain the approval of my peers
OS3 Studying English is important to me to gain the approval of the society
OS4 I study English because close friends of mine think it is important
OS5 I consider learning English important because the people I respect think that I
should do it
OS6 My parents/family believe that I must study English to be an educated person
Visual style
VS1 I use color coding (e.g., highlighter pen) to help me as I learn
VS2 Charts, diagrams, and maps help me understand what someone says
VS3 When I listen to a teacher, I imagine pictures, numbers, or words
VS4 I highlight the text in different colors when I study English
VS5 I learn better by reading what the teacher writes on the board
Intended effort
IE1 I am prepared to expend a lot of effort in learning English
IE2 I find learning English really interesting
IE3 I would like to concentrate on studying English more than any other topic
IE4 Even if I failed in English learning, I would still learn English very hard
IE5 English would still be important to me in the future even if I failed in my English
course
In their original analyses, Hiver and Al-Hoorie (2020b) tested for normality and
found that their data were not multivariate normal both in skewness and kurtosis. For
this reason, the network was estimated using Spearman correlations (Epskamp, 2014).
After estimating the model, we evaluated the stability of the network structure in
terms of edge-weight accuracy using bootstrapping (see Epskamp et al., 2018a for an
in-depth explanation of bootstrapping in psychological networks). We used 5,000
samples of the nonparametric bootstrap to assess the variability of the edge-weights.
This step should always be performed (Epskamp et al., 2018a) as any interpretation of
the network becomes limited if the network is unstable (Burger et al., 2022). The results
show a good overlap between the estimated model and the bootstrapped edge-weights,
indicating that the network of Figure 1 is stable. The results of the nonparametric
bootstrap can be viewed in the supplementary materials.
To assess the stability of the centrality coefficients, we again used bootstrapping. We
used the case-dropping bootstrap, specifically developed to this aim (Epskamp et al.,
2018a). The case-dropping bootstrap assesses the stability of the order of centrality in
subsets of the data, that is, after systematically dropping an increasing percentage of
participants from the dataset. The centrality stability for “strength” centrality was

Figure 2. Centrality plots for the L2MSS network.

Note: Centrality plots for the network model of the L2MSS. Centrality measures are shown as standardized
z-scores. The raw centrality indices can be found in the online Supplementary Materials.
estimated on a sample of 5,000 bootstraps, which resulted in a correlation stability

coefficient (CS-coefficient) of 0.52 for the “strength” centrality. This is above the 0.5
(CS-coefficient) recommendation (Epskamp et al., 2018a), which is why we conclude
that the stability of node centrality in this network model is good. The results are
presented in the supplementary materials.
Based on the centrality indices (see Figure 2), ought-to L2 self 2 is the most central
component in the network model in Figure 1 in terms of node strength, followed by
intended effort 5. The questionnaire items that correspond to these components are
“Studying English is important to me to gain the approval of my peers” and “English
would be still important to me in the future even if I failed in my English course.” This
suggests that peer approval and perceived future importance of English play important
roles in L2 motivation, as they are the strongest direct relationships with other
motivational constructs in the system.
In addition to node strength, we also computed node centrality indices based on
closeness and betweenness. The closeness index “indicates a short average distance of a
specific node to all other nodes” (Hevey, 2018, p. 311). In the network model, the nodes
with the highest closeness are the five intended effort nodes. This is an interesting find
and indicates that intended effort may have an integral role in L2 motivation. Although
the role of central components is not yet fully understood, it is thought that central
nodes with high closeness are the most likely to both effect changes and be affected by
changes in the system (Hevey, 2018). The third measure of centrality, betweenness,
refers to how well one node connects other nodes together; nodes with high

betweenness lie on the shortest path between pairs of nodes. As shown in Figure 2, the
node with the highest betweenness is intended effort 5, followed by ought-to L2 self
5 and ideal L2 self 1.
Interpreting the network model

Both You et al. (2016) and Hiver and Al-Hoorie (2020b) used SEM to evaluate the
relationships between the L2MSS, vision, and intended effort. The network model
contains four motivational constructs (ideal L2 self, ought-to L2 self, intended effort,
visual style) and the two measures of L2 achievement (midterm grades and final exam
grades). We can see the wider interconnectedness of components in the system, with
multiple interactions across different motivational constructs.
One of the first things we notice when looking at this network model is that,
although the motivational constructs are interrelated, there are only a few weak
edges between any of the motivational constructs and the two measures of L2
proficiency. For example, final grades have a weak partial correlation with ideal self
1 (0.10) and midterm grades have a weak negative partial correlation with ought-to
self 4 (–0.09). The network analysis results are consistent with Hiver and Al-Hoorie’s
(2020b) study, where the ideal L2 self was only a weak predictor of L2 achievement
(accounting for less than 1% of the variance), and the ought-to L2 self had almost no
predictive value.
Visual style
The visual style scale consists of five questionnaire items. As can be seen in Figure 1,
although each of the five measures of visual style are grouped together, the nodes are
not as closely grouped together compared to nodes measuring other constructs. In
Hiver and Al-Hoorie’s SEM analyses, they excluded visual style 2 to improve
convergent validity, and also note that this scale had the lowest reliability in You
et al.’s (2016) study. Removing items is typical with latent variable approaches, where
researchers drop variables that do not load onto factors or if there are cross-loadings
(Fried, 2020). With network analysis however, Fried (2020, p. 21) has pointed out that
“items that load onto two factors simultaneously make for the potentially most
interesting items because they may build causal bridges between two communities
of items.” Because of this, we decided to include visual style 2 in the network analysis.
The network model shows that visual style 2 is linked to three other nodes measuring
visual style, and also has a weak partial correlation with one measure of L2 achieve-
ment, on measure of the ideal L2 self, and one measure of intended effort. While visual
style 2 was left out of the SEM analyses, results of the network analysis tentatively
suggest that this questionnaire item may function as a bridge node between other
motivational constructs. In both You et al. (2016) and Hiver and Al-Hoorie’s (2020b)
SEM analyses, they treated visual style as a predictor of the ideal L2 self and the ought-
to L2 self. The network model is an undirected graph, so our analyses cannot provide
additional insights into whether visual style is a predictor or outcome of motivation.
What the network analysis does provide, is a more complex pattern of relationships
between visual style and other system components than the original analyses. The
nodes that measure visual style are partially correlated with components from all
other motivational constructs in the network, as well as one measure of language
achievement.

Intended effort
Besides the role of vision, You et al. and Hiver and Al-Hoorie were also interested in
the direction of the relationship between intended effort and the L2MSS. Hiver and
Al-Hoorie’s analyses of two competing SEM models showed that intended effort was
a better predictor of the ideal L2 self and ought-to L2 self than an outcome. Previous
research has provided empirical evidence for reciprocal causal relationships between
motivation and academic achievement (Vu et al., 2021). The network model shows
that components of intended effort are related to components of all other subsystems,
as well as L2 achievement, indicating a complex pattern of relationships. The results
of the centrality indices highlight the overall importance of intended effort in L2
motivation, as the five intended effort variables have the highest closeness index in the
network. Overall, intended effort 5 emerges as the most central component of the
network. This item refers to the statement “English would be still important to me in
the future even if I failed in my English course.” Intended effort 5 also has the highest
centrality in terms of betweenness, and the second highest in terms of closeness and
strength. The question surrounding the role of central components will be further
discussed later in this article.
Example 2
The second example illustrates how network analysis can be used to explore the
relationships between multiple individual differences using the dataset from Dąbrows-
ka’s (2018) study Experience, aptitude and individual differences in native language
ultimate attainment. The dataset is publicly available online using the IRIS Database.
The network model made from Dąbrowska’s (2018) dataset presents a different level of
analysis from the previous example. In contrast to the network model in example 1,
where each node represents a single questionnaire item, each node in the network
model in Figure 3 represents a distinct variable measured by aggregated task scores. In
the original study, Dąbrowska tested the assumption that adult native speakers tend to
converge on the same grammar. She addressed this question by considering two
opposing approaches to language acquisition: the usage-based perspective and the
modular perspective. From a usage-based perspective, language abilities are thought to
emerge out of interactions between general cognitive mechanisms and exposure to
linguistic input (Ellis & Wulff, 2018). From this perspective, “causal mechanisms
interact iteratively to produce what appears to be structure” (Bybee & Beckner, 2009,
p. 23). A usage-based approach is thus aligned with CDST, where linguistic knowledge
emerges as a network of interrelated and interacting components. In contrast, from a
modular perspective, language abilities are thought to stem from an innate universal
grammar, whereby different types of language knowledge rely on autonomous modules
within the mind (Tan & Shojamanesh, 2019).
Dąbrowska (2018) discusses the plausibility of these two theories in connection with
analyses of a dataset of 90 native English speakers’ performance on different linguistic
and nonlinguistic tasks. She first analyzed the amount of individual variation on six
tasks that measured grammatical comprehension, receptive vocabulary, collocations,
nonverbal IQ, language analytic ability, and print exposure. Full details regarding which
tests were used to measure each construct can be found in the original study. Dąb-
rowska then conducted Pearson correlations to explore interactions between the six
aforementioned tasks as well as education (measured by number of years spent in
education). This revealed several significant correlations between the measures of

Figure 3. A network model of individual differences in native language ultimate attainment.

Note: The nodes in this network are composite scores representing three measures of language proficiency
and four individual differences. The three proficiency measures are receptive vocabulary, collocations, and
grammatical comprehension. The four individual differences are nonverbal IQ, print exposure, language
analytic ability, and years of education.
language knowledge as well as between other variables. To determine potential causes

of individual differences in linguistic knowledge, Dąbrowska then conducted regression
analyses with the four predictor variables (nonverbal IQ, language analytic ability, print
exposure, and education) on each of the three measures of language knowledge.
Overall, results showed that nonverbal IQ was strongly related to grammar and
vocabulary, but not to collocations. Language analytic ability was also significantly
related to grammar and vocabulary, as well as several other variables. Print exposure
contributed more to vocabulary and collocations than to grammar, and education only
weakly predicted each measure of language knowledge. Based on the significant
correlations between the three measures of language knowledge and the fact that the
same nonlinguistic variables predicted different areas of language knowledge, Dąb-
rowska concluded that these findings support a usage-based approach.
Network estimation and visualization

Figure 3 is a GGM of partial correlations that includes the same seven variables used in
Dąbrowska’s analyses. The network model was estimated using the “least absolute
shrinkage and selection operator” (LASSO), which is considered an appropriate
estimation method for smaller datasets (Epskamp et al., 2018a; Hevey, 2018). The
LASSO technique results in a sparser network, using only a relatively small number of
edges to explain the covariance in structure (Epskamp et al., 2018b). This makes the
estimated model more interpretable and accurate, as very small edges are removed from

the estimated network (Epskamp et al., 2018a; Hevey, 2018). The LASSO applies a
regularization technique that is controlled by a tuning parameter. The tuning param-
eter was selected by minimizing the Extended Bayesian Information Criterion (EBIC),
for which we used the default setting of 0.5.
To assess network stability, we used a nonparametric bootstrap of 5,000 samples.
Bootstrapping results can be found in the supplementary materials. The bootstraps
show wide 95% confidence intervals, meaning that the estimated network structure is
not very stable and the found links should be interpreted with care. As such, our
discussion and interpretation of this network model is tentative, and a larger sample
size is needed to draw any strong conclusions. We did not compute centrality indices
for this dataset because the aim of this network analysis was to explore overall patterns
of relationships between variables, and also given the small number of variables in this
model.
Interpreting the network model

The network model in Figure 3 illustrates a complex system of interdependent
relationships between linguistic and nonlinguistic variables. Each node in the net-
work model in Figure 3 represents a composite variable. For example, the node
“collocations” consists of 40 multiple choice items on the Words That Go Together
test, and the node “print exposure” consists of 130 items on the Author Recognition
Test. From a CDST perspective, the network model in Figure 3 provides a visualiza-
tion of how different aspects of language knowledge are related to both internal
resources (nonverbal IQ and language analytic ability) and external resources (print
exposure and education). When comparing to the results of the regression analyses in
the original study, the network model reflects the same overall patterns of relation-
ships between individual differences in language knowledge. For example, nonverbal
IQ is more strongly associated with grammar and vocabulary than with collocations,
and print exposure is more strongly associated with vocabulary and collocations than
with grammar. The fact that both analyses reveal the same overall patterns is not
surprising because partial correlations and multiple regression coefficients both
estimate of the strength of relationships between variables while controlling for the
effects of other measured variables (Hevey, 2018). The key difference is that regres-
sion analysis imposes unidirectional causal relationships between specific variables
selected by the researcher, whereas with network analysis there are no assumptions
regarding the direction of the relationships.
There are some subtle differences between the results of the network analyses and
Dąbrowska’s analyses. For instance, whereas Dąbrowska found that language analytic
ability was significantly related to both grammar and vocabulary, the network
analysis shows that language analytic ability is only very weakly associated with
vocabulary. In the network model, the relationship between language analytic ability
and vocabulary knowledge appears to be altered by print exposure and nonverbal
IQ. Similarly, while Dąbrowska’s analyses showed that education weakly predicted
each measure of language knowledge, the network analysis shows that the relation-
ship between education and language knowledge becomes weaker after controlling
for the effects of print exposure, nonverbal IQ, and language analytic ability. It is also
interesting to note that in the network model, the negative relationship between print
exposure and nonverbal IQ becomes stronger after controlling for other variables.
Another minor difference is that network analysis revealed a negative relationship

between nonverbal IQ and print exposure whereas in Dąbrowska’s analyses this

relationship was positive. The reason for this difference is that Dąbrowska trans-
formed the raw IQ scores into percentages, while we opted to conduct analyses with
the raw IQ scores. To confirm this, we conducted Pearson correlations between print
exposure and both the raw and transformed IQ scores that showed that print
exposure had a weak negative correlation with raw IQ scores (r(88) = –.03, p =
.719) and a weak positive correlation with transformed IQ scores (r(88) = .08, p =
.440). However as these are very small differences, they cannot be interpreted as
meaningful. These slight differences revealed by the network analysis could be due to
the fact that we included all seven variables in the network analysis, whereas
Dąbrowska conducted three separate regression analyses for each measure of lan-
guage knowledge. By taking a more holistic approach including all variables within
the same analysis, additional patterns of relationships were revealed. This then raises
the question of how many variables should be included when working from a CDST
perspective.
Adding age to the network model

To explore this idea, we expanded on the original study by adding the variable “age”
to the network model. Participants’ ages were contained within the original dataset
that is available online, but Dąbrowska did not include this variable in her analyses. It
seemed particularly interesting to include this variable because Dąbrowska used the
dataset to evaluate the usage-based approach and the modular approach to language
acquisition. Age is an indirect measure of language experience. With first language
development, it is logical to assume that the older a person is, the more exposure to
linguistic input they have. Thus, from a usage-based perspective, we might hypoth-
esize age to be significantly related to a number of other variables, including
measures of language knowledge. The 90 participants in Dąbrowska’s study varied
greatly in age, with a range of 17 to 65 and a mean age of 38. The network model in
Figure 4 is a GGM of partial correlations between eight variables (the seven variables
from the original analyses plus age). The model was made following the same
procedures described for the network model in Figure 3. Similarly to the model
without age, the bootstraps show wide 95% confidence intervals, meaning that the
estimated network structure is also not very stable and the found links should be
interpreted with care.
The network model in Figure 4 shows that age is partially correlated with all other
variables. Out of the three measures of language knowledge (grammar, vocabulary, and
collocations), age is most strongly linked to vocabulary (0.35), which is the strongest
positive edge in the network.1 This is in line with previous studies which have shown
that vocabulary is typically the only aspect of language knowledge that does not tend to
decline with age (Reifegerste, 2021). As could be expected, age is also related to print
exposure. Age has a negative association with nonverbal IQ and language analytic
ability, which is consistent with previous research on cognitive decline and aging
(Reifegerste, 2021). There is also a negative relationship between age and education,
1
We conducted the bootstrapped difference test to check whether the edges in the network significantly
differ from each other, in the supplementary materials. The edge Age-Vocabulary is significantly stronger
than the edge Age-Grammar, but not from the edge Age-Collocations knowledge. This means that the
difference between these edges has to be interpreted with care.

Figure 4. A network model of individual differences in language knowledge, including age.

Note: In addition to the same variables as the network model in Figure 3, this model also has the variable
age. Blue edges denote positive partial correlations and red edges denote negative partial correlations.
which is logical considering that the percentage of university attendance has increased
over the years. Overall, the considerable effect that age has on the system provides
tentative support for the usage-based approach. Using network analysis, we can
visualize individual differences in language abilities as a complex system. While we
cannot draw conclusions about emergent processes from cross-sectional data, based on
this network model, we could speculate that vocabulary knowledge emerges out of
interactions between cognitive abilities (nonverbal IQ) and other language experience
(print exposure) throughout the lifespan (age).
In addition, controlling for age alters the partial correlations between other nodes.
For example, in the model that includes age, vocabulary knowledge has a weak positive
relationship with grammar (0.11) and language analytic ability (0.12), whereas in the
model without age, these relationships are weaker (0.06 and 0.05). This indicates that
age is a moderating variable. When controlling for age, the edge weight between
grammar and language analytic ability is stronger (more positive) because age has
negative partial correlations with grammar and language analytic ability. In a similar
way, age also moderates the relationship between IQ and print exposure; these variables
have an edge weight of –0.43 without age, and –0.26 when age is added to the model. In
this case, the edge weight between IQ and print exposure is weaker (less positive) when
controlling for age because age has negative partial correlations with IQ and print
exposure.
Although the network models estimated in example 2 are not stable, and a larger
sample size is necessary to draw any firm conclusions, our examples serve to illustrate
how network analysis can be used to model multiple individual differences in language
learning from a CDST perspective. The network analyses support the same conclusions

as the original study (Dąbrowska, 2018), but rather than analyzing unidirectional
relationships between individual difference constructs, the undirected network models
in example two depict hypothesized multicausal relationships between variables. By
estimating partial correlations between all variables, network analyses also reveal a
more complex network of relationships between variables than the original study’s
regression analyses, offering additional insights into the data.
Discussion
We have provided two examples of how network analysis can be used to model complex
systems from a relation-intensive perspective. These examples serve to illustrate how a
network approach can offer new insights into which components form a system and the
nature of the relationships between components. Network analysis is conceptually
aligned with CDST, enabling us to model hypothesized multicausal relationships
between variables. We have shown how network analysis of cross-sectional data can
be used to model individual difference constructs as complex systems, viewing the
network as a snapshot of (part of) a system in time. We illustrated this in example 1 with
nodes at item level, to analyze motivational constructs on a micro level, and in example
2 with node at composite level, to analyze the relationships between individual
differences and language knowledge on a more macro level. In both examples, network
analysis complements the original analyses by providing a more intricate pattern of
relationships between system components, and deeper understanding into the variables
of interest.
Besides the examples of psychological network analysis in this article, there are other
applications of network analysis that could also be beneficial to SLA researchers, such as
the network comparison test and dynamic network analysis. It is also important to
acknowledge that psychological network analysis is still a relatively new statistical
technique, and there are some unanswered questions regarding how certain aspects of
CDST fit with network analysis, for example regarding the question of how many
variables to include and the role of central components. In the following section, we
discuss some of these questions and highlight additional applications of network
analysis that could be applied to SLA research.
The network comparison test

The network comparison test is an application of network analysis that can be used to
compare group differences. The network comparison test statistically compares the
networks of two (or more) groups, such as in terms of node centrality and global
strength (van Borkulo et al., 2022). Networks can also be compared visually, which is
typically done by constraining the layout of the two models for ease of visual
comparison. Blanco et al. (2020) used a network comparison test to compare the
effects of two different interventions on treating depression. One group of patients
(n = 45) received a 10-week Positive Psychology Intervention (PPI) while another
group (n = 48) received a 10-week Cognitive-Behavioral Therapy (CBT) program.
Both groups completed clinical assessments of depression symptoms before and after
the intervention treatments. Blanco et al. (2020) used this data to create two network
models to compare before and after treatment. Results of the network comparison
test showed that only the PPI group showed significant changes in several edge
weights and global strength after intervention. In SLA research, the network

comparison test could be used to statistically compare the networks of learners at

different proficiency levels, at different time points, or across learning conditions.
Both You et al. (2016) and Hiver and Al-Hoorie (2020b) conducted additional
analyses to compare the roles of vision and intended effort across male and female
L2 learners. This comparison could be also done using a network comparison test. As
such, the network comparison test could strengthen CDST inspired research by
providing a means for hypothesis testing and generalizations. Comparing networks
across groups could also help to ascertain the phenomenological validity of concep-
tualizing abstract psychological phenomena as complex systems. In addition, the
network comparison test could provide insight into how to influence systems’
behavior, as illustrated by Blanco et al. (2020), and could a useful tool for SLA
researchers considering complex interventions (Hiver et al., 2022).
Dynamic network analysis

In examples 1 and 2, we took a relation-intensive CDST approach by estimating GGMs
of cross-sectional data. The GGM can also be used with time-intensive and time-
relation intensive research designs, for single subjects and group data, respectively.
Dynamic network analysis requires intensive repeated measurements of variables, such
as with a time-series or panel design, typically obtained through Experience Sampling
Method (ESM), whereby participants provide self-reports at regular intervals during
the day (Bringmann et al., 2013). With single-subject data, auto regressive
(AR) modeling can model time dynamics within an individual by regressing one
variable on a previous measurement of the same variable (called a lagged variable).
The vector auto regressive (VAR) model is the multivariate extension of the AR model,
where “a variable is regressed on all the lagged variables in the dynamic system” (van
Bork et al., 2018, p. 18). The VAR model has two extensions: graphical VAR and
multilevel VAR. For single-subject data, graphical VAR can be used to create both
temporal and contemporaneous networks using the GGM (Epskamp et al., 2018).
Temporal networks have directed edges and show how the state of variables at one time
point influence the state of variables at the next time point. A contemporaneous
network model shows how variables predict each other at the same measurement
occasion, after accounting for temporal effects (Epskamp et al., 2018b), similarly to
GAMMs and LCGMs. Multilevel VAR modeling can be used to model both within-
group and between-group variance over time (Bringmann et al., 2013). For example,
Bringmann et al. (2013) combined VAR and multilevel VAR to follow 129 participants’
changes in depressive symptoms during a treatment intervention, modeling time
dynamics at the individual and group level.
In the field of clinical psychology, researchers are exploring how dynamic network
modeling could provide insight into how people develop disorders over time, with the
aim of using this knowledge to target group and/or individual treatment interventions
(Bringmann et al., 2013; David et al., 2018; van Bork et al., 2018). Dynamic network
analysis could also prove to be a useful methodology for CDST researchers in applied
linguistics, and a few SLA studies have used ESM. For example, Waninge et al. (2014)
micro-mapped the motivational dynamics of four learners during their language
lessons. They took measurements at 5-minute intervals throughout lessons, resulting
in 10 observations per class. Similarly, Khajavy et al. (2021) used ESM to examine the
dynamic relationships between willingness to communicate (WTC), anxiety, and
enjoyment of 38 students throughout six language lessons. Students indicated their

level of WTC, anxiety, and enjoyment on a scale of 1 to 10 at 5-minute intervals,

resulting in 10 observations per class. Gregersen et al. (2020) also used ESM to explore
the dynamics of language teacher well-being, where teachers used an app to respond to
a short survey 10 times a day for 7 days. Although smartphone technology has the
potential to use ESM more easily than in the past (Arndt et al., 2021; Gregersen et al.,
2020), it is still extremely challenging in most applied linguistics research settings to
obtain a large enough number of observations to conduct a dynamic network analysis.
For example, in the study by Bringmann et al. (2013), participants recorded depressive
symptoms 10 times a day for 12 days, which resulted in a total of 120 observations per
participant.
Nonlinearity
GGMs and VAR models are estimated based on assumptions of multivariate normality
that assume linear relationships between variables (Epskamp et al., 2018b). As such,
these models may not present a fully accurate view of the data if the relationships
between variables are nonlinear. For cross-sectional data, the Ising model is a nonlinear
model used for binary variables (Finnemann et al., 2021), but nonlinear models for
continuous variables have not yet been developed. For longitudinal data, while VAR
models fit linear effects, new types of network analysis have been developed that can
also capture nonlinear relationships between variables (Haslbeck et al., 2021). The
findings from several CDST studies with a time element have shown that language
development, and its relationship with individual differences, is nonlinear (Fogal, 2022;
Pfenninger, 2020; Piniel & Czisér, 2014). This is why some CDST research designs that
include a time element are using techniques such as GAMM instead of LGCM, as
GAMM can handle nonlinearity (Pfenninger, 2020). Researchers in the field of network
psychometrics have recently combined the VAR model with a Generalized Additive
Model (GAM) framework, to estimate time-varying VAR models (Haslbeck et al.,
2021). The field of network psychometrics is developing rapidly and is likely to produce
other useful techniques in the future that could further enrich our methodological
toolbox.
Latent network analysis

In the first example of the L2MSS, we compared Hiver and Al-Hoorie’s (2020b) SEM
with our network analysis. For the past 100 years, psychological constructs have been
studied using latent variable approaches, which assume that observed variables corre-
late because they reflect the same underlying construct (van Bork et al., 2019). Network
analysis has been put forward as an alternative to latent variable approaches. From a
network perspective, correlations between observed variables may reflect mutual
interaction between psychological processes (van der Maas et al., 2006). Although
these two approaches have different competing causal explanations for the covariance
between observed variables, both create models for variance-covariance matrices and
are thus statistically equivalent (van Bork et al., 2019; van der Maas et al., 2006). Because
of this statistical equivalence, researchers have explored the idea that combining these
two approaches could be complementary, resulting in latent network analysis (Golino
& Epskamp, 2017; Guyon et al., 2017). Conceptually, a combined approach assumes
that manifestations of psychological attributes have a common cause (latent variables)
and that these latent variables interact (as a complex system) (Guyon et al., 2017). For

example, Epskamp et al. (2017) have used latent network modeling to explore the
structure of interdependent relationships between latent variables. An advantage of
latent network analysis is that due to the incorporation of factor-based statistical
techniques, it is possible to test model fit against data, which is a limitation of
psychological network analysis (Epskamp et al., 2017; van der Maas et al., 2017). It
can also be considered a useful way of exploring latent variables within a dataset because
clusters in the network can tell us about the factor structures present, without having to
impose the direction of the relationship like SEM (Golino & Epskamp, 2017). Latent
network analysis can be used with cross-sectional data as well as time-series and
panel data.
The role of central components

In example 1, we computed centrality indices for the network model of the L2MSS,
which showed that the intended effort nodes have the highest centrality. Researchers
from different fields have questioned whether central components have predictive
ability and can be used to target interventions. The role of central components has so far
provided insights into the dynamic processes of genetic networks, cortical networks,
and ecosystems (for a detailed description see Rodrigues, 2019). In the field of clinical
psychology, findings from few studies indicate that central components could be used
to target treatment interventions and make predictions about diagnoses. For example,
in clinical research on eating disorders, central components have been predictive of
treatment dropout (Lutz et al., 2018) and treatment outcomes (Elliott et al., 2020). The
idea behind using central nodes to target interventions is that these nodes are more
likely to have bigger effects (either directly or indirectly) on the rest of the system
compared to targeting a less central node (Rouquette et al., 2018). Nodes with high
closeness in particular are more likely to be affected by changes in other components of
the system and are also more likely to trigger change.
From the first network model example in Figure 1, intended effort had the highest
node centrality in terms of closeness, suggesting that intended effort plays a key role in
triggering the dynamic processes involved in L2 motivation. This fits with Hiver and
Al-Hoorie’s (2020b, p. 86) idea that putting in the effort to learn a language results in
dynamic interaction between motivational constructs and task demands.
However, readers should note that the use of centrality indices in psychological
networks is much debated (Bringmann et al., 2019). Centrality indices stem from
social network analysis, whereby the relationship between components/nodes is
known; the connections between nodes are observable. In comparison, in psycho-
logical networks the relationship between nodes is not directly observed, but is
estimated, based on the strength of partial correlations between our measurements
of psychological constructs. Bringmann and colleagues (Bringmann et al., 2019)
have advised researchers to interpret centrality measures with care, especially
betweenness and closeness centrality, as they are difficult to interpret and are
often unstable.
The number of variables to include

Complex systems are characterized by dynamic interaction between multiple internal
and external subsystems (de Bot et al., 2007; Larsen-Freeman & Cameron, 2008).
However, given the theoretical and practical impossibilities of analyzing the complete

interconnected of a whole system, CDST researchers have to find a balance between

oversimplification and undersimplification. Larsen-Freeman et al. (2011) have pointed
out that a main methodological concern for CDST researchers is drawing boundaries
and defining what we conceptualize to be a “functional whole.” Yet, when conducting
network analysis, Hevey (2018, p. 307) has reasoned that it is “critically important to
measure such potential confounding variables to ensure that their effects are controlled
for.” The network model of Dąbrowska’s (2018) dataset in example 2 that includes age
illustrates Hevey’s reasoning, as age moderates the relationships between other system
components. It is highly likely that there are also other confounding variables that have
been omitted from the model, such as socioeconomic status, L2 knowledge and
experience, gender, and other cognitive abilities. As with other types of modeling,
adding further variables to the network model could have both predictable and
unpredictable effects on the rest of the system. Yet from a CDST perspective, it is
theoretically impossible to measure every component of a system. What network
analysis can do, is capture at least part of a system. Thus, while we acknowledge the
potential of a network approach to SLA and individual differences, it is important to be
mindful of its limitations.
Generalizability
Generalizability is another debated topic in CDST research. Several researchers have
pointed out the lack of generalizability of CDST studies and the lack of practical
implications that CDST can currently offer to the field of applied linguistics (Hiver
et al., 2022; Palloti, 2022). Generalizability is a complex topic and is related to the
distinction between idiographic and nomothetic methodological approaches. Idio-
graphic approaches focus on the individual level with within-subject designs, analyzing
intraindividual differences (Hamaker, 2012). Idiographic approaches use longitudinal
data and process-focused analyses. In contrast, nomothetic approaches focus on the
group level with between-subject designs, analyzing interindividual differences
(Hamaker, 2012). Nomothetic approaches use cross-sectional data and product-
focused analyses.
The majority of CDST studies to date have used idiographic approaches (Hiver et al.,
2022) because it is difficult to generalize from cross-sectional models to individual
dynamics. This concept is known as the ergodicity problem: The idea that group
statistics cannot be generalized to the individual and vice-versa (Lowie & Verspoor,
2019). As Molenaar (2004, p. 225) has pointed out, “only under very strict conditions—
which are hardly obtained in real psychological processes—can a generalization be
made from a structure of interindividual variation to the analogous structure of
intraindividual variation.” However, this does not mean that idiographic and nomo-
thetic approaches are in competition (Salvatore & Valsiner, 2010). In fact, they can be
viewed as complementary, or two sides of the same coin (Grice, 2004). When discussing
the idiographic-nomothetic debate in relation to research on personality, Grice (2004)
argued that:
Establishing the uniqueness of some person’s developmental history, attitudes,

thoughts, behaviors etc., would require the negation of nomothetic principles.
Conversely, establishing the validity of a nomothetic principle that holds for all
people would require the study of individual persons, not simply aggregates of

persons. A true study of personality is therefore necessarily idiographic and

nomothetic. (p. 205)
Lowie and Verspoor (2019) have illustrated this point in relation to SLA, by investi-
gating the role of motivation and aptitude in both a group study and in 22 longitudinal
case studies. Their analyses showed that while learners showed different intraindividual
learning trajectories over time, there were overall similarities between learners in terms
of motivation and aptitude.
While some have argued that the idiographic approach undermines generalization
(Palloti, 2022; Spencer & Schönen, 2003), others have argued that idiography is a way to
pursue generalized knowledge (Salvatore & Valsiner, 2010). As Salvatore and Valsiner
(2010) have claimed, idiography is “the pursuit of nomothetic knowledge through the
singularity of the psychological and social phenomena” [emphasis in original] (p. 820). It
is also important to note that nomothetic refers to what can be generalized across a
sample population (e.g., from aggregated cross-sectional data), not what can be taken as
a general law across all populations (Hamaker, 2012). Hence, as with any other cross-
sectional data analysis, results of network analysis can only tell us about the population
from which the data was sampled and cannot be taken as a general law across all
populations or all individuals.
That said, a network approach offers a structural perspective that is currently
missing from CDST research in the field of SLA and enables us to expand our research
agenda beyond idiographic approaches, time-intensive approaches (Hiver et al., 2022).
Taking steps toward generalizable findings, network analysis provides a means to
quantitatively analyze the relationships between multiple variables and assess the
relative importance of each variable within the system. Compared to other statistical
techniques such as SEM, an advantage of network analysis is that it does not require a
priori assumptions about unidirectional causal relations, but instead it allows for
(hypothesized) bidirectional interactions between variables. As previously mentioned,
other applications of network analysis such as the network comparison test make it
possible for SLA researchers to test hypotheses and assess the extent to which systems
can be generalized across different learner populations. Although network analysis is
still relatively new, some researchers in clinical psychology have set out to examine its
methodological validity and to determine the most appropriate metrics for assessing
similarities between samples (Borsboom et al., 2017; Funkhouser et al., 2020).
Researchers have also begun to assess the extent to which network analytic tools can
inform the design of intervention studies. For example, Henry et al. (2020, p. 2) have
developed a statistical testing procedure to assess the efficacy of an intervention,
“determining if the dynamical systems of different people have the same optimal
intervention studies.”
Conclusion
In this article we provided a brief overview of research methods used by SLA researchers
working within a CDST paradigm. We put forward network analysis as a way to model
complex systems from a relation-intensive perspective and provided two examples of
how to apply network analysis to two different datasets. In the first example we
estimated a network model of L2 motivation, which provided a more fine-tuned picture
of the potential relationships between motivational constructs compared to the original
SEM analyses. In the second example we created a network model of individual
differences in native language knowledge, showing how network analysis can model

the interconnectedness of individual difference constructs and different aspects of

language knowledge.
While CDST researchers have made considerable advances in describing language
development and changes in individual differences over time, the potential of relation-
intensive approaches has not yet been explored. Through our two examples of network
models, we hope to have illustrated that cross-sectional data does have a place in CDST
research, and that network analysis is a useful technique to add to the CDST toolbox.
10.1017/S0272263122000407.
Acknowledgments. We would like to thank Han van der Maas, Wander Lowie, and the three anonymous
reviewers for their invaluable feedback and suggestions on an earlier draft of this manuscript.
parent practices. The materials are available at https://osf.io/hjcvz/
References
Al-Hoorie, A. H., & Hiver, P. (2022). Complexity theory: From metaphors to methodological advances. In
A. H. Al-Hoorie & F. Szabó (Eds.). Researching language learning motivation: A concise guide
(pp. 175–184). Bloomsbury Academic.
American Psychological Association. (2017, February). Nonlinear methods for understanding complex
dynamical phenomena in psychological science. https://www.apa.org/science/about/psa/2017/02/dynam
ical-phenomena
Amerstorfer, C. M. (2020). The dynamism of strategic learning: Complexity theory in strategic L2 develop-
ment. Studies in Second Language Learning and Teaching, 10, 21–44.
Arndt, H. L., Granfeldt, J., & Gullberg, M. (2021). Reviewing the potential of the experience sampling method
(ESM) for capturing second language exposure and use. Second Language Research. Advance online
Blanco, I., Contreras, A., Chaves, C., Lopez-Gomez, I., Hervas, G., & Vazquez, C. (2020). Positive interven-
tions in depression change the structure of well-being and psychological symptoms: A network analysis.
The Journal of Positive Psychology, 15, 623–628.
Borsboom, D. (2017). A network theory of mental disorders. World Psychiatry, 16, 5–13.
Borsboom, D., & Cramer, A. O. J. (2013). Network analysis: An integrative approach to the structure of
psychopathology. Annual Review of Clinical Psychology, 9, 91–121.
Borsboom, D., Fried, E. I., Epskamp, S., Waldorp, L. J., van Borkulo, C. D., van der Maas, Han L. J., & Cramer,
A. O. J. (2017). False alarm? A comprehensive reanalysis of “evidence that psychopathology symptom
networks have limited replicability” by Forbes, Wright, Markon, and Krueger (2017). Journal of Abnormal
Bringmann, L. F., Elmer, T., Epskamp, S., Krause, R. W., Schoch, D., Wichers, M., Wigman, J. T. W., & Snippe,
E. (2019). What do centrality measures measure in psychological networks?, Journal of Abnormal
Psychology 128, 892–903.
Bringmann, L. F., Vissers, N., Wichers, M., Geschwind, N., Kuppens, P., Peeters, F., & Tuerlinckx, F. (2013). A
network approach to psychopathology: New insights into clinical longitudinal data. PloS ONE, 8, e60188.
Burger, J., Isvoranu, A. M., Lunansky, G., Haslbeck, J. M. B., Epskamp, S., Hoekstra, R. H. A., Fried, E. I.,
Borsboom, D., Blanken, T. F. (2022). Reporting standards for psychological network analyses in cross-
sectional data. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000471
Bybee, J. L., & Beckner, C. (2009). Usage-based theory. In B. Heine & H. Narrog (Eds.), The Oxford handbook
of linguistic analysis (pp. 1–26). Oxford University Press.
Chang, P., & Zhang, L. J. (2021). A CDST perspective on variability in foreign language learners’ listening
development. Frontiers in Psychology, 12, 1–17.
Dąbrowska, E. (2018). Experience, aptitude and individual differences in native language ultimate attain-
ment. Cognition, 178, 222–235.

David, S. J., Marshall, A. J., Evanovich, E. K., & Mumma, G. H. (2018). Intraindividual dynamic network
analysis—implications for clinical assessment. Journal of Psychopathology and Behavioral Assessment, 40,
235–248.
De Bot, K. (2011). Researching second language development from a dynamic systems theory perspective. In
M. Verspoor, K. de Bot, & W. Lowie (Eds.), A dynamic approach to second language development: Methods
and techniques (pp. 5–24). John Benjamins.
de Bot, K., Lowie, W., & Verspoor, M. (2007). A dynamic systems theory approach to second language
acquisition. Bilingualism: Language and Cognition, 10, 7–21.
Dörnyei, Z. (2005). The psychology of the language learner: Individual differences in second language
acquisition. Lawrence Erlbaum.
Dörnyei, Z. (2009). Individual differences: Interplay of learner characteristics and learning environment. In
N. Ellis & D. Larsen-Freeman (Eds.), Language as a complex adaptive system (pp. 230–248). Wiley-
Blackwell.
Dörnyei, Z. (2017). Conceptualizing L2 learner characteristics in a complex, dynamic world. In L. Ortega & Z.
Han (Eds.), Complexity theory and language development: In celebration of Diane Larsen-Freeman
Dörnyei, Z., & Chan, L. (2013). Motivation and vision: An analysis of future L2 self images, sensory styles, and
imagery capacity across two target languages. Language Learning, 63, 437–462.
Dörnyei, Z., MacIntyre, P. D., & Henry, A. (2015). Introduction: Applying complex dynamic systems
principles to empirical research on L2 motivation. In Z. Dörnyei, P. D. MacIntyre, & A. Henry (Eds.),
Motivational dynamics in language learning (pp. 1–7). Multilingual Matters.
Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and
measures. Psychological Methods, 5, 155–174.
Elliott, H., Jones, P. J., & Schmidt, U. (2020). Central symptoms predict posttreatment outcomes and clinical
impairment in anorexia nervosa: A network analysis. Clinical Psychology Science, 8, 139–154.
Ellis, N., & Wulff, S. (2018). Usage-based approaches to second language acquisition. In D. Miller, F. Bayram,
J. Rothman, & L. Serratrice (Eds.), Bilingual cognition and language: The state of the science across its
subfields (pp. 37–56). John Benjamins.
Epskamp, S. (2014). Network model selection using qgraph 1.3. http://psychosystems.org/network-model-
selection-using-qgraph-1-3-10/
Epskamp, S., & Fried, E. (2019). A tutorial on regularized partial correlation networks. Psychological Methods,
23, 617–634.
Epskamp, S., Borsboom, D., & Fried, E. I. (2018a). Estimating psychological networks and their accuracy: A
tutorial paper. Behavior Research Methods, 50, 195–212.
Epskamp, S., Cramer, A. O. J., Waldorp, L. J., Schmittmann, V. D., & Borsboom, D. (2012). Qgraph: Network
visualizations of relationships in psychometric data. Journal of Statistical Software, 48, 1–18.
Epskamp, S., Rhemtulla, M., & Borsboom, D. (2017). Generalized network psychometrics: Combining
network and latent variable models. Psychometrika, 82, 904–927.
Epskamp, S., Waldorp, L., Mõttus, R., & Borsboom, D. (2018b). The gaussian graphical model in cross-
sectional and time-series data. Multivariate Behavioral Research, 53, 453–480.
Evans, R., & Larsen-Freeman, D. (2020). Bifurcations and the emergence of L2 syntactic structures in a
complex dynamic system. Frontiers in Psychology, 11, 1–12.
Finnemann, A., Borsboom, D., Epskamp, S., & van der Maas, H. L. J. (2021). The theoretical and statistical
Ising model: A practical guide in R. Psych, 3, 593–617.
Fogal, G. (2022). Second language writing from a complex dynamic systems perspective. Language Teaching,
55, 193–210.
Fried, E. I. (2020). Lack of theory building and testing impedes progress in the factor and network literature.
Psychological Inquiry, 31, 271–288.
Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph drawing by force-directed placement. Software:
Practice and Experience, 21, 1129–1164.
Funkhouser, C. J., Correa, K. A., Gorka, S. M., Nelson, B. D., Phan, K. L., & Shankman, S. A. (2020). The
replicability and generalizability of internalizing symptom networks across five samples. Journal of
Abnormal Psychology, 129, 191–203.

Gautier, R. (2019). Understanding socialisation and integration through social network analysis: American
and Chinese students during a stay abroad. In M. Howard (Ed.), Study abroad, second language acquisition
and interculturality (pp. 207–236). Multilingual Matters.
Golino, H. F., & Epskamp, S. (2017). Exploratory graph analysis: A new approach for estimating the number
of dimensions in psychological research. PloS ONE, 12, e0174035.
Gregersen, T. (2020). Dynamic properties of language anxiety. Studies in Second Language Learning and
Teaching, 10, 67–87.
Gregersen, T., Mercer, S., MacIntyre, P., Talbot, K., & Banga, C. A. (2020). Understanding language teacher
wellbeing: An ESM study of daily stressors and uplifts. Language Teaching Research. Advance online
Grice, J. W. (2004). Bridging the idiographic-nomothetic divide in ratings of self and others on the big five.
Journal of Personality, 72, 203–241.
Guyon, H., Falissard, B., & Kop, J. (2017). Modeling psychological attributes in psychology—an epistemo-
logical discussion: Network analysis vs. latent variables. Frontiers in Psychology, 8. https://doi.org/10.3389/
fpsyg.2017.00798
Han, Z. (2019). Researching CDST: Promises and pitfalls. In Z. Han (Ed.), Profiling learner language as a
dynamic system (pp. 156–166). Multilingual Matters.
Hamaker, E. L. (2012). Why researchers should think “within-person”: A paradigmatic rationale (pp. 43–61).
The Guilford Press.
Haslbeck, J., Bringmann, L., & Waldorp, L. (2021). A tutorial on estimating time-varying vector autore-
gressive models. Multivariate Behavioral Research, 56, 120–149.
Henry, A. (2014). The dynamics of possible selves. In Z. Dörnyei, P. MacIntyre, & A. Henry (Eds.),
Motivational dynamics in language learning (pp. 83–94). Multilingual Matters.
Henry, A. (2017). L2 motivation and multilingual identities. The Modern Language Journal, 101, 548–565.
Henry, T. R., Robinaugh, D., & Fried, E. I. (2020, April 1). On the control of psychological networks.
PsyArXiv. https://doi.org/10.31234/osf.io/7vpz2
Hevey, D. (2018). Network analysis: A brief overview and tutorial. Health Psychology and Behavioral
Medicine, 6, 301–328.
Hilpert, J. C., & Marchand, G. C. (2018). Complex systems research in educational psychology: Aligning
theory and method. Educational Psychologist, 53, 185–202.
Hiver, P. (2014). Attractor states. In Z. Dörnyei, P. MacIntyre, & A. Henry (Eds.), Motivational dynamics in
language learning (pp. 20–28). Multilingual Matters.
Hiver, P., & Al-Hoorie, A. H. (2016). A dynamic ensemble for second language research: Putting complexity
theory into practice. The Modern Language Journal, 100, 741–756.
Hiver, P., & Al-Hoorie, A. H. (2020a). Reexamining the role of vision in second language motivation: A
preregistered conceptual replication of you, Dörnyei, and Csizér (2016). Language Learning, 70, 48–102.
Hiver, P., & Al-Hoorie, A. H. (2020b). Research methods for complexity theory in applied linguistics.
Multilingual Matters.
Hiver, P., Al-Hoorie, A. H., & Evans, R. (2022). Complex dynamic systems theory in language learning: A
scoping review of 25 years of research. Studies in Second Language Acquisition, 44, 913–941.
Hiver, P., & Larsen-Freeman, D. (2019). Motivation: It is a relational system. In A. Al-Hoorie & P. MacIntyre
(Eds.), Contemporary language motivation theory (pp. 285–303). Multilingual Matters.
Hiver, P., & Papi, M. (2019). Complexity theory and L2 motivation. In M. Lamb, K. Csizér, A. Henry, & S.
Ryan (Eds.), The Palgrave handbook of motivation for language learning (pp. 117–137). Palgrave Mac-
millan.
Hulstijn, J. (2020). Proximate and ultimate explanations of individual differences in language use and
language acquisition. Dutch Journal of Applied Linguistics, 9, 21–37.
Jackson, D. O. (2020). Working memory and second language development: A complex, dynamic future?
Studies in Second Language Learning and Teaching, 10, 89–109.
Khajavy, G. H., MacIntyre, P. D., Taherian, T., & Ross, J. (2021). Examining the dynamic relationships
between willingness to communicate, anxiety and enjoyment using the experience sampling method. In
N. Zarrinabadi, & M. Pawlak (Eds.), New perspectives on willingness to communicate in a second language
(pp. 169–197). Springer.
Kievit, R. A. (2020). Sensitive periods in cognitive development: A mutualistic perspective. Current Opinion
in Behavioral Sciences, 36, 144–149.

Kliesch, K., & Pfenninger, S. (2021). Cognitive and socioaffective predictors of L2 microdevelopment in late
adulthood: A longitudinal intervention study. The Modern Language Journal, 105, 237–266.
Larsen-Freeman, D. (1997). Chaos/complexity science and second language acquisition. Applied Linguistics,
18, 141–165.
Larsen-Freeman, D. (2006). The emergence of complexity, fluency, and accuracy in the oral and written
production of five Chinese learners of English. Applied Linguistics, 27, 590–619.
Larsen-Freeman, D. (2017). Complexity theory: The lessons continue. In L. Ortega & Z. Han (Eds.),
Complexity theory and language development: In celebration of Diane Larsen-Freeman (pp. 1–10). John
Benjamins.
Larsen-Freeman, D., & Cameron, L. (2008). Complex systems and applied linguistics. Oxford University Press.
Larsen-Freeman, D., Schmid, M., & Lowie, W. (2011). Introduction: From structure to chaos. Twenty years of
modeling bilingualism. In M. Schmid & W. Lowie (Eds.), Modeling Bilingualism: From structure to chaos.
In honor of Kees de Bot (pp. 1–12). John Benjamins.
Li, C., Dewaele, J., & Jiang, G. (2020). The complex relationship between classroom emotions and EFL
achievement in China. Applied Linguistics Review, 11, 485–510.
Lowie, W., van Dijk, M., Chan, H., & Verspoor, M. (2017). Finding the key to successful L2 learning in groups
and individuals. Studies in Second Language Learning and Teaching, 7, 127–148.
Lowie, W. M., & Verspoor, M. H. (2019). Individual differences and the ergodicity problem. Language
Learning, 69, 184–206.
Lutz, W., Schwartz, B., Hofmann, S. G., Fisher, A. J., Husen, K., & Rubel, J. A. (2018). Using network analysis
for the prediction of treatment dropout in patients with mood and anxiety disorders: A methodological
proof-of-concept study. Scientific Reports, 8, 7819.
MacIntyre, P. (2012). The idiodynamic method: A closer look at the dynamics of communication traits.
Communication Reports, 29, 361–367.
MacIntyre, P. (2020). Expanding the theoretical base for the dynamics of willingness to communicate. Studies
in Second Language Learning and Teaching, 10, 111–131.
MacIntyre, P. D., MacKay, E., Ross, J., & Abel, E. (2017). The emerging need for methods appropriate to study
dynamic systems. In L. Ortega & Z. Han (Eds.), Complexity theory and language development: In
celebration of Diane Larsen-Freeman (pp. 97–122). John Benjamins.
Marchand, G. C., & Hilpert, J. C. (2018). Design considerations for education scholars interested in complex
systems research. Complicity: An International Journal of Complexity and Education, 15, 31–44.
Markus, H., & Nurius, P. (1986). Possible selves. American Psychologist, 41, 954–969.
Mercer, S. (2011a). The self as a complex dynamic system. Studies in Second Language Learning and Teaching,
1, 57–82.
Mercer, S. (2011b). Understanding self-concept in the FLL context. In S. Mercer, Towards an understanding
of language learner self-concept (pp. 35–71). Springer.
Mercer, S. (2014). Social network analysis and complex dynamic systems. In Z. Dörnyei, P. MacIntyre, & A.
Henry (Eds.), Motivational dynamics in language learning (pp. 83–94). Multilingual Matters.
Molenaar, P. C. M. (2004). A manifesto on psychology as idiographic science: Bringing the person back into
scientific psychology, this time forever. Measurement, 2, 201–218.
Nitta, R., & Baba, K. (2018). Understanding benefits of repetition from a complex dynamic systems
perspective. In M. Bygate (Ed.), Language learning through task repetition (pp. 279–309). John Benjamins.
O’Malley, A., & Onella, J. (2019). Introduction to social network analysis. In A. Levy, S. Goring, C. Gatsonis,
B. Sobolev, E. van Ginneken, & R. Busse (Eds.), Health services evaluation (pp. 617–660). Springer.
Palloti, G. (2022). Cratylus’ silence: On the philosophy and methodology of complex dynamic systems theory
in SLA. Second Language Research, 38, 689–701.
Papi, M., & Hiver, P. (2020). Language learning motivation as a complex dynamic system: A global
perspective of truth, control, and value. The Modern Language Journal, 104, 209–232.
Paradowski, M. B., Jarynowski, A., Jelińska, M., & Czopek, K. (2021). Selected poster presentations from the
American Association of Applied Linguistics conference, Denver, USA, March 2020. Language Teaching,
54, 139–143.
Pfenninger, S. E. (2020). The dynamic multicausality of age of first bilingual language exposure: Evidence
from a longitudinal content and language integrated learning study with dense time serial measurements.
The Modern Language Journal, 104, 662–686.

Piniel, K., & Csizér, K. (2014). Changes in motivation, anxiety and self-efficacy during the course of an
academic writing seminar. In Z. Dörnyei, P. D. MacIntyre, & A. Henry (Eds.), Motivational dynamics in
language learning (pp. 164–194). Multilingual Matters.
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. https://www.R-project.org/
Reifegerste, J. (2021). The effects of aging on bilingual language: What changes, what doesn’t, and why.
Bilingualism: Language and Cognition, 24, 1–17.
Rodrigues, F. (2019). Network centrality: An introduction. In E. Macau (Ed.), A mathematical modeling
approach from nonlinear dynamics to complex systems. nonlinear systems and complexity (pp. 177–196).
Springer.
Rouquette, A., Pingault, J., Fried, E. I., Orri, M., Falissard, B., Kossakowski, J. J., … Borsboom, D. (2018).
Emotional and behavioral symptom network structure in elementary school girls and association with
anxiety disorders and depression in adolescence and early adulthood: A network analysis. JAMA
Psychiatry, 75, 1173–1181.
Sachisthal, M. S. M., Jansen, B. R. J., Peetsma, T. T. D., Dalege, J., van der Maas, Han L. J., & Raijmakers,
M. E. J. (2019). Introducing a science interest network model to reveal country differences. Journal of
Educational Psychology, 111, 1063–1080.
Saito, K., Macmillan, K., Mai, T., Suzukida, Y., Sun, H., Magne, V., Ilkan, M., & Murakami, A. (2020).
Developing, analyzing and sharing multivariate datasets: Individual differences in L2 learning revisited.
Annual Review of Applied Linguistics, 40, 9–25.
Salvatore, S., & Valsiner, J. (2010). Between the general and the unique: Overcoming the nomothetic versus
idiographic opposition. Theory & Psychology, 20, 817–833.
Serafini, E. J. (2017). Exploring the dynamic long-term interaction between cognitive and psychosocial
resources in adult second language development at varying proficiency. The Modern Language Journal,
101, 369–390.
Spearman, C. (1904). “General intelligence,” objectively determined and measured. The American Journal of
Spencer, J. P., & Schöner, G. (2003). Bridging the representational gap in the dynamic systems approach to
development. Developmental Science, 6, 392–412.
Tan, K. H., & Shojamanesh, V. (2019). Usage-based and universal grammar-based approaches to second
language acquisition. In O. Lutsenko & G. Lutsenko (Eds.), Active Learning—Theory and Practice. Intech
Open. https://doi.org/10.5772/intechopen.89737
Tiv, M., Gullifer, J. W., Feng, R. Y., & Titone, D. (2020). Using network science to map what Montréal
bilinguals talk about across languages and communicative contexts. Journal of Neurolinguistics, 56, 1–17.
van Bork, R., Rhemtulla, M., Waldorp, L. J., Kruis, J., Rezvanifar, S., & Borsboom, D. (2019). Latent variable
models and networks: Statistical equivalence and testability. Multivariate Behavioral Research, 56,
175–198.
van Bork, R., van Borkulo, C. D., Waldorp, L. J., Cramer, A. O. J., & Borsboom, D. (2018). Network models for
clinical psychology. In J. T. Wixted (Ed.), Stevens’ handbook of experimental psychology and cognitive
neuroscience (4th ed.). Wiley. https://doi.org/10.1002/9781119170174.epcn518
van Borkulo, C. D., Boschloo, L., Kossakowski, J. J., Tio, P., Schoevers, R. A., Borsboom, D., & Waldorp, L. J.
(2022). Comparing network structures on three aspects: A permutation test. Psychological Methods.
Advance online publication. https://doi.org/10.1037/met0000476
van der Maas, Han L. J., Dolan, C. V., Grasman, Raoul P. P. P., Wicherts, J. M., Huizenga, H. M., &
Raijmakers, M. E. J. (2006). A dynamical model of general intelligence: The positive manifold of
intelligence by mutualism. Psychological Review, 113, 842–861.
van der Maas, Han L. J., Kan, K., Marsman, M., & Stevenson, C. E. (2017). Network models for cognitive
development and intelligence. Journal of Intelligence, 5, 16.
van Geert, P. (1991). A dynamic systems model of cognitive and language growth. Psychological Review, 98,
3–53.
van Geert, P. (2019). Dynamic systems, process and development. Human Development, 63, 153–179.
van Geert, P., & Steenbeek, H. (2005). Explaining after by before: Basic aspects of a dynamic systems approach
to the study of development. Developmental Review, 25, 408–442.
van Geert, P., & van Dijk, M. (2021). Thirty years of focus on individual variability and the dynamics of
processes. Theory & Psychology, 31, 405–410.

Vu, T., Magis-Weinberg, L., Jansen, B. R. J., van Atteveldt, N., Janssen, T. W. P., Lee, N. C., van der Maas, H. L.
J., Raijmakers, M. E. J., Sachisthal, M. S. M., & Meeter, M. (2021). Motivation-achievement cycles in
learning: A literature review and research agenda. Educational Psychology Review, 34, 39–71.
Waninge, F., Dörnyei, Z., & de Bot, K. (2014). Motivational dynamics in language learning: Change, stability,
and context. The Modern Language Journal, 98, 704–723.
You, C., Dörnyei, Z., & Csizér, K. (2016). Motivation, vision, and gender: A survey of learners of English in
China. Language Learning, 66, 94–123.
Zappa-Hollman, S., & Duff, P. A. (2014). Academic English socialization through individual networks of
practice. TESOL Quarterly, 49, 333–368.
Cite this article: Freeborn, L., Andringa, S., Lunansky, G. and Rispens, J. (2023). Network analysis for
modeling complex systems in SLA research. Studies in Second Language Acquisition, 45, 526–557. https://
doi.org/10.1017/S0272263122000407

doi:10.1017/S0272263122000419
RESEARCH REPORT
The importance of psychological and social

factors in adult SLA: The case of productive
collocation knowledge in L2 Swedish of L1 French
long-term residents
Fanny Forsberg Lundell* , Klara Arvidsson and Andreas Jemstedt
Stockholm University, Stockholm, Sweden
*Corresponding author. E-mail: fanny.forsberg.lundell@su.se
(Received 15 March 2022; Revised 10 August 2022; Accepted 15 August 2022)
Abstract
The study investigates how psychological and social factors relate to productive collocation
knowledge in late L2 learners of Swedish (French L1) (N = 59). The individual factors are
language aptitude (measured through the LLAMA aptitude test), reported language use,
social networks, acculturation, and personality. Multiple linear regression analysis showed
that positive effects were found for LLAMA D (phonetic memory), LLAMA E (sound-
symbol correspondence), reported language use, and length of residence (LOR). Further-
more, a negative effect was found for the personality variable Open-mindedness. These
variables explained 63% (adjusted R²) of the variance, which represents large effects
compared to other studies on individual factors. In sum, the findings confirm earlier results
on the importance of language aptitude and language use for productive collocation
knowledge. They also add evidence of the importance of personality and LOR. In sum,
cognitive and social factors combine to explain different outcomes in adult L2 acquisition.
Introduction
Research has suggested that after some time in the teens, age effects diminish and
individual variation in adult L2 learning is more dependent on social and psycho-
logical factors (cf. Hyltenstam, 2018). This study aims to contribute to this research by
examining what factors best predict language proficiency among French long-term
residents in Sweden. Many studies related to the critical period hypothesis have
focused on grammatical intuition and different measures of phonology (e.g., Bird-
song, 2005; DeKeyser, 2000). However, these studies have rarely looked into a central
phenomenon for the advanced second language learner: collocations (e.g., make a
decision, perfectly possible). Collocations are conventionalized word combinations
that refer to a meaning unit and have in common that they cannot be generated by
lexical or grammatical rules and contribute to fluent and idiomatic language use.

Collocations and psychological and social factors 559
Research has consistently shown that mastery of collocations correlates with mea-
sures of second language proficiency (Forsberg Lundell et al., 2018; Gyllstad, 2007;
Nizonkiza, 2011). In using a conventionalized word combination, the speaker signals
familiarity with, and a sense of belonging to, a specific linguistic community (Wray,
2002). As such, the use of collocations is a means of conforming to social norms and
expectations. It is accordingly not unreasonable to assume that cognitive, affective,
and social factors could have an effect on the successful acquisition of collocations.
This area remains largely unexplored, however, and constitutes the research gap for
the present study.
The study includes L2 learners of Swedish who started learning Swedish as adults.
They are L1 French voluntary migrants who have spent at least 5 years in the host
community, Sweden, but often longer. The participant sample is quite original with
respect to mainstream second language acquisition (SLA) because it targets long-term
L2 speakers having an L1 with many speakers around the world (French), in a second
language setting with a very limited number of speakers in comparison (Swedish). The
main research question to be answered is: What psychological and social factors predict
productive collocation knowledge in long-term L2 residents?
Background
Collocation knowledge and second language acquisition
Despite the plethora of definitions, most researchers agree that collocations consist of
words that occur frequently together in a given language. The present study takes a
statistical approach to defining what constitutes a collocation and considers a colloca-
tion to be any word combination in which the included words appear together more
often than by chance (for details, see “Methods and Procedures”) (Paquot & Granger,
2012). Research suggests that productive collocation knowledge (PCK) is the most
challenging aspect of L2 vocabulary knowledge (e.g., Laufer & Waldman 2011; Schmitt,
2014). The difficulty in acquiring collocations in an L2 is assumed to relate to the L2
learner’s relative lack of exposure to the target language and to the phenomenon of L1
entrenchment (L1 influence in preferred patterns) (Ellis, 2002, 2006). Massive exposure
is therefore crucial to develop collocation knowledge, a theme often discussed within
usage-based approaches to SLA (e.g., Ellis & Wulff, 2015).
Collocation knowledge and psychological and social factors

To date, little research has been conducted on what factors best predict collocation
knowledge in an L2. Granena and Long (2013) (Spanish L2, Chinese L1) showed that in
the late starter group (AO 16–29 years), language aptitude, measured by the LLAMA
aptitude test, was a predictor of lexis and collocations and the subtests LLAMA D
(sound recognition) and LLAMA E (sound-symbol correspondence) had the strongest
effects (LLAMA D, r = .46, LLAMA E, r = .36). A similar result was found by Forsberg
Lundell and Sandgren (2013), who investigated the relationship between PCK and
aptitude and personality, in a small sample of Swedish L1 users of L2 French (N = 13).
Just like Granena and Long (2013), they found an association with LLAMA D (r = .58).
In addition, they found that PCK was correlated to two dimensions of the Multicultural
Personality Questionnaire (MPQ), namely Open-mindedness and Cultural Empathy.
This latter result indicates that not only aptitude would be relevant for collocation
knowledge but perhaps also other individual factors such as personality.

560 Forsberg Lundell et al.
González-Fernandez and Schmitt (2015) also focus on PCK. Their study included
108 Spanish L1 English L2 participants of different proficiency levels, who had learned
English for 13.67 years on average. The study showed that PCK was correlated with
amount of everyday language exposure (r = .56). The importance of language exposure
for collocational knowledge is also investigated by Dąbrowska (2019). She compares
knowledge of grammar, vocabulary, and collocations in a group of English L1 speakers
(N = 90) and a group of English L2 speakers (N = 67). Besides investigating their
performance within the mentioned linguistic domains, she also measured the effect of
individual differences in both groups. Print exposure was an important predictor for
collocation knowledge, both in L1 and L2 speakers. However, when conducting a
regression analysis, it turned out that “everyday language use” was by far the strongest
predictor for collocation knowledge and explains 36% of the variance.
In the present study, language aptitude and language use will be included as primary
factors, given their importance in earlier research. However, in view of the scarcity of
quantitative research on individual factors and collocations, it is worthwhile including a
few other factors that have yielded effects on other L2 proficiency domains.
As stated above, social integration may be important for successful acquisition of
formulaic language (Dörnyei et al., 2004). Social integration is a complex phenomenon,
but it is reasonable to assume that it could relate to variables such as social networks
(cf. Dollmann et al., 2020) and acculturation, that is, cultural affiliation (Ryder et al.,
2000). It could also be related to personality because as Kormos (2013) notes, person-
ality can be a decisive factor for creating opportunities for language use. In a study by
Ożańska-Ponikwia and Dewaele (2012), the personality trait Openness to Experience
was the strongest predictor of self-perceived proficiency of L2 English in a migratory
setting. The importance of Openness and Open-mindedness are generally confirmed
by Moyer (2021) in her overview of gifted language learners. Collocations, because they
are typically nativelike, could be a means for and result of social integration and thus
linked to the aforementioned factors.
To summarize, research on collocations and individual factors suggests an effect of
language aptitude and language use, but findings from naturalistic settings point to the
importance of exploring other variables.
Research Questions and Hypotheses

The research question of this study is: To what extent do the following factors predict
productive collocation knowledge?
• Language aptitude,
• Language use,
• Social networks,
• Acculturation, and
• Personality.
This question relates to the five psychological and social factors investigated in this
study (see Table 1 for the correspondence between the investigated factors and their
operationalization as independent variables in the statistical analysis). The study also
includes two extraneous variables (i.e., variables that are not the focus of the investi-
gation, but that can potentially affect the dependent variable), namely length of
residence (LOR) and length of Swedish studies.

Table 1. Factors vs. independent variables included in the study
Factors (instrument) Independent variables Maximum score
Language aptitude (LLAMA) LLAMA B 100

LLAMA D 75
LLAMA E 100
LLAMA F 100
Acculturation (VIA) VIA Sweden 9
VIA France 9
Personality (MPQ) MPQ Cultural empathy 5
MPQ Flexibility 5
MPQ Social initiative 5
MPQ Open-mindedness 5
MPQ Emotional stability 5
Target language engagement (LEQ) Language engagement
Social networks (SNQ) Number of relations in L2
Extraneous variables LOR (in years)
Length of French studies (in years)
Based on the previous research, we propose the following hypothesis:

PCK will be related to language aptitude (LLAMA) given its importance for
collocation knowledge (Forsberg Lundell & Sandgren, 2013; Granena & Long, 2013).
It will also be related to target language use (language engagement), based on the results
from González-Fernández and Schmitt (2015) and Dąbrowska (2019).
Methods and Procedures

Participants
The present sample included 64 French L1 Swedish L2 speakers but 5 participants had to
be excluded (see the “Data Analysis” section for more information). The remaining
59 participants consisted of 35 women and 24 men. Their mean age was M = 41.59 (SD =
9.13), ranging from 27 to 71 years. Their mean LOR was 13.20, ranging from 5 to 50 years.
All of them had finished upper secondary education in France, before coming to Sweden.
The participants were carefully selected based on the following sociobiographic criteria:
1. They had French as their main L1 (bilinguals from birth were not excluded unless
Swedish was the other L1).
2. They had finished upper secondary education.
3. They had started learning the Swedish language no earlier than 12 years of age, to
target postcritical period learners.
4. They had spent at least 5 years in Sweden.
Recruitment of participants relied on convenience sampling. In a first phase,

participants were recruited through the Facebook groups Les Français à Stockholm
(French people in Stockholm) and French connection. A nonnegligible portion of the
participants were also recruited through snowball sampling.
The initial aim was to collect data from more than 64 participants, but this turned
out to be impossible due to financial constraints. While a larger sample would have
been desirable, we would like to insist on the value of the present dataset, given the
relative scarcity of data on this category of participants in SLA research (long-term

residents, L2 Swedish). No power analysis was conducted before recruiting partic-

ipants. Instead, the aim was to recruit as many participants as possible during the
project phase.
Instruments
Productive collocation knowledge
The L2 Swedish PCK test used in the present study has been validated in a prior study
(Prentice & Forsberg Lundell, 2021). The test targets verb þ noun collocations, such as
ställa en fråga (Eng. pose a question). The test was developed based on Gyllstad (2007)
for item selection. The items were extracted from newspaper corpora in the Swedish
language bank (https://spraakbanken.gu.se) and items were selected based on MI scores
and frequency (for details, see Prentice & Forsberg Lundell, 2021).
The test had a fill-in-the-gap format. Participants were asked to supply the verb; the
first letter of the verb was provided, to not open up the possibility for too many
alternatives. For example:
GP blir det första av de utländska medierna som får chansen att s__________ en
fråga på presskonferensen.
“GP [Göteborgs Posten] is the first of the foreign media getting a chance to
p__________ a question at the press conference.”
Items were scored dichotomously (1 or 0). Besides only accepting alternatives that
constitute a clear collocation (according to MI threshold and frequencies explained in
Prentice & Forsberg Lundell, 2021), spelling mistakes were allowed (such as *stella
instead of ställa) because they do not interfere with collocation knowledge.
Sociological questionnaires (independent variables)

The Language Engagement Questionnaire (LEQ) (McManus et al., 2014) measures
language use and was developed by the LANGSNAP-project (https://langsnap.soton.
ac.uk/). Participants were asked to indicate how often they carry out 23 activities in the
target language, including both passive and active language use. The six response
options ranged from “never” to “every day,” which were then coded with numerical
values ranging from 0 (never) to 5 (every day). In this study, “language engagement”
was operationalized as the average of the 23 responses.
The Social Network Questionnaire (SNQ) (ibid.) provides detailed information
about the number of people included in the participant’s social networks in the
target community, how they interact with these people, and in what languages. The
social network variable used in this study is a numerical value that represents the
number of people with whom the participant regularly interacts (only) in L2
Swedish.
Psychological tests and questionnaires (independent variables)

The LLAMA aptitude test (Meara, 2005), is one of the most recently developed language
aptitude tests and has been widely used (e.g., Abrahamsson & Hyltenstam, 2008;
Granena & Long, 2013). The test measures language aptitude with respect to vocabulary
learning (LLAMA B), sound recognition (LLAMA D), sound-symbol correspondence
(LLAMA E), and grammatical inferencing (LLAMA F).

The VIA Acculturation Questionnaire (Ryder et al., 2000). The questionnaire

consists of 10 items assessing migrants’ heritage culture attachment (VIA France)
and 10 items assessing their host culture attachment (VIA Sweden). Participants were
asked to express their liking for typical values, traditions, and practices for each culture
on a 9-point Likert scale ranging from 1 (disagree) to 9 (fully agree).
The Multicultural Personality Questionnaire (MPQ)—Short Form (Van der Zee
et al., 2013) measures an individual’s potential to function in a new cultural environ-
ment. It is based on the five-factor model but has been adapted for the purpose of testing
multicultural effectiveness. It measures personality along five dimensions:
• Cultural Empathy: the ability to empathize with cultural diversity and to understand
feelings, beliefs, and attitudes different from one’s own heritage.
• Open-mindedness: an open, unprejudiced attitude toward diversity.
• Social Initiative: the tendency to approach social situations actively, to take the
initiative and engage in social situations.
• Flexibility: the ability to learn from new experiences, including adjusting behavior
according to contingency and enjoying novelty and change.
• Emotional Stability: the tendency to remain calm in stressful situations and to control
emotional reactions.
In addition, a sociobiographic questionnaire was also filled in, based on Moyer

(2004). For the purpose of the present study, the information regarding LOR and length
of Swedish studies was used.
Table 1 contains an overview of the factors and the corresponding variables in the
statistical analysis, as well as the instrument used. Cronbach’s alpha was used to
calculate a measure of reliability in the cases in which the tests were compatible with
this type of analysis (see Table 2). The MPQ is divided into five dimensions and the VIA
questionnaire into two dimensions, and reliability coefficients were calculated for all
these. It should be noticed that the internal consistency for the productive collocation
test is very high (0.96), whereas some of the MPQ dimensions (Cultural Empathy,
Open-mindedness, and Emotional Stability) do not have very good internal consistency
and results related to these dimensions should be interpreted with extra caution.
Procedures
The data collection process was undertaken by the first and second authors, who met in
person with each participant in Stockholm during 2019 and 2020. The researchers and
Table 2. Cronbach’s alpha scores for instruments

Instrument α
PCK 0.96
MPQ Cultural Empathy 0.69
MPQ Flexibility 0.80
MPQ Social Initiative 0.85
MPQ Open-mindedness 0.65
MPQ Emotional Stability 0.69
VIA France 0.86
VIA Sweden 0.71

participants met in place chosen by the participant: private home, office, or a café. The
tests and questionnaires were presented in the following order:
1. Productive collocation test,

2. LLAMA aptitude test,
3. VIA acculturation questionnaire,
4. Multicultural personality questionnaire,
5. Target language engagement questionnaire,
6. Social network questionnaire, and
7. Sociobiographic questionnaire.
The whole session took 1.5–2 hours. The PCK test and the aptitude test were admin-
istered first because they were deemed to be more cognitively demanding than the others,
and we wanted to make sure that fatigue was not an issue when performing these tests.
Data analysis
Recent recommendations from the American Statistical Association highlight the prob-
lems with significance testing, for example, the problem with deciding on whether a
variable has an effect or not based on whether a p-value is above or below .05 (Wasserstein
et al., 2019). These types of recommendations have also been discussed within the SLA
domain by Larson-Hall and Plonsky (2015). In line with these recommendations, we will
focus on estimating effect sizes and discussing the uncertainty in our measurements,
rather than deciding that an effect is “significant” or not based on a p-value. Furthermore,
we aim to make the effect sizes meaningful by describing their effects in terms of how
much each variable needs to change to increase the PCK score by 1 SD.
To answer the research questions, two multiple linear regressions were conducted
with PCK as the dependent variable. Multicollinearity was likely not a problem in either
of the models, as indicated by the variance inflation factors (VIF) that were below 4 and
the tolerance levels that were above .3, for all variables. Specifically, the mean VIF in
Model 1 was 1.85, and in Model 2 it was 1.30. Also, inspection of Q-Q plots indicates
that residuals in both models were approximately normally distributed. Finally, indi-
vidual scatterplots and diagnostic plots (e.g., plots of residuals vs. fitted values) were
inspected to satisfy that a linear model was applicable to the data. All analyses were
conducted in the statistical software R version 4.1.1 (R Core Team, 2020).
In total, five participants were excluded from the analyses. Four participants were
excluded for missing values in one or more variable(s). The final participant was
excluded after inspection of a plot with residuals versus fitted values from the regression
analyses, revealed that the participant was a multivariate outlier.
Results
Because the present research is largely exploratory, we present two models. First, in
Model 1, all independent variables were included to explore their respective effect on
PCK. Thereafter, we present Model 2 which only includes the variables that Model
1 indicated had a meaningful effect on PCK. That is, to include a variable in Model 2, we
considered both the size of the effect and the width of the confidence intervals.
Specifically, the confidence interval had to be narrow enough to indicate with some
certainty, that the effect exists in the population, and the size of the effect had to be large

Table 3. Results of two multiple linear regression analyses with PCK as dependent variable
Model b b 95% CI beta beta 95% CI semi-partial R2
Model 1
(Intercept) 9.15 [–16.19, 34.48]
LLAMA B 0.06 [–0.06, 0.18] 0.12 [–0.11, 0.36] .01
LLAMA D 0.17 [0.01, 0.34] 0.23 [0.01, 0.45] .03
LLAMA E 0.18 [0.08, 0.27] 0.39 [0.18, 0.61] .09
LLAMA F 0.04 [–0.05, 0.12] 0.10 [–0.12, 0.31] .01
Language engagement 3.67 [1.43, 5.91] 0.39 [0.15, 0.62] .07
Number of relations in L2 –0.30 [–1.29, 0.70] –0.09 [–0.38, 0.20] .00
VIA Sweden 0.51 [–1.49, 2.51] 0.05 [–0.14, 0.23] .00
VIA France 0.79 [–0.62, 2.21] 0.11 [–0.08, 0.30] .01
MPQ Cultural empathy 0.22 [–3.98, 4.43] 0.01 [–0.18, 0.21] .00
MPQ Flexibility –0.72 [–3.35, 1.90] –0.05 [–0.24, 0.14] .00
MPQ Social initiative –1.75 [–4.66, 1.16] –0.13 [–0.35, 0.09] .01
MPQ Open-mindedness –5.06 [–9.63, –0.49] –0.24 [–0.46, –0.02] .03
MPQ Emotional stability –1.80 [–4.76, 1.16] –0.12 [–0.31, 0.08] .01
LOR 0.72 [0.40, 1.04] 0.63 [0.35, 0.91] .13
Length of Swedish studies 0.41 [–1.23, 2.04] 0.05 [–0.14, 0.24] .00
Model 2
(Intercept) 11.51 [–3.33, 26.35]
LLAMA D 0.17 [0.03, 0.30] 0.22 [0.04, 0.40] .04
LLAMA E 0.20 [0.12, 0.27] 0.44 [0.27, 0.62] .16
Language engagement 3.90 [2.09, 5.71] 0.41 [0.22, 0.60] .12
MPQ Open-mindedness –5.75 [–9.23, –2.27] –0.28 [–0.44, –0.11] .07
LOR 0.56 [0.33, 0.79] 0.49 [0.29, 0.70] .15
Note: N = 59.
enough to be meaningful for the understanding of PCK. See Table 3 for the results of
both models.
Due to the relatively low number of participants in the present study, the confidence
intervals are fairly broad, meaning that there is uncertainty about the size of the effects.
See Table 4, for means, standard deviations and range of all of the variables, and
Table A1 in the appendix for a full correlation matrix between all variables.
Table 4. Means, standard deviations, and range

Variable Mean Standard deviation Range
PCK 28.71 9.78 3–39

LLAMA B 47.03 19.37 5–100
LLAMA D 32.37 12.91 0–60
LLAMA E 76.10 21.97 20–100
LLAMA F 51.53 25.18 0–100
Language engagement 2.81 1.03 0.87–5.00
Number of relations in L2 3.19 2.87 0–14
VIA Sweden 6.62 0.89 4.10–8.20
VIA France 6.59 1.31 2.40–9.00
MPQ Cultural empathy 4.04 0.45 2.63–4.88
MPQ Flexibility 3.10 0.70 1.75–4.75
MPQ Social initiative 3.58 0.73 2.13–5.00
MPQ Open-mindedness 3.75 0.47 2.88–4.63
MPQ Emotional stability 3.16 0.64 1.63–4.50
LOR 13.20 8.56 5–50
Length of Swedish studies 1.18 1.13 0–6

Model 1 and 2
When we included all 15 variables in Model 1, it explained 63% of the variance in PCK
(adjusted R2 = .63). Building on Model 1, we propose a more compact model in which
we only included 5 variables that we judged had a meaningful effect on PCK. Although
Model 2 included 10 less variables, it still explained 63% of the variance (adjusted R2 =
.63). Thus, both Model 1 and Model 2 explained a large amount of the variance
according to Plonsky and Ghanbar’s (2018, p. 724) categorization. However, given
the use of fewer variables, Model 2 is a better model of what factors are important to
develop PCK.
In the following text we present the effect of each variable together with the
respective research question. To make the effect sizes more meaningful, we present
them in terms of how much each variable has to change for the PCK score to increase by
1 SD (9.78 in the current sample).
To what extent does language aptitude predict PCK?

Model 1. Language aptitude had an effect on PCK, but not all of its components.
Specifically, Model 1 indicates that, to raise PCK by 1 SD, LLAMA D (b = .17) has to
increase by 57.53 (i.e., 9:78
:17 ). Similarly, LLAMA E (b = .18) has to increase by 54.33. That
is, although the model indicates that these two variables impact PCK, the effects are
small because an individual would have to move almost the whole range of LLAMA D
to increase PCK by 1 SD, and more than half the measure for LLAMA E. Nevertheless,
the variables are still important for understanding PCK.
Meanwhile, LLAMA B (b = .06) and LLAMA F (b = .04) had no meaningful effects
on PCK. To increase PCK by 1 SD they would have to increase by 163.00 and 244.50.
That is, the effects are so small that a change of more than the scale ranges are required
for PCK to increase by 1 SD.
Model 2. The effects LLAMA D (b = .17) and E (b = .20) remained almost the same
in Model 2. That is, PCK increases by 1 SD for every 57.53 points on LLAMA D and for
every 48.90 points on LLAMA E. Thus, Model 2 indicates that an individual’s ability
both to recognize sounds and to make sound-symbol connections are important
for PCK.
To what extent does reported language use predict PCK?

Model 1. An individual’s language engagement (b = 3.67) may have a positive effect on
PCK. The Language Engagement Questionnaire (LEQ) ranges from 0 (lowest) to
5 (highest), and an increase of 2.66 is required to increase PCK by 1 SD. Although
the size of the effect was not large, it is not unimportant.
Model 2. The size of the effect remained largely the same in Model 2 (b = 3.90), for
every 2.51 points it increases PCK by 1 SD. In other words, the model indicates that it is
beneficial to engage with the L2 for PCK.
To what extent do social networks predict PCK?

Model 1. The number of L1 speakers in the L2 user’s social network does not have a
meaningful effect on PCK according to Model 1 (b = –.30). In fact, Model 1 indicates
that for PCK to increase by 1, their number of L2 relations has to decrease by 32.60. This
effect is both small and unlikely. Thus, it is more likely that there is no effect on PCK and
that the small negative effect is due to the imprecision of the measurements.

To what extent does acculturation predict PCK?
Model 1. Both the VIA Sweden and VIA France scales range from 1 (lower) to
9 (higher). Model 1 indicates acculturation has no meaningful effect on PCK. Specif-
ically, VIA Sweden (b = .51) has to decrease by 19.18 to increase it by 1 SD. Similarly,
VIA France (b = .79) has to increase by 12.38. That is, to increase PCK by 1 SD, the VIA
measures need to increase more than their scale ranges.
To what extent does multicultural effectiveness predict PCK?

Model 1. Each of the five measures in the Multicultural Personality Questionnaire
(MPQ) ranges from 1 (lowest) to 5 (highest). Model 1 indicates that out of the five
personality measurements, only Open-mindedness had a meaningful effect on PCK,
and the effect was negative. Namely, Open-mindedness (b = –5.06) has to decrease by
1.93 to raise PCK by 1 SD.
For the remaining four measures, a change of more than the range of the MPQ scale
would be needed to increase PCK by 1 SD. To raise PCK by 1 SD, Cultural empathy has
to increase by 44.45, Flexibility has to decrease by 13.58, Social initiative has to decrease
by 5.59, and finally, Emotional stability would have to decrease by 5.43.
Model 2. The effect was similar in size in Model 2 (b = –5.75), that is, for every 1.70
points Open-mindedness decreases, PCK will increase by 1 SD. Thus, the effect was not
large, but it is still important for PCK. However, please note that the participants mainly
used the higher part of the scale (M = 3.75). Only four participants scored below the
midpoint of the scale (3) and these four all scored 2.88. That is, the negative effect of
being Open-minded on PCK may only hold when comparing highly Open-minded
individuals to those moderately open-minded.
Length of residence and length of Swedish studies

Model 1. While LOR had an effect on PCK (b = .72), length of L2 Swedish studies did
not (b = .41). Specifically, to increase PCK by 1 SD, an individual would have to reside
in the host country an additional 13.58 years. Meanwhile, an individual would have to
study the L2 for another 23.85 years. Given that the average participant in our sample
had studied Swedish for 1.18 years, we judge that the effect of studying an L2 only has a
very small, if any, effect on PCK.
Model 2. The size of the effect of LOR on PCK was smaller in Model 2 (b = .56).
Specifically, Model 2 indicates that PCK will increase by 1 SD for every 17.47 years.
Nevertheless, Model 2 still indicates that LOR is important for an individual’s PCK.
Discussion and conclusion

The present study set out to investigate which factors best explain individual variation
in long-term L2 users of Swedish (L1 French) with respect to productive collocation
knowledge. The sample included 59 participants (F = 35, M = 24, mean age of testing
41.7 years, mean LOR 13.20 years).
It was hypothesized that both language aptitude and language engagement would
be important predictors of PCK. The remaining variables were exploratory. Two
multiple regression analyses were conducted to investigate which of these factors best
predicts PCK. Model 1 included all the variables and explained 63% of the variance.
Model 2 included only the variables that, given Model 1, seemed to have a noticeable
effect. These were LLAMA D (sound-recognition), LLAMA E (sound-symbol

correspondence), language engagement, MPQ Open-mindedness, and LOR. Model

2 also explained 63% of the variance, in spite of including a much smaller number of
variables. In relation to results from regression analyses in the field of SLA in general,
63% is to be considered a large effect and thus quite a robust model. Due to the limited
sample size and the resulting imprecision of the measurements, it is difficult to say
exactly which of the factors has the largest impact—however, the beta-values suggest
that language engagement, LLAMA E, and LOR have the strongest effects among all
the variables (see Table 3). Interestingly, this confirms the initial hypothesis and
resonates with the results from Granena and Long (2013) regarding language aptitude
(LLAMA D and LLAMA E) and with those of González-Fernández and Schmitt
(2015) and Dąbrowska (2019) for language use and experience. Furthermore, because
both LOR and language engagement were important predictors, the data lends strong
support to usage-based theories’ assumptions of the importance of frequency effects
in language acquisition (e.g., Ellis, 2002).
However, the study also shows, in accordance with a multifactorial approach as
proposed by the Douglas Fir Group (2016) and by Alene Moyer (2004, 2021), that
frequency of input and language engagement alone cannot explain learning out-
come. The present study indeed shows that a psychological factor such as language
aptitude is important. In addition, another psychological factor was also part of
Model 2: Open-mindedness. Interestingly enough, however, the relationship was
negative in this case. It was mentioned already in the results section that, when
interpreting this result, we need to consider the fact that the large majority of our
participants report values from 3–5 (max 5) and the lower values on the scale are not
represented. A tentative interpretation would thus be that being extremely open-
minded is negative for mastery of collocations, but not necessarily that being clearly
close-minded is a facilitating factor. The results are thought-provoking in compar-
ison to earlier findings on the role of personality in SLA where Openness to
Experience and Open-mindedness are consistently reported as positively associated
with language learning (Moyer, 2021; Ożańska-Ponikwia & Dewaele, 2012). Having
conducted fieldwork with the included participants and based on the sociobio-
graphic questionnaire, we know that some of our learners use English as a lingua
franca on a daily basis. Some of these participants display cosmopolitan language
ideologies and classify themselves as highly “open-minded.” However, these partic-
ipants typically attain only basic levels in Swedish. They reflect an international
posture and one could presume that an unexpected “side-effect” of reporting to be
very open-minded could be a lesser propensity to learn the local language. It is thus
possible that the negative effect of open-mindedness is not specifically related to
mastery of collocations, but to language proficiency in general, in a situation in
which the target language competes with the global lingua franca English. This
finding requires further research into personality traits and their connections to
SLA. It also suggests relationships between personality and ideological positions,
which could be further explored.
All in all, the study lends support to earlier findings on the role of both language
engagement and aptitude as important explanatory factors for high-level L2 proficiency
and collocation knowledge in particular. More generally, it suggests that a multifacto-
rial approach is necessary when accounting for second language proficiency in a
context of mobility and migration.
A limitation of the study is the sample homogeneity and size. Nevertheless, the
present study is the first to investigate the impact of multiple factors on PCK in long-
term L2 users. In addition, it is, for instance, rare in that it targets an L2 that competes

with a global lingua franca. It is our hope that it will motivate similar studies, in a
multitude of L2 user contexts.
Acknowledgments. The study was funded by Vetenskapsrådet (the Swedish Research Council), grant
number 2017-01196.
Data availability statement. The data was considered to contain sensitive personal information by the
Swedish Ethical Review authority, hence data cannot be openly published. Please contact the corresponding
author for questions related to the data.
References
Abrahamsson, N., & Hyltenstam, K. (2008). The robustness of aptitude effects in near-native second language
acquisition. Studies in Second Language Acquisition, 30, 481–509. https://doi.org/10.1017/
S027226310808073X
Birdsong, D. (2005). Nativelikeness and non-nativelikeness in L2A research, 43, 319–328. https://doi.org/
10.1515/iral.2005.43.4.319
Dąbrowska, E. (2019). Experience, aptitude, and individual differences in linguistic attainment: a comparison
of native and nonnative speakers. Language Learning, 69, 72–100. https://doi.org/ezp.sub.su.se/10.1111/
lang.12323
DeKeyser, R. M. (2000). The robustness of critical period effects in second language acquisition. Studies in
Dollmann, J., Kogan, I., & Weißmann, M. (2020). Speaking accent-free in l2 beyond the critical period: The
compensatory role of individual abilities and opportunity structures. Applied Linguistics, 41, 787–809. -
https://doi.org/10.1093/applin/amz029
Dörnyei, Z., Durow, V., & Zahran, K. (2004). Individual differences and their effects on formulaic sequence
acquisition. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 87–106). John
Benjamins. https://doi.org/10.1075/lllt.9.06dor
Douglas Fir Group. (2016). A transdisciplinary framework for SLA in a multilingual world. The Modern
Language Journal, 100, 19–47. https://doi.org/10.1111/modl.12301
Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for theories of
implicit and explicit language acquisition. Studies in Second Language Acquisition, 24, 143–188. https://
doi.org/10.1017/S0272263102002024
Ellis, N. C. (2006). Selective attention and transfer phenomena in L2 acquisition: Contingency, cue
competition, salience, interference, overshadowing, blocking, and perceptual learning. Applied Linguistics,
27, 164–194. https://doi.org/10.1093/applin/aml015
Ellis, N. C., & Wulff, S. (2015). Second language acquisition. In E. Dabrowska & D. Divjak (Eds.), Handbook of
Cognitive Linguistics (pp. 409–431). De Gruyter Mouton.
Forsberg Lundell, F., Lindqvist, C., & Edmonds, A. (2018). Productive collocation knowledge at advanced
CEFR levels. Evidence from the development of a test for advanced L2 French. Canadian Modern
Language Review, 74, 627–649.
Forsberg Lundell, F., & Sandgren, M. (2013). High-level proficiency in late L2 acquisition—Relationships
between collocational production, language aptitude and personality. In G. Granena & M. Long (Eds.)
Sensitive periods, aptitudes and ultimate attainment in L2 (pp. 231–256). John Benjamins.
Gónzalez-Fernández, B. & Schmitt, N. (2015). How much collocation knowledge do L2 learners have? The
effects of frequency and amount of exposure. ITL: International Journal of Applied Linguistics, 166,
94–126. https://doi.org/10.1075/itl.166.1.03fer
Granena, G., & Long, M. (2013). Age of onset, length of residence, language aptitude, and ultimate attainment
in three linguistic domains. Second Language Research, 29, 311–343. https://doi.org/ezp.sub.su.se/
10.1177%2F0267658312461497
Gyllstad, H. (2007). “Testing English collocations: Developing receptive tests for use with advanced Swedish
learners.” PhD diss., Lund University, Sweden. http://lup.lub.lu.se/search/ws/files/5893676/2172422.pdf

Hyltenstam, K. (2018). Second language ultimate attainment: Effects of maturation, exercise, and social/
psychological factors. Bilingualism: Language and Cognition, 21, 921–923. https://doi.org/10.1017/
S1366728918000172
Kormos, J. (2013). New conceptualisations of language aptitude in second language attainment. In G.
Granena & M. Long (Eds.), Sensitive periods, language aptitude, and ultimate L2 attainment
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets
reported and recommendations for the field. Language Learning, 65, 127–159. https://doi.org/10.1111/
lang.12115
Laufer, B., & Waldman, T. (2011). Verb‐noun collocations in second language writing: A corpus analysis of
learners’ English. Language Learning, 61, 647–672.
McManus, K., Mitchell, R., & Tracy-Ventura, N. (2014). Understanding insertion and integration in a study
abroad context: The case of English-speaking sojourners in France. Revue Française de Linguistique
Appliquée, 19, 97–116.
Meara, P. (2005). LLAMA language aptitude tests. Lognostics.
Moyer, A. (2004). Age, accent and experience in second language acquisition. Multilingual Matters.
Moyer, A. (2021). The gifted language learner: A case of nature or nurture? Cambridge University Press.
Nizonkiza, D. (2011). The relationship between lexical competence, collocational competence, and second
language proficiency. English Text Construction, 4, 113–145. https://doi.org/10.1075/etc.4.1.06niz
Ożańska-Ponikwia, K., & Dewaele, J. M. (2012). Personality and L2 use: The advantage of being openminded
and self-confident in an immigration context. Eurosla Yearbook, 12, 112–134.
Paquot, M., & Granger, S. (2012). Formulaic language in learner corpora. Annual Review of Applied
Linguistics, 32, 130–149. http://doi.org/10.1017/S0267190512000098
Plonsky, L., & Ghanbar, H. (2018). Multiple regression in L2 research: A methodological synthesis and guide
to interpreting R2 values. The Modern Language Journal, 102, 713–731. https://doi.org/10.1111/
modl.12509
Prentice, J., & Forsberg Lundell, F. (2021). Productive collocation knowledge and advanced CEFR-levels in
Swedish as a Second Language: A conceptual replication of Forsberg Lundell, Lindqvist & Edmonds
(2018). Journal of the European Second Language Association, 5, 44–53. https://doi.org/10.22599/jesla.72
Computing. Vienna, Austria. https://www.R-project.org/
Ryder, A. G., Alden, L. E., & Paulhus, D. L. (2000). Is acculturation unidimensional or bidimensional? A head-
to-head comparison in the prediction of personality, self-identity, and adjustment. Journal of Personality
and Social Psychology, 79, 49–65. https://psycnet.apa.org/doi/10.1037/0022-3514.79.1.49
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows. Language Learning,
64, 913–951. https://doi.org//ezp.sub.su.se/10.1111/lang.12077
van der Zee, K., van den Oudenhouven, J. P., Ponterotto, J., & Fietzer, A. (2013). Multicultural personality
questionnaire: Development of a short form. Journal of Personality Assessment, 95, 118–124. https://
doi.org/10.1080/00223891.2012.718302
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American
Statistician, 73, 1–19. https://doi.org/10.1080/00031305.2019.1583913
Wray, A. (2002). Formulaic language and the lexicon. Cambridge University Press.
Cite this article: Lundell, F. F., Arvidsson, K. and Jemstedt, A. (2023). The importance of psychological and
social factors in adult SLA: The case of productive collocation knowledge in L2 Swedish of L1 French long-
term residents. Studies in Second Language Acquisition, 45, 558–570. https://doi.org/10.1017/
S0272263122000419

doi:10.1017/S0272263122000213
RESEARCH REPORT
Revisiting the moderating effect of speaker

proficiency on the relationships among
intelligibility, comprehensibility, and
accentedness in L2 Spanish
Amanda Huensch1* and Charlie Nagle2
1
University of Pittsburgh, Pittsburgh, PA, USA; 2Iowa State University, Ames, IA, USA
*Corresponding author. E-mail: amanda.huensch@pitt.edu
(Received 20 October 2021; Revised 14 April 2022; Accepted 25 April 2022)
Abstract
This report examines the potential impacts of task and proficiency on listener judgments of
intelligibility, comprehensibility, and accentedness in L2 Spanish. This study extends Huensch
and Nagle [Language Learning, 71, 626–668, (2021)], who explored the partial independence
among the global speech dimensions for speech samples taken from a picture narrative task.
Given that the type of speaking task used to elicit speech samples has been shown to impact the
strength of the linguistic features contributing to the global speech dimensions and to explore
the impact of task on the relationships among the dimensions, the current study followed the
same procedure as Huensch and Nagle but employed a task in which participants responded to
a prompt based on NCSSFL-ACTFL Can-Do Statements. The speech samples were elicited
from instructed L2 Spanish learners of varying proficiency (n = 42) and were rated by a group
of native-speaking Spanish listeners (n = 80) using Amazon Mechanical Turk. In general, the
results were consistent with those reported in the initial study indicating a significant, positive,
and consistent relationship between comprehensibility and intelligibility and a null relation-
ship between accentedness and intelligibility. The limited differences between the studies’
findings are discussed considering the potential impact of task.
Introduction
Evidence for the partial independence of the global speech dimensions of intelligibility,
comprehensibility, and accentedness has resulted in a shifting of L2 pronunciation
teaching and learning goals away from nativeness principles—achieving nativelike pro-
nunciation using accent reduction—toward intelligibility principles whose focus is on
achieving understandable pronunciation (Derwing & Munro, 2015; Levis, 2005, 2020).1
1
Intelligibility and comprehensibility in the current study are defined in line with the conceptualizations of
Derwing and Munro: They are related, yet distinct constructs. In other words, the current work is conducted

572 Amanda Huensch and Charlie Nagle
The limited work exploring the relationships among all three of these global speech
dimensions (e.g., Derwing & Munro, 1997; Huensch & Nagle, 2021; Jułkowska & Cebrian,
2015; Munro & Derwing, 1995; Munro et al., 2006; Nagle & Huensch, 2020) has
demonstrated stronger and more consistent relationships between intelligibility (the
extent to which a listener has understood a speaker’s message) and comprehensibility
(the ease or difficulty a listener encounters trying to understand a speaker’s message) in
comparison to intelligibility and accentedness (the strength of a speaker’s foreign accent as
perceived by a listener). As the ultimate goal of language learning is successful commu-
nication of messages, the upshot of these findings is that L2 pronunciation teaching goals
ought to focus on improving comprehensibility, as opposed to accentedness, because
doing so is more likely to have an impact on intelligibility.
In comparison to the relatively limited number of studies that have incorporated
measures of intelligibility, more studies have focused on comprehensibility ratings (e.g.,
Bergeron & Trofimovich, 2017; Crowther et al., 2015a, 2018; French et al., 2020; Isaacs
& Trofimovich, 2012; Isbell et al., 2019; O’Brien, 2014; Saito et al., 2016; Trofimovich
et al., 2020). In justifying using comprehensibility ratings as opposed to intelligibility
measures, researchers have argued that comprehensibility ratings provide an intuitive
way to measure the subjective listener experience of processing difficulty, mirroring
real-world applications of such judgments (Crowther et al., 2015a; Trofimovich et al.,
2020). Additionally, comprehensibility ratings using Likert or sliding scales are rela-
tively quicker and easier to obtain than intelligibility measurements, which typically
involve transcription tasks. Nevertheless, if comprehensibility is to be used as a proxy
for intelligibility, then it is important to gain a better understanding of the factors that
influence the variability of the strength of the intelligibility-comprehensibility relation-
ship.
Beyond the paucity of work incorporating intelligibility measures, our understand-
ing of the strength of the relationships among these global speech dimensions is
additionally limited by the fact that most research in this area has relied on a single
type of speaking task (i.e., picture narrative) as well as speech data from relatively
advanced speakers of L2 English. Huensch and Nagle (2021) sought to contribute to
this line of research by including measures of all three speech dimensions and by
investigating the speech of instructed learners of L2 Spanish of varying proficiency;
however, they used a picture narrative task to elicit speech data. The current study
tested the generalizability of these findings by modifying the speaking task to better
understand the influence of task on moderating the strength of the relationships among
the global speech dimensions, and whether and how proficiency impacts the strength of
those relationships.
Relationships among the global speech dimensions

Previous studies incorporating measurements of intelligibility, comprehensibility,
and accentedness have generally reported stronger relationships between intelligi-
bility and comprehensibility than between intelligibility and accentedness, but they
have also documented substantial interlistener variability in the strength of the
relationships (e.g., Derwing & Munro, 1997; Munro & Derwing, 1995). For instance,
Munro and Derwing (1995) reported that for 15 of their 18 listeners there was a
within a paradigm that treats intelligibility and comprehensibility as separate constructs and not simply
methodologically differently operationalized.

Revisiting the moderating effect of speaker proficiency 573
significant correlation between comprehensibility and intelligibility whereas that was

true for only five listeners for accentedness and intelligibility (p. 86). Similar findings
were reported in Jułkowska and Cebrian (2015) where statistically significant corre-
lations were found between comprehensibility and intelligibility for 15 of 18 listeners
(ranging in strength from .667 to .825) whereas for accentedness and intelligibility,
the same was true for only five listeners with r values ranging from .099 to .686
(p. 224).
A related line of work has examined linguistic predictors of comprehensibility and
accentedness. In general, accumulated findings indicate that both phonological and
lexicogrammatical features contribute to comprehensibility and accentedness judg-
ments. However, different features have been shown to map onto each listener-based
dimension (Trofimovich & Isaacs, 2012), and even among statistically significant
features, some (e.g., word stress) seem to be far better predictors than others (Isaacs &
Trofimovich, 2012). Furthermore, when features are bundled into factors, the weights
of these factors differ depending on the listener-based construct under consideration.
Phonological features tend to be more strongly associated with accentedness than
with comprehensibility, whereas for lexicogrammatical features, the opposite is true,
insofar as they show a stronger relationship with comprehensibility (Saito et al.,
2017). Since these baseline studies, a large body of work has begun to examine the
factors that could moderate these relationships. In this study, we focus on two:
speaker proficiency and task.
Proficiency as a moderator of the relationship among intelligibility,

comprehensibility, and accentedness
Huensch and Nagle (2021), a conceptual replication of Derwing and Munro (1997) and
Munro and Derwing (1995), explored the relationships among the three global speech
dimensions in L2 Spanish and investigated the potential impact of speaker proficiency
on the relationships. Their motivation for focusing on proficiency stemmed from
differences in those studies regarding the strengths of the relationships among the
speech dimensions that were potentially attributable to differences in proficiency
between the speaker samples. Huensch and Nagle (2021) hypothesized that the impact
of proficiency might be more evident at the higher and lower ends of the proficiency
continuum (in comparison to values in the middle) resulting in a curvilinear relation-
ship. In their study, speech samples were elicited from 42 instructed L2 learners of
Spanish of varying proficiency using a picture narrative task. Two utterances per
speaker were extracted from the beginning of the narratives and used as stimuli in
an online transcription and rating task using Amazon Mechanical Turk (AMT). Eighty
native speakers of Spanish completed the AMT task. These listeners were recruited
from five countries representing the dialect regions learners reported being most
exposed to (Argentina, Colombia, Mexico, Spain, Venezuela). Results from the
mixed-effects model analysis indicated a significant positive relationship between
intelligibility and comprehensibility (consistent across listeners), such that speech rated
as one standard deviation above the mean was twice as likely to be perfectly intelligible.
In contrast, accentedness was not a statistically significant predictor of intelligibility.
Huensch and Nagle also found a significant positive relationship between comprehen-
sibility and accentedness, but this relationship varied significantly across listeners.
When proficiency was incorporated into the models, the findings indicated that
proficiency did not impact the strength of the relationship between intelligibility and

comprehensibility. In contrast, proficiency did have an impact on the strength of the

relationship between comprehensibility and accentedness, such that there was a weaker
relationship between these two global speech dimensions in higher proficiency
speakers.
While these findings contributed to a better understanding of the relationship
among these global speech dimensions and the impact of proficiency on those relation-
ships, speech samples were elicited using the same picture narrative as Munro and
Derwing (1995) and Derwing and Munro (1997). This methodological choice was
desirable to provide a point of comparison when exploring whether findings general-
ized to L2 Spanish learners of varying proficiency, but it means that findings are still
limited to the same type of picture narrative task that much of the previous literature in
this area has relied on (Crowther et al., 2015a).
Task effects on measurements of comprehensibility and accentedness

In a series of studies, Crowther and colleagues (Crowther et al., 2015a, 2015b; Crowther
et al., 2018) investigated factors contributing to variation in how rated linguistic
features of phonology and fluency (e.g., intonation, speech rate) and lexicon, grammar,
and discourse (e.g., lexical appropriateness, grammatical accuracy) contributed to
predicting comprehensibility and accentedness ratings. Particularly relevant to the
current study, Crowther et al. (2018) explored speaking task effects. In addition to a
picture narrative task, they employed two speaking tasks selected to represent real-
world assessment contexts of their speaker sample: the IELTS long-turn speaking task
and the TOEFL iBT integrated task. Using speech samples from 60 L2 English learners
from multiple L1 backgrounds who were rated by 10 experienced L1 English listeners,
these studies provided evidence that speaking task indeed impacts how linguistic
features map onto comprehensibility and accentedness ratings. In line with previous
work, for the picture narrative task, both pronunciation and lexicogrammar features
were associated with comprehensibility whereas pronunciation features were associ-
ated with accentedness. However, a novel finding was that in the IELTS and TOEFL
tasks, accentedness became increasingly associated with lexicogrammar features, lead-
ing the authors to observe that “linguistic distinctions between accentedness and
comprehensibility were thus clearest in the picture task” (p. 454). Nevertheless,
correlation analyses indicated that ratings were strongly related across the three tasks
(picture = .80, IELTS = .79, TOEFL = .74, p. 450). Finally, although the effects were
small, task appeared to systematically impact the ratings such that in the picture
narrative task speakers were rated as less accented but also less comprehensible when
compared to ratings for the IELTS task.
Crowther et al. (2018) hypothesized that these findings might be, in part, explained
by task familiarity and flexibility both from the speakers’ and listeners’ perspectives. For
instance, regarding the picture narrative task, when listeners are familiarized with the
story prior to the experimental task, they are likely to establish expectations for what
they hear and how it is presented. At the same time, speakers are constrained by these
expectations and therefore have less flexibility in the content they provide such that
successfully completing the picture narrative task requires using certain vocabulary and
following certain narrative conventions. An extension of this is that utterances
extracted from the start of a picture narrative task would also likely be quite similar
in their linguistic content and structure such that listeners would encounter many
comparable utterances. Trofimovich et al. (2020) offered similar explanations in their

study exploring how inter-interlocutor comprehensibility ratings evolve over time

during dialogic interactions. They discussed how both task and experience with a
speaker’s speech could potentially influence comprehensibility. For instance, if listeners
are familiar with the content (or listening to content where there is strong expectation
about what will be uttered), as they would be in a picture narrative, then their processing
resources might be freed up to pay more attention to how the speech potentially
deviates from their expectations, thus negatively impacting comprehensibility. Put
another way, when the speaker does not produce what the listener expects, then the
listener must deploy additional resources to process that mismatch, which could lower
comprehensibility.
In sum, while much has been learned about global dimensions of L2 speech, the
evidence has primarily come from a single data source (i.e., picture narratives) in a
single target language (i.e., English), which limits the generalizability of findings.
Additionally, while several studies have contributed important findings to the field
by examining tasks effects for comprehensibility and accentedness ratings, those
studies have not incorporated intelligibility measures. Therefore, it is unknown how
speaking task might impact intelligibility-comprehensibility relationships, among
others, which is an important question if comprehensibility is going to continue to
be used as a proxy for intelligibility in L2 pronunciation research.
Research Questions and Predictions

The current study is an extension of Huensch and Nagle (2021), who investigated the
potential impact of speaker proficiency on relationships among the three global
speech dimensions in L2 Spanish using a picture narrative task. Motivated by prior
work demonstrating variability in strength among the global speech dimensions of
accentedness, comprehensibility, and intelligibility as well as task impacts on listener
ratings, the current study followed the same methodological procedures as Huensch
and Nagle but modified the speaking task variable. The research questions were as
follows:
1. To what extent are intelligibility, comprehensibility, and accentedness related to one

another in L2 Spanish speech elicited using a prompted response task?
2. To what extent does proficiency affect relationships among intelligibility, compre-
hensibility, and accentedness in L2 Spanish?
Huensch and Nagle (2021) found a significant positive relationship between intelligi-
bility and comprehensibility that was consistent across listeners but no statistically
significant relationship between accentedness and intelligibility. They also found a
significant positive relationship between comprehensibility and accentedness that
varied significantly across listeners. The picture narrative task they employed likely
had a positive impact on intelligibility because it allowed listeners to have strong
preconceived expectations about what they would here. In contrast, these expectations
likely had a negative impact on comprehensibility ratings because any mismatches
between expectations and actual productions might have required the deployment of
additional processing resources. The current study employed a speaking task in which
participants responded to a prompt based on NCSSFL-ACTFL Can-Do Statements
that, like the IELTS task used in Crowther et al. (2018), allowed speakers more flexibility
in choosing what to talk about and how to do so. Therefore, we might predict that in the

current study overall intelligibility will be somewhat lower, but comprehensibility

would be higher. Because listeners in the current study must concentrate on under-
standing what the speaker is saying, as opposed to being able to determine what the
speaker was saying based on expectation, we might predict an even stronger alignment
between intelligibility and comprehensibility. In other words, in open-ended speech,
when all the listener has to rely on is what the speaker is saying, then comprehensibility
might be a very good representation of intelligibility. Regarding accentedness ratings,
although the findings from Crowther et al. (2018) might suggest an impact of task, effect
sizes were minimal. In terms of the current study, then, these previous findings might
allow us to predict that the relationship between accentedness and the other constructs
would remain relatively unaffected. Speaker flexibility in choosing what they say and
lowered listener expectation in terms of content suggest we make similar predictions
regarding research question 2 that incorporates learner proficiency. Huensch and Nagle
found that proficiency did not appear to impact the strength of the relationship between
intelligibility and comprehensibility, but it did have an effect on the relationship between
comprehensibility and accentedness such that the strength of the relationship weakened
as proficiency increased. We predicted similar findings in the current study.
Method
For comparison, Table 1 provides a summary of the similarities and differences
between the Method of Huensch and Nagle (2021) and the current study.
Participants
Speakers
Speakers included the same 42 instructed L2 Spanish learners from Huensch and Nagle
(2021) who were recruited from first- through fourth-year Spanish courses at two
institutions in the United States. Participants were all native speakers of English who
Table 1. Summary of Method differences between Huensch and Nagle (2021) and the current study
Huensch and Nagle (2021) Current Study
Speakers Spanish L2 learners (n = 42) Same

Listeners Spanish NSs (n = 80) Different NS listeners (n = 80)
Speaking task Hunter story Prompted response
Rating task Amazon Mechanical Turk Same
Table 2. Summary of listener characteristics

M (SD) Range
Age 35.33 (9.73) 19–62

Age of onset L2 English 7.11 (3.83) 0–22
Self-reported global English proficiency* 6.86 (1.41) 2.25–9.00
Percent daily English use 15.16 (12.93) 0–60
Familiarity L2 Spanish** 6.39 (2.20) 1–9
L2 teaching experience: Yes: 17 No: 63
*The proficiency scale ranged from 1–9 (1 = extremely poor, 9 = extremely proficient).
**The familiarity scale ranged from 1–9 (1 = not at all familiar, 9 = extremely familiar).

reported using English most of the time during the week (M = 94%, SD = 6%). The
participants represented a range of proficiencies as indicated by their scores on an Elicited
Imitation Test (EIT): M = 55.88 (SD = 26.48), 95% CI [47.63, 64.13], Range 17–106
(out of 120). Four native speakers were also recruited to provide speech samples to ensure
that listeners understood the ratings scales and tasks.
Listeners
Listeners included 80 NSs of Spanish recruited from multiple countries (e.g., Mexico,
Spain) using AMT. Listeners were recruited using the same IP address filters as those in
Huensch and Nagle (2021) in an effort to represent the major dialect zones of
instructors at the two institutions of the speakers. The final sample included listeners
from Spain (n = 40), Venezuela (n = 20), Mexico (n = 10), Colombia (n = 7), and
Argentina (n = 3). The goal was not to construct a set of homogenous listeners, but
rather to have speakers evaluated by a range of listeners representing the varieties the
speakers had been exposed to and might interact with in the future. Table 2 provides a
summary of listener characteristics.
Materials and Procedure

Here, we give a brief overview of our materials and procedure. For complete method-
ological details, see Huensch and Nagle (2021) whose materials, experimental and
coding protocols, data, and analysis code are publicly available at https://osf.io/4j5cr/.
Data and analysis code for the current study are available at https://osf.io/4p7r8/.
Speaking task
Speakers completed a speaking task modeled on the NCSSFL-ACTFL Can-Do State-
ments in which they responded to the following prompt: Describe un lugar que hayas
visitado o que te interese visitar y explica por qué fuiste o por qué quieres ir a ese lugar
“Describe a place you have visited or are interested in visiting and explain why you went
there or why you might want to go to this place.” Participants were given time to think
about their responses and were asked to speak for approximately 1 minute. Two
utterances minus any initial hesitation markers were extracted from the start of each
response to be used for the rating and transcription task. Utterances from the L2
speakers in the current study were on average 9.3 words (SD = 3.7) with a range of 4–17
words.
AMT rating task

The Human Intelligence Task (HIT) deployed to AMT workers included: (1) a consent
form and information about the rating task, (2) a listener background questionnaire,
(3) instructions and four practice items, and (4) the experimental rating task. For each
item in the rating task, the listeners first heard an utterance one time. Then, the rating
interface became active, and listeners had 45 seconds to transcribe the utterance and
rate its accentedness and comprehensibility on 100-point sliding scales whose end
points were marked with muy díficl de entender / muy fácil de entender (“very difficult to
understand” / “very easy to understand”) for comprehensibility and acento extranjero
muy marcado / ningún acento extranjero (“very strong foreign accent” / “no foreign
accent”) for accentedness. Ratings were recorded as numerical values on a 100-point
scale (but listeners did not see the numbers).

Language background questionnaires
The L2 speakers completed a language background questionnaire to gather basic
demographic information about themselves and their language learning experiences.
They were also asked about the varieties of Spanish spoken by their instructors, and this
information guided the AMT task deployment. The native speaker listeners completed
a similar background questionnaire and were asked about their experience and famil-
iarity with L2 Spanish speech.
Scoring and analysis

Intelligibility coding
Each of the utterances extracted from the speakers’ open-ended responses was tran-
scribed in CLAN (MacWhinney, 2000) and checked by a second member of the
research team. Listener transcriptions from the AMT HIT were compared to the
researchers’ transcriptions to determine an intelligibility score for each utterance
computed as the percentage of words transcribed accurately. Trivial transcription
differences (e.g., grammatical regularizations such as transcribing elMASC díaMASC
“the day” when the speaker said laFEM díaMASC, spelling mistakes such as aveces for
a veces “sometimes”) were not considered errors.
Analysis
We adopted the same analytical approach used in Huensch and Nagle (2021). First, we
examined the reliability of the comprehensibility and accentedness ratings using two-
way, consistency, average-measure intraclass correlation coefficients (ICC). For com-
prehensibility, ICC = .98 [.98, .99] and for accentedness, ICC = .98 [.97, .99], suggesting
that listeners were highly consistent in their use of the two scales. Next, we inspected the
distribution of the three scores. The intelligibility data showed extreme left-skew, with
most values occurring at 1 (i.e., perfect intelligibility). This amount of skew would have
affected the normality of model residuals. We therefore transformed intelligibility
scores into a new binary measure, where scores < 1 were coded as 0, or not (perfectly)
intelligible, which aligns with the same transformation applied to the intelligibility data
in Huensch and Nagle (2021). We then fit a logistic mixed-effects model to the binary
intelligibility outcome. Comprehensibility scores were reasonably distributed through-
out the 100-point scale, which indicated that there would be no issue with proceeding
with the linear mixed-effects models.
We included comprehensibility and accentedness as predictors of intelligibility,
alongside the listener-level covariates identified as potentially impactful in Huensch
and Nagle (2021): biological age, age of onset of L2 English, self-reported percent daily
English use, self-reported global English proficiency, self-reported familiarity with L2
Spanish speech, and a categorical predictor to account for whether listeners had L2
teaching experience. We also included length of utterance, in syllables, as an utterance-
level covariate. All continuous predictors were standardized. With respect to the
random effects structure of the model, we adopted by-speaker and by-listener random
intercepts, testing by-listener random slopes for focal predictors when the correspond-
ing fixed effect reached significance. Testing by-listener random slopes allowed us to
estimate between-listener variation in the relationship between the focal predictor and
utterance-level intelligibility. We adopted a similar procedure for probing the relation-
ship between comprehensibility and accentedness. We fit a linear mixed-effects model
to the comprehensibility data with accentedness as our focal predictor, including the

same covariates as above and testing the same random effects. We also included
intelligibility as a covariate so that we could estimate the relationship between accent-
edness and comprehensibility after controlling for the intelligibility of the utterance.
After building these primary models, we tested interactions between comprehensi-
bility and accentedness and proficiency (i.e., participants’ EIT score) in the intelligi-
bility model and an interaction between accentedness and proficiency in the
comprehensibility model. We examined proficiency as both a linear and quadratic
moderating variable, on the view that the moderating effect of proficiency on the
relationship between the listener-based constructs might not be linear. For the linear
mixed-effects model fit to the comprehensibility outcome variable, we checked the
following assumptions: normality of residuals using QQ plots, linearity by plotting
fitted values against residuals, and multicollinearity by computing variance inflation
factors. Unless otherwise noted, models passed these tests.
Results
As displayed in Figure 1 and mentioned in the preceding text, the intelligibility data
were heavily left-skewed, the comprehensibility data showed a relatively even distri-
bution throughout the 100-point scale, and the accentedness data were moderately
right-skewed. Descriptive statistics confirmed this trend: for intelligibility, M = .91
(.14); for comprehensibility, M = 58.80 (29.02); for accentedness, M = 29.36 (24.64).
Overall, then, it would be fair to characterize the utterances as highly intelligible,
moderately comprehensible, and strongly accented.
Interrelationships among the listener-based constructs

The logistic mixed-effects model fit to the binary intelligibility data showed a significant
relationship between comprehensibility and intelligibility (Odds Ratio = 2.05, 95% CI =
[1.86, 2.26), p < .001), whereas the relationship between accentedness and intelligibility
did not reach significance (Odds Ratio = 1.02, 95% CI = [0.93, 1.13], p = .62). The
odds ratio of 2.05 for comprehensibility indicates that, on average, utterances that
were 1 SD more comprehensible (where 1 SD corresponds to 29.02 units on the
100-point comprehensibility scale) were twice as likely to be intelligible. The marginal
R2 was .20, which indicates that the fixed effects accounted for approximately 20% of
variance in intelligibility, and the conditional R2, which includes the random effects,
was .46, indicating that the fixed and random effects together explain 46% of the
Figure 1. Distribution of intelligibility, comprehensibility, and accentedness scores.

variance in intelligibility. By-listener random slopes for comprehensibility did not

improve the fit of the model (χ2(2) = 4.22, p = .12), suggesting that the relationship
between comprehensibility and intelligibility was consistent across listeners.
The linear mixed-effects model fit to the comprehensibility data revealed a signif-
icant positive relationship between accentedness and comprehensibility, after control-
ling for intelligibility: estimate = 10.73, 95% CI = [9.43, 12.03], p < .001. This estimate
demonstrates that utterances rated as 1 SD less accented (where 1 SD for accentedness
corresponds to 24.64 units on the 100-point scale) tended to be judged as 10.73 units
more comprehensible. Integrating by-listener random slopes for accentedness
improved the fit of the model: χ2(2) = 313.12, p < .001. Thus, whereas the relationship
between comprehensibility and intelligibility was consistent across listeners, the rela-
tionship between accentedness and comprehensibility varied considerably. The mar-
ginal R2 of this model was .33 and the conditional R2 was .64. Thus, the fixed effects
accounted for approximately 33% of the variance in comprehensibility versus 64% for
the fixed and random effects together.
Overall, then, these initial models showed a strong and stable relationship between
comprehensibility and intelligibility and a strong but variable relationship between
comprehensibility and accentedness. Considering the effect size benchmarks proposed
by Plonsky and Ghanbar (2018), where R2 < .20 is small, .20 < R2 < .50 is medium, and
.50 < R2 is large, these models could be considered in the small (intelligibility) to
medium (comprehensibility) range.
Effect of proficiency on interrelationships among the listener-based constructs

We included proficiency as a moderating variable by generating interactions with
the focal predictors in each model: for intelligibility, proficiency comprehensi-
bility and proficiency accentedness, and for comprehensibility, proficiency
accentedness. We used the poly function to generate orthogonal linear and qua-
dratic terms, and we used likelihood ratio tests to determine if the more complex
model with the proficiency interactions significantly improved model fit over a
simpler, predecessor model (described in the previous section) that did not include
those interactions.
The interaction model for intelligibility was a marginally better fit than the
baseline model: χ2(5) = 12.82, p = .03. Interestingly, when we tested a simpler
interaction model, including only linear proficiency in interaction with the focal
predictors, that model did not prove to be an improvement over baseline: χ2(2) = 2.59,
p = .27. This finding indicates that the quadratic moderator was the primary driver of
the modest improvement in model fit. Neither of the linear interactions were
statistically significant, but both of their quadratic counterparts were. The odds ratio
for the quadratic proficiency comprehensibility interaction was greater than
1 (Odds Ratio = 1.11, 95% CI = [1.02, 1.20], p = .01), which shows that the
relationship between comprehensibility and intelligibility was slightly stronger at
the proficiency extremes (i.e., in speakers of lower and higher proficiency). Con-
versely, the odds ratio for the quadratic proficiency accentedness interaction was
less than 1 (Odds Ratio = 0.91, 95% CI = [0.84, 0.97], p = .01), which indicates that the
relationship between accentedness and intelligibility was weaker at both lower and
higher proficiency levels. It bears repeating, however, that the overall relationship
between accentedness and intelligibility was not significant. Thus, the significant
quadratic interaction has two interpretations: (1) at certain proficiency levels, the
relationship between accentedness and intelligibility could be significant, but those

levels would likely be extreme and not attested in most speakers and (2) there could be
significant differences in the relationship between accentedness and intelligibility in
speakers of varying proficiency (i.e., the accentedness-intelligibility slope estimate at
proficiency = –1 SD could be different from the slope estimate at proficiency = þ1
SD) despite a nonsignificant overall finding (i.e., each slope may not be significantly
different from zero). Furthermore, the marginal R2 of the interaction model was .21,
which represents a negligible 1% improvement over the baseline model (R2 = .20).
Thus, it would be fair to say that the statistically significant improvement in model fit
was not practically significant. Put another way, relationships between comprehen-
sibility and intelligibility and accentedness and intelligibility do not appear to vary
much at all as a function of speaker proficiency.
For comprehensibility, a model with a linear proficiency accentedness inter-
action was an improvement over the baseline model (χ2(1) = 34.14, p < .001), but a
model with a quadratic interaction did not result in any additional improvement
(χ2(2) = 5.66, p = .06). The significant negative coefficient for the interaction term
(estimate = –1.44, 95% CI = [–1.93, –0.96], p < .001) shows that the relationship
between accentedness and comprehensibility became slightly weaker in speakers of
higher proficiency. Put another way, accentedness seems to be more strongly
aligned with comprehensibility at lower proficiency levels. Again, however, consid-
ering the 100-point comprehensibility scale and the magnitude of the baseline
accentedness estimate, which was 11.39 in the updated model (95% CI = [10.09,
12.69], p < .001), the effect of proficiency on the relationship between accentedness
and comprehensibility was relatively small. This fact is also confirmed by the
marginal R2 of the interaction model, which remained .33, the same as the baseline
model. Thus, as was the case for the intelligibility model, the relationship between
accentedness and comprehensibility does not appear to vary with speaker profi-
ciency, at least not in a practically significant way.
Discussion
In this study, we found a significant, positive, and consistent relationship between
comprehensibility and intelligibility and a null relationship between accentedness and
intelligibility. We also found a significant positive relationship between accentedness
and comprehensibility, but that relationship varied significantly across listeners. As
shown in the top portion of Table 3, these findings closely align with those reported in
Huensch and Nagle (2021). In fact, most coefficients were a near exact match across the
studies, which suggests that task had very little effect on the relationships between the
listener-based constructs. The only coefficient that changed slightly was the estimate of
the relationship between accentedness and comprehensibility, which was slightly
smaller in the present study than in Huensch and Nagle (2021). This shrinkage, albeit
modest (see the substantial overlap in the 95% CIs), suggests that there is a somewhat
weaker relationship between accentedness and comprehensibility when speakers have
complete freedom in choosing what grammar and vocabulary to use and when listeners
have less concrete expectations about the content of the speech sample. Perhaps then,
when listeners have strong expectations about what a speaker will say and the language
they will use to say it, they can allocate attention toward the way in which the speaker
communicates the information rather than focusing on what they are trying to
communicate. As a result, if a speaker does not produce what the listener expects
additional processing resources might be required to address the mismatch, which

Table 3. Comparison of results: Huensch & Nagle (2021) vs. current study
Huensch & Nagle (2021) Current Study
Task: Picture narration Task: Prompted response
Descriptive statistics
Intelligibility .93 (.12) .91 (.14)
Comprehensibility 55.62 (29.01) 58.80 (29.02)
Accentedness 30.36 (24.65) 29.36 (24.64)
Baseline models
Comprehensibility-intelligibility Odds Ratio = 2.07* Odds Ratio = 2.05*
95% CI = [1.87, 2.29] 95% CI = [1.86, 2.26]
Accentedness-intelligibility Estimate = 1.01 Estimate = 1.02
95% CI = [0.91, 1.11] 95% CI = [0.93, 1.13]
Accentedness-comprehensibility Estimate = 11.53* Estimate = 10.73*
95% CI = [10.23, 12.83] 95% CI = [9.43, 12.03]
Random slopes SD = 5.10* Random slopes SD = 5.05*
Proficiency models
Comprehensibility-intelligibility na Quadratic moderator
Odds Ratio = 1.11
95% CI = [1.02, 1.20]
Δ R2 = 0.01 (1% variance)
Accentedness-intelligibility Quadratic moderator Quadratic moderator
Odds Ratio = 0.91 Odds Ratio = 0.91
95% CI = [0.84, 0.99] 95% CI = [0.84, 0.97]
Δ R2 = .05 (5% variance) Δ R2 = 0.01 (1% variance)
Accentedness-comprehensibility Linear moderator Linear moderator
Estimate = –0.83 Estimate = –1.44
95% CI = [–1.38, –0.28] 95% CI = [–1.93, –0.96]
Δ R = .00 (0% variance)
2
Δ R = .00 (0% variance)
2
could explain a somewhat stronger comprehensibility-accentedness link for the picture

narration samples than for the prompted response samples.
In terms of the moderating effect of proficiency on the relationships between the
listener-based constructs, again, findings for the prompted response task closely align
with findings for the picture narration task (see the lower portion of Table 3). Huensch
and Nagle did not find that proficiency affected the relationship between comprehen-
sibility and intelligibility, whereas in this study we did. However, it is important to bear
in mind that this effect was very small, explaining less than 1% of the variance in the
intelligibility data. Therefore, despite differences in what reached statistical significance
across the two studies, the practical significance of the findings is clear: In both studies,
proficiency had very little impact on the relationship between comprehensibility and
intelligibility. The same could be said of the effect of proficiency on the relationship
between accentedness and intelligibility. Despite reaching statistical significance in
both reports, the amount of variance that the interaction term explained was negligible
in the current study (1%) and very modest in the previous study (5%). Thus, it would be
fair to say that proficiency seems to have very little impact on the accentedness-
intelligibility relationship irrespective of task type. Lastly, although integrating
proficiency interactions into the comprehensibility model improved model fit, the
additional variance in comprehensibility that those terms explained was very small
(< 1%). The tentative conclusion that can be reached, then, is that proficiency has little
to no impact on the relationships between the listener-based dimensions, which also
appear to be consistent across speaking tasks.

Conclusion
The current study’s extension of Huensch and Nagle (2021) provides additional evidence
for the partial independence of the global speech dimensions of intelligibility, compre-
hensibility, and accentedness. Importantly, it lends further support to the pedagogical
focus on comprehensibility given its stronger and more consistent relationship to
intelligibility in comparison to accentedness. Furthermore, the findings indicated a
limited effect of speaking task in moderating the strength of the intelligibility and
comprehensibility relationship. One consideration for future work relates to the fact that
the current study included an intelligibility measure whereas many previous studies only
included ratings of accentedness and comprehensibility. This raises an interesting
question about whether having raters transcribe the speech might influence ratings of
comprehensibility and/or accentedness and thus potentially the strength of their rela-
tionship as well. For instance, if a listener is unable to complete the transcription task, this
likely indicates to them potential difficulties related to comprehensibility whereas in cases
where the transcription task is easily accomplished this might suggest ease in processing.
As suggested by Huensch and Nagle (2021), the inclusion of an intelligibility transcrip-
tion task might explain the relatively consistent relationship between intelligibility and
comprehensibility across listeners. This begs the question of whether the strength of the
accentedness/comprehensibility relationship might vary depending upon whether or not
a transcription task is included (i.e., depending on the methodological characteristics of
the research design). For instance, Derwing and Munro (1997), which included intelli-
gibility, reported a mean correlation of r = 0.45 (p. 7) whereas Saito et al. (2016) and Isbell
et al. (2019), who did not include intelligibility, reported correlation coefficients of r =
0.89 (p. 226) and r = 0.92 (p. 36), respectively. Future work should explore the potential
impact of these methodological differences. Additionally, as noted by Saito (2021), a
fruitful avenue for future meta-analytic work examining the global speech dimensions
would be the inclusion of task as part of a moderator analysis, given the growing number
of primary studies.
Another interesting avenue for future work will be considering how the complexity
and predictability of the speaking sample affect listener-based ratings and the
linguistic variables that predict them. For instance, when listeners can easily predict
the content of the sample, either because the speaking task is relatively circumscribed
or because they received instructions and images representing what the speakers had
to do, they may be able to ascertain what speakers have said even if the speech is
difficult to process, in which case intelligibility and comprehensibility (and the
features that map onto them) might show a weaker relationship. However, when
the message is less predictable, then listeners may need to focus entirely on appre-
hending what the speaker said, bringing intelligibility and comprehensibility closer in
line with one another. Regardless of whether such hypotheses are borne out, much
more work is needed on how task characteristics and listener background knowledge
interact with the linguistic and stylistic variables that the speaker brings to the table.
To that end, experimental studies that manipulate those variables could prove
especially illuminating.
Acknowledgments. This work was funded by a University of South Florida Creative Scholarship Grant and
a University of South Florida Nexus Initiative Award to the first author and an Iowa State University Social
Sciences Seed Grant to the second author. We would like to thank the participants and our research assistants,
especially Aneesa Ali and Bianca Pinkerton.
Data Availability Statement. The experiment in this article earned Open Materials and Open Data badges
for transparent practices. The materials and data are available at https://osf.io/4j5cr/ and https://osf.io/4p7r8/.

References
Bergeron, A., & Trofimovich, P. (2017). Linguistic dimensions of accentedness and comprehensibility:
Exploring task and listener effects in second language French. Foreign Language Annals, 50, 547–566.
https://doi.org/10.1111/flan.12285
Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K. (2015a). Does a speaking task affect second language
comprehensibility? Modern Language Journal, 99, 80–95. https://doi.org/10.1111/modl.12185
Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2015b). Second language comprehensibility revisited:
Investigating the effects of learner background. TESOL Quarterly, 49, 814–837. https://doi.org/10.1002/
tesq.203
Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2018). Linguistic dimensions of L2 accentedness and
comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40, 443–457.
https://doi.org/10.1017/S027226311700016X
Derwing, T. M., & Munro, M. J. (1997). Accent, intelligibility, and comprehensibility: Evidence from four L1s.
Derwing, T. M., & Munro, M. J. (2015). Pronunciation fundamentals: Evidence-based perspectives for L2
teaching and research. John Benjamins.
French, L. M., Gagné, N., & Collins, L. (2020). Long-term effects of intensive instruction on fluency,
comprehensibility and accentedness. Journal of Second Language Pronunciation, 6, 380–401. https://
doi.org/10.1075/jslp.20026.fre
Huensch, A., & Nagle, C. (2021). The effect of speaker proficiency on intelligibility, comprehensibility, and
accentedness in L2 Spanish: A conceptual replication and extension of Munro and Derwing (1995a).
Language Learning, 71, 626–668. https://doi.org/10.1111/lang.12451
Isaacs, T., & Trofimovich, P. (2012). Deconstructing comprehensibility: Identifying the linguistic influences
on listeners’ L2 comprehensibility ratings. Studies in Second Language Acquisition, 34, 475–505. https://
doi.org/10.1017/S0272263112000150
Isbell, D. R., Park, O. S., & Lee, K. (2019). Learning Korean pronunciation: Effects of instruction, proficiency,
and L1. Journal of Second Language Pronunciation, 5, 13–48. https://doi.org/10.1075/jslp.17010.isb
Jułkowska, I. A., & Cebrian, J. (2015). Effects of listener factors and stimulus properties on the intelligibility,
comprehensibility and accentedness of L2 speech. Journal of Second Language Pronunciation, 1, 211–237.
https://doi.org/10.1075/jslp.1.2.04jul
Levis, J. M. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly,
39, 369–377. https://www.jstor.org/stable/3588485
Levis, J. M. (2020). Revisiting the intelligibility and nativeness principles. Journal of Second Language
Pronunciation, 6, 310–328. https://doi.org/10.1075/jslp.20050.lev
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk (3rd ed.). Lawrence Erlbaum.
Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the
speech of second language learners. Language Learning, 45, 73–97. https://doi.org/10.1111/j.1467-
1770.1995.tb00963.x
Munro, M. J., Derwing, T. M., & Morton, S. L. (2006). The mutual intelligibility of L2 speech. Studies in Second
Nagle, C., & Huensch, A. (2020). Expanding the scope of L2 intelligibility research: Intelligibility, compre-
hensibility, and accentedness in L2 Spanish. Journal of Second Language Pronunciation, 6, 329–351.
https://doi.org/10.1075/jslp.20009.nag
O’Brien, M. G. (2014). L2 learners’ assessments of accentedness, fluency, and comprehensibility of native and
nonnative German speech: L2 learner assessments. Language Learning, 64, 715–748. https://doi.org/
10.1111/lang.12082
Plonsky, L., & Ghanbar, H. (2018). Multiple regression in L2 research: A methodological synthesis and guide
to interpreting R2 values. Modern Language Journal, 102, 713–731. https://doi.org/10.1111/modl.12509
Saito, K. (2021). What characterizes comprehensible and native-like pronunciation among English-as-a-
second-language speakers? Meta-analyses of phonological, rater, and instructional factors. TESOL Quar-
terly, 55, 866–900. https://doi.org/10.1002/tesq.3027
Saito, K., Trofimovich, P., & Isaacs, T. (2016). Second language speech production: Investigating linguistic
correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycho-
linguistics, 37, 217–240. https://doi.org/10.1017/S0142716414000502

Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on
L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38,
439–462. https://doi.org/10.1093/applin/amv047
Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language
and Cognition, 15, 905–916. https://doi.org/10.1017/S1366728912000168
Trofimovich, P., Nagle, C. L., O’Brien, M. G., Kennedy, S., Taylor Reid, K., & Strachan, L. (2020). Second
language comprehensibility as a dynamic construct. Journal of Second Language Pronunciation, 6,
430–457. https://doi.org/10.1075/jslp.20003.tro
Cite this article: Huensch, A. and Nagle, C. (2023). Revisiting the moderating effect of speaker proficiency
on the relationships among intelligibility, comprehensibility, and accentedness in L2 spanish. Studies in

doi:10.1017/S0272263122000304
REPLICATION STUDY
Wh-Dependency Processing in a Naturalistic

Exposure Context: Sensitivity to Abstract
Syntactic Structure in High-Working-Memory L2
Speakers
Robyn Berghoff
Department of General Linguistics, Stellenbosch University, Stellenbosch, South Africa
Corresponding author. E-mail: berghoff@sun.ac.za
(Received 09 July 2021; Revised 08 June 2022; Accepted 24 June 2022)
Abstract
This study replicates Felser and Roberts (2007), which used a cross-modal picture priming
task to examine indirect-object dependency processing in classroom L2 learners. The
replication focuses on early L2 learners with extensive naturalistic L2 exposure (n = 22)—
an understudied group in the literature—and investigates whether these learners, in contrast
to those in the original study, reactivate the moved element at its original position in the
sentence. Bayesian multilevel regression is used to analyze the data. The results suggest that
higher-working-memory participants did reactivate the moved element at its structural
origin. By extending previous research to an understudied group, the study contributes to
our knowledge regarding sensitivity to abstract syntactic structure in L2 processing.
Introduction
A key point of interest in second language (L2) acquisition research relates to the
circumstances under which first language (L1) and L2 processing converge. L1–L2
differences have repeatedly been observed in the processing of syntactically complex
constructions, such as those involving movement dependencies (e.g., Berghoff, 2020;
Felser & Roberts, 2007; Marinis et al., 2005). These differences have been attributed to a
variety of factors, including reduced sensitivity to abstract syntactic structure in L2
compared to L1 speakers and reduced L2 compared to L1 processing automaticity. At
the same time, certain theoretical accounts (Clahsen & Felser 2006a, 2006b, 2018;
Ullman, 2001) and empirical findings (Pliatsikas & Marinis, 2013; Pliatsikas et al., 2017)
suggest that L1–L2 processing convergence may be more likely among (early) L2
learners with naturalistic exposure to the L2. The ability to draw conclusions in this
area is currently limited by the dearth of studies that have examined such learners and
these studies’ focus on only one type of movement dependency, namely long-distance
wh-dependencies. The present study extends this body of literature by exploring the

Dependency Processing in a Naturalistic Exposure Context 587
processing of indirect-object dependencies among L2 learners drawn from a context in

which naturalistic L2 exposure is extensive and typically begins at an early age. To this
end, the study replicates Felser and Roberts (2007), which examined indirect-object
dependency processing among classroom learners in a foreign-language context.
Background
The focus of this study is on sentences such as (1), taken from Roberts et al. (2007,
p. 185), where the peacock originates structurally after the direct object the nice birthday
present but appears earlier in the sentence.
(1) John saw the peacock to which the small penguin gave the nice birthday present __
in the garden last weekend.
Processing a sentence such as (1) poses a challenge because after the moved element
(i.e., the filler) the peacock is encountered, it must be retained in short-term memory
until it can be linked to the element that licenses it, in a process termed “filler
integration.” A number of online studies of filler-gap dependency processing have
observed a tendency, among L1 speakers, to reactivate the filler at clause boundaries
and at its original position (Chow & Zhou, 2019; Fernandez et al., 2018; Gibson
& Warren, 2004; Nicol, 1993; Nicol & Swinney, 1989; although see Roberts et al.,
2007 and Miller, 2014 for contrary findings). This processing pattern is in line with a
Chomskyan account of movement (Chomsky, 1986), in which the filler moves through
these positions on its way to its surface structure destination, leaving behind a silent
copy of itself—a “trace”—at each position.
Roberts et al. (2007) used a cross-modal picture priming task to investigate
whether L1 speakers—children aged 5 to 7 years and adults—reactivated the moved
element at its original position, termed the “gap” position. While listening to
sentences such as (1), participants were shown pictures that were either identical
or unrelated to the entity denoted by the filler at either the gap position or a control
position 500 milliseconds earlier in the sentence. They then had to decide whether the
depicted entity was alive or not alive, with their reaction times (RTs) to this decision
serving as the dependent variable. Both adults and children with relatively high
working-memory capacity, as measured in adults by a reading span task and in
children by a listening span task, showed reduced RTs to identical versus unrelated
targets at the gap position. RTs to identical targets were also lower at this position
than at the control position. This finding suggests that the moved element was
reactivated at the gap position, thus facilitating responses in the decision task.
Participants with relatively lower working-memory scores, however, showed no
difference in RTs to identical versus unrelated targets (adults) or a disadvantage for
identical versus unrelated targets (children) at the gap position.
Felser and Roberts (2007) used the same task and materials employed by Roberts
et al. (2007) to investigate the L2 processing of indirect-object dependencies among L1
Greek speakers. These participants had first been exposed to the L2 between the ages of
6 and 11 in a classroom setting and did not consider themselves bilingual; further, they
had been living in the United Kingdom for an average of 2.9 years at the time of testing
(Felser & Roberts, 2007, p. 18). The results for the L2 speakers differed from those
obtained for the L1 speakers in Roberts et al. (2007) in two respects: first, there were no
working memory effects on processing behavior among the L2 speakers; and second,
they showed an advantage for identical versus unrelated targets at both sentence

588 Berghoff
positions. The latter result suggests that instead of selectively reactivating the moved
element at the gap position, the participants may have actively maintained the moved
element in working memory during processing, which facilitated their responses at
both the gap and the control locations. Importantly, as indirect-object dependencies are
formed in essentially the same way in Greek and English, the results do not suggest a
transfer of L1 processing strategies to the L2.
Miller (2014, 2015) investigated whether the reduced automaticity of L2 compared
to L1 processing might inhibit trace reactivation in L2 speakers. From this perspective,
delays in L2 lexical access lead to a delay in the construction of syntactic representa-
tions, and it is this delay in L2 processing that precludes the observation of a trace
reactivation effect. As such, this account predicts that if experimental stimuli are
designed in such a way that L2 lexical access is facilitated, L2 speakers will show
sensitivity to movement traces during real-time processing.
In Miller (2014, 2015), fillers were denoted by L1–L2 cognates (e.g., English–French
gorilla–gorille), with the rationale that the facilitative effect of cognates on lexical access
(see e.g., Costa et al., 2000) would mitigate the potential confounding effects of reduced
L2 processing automaticity. In line with this prediction, Miller’s (2014) intermediate L2
learners showed RT patterns consistent with trace reactivation at the gap position.
Miller (2015) obtained similar results with indirect-object cleft sentences in which the
filler crossed a clause boundary, where a subset of learners showed evidence of filler
reactivation at both the clause boundary and the gap position.
Miller’s (2014, 2015) findings are suggestive of a role for processing automaticity in
facilitating the construction of fully specified syntactic representations. In turn, they
predict that the construction of such representations should also be more likely given the
presence of individual characteristics associated with greater processing automaticity.
One such characteristic is L2 exposure, which has been proposed to exert a practice effect
on the L2 system, leading L2 processing to become more proceduralized (e.g., Ullman,
2001). Indeed, a few studies have observed differences in L2 processing across L2 learners
with classroom L2 exposure and naturalistic L2 exposure (e.g., Dussias & Sagarra, 2007).
Regarding the processing of movement dependencies specifically, Pliatsikas and Marinis
(2013; see also Pliatsikas et al., 2017), in their study of long-distance wh-dependency
processing, observed trace reactivation at the clause boundary among L2 learners with
naturalistic L2 exposure (an average of 9 years), but not among L2 learners whose
exposure was limited to the classroom. Some accounts of L2 processing—for example,
the Shallow Structure Hypothesis (Clahsen & Felser, 2006a, 2006b, 2018)—additionally
attribute a central role to age of L2 acquisition (AoA) in increasing sensitivity to
morphosyntactic information during L2 processing. There is variation in the literature
regarding the timing of the so-called sensitive period for grammar, with some studies
reporting an offset at around age six (Long, 1990) and others only at the end of
adolescence (Hartshorne et al., 2018; Johnson & Newport, 1989). Here, too, though, type
of exposure is crucial: Research has established that AoA is less relevant for L2 outcomes
in instructed L2 settings in which L2 exposure is limited (Muñoz, 2006).
This article reports on a close replication of Felser and Roberts (2007) conducted in
South Africa with L1 Afrikaans–L2 English speakers with AoAs ranging from 1–14
(mean 5.3 years). We refer to these as “early” L2 learners because the maximum AoA
still falls within the upper bound of the proposed sensitive period for grammar. While
South Africa has 11 official languages, English is a prominent societal language (Posel &
Zeller, 2016). Exposure to English often commences before it is formally introduced as a
school subject and is not limited to the classroom context, with studies indicating that
Ln speakers of English use this language extensively with both family and friends

(Berghoff, 2021; Coetzee van Rooy, 2013). At the same time, however, L2 English
speakers are not immersed in the L2 in South Africa, and the L1 is typically maintained
alongside English (Berghoff, 2021; Coetzee van Rooy, 2012; Posel et al., 2020). The
consequences of such societally multilingual settings for language processing remain
poorly understood. This study aims to extend our knowledge in this domain by
investigating whether L2 learners of this background show evidence of trace reactiva-
tion at the gap position during indirect-object dependency processing.
Method
Participants
The study’s participants were 22 L1 Afrikaans–L2 English speakers1 (mean age
20.75 years, standard deviation [SD] 1.06 years, range 19–23 years) who were students
at a university in South Africa. All had normal or corrected-to-normal vision. The study
was approved by the university’s research ethics committee (project number 0382) and
informed consent was obtained from all participants prior to the beginning of the
experiment. Participants received course credit for their participation.
Language background information was obtained using the Language Experience
and Proficiency Questionnaire (LEAP-Q; Marian et al., 2007). Participants’ English
proficiency was assessed using a C-test consisting of three short texts, each of which
contained 20 incomplete words with the first half of their letters provided. The
participants’ scores were comparable to those obtained from a sample of 53 L1 English
speakers who were students at the same university (mean 76.92%, SD 11.6%).2
One participant who indicated their age of first exposure to English as 0 years was
removed from further analyses. The characteristics of the remaining participants are
summarized in Table 1.
Working memory was assessed using a computerized reading span task (Stone &
Towse, 2015; von Bastian et al., 2013). In the task, participants were presented with a set
Table 1. Participant characteristics

Range Mean SD
Age of L2 acquisition (years) 1–14 5.27 3.01

Length of L2 exposure (years) 6–21 15.5 3.52
C-test score (%) 36.67–91.67 71.15 13.67
Current average global L2 exposure (%)a 2b–70 42.88 16.29
L2 speaking ability (self-rated)c 3–10 7.40 1.84
L2 spoken comprehension ability (self-rated) 3–10 8.44 1.76
L2 reading ability (self-rated) 2–10 8.11 1.97
a
The relevant question in the LEAP-Q here is “Please list what percentage of the time you are currently and on average
exposed to each language” (italics in original).
b
The lowest values for L2 Exposure and the three self-rated variables all come from one participant. This participant
obtained a C-test score of 80%, suggesting that their self-ratings were not reliable. The value of 2 for L2 Exposure is also
implausible, given the study’s context. Due to the small sample, however, we did not wish to exclude this participant’s data.
c
Self-ratings are on a scale of 0 to 10, with 0 indicating “none” and 10 indicating “perfect.”
1
This sample size is identical to the final sample size used in Felser and Roberts (2007). As a reviewer points
out, the sample is relatively small, which can cause issues in frequentist analysis due to low statistical power.
Bayesian techniques like those adopted in this article have been argued to be better suited to analyzing small-
sample data (e.g., Baldwin & Fellingham, 2013).
2
Felser and Roberts (2007) used the Oxford Placement Test (OPT) to assess their participants’ English
proficiency. We used a C-test instead due to concerns about potential ceiling effects among our participants.

590 Berghoff
of sentences and had to judge each sentence as either “makes sense” or “does not make
sense.” Each sentence was followed by a number that had to be remembered until the
end of the set of sentences, at which point the participant had to provide all of the
numbers they had seen in that set in order of appearance. The number of sentences in a
set ranged from two to five, and scoring was done based on the proportion of numbers
the participant recalled correctly. The data from one participant who scored 0 on this
task was removed from further analyses. The mean proportion correct of the remaining
participants was 53.97% (SD 13.9%, range 26–76%).
Materials
The task involved 20 experimental sentences, which were identical to those used in
Roberts et al. (2007) and Felser and Roberts (2007). As in these studies, the task also
contained 60 filler sentences similar in length to the experimental sentences, 12 of
which were similar in structure to the experimental sentences, but where the visual
target was displayed at a position other than the two critical test points.
The 80 sentences were recorded by a female L1 English speaker using Audacity
(Audacity Team, 2019). All but two of the target pictures were obtained from
Snodgrass and Vanderwart’s (1980) dataset.3 Each experimental sentence was paired
with a visual target that was either identical to the referent of the indirect object or
unrelated.
In each experimental sentence, the visual target (identical or unrelated) appeared at
one of two critical points: the offset of the direct object noun phrase (i.e., the gap
position) or a pregap control position 500 milliseconds prior to this offset. This yielded
four experimental conditions, illustrated in (2) (Felser & Roberts, 2007, p. 20). It is
noted that, like English, Afrikaans is also a wh-movement language in which the
indirect object canonically follows the direct object (de Stadler, 1995).
(2) Fred chased the squirrel to which the nice monkey explained …
a. Identical, gap position:
…the game’s difficult rules [SQUIRREL] in the class last Wednesday.
b. Identical, pregap position:
…the game’s [SQUIRREL] difficult rules in the class last Wednesday.
c. Unrelated, gap position:
…the game’s difficult rules [TOOTHBRUSH] in the class last Wednesday.
d. Unrelated, pregap position:
…the game’s [TOOTHBRUSH] difficult rules in the class last Wednesday.
The experimental items were divided across four presentation lists, so that each
participant saw only one version of each experimental sentence. The 20 experimental
items in each list were combined with the 60 fillers and pseudorandomized.
Procedure
The cross-modal picture priming task was designed and administered in PsychoPy
(Peirce et al., 2019). Participants performed the task on a laptop with a 15-inch
3
The two exceptions were “hippopotamus” and “panda.” For these, black-and-white line drawings in the
style of the Snodgrass and Vanderwart (1980) pictures were used.

screen (resolution: 1366 768). At the beginning of the session, the experiment
administrator told the participant to listen carefully to the prerecorded sentences,
which were presented over headphones, and watch the screen for a picture of an
animal or an object that would be displayed at an undetermined point during the
sentence. They were instructed further that when a picture appeared, they had to
decide as quickly as possible whether the animal/object was alive or not alive and
indicate their choice by pressing either the green (“yes”) or red (“no”) key on the
keyboard. As in Felser and Roberts (2007), the task also included 38 comprehension
questions, which were distributed across the experiment and auditorily presented.
The experiment was preceded by a short practice round to allow participants to
familiarize themselves with the procedure. The task included four self-timed breaks
and on average took around 30 minutes to complete. After the completion of the
experiment, participants completed the working memory task, the LEAP-Q, and the
C-test.
Analysis
RTs were analyzed using Bayesian regression. A key advantage of the Bayesian
approach (see e.g., Norouzian et al., 2019) is that it allows for the strength of evidence
both for and against the null hypothesis to be evaluated. In contrast, the conventional
null hypothesis significance testing approach does not provide evidence in favor of the
null hypothesis, as the failure to obtain a significant effect may be due to, for example, a
lack of statistical power, rather than the nonexistence of the effect. Another benefit
offered by this approach is the ability to specify, by means of so-called priors, the
expected direction and magnitude of an effect based on extant research findings or
expert opinion. Here, we use Felser and Roberts’s (2007) results as a basis for the
specification of informative priors for the effects of target type, sentence position, and
their interaction. Details of the prior specification are provided in Appendix A. Because
Felser and Roberts (2007) report no effect of working memory, we use noninformative
priors for this term; noninformative priors were also used for the standard deviations.
For robustness purposes, we also reran the models using informative priors based on
the results of Roberts et al. (2007). The Bayes factors remain robust. These results are
available upon request.
All models were fit with four chains, each of which contained 10,000 samples
following a warmup of 2,000 samples. For each model parameter, we report the
parameter estimate b; the 95% credible interval, or the range within which b can be
taken to fall with 95% certainty; and the evidence ratio P(b). Following Jeffreys (1998),
we consider an evidence ratio of 0.3 or smaller as substantial evidence for the absence of
an effect and an evidence ratio of 3 or greater as substantial evidence for the presence
thereof.
Results
Accuracy
Accuracy scores were 80.13% (SD 7.4%, range 67.6–94.6%) on the end-of-trial com-
prehension questions and 96.3% (SD 3.7%, range 84.2–100%) on the aliveness decision
task. These results are comparable to those of Felser and Roberts (2007) and Roberts
et al. (2007).

592 Berghoff
Table 2. Mean RTs (SD) to visual targets per condition
Control position Trace position
Unrelated target 882 (257) 867 (196)

Identical target 838 (222) 815 (194)
Table 3. Model results: RTs to visual targets

Estimate Est. Error CI L CI U
Intercept –0.2 0.04 –0.28 –0.12

Target Type –0.04 0.03 –0.10 0.02
Position –0.02 0.04 –0.09 0.06
Working Memory Score –0.03 0.04 –0.11 0.06
Target Type Position –0.04 0.06 –0.17 0.09
Working Memory Score Target Type 0.04 0.03 –0.01 0.09
Working Memory Score Position 0.02 0.03 –0.04 0.07
Working Memory Score Target Type Position –0.05 0.05 –0.14 0.04
Note: Estimate = parameter estimate; Est. Error = standard error; CI L = lower end of the 95% credible interval; CI U = upper
end of the 95% credible interval. Parameter estimates in bold are effects that are reliably present (Bayes factor ≥ 3).
Reaction Times
In line with previous studies (Felser & Roberts, 2007; Roberts et al., 2007), only trials in
which the aliveness decision was correct were analyzed, which led to the removal of
3.7% of the data. No RTs on this task exceeded 2,000 milliseconds, nor were there any
individual outliers exceeding two SDs from each participant’s mean per condition; thus,
no additional data points were omitted.
Table 2 provides the means and SDs of the participants’ RTs per condition. As is
evident, RTs to identical targets were shorter than those for unrelated targets at both the
control and trace position, but the advantage for identical targets is slightly larger at the
trace position (52 vs. 44 ms).
Log-transformed RTs were analyzed using a Bayesian linear mixed regression model
fit with the brms package (version 2.16.3, Bürkner, 2017) in the R environment for
statistical computing (version 4.1.2, R Core Team, 2021). The model included Position
(Control or Trace, sum contrast coded as –0.5 and 0.5), Target Type (Unrelated or
Identical, sum contrast coded as –0.5 and 0.5), and Working Memory Score (scaled and
centered around the mean) as fixed effects, as well as the interaction between Working
Memory Score, Target Type, and Position. Model comparisons indicated that adding
C-test score, L2 Exposure, and AoA did not improve the fit of the model; therefore, no
additional predictors were included. The random effects structure included random
intercepts for participants and items and by-participants and by-items random slopes
for Position, Target Type, and their interaction. Model results are provided in Table 3.
Bayes factors indicating the extent of support for the existence of an effect in the
direction specified in the model output were calculated using the “hypothesis” function
from the brms package; in each case, the Bayes factor indicates the ratio of the
hypothesis (e.g., b > 0) to its complement (e.g., b < 0; see Winter & Bürkner, 2021).
The estimates of robust effects (Bayes factor ≥ 3) are indicated in bold.
There was a reliable effect of Target Type, which indicates that RTs were faster for
identical versus unrelated targets (P(b < 0) = 10.39). There was also a reliable effect of
Working Memory Score, such that participants with higher working memory had lower

Table 4. Mean RTs (SDs) to visual targets in low-span and high-span participants
Low-span participants High-span participants
Control position Trace position Control position Trace position
Unrelated target 904 (237) 879 (190) 857 (273) 864 (238)
Identical target 821 (203) 816 (194) 843 (232) 816 (192)
Table 5. Model results for low-span and high-span participants

Low-span participants High-span participants
Est. Est. Error CI L CI U Est. Est. Error CI L CI U
Intercept –0.17 0.05 –0.28 –0.07 0.21 0.07 –0.35 –0.08

Target Type –0.09 0.04 –0.16 –0.01 –0.01 0.05 –0.10 0.08
Position –0.02 0.05 –0.12 0.07 –0.01 0.04 –0.10 0.07
Target Type 0.00 0.08 –0.17 0.15 –0.06 0.09 –0.22 0.12
Position
RTs overall (P(b < 0) = 3.13). In addition, there were reliable interactions between
Working Memory Score and Target Type (P(b > 0) = 15.1) and between Working
Memory Score, Target Type, and Position (P(b < 0) = 6.6). The former effect indicates
that participants with higher working-memory scores showed less of an RT advantage
for identical compared to unrelated target pictures; the latter effect indicates that
participants with higher working-memory scores showed a larger RT advantage for
identical pictures at the gap compared to the control position.
Given the interactions between Working Memory Score and the factors of interest,
we split participants into two groups based on the median working memory score
(55.7%). This yielded two groups of 10 participants each. Importantly, these groups did
not differ significantly in terms of either AoA or L2 exposure (ps > .2). Table 4 shows the
mean RTs (SDs) per working memory group in the four conditions.
We then analyzed RTs in the low-span and high-span participants separately, again
using Bayesian linear mixed regression models with Position and Target Type as fixed
effects and the same maximal random effects structure reported in the preceding text.
As in the main analysis, informative priors based on Felser and Roberts’s (2007) results
were used for the effects of Target Type, Position, and their interaction. Model results
are provided in Table 5, with the estimates of effects that are reliably present marked in
bold. Figures 1 and 2 illustrate the posterior distributions of the model parameters
(i.e., estimates of the distributions that take the new data into account) for the low- and
high-span groups, respectively.
Table 5 indicates that for the low-span participants, the only reliable effect was an RT
advantage for identical compared to unrelated targets (P(b < 0) = 69.95). The data are
inconclusive regarding a potential advantage for identical targets at the gap relative to
the control position (Target Type Position: P(b < 0) = 1). For the high-span
participants, the only effect that is reliably present is the Target Type Position
interaction (P(b < 0) = 3.1).

594 Berghoff
Figure 1. Posterior distributions: Low-span group.
Figure 2. Posterior distributions: High-span group.
Discussion and Conclusion

This article’s aim was to extend previous research on indirect-object dependency
resolution to a group of L2 speakers that is understudied in the L2 processing literature,
namely early L2 acquirers with extensive (though nonimmersive) naturalistic L2
exposure. This focus was motivated by accounts of L2 processing that predict greater
processing automaticity among L2 learners of this profile (e.g., Clahsen & Felser, 2006a,
2006b, 2018; Ullman, 2001), as well as previous studies that have found increased
sensitivity to abstract syntactic structure among learners in naturalistic exposure
environments (e.g., Pliatsikas & Marinis, 2013). We conducted a close replication of
Felser and Roberts (2007). In contrast to these authors, but like Roberts et al. (2007), we
observed a working memory effect on our participants’ response patterns. Follow-up
analyses indicated that while low-working-memory participants responded more
quickly to identical targets at both the gap and the earlier control position, high-
working-memory participants’ RTs to identical targets were lower at the gap than the
control position.
The low-working-memory participants’ processing pattern, which mirrors that of
Felser and Roberts’s (2007) participants, would be consistent with a strategy in which
the filler was actively maintained in working memory, leading to lower RTs at both test

positions. However, we note a caveat here, which arises due to the affordances of the
Bayesian analysis: Specifically, the data do not provide evidence that the low-working-
memory group did not show a position-specific RT advantage to identical targets; the
Bayes factor was inconclusive at 1. We therefore cannot comment further on whether
trace reactivation occurred among this group.
There does, however, appear to be a difference in processing pattern between our
low-span L2 group and the low-span L1 group in Roberts et al. (2007), who did not
show an advantage for identical targets at either position. This difference suggests that
even among individuals who share relatively lower working-memory capacity, L1 and
L2 processing of movement dependencies may differ. The divergence here may be
attributable to different allocations of cognitive resources during processing: For
example, Williams (2006) found that L2 speakers with relatively low working-memory
capacity, unlike L1 speakers, seemed not to process input incrementally when they also
had to perform a memory task, suggesting that the L2 speakers had directed their
cognitive resources toward the memory task.
Our high-span group’s processing pattern is compatible with a strategy in which the
filler is selectively reactivated at the gap position. This finding is in line with the
proposal that when a filler is encountered, the parser predicts an upcoming syntactic
gap, and retrieval of the filler from memory is triggered when such a gap is reached (e.g.,
Frazier, 1987). In this respect, our high-working-memory participants showed the same
processing pattern as the high-working-memory L1 groups (both adults and children)
in Roberts et al. (2007). In turn, this finding aligns with the results of Miller (2014,
2015), in that it shows that L2 learners can make use of abstract syntactic structure
during real-time processing. In Miller’s (2014, 2015) studies, however, it was a task
characteristic, specifically the use of cognates as visual targets, that seemed to facilitate
sensitivity to the gap. The present results, like those of Pliatsikas and Marinis (2013),
provide an indication that this sensitivity can arise in the absence of targeted attempts to
elicit it. Considering, however, that our low- and high-span groups did not differ in
terms of AoA or L2 exposure, we cannot say that either of these factors is decisive in
engendering trace reactivation. Ultimately, working memory capacity seemed to be the
deciding factor in this regard.
Our observation of a working memory effect bears on another important question
in SLA, namely whether individual cognitive differences are equally relevant to L2
outcomes across early and late L2 learners. Theories of SLA and L2 processing in
which AoA plays a central role (e.g., Clahsen & Felser, 2006a, 2006b, 2018) typically
do not discuss the potential effects of individual differences on early learners’ L2
attainment, with the implicit assumption being that an early start to learning and
sufficient exposure should together ensure acquisition success. However, some
studies have observed effects of, for example, language aptitude on L2 outcomes
among early learners (Abrahamsson & Hyltenstam, 2008; Granena, 2014). Our
results align with these findings and highlight the complex, multifactorial nature of
early L2 acquisition (cf. Granena, 2014). Future research might aim to shed additional
light on the interplay between environmental and individual-level variables among
early L2 learners, particularly with respect to the parsing of complex syntactic
structures.
Acknowledgments. I would like to thank Emanuel Bylund as well as the journal editors and two
anonymous reviewers for their constructive feedback on the manuscript.
Data Availability Statement The experiment in this article earned an Open Data badge for transparent
practices. The materials are available at https://doi.org/10.7910/DVN/SGFGKO.

596 Berghoff
References
Abrahamsson, N., & Hyltenstam, K. (2008). The robustness of aptitude effects in near-native second language
acquisition. Studies in Second Language Acquisition, 30, 481–509. https://doi.org/10.1017/S027226310808073X
Audacity Team. (2019). Audacity: Free audio editor and recorder. https://audacityteam.org/
Baldwin, S. A., & Fellingham, G. W. (2013). Bayesian methods for the analysis of small sample multilevel data
with a complex variance structure. Psychological Methods, 18, 151.
Berghoff, R. (2020). L2 processing of filler-gap dependencies: Attenuated effects of naturalistic L2 exposure in
a multilingual setting. Second Language Research, 25, 026765832094575. https://doi.org/10.1177/
0267658320945757
Berghoff, R. (2021). The role of English in South African multilinguals’ linguistic repertoires: A cluster-
analytic study. Journal of Multilingual and Multicultural Development, 12, 1–15. https://doi.org/10.1080/
01434632.2021.1941066
Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical
Software, 80. https://doi.org/10.18637/jss.v080.i01
Chomsky, N. (1986). Barriers. MIT Press.
Chow, W.-Y., & Zhou, Y. (2019). Eye-tracking evidence for active gap-filling regardless of dependency length.
Quarterly Journal of Experimental Psychology, 72, 1297–1307. https://doi.org/10.1177/1747021818804988
Clahsen, H., & Felser, C. (2006a). Grammatical processing in language learners. Applied Psycholinguistics, 27,
3–42. https://doi.org/10.1017/S0142716406060024
Clahsen, H., & Felser, C. (2006b). How native-like is non-native language processing? Trends in Cognitive
Sciences, 10, 564–570. https://doi.org/10.1016/j.tics.2006.10.002
Clahsen, H., & Felser, C. (2018). Some notes on the Shallow Structure Hypothesis. Studies in Second Language
Coetzee-Van Rooy, S. (2012). Flourishing functional multilingualism: Evidence from language repertoires in
the Vaal Triangle region. International Journal of the Sociology of Language, 2012, 87–119. https://doi.org/
10.1515/ijsl-2012-0060
Coetzee-Van Rooy, S. (2013). Afrikaans in contact with English: Endangered language or case of exceptional
bilingualism? International Journal of the Sociology of Language, 2013, 179–207.
Costa, A., Caramazza, A., & Sebastian-Galles, N. (2000). The cognate facilitation effect: Implications for
models of lexical access. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 1283–
1296. https://doi.org/10.1037/0278-7393.26.5.1283
de Stadler, L. (1995). The indirect object in Afrikaans. South African Journal of Linguistics, 13, 26–38. https://
doi.org/10.1080/10118063.1995.9723972
Dussias, P. E., & Sagarra, N. (2007). The effect of exposure on syntactic parsing in Spanish–English bilinguals.
Bilingualism: Language and Cognition, 10, 101. https://doi.org/10.1017/S1366728906002847
Felser, C., & Roberts, L. (2007). Processing wh-dependencies in a second language: A cross-modal priming
study. Second Language Research, 23, 9–36. https://doi.org/10.1177/0267658307071600
Fernandez, L., Höhle, B., Brock, J., & Nickels, L. (2018). Investigating auditory processing of syntactic gaps
with L2 speakers using pupillometry. Second Language Research, 34, 201–227. https://doi.org/10.1177/
0267658317722386
Frazier, L. (1987). Syntactic processing: Evidence from Dutch. Natural Language & Linguistic Theory, 5, 519–
559.
Gibson, E., & Warren, T. (2004). Reading-time evidence for intermediate linguistic structure in long-distance
dependencies. Syntax, 7, 55–78. https://doi.org/10.1111/j.1368-0005.2004.00065.x
Granena, G. (2014). Language aptitude and long-term achievement in early childhood L2 learners. Applied
Linguistics, 35, 483–503. https://doi.org/10.1093/applin/amu013
Hartshorne, J. K., Tenenbaum, J. B., & Pinker, S. (2018). A critical period for second language acquisition:
Evidence from 2/3 million English speakers. Cognition, 177, 263–277. https://doi.org/10.1016/j.
cognition.2018.04.007
Jeffreys, H. (1998). Theory and probability. Oxford: Clarendon Press.
Johnson, J. S., & Newport, E. L. (1989). Critical period effects in second language learning: The influence of
maturational state on the acquisition of English as a second language. Cognitive Psychology, 21, 60–99.
https://doi.org/10.1016/0010-0285(89)90003-0

Long, M. H. (1990). Maturational constraints on language development. Studies in Second Language
Acquisition, 12, 251–285. https://doi.org/10.1017/s0272263100009165
Marian, V., Blumenfeld, H. K., & Kaushanskaya, M. (2007). The Language Experience and Proficiency
Questionnaire (LEAP-Q): Assessing language profiles in bilinguals and multilinguals. Journal of Speech
Language and Hearing Research, 50, 940. https://doi.org/10.1044/1092-4388(2007/067)
Marinis, T., Roberts, L., Felser, C., & Clahsen, H. (2005). Gaps in second language processing. Studies in
Second Language Acquisition, 27, 483. https://doi.org/10.1017/S0272263105050035
Miller, A. K. (2014). Accessing and maintaining referents in L2 processing of wh-dependencies. Linguistic
Approaches to Bilingualism, 4, 167–191. https://doi.org/10.1075/lab.4.2.02mil
Miller, A. K. (2015). Intermediate traces and intermediate learners. Studies in Second Language Acquisition,
37, 487–516. https://doi.org/10.1017/S0272263114000588
Muñoz, C. (2006). Age and the rate of foreign language learning. Multilingual Matters.
Nicol, J. L. (1993). Reconsidering reactivation. In G. Altmann & R. Shillcock (Eds.), Cognitive models of speech
processing (pp. 321–350). Erlbaum.
Nicol, J. L., & Swinney, D. (1989). The role of structure in coreference assignment during sentence
comprehension. Journal of Psycholinguistic Research, 18, 5–19.
Norouzian, R., Miranda, M. D. E., & Plonsky, L. (2019). A Bayesian approach to measuring evidence in L2
research: An empirical investigation. The Modern Language Journal, 25, 1. https://doi.org/10.1111/
modl.12543
Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., & Lindeløv, J. K.
(2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51, 195–203. https://
doi.org/10.3758/s13428-018-01193-y
Pliatsikas, C., Johnstone, T., & Marinis, T. (2017). An fMRI study on the processing of long-distance wh-
movement in a second language. Glossa: A Journal of General Linguistics, 2, 1052. https://doi.org/10.5334/
gjgl.95
Pliatsikas, C., & Marinis, T. (2013). Processing empty categories in a second language: When naturalistic
exposure fills the (intermediate) gap. Bilingualism: Language and Cognition, 16, 167–182. https://doi.org/
10.1017/S136672891200017X
Posel, D., & Zeller, J. (2016). Language shift or increased bilingualism in South Africa: Evidence from census
data. Journal of Multilingual and Multicultural Development, 37, 357–370. https://doi.org/10.1080/
01434632.2015.1072206
Posel, D., Hunter, M., & Rudwick, S. (2020). Revisiting the prevalence of English: Language use outside the
home in South Africa. Journal of Multilingual and Multicultural Development, 218, 1–13. https://doi.org/
10.1080/01434632.2020.1778707
Computing. https://www.R-project.org/
Roberts, L., Marinis, T., Felser, C., & Clahsen, H. (2007). Antecedent priming at trace positions in children’s
sentence processing. Journal of Psycholinguistic Research, 36, 175–188. https://doi.org/10.1007/s10936-
006-9038-3
Snodgrass, J. G., & Vanderwart, M. (1980). A standardized set of 260 pictures: Norms for name agreement,
image agreement, familiarity, and visual complexity. Journal of Experimental Psychology: Human Learning
and Memory, 6, 174.
Stone, J. M., & Towse, J. N. (2015). A working memory test battery: Java-based collection of seven working
memory tasks. Journal of Open Research Software, 3, 92. https://doi.org/10.5334/jors.br
Ullman, M. T. (2001). The neural basis of lexicon and grammar in first and second language: The declarative/
procedural model. Bilingualism: Language and Cognition, 4, 105–122. https://doi.org/10.1017/
s1366728901000220
von Bastian, Claudia C., Locher, A., & Ruflin, M. (2013). Tatool: A Java-based open-source programming
framework for psychological studies. Behavior Research Methods, 45, 108–115. https://doi.org/10.3758/
s13428-012-0224-y
Williams, J. N. (2006). Incremental interpretation in second language sentence processing. Bilingualism:
Language and Cognition, 9, 71–88. https://doi.org/10.1017/S1366728905002385
Winter, B., & Bürkner, P.-C. (2021). Poisson regression for linguists: A tutorial introduction to modelling
count data with brms. Language and Linguistics Compass, 15, e12439.

598 Berghoff
Appendix A
Informative priors for the effects of Target Type, Position, and their interaction were based on the reaction
times in Felser and Roberts (2007). Each prior was normally distributed with a mean equal to the result
obtained in Felser and Roberts (2007) and a standard deviation equal to the mean. The prior distributions are
visualized in Figure A1.
Figure A1. Prior distributions.
Cite this article: Berghoff, R. (2023). Wh-Dependency Processing in a Naturalistic Exposure Context:
Sensitivity to Abstract Syntactic Structure in High-Working-Memory L2 Speakers. Studies in Second

Vol. 45 No. 2 May 2023
Studies in
Second
Language
Acquisition

STUDIES IN SECOND LANGUA GE ACQUISITION
Editor: Luke Plonsky (Northern Arizona University)
Associate Editors: Jill Jegerski (University of Illinois), Kevin McManus (Pennsylvania
State University), Andrea Révész (UCL Institute of Education), Kazuya Saito (UCL
Institute of Education), Stuart Webb (University of Western Ontario)
Editorial Assistants: Andrew Dennis (Northern Arizona University) and Lizz Huntley
(Michigan State University)
Founding Editor: Albert Valdman (Indiana University)
Former Editors: Susan Gass (Michigan State University), Bill VanPatten
(Michigan State University)
Editorial Board
Ali Al-Hoorie (Royal Commission for Jubail and Yanbu, Saudi Arabia), Frank Boers (University of Western
Ontario, Canada), Dustin Crowther (University of Hawai'i), Irina Elgort (Victoria University of Wellington,
New Zealand), Paola Escudero (Western Sydney University, Australia), Aline Godfroid (Michigan State
University), Tania Ionin (University of Illinois at Urbana-Champaign), Carrie Jackson (Pennsylvania
State University), Scott Jarvis (University of Utah), Okim Kang (Northern Arizona University),
Sara Kennedy (Concordia University, Canada), Ron Leow (Georgetown University), Shaofeng Li (Florida
State University), Peter MacIntyre (Cape Breton University, Canada), Alison J. Mackey (Georgetown
University), Kara Morgan-Short (University of Illinois at Chicago), Akira Murakami (University of
Birmingham, UK), Charles Nagle (Iowa State University), William O'Grady (University of Hawaii), Magali
Paquot (Université Catholique de Louvain, Belgium), Ana Pellicer-Sanchez (University College London,
UK), Elke Peters (KU Leuven, Belgium), Graeme Porte (University of Granada, Spain), Jason Rothman
(Arctic University of Norway, Norway & University of Nebrija, Spain), Cristina Sanz (Georgetown
University), Megan Solon (Indiana University), Patti Spinner (Michigan State University), Yuichi Suzuki
(Kanagawa University, Japan), Naoko Taguchi (Northern Arizona University), Brent Wolter (Idaho State
University), Stefanie Wulff (University of Florida)
EDITORIAL POLICY
Studies in Second Language Acquisition is a refereed journal of international scope devoted to the
scientific discussion of acquisition of the use of non-native and heritage languages. Each volume contains
five issues, one of which is devoted to a special topic in the field. The other four issues contain
research articles of either a quantitative or qualitative nature in addition to essays on current theo-
retical matters. Other rubrics include Replication Studies, the Methods Forum, and Research Reports.
PUBLISHING, PRODUCTION, AND ADVERTISING OFFICES
Cambridge University Press, One Liberty Plaza, New York, NY 10006, USA; US: USAdSales@cambridge.
org; or Cambridge University Press, University Printing House, Shaftesbury Road, Cambridge CB2
8BS, England; UK: ad_sales@cambridge.org
SUBSCRIPTION OFFICES
(For U.S.A. and Canada) Cambridge University Press, One Liberty Plaza New York, NY 10006, U.S.A. (For U.K. and
elsewhere) Cambridge University Press, University Printing House, Shaftesbury Road, Cambridge CB2 8BS, England.
SUBSCRIPTION INFORMATION
Studies in Second Language Acquisition (ISSN 0272-2631) is published five times a year in March, May,
July, September, and December by Cambridge University Press, One Liberty Plaza, 20th floor, New York,
NY 10006. Periodicals postage rate paid at New York, NY, and at additional mailing offices. POSTMASTER:
Send address changes in the USA, Canada, and Mexico to: Studies in Second Language Acquisition,
Cambridge University Press, Journals Fulfillment Department, One Liberty Plaza, 20th floor, New York, NY
10006. Send address changes elsewhere to Studies in Second Language Acquisition, Cambridge University
Press, Journals Fulfillment Department, UPH, Shaftesbury Road, Cambridge CB2 8BS, England. Volume 45
subscription rates: Institutions, print and electronic US $734.00 in the U.S.A. and Canada, UK £454.00 in
the U.K. and elsewhere; Institution, electronic only US $579.00 in the U.S.A. and Canada, UK £357.00 in the U.K.
and elsewhere; Institution, print only US $733.00 in the U.S.A. and Canada, UK £449.00 in the U.K. and else-
where; Individuals, print only US $238.00 in the U.S.A. and Canada, UK £147.00 in the U.K. and elsewhere.
Prices include postage and insurance.
This journal is part of the Cambridge Journals Online service. Access to online tables of contents and
article abstracts is available to all researchers at no cost. Institutional subscribers: Access to full-text
articles online is included with the cost of the print subscription. Subscriptions must be activated; see
http://www.journals.cambridge.org for details.
© CAMBRIDGE UNIVERSITY PRESS & ASSESSMENT 2023
All rights reserved. No part of this publication may be reproduced, in any form or by any means, elec
tronic, photocopying, or otherwise, without permission in writing from Cambridge University Press. For
further information see http://us.cambridge.org/information/rights/ or http://www.cambridge.org/uk/
information/rights/ Photocopying information for users in the U.S.A.: The Item-Fee Code for this publication
(0272-2631/15 $15.00) indicates that copying for internal or personal use beyond that permitted by Sec. 107
or 108 of the U.S. Copyright Law is authorized for users duly registered with the Copyright Clearance Cen-
ter (CCC), provided that the appropriate remittance of $15.00 per article is paid directly to: CCC, 222 Rosewood
Drive, Danvers, MA 01923. Specific written permission must be obtained for all other copying. Contact
the ISI Tearsheet Service, 3501 Market Street, Philadelphia, PA 19104, for single copies of separate articles.
Studies in Second Language Acquisition is indexed in the Social Sciences Citation Index, Social SciSearch,
and Current Contents/Social & Behavioral Sciences. Coverage began with Vol. 26(1), 2004.
TABLE OF CONTENTS
RESEARCH ARTICLES
Effects of distributed practice on the acquisition of verb-noun

collocations
Satoshi Yamagata, Tatsuya Nakata, and James Rogers 291–317
A role for verb regularity in the L2 processing of the Spanish
subjunctive mood: Evidence from eye-tracking
Sara Fernández Cuenca and Jill Jegerski 318–347
The additive use of prosody and morphosyntax in L2 German
Nick Henry 348–369
“Bread and butter” or “butter and bread”? Nonnatives’
processing of novel lexical patterns in context
Suhad Sonbul, Dina Abdel Salam El-Dakhs, Kathy Conklin,
and Gareth Carrol 370–392
The elusive impact of L2 immersion on translation priming
Adel Chaouch-Orozco, Jorge González Alonso, Jon Andoni
Duñabeitia, and Jason Rothman 393–415
A closer look at a marginalized test method: Self-assessment as
a measure of speaking proficiency
Paula Winke, Xiaowan Zhang, and Steven J. Pierce 416–441
Explicit Instruction within a Task: Before, During, or After?
Gabriel Michaud and Ahlem Ammar 442–460
Sources and effects of foreign language enjoyment, anxiety,
and boredom: A structural equation modeling approach
Jean-Marc Dewaele, Elouise Botes, and Samuel Greiff 461–479
Second language productive knowledge of collocations: Does
knowledge of individual words matter?
Suhad Sonbul, Dina Abdel Salam El-Dakhs, and
Ahmed Masrai 480–502
A longitudinal study into learners’ productive collocation
knowledge in L2 German and factors affecting the learning
Griet Boone, Vanessa De Wilde, and June Eyckmans 503–525
METHODS FORUM
Network analysis for modeling complex systems in SLA research

Lani Freeborn, Sible Andringa, Gabriela Lunansky, and
Judith Rispens 526–557

RESEARCH REPORTS
The importance of psychological and social factors in adult SLA:

The case of productive collocation knowledge in L2 Swedish
of L1 French long-term residents
Fanny Forsberg Lundell, Klara Arvidsson, and
Andreas Jemstedt 558–570
Revisiting the moderating effect of speaker proficiency on the
relationships among intelligibility, comprehensibility, and
accentedness in L2 Spanish
Amanda Huensch and Charlie Nagle 571–585
REPLICATION STUDY
Wh-Dependency Processing in a Naturalistic Exposure

Context: Sensitivity to Abstract Syntactic Structure in
High-Working-Memory L2 Speakers
Robyn Berghoff 586–598

STUDIES IN SECOND LANGUAGE ACQUISITION
Information for Contributors
For guidelines and requirements regarding manuscript submission, please consult the
SSLA website at http://journals.cambridge.org/sla. Click on the Journal Information
tab which will lead you to Information for Contributors. Potential authors are advised
that all manuscripts are internally reviewed for both content and formatting/style in
order to determine their suitability for external evaluation.
Research Article. These manuscripts may be essays or empirical studies, either of

which must be motivated by current theoretical issues in second and subsequent lan-
guage acquisition or heritage language acquisition, including methodological issues in
research design and issues related to the context of learning. Maximum length is 11,000
words all-inclusive (i.e., abstract, text, tables, figures, references, notes, and appendices
intended for publication must all fall within the 10,000 word limit).
Research Report. These manuscripts are shorter empirical studies motivated by cur-
rent theoretical issues in second and subsequent language acquisition or heritage lan-
guage acquisition, including methodological issues in research design. Very often, these
are narrowly focused studies or they present part of the results of a larger project in
progress. The background and motivation sections are generally shorter than research
articles. Maximum length is 6,000 words all-inclusive (i.e., abstract, text, tables, figures,
references, notes, and appendices intended for publication must all fall within the 6,000
word limit).
Replication Study. These manuscripts are empirical studies that replicate the research
design and methods of a previously published study, with or without changes. The study
selected for replication should have impacted empirical and/or theoretical work rele-
vant to SLA. Replications can be direct (exact, close, approximate) or conceptual, and
should identify and motivate the study selected for replication as well as any changes
made. The maximum length is 10,000 words all-inclusive (i.e., abstract, text, tables, fig-
ures, references, notes, and appendices intended for publication).
State-of-the-Scholarship Article. These manuscripts are essays that review the extant
research on a particular theme or theoretical issue, offering a summary of findings and
making critical observations on the research to date. Manuscripts in this category typi-
cally fall within the 10,000-word limit; however, longer manuscripts may be considered on
a case-by-case basis.
Critical Commentary. These manuscripts are shorter essays (i.e., non-empirical)

motivated by current theory and issues in second and subsequent language acquisition
or heritage language acquisition, including methodological issues in research design
and issues related to the context of learning. Maximum length is 6,000 words all-inclu-
sive (i.e., abstract, text, tables, figures, references, notes, and appendices intended
for publication must all fall within the 6,000 word limit).
Methods Forum. Recognizing the need to discuss and advance SLA research methods,
these manuscripts seek to advance methodological understanding, training, and practic-
es in the field. Submissions can be conceptual or empirical; we also encourage articles
introducing novel techniques. All research paradigms, epistemologies, ontologies,
and theoretical frameworks relevant to SLA are welcome. The target length is up to
10,000 words, although longer manuscripts will be considered with justification.
All manuscripts in all categories are peer reviewed and subject to the same high standards
for publication in SSLA.
Studies in Second Language Acquisition
Volume 45 Number 2 May 2023
RESEARCH ARTICLES
Effects of distributed practice on the acquisition of verb-noun collocations
Satoshi Yamagata, Tatsuya Nakata, and James Rogers 291–317
A role for verb regularity in the L2 processing of the Spanish subjunctive mood:
Evidence from eye-tracking
Sara Fernández Cuenca and Jill Jegerski 318–347
The additive use of prosody and morphosyntax in L2 German
Nick Henry 348–369
“Bread and butter” or “butter and bread”? Nonnatives’ processing of novel lexical
patterns in context
Suhad Sonbul, Dina Abdel Salam El-Dakhs, Kathy Conklin, and
Gareth Carrol 370–392
The elusive impact of L2 immersion on translation priming
Adel Chaouch-Orozco, Jorge González Alonso, Jon Andoni Duñabeitia,
and Jason Rothman 393–415
A closer look at a marginalized test method: Self-assessment as a measure
of speaking proficiency
Paula Winke, Xiaowan Zhang, and Steven J. Pierce 416–441
Explicit Instruction within a Task: Before, During, or After?
Gabriel Michaud and Ahlem Ammar 442–460
Sources and effects of foreign language enjoyment, anxiety, and boredom:
A structural equation modeling approach
Jean-Marc Dewaele, Elouise Botes, and Samuel Greiff 461–479
Second language productive knowledge of collocations: Does knowledge
of individual words matter?
Suhad Sonbul, Dina Abdel Salam El-Dakhs, and Ahmed Masrai 480–502
A longitudinal study into learners’ productive collocation knowledge in L2 German
and factors affecting the learning
Griet Boone, Vanessa De Wilde, and June Eyckmans 503–525
METHODS FORUM
Network analysis for modeling complex systems in SLA research
Lani Freeborn, Sible Andringa, Gabriela Lunansky, and Judith Rispens 526–557
RESEARCH REPORTS
The importance of psychological and social factors in adult SLA: The case of
productive collocation knowledge in L2 Swedish of L1 French long-term residents
Fanny Forsberg Lundell, Klara Arvidsson, and Andreas Jemstedt 558–570
Revisiting the moderating effect of speaker proficiency on the relationships among
intelligibility, comprehensibility, and accentedness in L2 Spanish
Amanda Huensch and Charlie Nagle 571–585
REPLICATION STUDY
Wh-Dependency Processing in a Naturalistic Exposure Context: Sensitivity to Abstract
Syntactic Structure in High-Working-Memory L2 Speakers
Robyn Berghoff 586–598
Cambridge Core
For further information about this journal please
go to the journal website at:
cambridge.org/sla

Studies in Second Language Acquisition-2023-Issue2

Uploaded by

Copyright:

Available Formats

You might also like

Studies in Second Language Acquisition-2023-Issue2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Studies in Second Language Acquisition-2023-Issue2

Uploaded by

Copyright:

Available Formats

Studies in Second Language Acquisition (2023), 45, 291–317

Effects of distributed practice on the acquisition of

(Received 16 September 2021; Revised 15 May 2022; Accepted 26 May 2022)

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

Effects of distributed practice: Verb-noun collocations

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

M [95% CI] SD Range M [95% CI] SD Range

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

Effects of distributed practice: Verb-noun collocations

Intercept 3.38 [4.10, 2.65] 0.37 9.09 0.03 <.001

Model Formula: glmer (Phrase_accuracy ~ Treatment*Test_timing þ s.UVLT_score þ s.Lag_to_test þ (1 | ID) þ

Participants (Intercept) 0.65 0.81

Table 5. Results of the post-hoc analysis for treatment: Collocation-filling test

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

Immediate posttest: collocation-spaced > collocation-massed = node-massed

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

Intercept 3.50 [4.09, 2.90] 0.30 11.48 0.03 <.001

Model Formula: glmer (Verb_accuracy ~ Treatment*Test_timing*s.Collocation_type þ s.UVLT_score þ s.Lag_to_test þ (s.

Participants (Intercept) 0.95 0.98

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

A limited advantage of the collocation-massed schedule over the node-massed

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

In the node-massed schedule, participants were also exposed to multiple colloca-

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

https://doi.org/10.1017/S0272263122000225 Published online by Cambridge University Press

A role for verb regularity in the L2 processing

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

greater susceptibility to interference during retrieval from memory, although the

The subjunctive mood in Spanish

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

The Spanish subjunctive in adult second language acquisition

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

1. Do advanced L2 readers show online sensitivity to Spanish mood while reading

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

Table 1. Language background information

Age 30.90 8.08 20–56 31.20 9.80 21–64

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

(2) Irregular Verb Stimulus

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

(4) Filler Sentence with an Erroneous Preposition

https://doi.org/10.1017/S027226312200016X Published online by Cambridge University Press

(5) Filler Sentence without an Error

Model Formula: glmer (Verb_accuracy ~ TreatmentTest_timings.Collocation_type þ s.UVLT_score þ s.Lag_to_test þ (s.