1 s2.0 S0346251X2400006X Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

System 121 (2024) 103224

Contents lists available at ScienceDirect

System
journal homepage: www.elsevier.com/locate/system

A comparative analysis of multiword units in the reading and


listening input of English textbooks
Hien Hoang , Peter Crosthwaite *
School of Languages and Cultures, University of Queensland, Australia
Australia

A R T I C L E I N F O A B S T R A C T

Keywords: This study examines the occurrence and function of multiword units (MWUs) found in reading
Multiword units and listening input within EFL textbooks commonly used at a Vietnamese university. Lists of
Textbook corpus MWUs were automatically extracted from this input based on frequency, range, and mutual in­
Written input
formation score criteria, followed by manual evaluation via a classification framework developed
Oral input
Vietnamese EFL
by Simpson-Vlach and Ellis (2010). A final merged list of English textbook MWUs which was used
to compare the frequency and functions of MWUs across reading and listening input. The findings
revealed significant differences in MWU occurrence and function between reading and listening
input, with MWUs occurring more frequently in oral input compared with written input. High
frequency spoken MWUs had higher frequency counts than their high frequency written coun­
terparts. A wider range of MWUs were produced in written input, with fewer MWUs being
repeatedly used. Reading input exhibited a notably lower presence of stance expressions when
contrasted with listening input, while special conversational functions featured more prominently
in listening input. The findings suggest EFL teachers and material developers should carefully
consider the presence and functions of MWUs found in textbooks, and whether textbook input
should be supplemented with other suitable input forms to help learners improve learners’
knowledge and use of MWUs.

1. Introduction

Similar to other East Asian nations, Vietnamese students typically commence English education early in school. The Vietnamese
Ministry of Education and Training (MOET) mandated English as a compulsory subject from Grade 3 in primary schools in selected
cities and provinces starting in 2008 where infrastructure allowed for implementation. Consequently, for nearly two decades English
has been the primary foreign language taught from third to twelfth grade in Vietnam. However, the English language teaching context
in Vietnam has been described as substandard (Nguyen & Nguyen, 2019), as demonstrated by Vu and Peters (2021) who found in
recent national high-school English exams in 2019, almost 70% of Vietnamese students scored below average. In recognition, MOET
initiated the National Foreign Language 2020 Project (MOET, 2008) aiming to reform the teaching and learning of English and other
foreign languages for Vietnamese graduates as they enter the workforce. Despite MOET’s ambitions, the project outcomes regarding
the English speaking and writing proficiencies of students have fallen short of expectations (Xuan Mai & Thanh Thao, 2022).
Research indicates a strong connection between the depth of formulaic language and overall proficiency of L2 learners. Keshavarz

* Corresponding author.
E-mail address: p.cros@uq.edu.au (P. Crosthwaite).

https://doi.org/10.1016/j.system.2024.103224
Received 26 September 2023; Received in revised form 2 December 2023; Accepted 4 January 2024
Available online 9 January 2024
0346-251X/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
H. Hoang and P. Crosthwaite System 121 (2024) 103224

and Salimi (2007) found a positive correlation between learners’ knowledge of collocations and their general proficiency, while
proficiency in MWUs is linked to higher competence in both spoken (Boers et al., 2006) and written tasks (Ohlrogge, 2009). A recent
study by Appel (2022) established a connection between proficiency and the use of 3-to-7-word lexical bundles in essays produced by
first-year English learners, with more proficient writers using these lexical bundles more frequently. Studies by Laufer and Waldman
(2011), Lewis (1993), and Siyanova-Chanturia and Pellicer-Sanchez (2019) also show higher rates of formulaic language use typify
higher proficiency levels. The evidence therefore suggests that language learners require a substantial MWU vocabulary to attain
advanced proficiency levels (Bui, 2016).
Within Vietnam, Nguyen and Webb (2017) found first-year English-major students exhibited insufficient knowledge of
adjective-noun and verb-noun collocations despite seven years of English learning, mastering less than half of these collocational
patterns overall. Conversely, their proficiency in single-word vocabulary was comparatively higher, reaching approximately 94% at
the 1000-word level, 75% at 2000 words, and 65% at 3000 words. The authors attributed this to insufficient emphasis on teaching
MWUs, inadequate recognition of the significance of MWUs for proficiency, and an underestimation of the challenge in acquiring
MWUs. Tran (2012) also highlighted a neglect of idioms in Vietnamese English teaching contexts and advocates for increased emphasis
on collocations and idioms in teaching practices. This corroborates Bui (2016) who suggested the challenges Vietnamese EFL learners
encounter arise from insufficient exposure to MWUs and a lack of awareness regarding their importance given Vietnamese teachers and
learners tend to prioritize learning single words. Tran (2017) investigated form recall, meaning recognition, and meaning recall for 50
commonly used figurative idioms among 70 Vietnamese students at pre-intermediate and intermediate English proficiency levels in a
Vietnamese university. Experimental findings highlighted most common idiomatic expressions were unfamiliar to students, while
focus group discussions revealed a lack of emphasis on idioms in English classes, prioritization of single words over MWUs, and
inadequate repetition and practice hindered their idiom usage. Clearly, there is a need for improvement.
Another important factor is that in EFL settings like Vietnam, textbooks often serve as the primary source of language input,
establishing the cornerstone for language practice both in and out of the classroom (Nguyen, 2007). However, Vu and Peters (2021)
maintain most EFL textbooks used in this context provide insufficient focus on formulaic sequences e.g., collocations and idioms,
focusing instead on single words. Given most Vietnamese learners’ exposure to English is distributed within a few hours of practice
across widely spaced sessions using said textbooks, such learners have great difficulty recalling previously encountered MWUs and
establishing associations between single words that lead to acquisition of MWUs (Serrano et al., 2015). Learners using such textbooks
also struggle to realise the register-appropriateness of MWUs, using MWUs found in spoken English in their writing, or vice-versa, and
lack knowledge of the pragmatic functions of MWUs (e.g. referential, special conversational function) when using them (Granger,
2017, 2019; Huang, 2014).
In the highly teacher-dependent and exam-oriented teaching and learning culture typical of Vietnam, if learners are made aware
that knowledge and use of MWUs can enhance their scores in English proficiency tests, practitioners may find more motivation to focus
on MWUs in their pedagogical practice. It is therefore valuable to examine the use and function of MWUs in English textbooks and
compare their use across reading and listening input. This study then explores three-, four- and five-word linguistic expressions used in
several popular English as a foreign language (EFL) textbook series commonly used by tertiary English major students in Vietnam,
revealing their presence and function across both listening and reading input so as to better inform teachers, learners and materials
developers of their role and importance in L2 learning.

2. Multi word units

In the literature, various terms have been used to describe expressions at and above the two-word level, e.g., ‘multiword units’,
‘formulaic language’ or ‘prefabricated chunks’ (Boers et al., 2006), with this study opting for the former (henceforth MWUs). Spe­
cifically, an MWU is defined as a conventionalized fixed or semi-fixed string of two or more co-occurring words, with ‘conventionality’
assessable through corpus-derived criteria including frequency, range and mutual information as well as through expert speaker
judgment (Boers & Lindstromberg, 2012).
The study of MWUs has generated a significant amount of interest in recent years, with numerous research works highlighting their
pervasiveness and pedagogical significance (Boers, 2021; Nguyen & Coxhead, 2022; Siyanova-Chanturia & Pellicer-Sanchez, 2019;
Webb, 2019). According to Huang (2014), growing interest in MWU research can be attributed to two widely held beliefs: firstly, that
natural language makes extensive use of recurrent MWUs; and secondly that MWUs can play a crucial role in English language learning
and teaching. Indeed, MWUs are ubiquitous in both spoken and written discourse, as demonstrated via corpus-based studies showing
MWUs comprise approximately 30% of conversational corpus data and around 21% of academic prose corpus data (Biber & Conrad,
1999), or as much as 59% of spoken discourse and approximately 52% of written discourse (Erman & Warren, 2000). Although these
percentages vary depending on the corpora used within individual studies abd how MWUs are specifically defined and identified,
overall findings suggest a prevalence of MWUs in both spoken and written language.
Regarding the second belief, the significance of MWU acquisition in second language (L2) learning is supported by research
highlighting their role in the development of learners’ interlanguage grammars, fluency and proficiency, and assisting learners in
sounding ‘native-like’. Learning formulaic language also plays a complementary role to more traditional, deductive, rule-based
learning processes (Wulff, 2019), in that incorporating MWUs in spoken language has been shown to benefit learners’ spoken
fluency by reducing hesitation and lessening processing loads (Boers et al., 2006). An extensive repertoire of relatively frequent L2
MWUs is also crucial for fluent L2 comprehension and production (Boers et al., 2012). It is now therefore widely suggested therefore
that second language teaching as an enterprise should focus on building learners’ knowledge and use of MWUs (e.g., Meunier, 2012,
pp. 111–129).

2
H. Hoang and P. Crosthwaite System 121 (2024) 103224

3. Corpus linguistics research into MWUs

One method of understanding MWU use is through corpus linguistics. Corpus linguistics is concerned with using computer-based
empirical analyses to investigate spontaneous L1 and L2 language use across large text collections of naturally occurring spoken and
written production (i.e., corpora) (McEnery et al., 2019). Corpus linguistics is gaining importance as a research method in phraseo­
logical studies (AUTHOR2 et al., 2022) given the significant role that (learner) corpus linguistics plays in establishing pre-patterned
word combinations through modern computational methods (Granger, 2019).
However, while serving as a primary source of L2 input for EFL learners (Jordan & Gray, 2019), few corpus-based studies have
scrutinized MWUs in English language teaching textbooks. Several studies have investigated the presence and use of MWUs in sec­
ondary school textbooks. (Koprowski 2005) found minimal representation of MWUs across three contemporary EFL textbooks, sug­
gesting that MWU selection seemed intuitive rather than determined empirically. Koya (2004) discovered a scarcity of collocations
from dictionaries in textbooks, with most being infrequent occurrences. McAleese (2013) highlighted the discrepancy between text­
book MWU usage and real language data, indicating limited representativeness. Tsai (2015) observed a small number of prescribed
collocations in textbooks, similarly lacking multiple occurrences. Lastly, Northbrook and Conklin (2018) noted a disparity between the
most frequent four-word MWUs in textbooks and those in spoken English, suggesting incongruity in their usage.
Most research efforts (e.g., Biber et al., 2004; Chen, 2010; Hsu, 2014, 2015; Koprowski, 2005; Wood, 2010) have concentrated on
academic and university settings. Biber et al. (2004) investigated four-word lexical bundles across various English registers, noting a
shortage of such MWUs in academic textbooks compared to other contexts. Biber and Barbieri (2007) observed distinctive four-word
MWU usage patterns between spoken and written university registers, highlighting a prevalence of referential expressions in written
registers versus stance-related MWUs in spoken contexts and suggesting mode has a significant influence on the communicative
purposes of MWU use. Examining MWUs in electrical engineering textbooks, Chen (2010) found a limited presence of MWUs from the
discipline in preparatory ESP textbooks, inadequately representing the breadth of functional MWUs used in that discipline. Coxhead
et al. (2017) analyzed four-word MWU vocabulary in university settings aiming to inform pedagogy and materials design, and
discovered a significant disparity between MWUs present in textbooks and those used in laboratory and tutorial contexts.
As a positive sign, corpus-based investigations on vocabulary in textbooks have been recently carried out in several diverse EFL
contexts, such as Nguyen (2021) in Vietnam; Sun and Dang (2020) in China; Rahmat and Coxhead (2021) and Yang and Coxhead
(2020) in Indonesia. However, in Vietnam at least, the findings of such studies have received little attention from language teachers
and textbook material developers, where single words remain the centre of attention. In addition, numerous studies focus on two-word
collocations and phrasal verbs, while those examining three to five word MWUs remain limited. Also, most investigations have
centered on MWU usage in written form, with few delving into oral corpus data. Therefore, additional studies that cast light on the use
of L2 English MWUs in textbooks in the Vietnamese context are much needed, raising teachers and learners’ awareness while also
providing suggestions for more effective teaching and learning. Thus, in the remainder of this paper, we seek to address the following
research questions:

RQ1: What English MWUs are present in the reading and listening input of English textbooks for Vietnamese EFL university stu­
dents, and how often do they occur?
RQ2: What are the functions of MWUs found in this input?
RQ3: Does MWU use vary significantly across reading and listening input?

4. Methodology

Three prominent stages were involved in this study’s methodology, namely the construction of textbook corpora, MWU extraction,
then the coverage calculation and functional analysis of the extracted MWUs.

4.1. Textbook corpora

Four coursebook series from various British publishers currently in use for tertiary students of English majors at a regional Viet­
namese University were examined for MWUs. Following the institution’s four-year program designed for undergraduate English
majors, students were anticipated to achieve an English language proficiency equal to A2 on the Common European Framework of
Languages (CEFR) by the end of their initial year and a level akin to B1 by the end of their second year. They aimed to reach a CEFR C1
level of proficiency in English or secure an English certificate that matched the CEFR C1 level upon graduating. To achieve these goals,
the chosen textbook series (which constitute the basis for a profile of MWUs in this study) are Face2Face (Cunningham et al., 2009;
Redston & Cunningham, 2005, 2006, 2007), New Headway (Soars & Soars, 2003a, 2003b, 2005, 2007), New Cutting Edge (Cunningham
& Moor, 2003, 2005, 2006, 2007) and Solutions (Falla & Davies, 2012a; 2012b; 2013a; 2013b). Four volumes of Solutions from
Pre-Intermediate to Advanced levels were selected as compulsory class materials at this institution, while the remaining three series
are available as reference out-of-class study materials based on institutional recommendations.
These materials are designed for young tertiary-age adults, with subject and cultural content covering topics including ‘money and
finance’, ‘employment’, etc., rendering them appropriate for the target audience. This also constrains the MWUs present to be more
appropriate for said audience, unlike books designed for younger learners, which, while more in line with the target proficiency level
(A2-B1), will cover less learner-appropriate subject content (e.g., ‘animals’, ‘playtime’) potentially rendering the final MWU list less
pedagogically useful. Each textbook series are designed for integrated language skills, with each claiming to be aligned with the

3
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Common European Framework of Reference proficiency levels. However, other than Face2Face which specifies their sample of vo­
cabulary were selected from the Cambridge International Corpus and Cambridge Learner Corpus, the other two series do not specify
their vocabulary data bank at all. Given the textbook-heavy Vietnamese EFL context described earlier, these textbooks represent a
significant proportion of the MWUs that learners of English at this institution are exposed to during their studies in terms of reading and
listening input used in and out-of-class.1
In preparation for analysis, the digital form of each textbook was converted to a text file using Optical Character Recognition
software. This study follows O’Loughlin (2012) in dividing the textbook corpus into reading input and listening input, using the
following criteria in classifying texts as reading input: (1) if a text resembles a typical text that learners would encounter in the real
world; and (2) if there is an expectation for text comprehension on the part of the learners. Unlike O’Loughlin (2012), if some of the
reading texts require the learners to supply the words or provide the correct forms of the verbs (these texts thus being left incomplete),
the study will keep the original texts without any insertions or corrections made, because it is not necessarily the case that the students
would have access to the answers to these incomplete texts. Regarding the listening input, the present study made use of the transcripts
included in the students’ books or supplied in the teachers’ books. As explained above, it is important to note that these transcripts do
not cover all the listening input the students will have been exposed to, but is representative of the input contained in the textbooks.
Any reading or listening input embedded in optional activities at the end of the student’s book were not included. The final materials
were collected and converted into text files for analysis with Collocate (Barlow, 2004), AntConc (Anthony, 2018), and LancsBox
(Brezina et al., 2015).
A summary of the textbook corpora used in this study is provided in Fig. 1.
The reading input corpus and listening input corpus (henceforth RIC and LIC) with their text numbers and token sizes are sum­
marized in Table 1.

4.2. MWU extraction

The development of the list of MWUs involved two stages: (1) computational analysis; and (2) manual refinement. First, a
quantitative computational analysis was conducted to extract MWU candidates based on frequency, distributional and statistical score
data. The employment of the frequency and range measures followed previous studies dealing with MWUs as ‘lexical bundles’ (e.g.,
Biber et al., 2004; Simpson-Vlach & Ellis, 2010). The study also adopted a statistical score measure (Mutual Information, MI). We now
focus on the MWU extraction process in the following section.

4.2.1. Computational analysis


The first consideration in identifying MWUs is the length of MWUs to be investigated. The length of a MWU means the number of
words it constitutes (Wray, 2002). It is worth noting that the most recurrent MWUs are generally made up of two-to-four words,
although previous research has extensively studied three-to-five word MWUs (Biber et al., 1999). As with this study, three to five word
MWUs have been chosen to be the object of investigation in many other previous corpus studies (e.g., Coxhead et al., 2017; Hyland,
2012; Simpson-Vlach & Ellis, 2010) because they are considered to be more phrasal, less common than two- or three-word MWUs
(which can provide an overwhelming list), and more frequent than five- or six-word MWUs (which may often fail to appear in some
small corpora). Cortes (2004) notes many three-word strings are parts of four-word MWUs. Simpson-Vlach and Ellis (2010) also
include strings of five words in their data set, although they admit these are relatively rare. Simpson-Vlach and Ellis (2010) argue
two-word units are very frequently included in three-word and four-word units, thus excluding bi-grams from their study to keep the
data set size manageable.
Regarding frequency cut-off level, the normalized frequency of each MWU per million is worked out by dividing its raw frequency
by the size of the corpus and then multiplying by 1,000,000. For example, “when it comes to” occurred four times in the reading corpus
(177,067 running words), for a relative frequency per million of 22.47. A cut-off of 40 times per million words is suggested if using
large corpora including millions of tokens (Biber et al., 2004), while other research uses cut-off ranges between 10 and 40 occurrences
per million words (Simpson-Vlach & Ellis, 2010) depending on corpus size. This study opted for an average frequency cut-off point of
20 occurrences per million words so as to create a less restricted dataset that would be later subject to manual criterion-based
refinement.
Regarding the distribution of MWUs across a range of texts, percentages (Hyland, 2008) and minimum occurrences across different
texts (Biber & Barbieri, 2007) appear to be the most frequently used approaches. Unlike previous studies with a generally higher
number of shorter texts, the present study has a fairly small number of longer texts comprised from ELT textbooks. Consequently, the
present study uses a minimum range at least two of textbooks for extraction following a minimum occurrence in at least 10 per cent of
corpus texts as suggested by Hyland (2008).
Hsu (2015) argued that MWUs identified exclusively by frequency and range criteria tend not to be meaningful for EFL learners.
Therefore, alongside frequency-based measures, mutual information (MI) is a statistical measure that determines the above-chance
degree of words co-occurring in a phrase (Oakes, 1998). Its value indicates the degree of the mutual dependence of two or more
words (Hsu, 2015). Choosing a statistical measure for association is not a simple task, and depends on the nature of the research
questions, the collocation patterns to be determined, the dataset, and (in this case) pedagogical purpose. Simpson-Vlach and Ellis

1
Of course, this will not be the only source of MWU input for these students, but will likely represent the coverage they would receive from the
institution in question.

4
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Fig. 1. Textbook corpora.

Table 1
Description on the 2 sub-corpora of English textbooks.
Sub-corpus Number of texts Size in tokens English level

Textbook corpus Sub-corpus 1 (RIC) 16 textbooks 177,067 Pre-intermediate to Advanced


Sub-corpus 2 (LIC) 16 textbooks 286,732
Total 16 textbooks 463,799

(2010) adopted MI to rank the MWUs for pedagogical use, showing that experienced EAP and ESL instructor participants rate MWUs to
be more formulaic, have more cohesive meaning or function, and deserve more teaching time if they rank higher on frequency and MI,
with MI being the primary determining factor. Furthermore, Hsu (2015) maintains that MWUs with high MI scores are more likely to
have distinctive functions and meanings and be suitable for pedagogical purposes. However, we are aware that the use of measures
including MI is not without controversy. For example, previous research claims MI tends to be strongly affected by frequency,
awarding high scores to MWUs having less frequent components (Brezina, 2018). That said, MI is considered a “real” (Gries, 2022,
p.18) measure of ‘exclusive’ association (e.g., “okey dokey”), rendering it potentially more pedagogically valuable than highly
frequent but non-exclusive collocations (e.g. “new idea”, Brezina, 2018). Moreover, Gries’ (2022) investigation of various association
measures downplayed the impact of frequency on MI, noting it “hardly reflects frequency at all and is pretty much identical to the log
odds ratio” (Gries, 2022, p.17). Rather, by combining information on frequency, range and MI together, our study follows published
precedence in studies e. g, Ellis et al. (2008), Siyanova-Chanturia (2015), and Gries (2019) where a separation of such dimensions of
information “leads to interpretable and interesting results” (Gries, 2022, p.19).
It is worth noting that there is no minimum cut-off for MI score, although some phrase extraction programs have the default value
set at three. According to Hunston (2002), collocations with an MI score above three are considered strong. Hsu (2015) set the default
MI score value of three as a cut-off point to extract the list of the most frequent MWUs in college engineering textbooks. In this study,
MWUs with an MI score greater than the default value of three were considered for screening the candidates for subsequent manual
refinement.
It is not recommended to rely on one corpus tool to identify MWUs as meeting all pre-determined criteria to be included in the final
lists (Huang, 2014). We therefore selected two popular corpus tools, namely Collocate (Barlow, 2004) and AntConc (Anthony, 2018), to
extract all the possible MWUs. The software Collocate was chosen mainly because it can generate all the n-grams by frequency
threshold and MI score and can output the MI scores of all n-grams. However, Collocate did not provide information on collocation
range in its output. AntConc was therefore taken to output the range of each MWU, as well as to double-check the accuracy of MWU

Table 2
Sample Collocate output from the computational analysis of MET.
Freq. MI score n-grams

9 27.6214 really looking forward to it


7 27.4812 as a matter of fact
26 26.0414 how long have you been
10 25.7894 from all over the world
7 25.4792 can i take a message
7 25.2305 about half an hour
10 25.1721 how are you i’m fine
7 24.9843 i wonder if you could
6 24.116 i’m not sure about that
6 24.1084 you can say that again

5
H. Hoang and P. Crosthwaite System 121 (2024) 103224

extraction results by Collocate.


To summarise, all three-, four-, and five-grams that occur at least 20 times per million, have a MI score greater than three, and occur
in at least two out of 16 textbooks under investigation were extracted from the listening input and reading input texts for analysis.

4.2.2. Manual refinement


Computational analysis based on length, frequency, range and statistical score measures, generated long lists of recurrent word
sequences that required manual checking to refine and validate the extracted MWUs across textbook corpora. Table 2 provides a
sample output derived from the LIC from Collocate highlighting why further manual refinement was required, as entries may cross
sentence boundaries (e.g., how are you I’m fine), contain proper names (e.g., in the United States), may not have a cohesive meaning or
function as a conventionalized phrase (e.g., how are you I’m fine) or may not be suitable for teaching purposes (e.g., how long have you
been).
Three selection criteria for intuition-based manual checking involved: (1) formulaicity or conventionality; (2) cohesiveness (to
determine the extent to which word components in MWUs occur together more frequently than would be expected by chance); and (3)
potential pedagogic value, following Simpson-Vlach and Ellis’ (2010) criteria for including MWUs into their lists. Each MWU was rated
across a three-point scale which corresponded to the responses of yes, not sure and no for each of these three questions, recoded as 1, 0.5
and 0 respectively. Items receiving a maximum score of three across each criterion instantly made the final list, while items where the
two raters decided ‘yes’ or ‘not sure’ on two of the three criteria went through additional qualitative judgement until both raters agreed
that a given item met all three criteria, rejecting any that did not. This procedure left a total of 239 MWUs and 140 MWUs for the LIC
and RIC in turn.
There were many instances of overlap in these two data sets (e.g., a bit of, a bit of a, exactly the same, the same as, exactly the same as).
The next step involved collapsing the overlapping data to avoid duplicate values when measuring text coverage across corpora. (.)
denotes optional words, e.g., and things like that and things like that were collapsed into (and) things like that. Frequency counts were
collected for each chunk appearing in any of these two corpora. In some cases, a subtractive method was adopted to represent a more
accurate frequency figure, e.g., a matter of was subsumed in as a matter of fact. In order to focus on just a matter of in LIC, it was possible
to simply subtract the number of occurrences of the string a matter of (10) from the number of times the idiomatic expression as a matter
of fact appearing in the corpus (7), which rendered a difference of 3. In other words, the true frequency of just a matter of in LIC is 3.

4.3. Analyses of the target MWUs

4.3.1. Measuring the MWU occurrence rate


In order to measure lexical coverage, this study follows Wingrove (2022) in first measuring the raw frequencies of every MWU type
in each textbook file (using LancsBox, Brezina et al., 2015) before converting these raw frequencies into percentage coverage by
dividing them by the total number of tokens of each text. This approach allows an analysis of the overall coverage of the lists and
coverage across sub-corpora in terms of their frequency of occurrence. In the case of MWUs with optional components, all likelihoods
were calculated without duplication, e.g., ‘and things like that’ and ‘things like that’ were collapsed into ‘(and) things like that’ as
described above.

4.3.2. Normalized MWU types and type/token ratios


The next step is to examine the quantity and variety of MWUs used in different sub-corpora. Following Huang’s (2014) study, this
study calculated and then normalized the number (types) of different MWUs identified from each of the four sub-corpora and the total
instances (tokens) of each MWU so that comparisons can be made. The normalized MWU types and tokens stand respectively for the
number of different MWUs and the total instances of all MWUs per 100 words in each sub-corpus. According to Huang (2014), these
type/token ratios indicate the extent to which the same MWUs were repeatedly used in a corpus - higher type/token ratios represent
more varied/less repetitive MWU use.

4.3.3. Determining the functional MWU classification


To investigate differences in MWU use across reading and listening input, we follow Biber et al. (2004)’s functional classification as
it applies to both written and spoken registers, unlike Hyland’s (2008) functional scheme which emphasises written registers. Biber
et al.‘s taxonomy divides MWUs into four primary categories: (1) referential expressions, (2) stance expressions, (3) discourse orga­
nizers, and (4) special conversational functions. Referential expressions, as described by Biber et al. (2004, p. 384), directly “denote
physical or abstract entities, the textual context, or highlight specific attributes of entities for identification or emphasising particular
characteristics”. Stance expressions refer to “the presentation of attitudes or evaluations of certainty that shape another proposition”
(p. 384). Discourse organizers are marked as signaling relationships between prior and upcoming talk. Special conversation functions
include expressions of politeness, simple inquiry and reporting. These functions, subcategories and examples from the textbook corpus
are tabulated in Table 3.
The functional categorization of MWUs is not straightforward, as the function of a MWU might vary depending on the context,
MWUs may perform more than one function, and a certain level of subjectivity may be involved. An example of this includes the fact

6
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Table 3
Sample of MWUs in English textbooks by function.
Major category Sub-category Examples

Referential expressions Intangible framing attributes in terms of


Tangible framing attributes the amount of, days per week,
Quantity specification a lot of, a number of
Identification and focus referred to as, the only way
Contrast and comparison all the same, the equivalent of
Deictics and locatives all over the world, at a time
Vagueness markers (or) something like that

Stance expressions Hedges the fact that, more or less


Epistemic value as I said, of the opinion that
Obligation and directive make sure that
Ability and possibility be able to, be possible to
Evaluation good for you, great to see
Intention/volition, prediction let me see, were going to

Discourse organizers Topic introduction first of all


Linking on the other hand

Conversational functions Politeness Thank you very much


Simple inquiry What’s going on?

Fig. 2. Sample concordance lines for the fact that.

that, which as shown in the concordance lines in Fig. 2 can either function as specifying intangible framing attributes, or strengthening
the author’s stance.
The researchers therefore independently assigned a function to each MWU before discussing discrepancies using a manual
concordance examination including all lines from the textbook corpus for each MWU in Lancsbox (Brezina et al., 2015).2 The Cohen’s
Kappa statistic was then used to test the inter-rater reliability of the independent ratings for the MET list across all identified 150 MWU
types. A k value 0.75 (p < 0.0001) was recorded, indicating substantial agreement between the two raters. For items with no agreement
between the two raters, the entry was reserved for further discussion prior to eventual inclusion or removal from the list.

4.3.4. Statistical analysis


Statistical analysis was then performed on the MET list and its functional subsections across the two corpora. Three, two-tailed
paired-samples t-tests were run as the data for the MET list and its subsections of referential expressions and discourse organizing
functions were found to meet the following assumptions: #1 the dependent variable (occurrence rate) is measured on a continuous
scale; #2 the independent variable is categorical with two comparison groups (RIC or LIC); #3 there is no significant outlier in the
differences between the occurrence rates of RIC and LIC #4 the distribution of the differences between the occurrence rates of RIC and
LIC is approximately normal. Two Wilcoxon signed rank tests were also performed for the subsections of stance expressions and special
conversational functions, as they were checked to have met the following assumptions: #1 the dependent variable (occurrence rate) is
measured at the continuous level; #2 the independent variable is categorical with two groups (RIC and LIC); and #3 the distribution of
the differences between the occurrence rates of RIC and LIC is symmetrical in shape. Since the total number of tests performed is five,
the alpha level was established to 0.01 using the Bonferroni correction.

5. Results

5.1. Results: the list of MWUs in English textbooks

In total, the data-driven n-gram lists for LIC (286,732 tokens) and RIC (177,067 tokens) contain 3365 and 1653 entries respectively.

2
Although MWUs can take more than one function, the most salient function among each MWU was selected following close examination of the
specific concordance line.

7
H. Hoang and P. Crosthwaite System 121 (2024) 103224

The 3365 and 1653 entries underwent a qualitative review in which each entry was assessed independently by the two researchers to
determine whether a specific entry should be included, discussed or excluded from further analysis. The Cohen’s Kappa statistic was
used to test the inter-rater reliability of the independent ratings on a subset of 250 items. The k values were 0.73, 0.75, and 0.73 (p <
0.0001) were recorded for the three criteria respectively, indicating a substantial agreement between the two raters. The total number
of MWUs appearing in the final master list was 263 (see Appendix 1) with the RIC and LIC containing 204 and 254 MWUs respectively.3
These 263 MWU types across RIC/LIC represented 5173 instances (tokens) of total MWU usage in our textbook corpora. Among
these 263 MWUs, as many as 154 were shared by all four textbook series (see Appendix 2), with as many as 59% of the MWUs
constituting a common group among these textbooks. Interestingly, there were no noticeable differences in terms of the occurrence and
distribution of MWUs across the textbook series, with each textbook series sharing approximately a quarter of total MWU usage. In
detail, MWU instances in Solutions totaled 1417 accounting for 27.39% of the total number of MWUs, followed by Face2Face with
25.58%, New Headway with 25.15% and New Cutting Edge with 21.88% of total MWU usage. Regarding register, 57 reading-specific
MWUs were common across all four coursebook series accounting for approximately one-fifth of the recorded MWUs (21.67%),
while 136 listening-specific MWUs accounting for 51.71% of recorded MWUs were found in the listening input. This contrast implies
the listening input exhibits a notably larger proportion of commonly shared MWUs in comparison to the reading input across our
textbook corpora.

5.2. Results: overview of occurrence rate

A two-tailed t-test for independent means concerning the occurrence rate of MET list in RIC and LIC produced the following: t (30)
= − 8.97, p < 0.01. The MET rate of occurrence was significantly greater for the MET in the LIC (Mean = 1.30%, SD = 0.19%) than in
the RIC (Mean = 0.83%, SD = 0.09%), p < 0.01, 95% CI [− 0.58, − 0.36]. The null hypotheses that LIC and RIC are the same in terms of
the MET occurrence rate was therefore rejected. In terms of percentage increase required, the average RIC text would require
approximately 57% more MWU items to reach the LIC. The effect size for this comparison was Cohen’s d = 3.16, which can be
considered a very large effect by Cohen’s (1988) guidelines for effect sizes.
Table 4 also shows the MWU occurrence rate in terms of the functional subsections of the list. Interestingly, reading input and
listening input texts tend to orchestrate MWU-containing discourse rather differently, particularly in terms of stance expressions (e.g.,
as a matter of fact, I must admit, I wouldn’t say) and special conversational functions (e.g., how about you, go for it, have a word). The MET
stance expression rate of occurrence was significantly greater in the LIC (Mean = 0.43%, SD = 0.08%) than in the RIC (Mean = 0.17%,
SD = 0.06%), p < 0.01, 95% CI [− 0.31, − 0.21], as was the MET special conversational function rate of occurrence (LIC Mean = 0.20%,
SD = 0.11%, RIC Mean = 0.02%, SD = 0.02%), p < 0.01, 95% CI [− 0.24, − 0.13]. The effect sizes were very large (Cohen’s d = 3.68 in
the stance expression group and d = 2.28 in the special conversational function group). The data indicate that reading input texts
contained significantly fewer MWU stance expressions than listening input texts, exemplified by MWUs including a bit of (a), a little bit,
as I said, and as you know. It is also apparent that special conversational function MWUs appear to be a major distinguishing feature
typical of listening input texts, as seen in MWUs including call you back, do you fancy, and do you mind (if). RIC texts used more
discourse organizing functions (t (30) = 0.76 (p = 0.45, 95% CI [− 0.19, 0.04]), although since the 95% CI contains zero, it can be
concluded that there is no statistically significant difference.
The distribution of the MWU occurrence rates across the RIC and LIC can be illustrated in the histograms in Fig. 3. The y-axis
displays the percentage occurrence rate the list(s) achieve across texts. The x-axis displays the number of texts that reach that level of
coverage. As shown, there is little or no overlap across the two corpora, and most LIC texts have a higher MWU occurrence rate than
their RIC counterparts.

5.3. Normalized MWU types and type/token ratios

Table 5 shows 204 MWU types represent 1432 instances (tokens) of total MWU usage in RIC, whereas 254 types and 3741 instances
of MWUs were identified in LIC. Looking at the normalized MWU types (0.12/0.09 per 100 words), RIC contained almost 1.3 times as
many MWUs as LIC. The type/token ratio from RIC (1/7) is two times higher than that from LIC (1/14), which means a much wider
variety of MWUs were produced in RIC than LIC, with fewer MWUs being repeatedly used.
The repetitions of the 263 MWUs in the reading and listening input can be illustrated in Table 6. It is worth noting that in the
reading input, more than half of the MWUs (nearly 58%) had less than 5 occurrences, while only a small number of MWUs in the
listening input (14.17%) occurred below 5 times. Table 6 shows nearly 86% of the MWUs in the listening input were recycled 6 times or
more. In other words, the extent to which MWUs were repeated in the listening input was far greater than those present in the reading
input.
The top 50 most frequent MWUs in the RIC and LIC corpora were also investigated. The top 50 items in RIC occurred 4862.57 times
per million, while the top 50 MWUs in LIC occurred 6738.00 times per million – in other words, the top 50 MWUs in LIC occurred
almost 1.5 times as frequently as the top 50 MWUs in RIC. For most items at the same rank in the two lists, LIC MWU items are found to
be 50% or 100% more frequent than the MWU items at the same rank in the RIC texts. For example, the most frequent MWU in LIC a lot
of occurs 788.19 times per million running words, while the most frequent MWU in RIC in the world occurs 372.74 times per million

3
Several MWUs are shared among RIC and LIC as is shown later.

8
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Table 4
Mean occurrence rate (%) of the list of MET across the RIC and LIC.
List RIC LIC Percent t 95% Confidence interval of the p (two- Cohen’s
increase difference tailed) d

Mean SD Mean SD

The MET list 0.83 0.09 1.30 0.19 57% ¡8.97 [-0.58, -0.36] 0.00* 3.16
Referential expressions 0.53 0.09 0.59 0.08 11% − 1.9 [-0.12, 0.00] 0.07
Stance expressions 0.17 0.06 0.43 0.08 153% − 11.03 [-0.31, − 0.21] 0.00* 3.68
Discourse organizing 0.09 0.05 0.08 0.03 − 11% 0.76 [-0.19, 0.04] 0.45
functions
Special conversational 0.02 0.02 0.20 0.11 900% − 6.55 [-0.24, − 0.13] 0.00* 2.28
functions

*Significant at p = 0.01.

Fig. 3. The MWU occurrence rate across the RIC and LIC.

Table 5
Normalized MWU types and tokens from two textbook sub-corpora.
Sub-corpora Size in tokens Raw types MWU types per 100 words Raw tokens MWU tokens per 100 words Type/token ratio

RIC 177,067 204 0.12 1432 0.81 1/7


LIC 286,732 254 0.09 3741 1.30 1/14

Table 6
The number of repetitions of the 263 MWUs in the reading and listening input.
Reading input Listening input

Frequency Percentage Cumulative percentage Frequency Percentage Cumulative percentage

Once 33 16.18% 16.18% 4 1.57% 1.57%


2-5 times 86 42.16% 58.34% 32 12.60% 14.17%
6-10 times 52 25.49% 83.83% 101 39.76% 53.93%
>10 times 33 16.17% 100% 117 46.07% 100%

running words.

5.4. Overview of functional classification

Table 7 summarizes the distribution of MWUs across the four primary functional classifications. The MET list includes 121

9
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Table 7
Distribution of the list of MET across functions.
Functional categories Number Percentages Example instances

Referential expressions 121 46.00% (and) things like that, (exactly) the same (as), (nothing) to do (with)
Stance expressions 88 33.46% a bit of (a), a little bit, again and again, are able to, are going to
Discourse organizers 36 13.69% (i) see what you mean, as a child, as a matter of fact, as a result
Special conversational functions 18 6.85% how about you, how are things, how are you, how’s it going

referential expressions, 88 stance expressions, 36 discourse organizers and 18 conversational expressions. A very high proportion of
MWU units in the textbook corpus are referential expressions, accounting for 46% of the total MET, while special conversational
functions are considerably less common (6.85%). Stance expressions are the second common functional type (33.46%), followed by
discourse organizers (13.69%).
Fig. 4 provides an overview of the proportions of the four functional categories in the RIC and LIC. Referential MWU expressions are
the most common functional category used in both RIC and LIC, whereas the least common type for both RIC and LIC appears to be
special conversational function MWUs. Discourse organizing MWUs are employed less frequently compared to both referential and
stance expressions in both RIC and LIC.

6. Discussion

The present study has investigated the MWUs present in the reading and listening input in popular EFL textbook series commonly
used in Vietnamese tertiary English instruction. This was done with a view to understanding the functions of the MWUs found in these
textbooks, how they differ in listening and reading input texts, and whether textbook input may need to be supplemented with other
suitable input to help learners improve learners’ knowledge and use of MWUs. This was addressed by investigating the list of MWUs in
English textbooks, their overall occurrence rate, a comparison of the quantity and variety of MWUs between the reading input and
listening input, and comparison of functional MWU classifications across written and oral input.
Regarding the list of MWUs in English textbooks, nearly two-thirds of the MWUs (154 out of the recorded 263 MWUs) were found to
be similar across the four textbook series. This result is encouraging in comparison with previous research by Hsu (2006) who found
less than one percent (0.84%) of MWUs were commonly shared among their three selected ELT textbook series (Communication
Strategies, Touchstone, and Totally True). In Hsu’s study, one textbook (Totally True) was designed specifically to enhance reading and
vocabulary skills only, unlike Communication Strategies and Touchstone that are more focused on integrating the four skills of listening,
reading, speaking, and writing. In the present study, the elevated occurrence of shared MWUs among the four textbook series may be
attributed to their collective emphasis on integrated skills and their tailored design targeting English learners spanning A2 to C1
proficiency levels. This suggests future investigations of textbook corpora should aim for balance in terms of their selected textbooks’
aims, scope and integration of varying registers.
As regards overall MWU occurrence rate, the occurrence rate of MWUs in the analyzed textbooks appears lower (averaging between
1 and 2%) than prior studies of MWUs in other registers. For example, previous research suggests a higher prevalence of MWUs in
native-speaker communication, constituting approximately 59% of spoken and 52% of written discourse (Erman & Warren, 2000),
while Foster (2001) found approximately 32% of spoken discourse comprised MWUs. This suggests the range of MWUs in these

Fig. 4. Distribution of the four functional categories across RIC and LIC.

10
H. Hoang and P. Crosthwaite System 121 (2024) 103224

textbook may not represent that of authentic native speaker-like production, a critical finding for materials developers and teach­
ers/students intending to use these textbooks.
Comparing the MWU occurrence rate across reading and listening inputs revealed statistically significant differences, with a fairly
large effect size. On average, reading input necessitated 57% more MWUs to match the frequency found in listening texts. This finding
deviates from prior research (e.g., Biber & Barbieri, 2007; Biber & Conrad, 1999). While contradicting Biber and Barbieri’s (2007)
observation that MWUs are more common in written university registers, our finding supports Biber and Conrad’s (1999) conclusion
that MWUs are more frequent in conversation than in prose. Regarding MWU recycling, the majority (about 86%) in listening recurred
6 times or more, whereas roughly 58% in writing were reused less than 5 times, aligning with research suggesting a learner needs to
encounter a collocation between 5 and 15 time before s/he acquires it (Webb et al., 2013). Thus, in our findings regarding normalized
MWU types and type/token ratios, the English textbooks’ reading input exhibits a larger quantity of MWU types and broader variety of
MWUs with a low MWU recurrence rate compared to the listening input. Moreover, the top 50 most frequent MWUs in listening were
50% or 100% more frequent than those in reading. For instance, the most frequent MWU ‘a lot of’ occurred 788.19 times per million
words in spoken text compared to 372.74 times in written text for the most frequent MWU ‘in the world’. This mirrors earlier findings (e.
g., Shin and Nation, 2007) indicating higher frequency spoken MWUs compared to written ones.
Finally, the findings regarding functional classification display an uneven distribution of MWU categories in both reading and
listening inputs. Significant differences were noted in stance expressions (e.g., ‘as a matter of fact’, ‘I must admit’, ‘I wouldn’t say’) and
special conversational functions (e.g., ‘how about you’, ‘go for it’, ‘have a word’) between reading and listening inputs. While special
conversational functions would naturally occur in oral discourse, that reading materials contained notably fewer stance expressions
than those found in listening materials highlights substantial functional differences in MWU use between reading and listening re­
sources. This aligns with prior research (e.g., Biber, 2006) contrasting stance expression usage in university spoken and written
registers, revealing textbooks generally exhibit lower instances of all stance expressions compared to their use in spoken university
registers. Discourse-organizing MWUs were comparatively less utilised in both written and spoken inputs compared to referential and
stance expressions. Although achieving an evenly distributed functional coverage may not be a feasible goal for materials developers,
emphasising the importance of discourse organizing MWUs should not be overlooked given their crucial role in signaling discourse
relationships, as noted by scholars including Huang (2014, p. 185), Biber et al. (2004), and Hyland (2012, p. 160).

7. Pedagogical implications

The current study aims to raise awareness among teachers and learners about MWUs in Vietnam and similar contexts, emphasising
the significance of MWUs in language learning. In this research, it was noted that among the textbooks studied, only Face2Face
explicitly mentioned using the Cambridge International Corpus, which houses a vast collection of text and spoken data, along with the
Cambridge Learner Corpus, holding 45 million words of written data, as sources for its MWUs. This finding aligns with Hsu’s (2015)
suggestion for the development of more MWU-focused textbooks that leverage extensive corpora to create English Language Teaching
(ELT) materials that are theoretically, pedagogically, and commercially robust. Therefore, it is imperative for Vietnamese institutional
leaders to pay close attention to this issue when selecting optimal coursebooks as a means to cultivate knowledge of MWUs.
While recognizing the influence of implicit learning on phraseology, research (e.g., Bui, 2016) suggests a greater emphasis on
explicit training in MWUs for Vietnamese EFL majors. This aligns with Schmidt’s (1990) noticing hypothesis in that awareness is
fundamental for uptake. The need for more exposure to English MWUs is particularly obvious in Vietnamese classroom environments
given the limited chances for genuine, communicative language interactions outside of class. Thus, directing students’ attention to
MWU input in English textbooks could effectively enhance learners’ awareness and, consequently, encourage their self-directed
learning of MWUs out of class.
Additionally, textbooks may expose learners to diverse MWUs but lack repetition in their use, differing from natural speech where
MWUs are frequently repeated. As asserted by Tsai (2015), conventional course materials are incapable of offering adequate exposure
and must be complemented by substantial supplementary listening and reading resources. Therefore, educators should emphasize the
value of MWU repetition in learning and adapt the curriculum to ensure exposure to MWUs in both reading and listening contexts.
Also, as observed by Bui (2016), in textbooks currently used in Vietnam e.g., Life, New Cutting Edge, English Files, or Solutions, vo­
cabulary items are not always highlighted or underlined, and very few of the typographically enhanced items are MWUs. Future
materials should therefore use textual or typographic enhancement to make MWUs more salient.
Spoken MWUs hold greater importance in oral communication than written MWUs do in written language, suggesting a need for
increased attention during teaching (Shin, Lee, & Choi, 2023). Our study underscores the impact of spoken MWUs from listening input,
urging more teaching time be devoted to them. In a Vietnamese EFL context where vocabulary learning heavily relies on instructed
input, and classroom instruction often centers on grammar and reading comprehension, it is crucial for both educators and learners to
recognize the prevalence of high-frequency MWUs provided in textbooks and acknowledge the substantial role of spoken MWUs in
listening input. Furthermore, learners should be mindful of the different register variations associated with MWU use, as they might
inadvertently use spoken MWUs in writing and vice versa due to a lack of awareness. This is supported by Granger (2019) who
observed that L2 learners are often found to use spoken-like bundles in academic writing due to their lack of awareness of register
differences. Educators should caution learners about these differences to enhance their grasp of appropriate MWU usage.
Finally, our findings regarding MWU function suggest EFL teachers should be trained in recognizing the functions of MWUs in
textbooks, noting differences between listening and reading inputs. This understanding aids in structuring materials for balanced MWU
exposure, especially for conversational functions, which may require supplemental input for learners’ improvement.

11
H. Hoang and P. Crosthwaite System 121 (2024) 103224

8. Conclusion

This paper aimed to show which MWUs Vietnamese tertiary-level EFL students have frequent exposure to in the reading and
listening input presented in the English textbooks. Given the prominence of MWUs in real-life spoken and written communication,
understanding MWU use and the variation inherent within such use across registers is not just an academic exercise, but a pedagogical
imperative. The insights gleaned from our research not only contribute to the broader academic discourse on second language
acquisition but also have the potential to reshape the design and delivery of language instruction where MWUs are concerned.
This study acknowledges several limitations. First, with textbook use often involving repeated listening and rereading, the assumed
frequency of MWUs based on a single corpus pass might underrepresent actual exposures. The study does not consider the teacher’s
influence on ELT textbook use, a known factor in vocabulary teaching success (Folse, 2004). It also assumes the primary source of
vocabulary learning exposure is the textbook corpus, not factoring in other materials or resources teachers might use. Furthermore,
this research presumes each learner completes all textbooks and their activities, but direct and repeated MWU exposure to these
learners remains speculative. We have not conducted an experimental design to test MWU exposure effects, so findings should be
viewed as correlational, not causative.
Another limitation is the normalized frequency cut-off; only MWUs occurring above 20 times per million were analyzed. This
potentially excludes other MWUs present in the textbooks. The identified MWUs must be considered cautiously due to the sub-cor­
pora’s small size and the exclusion of bigrams. The extraction method doesn’t capture all MWU variation or identify discontinuous n-
grams. Furthermore, the study primarily assessed MWU occurrence rate rather than total tokens. For example, a four-token-long MWU
is counted once, not accounting for individual tokens. Bigrams were excluded for manageability, and the method missed some
formulaic MWU sequences. The textbook size, under half a million tokens, is modest and might not represent real-life discourse due to
editorial choices to avoid repetition. Lastly, the study’s focus was on MWU occurrence rate in the textbook corpora rather than total
tokens. While single-word measurements overlap in frequency and coverage, multi-word MWUs differ in these aspects. Given MWU
variability and overlap (e.g., “exactly the same as”), the study opted to investigate MWU occurrence rate for items deemed peda­
gogically significant.
Notwithstanding these limitations, the study has its own value in focusing on the specific context of Vietnam where the learners’
awareness of the pervasiveness of MWUs in L1 language use and the importance of MWUs in L2 teaching and learning is currently
limited. This research also highlights areas warranting further exploration. Future studies inspired by this work could focus on
replication to deepen our understanding of English MWUs in textbook input. Replication studies, whether involving similar or different
learner profiles, are crucial. It would be valuable to examine learners from diverse L1 backgrounds, such as Indonesian, or from
different contexts like ESL. Furthermore, comparing the structural types of MWUs can provide more insight into the disparities be­
tween reading and listening input.

CRediT authorship contribution statement

Hien Hoang: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original
draft, Writing – review & editing. Peter Crosthwaite: Conceptualization, Data curation, Funding acquisition, Project administration,
Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.system.2024.103224.

References

Anthony, L. (2018). AntConc (version 3.5. 7)[computer software]. Waseda University.


Appel, R. (2022). Lexical bundles in L2 English academic texts: Relationships with holistic assessments of writing quality. System. https://doi.org/10.1016/j.
system.2022.102899
Barlow, M. (2004). Collocate 1.0: Locating collocations and terminology. Athelstan.
Biber, D. (2006). Stance in spoken and written university registers. Journal of English for Academic Purposes, 5(2), 97–116.
Biber, D., & Barbieri, F. (2007). Lexical bundles in university spoken and written registers. English for Specific Purposes, 26(3), 263–286. https://doi.org/10.1016/j.
esp.2006.08.003
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405. https://doi.org/
10.1093/applin/25.3.371
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman grammar of spoken and written English. Longman.
Biber, D., & Conrad, S. (1999). Lexical bundles in conversation and academic prose. In H. Hasselgard, & S. Oksefjell (Eds.), Out of corpora: Studies in honour of Stig
Johansson (pp. 181–190). Rodopi, Amsterdam.
Boers, F. (2021). Evaluating second language vocabulary and grammar instruction. Routledge.
Boers, F., Eyckmans, J., Kappel, J., Stengers, H., & Demecheleer, M. (2006). Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test.
Language Teaching Research, 10(3), 245–261. https://doi.org/10.1191/1362168806lr195oa
Boers, F., & Lindstromberg, S. (2012). Experimental and intervention studies on formulaic sequences in a second language. Annual Review of Applied Linguistics, 32,
83–110. https://doi.org/10.1017/S0267190512000050
Boers, F., Lindstromberg, S., & Eyckmans, J. (2012). Are alliterative word combinations comparatively easy to remember for adult learners? RELC Journal, 43,
127–135. https://doi.org/10.1177/0033688212439997

12
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Brezina, V. (2018). Statistics in corpus linguistics: A practical guide. Cambridge University Press.
Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2),
139–173.
Bui, T. B. T. (2016). A review of recent instructional interventions to raise learners’ awareness of multiword units in English and some reflections on Vietnamese
context. Journal of Science of HNUE, 61(11), 136–142.
Chen, L. (2010). An investigation of lexical bundles in ESP textbooks and electrical engineering introductory textbooks. Perspectives on formulaic language: Acquisition
and communication, 107–125.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Erlbaum Associates.
Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes, 23(4), 397–423.
Coxhead, A., Yen Dang, T. N., & Mukai, S. (2017). Single and multi-word unit vocabulary in university tutorials and laboratories: Evidence from corpora and
textbooks. Journal of English for Academic Purposes, 30, 66–78. https://doi.org/10.1016/j.jeap.2017.11.001
Cunningham, G., Bell, J., & Redston, C. (2009). Face2Face advanced student’s book. Cambridge University Press.
Cunningham, S., & Moor, P. (2003). New Cutting Edge advanced student’s book. Pearson.
Cunningham, S., & Moor, P. (2005). New Cutting Edge intermediate, student’s book. Pearson.
Cunningham, S., & Moor, P. (2006). New Cutting Edge pre-intermediate student’s book. Pearson.
Cunningham, S., & Moor, P. (2007). New Cutting Edge upper-intermediate, student’s book. Pearson.
Ellis, N. C., Simpson-Vlach, R. I. T. A., & Maynard, C. (2008). Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and
TESOL. Tesol Quarterly, 42(3), 375–396.
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text-Interdisciplinary Journal for the Study of Discourse, 20(1), 29–62.
Falla, T., & Davies, P. A. (2012a). Solutions intermediate student’s book. Oxford University Press.
Falla, T., & Davies, P. A. (2012b). Solutions pre-intermediate student’s book. Oxford University Press.
Falla, T., & Davies, P. A. (2013a). Solutions advanced student’s book. Oxford University Press.
Falla, T., & Davies, P. A. (2013b). Solutions upper-intermediate student’s book. Oxford University Press.
Folse, K. S. (2004). Myths about teaching and learning second language vocabulary: What recent research says. TESL Reporter, 37(2), 1–13.
Foster, P. (2001). Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers. In M. Bygate, P. Skehan, &
M. Swain (Eds.), Researching pedagogic tasks: Second language learning, teaching, and testing (pp. 75–93). Longman.
Granger, S. (2017). Learner corpora in foreign language education. In Language, education, and technology (pp. 427–440). Springer International Publishing. https://
doi.org/10.1007/978-3-319-02237-6_33.
Granger, S. (2019). Formulaic sequences 1 in learner corpora. In Understanding formulaic language (1st ed., pp. 228–247). Routledge. https://doi.org/10.4324/
9781315206615-13.
Gries, S. T. (2019). 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal
of Corpus Linguistics, 24(3), 385–412.
Gries, S. T. (2022). What do (some of) our association measures measure (most)? Association? Journal of Second Language Studies, 5(1), 1–33.
Hsu, J. Y. (2006). An analysis of the multiword lexical units in contemporary ELT textbooks. Online Submission.
Hsu, W. (2014). The most frequent opaque formulaic sequences in English-medium college textbooks. System, 47, 146–161.
Hsu, W. (2015). The most frequent formulaic sequences in college engineering textbooks. Corpus Linguistics Research, 1, 109–132.
Huang, K. (2014). A corpus study of Chinese EFL majors’ phraseological performance (Doctoral dissertation, University of Hong Kong). HKU Theses Online (HKUTO).
http://hub.hku.hk/handle/10722/219900.
Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27(1), 4–21. https://doi.org/10.1016/j.esp.2007.06.001
Hyland, K. (2012). Bundles in academic discourse. Annual Review of Applied Linguistics, 32, 150–169. https://doi.org/10.1017/S0267190512000037
Jordan, G., & Gray, H. (2019). We need to talk about coursebooks. ELT Journal, 73(4), 438–446.
Keshavarz, M. H., & Salimi, H. (2007). Collocational competence and cloze test performance: A study of Iranian EFL learners. International Journal of Applied
Linguistics, 17(1), 81–92. https://doi.org/10.1111/j.1473-4192.2007.00134.x
Koprowski, M. (2005). Investigating the usefulness of lexical phrases in contemporary coursebooks. ELT Journal, 59(4), 322–332.
Koya, T. (2004). Collocation research based on corpora collected from secondary school textbooks in Japan and in the UK. Dialogue, 3, 7–18.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A corpus analysis of learners’ English. Language Learning, 61(2), 647–672.
Lewis, M. (1993). The lexical approach: The state of ELT and a way forward. Language Teaching Publications.
McAleese, P. (2013). Investigating multi-word items in a contemporary ELT course book. October. In Paper presented at the JALT2012 conference proceedings. Tokyo,
Japan.
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual
Review of Applied Linguistics, 39, 74–92.
Meunier, F. (2012). Formulaic language and language teaching (32). Annual Review of Applied Linguistics.
MOET. (2008). Decision No. 1400/QĐ-TTg: ‘Teaching and learning foreign languages in the national education System. Period 2008 to 2020’ https://www.jpf.org.vn/
iwtcore/uploads/2012/08/1-3Decision_1400_QD-TTg-Eng.pdf.
Nguyen, T. T. M. (2007). Textbook evaluation: The case of English textbooks currently in use in Vietnam’s upper-secondary schools. SEAMEO Regional Language Centre.
Nguyen, C. D. (2021). Lexical features of reading passages in English-language textbooks for Vietnamese high-school students: Do they foster both content and
vocabulary gain? RELC Journal, 52(3), 509–522.
Nguyen, T. M. H., & Coxhead, A. (2022). Evaluating multiword unit word lists for academic purposes. ITL - International Journal of Applied Linguistics, 115, 1–14.
https://doi.org/10.1075/itl.21041.ng
Nguyen, X. N. C. M., & Nguyen, V. H. (2019). Language education policy in Vietnam. In The Routledge international handbook of language education policy in Asia (pp.
185–201). Routledge.
Nguyen, T. M. H., & Webb, S. (2017). Examining second language receptive knowledge of collocation and factors that affect learning. Language Teaching Research, 21
(3), 298–320.
Northbrook, J., & Conklin, K. (2018). “What are you talking about?”: An analysis of lexical bundles in Japanese junior high school textbooks. International Journal of
Corpus Linguistics, 23(3), 311–334.
O’Loughlin, R. (2012). Tuning in to vocabulary frequency in coursebooks. RELC Journal: Journal of Language Teaching and Research, 43(2), 255–269.
Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh University Press.
Ohlrogge, A. (2009). Formulaic expressions in intermediate EFL writing assessment. Formulaic Language, 2, 375–386.
Rahmat, Y. N., & Coxhead, A. (2021). Investigating vocabulary coverage and load in an Indonesian EFL textbook series. Indonesian Journal of Applied Linguistics, 10(3),
804–814. https://doi.org/10.17509/ijal.v10i3.31768
Redston, C., & Cunningham, G. (2005). Face2Face pre-intermediate student’s book. Cambridge: Cambridge University Press.
Redston, C., & Cunningham, G. (2006). Face2Face intermediate student’s book. Cambridge University Press.
Redston, C., & Cunningham, G. (2007). Face2Face upper intermediate student’s book. Cambridge University Press.
Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11, 129–158.
Serrano, R., Stengers, H., & Housen, A. (2015). Acquisition of formulaic sequences in intensive and regular EFL programmes. Language Teaching Research, 19(1),
89–106.
Shin, D., Lee, J. H., & Choi, W. (2023). An exploratory study of young EFL learners’ aural and written receptive multi-word unit knowledge. System, 114, Article
103029.

13
H. Hoang and P. Crosthwaite System 121 (2024) 103224

Shin, D., & Nation, P. (2007). Beyond single words: The most frequent collocations in spoken English. ELT Journal, 62(4), 339–348. https://doi.org/10.1093/elt/
ccm091
Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512. https://doi.org/
10.1093/applin/amp058
Siyanova-Chanturia, A. (2015). Collocation in beginner learner writing: A longitudinal study. System, 53, 148–160.
Siyanova-Chanturia, A., & Pellicer-Sanchez, A. (2019). Understanding formulaic language: A second language acquisition perspective. Routledge.
Soars, L., & Soars, J. (2003a). New Headway advanced student’s book. Oxford University Press.
Soars, L., & Soars, J. (2003b). New Headway intermediate student’s book. Oxford University Press.
Soars, L., & Soars, J. (2005). New Headway upper-intermediate student’s book. Oxford University Press.
Soars, J., & Soars, L. (2007). New Headway pre-intermediate student’s book. Oxford University Press.
Sun, Y., & Dang, T. N. Y. (2020). Vocabulary in high-school EFL textbooks: Texts and learner knowledge. System (vol. 93). https://doi.org/10.1016/j.system.2020.102279.
First view.
Tran, H. Q. (2012). An explorative study of idiom teaching for pre-service teachers of English. English Language Teaching, 5(12), 76–86.
Tran, H. Q. (2017). Figurative idiomatic competence: An analysis of EFL learners in Vietnam. Asian-Focused ELT research and practice: Voices from the far Edge (vol. 66).
Tsai, K. J. (2015). Profiling the collocation use in ELT textbooks and learner writing. Language Teaching Research, 19(6), 723–740.
Vu, D. V., & Peters, E. (2021). Vocabulary in English Language learning, teaching, and testing in Vietnam: A review. Education Sciences, 11(9), 563.
Webb, S. (2019). The routledge handbook of vocabulary studies (1st ed.). Routledge. https://doi.org/10.4324/9780429291586
Webb, S., Newton, J., & Chang, A. (2013). Incidental learning of collocation. Language Learning, 63(1), 91–120.
Wingrove, P. (2022). Academic lexical coverage in TED talks and academic lectures. English for Specific Purposes, 65, 79–94.
Wood, D. (2010). Formulaic language and second language speech fluency: Background, evidence, and classroom applications. Continuum.
Wray, A. (2002). Formulaic Language and the lexicon. Cambridge University Press.
Wulff, S. (2019). Acquisition of formulaic language from a usage-based perspective. In A. Siyanova-Chanturia, & A. Pellicer-Sanchez (Eds.), Understanding formulaic
language: A second language acquisition perspective (pp. 19–37). Routledge.
Xuan Mai, L., & Thanh Thao, L. (2022). English language teaching pedagogical reforms in Vietnam: External factors in light of teachers’ backgrounds. Cogent
Education, 9(1), Article 2087457.
Yang, L., & Coxhead, A. (2020). A corpus-based study of vocabulary in the New Concept English textbook series. RELC Journal. https://doi.org/10.1177/
0033688220964162. online first.

Thi Thu Hien Hoang is a research assistant at University of Queensland, and former lecturer at Quy Nhon University, Vietnam. Her primary areas of interest are corpus
linguistics and instructed L2 acquisition, with a special focus on phraseology. She holds a PhD in applied linguistics from the University of Queensland, Australia.

Dr. Peter Crosthwaite is Associate Professor of Applied Linguistics at the University of Queensland. He is an expert in corpus linguistics, data-driven learning, and
computer-assisted language learning. He has published 50+ articles in top journals, and is the editor-in-chief of Australian Review of Applied Linguistics (from 2024)

14

You might also like