Professional Documents
Culture Documents
Ackermannand Chen 2013 Developing Academic Collocation List Authorsmanuscript
Ackermannand Chen 2013 Developing Academic Collocation List Authorsmanuscript
Ackermannand Chen 2013 Developing Academic Collocation List Authorsmanuscript
net/publication/259161085
CITATIONS READS
179 14,720
2 authors, including:
Yu-Hua Chen
Coventry University
12 PUBLICATIONS 632 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Corpus of Chinese Academic Written and Spoken English (CAWSE) View project
All content following this page was uploaded by Yu-Hua Chen on 16 October 2017.
approach
Highlights
Abstract
This article describes the development and evaluation of the Academic Collocation List (ACL),
which was compiled from the written curricular component of the Pearson International Corpus
of Academic English (PICAE) comprising over 25 million words. The development involved four
stages: (1) computational analysis; (2) refinement of the data-driven list based on quantitative
and qualitative parameters; (3) expert review; and (4) systematization. While taking
advantage of statistical information to help identify and prioritize the corpus derived
collocational items that traditional manual examination are unable to manage, we argue that
only with human intervention can a data driven collocation listing be of much pedagogical use.
Focusing on lexical collocations only, we present a new academic collocation list compiled using
a mixed-method approach of corpus statistics and expert judgement, consisting of the 2,468
most frequent and pedagogically relevant entries we believe can be immediately
operationalized by EAP teachers and students. By highlighting the most important
crossdisciplinary collocations, the ACL can help learners increase their collocational
competence and thus their proficiency in academic English. The ACL can also support EAP
teachers in their lesson planning and provide a research tool for investigating academic
language development.
1 Introduction
The Academic Word List (Coxhead, 2000) is arguably the most widely used EAP word
list nowadays taking corpus frequency into account. As the research interest in corpus
linguistics has gradually shifted towards word co-occurrence rather than single words (see
Granger and Meunier, 2008; Schmitt, 2004; Stubbs, 2001; Wray, 2008), there have been
more investigations of recurrent word combinations in academic prose using frequency and
1
dispersion parameters (e.g. Biber, Conrad & Cortes, 2004; Chen & Baker, 2010; Hyland, 2008).
To the best of our knowledge, however, corpus analysis was not applied to creating lists of
multi-word units for EAP pedagogy until recently when Durrant (2009) looked into the viability
of an academic collocation list and Simpson-Vlach and Ellis (2010) an academic formulas list
(AFL). Both lists present corpus-derived lexis across academic disciplines. The former covers
the most frequent two-word collocations retrieved using statistical information to determine
the strength of word co-occurrence while the latter combines automated extraction of recurring
word sequences and expert judgement to identify pedagogically useful formulaic sequences (3-,
the strength of word combinations described above, the traditional approach in collocation
research often relies on expert intuition to identify phraseological units and is hence also
known as the corpus-based approach if corpus data are used (see Granger & Paquot, 2008;
Howarth, 1998), collocation is generally seen as a continuum with varying degree of arbitrary
restriction ranging from free combinations (e.g. write an essay), through restrictive
collocations (e.g. conduct/do research instead of make research), to frozen expressions (e.g.
generally speaking). The arbitrary restriction could be associated with semantic opaqueness,
can genuinely co-occur with any other word without restriction. Take write an essay for
example. In terms of syntax, if functioning as a transitive verb, write can only be followed by a
noun as an object. In terms of semantics, the literal meaning of write can only collocate with
what the action of write can produce such as essay, song or article. In other words, the
traditionally defined ‘free collocations’ are still very much restricted by their semantic and/or
syntactic environment. Hence, the boundaries between free combinations and restrictive
collocations are sometimes blurred and thus difficult to distinguish as both of them entail
arbitrary restriction to various extents. This also corresponds to the theory of ‘Lexical Priming’
proposed by Hoey (2005). Hoey argued that collocations offer a clue of how language is
structured and that words are ‘primed’ for use through our repeated encounters with them so
2
that our knowledge of a word, including the contexts and co-texts in which they occur, is the
Although collocations can be instantly recognized by native speakers, they often remain
difficult for learners to acquire and use properly. According to Nation (2001, p. 324)
is this feature that makes collocations challenging for L2 learners. Laufer (2011, pp. 30-31)
summarizes the research findings from error analyses, elicitation and corpus analyses as
follows: ‘the use of collocations is problematic for L2 learners, regardless of years of instruction
they received in L2, their native language, or type of task they are asked to perform.’
Particularly the productive use of collocations poses great challenges to L2 learners. Biber and
Conrad (1999) found that words with similar meaning are often distinguished by their
preferred collocations, which reinforces the need for a high level of collocational competence if
speakers want to express themselves clearly and unambiguously. Bahns and Eldaw’s study
(1993), on the other hand, revealed that in a translation task the number of ‘collocation errors’
made by L2 speakers is twice as high as the errors in single lexical items. Biskup (1992)
showed that learners use inappropriate synonyms when producing collocations, and Nesselhauf
(2005) points out that 50% of collocation errors are due to mother tongue interference. Finally,
Cobb (2003) found that even when learners use collocations correctly they over-rely on a small
number of collocations. Yet by using a less appropriate collocate, a non-native speaker will
sound unnatural or may even become unintelligible among speakers of the target language.
Hence if learners aim for advanced proficiency, achieving a high level of collocational
competence is essential.
The aforementioned research reveals the importance collocations play as well as the
challenges they pose to L2 learners, thus indicating the central role they should play in
language teaching and learning. Nation (2001, pp. 189-191) highlights the fact that academic
implicitly nor part of the technical lexicon which is likely to be explicitly taught as part of
subject courses’, which reinforces the need for such a listing. None of the existing EAP
vocabulary lists, however, has met this need. As Durrant himself points out (2009, p. 163), the
majority of the items in his list are grammatical collocations, i.e. one closed-class word (a.k.a.
3
function word) such as prepositions or determiners plus one open-class word (a.k.a. content
word) such as verbs or nouns (Benson, 1985). For example, the top five collocations in
Durrant’s listing (ibid, p. 166) - this study, associated with, based on, and respectively, due to
- do not appear to have attracted much research interest in conventional collocation studies,
Instead, in the traditional approach, lexical collocations (the combination of two open-class
components, e.g. perform an experiment) are usually the target phraseological units under
investigation (e.g. Granger, 1998; Laufer and Waldman, 2011; Nesselhauf, 2003). While it is
true that some of the grammatical collocations in Durrant’s listing could contribute to revealing
the patterns that might be overlooked otherwise, his listing based on statistics alone, does not
In the current study, we therefore argue that only with human intervention can a data-
driven collocation listing be of much pedagogical use while still taking advantage of statistical
information to help identify and prioritize the corpus-derived collocational items that traditional
adopted by Simpson-Vlach and Ellis (2010), who also combined statistical information and
human judgement from EAP instructors when compiling the Academic Formulas List. The
difference is that in our study expert judgment is used not only for the selection of lexical
items for pedagogical purposes but also for the refinement for the final listing. It should be
noted that manual intervention is perhaps much more challenging when tackling collocations
than it is when listing formulas because the latter are rather fixed expressions (e.g. in terms of,
at the same time, from the point of) with little variation of individual components. Collocations,
on the other hand, often contain inflective or positional variations (e.g. results obtained,
broader contexts, achieving objectives), which poses the great challenge of how to collate
these relevant forms and present them in a uniform and consistent way. This challenge can
only be met with human intervention as there is currently no automation that can simplify this
process.
In terms of disciplinary variability, Hyland and Tse (2007) cast some doubt on the
generalizability of the AWL and questioned the assumption of a universal single core
vocabulary list that can be applied to all fields of study. Although we also take a general
4
approach following the tradition of the AWL, a specified frequency and dispersion threshold at
least warrants that students, regardless of field of study, would be more likely to encounter
the lexical items on the lists than those outside of the lists. In addition, EAP lessons usually
follow a general syllabus to cater for students of all subjects; thus there is a need for such
academic collocation list (ACL), which consists of 2,468 most frequent (determined by corpus
analysis) and pedagogically relevant (determined by expert judgement) entries we believe can
2 Methodology
2.1 Corpus
The Academic Collocation List is derived from the written curricular component of the
Pearson International Corpus of Academic English (PICAE). The corpus comprises over 37
million words of academic written and spoken texts from five major English-speaking countries,
i.e. Australia, Canada, New Zealand, UK and USA. The corpus includes curricular English as
found in lectures, seminars, textbooks and journal papers. It also samples extracurricular
The written curricular component of the corpus, from which the ACL was compiled,
comprises 25.6 million words from journal articles and textbook chapters covering 28 academic
disciplines as listed in Table 1. Each of the four fields of study contains materials from seven
academic disciplines to ensure that the corpus is representative of the academic register. The
number of tokens per academic discipline as well as the total number of tokens per field of
study and its percentage are provided below. Whereas the main objective was to compile
subcorpora for each field of study of similar size, less emphasis was placed on having similar-
5
Table 1 Fields of study and academic disciplines represented in the written curricular component of the
corpus
The ACL was developed in four stages. First, a computational analysis of the written
curricular component was conducted. Second, manual refinement of the data-driven list based
on quantitative parameters and target part-of-speech combinations was carried out. This was
followed by an expert review to judge whether each collocation is pedagogically relevant and a
systematization process of the list. Each stage will be addressed in turn in the following
sections.
At this stage ‘collocation’ was defined as a single word that tends to co-occur in the
span of ±3 words from the reference word, co-occurring at least five times in total across at
least five different texts with a Mutual Information (MI) score of at least 3 and a t-score of at
least 2. The MI score indicates the strength of association between the components of the
collocation. The t-score, on the other hand, is a measure of certainty of a collocation, also
taking frequency into account. The former is more likely to give high scores to fixed phrases
whereas the latter will yield significant collocates that occur relatively frequently. According to
Hunston (2002, p. 75), a collocation with an MI score of at least 3 and a t-score of at least 2 is
6
The first step of the computational analysis was to obtain a list of content words in the
corpus using MonoConc Pro 2.2. Secondly, a list of node words, i.e. the words that occurred at
least five times per million words and in at least five different texts was compiled. Function
words, proper nouns, personal names and non-words were removed manually from this list if
they occurred in high frequency. Words from the General Service List (West, 1953) were also
removed from the node word list but could appear as pre- or post-collocate.
Next, a stop list which contained frequent function words that express little lexical
meaning was created, i.e. articles, pronouns, conjunctions, preposition of1. The stop list was
used by the collocation program, specifically written for this project, to exclude sequences
The list of node words was then used to extract potential collocations from the corpus.
In total, this data-driven list contains over 130,000 entries. Each entry includes the node word,
its collocate, the general position of the collocate (pre- or post-), the precise position the
collocate most often occurs in, the normed frequency per million words, MI score, t-score, the
number of texts the collocation occurs in, and normed rates of occurrence for each of the four
fields of study defined in the corpus, i.e. applied science & professions (AS), humanities (HM),
social sciences (SS), and natural/formal sciences (NS). Table 2 provides a sample output from
contribute xxx development -3 17 25 1.15 1.28 0.42 2.35 0.20 4.80 4.82
empirical research 1 51 108 4.85 6.14 2.72 8.67 0.80 6.81 10.30
well established -1 131 319 14.32 12.13 16.98 14.09 15.08 6.43 17.65
experimental study 1 21 30 1.35 1.86 1.26 1.08 1.01 4.44 5.23
holistic approach 1 26 45 2.02 2.28 0.63 3.43 1.41 8.65 6.69
profound implications -1 16 22 0.99 0.71 0.21 1.99 1.01 8.58 4.68
provide information -1 109 371 16.65 20.41 14.88 13.01 17.10 5.91 18.94
seem paradoxical -1 5 5 0.22 0.29 0.21 0.36 0.00 7.34 2.22
subordinate position 1 15 27 1.21 0.14 1.68 3.25 0.00 7.26 5.16
1
As the focus of the ACL is lexical collocation, the preposition ‘of’ was excluded because it tends to occur
in grammatical phrases only (e.g. of the).
7
Table 2 also highlights why further refinement was required as the entries may have a
very low normed frequency, e.g. seem paradoxical; may have a low distribution in certain
fields of study, e.g. subordinate position; or may be part of an extended phrase, e.g.
2.3 Refinement
This list of 130,000+ entries required further scrutiny in order to select and present
academic collocations in a more systematic and user-friendly way. This section will explain the
2.3.1 Filtering
As one of the main objectives of this project was to identify the most frequent
collocations across academic disciplines, quantitative values were first taken into consideration.
An explorative pilot investigation was conducted in search for the optimal combination of cut-
off points of MI score, t-score, frequency and distribution, with which the collocations could be
identified while unsuitable combinations could be filtered out. As a result the two principal
researchers agreed that only entries which met the following quantitative parameters would
undergo further analysis: (1) normed frequency ≥1 per million; (2) normed frequency ≥0.2
per million in each field of study; (3) MI score ≥3; and (4) t-score ≥4. The t-score threshold
was raised because it was found that entries with a t-score of less than 4 were mainly noun-
preposition combinations and fragments of extended phrases, which were not target
combinations. Once the entries were filtered using the quantitative parameters (1) to (4), the
resulting list was reduced to 16,174 entries. Despite a much more manageable data set2, this
2
Here the manageability refers to the data management which would be subject to human judgement at
a later stage as well as the learning load for students. Cf. The AWL consists of 570 word families and
approximately 3,000 words altogether while the AFL presents 200 formulas for the spoken and the
written registers respectively. Other comparable pedagogical vocabulary listings include the General
Service List with the most frequent 2,000 English word families (West, 1953), the Phrasal Expression List
(Martinez and Schmitt, 2012) with 505 entries, and only the top 100 key academic collocations – mostly
grammatical ones – reported in Durrant’s study (2009).
8
2.3.2 POS-tagging
At this stage, it was decided to apply part-of-speech tagging to each entry to facilitate
the extraction of collocations with specific word-class combinations. Lexical collocations that
fall into the following four types of part-of-speech (POS) combinations are the major targets of
our subsequent investigation: verb+noun (e.g. gather data), adjective+noun (e.g. systematic
affect). This conforms to the literature of conventional corpus-based collocation research, e.g.
verb+noun combinations investigated by Altenberg and Granger (2001), Laufer and Waldman
target POS combinations may meet all the quantitative criteria but are excluded from the list
because they are of little pedagogical value. For example, standard deviation is highly frequent
with high MI and t-scores, but this word combination is often considered a compound noun
without any room for commutability and listed as an independent entry in many dictionaries
we consulted. By the same token, frozen expressions such as generally speaking are excluded
other words, only free and restricted combinations in the traditional phraseological sense as
discussed in the collocation continuum above will be subject to expert judgement as these
combinations are most challenging for learners due to their varying degree of arbitrary
restriction or substitutability.
In order to only subject collocations with the target POS combinations to further
analysis, the list was tagged using Apache OpenNLP v1.5.0 applying a simplified set of POS
tags, i.e. noun, verb, modal verb, adverb, adjective, preposition, determiner, pronoun,
conjunction. Although the entries that were tagged lacked context, the tagging was rather
accurate, and only about 10 per cent of POS tags had to be corrected manually. Entries with
most non-target POS combinations, for example, determiner+noun (e.g. some historians),
excluded, whereas noun+noun combinations (e.g. information retrieval, problem area) and
noun+adjective (e.g. evidence available) were kept for manual review as they appeared
valuable from a pedagogical point of view although they do not fall into the four major target
categories of POS combinations. The filtered list now contained 6,808 entries. These entries
9
were first manually vetted by the principal researchers before being reviewed by experts as
described below.
The 6,808 entries underwent a qualitative review in which each entry was assessed
independently by the two researchers to determine whether a specific entry should be included,
discussed or excluded from further analysis. The objective of this stage was to further refine
3. Combinations with adverbs referring to time or frequency (i.e. already, now, often)
The independent judgements were then compared and entries where there was no
agreement or that were marked as ‘discuss’ were reassessed. The discussion was mainly
related to entries where there was ambiguity in relation to the degree of fixedness, technical
specificity or semantic transparency of an entry. At this stage it was decided to opt for
After excluding all combinations tagged as ‘exclude’ by both researchers, the remaining
The purpose of the expert review was to judge whether all 4,558 entries, which met the
aforementioned quantitative and qualitative criteria, should be included in the final list from a
3
This combination takes one noun before ‘based’, e.g. ‘task based approach’.
4
The degree of fixedness was determined by consulting several popular online dictionaries including the
Longman Dictionary of Contemporary English (http://www.ldoceonline.com/) or the Cambridge Dictionary
(http://dictionary.cambridge.org/) to see whether the word combinations under investigation are listed
as independent entries.
10
pedagogical point of view, i.e. appropriateness and relevance of each entry to the field of EAP.
The panel consisted of six experts from different professional backgrounds as below, and the
rationale of having theoretically oriented as well as practically oriented experts on the panel
A detailed written brief that outlined the scope and objectives of the project in general
and the aims of the ACL in particular was reviewed and approved by a university EAP lecturer
and a publisher from the panel before being sent to all the panel experts. The questions and
rating scale given to the experts were deliberately vaguely formulated to allow for the
incorporation of individual viewpoints deriving from the varied backgrounds. However, experts
were instructed to contact the two principal researchers for any clarification before
commencing judgement and were encouraged to comment on their own ratings. Each expert
was asked to make an independent judgement based on the following questions: (1) Is it
appropriate to regard the entry as a collocation for teaching and/or learning purposes? (2) Is
the collocation pedagogically relevant? The following four-point Likert scale was used for the
judgement:
1 = definitely exclude
4 = definitely include
Each entry contained the following statistical information: overall normed frequency,
normed frequency in each field of study, MI score, and t-score. Experts could use the statistical
11
The inter-rater reliability was overall moderate (Intraclass Correlation Coefficient 0.524).
Experts5 agreed to definitely include 1,215 collocations (27%). The moderate agreement is
probably not surprising as the reviewers were intentionally chosen from heterogeneous
backgrounds so that they could provide feedback from different perspectives. If the sum of the
expert judgements was less or equal to 9, the entry was excluded from the final list. In total
It has to be noted that the comments from the panel also significantly contributed to
the follow-up systematization process, which was undertaken by the principal researchers. This
2.5 Systematization
As more than one panel member suggested systematizing the entries to provide a
listing that would be more readily accessible for users, the researchers decided to take the
appropriate
ii. Changing nouns in plural to singular unless the noun is a plurale tantum or
II. Harmonizing entries that appear in British and American English, with British English
III. Adding definite or indefinite articles to verb+noun collocations in line with dictionary
IV. Adding optional copula be to adverb+verb past participle combinations if they can be
5
The ratings of one expert had to be disregarded as they were incomplete.
12
In order to make informed decisions, the researchers consulted concordance lines from
the corpus and referred back to the data-driven list. The pros and cons regarding points III to
3 Results
3.1 Composition
above, the academic collocation list with 2,468 entries was completed. The number of entries
in various target POS combinations and examples in each category are presented in Table 3.
As can be seen, with 1,835 entries noun combinations form the largest category comprising
nearly three quarters of the total entries (74.3%, n=1,835). The second largest category are
Adverb+adjective combinations are comparably fewer but still cover 5.0% (n=124) of the list.
13
A representative selection of adjective+noun and noun+noun combinations in the ACL
are listed in Table 4. These are the entries that received the highest expert agreement (see
Section 2.4) indicating almost unanimous agreement among experts that these collocations
14
Table 4 Adjective+noun and noun+noun collocations with the highest combined score
Normed
No of Normed Normed Normed Normed t-
Adjective Noun freq per MI StDev* StDev
texts AS HM SS NS score
million
1 academic writing 16.74 23 5.28 68.12 1.81 0.20 8.32 19.25
2 brief overview 2.20 38 2.43 1.68 2.71 1.81 9.85 6.99
3 causal link 1.53 16 3.14 0.63 1.45 0.20 8.09 5.81
4 conflicting interests 1.21 13 1.00 0.21 3.07 0.40 8.47 5.18
5 conventional wisdom 3.72 38 2.86 1.26 9.39 1.01 10.99 9.11
6 crucial factor 1.80 27 2.00 1.89 2.17 1.01 7.32 6.29
7 crucial importance 2.06 35 1.71 2.10 2.71 1.81 6.99 6.73
8 crucial role 4.89 71 3.43 7.55 5.42 3.82 7.11 10.36
9 cultural heritage 2.51 30 2.00 1.05 3.97 3.02 8.09 7.46
10 (a) deep understanding (of) 2.74 43 2.00 3.14 2.71 3.42 8.38 7.79
11 disposable income 1.62 16 3.14 0.21 1.81 0.60 12.66 6.00
12 dividing line 1.30 22 1.28 1.26 2.35 0.20 8.34 5.37
13 domestic violence 13.24 22 22.70 0.63 23.84 0.20 9.72 17.16
14 due process 5.39 26 9.56 1.89 6.14 2.01 5.55 10.72
15 economic conditions 4.76 50 6.00 1.47 8.49 2.01 5.35 10.04
16 economic growth 15.21 60 27.70 2.93 11.92 13.07 7.37 18.30
17 economic power 5.43 49 4.43 3.14 10.84 3.02 4.51 10.52
18 educational institution 3.90 50 4.71 3.14 5.60 1.61 7.15 1.20 5.88 4.13
19 environmental factors 8.26 46 10.14 0.84 12.10 8.45 5.80 2.66 7.86 7.61
20 environmental protection 4.62 30 5.00 1.47 1.99 10.06 7.51 10.09
21 equal opportunity 8.39 30 14.85 0.84 13.91 0.40 9.01 13.65
22 ethnic minority 18.62 55 23.70 5.66 38.29 2.01 10.40 0.46 14.38 0.86
23 federal government 12.25 47 24.13 3.56 14.09 1.81 9.21 16.49
24 final stage 5.39 72 5.85 3.56 5.06 6.84 7.11 0.09 7.51 2.32
25 financial resources 4.22 48 7.57 2.31 3.97 1.61 7.30 9.63
26 financial support 5.97 59 8.71 2.72 7.59 3.42 7.09 11.45
27 first generation 5.56 51 6.42 5.66 4.34 5.63 5.77 10.93
28 foreign investment 4.40 23 8.28 1.68 2.71 3.42 9.00 9.88
29 foreign policy 14.05 59 12.85 11.11 28.54 2.41 7.10 1.37 10.92 8.47
30 full range 6.37 68 6.85 3.14 10.30 4.42 6.18 11.75
31 further information 9.92 52 15.13 6.92 7.77 7.84 4.96 14.39
32 further research 6.10 67 6.42 9.64 4.52 4.02 4.39 11.11
15
33 high profile 4.67 45 5.00 3.77 7.77 1.61 7.34 10.14
34 higher education 26.52 72 33.55 18.86 31.07 18.91 8.44 24.24
35 individual differences 7.05 41 7.28 2.10 15.53 2.01 5.97 12.33
36 infinite number 3.90 39 3.43 3.77 0.54 8.45 7.45 9.27
37 intellectual property 6.87 25 15.13 2.31 0.72 6.44 9.49 12.35
38 interpersonal skills 5.25 14 14.28 1.47 1.63 0.20 9.99 10.81
39 key element 8.30 86 13.28 2.93 11.56 2.82 6.12 0.66 9.47 0.04
40 key factor 7.58 86 10.56 1.47 9.21 7.44 6.00 0.60 9.04 0.05
41 living conditions 4.49 35 7.00 1.05 6.86 1.61 6.72 9.90
42 local government 23.78 88 35.40 10.48 39.02 3.22 7.39 0.64 15.79 4.86
43 low income 7.63 44 11.14 1.47 10.84 5.03 8.13 12.99
44 medical treatment 6.51 27 16.70 0.42 3.25 1.61 7.55 11.98
45 mental health 59.64 50 84.37 2.10 130.24 1.41 9.26 36.40
46 mental illness 8.71 38 10.85 3.77 17.70 0.40 9.15 13.90
47 minimum standard 3.41 28 7.14 0.84 3.25 0.80 6.70 2.39 5.63
48 national identity 10.99 49 5.14 6.71 27.28 5.23 6.78 15.51
49 native speaker 5.56 46 3.00 18.65 2.17 0.40 9.97 0.94 7.74 2.01
50 natural resources 11.62 63 17.56 3.77 6.32 16.69 7.33 0.12 10.83 4.62
51 natural world 6.51 46 1.86 11.32 11.92 2.41 5.29 11.73
52 next generation 5.03 55 6.28 3.35 3.97 6.03 7.74 10.53
53 nuclear power 6.01 29 1.71 4.61 4.34 15.29 7.47 11.51
54 nuclear weapon 4.13 30 4.85 1.68 6.50 2.82 11.84 9.59
55 ongoing debate 1.80 21 1.71 1.05 3.97 0.20 8.94 6.31
56 online database 1.21 6 0.29 0.21 4.15 0.20 8.28 5.18
57 paid employment 4.53 22 2.43 1.47 13.01 1.01 8.70 10.03
58 physical activity 5.92 33 9.28 2.31 9.57 0.60 5.04 1.51 7.34 4.31
59 physical properties 11.94 38 2.14 4.19 0.90 45.45 8.43 16.26
60 political economy 18.58 47 22.27 7.34 33.96 7.04 7.85 20.26
61 political institution 6.33 35 2.28 4.40 18.42 0.40 6.06 11.70
62 political party 9.83 50 5.00 3.77 28.00 2.21 7.19 14.70
63 popular culture 52.59 51 13.42 21.17 175.22 1.41 9.09 34.17
64 positive feedback 4.58 27 1.57 0.84 1.26 16.09 7.85 10.06
65 prior knowledge 3.32 33 3.57 2.72 1.99 5.03 6.59 8.51
66 private sector 26.93 83 50.11 3.35 36.13 6.64 9.47 0.68 15.90 9.63
67 public sector 12.97 46 26.70 1.47 14.99 2.41 7.73 16.92
68 public transport 8.26 33 17.13 1.68 7.59 2.82 6.47 13.41
16
69 qualitative analysis 4.35 45 3.00 6.08 4.15 4.83 7.42 0.06 6.26 4.18
70 racial discrimination 12.21 26 24.13 0.84 17.70 0.20 9.98 16.48
71 random sample 2.33 22 3.71 1.47 3.25 0.20 8.25 7.19
72 raw data 6.37 46 4.57 9.43 11.20 0.60 9.26 11.90
73 religious belief 9.24 71 6.00 15.51 15.72 0.60 8.56 0.99 9.93 2.78
74 renewable energy 2.78 14 3.57 0.21 0.36 6.84 10.30 7.87
75 ruling class 6.24 25 3.28 5.03 16.08 0.60 9.27 11.77
76 scientific research 5.79 54 5.85 4.61 7.23 5.23 5.75 11.15
77 significant impact 4.67 57 8.71 1.89 4.70 1.61 6.24 10.06
78 small proportion 3.63 53 3.71 1.68 3.61 5.43 7.29 8.94
79 social factors 12.12 57 12.99 16.56 15.17 3.22 5.01 15.92
80 social mobility 4.13 28 0.71 2.10 13.19 0.80 6.59 9.49
81 social status 8.39 58 3.14 17.82 13.91 0.60 4.82 13.19
82 solar energy 3.19 16 1.14 1.05 0.72 10.86 7.97 8.39
83 stark contrast 1.26 25 1.71 1.26 1.63 0.20 10.17 5.29
84 varying degree 6.51 85 6.57 5.87 8.85 4.42 8.68 4.76 7.00 6.82
85 whole range 8.08 64 8.85 6.29 12.46 3.82 6.52 13.27
86 wide range 48.29 258 49.39 32.91 54.37 54.71 8.78 1.56 21.02 13.74
Noun Noun
87 background knowledge 1.93 24 1.28 5.24 0.36 1.41 5.87 6.45
88 class consciousness 3.46 19 0.29 3.98 9.94 6.84 8.70
89 conflict resolution 2.15 20 2.86 0.21 3.79 1.21 7.65 6.89
90 data set 8.48 49 5.00 5.03 8.31 16.89 4.75 13.24
91 source material 3.63 35 0.71 11.32 2.17 2.01 5.38 8.78
* The value of standard deviation is added where different forms of collocations are combined, e.g. educational institution/institutions. It reflects the difference
between the two MI scores and t-scores respectively for combined entries.
17
In terms of verb combinations, again only those which received the greatest expert
agreement are listed here. Verb+noun and verb+adjective collocations are presented in Table
nominal collocations, verb collocations appear to have received much more research interest in
the past, particularly verb+noun collocations in second language learning. The assumption
appears to be that learners tend to encounter more difficulties in choosing verb collocates
correctly than any other type of collocation. Very little research has addressed other POS
combinations, particularly noun combinations although they dominate the academic register.
This discrepancy, therefore, may require further research into learners’ collocational use to find
out if other types of collocations, in comparison with verb+noun collocations, also pose
challenges to learners.
The final category contains adverb+adjective combinations. Those with the highest
expert agreement are presented in Table 7. It is interesting to note that from Table 4 to 7,
there appears to be a range of recurring words, many of which originate from the same word
adverb in the ACL highly for example. It collocates with 20 different adjectives (e.g.
sophisticated, complex, critical) and six verb past participles (e.g. educated, charged,
developed). As for its adjective form high, it collocates with 22 nouns (e.g. level, profile) while
its comparative form higher collocates with two nouns (education, degree).
The implication for EAP pedagogy at the lexico-grammatical level is that despite a
seemingly large number of collocations in the ACL, the actual teaching and learning load
should be manageable as many of the words in the ACL are often part of a ‘recurrent frame’,
and understanding the frame will give learners a sense of familiarity when encountering new
collocations within the same or a similar frame. Therefore, one possible way of presenting the
entries in the classroom would be to show the frame, if any, of a node word, e.g. (be) highly +
vpp, and introduce the six verb past participles from the ACL together.
18
Table 5 Verb+noun and verb+adjective collocations with the highest combined score
Normed
No of Normed Normed Normed Normed
Verb Noun freq per MI StDev t-score StDev
texts AS HM SS NS
million
1 achieve (a) goal 6.96 85 9.85 5.66 8.31 2.61 8.98 0.49 6.68 3.21
2 achieve (an) objective 6.19 73 13.56 3.56 3.79 1.01 7.92 0.94 6.64 1.54
3 cast doubt 1.75 32 0.71 3.35 2.35 1.01 10.07 6.24
4 make (a) living 5.07 40 1.71 1.89 2.71 15.49 4.44 1.33 5.44 2.88
5 make (a) prediction 3.99 61 2.71 2.31 5.42 5.83 5.35 1.61 18.31 2.17
6 meet criteria 3.28 53 5.00 2.72 2.35 2.41 6.95 1.12 4.01 1.62
7 meet (a) requirement 6.60 93 12.13 3.98 3.61 4.63 7.30 1.47 4.32 2.59
8 obtain (a) result 9.51 115 10.14 6.92 3.25 18.10 5.53 1.21 6.41 3.72
9 pose (a) question 7.22 121 5.85 7.76 10.84 4.63 7.65 0.38 5.03 1.25
10 provide (a) clue 1.80 28 1.71 2.31 1.81 1.41 8.02 0.43 4.18 2.19
11 take precedence 3.19 58 3.71 2.72 3.61 2.41 9.54 0.43 4.72 1.42
12 take responsibility 9.24 105 12.99 7.13 13.37 1.41 5.81 0.66 5.63 3.24
Verb Adjective
13 make explicit 11.71 174 7.85 21.17 14.99 4.42 6.67 0.30 7.80 2.03
20
3.2 Validation
comparison with non-academic registers, a validation study was conducted with the aim to
investigate the list’s overall coverage of its source corpus as well as of a comparably-sized
general corpus. The general corpus of 25 million tokens was compiled from the BNC including
imaginative writings (i.e. literary and creative works) and informative writings (i.e. leisure
component).
For the purpose of the analysis, each ACL entry underwent inflection expansion. In
other words, the entry was classed as the root form with all possible variants added. For
If the first collocate contained optional words, e.g. test (a) | v | theory | n , then this
was expanded into two separate rows as shown below. The POS for the second row was
classed ngram as multiword items are not tagged with a single POS. When searching for such
multiword items, the POS for the constituent tokens were not considered.
test | v | theory | n
This extended list, however, neither included collocations with flexible positions of their
significantly, nor collocations with variable gaps, e.g. consider these/such/specific issues. It
21
would have been beyond the scope of the validation to attempt to include all possible
The results show that the overall coverage of the ACL in the source corpus amounts to
1.4% and for the general corpus to 0.1%. This suggests that the ACL has a 14-times higher
note that Coxhead (2000) found a similar ratio of the AWL coverage in her source corpus
(10%) and a corpus of fiction (1.4%). These findings underline the importance that should be
4 Discussion
As described in the methodology section, the ACL was compiled using quantitative
analysis followed by qualitative refinement. During the process of manual vetting and
systematization the researchers had to constantly check the selection criteria and adapt them.
For example, at the outset the intention was to include two-word collocations only. Yet after
reviewing the data from the computational analysis, it was decided to keep those with
extended combinations such as deal xx issue, lead xx conclusion, largely based as they are of
pedagogical value. Consulting concordance lines from the source corpus allowed the
researchers to identify the most frequent pattern, which was chosen for the final format. The
above examples were then completed as deal with an issue, lead to the conclusion, and (be)
largely based (on). Very often the addition of words included: articles added to verb+noun
combinations, e.g. raise (an) issue, copula be added to adverb+vpp combinations if they can
be used as predicate, e.g. (be) widely recognized, prepositions added to verb or noun
combinations, e.g. rely heavily (on) and (a) high proportion (of), or a combination of two of
the above conditions, e.g. (be) greatly influenced (by). Providing the most frequent pattern
increases the list’s usefulness and applicability for teachers and learners alike.
One major challenge of compiling this list lay in the fact that there seems to be no
absolute definition of collocation and this is particularly true for some word combinations. The
moderate inter-rater reliability among the expert panel reported in the methodology section
suggests that which entry qualifies as an academic collocation for pedagogical purposes may
22
continuum of formulaicity and semantic opaqueness, and at the earlier stage, it was decided to
remove combinations at the two extreme ends, which are either fixed expressions (e.g. global
warming) or semantically transparent combinations (e.g. first issue). Yet there is no absolute
dividing line to determine what ‘fixedness’ or ‘transparency’ is. Take the entry academic life for
example. At the stage of expert judgement, it received a wide range of scores from 1
semantically transparent may be debatable, yet it was accepted as an entry in the list because
in addition to meeting all quantitative criteria, this entry still received a moderate total score of
15 from the experts. Comparing the appropriateness of the following examples that received a
total score below 10 (Table 8) and the ones with the highest expert scores (Table 4 to 7), it
becomes clear why we believe that adopting expert judgement in selecting the entries and
utilizing the feedback from the panel such as systematizing the entries have contributed
significantly to the quality of this listing, which could not have been achieved by relying solely
on computational analysis.
Nonetheless three issues which required further manual scrutiny became apparent after
the expert review. First of all, as the words processed by the computer were not lemmatized,
nouns with singular and plural forms or inflected verbs were processed separately despite
having identical collocates and hence some entries were duplicated. To rectify this issue the
original computational output was checked again manually, and duplicates such as
where both singular and plural forms were present, only the more frequent form was retained,
e.g. the singular form fundamental assumption, but the plural form common characteristics.
For verb combinations, inflected forms were converted to the infinitive, e.g. deny access
instead of denied access or denying access. For combinations with verb past participles (vpp),
it proved more complicated as they often occur in a variety of contexts – modifying nouns as
adjectives (see Example 1) or collocating with copula be or linking verbs as complement (see
Examples 2-3). It is also possible that a main verb shares the same form with its verb past
participle, in which case the current automated corpus-driven approach is unable to distinguish
them (compare Examples 3 and 4). The variable position of adverb collocate adds even more
23
complexity to this matter (see Example 5). To solve this issue the following policy was
adopted: if the vpp forms generally function as complement in the concordance lines as in
Examples 2-3, then an optional copula be would be added. In addition, only the most frequent
pattern was kept; therefore, directly involved is included as opposed to involved directly, which
(2) Firstly, it is now well established that investigations into human rights abuses …
(3) … that the two leaders will have to become directly involved in talks rather than
(4) Only those negative outcomes that directly involved the pupil were included.
(5) When the nurse has not been involved directly in the administration of the
medication…
The second issue was that some combinations were excluded because the individual
inflective forms with collocates failed to meet the frequency threshold, yet their combined
frequency is actually higher than the cut-off frequency. The expert panel suggested a number
of such collocations that were not covered in the reviewed list. The researchers then
scrutinized the original data-driven list of 130,000+ entries again in search for any collocations
that had been missed out as a result of inflections. Another 9.7% (n=239) of additional
combinations (e.g. differ considerably or exercise authority) were added to the list at this
stage. The above tasks were very time-consuming, yet such manual scrutiny was required in
The final issue involves whether highly sensitive entries should be included in the ACL.
The principal researchers decided to exclude entries such as mental retardation as some
One interesting finding is the dominance of nominal combinations in the list. This might
indicate that nouns have greater tendency to collocate than other word classes, which would
require further analysis. It might also be reflective of the written register, particularly in
24
scientific writing, which is characterized by a large proportion of nominalization in academic
texts indicating high information content. For example, Biber and Gray (2010: 2) found that
in noun phrases’. Fang, Schleppegrell & Cox (2006) reported that the extensive use of nouns
and nominal expressions in academic text can pose great challenges for comprehension (for
the discussion of nominalization in the written register, see also Halliday, 1985; Quirk,
Greenbaum, Leech and Svartvik 1985). Yet very little research has been conducted on
adjectival or nominal collocations in academic English. One of the few exceptions is a study
Lorenz’ study, however, focuses on the contrastive use of adjective intensification between
native and non-native writing rather than the use of academic collocations per se.
One may criticize that the listing has undergone various stages of filtering and many
potentially useful and valuable combinations might have been removed from the final ACL
during this process. However, we argue that providing too much information - in this case too
many collocational entries - can be overwhelming for learners. Research has shown that even
with collocations dictionaries at hand, students still struggle to find correct verb collocates in a
given task (Dziemianko, 2010; Laufer, 2011). We believe that the ACL, composed of carefully
selected collocations, can serve as part of a lexical syllabus to raise learners’ awareness of
word co-occurrence and help them prioritize the learning of lexical items.
5 Conclusion
The Academic Collocation List comprises 2,468 entries. Our definition of collocations
refers to word combinations which co-occur more frequently than by chance across academic
disciplines (hence corpus-driven) and are pedagogically relevant in an EAP context (hence
expert-judged). Within the scope of this definition, we primarily focused on lexical collocations
as they contain certain variability and are thus more dynamic while grammatical collocations or
idioms consist of comparatively fixed patterns and are consequently more predictable. The
former is more challenging for learners to master whereas the latter can generally be treated
as holistic units and learners can more easily internalize the usage into their lexicon.
25
The ACL was compiled using a mixed-method approach of combining computational
analysis of the source corpus with expert judgement and systematization. As pointed out
above, existing corpus-driven multi-word lists often fail to provide immediate usable resources
for language learning, and it is only with expert intervention that raw data can be filtered and
refined in order to extract the most informative and meaningful entries. In the case of the ACL,
the statistical information served as important reference, but in addition each collocation
included in the final list had been subjected to expert judgement and manual refinement. The
advantage of relying not only on computational analysis but also on human intervention is that
the data-driven approach ensures that important combinations are not missed while expert
judgement ensures that the final entries are appropriate and relevant for EAP.
This approach led to an academic collocation list that will be of much greater use to
language learners and EAP teachers alike. In addition to the Academic Word List and the
Academic Formulas List, the ACL provides a further tool for EAP teachers to construct
appropriate teaching materials and help students focus on frequent lexical items beyond
individual words.
As research has shown, collocations are difficult to learn and retain even with the
assistance of dictionaries. The list can therefore be used to support learning by drawing
attention to collocation per se. By subdividing the ACL, students and teachers can focus
opportunity to encounter collocations when dealing with academic texts. Combining both
explicit and implicit teaching will enhance learners’ receptive and productive collocational
knowledge and thus improve their academic English proficiency. EAP material writers may also
In terms of directions for future research, as Hyland and Tse (2007) suggested, further
research into an extended ACL may be required to highlight specific collocations that are
exclusively frequent in individual fields of studies. We also propose that future research should
aim at categorizing the entries of the ACL on the basis, for example, of semantics or discourse
functions. For instance, the hedging function of some of the entries, e.g. virtually impossible,
relatively stable, is one feature that came to light during the compilation process. Types of
26
errors or underuse/overuse in learner language may be identified to help students improve
References
Alternberg, B. & Granger, S. (2001). The grammatical and lexical patterning of MAKE in native
Bahns, J. & Eldaw, M. (1993). Should we teach EFL students collocations? System, 21(1), 101-
114.
Benson, M. (1985). Collocation and idioms. In R. Ilson (Ed.) Dictionaries, lexicography and
Biber, D. & Conrad, S. (1999). Lexical bundles in conversation and academic prose. In H.
Hasselgard & S. Oksefjell (Eds.), Out of corpora: Studies in honour of Stig Johansson (pp.
Biber, D., Conrad, S. & Cortes, V. (2004). If you look at ...: Lexical bundles in university
Biber, D. & Gray, B. (2010). Challenging stereotypes about academic writing: Complexity,
German empirical study. In P.J.L. Arnaud & H. Béjoint (Eds.), Vocabulary and Applied
Chen, Y. H. & Baker, P. (2010). Lexical bundles in native and non-native academic writing.
Cobb, T. (2003). Analyzing late interlanguage with learner corpora: Quebec replications of
Cowie, A. P. (1981). The treatment of collocations and idioms in learners’ dictionaries. Applied
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
27
Durrant, P. (2009). Investigating the viability of a collocation list for students of English for
Dziemianko, A. (2010). Paper or electronic? The role of dictionary form in language perception,
Fang, Z., Schleppegrell, M. & Cox, B. (2006). Understanding the language demands of
Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae.
In A. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 145-160). Oxford:
Granger, S. & Paquot, M. (2008). Disentangling the phraseological web. In S. Granger &
John Benjamins.
Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.
Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19(1),
24-44.
Hyland, K. (2008). Academic clusters: Text patterning in published and postgraduate writing.
Hyland, K. & Tse, P. (2007). Is there an “Academic Vocabulary”? TESOL Quarterly, 412(2),
235-253.
Laufer, B. (2011). The contribution of dictionary use to the production and retention of
Laufer, B. & Waldman, T. (2011). Verb-Noun collocations in second language writing: A corpus
28
Lorenz, G. (1998). Overstatement in advanced learners' writing: stylistic aspects of adjective
intensification. In S. Granger (Ed.), Learner English on computer (pp. 53-66). London and
Martinez, R., & Schmitt, N. (2012) A phrasal expressions list. Applied Linguistics, 33(3), 299-
320.
University Press.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some
Quirk, R., Greenbaum, S., Leech, G. & Svartvik, J. (1985). A comprehensive grammar of the
Schmitt, N. (2004). Formulaic sequences: Acquisition, processing, and use. Amsterdam: John
Benjamins.
Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell.
Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford University Press.
Acknowledgements
We thank Douglas Biber, Regents' Professor in the Applied Linguistics Program at Northern
Arizona University, and Bethany Gray, Assistant Professor in the Department of English and
the TESL/Applied Linguistics Program at Iowa State University, for conducting the
29
We would like to express our gratitude to Andrew Roberts, Computational Linguist, for tagging
the initial collocation list and conducting the validation study of the Academic Collocation List.
We are also grateful to the members of the expert panel: David Crystal, Honorary Professor of
University; Della Summers, Dictionary Consultant; Professor Lord Randolph Quirk, FBA; and
We would also like to thank Mike Mayor, Editorial Director, Dictionaries & Reference,
Pearson for contributing valuable advice and John H.A.L. De Jong, Senior Vice President,
Standards and Quality Office, Pearson for his support throughout the project.
Last but not least, we thank the anonymous reviewers for their helpful and insightful
The complete Academic List can be accessed via the following link:
http://www.pearsonpte.com/research/Pages/CollocationList.aspx
30