Ackermannand Chen 2013 Developing Academic Collocation List Authorsmanuscript

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259161085

Developing the Academic Collocation List (ACL) – A corpus-driven and


expert-judged approach

Article  in  Journal of English for Academic Purposes · December 2013


DOI: 10.1016/j.jeap.2013.08.002

CITATIONS READS

179 14,720

2 authors, including:

Yu-Hua Chen
Coventry University
12 PUBLICATIONS   632 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Corpus of Chinese Academic Written and Spoken English (CAWSE) View project

All content following this page was uploaded by Yu-Hua Chen on 16 October 2017.

The user has requested enhancement of the downloaded file.


Developing the Academic Collocation List (ACL) - A corpus-driven and expert-judged

approach

Kirsten Ackermann & Yu-Hua Chen*


Pearson, 80 Strand London WC2R 0RL, UK
Emails: kirstensutton11@gmail.com, yu-hua.chen@nottingham.edu.cn (*corresponding author)

Highlights

 We focused on cross-disciplinary lexical collocations in academic writing.


 A corpus-driven and expert-judged approach was used.
 2,468 highly frequent and pedagogically relevant collocations were identified.
 The ACL serves as a new tool in EAP for teaching and learning collocations.

Abstract

This article describes the development and evaluation of the Academic Collocation List (ACL),
which was compiled from the written curricular component of the Pearson International Corpus
of Academic English (PICAE) comprising over 25 million words. The development involved four
stages: (1) computational analysis; (2) refinement of the data-driven list based on quantitative
and qualitative parameters; (3) expert review; and (4) systematization. While taking
advantage of statistical information to help identify and prioritize the corpus derived
collocational items that traditional manual examination are unable to manage, we argue that
only with human intervention can a data driven collocation listing be of much pedagogical use.
Focusing on lexical collocations only, we present a new academic collocation list compiled using
a mixed-method approach of corpus statistics and expert judgement, consisting of the 2,468
most frequent and pedagogically relevant entries we believe can be immediately
operationalized by EAP teachers and students. By highlighting the most important
crossdisciplinary collocations, the ACL can help learners increase their collocational
competence and thus their proficiency in academic English. The ACL can also support EAP
teachers in their lesson planning and provide a research tool for investigating academic
language development.

1 Introduction

The Academic Word List (Coxhead, 2000) is arguably the most widely used EAP word

list nowadays taking corpus frequency into account. As the research interest in corpus

linguistics has gradually shifted towards word co-occurrence rather than single words (see

Granger and Meunier, 2008; Schmitt, 2004; Stubbs, 2001; Wray, 2008), there have been

more investigations of recurrent word combinations in academic prose using frequency and
1
dispersion parameters (e.g. Biber, Conrad & Cortes, 2004; Chen & Baker, 2010; Hyland, 2008).

To the best of our knowledge, however, corpus analysis was not applied to creating lists of

multi-word units for EAP pedagogy until recently when Durrant (2009) looked into the viability

of an academic collocation list and Simpson-Vlach and Ellis (2010) an academic formulas list

(AFL). Both lists present corpus-derived lexis across academic disciplines. The former covers

the most frequent two-word collocations retrieved using statistical information to determine

the strength of word co-occurrence while the latter combines automated extraction of recurring

word sequences and expert judgement to identify pedagogically useful formulaic sequences (3-,

4-, and 5-grams) for EAP.

In contrast to the corpus-driven approach using quantitative information to determine

the strength of word combinations described above, the traditional approach in collocation

research often relies on expert intuition to identify phraseological units and is hence also

known as the corpus-based approach if corpus data are used (see Granger & Paquot, 2008;

Tognini-Bonelli, 2001). In conventional phraseology research (Cowie, 1981; Cowie 1994;

Howarth, 1998), collocation is generally seen as a continuum with varying degree of arbitrary

restriction ranging from free combinations (e.g. write an essay), through restrictive

collocations (e.g. conduct/do research instead of make research), to frozen expressions (e.g.

generally speaking). The arbitrary restriction could be associated with semantic opaqueness,

degree of formulaicity or substitutability. However, even in free collocations, no single word

can genuinely co-occur with any other word without restriction. Take write an essay for

example. In terms of syntax, if functioning as a transitive verb, write can only be followed by a

noun as an object. In terms of semantics, the literal meaning of write can only collocate with

what the action of write can produce such as essay, song or article. In other words, the

traditionally defined ‘free collocations’ are still very much restricted by their semantic and/or

syntactic environment. Hence, the boundaries between free combinations and restrictive

collocations are sometimes blurred and thus difficult to distinguish as both of them entail

arbitrary restriction to various extents. This also corresponds to the theory of ‘Lexical Priming’

proposed by Hoey (2005). Hoey argued that collocations offer a clue of how language is

structured and that words are ‘primed’ for use through our repeated encounters with them so

2
that our knowledge of a word, including the contexts and co-texts in which they occur, is the

product of such encounters.

Although collocations can be instantly recognized by native speakers, they often remain

difficult for learners to acquire and use properly. According to Nation (2001, p. 324)

collocations contain ‘some element of grammatical or lexical unpredictability or inflexibility’. It

is this feature that makes collocations challenging for L2 learners. Laufer (2011, pp. 30-31)

summarizes the research findings from error analyses, elicitation and corpus analyses as

follows: ‘the use of collocations is problematic for L2 learners, regardless of years of instruction

they received in L2, their native language, or type of task they are asked to perform.’

Particularly the productive use of collocations poses great challenges to L2 learners. Biber and

Conrad (1999) found that words with similar meaning are often distinguished by their

preferred collocations, which reinforces the need for a high level of collocational competence if

speakers want to express themselves clearly and unambiguously. Bahns and Eldaw’s study

(1993), on the other hand, revealed that in a translation task the number of ‘collocation errors’

made by L2 speakers is twice as high as the errors in single lexical items. Biskup (1992)

showed that learners use inappropriate synonyms when producing collocations, and Nesselhauf

(2005) points out that 50% of collocation errors are due to mother tongue interference. Finally,

Cobb (2003) found that even when learners use collocations correctly they over-rely on a small

number of collocations. Yet by using a less appropriate collocate, a non-native speaker will

sound unnatural or may even become unintelligible among speakers of the target language.

Hence if learners aim for advanced proficiency, achieving a high level of collocational

competence is essential.

The aforementioned research reveals the importance collocations play as well as the

challenges they pose to L2 learners, thus indicating the central role they should play in

language teaching and learning. Nation (2001, pp. 189-191) highlights the fact that academic

collocations may ‘neither be sufficiently frequent in the language as a whole to be learnt

implicitly nor part of the technical lexicon which is likely to be explicitly taught as part of

subject courses’, which reinforces the need for such a listing. None of the existing EAP

vocabulary lists, however, has met this need. As Durrant himself points out (2009, p. 163), the

majority of the items in his list are grammatical collocations, i.e. one closed-class word (a.k.a.

3
function word) such as prepositions or determiners plus one open-class word (a.k.a. content

word) such as verbs or nouns (Benson, 1985). For example, the top five collocations in

Durrant’s listing (ibid, p. 166) - this study, associated with, based on, and respectively, due to

- do not appear to have attracted much research interest in conventional collocation studies,

where collocations are manually retrieved as opposed to relying on statistical measures.

Instead, in the traditional approach, lexical collocations (the combination of two open-class

components, e.g. perform an experiment) are usually the target phraseological units under

investigation (e.g. Granger, 1998; Laufer and Waldman, 2011; Nesselhauf, 2003). While it is

true that some of the grammatical collocations in Durrant’s listing could contribute to revealing

the patterns that might be overlooked otherwise, his listing based on statistics alone, does not

provide readily usable materials for EAP teaching and learning.

In the current study, we therefore argue that only with human intervention can a data-

driven collocation listing be of much pedagogical use while still taking advantage of statistical

information to help identify and prioritize the corpus-derived collocational items that traditional

manual examination is unable to manage. This is similar to the mixed-method approach

adopted by Simpson-Vlach and Ellis (2010), who also combined statistical information and

human judgement from EAP instructors when compiling the Academic Formulas List. The

difference is that in our study expert judgment is used not only for the selection of lexical

items for pedagogical purposes but also for the refinement for the final listing. It should be

noted that manual intervention is perhaps much more challenging when tackling collocations

than it is when listing formulas because the latter are rather fixed expressions (e.g. in terms of,

at the same time, from the point of) with little variation of individual components. Collocations,

on the other hand, often contain inflective or positional variations (e.g. results obtained,

broader contexts, achieving objectives), which poses the great challenge of how to collate

these relevant forms and present them in a uniform and consistent way. This challenge can

only be met with human intervention as there is currently no automation that can simplify this

process.

In terms of disciplinary variability, Hyland and Tse (2007) cast some doubt on the

generalizability of the AWL and questioned the assumption of a universal single core

vocabulary list that can be applied to all fields of study. Although we also take a general

4
approach following the tradition of the AWL, a specified frequency and dispersion threshold at

least warrants that students, regardless of field of study, would be more likely to encounter

the lexical items on the lists than those outside of the lists. In addition, EAP lessons usually

follow a general syllabus to cater for students of all subjects; thus there is a need for such

cross-disciplinary lexical resources. Focusing on lexical collocations only, we present an

academic collocation list (ACL), which consists of 2,468 most frequent (determined by corpus

analysis) and pedagogically relevant (determined by expert judgement) entries we believe can

be immediately operationalized by EAP teachers and students.

2 Methodology

2.1 Corpus

The Academic Collocation List is derived from the written curricular component of the

Pearson International Corpus of Academic English (PICAE). The corpus comprises over 37

million words of academic written and spoken texts from five major English-speaking countries,

i.e. Australia, Canada, New Zealand, UK and USA. The corpus includes curricular English as

found in lectures, seminars, textbooks and journal papers. It also samples extracurricular

English that students encounter from university administration to transcripts of broadcasts.

The written curricular component of the corpus, from which the ACL was compiled,

comprises 25.6 million words from journal articles and textbook chapters covering 28 academic

disciplines as listed in Table 1. Each of the four fields of study contains materials from seven

academic disciplines to ensure that the corpus is representative of the academic register. The

number of tokens per academic discipline as well as the total number of tokens per field of

study and its percentage are provided below. Whereas the main objective was to compile

subcorpora for each field of study of similar size, less emphasis was placed on having similar-

sized subcorpora for each academic discipline.

5
Table 1 Fields of study and academic disciplines represented in the written curricular component of the
corpus

Applied Sciences and Natural / Formal


Humanities (HM) Social Sciences (SS)
Professions (AS) Sciences (NS)

Discipline Tokens Discipline Tokens Discipline Tokens Discipline Tokens


Earth
Architecture 167,074 History 946,707 Anthropology 413,237 1,343,723
sciences
Business 1,644,180 Linguistics 855,128 Archaeology 184,089 Chemistry 1,502,277
Cultural
Education 405,202 Literature 1,562,046 861,656 Physics 662,054
studies
Gender Computer
Engineering 1,134,950 Arts 728,532 520,395 1,124,097
studies sciences
Health General
1,429,679 627,951 Politics 1,090,800 Mathematics 295,565
sciences humanities
Media
1,500,485 Philosophy 602,233 Psychology 1,560,745 Biology 858,597
studies
Law 1,962,002 Religion 198,165 Sociology 1,832,588 Ecology 239,787
Total 8,243,572 Total 5,520,762 Total 6,463,510 Total 6,026,100

(31%) (21%) (25%) (23%)

The ACL was developed in four stages. First, a computational analysis of the written

curricular component was conducted. Second, manual refinement of the data-driven list based

on quantitative parameters and target part-of-speech combinations was carried out. This was

followed by an expert review to judge whether each collocation is pedagogically relevant and a

systematization process of the list. Each stage will be addressed in turn in the following

sections.

2.2 Computational analysis

At this stage ‘collocation’ was defined as a single word that tends to co-occur in the

span of ±3 words from the reference word, co-occurring at least five times in total across at

least five different texts with a Mutual Information (MI) score of at least 3 and a t-score of at

least 2. The MI score indicates the strength of association between the components of the

collocation. The t-score, on the other hand, is a measure of certainty of a collocation, also

taking frequency into account. The former is more likely to give high scores to fixed phrases

whereas the latter will yield significant collocates that occur relatively frequently. According to

Hunston (2002, p. 75), a collocation with an MI score of at least 3 and a t-score of at least 2 is

considered ‘a strong collocate, and a certain one’.

6
The first step of the computational analysis was to obtain a list of content words in the

corpus using MonoConc Pro 2.2. Secondly, a list of node words, i.e. the words that occurred at

least five times per million words and in at least five different texts was compiled. Function

words, proper nouns, personal names and non-words were removed manually from this list if

they occurred in high frequency. Words from the General Service List (West, 1953) were also

removed from the node word list but could appear as pre- or post-collocate.

Next, a stop list which contained frequent function words that express little lexical

meaning was created, i.e. articles, pronouns, conjunctions, preposition of1. The stop list was

used by the collocation program, specifically written for this project, to exclude sequences

composed only of grammatical function words from subsequent analysis.

The list of node words was then used to extract potential collocations from the corpus.

In total, this data-driven list contains over 130,000 entries. Each entry includes the node word,

its collocate, the general position of the collocate (pre- or post-), the precise position the

collocate most often occurs in, the normed frequency per million words, MI score, t-score, the

number of texts the collocation occurs in, and normed rates of occurrence for each of the four

fields of study defined in the corpus, i.e. applied science & professions (AS), humanities (HM),

social sciences (SS), and natural/formal sciences (NS). Table 2 provides a sample output from

the computational analysis.

Table 2 Sample output from the computational analysis


Normed frequency in field of study
Academic Post- Posi- No of Raw t-
Pre-collocate per MI
word collocate tion texts freq AS HM SS NS score
mill
becomes apparent -1 46 66 2.96 3.28 2.72 3.25 2.41 7.72 8.09
more complex -1 208 870 39.04 38.69 37.10 39.74 40.63 5.76 28.95
social context -1 73 350 15.71 18.13 15.72 25.65 1.21 5.03 18.13

contribute xxx development -3 17 25 1.15 1.28 0.42 2.35 0.20 4.80 4.82

domain specific 1 9 57 2.56 0.57 9.22 0.54 1.21 6.68 7.48

empirical research 1 51 108 4.85 6.14 2.72 8.67 0.80 6.81 10.30
well established -1 131 319 14.32 12.13 16.98 14.09 15.08 6.43 17.65
experimental study 1 21 30 1.35 1.86 1.26 1.08 1.01 4.44 5.23
holistic approach 1 26 45 2.02 2.28 0.63 3.43 1.41 8.65 6.69
profound implications -1 16 22 0.99 0.71 0.21 1.99 1.01 8.58 4.68
provide information -1 109 371 16.65 20.41 14.88 13.01 17.10 5.91 18.94
seem paradoxical -1 5 5 0.22 0.29 0.21 0.36 0.00 7.34 2.22
subordinate position 1 15 27 1.21 0.14 1.68 3.25 0.00 7.26 5.16

1
As the focus of the ACL is lexical collocation, the preposition ‘of’ was excluded because it tends to occur
in grammatical phrases only (e.g. of the).
7
Table 2 also highlights why further refinement was required as the entries may have a

very low normed frequency, e.g. seem paradoxical; may have a low distribution in certain

fields of study, e.g. subordinate position; or may be part of an extended phrase, e.g.

contribute xxx development.

2.3 Refinement

This list of 130,000+ entries required further scrutiny in order to select and present

academic collocations in a more systematic and user-friendly way. This section will explain the

refinement process using quantitative and qualitative parameters.

2.3.1 Filtering

As one of the main objectives of this project was to identify the most frequent

collocations across academic disciplines, quantitative values were first taken into consideration.

An explorative pilot investigation was conducted in search for the optimal combination of cut-

off points of MI score, t-score, frequency and distribution, with which the collocations could be

identified while unsuitable combinations could be filtered out. As a result the two principal

researchers agreed that only entries which met the following quantitative parameters would

undergo further analysis: (1) normed frequency ≥1 per million; (2) normed frequency ≥0.2

per million in each field of study; (3) MI score ≥3; and (4) t-score ≥4. The t-score threshold

was raised because it was found that entries with a t-score of less than 4 were mainly noun-

preposition combinations and fragments of extended phrases, which were not target

combinations. Once the entries were filtered using the quantitative parameters (1) to (4), the

resulting list was reduced to 16,174 entries. Despite a much more manageable data set2, this

list still required further refinement.

2
Here the manageability refers to the data management which would be subject to human judgement at
a later stage as well as the learning load for students. Cf. The AWL consists of 570 word families and
approximately 3,000 words altogether while the AFL presents 200 formulas for the spoken and the
written registers respectively. Other comparable pedagogical vocabulary listings include the General
Service List with the most frequent 2,000 English word families (West, 1953), the Phrasal Expression List
(Martinez and Schmitt, 2012) with 505 entries, and only the top 100 key academic collocations – mostly
grammatical ones – reported in Durrant’s study (2009).
8
2.3.2 POS-tagging

At this stage, it was decided to apply part-of-speech tagging to each entry to facilitate

the extraction of collocations with specific word-class combinations. Lexical collocations that

fall into the following four types of part-of-speech (POS) combinations are the major targets of

our subsequent investigation: verb+noun (e.g. gather data), adjective+noun (e.g. systematic

approach), adverb+adjective (e.g. increasingly complex), and adverb+verb (e.g. significantly

affect). This conforms to the literature of conventional corpus-based collocation research, e.g.

verb+noun combinations investigated by Altenberg and Granger (2001), Laufer and Waldman

(2011), Nesselhauf (2005), or intensifying adjectival combinations by Lorenz (1998). Some

target POS combinations may meet all the quantitative criteria but are excluded from the list

because they are of little pedagogical value. For example, standard deviation is highly frequent

with high MI and t-scores, but this word combination is often considered a compound noun

without any room for commutability and listed as an independent entry in many dictionaries

we consulted. By the same token, frozen expressions such as generally speaking are excluded

as they are generally regarded as formulae as opposed to collocations by phraseologists. In

other words, only free and restricted combinations in the traditional phraseological sense as

discussed in the collocation continuum above will be subject to expert judgement as these

combinations are most challenging for learners due to their varying degree of arbitrary

restriction or substitutability.

In order to only subject collocations with the target POS combinations to further

analysis, the list was tagged using Apache OpenNLP v1.5.0 applying a simplified set of POS

tags, i.e. noun, verb, modal verb, adverb, adjective, preposition, determiner, pronoun,

conjunction. Although the entries that were tagged lacked context, the tagging was rather

accurate, and only about 10 per cent of POS tags had to be corrected manually. Entries with

most non-target POS combinations, for example, determiner+noun (e.g. some historians),

preposition+noun (e.g. within cultures) or preposition+adverb (e.g. without actually) were

excluded, whereas noun+noun combinations (e.g. information retrieval, problem area) and

noun+adjective (e.g. evidence available) were kept for manual review as they appeared

valuable from a pedagogical point of view although they do not fall into the four major target

categories of POS combinations. The filtered list now contained 6,808 entries. These entries

9
were first manually vetted by the principal researchers before being reviewed by experts as

described below.

2.3.3 Manual vetting

The 6,808 entries underwent a qualitative review in which each entry was assessed

independently by the two researchers to determine whether a specific entry should be included,

discussed or excluded from further analysis. The objective of this stage was to further refine

the list by excluding the following types of entries:

1. Linguistically incomplete units (e.g. based approach3)

2. Combinations with a high degree of fixedness4 (e.g. collective bargaining)

3. Combinations with adverbs referring to time or frequency (i.e. already, now, often)

4. Combinations with common transparent adjectives (e.g. good evidence)

5. Combinations with concrete geographical references (e.g. European community)

6. Combinations that are often hyphenated (e.g. ill-defined)

The independent judgements were then compared and entries where there was no

agreement or that were marked as ‘discuss’ were reassessed. The discussion was mainly

related to entries where there was ambiguity in relation to the degree of fixedness, technical

specificity or semantic transparency of an entry. At this stage it was decided to opt for

inclusion if no clear-cut decision could be made.

After excluding all combinations tagged as ‘exclude’ by both researchers, the remaining

4,558 entries were subjected to the expert review.

2.4 Expert review

The purpose of the expert review was to judge whether all 4,558 entries, which met the

aforementioned quantitative and qualitative criteria, should be included in the final list from a

3
This combination takes one noun before ‘based’, e.g. ‘task based approach’.
4
The degree of fixedness was determined by consulting several popular online dictionaries including the
Longman Dictionary of Contemporary English (http://www.ldoceonline.com/) or the Cambridge Dictionary
(http://dictionary.cambridge.org/) to see whether the word combinations under investigation are listed
as independent entries.
10
pedagogical point of view, i.e. appropriateness and relevance of each entry to the field of EAP.

The panel consisted of six experts from different professional backgrounds as below, and the

rationale of having theoretically oriented as well as practically oriented experts on the panel

was to ensure a thorough assessment of the list entries.

Expert 1: Professor of Linguistics

Expert 2: Professor of English Linguistics

Expert 3: Senior lecturer in EFL/TESOL

Expert 4: Dictionary consultant

Expert 5: Professor of English Language and Literature

Expert 6: Lexicographer and publisher

A detailed written brief that outlined the scope and objectives of the project in general

and the aims of the ACL in particular was reviewed and approved by a university EAP lecturer

and a publisher from the panel before being sent to all the panel experts. The questions and

rating scale given to the experts were deliberately vaguely formulated to allow for the

incorporation of individual viewpoints deriving from the varied backgrounds. However, experts

were instructed to contact the two principal researchers for any clarification before

commencing judgement and were encouraged to comment on their own ratings. Each expert

was asked to make an independent judgement based on the following questions: (1) Is it

appropriate to regard the entry as a collocation for teaching and/or learning purposes? (2) Is

the collocation pedagogically relevant? The following four-point Likert scale was used for the

judgement:

1 = definitely exclude

2 = not sure, but tendency to exclude

3 = not sure, but tendency to include

4 = definitely include

Each entry contained the following statistical information: overall normed frequency,

normed frequency in each field of study, MI score, and t-score. Experts could use the statistical

information to make an informed decision when assessing the collocation.

11
The inter-rater reliability was overall moderate (Intraclass Correlation Coefficient 0.524).

Experts5 agreed to definitely include 1,215 collocations (27%). The moderate agreement is

probably not surprising as the reviewers were intentionally chosen from heterogeneous

backgrounds so that they could provide feedback from different perspectives. If the sum of the

expert judgements was less or equal to 9, the entry was excluded from the final list. In total

385 entries (8%) were dismissed at this stage.

It has to be noted that the comments from the panel also significantly contributed to

the follow-up systematization process, which was undertaken by the principal researchers. This

process is explained in the next section.

2.5 Systematization

As more than one panel member suggested systematizing the entries to provide a

listing that would be more readily accessible for users, the researchers decided to take the

following steps as part of the systematization process:

I. Listing collocations in their base form

i. Changing adjectives in the comparatives/superlatives to the base form if

appropriate

ii. Changing nouns in plural to singular unless the noun is a plurale tantum or

more likely to appear in its plural form as part of the collocation

iii. Changing inflected verbs to infinitive

II. Harmonizing entries that appear in British and American English, with British English

as the preferred form

III. Adding definite or indefinite articles to verb+noun collocations in line with dictionary

conventions (e.g. resolve (a) dispute; apply (the) theory)

IV. Adding optional copula be to adverb+verb past participle combinations if they can be

used as predicate (e.g. (be) universally accepted)

V. Adding dominant prepositions to collocations (e.g. bear resemblance (to))

5
The ratings of one expert had to be disregarded as they were incomplete.
12
In order to make informed decisions, the researchers consulted concordance lines from

the corpus and referred back to the data-driven list. The pros and cons regarding points III to

V are discussed later.

3 Results

In this section, an overview of the ACL in terms of part-of-speech combinations is

provided and the results of the validation study are presented.

3.1 Composition

After several stages of computational analyses and manual refinement as described

above, the academic collocation list with 2,468 entries was completed. The number of entries

in various target POS combinations and examples in each category are presented in Table 3.

As can be seen, with 1,835 entries noun combinations form the largest category comprising

nearly three quarters of the total entries (74.3%, n=1,835). The second largest category are

verb combinations with nouns or adjectives as complements (13.8%, n=340), followed by

adverb+verb combinations (6.9%, n=170), which include positionally variable adverb+verb

and verb+adverb entries as well as adverb+vpp (verb past participle) combinations.

Adverb+adjective combinations are comparably fewer but still cover 5.0% (n=124) of the list.

Table 3 Overview of final academic collocations in part-of-speech combinations


No of
Combinations POS I POS II Percentage Examples
entries
adj n 1,773 71.8% anecdotal evidence, classic example
1. noun
n n 62 2.5% assessment process, target audience
combinations
Total 1,835 74.3%
v n 310 12.6% gather information, undertake research
2. verb+noun/
v adj 30 1.2% consider appropriate, seem plausible
adj combinations
Total 340 13.8%
adv v 17 0.7% explicitly state, strongly agree
v adv 29 1.2% grow rapidly, vary considerably
3. verb+adv
combinations previously mentioned, (be) widely
adv vpp 124 5.0%
dispersed
Total 170 6.9%
highly controversial, (be) markedly
4. adv+adj adv adj 124 5.0%
different
combinations
Total 124 5.0%
Grand total 2,469 100%

13
A representative selection of adjective+noun and noun+noun combinations in the ACL

are listed in Table 4. These are the entries that received the highest expert agreement (see

Section 2.4) indicating almost unanimous agreement among experts that these collocations

are of pedagogical value.

14
Table 4 Adjective+noun and noun+noun collocations with the highest combined score
Normed
No of Normed Normed Normed Normed t-
Adjective Noun freq per MI StDev* StDev
texts AS HM SS NS score
million
1 academic writing 16.74 23 5.28 68.12 1.81 0.20 8.32 19.25
2 brief overview 2.20 38 2.43 1.68 2.71 1.81 9.85 6.99
3 causal link 1.53 16 3.14 0.63 1.45 0.20 8.09 5.81
4 conflicting interests 1.21 13 1.00 0.21 3.07 0.40 8.47 5.18
5 conventional wisdom 3.72 38 2.86 1.26 9.39 1.01 10.99 9.11
6 crucial factor 1.80 27 2.00 1.89 2.17 1.01 7.32 6.29
7 crucial importance 2.06 35 1.71 2.10 2.71 1.81 6.99 6.73
8 crucial role 4.89 71 3.43 7.55 5.42 3.82 7.11 10.36
9 cultural heritage 2.51 30 2.00 1.05 3.97 3.02 8.09 7.46
10 (a) deep understanding (of) 2.74 43 2.00 3.14 2.71 3.42 8.38 7.79
11 disposable income 1.62 16 3.14 0.21 1.81 0.60 12.66 6.00
12 dividing line 1.30 22 1.28 1.26 2.35 0.20 8.34 5.37
13 domestic violence 13.24 22 22.70 0.63 23.84 0.20 9.72 17.16
14 due process 5.39 26 9.56 1.89 6.14 2.01 5.55 10.72
15 economic conditions 4.76 50 6.00 1.47 8.49 2.01 5.35 10.04
16 economic growth 15.21 60 27.70 2.93 11.92 13.07 7.37 18.30
17 economic power 5.43 49 4.43 3.14 10.84 3.02 4.51 10.52
18 educational institution 3.90 50 4.71 3.14 5.60 1.61 7.15 1.20 5.88 4.13
19 environmental factors 8.26 46 10.14 0.84 12.10 8.45 5.80 2.66 7.86 7.61
20 environmental protection 4.62 30 5.00 1.47 1.99 10.06 7.51 10.09
21 equal opportunity 8.39 30 14.85 0.84 13.91 0.40 9.01 13.65
22 ethnic minority 18.62 55 23.70 5.66 38.29 2.01 10.40 0.46 14.38 0.86
23 federal government 12.25 47 24.13 3.56 14.09 1.81 9.21 16.49
24 final stage 5.39 72 5.85 3.56 5.06 6.84 7.11 0.09 7.51 2.32
25 financial resources 4.22 48 7.57 2.31 3.97 1.61 7.30 9.63
26 financial support 5.97 59 8.71 2.72 7.59 3.42 7.09 11.45
27 first generation 5.56 51 6.42 5.66 4.34 5.63 5.77 10.93
28 foreign investment 4.40 23 8.28 1.68 2.71 3.42 9.00 9.88
29 foreign policy 14.05 59 12.85 11.11 28.54 2.41 7.10 1.37 10.92 8.47
30 full range 6.37 68 6.85 3.14 10.30 4.42 6.18 11.75
31 further information 9.92 52 15.13 6.92 7.77 7.84 4.96 14.39
32 further research 6.10 67 6.42 9.64 4.52 4.02 4.39 11.11
15
33 high profile 4.67 45 5.00 3.77 7.77 1.61 7.34 10.14
34 higher education 26.52 72 33.55 18.86 31.07 18.91 8.44 24.24
35 individual differences 7.05 41 7.28 2.10 15.53 2.01 5.97 12.33
36 infinite number 3.90 39 3.43 3.77 0.54 8.45 7.45 9.27
37 intellectual property 6.87 25 15.13 2.31 0.72 6.44 9.49 12.35
38 interpersonal skills 5.25 14 14.28 1.47 1.63 0.20 9.99 10.81
39 key element 8.30 86 13.28 2.93 11.56 2.82 6.12 0.66 9.47 0.04
40 key factor 7.58 86 10.56 1.47 9.21 7.44 6.00 0.60 9.04 0.05
41 living conditions 4.49 35 7.00 1.05 6.86 1.61 6.72 9.90
42 local government 23.78 88 35.40 10.48 39.02 3.22 7.39 0.64 15.79 4.86
43 low income 7.63 44 11.14 1.47 10.84 5.03 8.13 12.99
44 medical treatment 6.51 27 16.70 0.42 3.25 1.61 7.55 11.98
45 mental health 59.64 50 84.37 2.10 130.24 1.41 9.26 36.40
46 mental illness 8.71 38 10.85 3.77 17.70 0.40 9.15 13.90
47 minimum standard 3.41 28 7.14 0.84 3.25 0.80 6.70 2.39 5.63
48 national identity 10.99 49 5.14 6.71 27.28 5.23 6.78 15.51
49 native speaker 5.56 46 3.00 18.65 2.17 0.40 9.97 0.94 7.74 2.01
50 natural resources 11.62 63 17.56 3.77 6.32 16.69 7.33 0.12 10.83 4.62
51 natural world 6.51 46 1.86 11.32 11.92 2.41 5.29 11.73
52 next generation 5.03 55 6.28 3.35 3.97 6.03 7.74 10.53
53 nuclear power 6.01 29 1.71 4.61 4.34 15.29 7.47 11.51
54 nuclear weapon 4.13 30 4.85 1.68 6.50 2.82 11.84 9.59
55 ongoing debate 1.80 21 1.71 1.05 3.97 0.20 8.94 6.31
56 online database 1.21 6 0.29 0.21 4.15 0.20 8.28 5.18
57 paid employment 4.53 22 2.43 1.47 13.01 1.01 8.70 10.03
58 physical activity 5.92 33 9.28 2.31 9.57 0.60 5.04 1.51 7.34 4.31
59 physical properties 11.94 38 2.14 4.19 0.90 45.45 8.43 16.26
60 political economy 18.58 47 22.27 7.34 33.96 7.04 7.85 20.26
61 political institution 6.33 35 2.28 4.40 18.42 0.40 6.06 11.70
62 political party 9.83 50 5.00 3.77 28.00 2.21 7.19 14.70
63 popular culture 52.59 51 13.42 21.17 175.22 1.41 9.09 34.17
64 positive feedback 4.58 27 1.57 0.84 1.26 16.09 7.85 10.06
65 prior knowledge 3.32 33 3.57 2.72 1.99 5.03 6.59 8.51
66 private sector 26.93 83 50.11 3.35 36.13 6.64 9.47 0.68 15.90 9.63
67 public sector 12.97 46 26.70 1.47 14.99 2.41 7.73 16.92
68 public transport 8.26 33 17.13 1.68 7.59 2.82 6.47 13.41
16
69 qualitative analysis 4.35 45 3.00 6.08 4.15 4.83 7.42 0.06 6.26 4.18
70 racial discrimination 12.21 26 24.13 0.84 17.70 0.20 9.98 16.48
71 random sample 2.33 22 3.71 1.47 3.25 0.20 8.25 7.19
72 raw data 6.37 46 4.57 9.43 11.20 0.60 9.26 11.90
73 religious belief 9.24 71 6.00 15.51 15.72 0.60 8.56 0.99 9.93 2.78
74 renewable energy 2.78 14 3.57 0.21 0.36 6.84 10.30 7.87
75 ruling class 6.24 25 3.28 5.03 16.08 0.60 9.27 11.77
76 scientific research 5.79 54 5.85 4.61 7.23 5.23 5.75 11.15
77 significant impact 4.67 57 8.71 1.89 4.70 1.61 6.24 10.06
78 small proportion 3.63 53 3.71 1.68 3.61 5.43 7.29 8.94
79 social factors 12.12 57 12.99 16.56 15.17 3.22 5.01 15.92
80 social mobility 4.13 28 0.71 2.10 13.19 0.80 6.59 9.49
81 social status 8.39 58 3.14 17.82 13.91 0.60 4.82 13.19
82 solar energy 3.19 16 1.14 1.05 0.72 10.86 7.97 8.39
83 stark contrast 1.26 25 1.71 1.26 1.63 0.20 10.17 5.29
84 varying degree 6.51 85 6.57 5.87 8.85 4.42 8.68 4.76 7.00 6.82
85 whole range 8.08 64 8.85 6.29 12.46 3.82 6.52 13.27
86 wide range 48.29 258 49.39 32.91 54.37 54.71 8.78 1.56 21.02 13.74
Noun Noun
87 background knowledge 1.93 24 1.28 5.24 0.36 1.41 5.87 6.45
88 class consciousness 3.46 19 0.29 3.98 9.94 6.84 8.70
89 conflict resolution 2.15 20 2.86 0.21 3.79 1.21 7.65 6.89
90 data set 8.48 49 5.00 5.03 8.31 16.89 4.75 13.24
91 source material 3.63 35 0.71 11.32 2.17 2.01 5.38 8.78
* The value of standard deviation is added where different forms of collocations are combined, e.g. educational institution/institutions. It reflects the difference
between the two MI scores and t-scores respectively for combined entries.

17
In terms of verb combinations, again only those which received the greatest expert

agreement are listed here. Verb+noun and verb+adjective collocations are presented in Table

5, and verb+adverb collocations with variable sequences in Table 6. In comparison with

nominal collocations, verb collocations appear to have received much more research interest in

the past, particularly verb+noun collocations in second language learning. The assumption

appears to be that learners tend to encounter more difficulties in choosing verb collocates

correctly than any other type of collocation. Very little research has addressed other POS

combinations, particularly noun combinations although they dominate the academic register.

This discrepancy, therefore, may require further research into learners’ collocational use to find

out if other types of collocations, in comparison with verb+noun collocations, also pose

challenges to learners.

The final category contains adverb+adjective combinations. Those with the highest

expert agreement are presented in Table 7. It is interesting to note that from Table 4 to 7,

there appears to be a range of recurring words, many of which originate from the same word

families, e.g. increase/increasing/increasingly, significant/significantly. Take the most frequent

adverb in the ACL highly for example. It collocates with 20 different adjectives (e.g.

sophisticated, complex, critical) and six verb past participles (e.g. educated, charged,

developed). As for its adjective form high, it collocates with 22 nouns (e.g. level, profile) while

its comparative form higher collocates with two nouns (education, degree).

The implication for EAP pedagogy at the lexico-grammatical level is that despite a

seemingly large number of collocations in the ACL, the actual teaching and learning load

should be manageable as many of the words in the ACL are often part of a ‘recurrent frame’,

and understanding the frame will give learners a sense of familiarity when encountering new

collocations within the same or a similar frame. Therefore, one possible way of presenting the

entries in the classroom would be to show the frame, if any, of a node word, e.g. (be) highly +

vpp, and introduce the six verb past participles from the ACL together.

18
Table 5 Verb+noun and verb+adjective collocations with the highest combined score
Normed
No of Normed Normed Normed Normed
Verb Noun freq per MI StDev t-score StDev
texts AS HM SS NS
million
1 achieve (a) goal 6.96 85 9.85 5.66 8.31 2.61 8.98 0.49 6.68 3.21
2 achieve (an) objective 6.19 73 13.56 3.56 3.79 1.01 7.92 0.94 6.64 1.54
3 cast doubt 1.75 32 0.71 3.35 2.35 1.01 10.07 6.24
4 make (a) living 5.07 40 1.71 1.89 2.71 15.49 4.44 1.33 5.44 2.88
5 make (a) prediction 3.99 61 2.71 2.31 5.42 5.83 5.35 1.61 18.31 2.17
6 meet criteria 3.28 53 5.00 2.72 2.35 2.41 6.95 1.12 4.01 1.62
7 meet (a) requirement 6.60 93 12.13 3.98 3.61 4.63 7.30 1.47 4.32 2.59
8 obtain (a) result 9.51 115 10.14 6.92 3.25 18.10 5.53 1.21 6.41 3.72
9 pose (a) question 7.22 121 5.85 7.76 10.84 4.63 7.65 0.38 5.03 1.25
10 provide (a) clue 1.80 28 1.71 2.31 1.81 1.41 8.02 0.43 4.18 2.19
11 take precedence 3.19 58 3.71 2.72 3.61 2.41 9.54 0.43 4.72 1.42
12 take responsibility 9.24 105 12.99 7.13 13.37 1.41 5.81 0.66 5.63 3.24
Verb Adjective
13 make explicit 11.71 174 7.85 21.17 14.99 4.42 6.67 0.30 7.80 2.03

Table 6 Verb+adverb collocations with the highest combined score


Normed
No of Normed Normed Normed Normed
Verb Adverb freq per MI StDev t-score StDev
texts AS HM SS NS
million
1 differ significantly 3.68 58 3.85 2.72 2.53 5.63 8.63 0.69 4.80
2 expand rapidly 2.24 40 2.57 1.47 2.35 2.41 8.32 1.41 3.78 2.49
3 explore further 1.88 30 1.71 2.31 1.99 1.61 5.59 1.14 4.28 1.84
4 increase dramatically 2.92 49 3.00 1.89 2.35 4.42 7.63 0.72 3.89 1.97
5 vary widely 3.10 56 2.00 1.68 3.97 5.03 7.63 1.05 4.53 1.11
Adverb Verb 1.84
6 adversely affect 4.35 58.00 7.00 0.63 3.97 4.63 11.86 0.42 6.95 0.66
7 closely resemble 1.53 26 1.43 1.05 1.26 2.41 10.55 5.83
8 fully understand 9.38 127.00 11.99 8.17 8.49 7.84 7.18 0.88 8.24 1.09
9 generally agree 2.20 41.00 1.14 2.52 3.43 2.01 6.00 1.51 4.59 2.39
Verb (past
Adverb participle)
10 (be) closely allied 1.39 15 2.43 1.68 0.54 0.60 10.17 5.56
19
11 (be) deeply rooted 1.66 30 0.86 2.93 2.53 0.60 11.32 6.08
12 (be) generally accepted 5.30 78 5.71 6.08 4.88 4.42 8.14 10.82
13 (be) generally considered 3.77 43 4.71 2.52 3.43 4.02 6.03 9.02
14 (be) inextricably linked 2.47 38 2.71 1.47 3.07 2.41 12.05 7.41
15 (be) well established 14.32 131 12.13 16.98 14.09 15.08 6.43 17.65
16 (be) widely accepted 5.25 77 3.43 6.71 7.41 4.02 9.34 10.80
17 (be) widely used 18.49 104 18.27 10.48 11.20 34.59 7.76 20.20

Table 7 Adverb+adjective collocations with the highest combined score


Normed
No of Normed Normed Normed Normed
Adverb Adjective freq per MI t-score
texts AS HM SS NS
million
1 ever increasing 3.37 52 3.14 3.56 4.52 2.21 7.71 8.62
2 hardly surprising 3.05 43 2.43 3.98 3.61 2.41 10.79 8.24
3 increasingly important 5.07 60 6.71 1.68 4.88 6.23 5.86 10.45
4 mutually exclusive 5.56 70 4.00 8.80 7.23 2.82 13.24 11.13
6 radically different 4.67 71 1.71 9.01 6.14 3.02 7.76 10.15
7 relatively few 6.15 76 5.71 2.72 9.75 6.03 7.10 11.62
8 relatively high 6.33 65 5.85 4.40 6.32 8.85 6.12 11.70
9 relatively little 7.18 88 4.14 6.71 13.19 5.23 7.16 12.56
10 relatively simple 4.53 60 4.28 5.87 2.17 6.23 7.02 9.97
11 slightly different 9.60 118 9.42 8.80 7.95 12.47 7.35 14.54

Table 8 Examples rejected due to low combined score


Normed
No of Normed Normed Normed Normed
Component I Component II freq per MI t-score
texts AS HM SS NS
million
1 use detection 1.17 7 2.43 0.21 0.54 1.01 5.18 4.96
2 access control 1.62 13 3.14 0.21 0.72 1.81 3.99 5.62
3 imagined community 1.12 14 0.29 1.05 2.35 1.01 7.02 4.96
4 individual criminal 1.08 5 2.71 0.21 0.54 0.20 4.59 4.69
5 biological basis 1.26 17 0.43 0.21 3.97 0.40 6.18 5.22

20
3.2 Validation

In order to determine whether the ACL is representative of the academic register in

comparison with non-academic registers, a validation study was conducted with the aim to

investigate the list’s overall coverage of its source corpus as well as of a comparably-sized

general corpus. The general corpus of 25 million tokens was compiled from the BNC including

imaginative writings (i.e. literary and creative works) and informative writings (i.e. leisure

component).

For the purpose of the analysis, each ACL entry underwent inflection expansion. In

other words, the entry was classed as the root form with all possible variants added. For

example, ‘close | adj | relationship | n’ expands to the following format:

close | adj | relationship | n

closer | adj | relationship | n

closest | adj | relationship | n

close | adj | relationships | n

closer | adj | relationships | n

closest | adj | relationships | n

If the first collocate contained optional words, e.g. test (a) | v | theory | n , then this

was expanded into two separate rows as shown below. The POS for the second row was

classed ngram as multiword items are not tagged with a single POS. When searching for such

multiword items, the POS for the constituent tokens were not considered.

test | v | theory | n

test a | ngram | theory | n

This extended list, however, neither included collocations with flexible positions of their

components, e.g. conduct research | research conducted; significantly correlated | correlate

significantly, nor collocations with variable gaps, e.g. consider these/such/specific issues. It

21
would have been beyond the scope of the validation to attempt to include all possible

variations for those collocations in question.

The results show that the overall coverage of the ACL in the source corpus amounts to

1.4% and for the general corpus to 0.1%. This suggests that the ACL has a 14-times higher

coverage in an academic corpus than in a general corpus of similar size. It is interesting to

note that Coxhead (2000) found a similar ratio of the AWL coverage in her source corpus

(10%) and a corpus of fiction (1.4%). These findings underline the importance that should be

placed on teaching and learning these collocations in EAP.

4 Discussion

As described in the methodology section, the ACL was compiled using quantitative

analysis followed by qualitative refinement. During the process of manual vetting and

systematization the researchers had to constantly check the selection criteria and adapt them.

For example, at the outset the intention was to include two-word collocations only. Yet after

reviewing the data from the computational analysis, it was decided to keep those with

extended combinations such as deal xx issue, lead xx conclusion, largely based as they are of

pedagogical value. Consulting concordance lines from the source corpus allowed the

researchers to identify the most frequent pattern, which was chosen for the final format. The

above examples were then completed as deal with an issue, lead to the conclusion, and (be)

largely based (on). Very often the addition of words included: articles added to verb+noun

combinations, e.g. raise (an) issue, copula be added to adverb+vpp combinations if they can

be used as predicate, e.g. (be) widely recognized, prepositions added to verb or noun

combinations, e.g. rely heavily (on) and (a) high proportion (of), or a combination of two of

the above conditions, e.g. (be) greatly influenced (by). Providing the most frequent pattern

increases the list’s usefulness and applicability for teachers and learners alike.

One major challenge of compiling this list lay in the fact that there seems to be no

absolute definition of collocation and this is particularly true for some word combinations. The

moderate inter-rater reliability among the expert panel reported in the methodology section

suggests that which entry qualifies as an academic collocation for pedagogical purposes may

be interpreted rather differently. As discussed in the introduction, collocations represent a

22
continuum of formulaicity and semantic opaqueness, and at the earlier stage, it was decided to

remove combinations at the two extreme ends, which are either fixed expressions (e.g. global

warming) or semantically transparent combinations (e.g. first issue). Yet there is no absolute

dividing line to determine what ‘fixedness’ or ‘transparency’ is. Take the entry academic life for

example. At the stage of expert judgement, it received a wide range of scores from 1

(exclude) to 4 (include), which indicates the disagreement among experts. Whether it is

semantically transparent may be debatable, yet it was accepted as an entry in the list because

in addition to meeting all quantitative criteria, this entry still received a moderate total score of

15 from the experts. Comparing the appropriateness of the following examples that received a

total score below 10 (Table 8) and the ones with the highest expert scores (Table 4 to 7), it

becomes clear why we believe that adopting expert judgement in selecting the entries and

utilizing the feedback from the panel such as systematizing the entries have contributed

significantly to the quality of this listing, which could not have been achieved by relying solely

on computational analysis.

Nonetheless three issues which required further manual scrutiny became apparent after

the expert review. First of all, as the words processed by the computer were not lemmatized,

nouns with singular and plural forms or inflected verbs were processed separately despite

having identical collocates and hence some entries were duplicated. To rectify this issue the

original computational output was checked again manually, and duplicates such as

fundamental assumption/fundamental assumptions were combined. For noun combinations

where both singular and plural forms were present, only the more frequent form was retained,

e.g. the singular form fundamental assumption, but the plural form common characteristics.

For verb combinations, inflected forms were converted to the infinitive, e.g. deny access

instead of denied access or denying access. For combinations with verb past participles (vpp),

it proved more complicated as they often occur in a variety of contexts – modifying nouns as

adjectives (see Example 1) or collocating with copula be or linking verbs as complement (see

Examples 2-3). It is also possible that a main verb shares the same form with its verb past

participle, in which case the current automated corpus-driven approach is unable to distinguish

them (compare Examples 3 and 4). The variable position of adverb collocate adds even more

23
complexity to this matter (see Example 5). To solve this issue the following policy was

adopted: if the vpp forms generally function as complement in the concordance lines as in

Examples 2-3, then an optional copula be would be added. In addition, only the most frequent

pattern was kept; therefore, directly involved is included as opposed to involved directly, which

occurs less frequently.

Examples (as retrieved from the corpus)

(1) We use a well established procedure to identify the regions…

(2) Firstly, it is now well established that investigations into human rights abuses …

(3) … that the two leaders will have to become directly involved in talks rather than

acting through negotiators…

(4) Only those negative outcomes that directly involved the pupil were included.

(5) When the nurse has not been involved directly in the administration of the

medication…

The second issue was that some combinations were excluded because the individual

inflective forms with collocates failed to meet the frequency threshold, yet their combined

frequency is actually higher than the cut-off frequency. The expert panel suggested a number

of such collocations that were not covered in the reviewed list. The researchers then

scrutinized the original data-driven list of 130,000+ entries again in search for any collocations

that had been missed out as a result of inflections. Another 9.7% (n=239) of additional

combinations (e.g. differ considerably or exercise authority) were added to the list at this

stage. The above tasks were very time-consuming, yet such manual scrutiny was required in

order to further improve the quality of the listing.

The final issue involves whether highly sensitive entries should be included in the ACL.

The principal researchers decided to exclude entries such as mental retardation as some

experts advised that these entries could be potentially offensive.

One interesting finding is the dominance of nominal combinations in the list. This might

indicate that nouns have greater tendency to collocate than other word classes, which would

require further analysis. It might also be reflective of the written register, particularly in

24
scientific writing, which is characterized by a large proportion of nominalization in academic

texts indicating high information content. For example, Biber and Gray (2010: 2) found that

academic writing is ‘structurally “compressed”, with phrasal (non-clausal) modifiers embedded

in noun phrases’. Fang, Schleppegrell & Cox (2006) reported that the extensive use of nouns

and nominal expressions in academic text can pose great challenges for comprehension (for

the discussion of nominalization in the written register, see also Halliday, 1985; Quirk,

Greenbaum, Leech and Svartvik 1985). Yet very little research has been conducted on

adjectival or nominal collocations in academic English. One of the few exceptions is a study

looking at intensifying adjectives such as important or different conducted by Lorenz (1998).

Lorenz’ study, however, focuses on the contrastive use of adjective intensification between

native and non-native writing rather than the use of academic collocations per se.

One may criticize that the listing has undergone various stages of filtering and many

potentially useful and valuable combinations might have been removed from the final ACL

during this process. However, we argue that providing too much information - in this case too

many collocational entries - can be overwhelming for learners. Research has shown that even

with collocations dictionaries at hand, students still struggle to find correct verb collocates in a

given task (Dziemianko, 2010; Laufer, 2011). We believe that the ACL, composed of carefully

selected collocations, can serve as part of a lexical syllabus to raise learners’ awareness of

word co-occurrence and help them prioritize the learning of lexical items.

5 Conclusion

The Academic Collocation List comprises 2,468 entries. Our definition of collocations

refers to word combinations which co-occur more frequently than by chance across academic

disciplines (hence corpus-driven) and are pedagogically relevant in an EAP context (hence

expert-judged). Within the scope of this definition, we primarily focused on lexical collocations

as they contain certain variability and are thus more dynamic while grammatical collocations or

idioms consist of comparatively fixed patterns and are consequently more predictable. The

former is more challenging for learners to master whereas the latter can generally be treated

as holistic units and learners can more easily internalize the usage into their lexicon.

25
The ACL was compiled using a mixed-method approach of combining computational

analysis of the source corpus with expert judgement and systematization. As pointed out

above, existing corpus-driven multi-word lists often fail to provide immediate usable resources

for language learning, and it is only with expert intervention that raw data can be filtered and

refined in order to extract the most informative and meaningful entries. In the case of the ACL,

the statistical information served as important reference, but in addition each collocation

included in the final list had been subjected to expert judgement and manual refinement. The

advantage of relying not only on computational analysis but also on human intervention is that

the data-driven approach ensures that important combinations are not missed while expert

judgement ensures that the final entries are appropriate and relevant for EAP.

This approach led to an academic collocation list that will be of much greater use to

language learners and EAP teachers alike. In addition to the Academic Word List and the

Academic Formulas List, the ACL provides a further tool for EAP teachers to construct

appropriate teaching materials and help students focus on frequent lexical items beyond

individual words.

As research has shown, collocations are difficult to learn and retain even with the

assistance of dictionaries. The list can therefore be used to support learning by drawing

attention to collocation per se. By subdividing the ACL, students and teachers can focus

exclusively, for example, on adjective+noun combinations, on common frames or collocation

families. Explicit teaching needs to be complemented by providing students with the

opportunity to encounter collocations when dealing with academic texts. Combining both

explicit and implicit teaching will enhance learners’ receptive and productive collocational

knowledge and thus improve their academic English proficiency. EAP material writers may also

benefit from integrating the ACL into the syllabus.

In terms of directions for future research, as Hyland and Tse (2007) suggested, further

research into an extended ACL may be required to highlight specific collocations that are

exclusively frequent in individual fields of studies. We also propose that future research should

aim at categorizing the entries of the ACL on the basis, for example, of semantics or discourse

functions. For instance, the hedging function of some of the entries, e.g. virtually impossible,

relatively stable, is one feature that came to light during the compilation process. Types of

26
errors or underuse/overuse in learner language may be identified to help students improve

their academic proficiency as well.

References

Alternberg, B. & Granger, S. (2001). The grammatical and lexical patterning of MAKE in native

and non-native student writing. Applied Linguistics, 22(2), 173–195.

Bahns, J. & Eldaw, M. (1993). Should we teach EFL students collocations? System, 21(1), 101-

114.

Benson, M. (1985). Collocation and idioms. In R. Ilson (Ed.) Dictionaries, lexicography and

language learning (pp. 61-68). Oxford: Pergamon Press.

Biber, D. & Conrad, S. (1999). Lexical bundles in conversation and academic prose. In H.

Hasselgard & S. Oksefjell (Eds.), Out of corpora: Studies in honour of Stig Johansson (pp.

181-190). Amsterdam: Rodopi.

Biber, D., Conrad, S. & Cortes, V. (2004). If you look at ...: Lexical bundles in university

teaching and textbooks. Applied Linguistics, 25(3), 371–405.

Biber, D. & Gray, B. (2010). Challenging stereotypes about academic writing: Complexity,

elaboration, explicitness. Journal of English for Academic Purposes, 9(1), 2-20.

Biskup, D. (1992). L1 influence on learners’ renderings of English collocations. A Polish/

German empirical study. In P.J.L. Arnaud & H. Béjoint (Eds.), Vocabulary and Applied

Linguistics (pp. 85-93). London: Macmillan.

Chen, Y. H. & Baker, P. (2010). Lexical bundles in native and non-native academic writing.

Language Learning and Technology. 14(2), 30-49.

Cobb, T. (2003). Analyzing late interlanguage with learner corpora: Quebec replications of

three European studies. Canadian Modern Language Review, 59(3), 393-423.

Cowie, A. P. (1981). The treatment of collocations and idioms in learners’ dictionaries. Applied

Linguistics 2(3), 223-35.

Cowie, A. P. (1994). Phraseology. In R. E. Asher (Ed.), The encyclopedia of language and

linguistics. Vol. 6 (pp. 3168-71). Oxford: Pergamon.

Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.

27
Durrant, P. (2009). Investigating the viability of a collocation list for students of English for

academic purposes. English for Specific Purposes, 28(3), 157-169.

Dziemianko, A. (2010). Paper or electronic? The role of dictionary form in language perception,

production, and the retention of meanings and collocation. International Journal of

Lexicography, 233, 257-273.

Fang, Z., Schleppegrell, M. & Cox, B. (2006). Understanding the language demands of

schooling: Nouns in academic registers. Journal of Literacy Research. 38(3), 247-273.

Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae.

In A. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 145-160). Oxford:

Oxford University Press.

Granger, S. & Meunier, F. (Eds.), (2008). Phraseology: An interdisciplinary perspective.

Amsterdam: John Benjamins.

Granger, S. & Paquot, M. (2008). Disentangling the phraseological web. In S. Granger &

Meunier (Eds.), Phraseology: An interdisciplinary perspective. Amsterdam & Philadelphia:

John Benjamins.

Halliday, M. A. K. (1985). An introduction to functional grammar. London: Continuum.

Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.

Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19(1),

24-44.

Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.

Hyland, K. (2008). Academic clusters: Text patterning in published and postgraduate writing.

International Journal of Applied Linguistics, 18(1), 41–62.

Hyland, K. & Tse, P. (2007). Is there an “Academic Vocabulary”? TESOL Quarterly, 412(2),

235-253.

Laufer, B. (2011). The contribution of dictionary use to the production and retention of

collocations in a second language. International Journal of Lexicography, 24(1), 29-49.

Laufer, B. & Waldman, T. (2011). Verb-Noun collocations in second language writing: A corpus

analysis of learners' English. Language Learning, 61(2), 647-672.

28
Lorenz, G. (1998). Overstatement in advanced learners' writing: stylistic aspects of adjective

intensification. In S. Granger (Ed.), Learner English on computer (pp. 53-66). London and

New York: Addison Wesley Longman Limited.

Martinez, R., & Schmitt, N. (2012) A phrasal expressions list. Applied Linguistics, 33(3), 299-

320.

Nation, I.S.P. (2001). Learning vocabulary in another language. Cambridge: Cambridge

University Press.

Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some

implications for teaching. Applied Linguistics, 24(2), 223–242.

Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: John Benjamins.

Quirk, R., Greenbaum, S., Leech, G. & Svartvik, J. (1985). A comprehensive grammar of the

English language. London: Longman.

Schmitt, N. (2004). Formulaic sequences: Acquisition, processing, and use. Amsterdam: John

Benjamins.

Simpson-Vlach, R. & Ellis, N. C. (2010). An academic formulas list: New methods in

phraseology research. Applied Linguistics, 31(4), 487-512.

Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell.

Tognini-Bonelli, E. (2001). Corpus linguistics at work: Studies in corpus linguistics, Vol. 6.

Amsterdam: John Benjamins.

West, M. (1953). A general service list of English words. London: Longman.

Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford University Press.

Acknowledgements

We thank Douglas Biber, Regents' Professor in the Applied Linguistics Program at Northern

Arizona University, and Bethany Gray, Assistant Professor in the Department of English and

the TESL/Applied Linguistics Program at Iowa State University, for conducting the

computational analysis of the source corpus.

29
We would like to express our gratitude to Andrew Roberts, Computational Linguist, for tagging

the initial collocation list and conducting the validation study of the Academic Collocation List.

We are also grateful to the members of the expert panel: David Crystal, Honorary Professor of

Linguistics, University of Bangor; Geoffrey Leech, Emeritus Professor of English Linguistics,

Lancaster University; Diane Schmitt, Senior Lecturer in EFL/TESOL, Nottingham Trent

University; Della Summers, Dictionary Consultant; Professor Lord Randolph Quirk, FBA; and

Chris Fox, Managing Editor, Pearson.

We would also like to thank Mike Mayor, Editorial Director, Dictionaries & Reference,

Pearson for contributing valuable advice and John H.A.L. De Jong, Senior Vice President,

Standards and Quality Office, Pearson for his support throughout the project.

Last but not least, we thank the anonymous reviewers for their helpful and insightful

comments and suggestions.

The complete Academic List can be accessed via the following link:

http://www.pearsonpte.com/research/Pages/CollocationList.aspx

30

View publication stats

You might also like