Professional Documents
Culture Documents
Aijmer & Altenberg - Advances in Corpus Linguistics
Aijmer & Altenberg - Advances in Corpus Linguistics
Corpus linguistics has made spectacular advances since the early 1960s when
computer corpora were first made available for research. The use of corpora has
spread to practically every branch of linguistics and has become indispensable in
many practical applications of linguistic research, from lexicography and
terminology extraction to information retrieval and computer-assisted translation.
Corpora have become bigger and more diversified: apart from large general-
purpose corpora, a number of specialised corpora are now being used for research
in such areas as historical linguistics, sociolinguistics, dialectology, LSP,
interlanguage research, contrastive linguistics and translation studies. In addition,
CD-ROM newspaper collections and the Internet have become increasingly
important resources for language study. Hand in hand with these developments a
variety of research tools have been created for exploring, annotating and
processing language data in various ways.
However, the most important achievement of corpus linguistics is
undoubtedly that it has put the use of language at the centre of linguistics. In
theoretical as well as practical approaches to language, computer corpora have
placed linguistics on a firm empirical footing, emphasising the functional and
communicative basis of language.
This volume contains twenty-two papers presented at the 23rd
International Conference on English Language Research on Computerized
Corpora of Modern and Medieval English (ICAME) held at Gteborg, Sweden, in
May 2002. They cover a wide range of topics and, though few of them represent
the technical or computational side of the discipline, they illustrate clearly the
diversity of research that is characteristic of corpus linguistics today. The
contributions have been divided into six broad and inevitably overlapping
categories, under the following headings:
We have chosen to call the volume Advances in Corpus Linguistics. This may
seem a bold title, as it suggests a systematic account of recent developments in
the field. However, advances in linguistics seldom take the form of big leaps.
2 Karin Aijmer and Bengt Altenberg
This is particularly true of corpus linguistics where each study can be seen as a
small step in the expansion of a vast and complex discipline, whether the focus is
theoretical, descriptive or methodological. Corpus linguistics is a constantly
changing field and ICAME conferences generally provide a good reflection of
this. The theme of the 2002 conference was The Theory and Use of Corpora. In
what ways can the present volume be said to represent advances in these
respects? Rather than presenting a summary of the individual contributions, we
will try to point out some issues and tendencies that we think are characteristic of
the volume as a whole.
The role of corpus linguistics and the relationship between data and theory
have been debated ever since the rise of corpus linguistics. The debate is also
clearly reflected in the present volume. That there is a need for such a debate may
suggest that corpus linguistics has not advanced in the past decades, but it can
also be regarded as a sign of the vitality of the field. A constant re-examination of
the goals of corpus linguistics and a critical discussion of theoretical and
methodological questions are necessary if corpus linguistics is to make significant
progress in the future.
The following issues are brought up for discussion in the first three
programmatic articles of the volume:
These are old questions in corpus linguistics. Although they are largely
methodological in character, they all have theoretical relevance. They are also
closely related.
The transcription of speech and the grammatical annotation of corpora
both involve imposing an analysis of the corpus data. This also means that they
allow the researchers intuition and, in the case of annotation, a preconceived
theoretical model to play a role at an early stage in the research process. But as
Michael Halliday points out in his contribution, transcription and annotation are
different in nature: while transcription of prosodic features provides an essential
part of the meaning of spoken discourse, grammatical annotation adds a
received linguistic description to the data, a description that may be incomplete,
obsolete or incorrect and therefore bound to distort the analysis before it has
started. Halliday recognises the problems involved in prosodic transcription but
also emphasises the desirability of marking such meaningful features as
intonation and rhythm. John Sinclair makes a distinction between mark-up and
annotation and argues that both should be kept separate from the raw text.
According to Sinclair, annotation should be avoided except in corpora used for
Introduction 3
Although corpus studies have a natural starting point in data, Leech objects to the
common assumption that corpus linguistics is concerned with mere data
collection and description. Explaining usage or changes in usage in Leechs
case inevitably involves theoretical considerations. The explanation of usage
may be language-internal or language-external, i.e. motivated by social factors.
As Leech demonstrates, corpus linguistics is naturally suited to usage-based
conceptions of linguistics which (unlike the Chomskyan paradigm) assume that
there is a bridge between the study of naturally-occurring data and the cognitive
and social workings of language (p 78).
Another problem that is no doubt familiar to most corpus linguists but
seldom discussed is the fact that corpus research is by necessity biassed in the
direction of lexis. Corpora are organised lexically and accessed via the
orthographic word. As a result, phenomena at the lexical end of the
lexicogrammatical continuum are more accessible than those at the grammatical
3
4 Karin Aijmer and Bengt Altenberg
end. As Halliday points out, this problem is especially acute in the study of the
spoken language where meaning is more highly grammaticalized and more
covert. What is needed, according to Halliday, are ways of designing a corpus for
the study of phenomena at the grammatical end of the continuum. This need is
especially great in the area of spoken language where, prototypically, meaning is
made and the frontiers of meaning potential are extended (p 11).
To judge from the present volume, Hallidays appeal for research on the
grammar of spoken discourse is warranted. The great majority of the studies
represented in the volume either focus on the lexical end of the continuum or
explore grammar or text via lexis. Moreover, few of them are specifically devoted
to spoken discourse as such. Two exceptions are Bernard De Clercks
examination of the pragmatic function of lets in the spoken part of ICE-GB and
Clive Souters study of childrens vocabulary in the Polytechnic of Wales (PoW)
Corpus. However, the focus De Clercks investigation is on the functional
variation of lets utterances in different speech categories and it is rather an
example of another important use of corpora, viz. the exploration of language
variation. Similarly, Souters aim is to demonstrate the usefulness of a small but
richly annotated corpus for studies of childrens vocabulary development and, in
particular, how this is affected by such extra-linguistic factors as sex and age.
Several other contributors explore register or regional variation. Jonathan
Charteris-Black uses corpora to compare metaphors in British and American
political discourse and Peter Tan, Vincent Ooi and Andy Chiang investigate the
spoken character of personal advertisements placed on the Web by ESL
speakers in South East Asia, using spoken and written portions of the Singapore
component of ICE as a standard of comparison.
More directly concerned with the structure of discourse are the papers by
Michael Hoey and Hilde Hasselgrd. Both argue that corpus-linguistic techniques
can be used to study patterns in text. However, their starting points are different.
Hoey claims that every lexical item is primed for use in textual organisation (p
174) and consequently examines textual patterns via lexis. Hasselgrd, on the
other hand, starts with a grammatical construction. Her paper investigates the
discourse and information structural functions of it-cleft constructions with an
adverbial in focus position.
As mentioned, the corpus linguist now has access to a wide variety of
corpora, ranging from very large corpora (the Cobuild Bank of English, the
British National Corpus) and carefully designed and annotated million-word
corpora in the tradition of the Brown and LOB corpora (e.g. Frown, FLOB and
the regional variants of ICE) to various smaller corpora collected for specific
purposes (e.g. the Helsinki Corpus, the ICLE corpus, the PoW Corpus). Many of
these are tagged and parsed, permitting the user easy retrieval of specific
grammatical categories. In addition, there is a rapidly growing number of
multilingual corpora with English as one of the languages compared. The
usefulness of all these types of corpora is amply illustrated in the present volume.
Yet, for certain purposes in particular the study of specific domains or
genres that are absent from, or insufficiently represented in, the general-purpose
Introduction 5
corpora the researcher has to collect his/her own corpus. Here material available
on the Web has proved to be a useful additional resource. No less than six of the
contributions to the present volume make use of such material (Charteris-Black,
Kbler, Renouf et al., Tan et al., Hoey, Tognini Bonelli and Manca). However,
using the Web as an unrestricted language resource presents several problems. As
Antoinette Renouf, Andrew Kehoe and David Mezquiriz point out in their
contribution, the nature of the Web as a random accumulation of heterogeneous
texts, many being less conventionally text-like, poses problems for the corpus
linguist who tries to access it through existing search engines (p. 403). Reporting
on a project designed to develop a user-friendly and more selective search tool for
the Web (WebCorp), they discuss some of the difficulties involved and how these
might be overcome or reduced. Their report is the only contribution representing
software development in the volume.
The volume also illustrates a variety of methodological approaches and, in
particular, that the choice of method is to a large extent determined by the
purpose of the study. One well-established method is the use of concordances
where syntagmatic lexicogrammatical patterns are revealed and make it possible
for the researcher to classify and describe the data in general theoretical terms.
This approach is of course especially useful in studies focusing on the lexical end
of the continuum and when the researcher knows which word or expression to
start from. Sometimes, however, there is no obvious lexical starting point. A case
in point is the study of metaphor. In this case the researcher first has to make an
educated guess about which lexical items are likely to serve as vehicles of
metaphors of a certain type (e.g. body parts, terms of war, etc), make a tentative
list of potentially rewarding items, and adjust the list after pilot searches in the
selected corpus material. An example of a study based on such intuitive
sampling is Charteris-Blacks comparison of metaphors in British and American
political discourse mentioned above.
Another example of the methodological problems facing the corpus
linguist is Thomas Kohnens investigation of the history of English directive
speech acts. With speech acts there is no predictable link between form and
function and consequently no systematic and reliable way of retrieving relevant
forms. Kohnen gives a summary of some of the methodological problems
involved and advocates a procedure called structured eclecticism. The method
implies the deliberate selection of typical patterns, such as the use of the
imperative or a performative clause, which are then traced throughout the history
of English. Kohnens diachronic study is also a good illustration of how several
corpora can be combined to throw light on linguistic change (in Kohnens case
the Helsinki Corpus, the electronic version of the Middle English Dictionary and
the Brown and LOB corpora). Another illustration is Geoffrey Leechs
examination of recent changes in English grammar on the basis of data from six
corpora spanning the last four decades of the 20th century.
However, diachronic change can also be demonstrated on the basis of
synchronic variation in recent corpora. Liselotte Brems investigates signs of
delexicalization and synchronic grammaticalization revealed by patterns in the
5
6 Karin Aijmer and Bengt Altenberg
use of measure nouns in the Cobuild Corpus. In a similar fashion, Gran Kjellmer
combines information from the OED with indications of synchronic variation in
recent corpora (the Cobuild Corpus and the BNC) to explain referential changes
of reflexive pronouns through the centuries.
Contrastive studies based on multilingual corpora require special
methodologies of their own. Here the languages compared serve as mirror images
of each other, highlighting cross-linguistic differences and similarities. For those
concerned with contrastive lexicology, such as Helge Dyvik and ke Viberg,
translation corpora clearly reveal such phenomena as overlapping polysemy,
diverging meaning extensions and language-specific lexical relations (synonymy,
hyponymy, etc). The procedure used in these studies is truly corpus-driven,
although theoretical frameworks guide the analysis at different stages. For Anna-
Lena Fredriksson, who investigates the notion of clausal theme in an English-
Swedish perspective, parallel corpus data help to define a tertium comparationis
and to identify a cross-linguistic theoretical model.
In contrastive research based on corpora of comparable texts from
different languages (rather than translations) the method has to be different. Here
a comparison must be made between typical expressions of concepts and
functions used in comparable situations in the compared languages. This is well
illustrated in Elena Tognini Bonellis and Elena Mancas comparison of
meanings encoded in English and Italian descriptions of farmhouse holidays on
the Web.
Natalie Kblers contribution is also cross-linguistic in character but has a
more clearly defined applied purpose. It reports on an experiment in corpus-
driven learning in the area of cross-linguistic lexicography. Trawling the Web by
means of the WebCorp tool (described by Renouf et al) and comparing the results
with data from multilingual corpora, students are taught to evaluate different
methods and sources for the purpose of building customised dictionaries for
machine translation.
Interlanguage studies on the basis of learner corpora such as the International
Corpus of Learner Language (ICLE) also require a special contrastive methodology.
Patterns of usage in the learners production that deviate from those of native
English writers may be due to contrastive differences between the learners L1 and
the target language. Conversely, contrastive differences can be used to formulate
hypotheses about interlanguage problems that can be checked against data in learner
corpora. As a result, research on learner corpora generally require comparisons with
corpora representing both the learners native language and the target language. This
is well illustrated in Roumania Blagoevas study of the use of demonstrative
pronouns by advanced Bulgarian learners of English.
Corpus linguistics can be combined with different theoretical approaches.
Whether corpus-driven or corpus-based, most of the contributions make some
link with theory. The aim of Joybrato Mukherjees paper on the verb give, for
example, is to bridge the gap between corpus-based research into actual language
use and cognitive grammar. Caroline David attempts to refine existing syntactico-
semantic classifications of putting verbs on the basis of corpus-data. Similarly,
Introduction 7
7
The spoken language corpus: a foundation for grammatical
theory
M.A.K. Halliday
University of Sydney
1. Introductory
I felt rather daunted when Professor Karin Aijmer invited me to talk at this
Conference, because it is fifteen years since I retired from my academic
appointment and, although I continue to follow new developments with interest, I
would certainly not pretend to keep up to date especially since I belong to that
previous era when one could hope to be a generalist in the field of language
study, something that is hardly any longer possible today. But I confess that I was
also rather delighted, because if there is one topic that is particularly close to my
heart it is that of the vast potential that resides in a corpus of spoken language.
This is probably the main source from which new insights can now be expected
to flow.
I have always had greater interest in the spoken language, because that in
my view is the mainspring of semogenesis: where, prototypically, meaning is
made and the frontiers of meaning potential are extended. But until the coming of
the tape recorder we had no means of capturing spoken language and pinning it
down. Since my own career as a language teacher began before tape recorders
were invented (or at least before the record companies could no longer stop them
being produced), I worked hard to train myself in storing and writing down
conversation as it occurred; but there are obviously severe limits on the size of
corpus you can compile like that. Of course, to accumulate enough spoken
language in a form in which it could be managed in very large quantities, we
needed a second great technical innovation, the computer; but in celebrating the
computerized corpus we should not forget that it was the tape recorder that broke
through the sound barrier (the barrier to arresting speech sound, that is) and made
the enterprise of spoken language research possible. It is ironical, I think, that
now that the technology of speech recording is so good that we can eavesdrop on
almost any occasion and kind of spoken discourse, we have ethics committees
and privacy protection agencies denying us access, or preventing us from making
use of what we record. (Hence my homage to Svartvik and Quirk, which I still
continue to plunder as a source of open-ended spontaneous dialogue.)
So my general question, in this paper, is this: what can we actually learn,
about spoken language and, more significantly, about language, by using a
computerized corpus on a scale such as can now be obtained? What I was
suggesting by my title, of course (and the original title had the phrase at the
foundation of grammatics, which perhaps makes the point more forcefully), was
that we can learn a great deal: that a spoken language corpus does lie at the
12 M.A.K. Halliday
as the window, not just into written language but into language. Now, thanks to
the new technology, things have changed; we might want to say: well, now, we
can study written texts, which will tell us about written language, and we can
study spoken texts, which will tell us about spoken language.
But where, then, do we find out about language? One view might be:
theres no such thing as language, only language as spoken and language as
written; so we describe the two separately, with a different grammar for each, and
the two descriptions together will tell us all we need to know. The issue of same
or different grammars has been much discussed, for example by David Brazil,
Geoffrey Leech and Michael Stubbs; there is obviously no one right answer it
depends on the context and the purpose, on what you are writing the grammar for.
The notion there is no such thing as language; there are only , whether only
dialects, only registers, only individual speakers or even only speech
events is a familiar one; it represents a backing away from theory, in the name of
a resistance to totalizing, but it is itself an ideological and indeed theoretical
stance (cf. Martins 1993 observations on ethnomethodology). And while of all
such attempts to narrow down the ultimate domain of a linguistic theory the
separation into spoken language and written language is the most plausible, it still
leaves language out of account, and hence renders our conception of semantics
particularly impoverished it is the understanding of the meaning-making power
of language that suffers most from such a move.
It was perhaps in the so-called modern era that the idea of spoken
language and written language as distinct semiotic systems made most sense,
because that was the age of print, when the two were relatively insulated one
from the other although the spoken standard language of the nation state was
already a bit of a hybrid. Now, however, when text is written electronically, and
is presented in temporal sequence on the screen (and, on the other hand, more and
more of speech is prepared for being addressed to people unknown to the
speaker), the two are tending to get mixed up, and the spoken/written distinction
is increasingly blurred. But even without this mixing, there is reason for
postulating a language, such as English, as a more abstract entity encompassing
both spoken and written varieties. There is nothing strange about the existence of
such varieties; a language is an inherently variable system, and the spoken/written
variable is simply one among many, unique only in that it involves distinct
modalities. But it is just this difference of modality, between the visual-synoptic
of writing and aural-dynamic of speech, that gives the spoken corpus its special
value not to mention, of course, its own very special problems!
I think it is not necessary, in the present context, to spend time and energy
disposing of a myth, one that has done so much to impede, and then to distract,
the study of spoken language: namely the myth that spoken language is lacking in
structure. The spoken language is every bit as highly organized as the written it
couldnt function if it wasnt. But whereas in writing you can cross out all the
mistakes and discard the preliminary drafts, leaving only the finished product to
offer to the reader, in speaking you cannot do this; so those who first transcribed
spoken dialogue triumphantly pointed to all the hesitations, the false starts and the
14 M.A.K. Halliday
backtrackings that they had included in their transcription (under the pretext of
faithfulness to the data), and cited these as evidence for the inferiority of the
spoken word a view to which they were already ideologically committed. It
was, in fact, a severe distortion of the essential nature of speech; a much more
faithful transcription is a rendering in ordinary orthography, including ordinary
punctuation. The kind of false exoticism which is imposed on speech in the act of
reducing it to writing, under the illusion of being objective, still sometimes gets in
the way, foregrounding all the trivia and preventing the serious study of language
in its spoken form. (But not, I think, in the corridors of corpus linguistics!)
Its been going tove been being taken out for a long time. [of a package
left on the back seat of the car]
All the system was somewhat disorganized, because of not being sitting in
the front of the screen. [cf. because I wasnt sitting ]
Drrr is the noise which when you say it to a horse the horse goes faster.
Excuse me is that one of those rubby-outy things? [pointing to an object
on a high shelf in a shop]
And then at the end I had one left over, which youre bound to have at
least one that doesnt go.
Thats because I prefer small boats, which other people dont necessarily
like them.
This court wont serve. [cf. its impossible to serve from this court]
20 M.A.K. Halliday
features patterns of intonation and rhythm at the phonological level (that is,
identifying them as meaningful options); meanwhile we might explore the value
of something which is technically possible already but less useful for
lexicogrammar and semantics, namely annotation of speech at the phonetic level
based on analysis of the fundamental parameters of frequency, amplitude and
duration.
But, as I suggested, a more serious problem is that of over-transcribing,
especially of a kind which brings with it a false flavour of the exotic: speech is
made to look quaint, with all its repetitions, false starts, clearings of the throat and
the like solemnly incorporated into the text. This practice, which is regrettably
widespread, not only imparts a spurious quaintness to the discourse one can
perhaps teach oneself to disregard that but, more worryingly, obscures, by
burying them in the clutter, the really meaningful sleights of tongue on which
spoken language often relies: swift changes of direction, structures which Eggins
and Slade call abandoned clauses, phonological and morphological play and
other moments of semiotic inventiveness. Of course, the line between these and
simple mistakes is hard to draw; but that doesnt mean we neednt try. Try getting
yourself recorded surreptitiously, if you can, in some sustained but very casual
encounter, and see which of the funny bits you would cut out and which you
would leave in as a faithful record of your own discourse.
But even with the best will, and the best skill, in the world, a fundamental
problem remains. Spoken language isnt meant to be written down, and any
visual representation distorts it in some way or other. The problem is analogous,
in a way, to that of choreographers trying to develop notations for the dance: they
work as aids to memory, when you want to teach complex routines, or to preserve
a particular choreographers version of a ballet for future generations of dancers.
But you wouldnt analyse a dance by working on its transcription into written
symbols. Naturally, many of the patterns of spoken language are recognizable in
orthographic form; but many others are not types of continuity and
discontinuity, variations in tempo, paralinguistic features of tamber (voice
quality), degrees of (un)certainty and (dis)approval and for these one needs to
work directly with the spoken text. And we are still some way off from being able
to deal with such things automatically.
The other major problem lies in the nature of language itself; it is a
problem for all corpus research, although more acute with the spoken language:
this is what we might call the lexicogrammatical bind. Looking along the
lexicogrammatical continuum (and I shall assume this unified view, well set out
by Michael Stubbs (1996) among the principles of Sinclairs and my approach,
as opposed to the bricks-&-mortar view of a lexicon plus rules of syntax) if we
look along the continuum from grammar to lexis, it is the phenomena at the
lexical end that are the most accessible; so the corpus has evolved to be organized
lexically, accessed via the word, the written form of a lexicogrammatical item.
Hence corpuses have been used primarily as tools for lexicologists rather than for
grammarians.
22 M.A.K. Halliday
If you are researching the forms of expression of the meaning cause, you can
identify a set of verbs which commonly lexify this meaning in written English
verbs like cause, lead to, bring about, ensure, effect, result in, provoke and
retrieve occurrences of these together with the (typically nominalized) cause and
effect on either side; likewise the related nouns and adjectives in be the cause of,
be responsible for, be due to and so on. It takes much more corpus energy to
retrieve the (mainly spoken) instances where this relationship is realized as a
clause nexus, with cause realized as a paratactic or hypotactic conjunction like
so, because or as, for at least three reasons: (i) these items tend to be polysemous
(and to collocate freely with everything in the language); (ii) the cause and effect
are now clauses, and therefore much more diffuse; (iii) in the spoken language
not only semantic relations but participants also are more often grammaticalized,
in the form of cohesive reference items like it, them, this, that, and you may have
to search a long way to find their sources. Thus it will take rather longer to derive
a corpus grammar of causal relations from spoken discourse than from written;
The spoken language corpus 23
and likewise with many other semantic categories. Note that this is not because
they are not present in speech; on the contrary, there is usually more explicit
rendering of semantic relationships in the spoken variants; you discover how
relatively ambiguous the written versions are when you come to transpose them
into spoken language. It is the form of their realization more grammaticalized,
and so more covert that causes most of the problems.
Another aspect of the same phenomenon, but one that is specific to
English, is the way that material processes tend to be delexicalized: this is the
effect whereby gash slash hew chop pare slice fell sever mow cleave shear and so
on all get replaced by cut. This is related to the preference for phrasal verbs,
which has gained momentum over a similar period and is also a move towards the
grammaticalizing of the process element in the clause. Ogden and Richards, when
they devised their Basic English in the 1930s, were able to dispense with all but
eighteen verbs, by relying on the phrasal verb constructions (they would have
required me to say were able to do away with all but eighteen verbs); they
were able to support their case by rewording a variety of different texts, including
biblical texts, using just the high frequency verbs they had selected. These are, as
I said, particular features of English; but I suspect there is a general tendency for
the written varieties of a language to favour a more lexicalized construal of
meaning.
So I feel that, in corpus linguistics in general but more especially in
relation to a spoken language corpus, there is work to be done to discover ways of
designing a corpus for the use of grammarians or rather, since none of us is
confined to a single role, for use in the study of phenomena towards the
grammatical end of the continuum. Hunston and Francis, in their work on
pattern grammar (1999), have shown beyond doubt that the corpus is an
essential resource for extending our knowledge of the grammar. But a corpus-
driven grammar needs a grammar-driven corpus; and that is something I think we
have not yet got.
hence ongoingly modified and extended when the semogenic energy of their
lexicogrammar is brought to bear on the material and semiotic environment,
construing it, and reconstruing it, into meaning. In this process, written language,
being the more designed, tends to be relatively more focussed in its demands on
the meaning-making powers of the lexicogrammar; whereas spoken language is
typically more diffuse, roaming widelier around the different regions of the
network. So spoken language is likely to reveal more evidence for the kind of
middle range grammar patterns and extended lexical units that corpus studies
are now bringing into relief; and this in turn should enrich the analysis of
discourse by overcoming the present disjunction between the lexical and the
grammatical approaches to the study of text.
Already in 1935 Firth had recognized the value of investigating
conversation, remarking it is here we shall find the key to a better understanding
of what language really is and how it works (1957: 32). He was particularly
interested in its interaction with the context of situation, the way each moment
both narrows down and opens up the options available at the next. My own
analysis of English conversation began in 1959, when I first recorded spoken
dialogue in order to study rhythm and intonation. But it was Sinclair, taking up
another of Firths suggestions the study of collocation (see Sinclair 1966) who
first set up a computerized corpus of speech. Much later, looking back from the
experience with COBUILD , Sinclair wrote (1991: 16): a decision I took in
1961 to assemble a corpus of conversation is one of the luckiest I ever made. It
would be hard now to justify leaving out conversation from any corpus designed
for general lexicogrammatical description of a language. Christian Matthiessen,
using a corpus of both spoken and written varieties, has developed text-based
profiles: quantitative studies of different features in the grammar which show up
the shifts in probabilities that characterize variation in register. One part of his
strategy is to compile a sub-corpus of partially analysed texts, which serve as a
basis for comparison and also as a test site for the analysis, allowing it to be
modified in the light of ongoing observation and interpretation. I have always felt
that such grammatical probabilities, both global and local, are an essential aspect
of what language really is and how it works. For these, above all, we depend on
spoken language as the foundation.
References
Baker, M., G. Francis and E. Tognini-Bonelli (eds) (1993), Text and technology:
in honour of John Sinclair. Amsterdam: John Benjamins.
Brazil, D. (1995), A grammar of speech. Oxford: Oxford University Press.
Carter, R. (2002), Language and creativity: the evidence from spoken English.
[The Second Sinclair Open Lecture, Department of English, University of
Birmingham]
Carter, R., and M. McCarthy (1995), Grammar and the spoken language.
Applied Linguistics 16: 141-158.
28 M.A.K. Halliday
Key:
Indented lines represent the contributions of the interviewer, the asterisks
in the informants speech indicating the points at which such
contributions began, or during which they lasted.
The hyphens (-, --, ---) indicate relative lengths of pauses.
Proper names are fictitious substitutes for those actually used.
The informant is a graduate, speaking RP with a normal delivery.
i is this true I heard on the radio last night that er pay has gone net pay
but er -- retirement age has gone up - *for you chaps*
*yes but er*
to seventy*
5 *yes I think thats scandalous*
*but is it right is it true*
*yes it is true yes it is true*
*well its a good thing*
yes *but the thing is that er -* everybody wants more money --
10 *I mean youve got your future secure*
but er the thing is you know -- er I mean of course er the whole thing is
absolutely an absolute farce because -- really with this grammar school
business its perfectly true that - that youre drawing all your your brains of
the country are going to come increasingly from those schools - therefore
15 youve got to have able men - and women to teach in them - but you want
fewer and better ** - thats the thing they want
*hm*
- fewer grammar schools and better ones --- *because at the
*Mrs Johnson was saying*
20 moment* its no good having I mean weve got some very good men where I
am which is a bit of a glory hole -- but er theres some theres some good
men there theres one or two millionaires nearly theres Ramsden who
cornered the - English text book market -- *and er* - yes hes got a net
income of
25 *hm*
about two thousand five hundred a year and er theres some good chaps there
I mean you know first class men but its no good having first class men -
dealing with the tripe that we get *--* you see thats the trouble that youre
wasting its
30 *hm*
a waste of energy -- um an absolute waste of energy - your - your er method
of selection there is all wrong -- *um
*but do you think its better to have -- er teachers whove had a lot of
experience - having an extra five years to help solve this - problem of
of fewer teachers -- er or would you say - well no cut them off at at
sixty-five and lets get younger*
The spoken language corpus 31
A: Yes; thats very good. I wouldnt be able to have that one for some reason
you see: this checker board effect I recoil badly from this. I find I hadnt
looked at it, and I think its probably because it probably reminds me you
know of nursing Walter through his throat, when you play checker boards or
something. I think its it reminds me of the ludo board that we had, and I
just recoiled straight away and thought [mm] not not that one, and I didnt
look inside; but thats very fine, [mm mm] isnt it? very fine, yes.
B: Its very interesting to try and analyse why one like abstract paintings, cause
I like those checks; just the very fact that theyre not all at right angles means
that my eyes dont go out of focus chasing the lines [yes] they can actually
follow the lines without sort of getting out of focus.
A: Yes Ive got it now: its those exact two colours you see, together. He had
he had a blue and orange crane, I remember it very well, and you know one
of those things that wind up, and thats it.
B: It does remind me of meccano boxes [yes well] the box that contains
meccano, actually.
A. Yes. Well, we had a bad do you know; we had oh we had six or eight
weeks when he had a throat which was [mhm] well at the beginning it was
lethal if anyone else caught it. [yeah] It was lethal to expectant mothers with
small children, and I had to do barrier nursing; it was pretty horrible, and the
whole corridor was full of pails of disinfectant you know [mm], and you went
in, and of course with barrier nursing I didnt go in in a mask I couldnt
with a child that small, and I didnt care if I caught it, but I mean it was
ours emptied outside you see [mm] and you had to come out and you brought
all these things on to a prepared surgical board [mm mm] and you stripped
your gloves off before you touched anything [mm] and you disinfected oh it
was really appalling [mm]. I dont think the doctor had expected that I would
do barrier nursing you see [mm] I think she said something about she
wished that everybody would take the thing seriously you know, when they
were told, as I did, cause she came in and the whole corridor was lined [mm]
with various forms of washing and so on, but after all I mean you cant go
down and shop if you know that youre going to knock out an expectant
mother. It was some violent streptococcus that hed got and he could have
gone to an isolation hospital but I think she just deemed that he was too small
[yes mm mm] for the experience, and then after wed had him, you know,
had him for a few days at home this couldnt be done. [mhm] She made the
decision for me really, which at the time I thought was very impressive, but
she didnt know me very well: I think she thought I was a career woman who
would be only too glad and would say oh well hes got to go into a hospital,
you know, so she made the decision for me and then said its too late now to
put him into an isolation hospital; I would have had to do that a few days
ago which, I thought, I didnt want her to do!
The spoken language corpus 35
B: Do nurses tend to be aggressive, or does one just think that nurses are
aggressive?
A: Well, that was my doctor [oh], and she didnt at that time understand me very
well. I think she does now.
. . . ) and I / think shes a/ware of this and I / think you / know she . . . // 4
) I / think one / thing thatll / happen I / think that . . . // 1 ) that / Mike may
en/courage her // 1 ) and I / think thatll be / all to the / good //
P. // 4 ) to / what ex/tent are / these / ) the / three / theories that she se/lected // 1
truly repre/sentative of / theories in this / area //
A. // 1 thats / it / ) // 1 thats / it //
P. // 1 ) they / are in/deed //
S. // 1 yeah //
P. // 1 oh // 2 they are / the / theories //
A. // 1 thats about / it //
P. // 1 they are / not / really repre/sentative / then //
S. // 1 well there are // 1 ) there are / vari/ations // 1 ) there are / vari/ations // 1
on / themes but . . . // 4 ) but / I dont / know of any / major con/tender ) there
/ may be // 1 ) well / I dont / know of / anything that / looks much / different
from the / things shes . . . ) she has / looked at in the spe/cific / time //
A. // 4 ) ex/cept for the / sense that
P. // 1 ) so / nobody / nobody would at/tack her on / that ground / then if she
//
A. // 1 oh no / I dont / think so // 4 ) I think the / only / thing that would be
sub/stantially / different would be a // 1 real / social / structuralist who would
/ say // 4 ) you / dont have to / worry about cog/nitions // 1 what you have to
/ do is / find the lo/cation of these / people in the / social / structure // 1- ) and
/ then youll / find out how theyre / going to be/have with/out having to / get
into their / heads at / all // 4 ) and / that // 1 hasnt been / tested // 1- ) ex/cept
in / very / gross / kinds of / ways with // 1 macro / data which has / generally /
not been / very satis/factory // 1 yeah / ) // 1 ) so I can / tell her that // 3 )
you / know I
S. // 1 ) shes / won //
36 M.A.K. Halliday
Text 5. Choreographic notation for the clause complex of spoken language (cf.
forms of notation in Martin 1992). Clause complex from Text 3 above.
The doctor prob-ably
expected
I would say
2
That he had to
Go into hospital
So
Instead of
Asking me
+
She said
2
Its too late now to
Put him into a hospital
=2
I should have had to do
That a few days ago
+2
And I thought
To myself
2
I didnt want to
You to do that
The spoken language corpus 37
Note: Written originals are those lettered (a) in Set 1 and those in the left hand
column of Set 2.
1. (1a) Strength was needed to meet driver safety requirements in the event
of missile impact.
(1b) The material needed to be strong enough for the driver to be safe if it
got impacted by a missile.
(2a) Fire intensity has a profound effect on smoke injection.
(2b) The more intense the fire, the more smoke it injects (into the
atmosphere).
(3a) The goal of evolution is to optimize the mutual adaption of species.
(3b) Species evolve in order to adapt to each other as well as possible.
(4a) Failure to reconfirm will result in the cancellation of your
reservations.
(4b) If you fail to reconfirm your reservations will be cancelled.
(5a) We did not translate respectable revenue growth into earnings
improvement.
(5b) Although our revenues grew respectably we were not able to
improve our earnings.
John Sinclair
Abstract
Some corpus linguists prefer to research using plain text, while others first
prepare the texts by adding various analytic annotations. The former group
express reservations about the reliability of intuitive data, whereas the latter
group, if obliged to choose, will reject corpus evidence in favour of their intuitive
responses. This paper attempts to move from the broad differences expressed
above to a small number of specific points of contrast between the two
approaches.
1. Introduction
As the study of language in corpora continues to grow and diversify, differences
of methodology emerge, and there is room for misunderstanding. Aarts (1991,
2002a, 2002b) has monitored the development of the relationship between the
management of corpora and the theory of language, and Tognini Bonelli (2001)
has described contrasting conceptualisations of the relation between theory and
data.
The key concept here is -driven, in the phrase corpus-driven linguistics.
-driven has several characteristic usages, among which we may focus on two,
which might be paraphrased as motivated and controlled. Its use in relation
to corpus linguistics can be traced back to Johns (e.g. 1990) and his data-driven
learning. Here the matter of motivation is on top, as it was found that learners
have unbounded curiosity when they are allowed to interrogate corpora, and
apparently natural learning mechanisms to profit from the curiosity. Francis
(1993) shifted the focus to corpus-driven grammar, where controlled is
perhaps the more appropriate gloss. The grammar should follow the corpus,
accounting for as much as possible of the patterning, and being cautious in
ascribing to the language a pattern that is not attested in the corpus.
Tognini Bonelli (op. cit.) noted that in much corpus research the
theoretical and descriptive positions were carefully insulated from the findings of
the corpus investigations. Though researchers acknowledged that one legitimate
use of a corpus was to test hypotheses, there was no serious testing of the
governing theories. These, it was held, had been forged over many years, and
thoroughly tested against intuitive responses, and they were extremely abstract.
The myriad details of actual usage could provide some helpful reflections of the
40 John Sinclair
theory, but there was no question of threatening the theory with evidence from
usage, however compelling that might be.
Tognini Bonelli called this position corpus-based linguistics, and
contrasted it with corpus-driven linguistics, which specifically places the theory
in a vulnerable position, to be justified or modified according to the results of
investigations the classic posture of the empirical scientist.1 There may be
intermediate positions between these poles, but I cannot imagine any. Either
ones whole cathedral of linguistic structures is ready to receive the scaffolding,
or it is not.
Aarts (op. cit.) offers a penetrating discussion of this dichotomy, and
suggests that the two approaches contrast in their methodologies in two
important places the role they see for a persons intuition, and the place, value
and legitimacy of annotating corpora.
Regarding intuition, he anticipates quite opposed positions; the corpus-
based linguist allows his intuition to overrule his corpus data and hence gives
primacy to the former (2002a: 8) and Aarts expects the corpus-driven linguist to
do the opposite.
There are two observations to be made here. One is that Aarts moves
smoothly from considering the use of intuitive data to the more general point
of the role of the intuition in the process of making linguistic descriptions. But it
is quite reasonable to differentiate between these two positions, to reject the
former and keep an open mind on the latter as I do, and as I think most corpus-
driven linguists would also do. The other point is that I wonder if corpus-based
linguists have ever thought seriously about the priority they assign to intuitive
data. Would they really just set aside a mass of information about how people
use a language when their still, small voice tells them something different?
Leave aside the details, the one-offs, the peculiarities corpus linguistics is
about generalities if anything. Would they really feel secure in preferring their
intuition against measurable, incontrovertible objective evidence? Their only
hope would be to find an explanation for the apparent conflict, and although that
is a laudable aim, it is rarely resorted to because of the low prestige of empirical
data in the last half-centurys linguistics.
In no way do I intend this argument to devalue the importance of
intuition, as will become apparent in a little while. But I urge caution. When
cheap pitch meters became available in phonetics, it was possible to discover
exactly what the pitch contours of an utterance were. It was discovered that
people believed things that were at variance with the facts. They believed, for
example, that questions were spoken on a rising intonation, although in British
English they usually are not, and they would hear the pitch going up when in
fact it was going down. Intuition is not some kind of gut reaction to events, it is
educated in various ways, and sophisticated.
On the topic of annotation, Aarts considers that the contrast between the
two approaches to corpus linguistics is at its most marked in this area. He
deduces that corpus-driven linguists are bound to reject annotation, because it
could hamper their wish to be as close to the plain text as possible, whereas
Intuition and annotation 41
corpus-based linguists, who do not share their concerns, rely on annotation as the
main means by which they express their analysis and make it available to others.
It is, in Aarts uncompromising phrase, an indispensable tool for them.
Certainly there are contrasting attitudes to annotation among corpus
linguists of different styles, but not perhaps as extreme as suggested by Aarts. I
would like to continue this valuable discussion with an examination of the roles
of intuition and annotation in corpus linguistics, because I think that some
misunderstandings have arisen. These are quite understandable in their context,
and I am certain that Aarts is striving to be completely fair in his representation
of all points of view, and particularly those that he does not share. I can, of
course, talk only from my own perspective as following the corpus-driven
approach to research.
2. Intuition
In considering the role of intuition, for example, two issues have in recent years
tended to undermine confidence in the reliability of this elusive faculty. Let us
examine each of them briefly.
idiomatic in the sense that their meaning does not simply combine the meanings
of the individual words. Examples are take place, take a photograph, take
control, take time.
While these latter phrases were well known to people dealing with
English, their prominence in texts was not, and they had clearly been under-
assessed in reference books; their frequency was overwhelming, and the
intuitively-favoured meaning was insignificant in comparison.4 This was the
beginning of distrust of intuition why did ones intuition fail to come up with
the massively common uses of a word, but instead reported a rather rare one?
The term delexicalisation was used of the process whereby the original
meaning of the verbs which appeared in these patterns was watered down or lost
completely, overlaid with a new meaning that arose from regular collocation. At
that time no-one was questioning the ideas (a) that each word had one or more
meanings, and (b) that one of the meanings had special status as the original or
core meaning. But gradually confidence in these ideas was eroded as it was
realised that a model based on these ideas only fitted the facts marginally, and
left most of the meaningful patterning unresolved in layers of ambiguity.
Estimates of the proportion of text that consists of multi-word lexical units rose
to as high as 80% in some circumstances. The link between the word and the
meaning gradually crumbled.
Delexicalisation is thus an unfortunate term. The word only appears to
lose meaning when the model has no higher unit to show where the meaning of a
multi-word unit is actually created. With the higher unit the lexical item
established we can return to the role of intuition, and take a different view of its
accuracy and relevance.
The lemma TAKE contributes to many lexical items in coselection with
e.g. prepositions and particles to make phrasal verbs, and nouns to maintain the
preference of English for simple verbs such as take a risk instead of risk. These
are not strictly meanings of TAKE , but uses of the word in combinations. If
these and other coselections are removed from the concordance of TAKE then it
might well be that the main remaining meaning of this lemma is as reported by
the intuition.5 The intuition was probably right after all, but if so this has been
obscured by an inadequate model of interpretation.
Problems of the intuition are not always resolved so easily, but the fairly
objective evidence from corpora allows us to study intuitive positions and
reactions with greater clarity, and there is less chance that the intuition will be
dismissed as irrelevant on future occasions.
From these brushes between the corpus and the intuition, it is easy to see
how word could get around that intuition was not to be trusted, and that it tended
to take up a position that could not be supported by corpus evidence. However,
the failing, as so often, is more likely to be in the model, or theory, of language
through which we perceive linguistic events and with which we interpret them.
Take the case of inventing sentences what are you asking your intuition to do?
Any utterance which is part of a communicative event is heavily dependent on
the events preceding it, to the extent that many contextual settings are already
44 John Sinclair
established before the utterance takes place, and the utterance is interpreted with
reference to those settings. If a user of English is asked to produce a sentence of
English in the absence of these settings, it is a most unnatural request, and it is
unlikely that the subject will be able to imagine a suitable communicative event,
master all the relevant settings, mentally construct enough of the preceding
utterances to provide an adequate cotext, and then think up a sensible
contribution although he or she is not involved in the hypothetical event. No
wonder we usually make a hash of it!
Because our basic models of language structure concentrate on sub-
sentential matters, and do not assign a central importance to interaction, we
formulate requests that appear simple enough from the perspective of our model,
but involve processes which are almost impossible to control, as we see when we
look through a richer model.
In the second instance, where a unit of meaning can spread over several
words, the intuition was delivering a perfectly reasonable answer to the question
as asked, but our resident models misinterpreted it, and so we blamed the
intuition and lost confidence in it. This kind of confusion is likely to characterise
future encounters with the intuition as well, until our models are rich enough to
cope with the information they are receiving, both from corpus and from the
intuitive reactions of people with command of the language.
So we both trust our intuitions and keep a wary eye on the strong
possibility of misunderstanding what we are observing. To hint at another area
where this could arise, it has been noticed informally that people recall phrases
that are frequent exceptions rather than normal constructions. Grammar deals
with the regular, and so is resonant with frequency; however many words appear
commonly in phrases which are uncharacteristic of the normal usage of the
words. The intuition will tend thus to retrieve the non-standard structure. For
example the corpus tells us that there are a number of adjectives that do not
usually appear in front of nouns, in what is called the attributive position. Instead
they normally occur after the verb be in the predicative position. However, in
certain collocations, fixed phrases and idioms this restriction is lifted; so users of
English have to remember both the rule and the exceptions. It seems that the
intuition, if queried about a particular adjective, tends to report on the
participation of the adjective in lexical patterns rather than grammatical ones in
phrases with particular collocates that are uncharacteristic of the grammar, rather
than the regular structures. From a grammatical point of view this seems
perverse it is the exceptions that come to the surface first but from a lexical
point of view it is a sensible response, since if an established multi-word
expression has a structure that is exceptional compared with the normal usage of
one of the words in it, this point has to be remembered in connection with the
individual word.
This intuitive response first came to our notice in Cobuild with the
adjective glad, and I have commented before about it (Sinclair 1991).
Overwhelmingly glad is used predicatively, and in some complex constructions.
However, many English speakers, when asked about the usage of this word, cite
Intuition and annotation 45
a phrase from the translation of the bible published in 1611 glad tidings of
great joy that is still alive and well in the speech community. Apart from this
relic, and a few minor phrases, glad will be found, on thousands and thousands
of occasions, in predicative position. Without one of the tiny number of
collocations like tidings, it will sound very odd indeed as an attributive adjective.
There are good reasons for this which I will not go into now, because my point is
that here is another place where our intuitions may appear to report falsely about
the facts of the language. Following a grammar-predominant model, such as we
have, glad will be classified as predicative without question on the basis of
corpus evidence; the intuition may hold on fiercely to the few phrases that
contravene this convention. A model which is more balanced between grammar
and lexis should mediate successfully between these apparently opposed
positions.
It seems that glad is not alone in presenting a different pattern to the
grammar and the lexis. The adjectives ill, safe and likely are all found
predominantly in the predicative position, but it is easy to think up phrases where
they are used attributively an ill wind, ill effects; safe haven, safe sex; a likely
story, a likely lad. So in each of these cases we can interpret the role of intuition
as preserving memory of those phrasings which are characteristic of the lexical
patterning, especially when the more general and freer usage of the adjective is
in a contrasting grammatical structure. When what is exceptional in grammar is
typical in lexis, the phrasings are stored as individual items.
While there is at least an interpretive problem in the cases that we have
discussed, there is one process where the intuition can be safely trusted. In the
evaluation of corpus evidence the researcher has virtually no option but to yield
to the organising influence of his or her intuition. Complex patterns of
coselection are immediately interpreted semantically and classified broadly with
respect to each other. The same mental resource that we have seen is unable to
manage coselections outside participation in a genuine communicative context is
apparently razor sharp and completely reliable in a receptive mode.
To illustrate this I present the results of asking The Bank of English what
the principal collocates were of the pattern on the of.6 Any single word form
might occur in the gap, but it should be noted that the collocates are not restricted
to occurrence in that position, but might be anywhere in a ten-word window
around the phrase.
The leading collocates, according to their t-scores, are listed in Table 1. I
believe that anyone with normal fluency in English would find it difficult to scan
this table without making tentative groupings of the words along various
dimensions.
46 John Sinclair
1. Timing expressions, such as eve, night, morning, day. Here we note that eve is
an unusual word, usually found in poetry and oratory. This is a clue to the
meaning of these expressions, which are used in the timing of important events.
On the stroke of is also a somewhat dramatic timing expression, which needs a
particular time after it, the kind of time that is likely to be signalled by a clock
striking or something similar. As well as the hours, especially midnight, and half-
time, full-time, those unfamiliar with the game of cricket might be surprised to
find on the stroke of lunch/tea in there as well.
2. Spatial indicators, such as back, side, surface, and floor, corner. Site attracts
collocates to do with buildings. Outskirts, streets, banks are more specific spatial
references. Isle and island are parts of place names. Some uses of edge, verge,
brink are also spatial, but on the brink of and on the verge of are commonly used
as complex prepositions introducing mainly dreadful things.
3. The phoric nouns subject, issue, question, whose referents are to be found in
the surrounding cotext; in this phrasing probably just after the of.
4. The complex prepositions on the basis of, on the grounds of, on the strength
of, indicating the reason for a decision.
5. In some cases the lexical item extends beyond the designated phrase; for
example most of the occurrences of face in this phrasing is in the phrase on the
face of it, with a variant on the face of things. On the heels of is usually preceded
by hot or hard, or one of a few variants like close.
Intuition and annotation 47
6. More generally, part fits into a phrasing X on the part of Y, where X is some
action, usually described in derogatory terms, and Y is the actor. Future attracts
talks and similar events on its left, and political problems on its right. Effect is
part of an item which can be represented as X has a Y effect on Z, where X is
some event, Y is an adjective like adverse, dramatic, and Z is something like a
political programme. Cover is usually preceded by the name of a celebrity, and
followed by the name of a journal. Sale predictably attracts the vocabulary of
financial dealings. On the evidence of has a remarkable tendency to come at the
beginning of its clause, introducing the reason for an action which is reported
later.
7. The remaining nouns that occur between o n and of are frequent but
unremarkable collocationally, like number, amount, advice.
8. Depends, depending, depend are verb forms which are likely to come in front
of the expression, as also focus, focused, based, comment, report. Evidence is
typically preceded by one of these verbs. Impact, emphasis and restrictions are
much more likely to precede this phrase than to be the missing noun.
3. Annotation
Aarts (2002a: 10) went so far as to say that annotation is anathema to
corpus-driven linguists. This is a fairly serious misunderstanding, and to clarify
my own position it is necessary to define terms carefully.
Let us first distinguish between mark-up and annotation. They are not
always kept distinct in usage, and their domains may overlap, but they are worth
distinguishing. Both of them are processes which provide additional information
to what is called plain text. Plain text is a straightforward concept, but there
are some who claim not to understand it, so we will start there.
Imagine that you had a long thin reel of paper to write on rather than a
rectangular sheet like a reel of sticky tape but made of paper. You have in front
of you a piece of writing that you want to record onto this reel of paper just a
paragraph. How would you do it? I expect that you would ignore line ends,
remove hyphens that marked words split at line-ends, and otherwise produce a
continuous stream of letters, numbers and punctuation marks in the same
sequence as the original. That is plain text, and it consists of an alphanumeric
stream.
If you continue transferring written text in this way, however, you will
soon encounter problems bold face, italic, underlinings for example, and
headings, large fonts and other layout matters. Mark-up is the process of
recording these additional pieces of information by making notes interspersed in
the alphanumeric string. So just before a section of bold face will be a tag that
says from here on there is bold face and just after the section there will be a tag
that says from here on we return to normal face. The tags are coded in a mark-
up language, of which the most widely used has been SGML, now giving way to
XML. So for each note there will be two tags.
In marking text up, then, the aim is to preserve information that would
otherwise be lost in the transfer of text to electronic form. Annotation, which we
will come to later, uses the same conventions as mark-up but has no limits on the
kind of information that is provided. Specifically, it encodes information which
is not directly recoverable from the original text, but is added by a researcher.
Returning to mark-up, now imagine that instead of a written text to be
transferred to the reel of paper, you were faced with a recording of a
conversation. Here there are many more decisions to be taken, because the sound
wave has to be interpreted as an alphanumeric stream. Let us say that you do not
attempt a phonetic transcription, but you adopt the mode of transcription called
orthographic, using ordinary spelling wherever possible, but noting all the false
starts, laughs, coughs and stutters.
Intuition and annotation 49
If you do this conscientiously you will end up with a legible text, but one
which has lost a lot of the original information in the sound wave. Intonation is
poorly represented in punctuation, stress is usually not marked in writing, and all
sorts of emotional and attitudinal meanings will not transfer. You may want to
mark up the transcription to record some of these important items, again using a
tag coding.
There is good motivation for preserving this information, and there are
various ways of preserving it. However, it is important to note that a simple
orthographic transcription has a definite status in and for itself, even though it
may be enhanced by good mark-up. It is legible, and a fluent speaker is usually
able to infer enough of the missing information to understand the transcript with
only occasional difficulty, much as he or she would adjust to a speaker of an
unfamiliar variety of English. You will have included word spaces in your
transcription without difficulty, though you did not hear most of them, and
perhaps speaker change, with some attempt to recognise the various speakers.
You will have made a stab at sentence and paragraph boundaries, and used full
stops and capital letters with confidence.
In the very first corpus of spoken language in electronic form (Sinclair et.
al. 1970 and forthcoming) there was no difference made between capital letters
and small ones because in the early sixties computers could only cope with one
alphabet. There was no punctuation and no indication of speaker change because
transcribers were asked not to include these. Word spaces were present, and this
led to criticism from some purists, but I find no problem in using the
transcribers ability to detect word spaces to improve legibility.7
Conventions such as SGML originated when computers were not nearly
as powerful and flexible as they are today. We have reached a stage where a
recorded conversation can be digitised and all the features of the sound wave
which are relevant to language can be retained in the computer and presented to a
researcher as required so, for example, an orthographic transcription can be
aligned with the sound wave from which it was transcribed, and segments of the
recording can be played back to order, so there is no further need for mark-up.
Similarly, documents can be digitised, retaining all relevant aspects of their
format, layout and typography, and again this information, kept separate from the
alphanumeric stream, can be aligned as and when required.
So the mark-up languages represent a stage of development of computer
text processing which is now obsolete. The updating of existing corpora will be
slow because a lot of material has been tagged (and often re-tagged to keep up
with changes in best practice), and there are some contingent problems which
will be mentioned below.
There is an issue here of the integrity of texts. While it is conceded that no
electronic representation of a text is identical with the original, the object of
making an electronic copy is surely to preserve at least the alphanumeric stream
in its original sequence. Any disturbance to that will lead to difficulties later on,
particularly now that many corpora are much too large for human inspection.
The principal problem is that it is not possible to be sure that all the tags have
50 John Sinclair
been removed, without the accidental removal of some genuine text. There are
two sources of error here one is the accuracy with which the tags have been
inserted, and despite the availability in recent years of SGML parsing and
checking programs there are all sorts of opportunities for error. The other is that
strict adherence to the rules is laborious, and there are a number of short-cuts that
are commonplace, and not necessarily retrievable. The situation was summed up
by Vlado Keselj in a message to the Corpora List in April 2002: Actually,
writing a correct and general SGML detagger would be a *very* difficult task.
Thankfully, there is an easy way of avoiding this problem. The
alphanumeric stream, the plain text file, can be just one of several parallel data
streams, and mark-up tags can be another. When required, these two streams can
be merged, and a single string alternating text and tag can be made. This does not
affect the integrity of the plain text file, and the process can be repeated and
elaborated as required. This system has been in everyday use for some fifteen
years now, but it is still common to find tagged corpora that are not available in
plain text form, and can only be separated by a laborious process of doubtful
accuracy.
We can summarise the arguments around mark-up as follows:
With this in mind, we can now turn to annotation. Annotation uses the same
conventions as mark-up, but is not restricted to features of the original text or
recording. The classic annotation is POS-tagging, which means inserting after
each word in a corpus a code denoting its part of speech, but there are now many
others, some quite unusual and informal, and many corpora are very heavily
annotated.
I would certainly not condemn all annotation, and I make judicious use of
it myself; but I have reservations about some practices, and about the wisdom of
relying on a platform of annotated text in our present state of knowledge. The
idea that annotation is anathema to people who share my views no doubt arises
because of these stated reservations.
In order to clarify my position, I would like now to make a distinction
between a corpus which is prepared for general use by a community of
researchers, students and workers in the language industries, and one which is
put together for a particular application. My comments and particularly my
reservations largely concern the former type of corpus, often known as a
generic corpus, where I take the simple view that all the information apart from
the plain text should be optional, because (a) some important groups of users
require only that, and (b) most researchers will only require a small subset of the
annotations that might be available. Researchers using statistical methods usually
Intuition and annotation 51
need a large amount of plain text, as do those searching for lexical patterns.
Information from mark-up and annotation would only be of interest in problem
cases, and statistical studies rarely get down to that level of detail. Also, as
annotations become more varied and verbose, no-one will want to make use of
all of them, and if the corpus is only available in fully annotated form, they will
be carrying a lot of baggage around with them.
The other type of corpus, one that is designed and built for a pre-
determined application, will give top priority to the needs of the job, quite
rightly. The type and level of mark-up and annotation will depend on the kind of
queries that the investigation requires, most of which will be knowable in
advance. In such circumstances, which are common in commercial applications,
the best that one can do is appeal for the researchers to observe good practice so
that their corpus may be reusable for other purposes.8 The same situation is
found in the growing practice of putting together quick, highly specialised
corpora, perhaps from the internet, in order to carry out a limited set of tasks,
with no intention of retaining the corpus after the disposable corpus. In such
cases any short-cut is justified and it is irrelevant to suggest that researchers
conform to good practice (see Pearson and Bowker 2002).
So I am only concerned with generic corpus resources. Many prospective
users of such corpora expect to be offered POS tagging and sometimes full
parsing and semantic and pragmatic tagging as well, and there is no reason why
such annotations should not be available with generic plain text corpora, but they
should be optional, and they should conform to the conditions set out above, for
mark-up, and below. Many projects start out with a request to the corpus
linguistics community for a corpus already tagged in a particular way. In the best
scientific tradition, researchers use previous research as their platform, and probe
beyond their predecessors.
Here is where my reservations start. In the first place, all the annotation
systems that I know of that code linguistic information have an element of
human input, of which the smallest-scale intervention is the human correction
of the computers mistakes. In many procedures the computer plays a fairly
minor role in the decision-making and is used just to manage the data; in others
there is a preliminary stage where the input text is manually edited and then
processed automatically.
I have argued for some years that annotation which is not fully automatic
has no place in the toolboxes of generic corpora. It is unavoidable in many
applications because of their need for practical outcomes, and because there are
no suitable tools which are fully automatic. While it is claimed that better and
better analyses are made by researchers working in partnership with the
computers, in Aarts words, at some moment the descriptive model and the
annotation tool derived from it must be frozen if the desired result is to be
achieved (Aarts 2002b). That is a fact of life in applications.
Unfortunately, too many researchers nowadays expect, and accept, off-
the-shelf tools that they do not examine too closely; the tools may be of some
antiquity, but they are not carefully evaluated. There is thus no incentive to work
52 John Sinclair
towards a new generation of fully automatic tools which derive from a corpus-
sensitive analysis, and which may present a rather different picture of the
language from the present ones. The whole procedure of annotation is pretty
frozen at present, and has moved very little in the last decade, because the
theories are not accessible for modification by the data.
There are two compelling reasons why annotation of this kind should not
be offered as part of a generic package.
One is that the models of language used in todays taggers date from a
time before evidence from a corpus was available, and some of them derive from
models which ignored empirical evidence entirely. A corpus can certainly be
used to evaluate and correct the descriptions that come from these models, and
eventually the models themselves, and this does happen in a very small way
concerning some of the details of classification. But, as Tognini Bonelli points
out in the quote early in this paper, for many scholars there is no impetus to
expose the theory to such scrutiny. Overwhelmingly the consensus view of
researchers is that the models are basically correct, and while they can be tidied
up by corpus evidence there is no need to open up the whole complexity of
language theory and description for the sake of some minor blemishes. Better to
get on with the job.
In the view of corpus-driven linguists, the picture is quite different. Their
perception is that corpus study provides a constant, subtle undermining of the
received models of language. The evidence is piling up all the time, but it is
invisible to anyone who looks only through the categories of the received model.
Claims of a high-per-cent accuracy of tagging are misleading, because the
decisions about what is correct and what is wrong are not supported with
linguistic evidence. Also most wrong assignments are systematically wrong,
because the machines are consistent at least, and the researcher is left with two
misgivings: (a) perhaps the computer is offering valuable new information rather
than making mistakes, and (b) the places where the computer is unreliable are
probably just the places where the researcher would like to rely on it.
The other argument against conventional tagging causes some problems
when put, as I sometimes do, in the form Annotation loses information. It
would seem at first sight that annotations add to the information in the corpus,
and indeed terms like enrichment are sometimes rather rashly used to promote
annotated text. Let us start with a simple case and follow it through. Let us agree
that boy, bicycle and brat are all nouns. They each are given the tag N. Once
this is done, they are all identical from the point of view of the tags; their
individuality is lost.
The proponents of annotation argue at this point (a) that there is a gain in
generality in the recognition of what is shared among members of the class N,
and (b) that the individuality of the word is not lost, because the word itself is
still there in the linear stream. These points need to be explored carefully.
First, the gain in generalisation, which is certainly a valid point as long as
generalisation can be demonstrated, but here the informality of the received parts
of speech weakens the argument considerably. No formal definition exists of the
Intuition and annotation 53
flexibility is vital to any theory-based research. This is another way of seeing the
individuality of words, which is denied them as soon as they are given a tag.
From this discussion it is clear that non-automatic annotations are best
confined to applications, where they can expect to remain in use for some time.
Their inclusion among generic resources, however, is misplaced and hazardous,
and it holds back progress substantially. Instead of research projects pushing
ahead with the improvement of fully automatic annotation, a considerable
proportion of the available funding goes into this very flawed activity.
Any unavoidable human role in the process of analysing corpora holds
back progress along many dimensions, but none so obvious as in the size of
corpus to be managed. Generic corpora are now measured in the hundreds of
millions of words, and this figure will rise and rise because each rise in the order
of magnitude shows the need for the next one, and there is no reason why this
should stop at some arbitrary size. Any human input, no matter how tiny, that
grows with the size of the corpus adds so much to the cost and time, as well as
opening an opportunity for inaccuracy, that either the size of the corpus has to be
kept down or costs will soar.
To summarise this complex area, my reservations about annotation are
quite specific, and concern only their inclusion in the resources around generic
corpora. Because they impose one particular model of language on the corpus,
they restrict the kind of research that can be done; because the practice of
annotation normally requires human intervention, it is not a replicable process
and therefore fails the first test of scientific method. Because the models imposed
by current conventions of annotation are unlikely to be informed by corpus
evidence, I believe researchers who use them are likely to make unnecessary
problems for themselves.
None of these reservations are relevant when researchers are concerned
with an application and considering matters such as cost-effectiveness, and are
not interested in any factors outside the application. Annotation as an
exploitation of the mark-up facility is typical of the kind of tool that emerged in
the early days of computing simple, extremely flexible and useful. The other
side of the coin is that it can be uncontrolled, invasive and overwhelming; I
believe that most of the research projects in corpus linguistics that are in progress
at the present time are not examining their languages at all, but are examining the
tags. The particular choices of word combinations that corpora uniquely offer us
are impossible to retrieve using tags.
As a matter of personal practice, I have very little need for non-automatic
annotation, and I use plain-text corpora whenever possible. This is because I am
primarily interested in the implications of corpus study for the development of
language theory and description. If I was obliged to use only annotated corpora
to work with which is the settled policy of, for example, the Arts and
Humanities Research Board in UK, which funds most of the relevant research
then my work would be hampered if not rendered impossible.
This is where we come to the crunch about annotation, where I think I
part company not only with Jan Aarts but with quite a proportion of the ICAME
Intuition and annotation 55
4. Conclusion
This has been an exercise in clarification, because for many linguists working
with corpora it might seem bizarre that one group distinguish themselves by
denying any role for the intuition, and condemning the normal practice of
annotation. I cannot, of course, speak for all researchers who might see
themselves as sharing a corpus-driven perspective, but I hope that I reflect their
general position fairly. They have a great respect for intuition, and cannot work
without it. The cannot applies in two meanings they are constantly guided by
it, and they could not get rid of it if they wanted to. As part of their professional
stance they cultivate the skills of degeneralisation, allowing them to stand back a
little from participating in the language events they observe as researchers, and
to defer momentarily the intuitive response; this gives them a small amount of
independence from their intuitions.
They appreciate, moreover, that intuitive responses need careful
interpretation, and they respect the limits of intuitive competence; in particular
they do not expect that if they invent a sentence their intuitions will ensure that it
has all the features of a naturally-occurring one.
At present corpus-driven linguists are not likely to have much use for
annotation, because most of the available systems suffer from the twin
drawbacks that their underlying model of language is pre-corpus, and that they
fit the corpus so badly that human intervention is necessary. Annotation,
however, even of the limited kind we have, has its place in applications, where
quick results are needed and rough-and-ready ones will suffice.
Perhaps the main difference between the two methodological stances in
corpus linguistics is their attitude to the use of annotations, of the present-day
variety, in purely descriptive studies. To the corpus-based linguist they are
indispensable, whereas to the corpus-driven linguist they are obfuscating.
But provided that the various safeguards discussed above are respected
(including those raised in connection with mark-up) there is no objection to the
practice of annotation in itself; used without understanding of its limitations it is
a hazardous practice. Perhaps newcomers to the growing profession of corpus
linguist should be given a few warnings that annotation is a coding convention
Intuition and annotation 57
that has no controls beyond the grammar of the code, that the appearance of an
annotated corpus belies the fact that it is an alternation of two separate and
incompatible codes (in the sense that plain text is also a code), that the two
coding streams should always be maintained separately, and that non-automatic
annotation is essentially subjective.
Notes
1. The pattern grammars mark a first step in following the corpus evidence
with little or no grammatical preconceptions, and Hunston and Francis (1999)
give a thorough explication of this approach.
2. See the discussion in Sinclair (1984).
3. The phrase used language is from Brazil (1995); while a little whimsical to
be a regular term, it allows us to avoid the issue of authenticity that is such a
humbug in this kind of discussion.
4. It is always conceded that frequency is a crude measure of importance,
and more an indication of a criterion than a criterion in itself. But where two
uses of a word show massive discrepancies in frequency, and the less common
one is the one that first comes to mind, then there is some explaining to be
done.
5. There are 755784 instances of TAKE in the Bank of English, so it would be a
considerable though worthy labour to check this. I have looked at several
small samples, and I have not so far found any convincing examples of the
core meaning, but I would expect them to be few and far between.
6. The Bank of English stood at a little less than 500 million words when this
data were retrieved. Details of the corpus can be found at
http://www.cobuild.collins.co.uk. I am grateful to The University of
Birmingham, co-owners of the corpus, for access to it.
7. An example of this kind of text can be found in the file LEXIS at
http://ota.ahds.ac.uk/, being transcripts of recordings made at the University of
Edinburgh in the early 1960s.
8. See Wynne (ed) (forthcoming) for an example of such guidance.
9. Todays example: I badged my way into the lobby. said by a police
inspector arriving at a crime scene (Patterson 2002: 23).
10. Some might say that if the description is inaccurate then the machine will
never work properly, and that there is evidence in the performance of such
devices that support this position. But it is an empirical question.
11. Attitudes change quickly in this area of study, and I can only be sure that in
the few years up to the composition of this paper in 2002, SGML format was
regarded as the standard among the advisers to AHRB. The advisers have
changed, thankfully, and there may now be a greater understanding of the
58 John Sinclair
numbing effect of having to view ones data through the imperfect vision of
another.
References
Sinclair, J., J. Payne and C. Hernandez (eds) (1996), Corpus to corpus A study
of translation equivalence. International Journal of Lexicography 9 (3)
(Special Issue): 172-196.
Tognini Bonelli, E. (2001), Corpus linguistics at work. Amsterdam and
Philadelphia: John Benjamins.
Wynne, M. (ed) (forthcoming), Developing linguistic corpora a guide to good
practice (provisional title). Web and Print versions. Oxford Text Archive.
Recent grammatical change in English: data, description,
theory
Geoffrey Leech
Lancaster University
Abstract
1. Introduction
Since that time, the more empiricist and more rationalist trends in linguistics have
diverged so far as to be almost irreconcilable. However, I still find the
formulation in (2) useful, although I would now prefer to insert the words in
square brackets [and functional], showing my preference for a combination of
formal and functional explanation which corpus linguistics is characteristically
attracted to. The other words in brackets [e.g. corpora] are of course a
reminder that corpus linguistics finds its raison dtre at the observational or
data-collection stratum of these three, the one that Chomsky found to be of such
little importance. However, my overarching goal in the present chapter is to
explore the relation between these three interrelated levels, and to argue against
the common assumption that corpus linguistics is concerned with mere data
collection or mere description.
Recent grammatical change in English 63
Alongside this, I also have a more practical goal, which is to exhibit as a case
study a particular area of linguistic description: recent quantitative change in
English grammar, as observed through the comparison of the LOB and FLOB
corpora. Although the main study has been focused on the LOB and FLOB
corpora, and therefore on written British English, it has been supplemented where
practicable by work on other corpora permitting a similar comparison between
English in the early 1960s and in the early 1990s. I will use this case study as a
means of illustrating the relation between the three levels of theory, description
and data collection or, to put them in the order which would more naturally
occur to a corpus linguist data collection, description and theory.
2.1 Data collection: using the LOB, FLOB, and other corpora
To begin with the level of observation: we began with a study of the two
matching corpora LOB and FLOB, which had already been part-of-speech
tagged, through the combined processing of two taggers: CLAWS4 and Template
Tagger (see Smith 1997 on the tagging techniques).1 By using the powerful
annotation-aware search and retrieval tool Xkwic (Christ 1994), we found it
possible to extract occurrences of a whole range of grammatical categories that
have been suspected, with varying degrees of empirical backing, to have become
more frequent or less frequent in the recent past. The main areas of grammar we
focus on in this chapter are (a) the modal auxiliaries, together with the mixed
array of verbal constructions conveniently termed semi-modals, and (b) a range
of grammatical phenomena associated with a suspected trend of
colloquialization.2
Although we began with the LOB and FLOB corpora, we extended our
study to a selective use of some other comparable corpora spanning
approximately the same period of 30 years, as shown in Table 2.
The family of four matching corpora Brown, LOB, Frown and FLOB
(henceforward termed the Brown family) is well placed to provide evidence of
frequency changes in British and American English over the period between 1961
and 1991-2. Unfortunately no comparable corpora for spoken English exist, but
we were reluctant to confine our attention to written (printed) language,
especially considering that much grammatical innovation is likely to originate in
the spoken language. With the permission and help of Bas Aarts and Gerry
Nelson at University College London, we were able to identify small comparable
spoken subsets from two other million-word corpora developed at UCL with data
from around the early 1960s and the early 1990s.3 These were the corpus of the
Survey of English Usage (SEU), of which a large spoken part was computerized
and distributed as the London-Lund Corpus, and the International Corpus of
English (the British variant known as ICE-GB). Because of difficulties of
matching samples, the spoken mini-corpora from SEU and ICE-GB were even
smaller, indeed much smaller, and were moreover less closely matched than the
64 Geoffrey Leech
Brown family of corpora. One difficulty was that, although the SEU corpus had
been collected over a period of about 30 years, comparability with LOB and
Brown dictated that we rejected any material not contemporaneous with the
written corpora, a constraint we interpreted rather liberally to exclude any
material outside the time frame 1959-1965. Another problem was that the SEU
corpus was subdivided into texts of 5000 words each, whereas the ICE-GB texts
were of 2000 words each. Hence a one-by-one matching of texts between the two
spoken mini-corpora was not feasible, and partial and overlapping matchings had
to be allowed.
Because of these drawbacks, particularly the restriction of the mini-corpora
of speech to a mere 80,000 words each, our findings from the spoken corpora
could only be seen as highly tentative indicators of what was happening to spoken
English over this period. Nevertheless, we felt that such a study, however
inadequate and provisional, would be preferable to a survey of recent
grammatical change which took no account of the spoken language. In fact,
differences observed between the mini-corpora in the frequency of modals and
semi-modals were tantalizingly even greater than those observed between LOB
and FLOB. A summary of the contents of the two spoken mini-corpora is given in
Table 3.
The sophisticated ICECUP software available for searching the ICE-GB
could not be used with SEU-mini-sp, and so to ensure comparability we decided
to use the WordSmith retrieval package and XKwic for both mini-corpora.
Recent grammatical change in English 65
This section of the chapter has been called Data collection, and under this
heading we can bring together the basic evidence-providing tools of the corpus
linguists stock in trade. Obviously, these include the corpora used for this
particular study, and the software used to extract the relevant grammatical
phenomena in this case the search and retrieval tools XKwic and WordSmith.
Basic retrieval products such as concordances and frequency lists, especially
when they incorporate the results of simple grammatical analysis such as POS
tagging, might be considered to take us beyond mere data collection, and to bring
us to the threshold of the descriptive level of analysis. However, the scale of
abstraction represented by the three levels of data collection, description, and
theory is best assumed to consist of many small steps, rather than three giant
strides. I return to the matter of data collection versus description in 2.2 below.
Although so far my presentation of the three levels has worked from the
bottom up, this is of course by no means inevitable in the methodology of corpus
linguists. Some studies are problem-driven where the need to investigate a
particular theoretical or descriptive hypothesis may determine the collection or
selection of a suitable corpus, and the selection of particular corpus data to be
studied. But in the present case, the bottom-up methodology prevailed. We did
not start with a particular theoretical claim (say about the process of historical
change) or a particular descriptive hypothesis (say about the English modals),
although our study led to these. It was the existence of the LOB and FLOB
corpora, and the particular equivalence relation between them (found also
between Brown and Frown) which enticed us to follow the example already set
by Hundt, Mair and others, and to use these corpora to investigate recent changes
in grammar.4
being made about a particular set of corpora, rather than about the language that
they exemplify. We could call this level of statement data description: an
intermediate step between data collection and linguistic description.
would 3028 2694 20.4 -11.0 would 3053 2868 5.6 -6.1
will 2798 2723 1.2 -2.7 will 2702 2402 17.3 -11.1
can 1997 2041 0.4 +2.2 can 2193 2160 0.2 -1.5
could 1740 1782 2.4 +2.4 could 1776 1655 4.1 -6.8
may 1333 1101 22.8 -17.4 may 1298 878 81.1 -32.4
should 1301 1147 10.1 -11.8 should 910 787 8.8 -13.5
must 1147 814 57.7 -29.0 must 1018 668 72.8 -34.4
might 777 660 9.9 -15.1 might 635 635 0.7 -4.5
shall 355 200 44.3 -43.7 shall 267 150 33.1 -43.8
ought 104 58 13.4 -44.2 ought 70 49 3.7 -30.0
need 78 44 9.8 -43.6 need 40 35 0.3 -12.5
Total 14667 13272 73.6 -9.5 Total 13962 12287 68.0 -12.2
listed in order of frequency in LOB, and exactly the same order of frequency,
with the exception of should and must, applies to Brown. It will also be seen that
a roughly similar pattern of falling frequency is observed in both BrE and AmE
corpora. Broadly, the most frequent modals decline least, and the least frequent
modals decline most in percentage terms, the rare modals shall, ought (to) and
need (+ bare infinitive) having become much rarer. Some middle-order modals
(especially must and may) also show very significant falls in frequency.
The most interesting observation from Table 4, however, is that the overall
frequency of modals is highest in LOB and lowest in Frown, with FLOB and
Brown in intermediate positions. Alongside the decline between 1961 and 1991-
2, there is an equally important difference between AmE and BrE, which invites
interpretation as a time lag. It is as if BrE is following rather reluctantly in the
wake of a change in AmE, with something like a generation gap. This is shown
graphically (though not strictly to scale) in Figure 1.
It might be proposed that the apparent decline in modal usage is due to the
rise, in recent centuries, of the so-called semi-modals, such as be going to and
have to, which are presumed to be still increasingly used. Perhaps these are
gradually encroaching on the territory of the canonical modals. Such a hypothesis
can be tested, up to a point, by noting the differences of frequency of semi-
modals in the four corpora, as shown in Table 5. Although the class of semi-
modals is not a well-defined set, those in Table 5 may be taken as fairly
representative.
Ostensibly, there is no strong connection between the patterns shown by the
modals and the semi-modals.7 Altogether, the semi-modals are very much less
frequent (in written English) than the modals, and their changes in frequency
show a mixed picture. Some of them seem to have increased their usage
massively in the period 1961-1991/2, but others have declined. One of the
differences at first glance lending credence to the encroachment hypothesis is that
AmE shows a greater increase in the semi-modals (+18.6%) in comparison with
BrE (+10.0%) a mirror image of what is happening with the modals.
Unexpectedly, however, the overall frequency of semi-modals is found to be
greater in the BrE than in the AmE corpora in both periods!
68 Geoffrey Leech
Note: The figures in parenthesis show frequency per million words, and are therefore comparable to
the figures for the written corpora given in Table 4.
Recent grammatical change in English 69
(ii) During this period, individual modals have been declining at different rates,
but there is a tendency for very common modals to hold their own (e.g. will,
can), and for infrequent modals (e.g. shall, ought to, need) to decline
sharply and to appear almost moribund. Some middle-ranking modals (e.g.
may and must) have also declined sharply.
(iii) Alongside the decline of modals, there is no clear overall picture regarding
semi-modals: although in general, semi-modal usage is increasing, some
semi-modals are declining, and semi-modals as a whole are much less
frequent than true modals.
If we ignore the italicised phrase above (On the basis of the evidence from the
corpora) these statements are descriptive: they claim to tell us something that is
true about the language, English. But rather than accept them uncritically, we
have to bear in mind some hazardous assumptions which can be made in moving
from data description to language description:
1. That the corpora are large enough and varied/balanced enough to allow us to
extrapolate from corpus findings to what is happening in (relevant varieties of) the
language in general.
2. That the corpora are sufficiently comparable in terms of samples of the varieties
represented, and in using the same sampling methods.
3. That statistically significant results can be attributed to real linguistic differences,
rather than to extraneous factors such as cultural shifts or faulty sampling.
4. That the grammatical categories are defined and used in a way that other
grammarians or linguists find reasonable.
5. That the extraction of data from the corpora has been acceptably (if not totally) free
from error.
(a) Many results are highly significant as measured by log likelihood ratio.
(b) Trends are consistent across different items e.g. the general frequency
decline of the modals is replicated in almost every single modal auxiliary.
(c) Trends are often consistent across different subcorpora e.g. if we
subdivide each of the Brown family into genre categories Press (A-C),
General Prose (D-H), Learned (J), and Fiction (K-R), often similar trends
are observed in all these four subcorpora. An instance of this is the decline
of the passive from LOB to FLOB (see Table 8). The passive is less
frequent in FLOB as a whole by 12.4%, a trend repeated in a similar way
for each subcorpus: Press 12.5%; Gen Prose 12.4%; Learned 16.6%;
Fiction 3.6%).
72 Geoffrey Leech
Taking further the descriptive study of the LOB and FLOB corpora, we now turn
to a wider-ranging set of grammatical categories, mostly belonging either to the
verb phrase or to the noun phrase. What brings all these categories together is that
they can all be associated with a trend towards colloquialization, that is a
tendency for the written language gradually to acquire norms and characteristics
associated with the spoken conversational language. Quantitatively,
colloquialization can be shown in two ways: (a) by an increasing frequency of
phenomena associated with spoken language, and (b) by a decreasing frequency
of phenomena associated with the written language. Type (a) changes
predominate in Table 8 below, but Type (b) changes are also seen, in the
decreasing frequency of the passive, of the of-construction, and of the relative
pied-piping construction.
Recent grammatical change in English 73
Miscellaneous colloquialization
features outside the verb phrase
f. Questions (all) 2572 2816 11.1 +9.5
g. Verbless questions 310 424 17.7 +36.6
h. Tag questions 63 65 0.1 +4.5
j. Genitives 4935 6122 128.5 +24.1
k. Of-phrases 33715 32139 37.9 -4.7
l. Of-phrases competing with the 124 95 3.9 -23.6
genitive (2% sample only)
Relative clauses
m. Wh-relative pronouns 6971 6376 26.7 -8.5
n. Zero relative with stranding 18 73 36.4 +310.0
(sample)
p. Pied-piping relatives 1394 1158 21.9 -16.9
Of the categories within the verb phrase, the first four (a.-d.) all show very
convincing increases between LOB and FLOB. Previous corpus studies (e.g.
Biber et al. 1999: 461-463) have shown the progressive to be more common in
conversation than in written genres, and this is a justification for treating
colloquialization as a possible explanation for a. and b. (However, the growing
use of the progressive aspect can also be linked with grammaticalization, going
back over 500 years.) The passive (e.), on the other hand, is strongly associated
with the written medium (see for example Biber et al. 1999: 476-477), and so its
decline in frequency can count as a negative manifestation of colloquialization.
The next set of categories in Table 8 (f.-h.) is more mixed. In fact f. and h.
(questions) should arguably be excluded from the list of colloquialization
phenomena, as the increase of quoted speech in FLOB compared with LOB (see
Note 9) provides a readier explanation for the increasing occurrence of questions
(+9.5%) and tag questions (+4.5%).
We have begun to investigate two further colloquialization themes in the
noun phrase (see j.-p. in Table 8): the s-genitive vs. the of-phrase; and zero or
that-relative clauses vs. wh- relative clauses. Results so far point in the direction
of (a) a rise in the genitive with a corresponding decline in of-phrases; and (b) a
rise in zero relative clauses ending with a stranded preposition and a
corresponding decline in wh- relative clauses. The rise in stranding accords with
an unsurprising and significant fall in the use of pied-piping constructions in
which the wh-relative pronoun is preceded by a preposition (in which, of whom,
etc.).
74 Geoffrey Leech
still work in progress. (In fact I have gone so far as to suggest that it is in the
nature of corpus research to be provisional.) Second, the hazardous assumptions
listed in section 2.3 have to be kept in mind throughout, and opportunities found
to probe them further. I have yielded above to the temptation to talk in terms of
the language change between LOB and FLOB: a kind of dynamic metaphor used
to explain what are actually sets of synchronic observations about a 1961 corpus
and a 1991 corpus. But the claims that these observations represent changes in the
(use of the) language ultimately remain hypotheses, in need of further probing
and confirmation.
deontic must and the rise of deontic should, have to and need to). For this
kind of explanation in the realm of modality, see Myhill (1995).
Americanization the influence of north American habits of expression and
behaviour on the UK (and other nations). This shows up apparently in the
loss of frequency of the modals, as depicted in Figure 1.12
Table 10. Some principles of usage-based models of language (after Barlow and
Kemmer 2000)
1. The intimate relation between linguistic structures and instances of the
use of language.
2. The importance of frequency.
3. Comprehension and production are integral, rather than peripheral to the
language system.
4. Focus on the role of learning and experience in language acquisition.
5. Importance of usage data in theory construction and description.
6. The intimate relation between usage, synchronic variation, and diachronic
change.
7. The interconnectedness of the linguistic system with non-linguistic
cognitive systems.
8. The crucial role of context in the operation of the linguistic system.
Notes
6. A caveat about frequency: most of the frequency figures in this study are very
close approximations rather than guaranteed 100% accurate. Both manual
procedures and automatic procedures can give rise to error, although the
incidence of error is likely to be totally insignificant. The one exception to this
is the margin of error arising from POS tagging (about 2% in the present
context). Although we were able to use the results of manual correction for
the LOB Corpus and most of the FLOB corpus, for the fictional genres (K-R)
of FLOB and for the Frown Corpus we had to rely on automatic tagging only.
A method of approximation was devised on the basis of comparing automatic
tagging and manual tagging outcomes in cases where they were both
available, and hence calculating an error coefficient for each tag. The
procedure is described in the Appendix to Mair et al. (2003).
7. However, the decline of must may have some connection with the increase in
use of have to and need to see Smith (2003). In general, the varied
behaviour of the semi-modals in this corpus confirm the impression that they
comprise a miscellaneous category. In Quirk et al. (1985:136-148), where it is
argued that they form a gradient between auxiliary and full verbs, four
intermediate categories are distinguished: marginal modals, modal idioms,
semi-auxiliaries, and catenative verbs.
8. However, the decline of must may have some connection with the increase in
use of have to and need to see Smith (forthcoming 2003). In general, the
varied behaviour of the semi-modals in this corpus confirm the impression
that they comprise a miscellaneous category. In Quirk et al. (1985:136-148),
where it is argued that they form a gradient between auxiliary and full verbs,
four intermediate categories are distinguished: marginal modals, modal
idioms, semi-auxiliaries, and catenative verbs.
9. On representativeness, Biber (1993) is the classic reference; but Bibers
position has also been criticised (e.g. by Vradi 2001). There is no test that
could be used to ensure that statements about the LOB and FLOB corpora are
representative of the varieties of English of which they are samples, except to
collect independent samples of data of the same text types in effect, to
replicate the LOB and FLOB corpora but with different text samples.
10. Nick Smith has undertaken a count of quoted material in the LOB and FLOB
corpora, helped by a program written by Izumi Tanaka. He found that the
number of words within quotation marks in FLOB was c.127,000, compared
with c.116,000 words in LOB an increase of c. 9.5%. This figure of +9.5%
is a reasonably close approximation, but needs to be followed up by further
checks and edits.
11. Actually genitives are not so frequent in conversation as in some varieties of
written English, especially news writing (see Biber et al. 1999: 302). This can
be largely explained by the fact that nouns are notably infrequent in the
spoken language: a construction which is rich in nouns (a description that
applies both to the genitive construction and the of-construction) is therefore
comparatively rare. However, if we consider the likelihood of choosing a
80 Geoffrey Leech
References
Leech, G. (2003). Modality on the move: the English modal auxiliaries 1961-
1992, in: R. Facchinetti, M. Krug and F. R. Palmer (eds), Modality in
contemporary English. Berlin & New York: Mouton de Gruyter, 223-240.
Leech, G., B. Francis and X. Xu (1997), The odds in favour of the genitive: a
study of gradience in English, in: K. Yamanaka and T. Ohori, The locus
of meaning: Papers in honor of Yoshihiko Ikegami. Tokyo: Kuroshio, 187-
208.
Mair, C., M. Hundt, G. Leech and N. Smith (2002), Short term diachronic shifts
in part-of-speech frequencies: a comparison of the tagged LOB and FLOB
corpora, International Journal of Corpus Linguistics, 245-264.
Mair, C. (1997), Parallel corpora: a real-time approach to language change in
progress, in: M. Ljung (ed.), Corpus-based studies in English: Papers
from the 17th International Conference on English Language Research on
Computerized Corpora (ICAME 17). Amsterdam: Rodopi, 195-209.
Mair, C. (1998), Corpora and the study of the major varieties of English: issues
and results, in: H. Lindqvist et al. (eds), The major varieties of English.
Vxj: Vxj University Press, 139-157.
Myhill, J. (1995), Change and continuity in the functions of the American
English modals, Linguistics 33: 157-211.
Popper, K. (1972), Objective knowledge (revised edition). Oxford: Oxford
University Press.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rayson, P., A. Wilson, T. McEnery, A. Hardie and S. Khoja (eds) (2001),
Proceedings of the Corpus Linguistics 2001 Conference. Lancaster
University: UCREL Technical Papers 13.
Serpollet, N. (2001), The mandative subjunctive in British English seems to be
alive and kicking Is this due to the influence of American English?, in:
Rayson et al. (2001), 531-542.
Smith, N. (1997), Improving a tagger, in: R. Garside, G. Leech and A. McEnery
(eds), Corpus annotation: Linguistic information from text corpora.
London: Longman, 137-150.
Smith, N. (2003), Changes in the modals and semi-modals of strong obligation
and epistemic necessity in recent British English, in: R. Facchinetti, M.
Krug and F. R. Palmer (eds), Modality in contemporary English. Berlin &
New York: Mouton de Gruyter, 241-266.
Vradi, T. (2001), The linguistic relevance of corpus linguistics, in: Rayson et
al. (2001), 587-593.
Corpus data in a usage-based cognitive grammar
Joybrato Mukherjee
University of Giessen
Abstract
The present paper is intended to bridge the long-established gap between corpus-
based research into actual language use on the one hand and cognitive models of
the abstract language system (in terms of speakers competence) on the other.
For this purpose, a very useful, non-generative framework is provided by
Langackers usage-based cognitive grammar. In general, the consideration of
corpus data in cognitive grammar leads to an innovative and realistic model of
speakers linguistic knowledge, i.e. a model which is data-oriented and
frequency-based, functionalist and lexicogrammatical in nature. This theoretical
from-corpus-to-cognition approach will be illustrated by discussing corpus data
on the use of the ditransitive verb GIVE and by sketching out how the data may
be included in a truly usage-based model of the lexicogrammar of GIVE.
The patterns of a word can be defined as all the words and structures
which are regularly associated with the word and which contribute to
its meaning. [...] as a word can have several different patterns, so a
Corpus data in a usage-based cognitive grammar 87
Special emphasis is placed here on the actual use of the linguistic system. In
general, this clearly mirrors the Hallidayan assumption that system and use are
inseparable because language use instantiates the system (cf. Halliday 1991: 31).
More specifically, a model of language cognition should be able to account for
actual usage, so that the model has to be based on actual use in the first place. It is
exactly here that corpus data may play a major role in refining cognitive grammar
and increasing its usage-basedness: corpora are samples of actual use of the
linguistic system; the schematic networks, low-level schemas and linguistic
conventions correspond largely to the lexicogrammatical patterns and routines
that can be identified by drawing on corpus data.
Table 1. Corpus-based insights into actual language use and their implications
for a usage-based cognitive grammar
some typical features of language use implications for a usage-based
as attested in corpora cognitive grammar
linguistic forms differ with regard to knowledge about these frequencies
frequency and distribution and distributions should be part of
the model
language use is to a large extent based the model should account not only
on recurrent patterns of different kinds for linguistic creativity but also for
linguistic routine
quantitative findings can often be ex- these principles/factors are part of
plained by considering functional and speakers linguistic knowledge and
context-dependent principles/factors should be included in the model
lexical and grammatical choices are lexicogrammatical patterns should
interdependent be at the basis of the model
88 Joybrato Mukherjee
pattern.4 In the examples, the preceding pattern is given in italics, and the over-
lapping GIVE-pattern is underlined.
(2) But it then means that the more things they put on the menu the tinier the
amount they give you <ICE-GB:S1A-018 #24:1:B>
(3) I would anticipate doing one or two units per year and would be grateful
for any financial assistance that the college could give me
<ICE-GB:W1B-022 #152:13>
90 Joybrato Mukherjee
(4) I must thank you, Simon and your parents officially for the slow cooker
and table cloth you gave us for our wedding <ICE-GB:W1B-004 #12:1>
For the passive type IP, there are many factors that seem to play a role in
the process of pattern selection. The cluster of relevant factors is summarised in
(5). It is not at all surprising that in more than 96% of all instances the by-agent is
left out. An important reason for choosing type IP thus lies in the optionality of
the agent. Additionally, two further factors seem to be responsible for the fact that
the recipient (corresponding to the indirect object in the default active type-I
pattern) is placed in the initial slot, thus serving as the grammatical subject. First,
this pattern tends to be chosen whenever the direct object is significantly heavier
than the initial element and is therefore placed in final position according to the
principle of end-weight (cf. Quirk et al. 1985: 1362). The correlation between
weight and pattern selection is illustrated in examples (6) and (7). This factor
alone accounts for 50% of all 84 cases. Second, in some 10% of all cases it is the
recipient that has already been activated before and is thus taken up as the first
element in the type-IP pattern. This is in line with the principle of end-focus (cf.
Quirk et al. 1985: 1357) according to which there is a general tendency to place
given information before new information. In examples (7) to (9), the previously
activated element which is part of (or provides the initial element for) the GIVE-
pattern at hand is italicised.
(6) [...] Margaret Thatcher cannot be given all the credit for our record levels
of radioactivity both at sea and on land <ICE-GB:W2B-014 #11>
(7) and rather nastily she had been tied to a chair until she was fourteen by her
blind mother and never actually given any form of uhm sound or language
communication <ICE-GB:S1B-003 #102>
(8) After all Saddam Hussein uh led his people they although they were not
given much choice in the matter in an eight year war [...]
<ICE-GB:S1B-035 #66>
(9) The Italian peoples were bound to fight in Romes wars at their own
charge [...] Some peoples were actually given Roman citizienship [...]
<ICE-GB:W2A-001 #006/8>
heavy
(39 of 123 cases
= 31.7%)
(11) so we can have an acid and alcohol and give it to the esterase which is a
useful product <ICE-GB:S2A-034 #39>
92 Joybrato Mukherjee
(12) A clutch of opinion polls gave comfort to both sides in the simmering
civil war yesterday <ICE-GB:W2C-006 #76>
(13) but when you follow that through youve got the means to give rise to a
change in the method of accounting thats adopted in the company
<ICE-GB:S2A-037 #122>
Type IIP is the passive form that can be derived from the type-II pattern.
Note that the systematic correspondence between the two patterns stems from the
fact that in both cases the indirect object is realised as a to-phrase. As shown in
(14), all the kinds of factors that are involved in the choice of the passive pattern
IP are also involved in type IIP: previous activation of the initial element (6 of 23
cases = 26.1%), heaviness of the post-verbal element (8 of 23 cases = 34.8%),
and the frequent omission of the by-agent (22 of 23 cases = 95.6%). In the light of
the 23 cases at hand, we may also assume that two further factors may at times tip
the balance in favour of type IIP: (i) the need to put the indirect object in focus
according to the principle of end-focus; (ii) the use of a lexical item (e.g. thought)
in the passive subject which may be habitually associated with the preposition to.
The cluster of all five factors and their explanatory power in quantitative terms
are summarised in (14). Example (15) illustrates the relevance of the principle of
end-focus (here in order to contrast the two italicised elements at the end of the
two dependent clauses). Example (16) refers to the influence of the lexical item in
subject position on the selection of the type-IIP pattern.
(15) At the start of the conflict you said more time should have been given to
sanctions but now youre saying that more time should have been given to
pursue those diplomatic initiatives <ICE-GB:S2B-018 #92-94:2:D>
(16) It is not clear that enough thought has been given to the consequences of
these proposals for the movement of traffic outside the areas immediately
affected <ICE-GB:W1B-027 #41:4>
What all type-III patterns have in common is the fact that the indirect
object is omitted. Note that in many of these cases, the verb GIVE is not parsed
as ditransitive but as monotransitive in ICE-GB. For various reasons, how-
Corpus data in a usage-based cognitive grammar 93
(18) So for instance we can give a very nice account of coarticulation [...]
<ICE-GB:S2A-030 #12>
(19) It helps to clarify the poets ambiguous comments beforehand by giving an
actual example of what he means <ICE-GB:W1A-018 #33>
94 Joybrato Mukherjee
(20) And its that sort of thing that gave the impression which Im sure he was
trying to do <ICE-GB:S1B-038 #103>
From type III, the passive form IIIP can be derived. Again, the optionality
of the by-agent is most important for the process of pattern selection because it is
omitted in 31 out of 38 cases (81.6%). Additionally, specific lexical items in the
subject position (i.e. the subjectivised direct objects of the type-III pattern) tend to
be closely associated with this pattern. That is to say, not only is the type-IIIP
pattern used whenever neither the agent nor the recipient needs to be explicitised
but also when particular words refer to the patient of the action. In (21) those
words are listed that occur at least twice in this pattern in ICE-GB, accounting for
some 45% of all instances. Some of them are exemplified in (22) to (24).7
left out
(31 of 38 cases
= 81.6%)
recurrent lexical items (2):
approval (2), limit (2), information (2), detail (7), time (2), directions (2)
(17 of 38 cases = 44.7%)
(22) Hes called Malachi in the opening verse but no biographical information
is given about him <ICE-GB:S2A-036 #78>
(23) uh directions are given from Ushant uh from the Scillies uh from the
South coast of Ireland down to Cape Ortegal or Finisterre
<ICE-GB:S2B-043 #20>
(24) More specific implementation details are given at the end of the report
<ICE-GB:W1A-005 #5:1>
(26) and he will also know of the increased uh support given uh in the uh
announcement last week by my right honourable friend the Social Security
Secretary <ICE-GB:S1B-056 #46:1:B>
(27) [...] it also is of relevance when considering the evidence given by Mr Holt
because there is a clear conflict [...] <ICE-GB:S2A-068 #40:1:A>
(28) But what I have simply done is to trace on to a map the directions that are
given which give you some indication [...] <ICE-GB:S2B-043 #19:1:A>
I Ib IP
(S) GIVE [Od: NP (antecedent)] [S < Oi active] BE
[Oi:NP] [Od: NP] (rel. pron.) [S] GIVE [Oi:NP] given [Od:NP] (by-agent)
The actual use of the eight most frequent GIVE-patterns and the relevant
principles of pattern selection as described above provide an empirically sound
basis for a truly usage-based cognitive model of the lexicogrammar of GIVE.
Such a usage-based model on the basis of ICE-GB is visualised in Figure 2.
In two regards, the tentative model suggested in Figure 2 is more
elaborated and more usage-based, as it were, than traditional lexical networks in
cognitive grammar (as, for example, shown in Figure 1). Firstly, the thickness of
the lines between GIVE and its patterns depends on the frequency of GIVE in
each pattern. Figure 2 thus puts into operation what has been suggested, among
others, by Lamb (2002: 91), namely that different [d]egrees of entrenchment
[can be] accounted for by variability in the strengths of connections. Secondly,
at all lines connecting GIVE and its patterns there is information on why a
particular pattern is used in a given context. Such principles of pattern selection
can be identified only by looking at large amounts of natural data in context and
have so far not been taken into consideration in cognitive grammar. More
specifically, traditional network models in cognitive grammar have focused on
what is structurally possible. Corpus data, however, provide information on what
is likely to occur and why. As I have argued elsewhere (cf. Mukherjee 2002),
both aspects are part of speakers linguistic knowledge and should therefore be
covered by a truly usage-based cognitive grammar.
The present paper is informed by the belief that corpus linguistics and cognitive
linguistics are not at all mutually exclusive but can fruitfully complement each
other in developing a genuinely usage-based model of language cognition, i.e. of
speakers knowledge of the underlying language system. A genuinely usage-
based model defies the rigid Chomskyan dichotomy between competence and
performance.8 In fact, such a model is intended to bridge the gap between
system and use and to mirror speakers linguistic knowledge along the lines of
Hymess (1972, 1992) concept of communicative competence, in which the
ability to use linguistic forms and structures idiomatically (e.g. in terms of
frequently co-occurring forms) and appropriately (e.g. in terms of pragmatic
principles) is integral to speakers knowledge of the language. This view is
closely related to the Hallidayan idea that language use and language system are
intricately interwoven, which makes it possible and reasonable to derive from a
corpus-based analysis of actual language use a usage-based model of the
cognitive entrenchment of the language system. In effect, this approach
capitalises on Schmids (2000: 39) From-Corpus-to-Cognition Principle:
Frequency in text instantiates entrenchment in the cognitive system. In
particular, I hope to have shown that lexical network models in cognitive
grammar can be refined in two regards by taking into account corpus data: not
only is it possible to introduce frequency-based information on different strengths
of linkage between lexical items and constructions but also to introduce in the
Corpus data in a usage-based cognitive grammar 97
Notes
1. Note that Langacker (1999: 122) himself states that lexicon and grammar
grade into one another so that any specific line of demarcation would be
arbitrary. This description is of course largely reminiscent of the Hallidayan
approach to lexicogrammar as a unified phenomenon, a single level of
wording, of which lexis is the most delicate resolution (Halliday 1991:
31-32).
2. It should be noted that the data in Table 2 are based on a manual analysis of
all occurrences of GIVE and not on the parsing information included in ICE-
GB. The reason why the data were analysed manually is the fact that many
98 Joybrato Mukherjee
instances of GIVE are not parsed as ditransitive in ICE-GB but, for example,
as monotransitive (especially in the case of type-III patterns) or as complex-
transitive (especially in the case of type-II patterns). In contrast, I regard all
instances of GIVE as examples of ditransitivity on cognitive and semantic
grounds (cf. Goldberg 1995 and Newman 1996). It is for this reason that
phrasal verbs such as GIVE AWAY, GIVE IN and GIVE UP have not been
taken into account, because their semantics tends to be quite different from
GIVE. Note also that not all instances of GIVE can be grouped into any of the
patterns listed in Table 2. However, such miscellaneous cases are rare and
thus of a marginal nature.
3. The pattern formulas are based on the following notational conventions: [...]
obligatory element; [...(...)] obligatory element with a specific form/function;
(...) optional element; Oi/Od clause element which is not part of the
lexicogrammatical pattern at the level of syntactic surface structure (although
the corresponding argument role is taken to be implicitly evoked by GIVE at a
cognitive level).
4. In fact, this is reminiscent of what Hunston and Francis (2000: 211) refer to as
pattern flow: Pattern flow occurs whenever a word that occurs as part of the
pattern of another word has a pattern of its own.
5. Since, from a lexico-semantic point of view, the existence of a recipient is
already inherent in the event type evoked by the ditransitive verb GIVE, there
is no need to explicitise the recipient as an indirect object at the level of
syntactic surface structure in these cases. For example, in phrases such as give
a lecture and give a talk some kind of recipient is always implied (e.g. an
unspecified audience). Accordingly, Newman (1996: 54), in his cognitive
study of GIVE, describes such implicit argument roles as unfilled elaboration
sites.
6. As a matter of fact, many of the lexical items could be complemented by other
items of the same semantic field that also occur in this GIVE-pattern in ICE-
GB, e.g. give a lecture/a talk (+ a paper, a speech, a statement...), give
instructions (+ advice, help, orientation...) and give a message (+ an answer,
an outline, a response, a warning...). The important point here is that the lexis
in direct-object position is semantically restricted.
7. Note that the analysis of type IIIP is based on 38 instances only. One could
easily hypothesise that the list of recurrent lexical items would have been
much more similar to the list given for the type-III pattern if some 250 cases
had been scrutinised. Here, larger corpora are needed.
8. It is for this reason that the term competence is not used in the present paper.
A cognitive model that is based on corpus evidence, as suggested in the
present paper, has not much in common with a generative model of
competence. Thus, it is not very useful to take over and extend or redefine the
term competence, which would automatically lead to terminological
confusion (cf. Taylor 1988). Instead, I prefer to speak of a usage-based model
of speakers linguistic knowledge.
Corpus data in a usage-based cognitive grammar 99
9. Note that many issues that have only been mentioned in passing in this
section, including the implications of the concept of communicative
competence, the issue of constructional networks and the place of genre
distinctions in a usage-based model of speakers linguistic knowledge, will be
discussed in much more detail in a book-length study that is underway (cf.
Mukherjee, forthcoming).
References
Caroline David
Universit de Poitiers
Abstract
1. Introduction
Beneath homogeneous semantic and cognitive features, the class of Verbs of
Putting, as Levin (1993) and Dixon (1991) label it, displays significant variations
in their syntactic organisation. On the basis of corpus evidence, I will first tackle
the problem from a semantic point of view, showing that the synonymy of put,
set, lay and place is only superficial, and that the constraints they impose on their
prepositions depend on the semantic content of each (verb and preposition).
Indeed, the general semantic value of these three-place predicates (which
represent most of their uses) is the result of a semantic balance between the
preposition and the verb itself. Then I will examine the verb load whose syntactic
behaviour seems to make it closer either to FILL or COIL-verbs according to the
structure it is in. Finally, the analysis of fill will bring out (i) the complex and
intricate relations that tie up the three arguments of the verb (the subject, the
object, and the goal location), that is to say the syntactico-semantic mechanisms
which underlie it, and (ii) the intrinsic semantic properties of each argument that
are required by fill.
102 Caroline David
Number of idioms 94 21 12 0
The high frequency characteristic of put and its generalness of meaning, that is to
say its polysemy compared with set, lay and place, both lead me to the same
conclusion: put, set, lay and place are not synonymous, and therefore cannot be
classified under the same label of PUT-verbs (sub-class n1), because they do
not seem to represent the same process of putting things. Therefore, what has
been obvious for lay since Levin (1993) described it as a Verb of Putting in
Spatial Configuration, should be the same for set and place.
Defining lay as the way things are put, the way the object is displaced,
adds more information to the process of putting, which is not included in the
light meaning of put, if I can put it this way. Similarly, it seems that set and
place behave like lay in the sense that they also describe the way things are
moved, if we rely on the following definitions given by dictionaries such as the
Collins Cobuild English Dictionary, the Oxford English Dictionary and the
Longman Dictionary of Contemporary English (the highlighting in boldface is
mine):
SET 1. If you set something somewhere, you put it there, especially in a careful
or deliberate way. He took the case out of her hand and set it on the
floor.When he set his glass down he spilled a little drink.
Putting putting verbs to the test of corpora 105
LAY 1. If you lay something somewhere, you put it there in a careful, gentle, or
neat way. Mothers routinely lay babies on their backs to sleep.
SET 1. To put in a definite place (the manner of the action being implied either
in the verb itself or in the context); to put (more or less permanently) in a
definite place.
For each verb, we find the notions of deposit, dispose, arrange or place in a
certain, specified or particular position, not to forget the notion of carefulness
which is quite often underlined. The OED even specifies for set that the manner
of the action is implied either in the verb itself or in the context.
Therefore, I will propose a new distribution of the different classes, where
on the one hand put, alone, is considered the prototypical verb of the general
process of putting with little additional information regarding the way things are
displaced and, on the other hand, set, lay, place etc., which are more specific in
meaning than put, are classified together as a kind of manner of putting.
According to Lyons tests (1977: 292), which allow us to establish the position of
a lexeme in a hierarchical lexical field, setting, placing, laying things are all a
kind of putting things, that is to say put is a hyperOnym and the others hyponyms.
This conception of the semantic organisation of the verbs might be represented as
in Figure 1.4
106 Caroline David
hyperOnym
prototype
SET
PUT
PLACE
LAY
hyponyms
Let us now move on to some other verbs of the large class of Verbs of
Putting, viz. SPRAY/LOAD-verbs.
(1) Similarly there had been hay bales. Similarly, now there were for us
school trunks. Three times a year I loaded school trunks on to the car
and took them to the station, and three times a year loaded them on the car
and brought them home from the station. (LOB G19 163-164)
No. of PP-phrases 6 6 9 5 24
No. of with-phrases 12 7 4 11 34
To judge from the frequencies in the Brown set of corpora (Table 6), load does
not really favour one structure more than the other.6 This syntactic feature
Putting putting verbs to the test of corpora 107
(locative alternation), which can be viewed as the common feature of the whole
class, links them closely to either COIL-verbs or FILL-verbs.
The first structure (1a) resembles the syntactic construction of put and can be
glossed as I put school trunks on to the car by loading them. So if load and put
are syntactically and semantically close to each other we have a subject (I) who
triggers the action of moving an object (trunks) to a destination (the car) what
sets them apart is only the manner of putting the trunks on to the car, which ties
up with what I have just said about the manner of movement for set, lay and
place. Thus, in (1a) the object is transferred and located relative only to the
destination, that is pure displacement. What I should add is that the affectedness
of the object tends to take on a holistic interpretation, in other words the default
interpretation is that all the trunks are loaded, irrespective of whether the car is
full or not. As with put, in this structure, we can analyse the process of loading
in two steps (see Figure 2): first and foremost the relation [I - load - trunks] is
established (a strong link between the subject and the object), and secondly the
trunks are located relative to the car, and they are completely loaded. The order of
construction of the two relations in (1a) emphasises one thing: the importance of
the second argument, that is to say the direct object, which is also found in the
construction of put and of course in one of the possible structures of COIL and
POUR-verbs, as illustrated by pour and spill, respectively:7
I trunks
subject object
Quantification8
Figure 2. Process of loading: pure displacement (1a)
(2) Spoon the mixture over the potatoes and then pour the cheese sauce over
the top. (FLOB E19 122)
108 Caroline David
I the car
subject object
Qualification9
Before moving on to the FILL -verbs, let us examine a trickier and more
unusual example of load, which shows that the with-construction is decisive for
the notion of qualification but not for quantification:
(4) She knew, for Gideon dispatched formal reports whenever he could, that
the ship had made her way safely to Australia and disgorged her
passengers and cargo safely. She had loaded with Mundy wool from the
warehouses in Melbourne and was on her way back home. (FLOB P03
153)
(5) Norah will pack us up some sandwiches, and I will fill the flasks with tea.
(FLOB P25 146)
(6) Just then, to his delight, the gray fills with snow, as though someone
standing below the window had broken open seedpods and tossed up
fistfuls of white puffs. (Frown N22 17)
4. FILL-verbs
If we now turn to the structure of F ILL-verbs, we find that we can have Norah
filled the flasks with tea, but not *Norah filled the tea into the flasks. The flask,
both locative, container, and the object affected by tea, has changed properties,
110 Caroline David
since it goes from the state of emptiness to the state of fullness. It changes from a
flask (a tea flask) to a flask of tea. The destination (the underlying element in this
pattern) is first any kind of container (any flask) and then it becomes an object
with a specified content (a flask which has the properties of being a flask full of
tea and not a flask of water, or of wine). Therefore, fill (like the FILL -verbs in
general) is more restrictive on the type of object it can take than load: we can
only take into account the [fill-container] relation and then qualify it. Hence, we
have first [fill-flask] and then [tea: process of filling], which is very similar to the
second case of load (1b). This relation [fill-container] is very strong since we can
have a structure with only two nominal arguments (two participants), which is
statistically quite frequent (28% of the transitive uses of fill are two-participant
structures):12
(7) He fetched the Nescafe and camp stove from his provisions, then went to
fill the bottle. (Frown N21 54)
In fact, load gives less information about the type of locative it accepts and about
the way things are moved than fill. The object constructed with fill must satisfy
all the properties of something which can be filled in, whereas almost anything
can be loaded. As FILL-verbs format, give shape to, and put more restrictions
on the container than load, the only possible constructions are Norah filled the
flasks with tea, and the gray fills with snow (as in example 6).
In short, beyond the locative alternation which shows a certain syntactic
consistency across the members of the class, the properties of the object in one
type of structure and of the destination in the other reveal how the verb behaves
in a more significant way. The choice of one structure over another, e.g. the
choice of pour and coil over fill, serves to highlight the motion of the content,
rather than the change in fullness of the container. This idea is notably supported
by Gropen et al. (1991: 161) in their discussion of verbs like pour and fill:
5. Conclusion
All the verbs put, set, lay, place, coil, load, and fill need a PP which provides
different kinds of additional specification of the destination from a semantic point
of view; at the same time, a characterisation is given of the nature of the link
between the verb and its different arguments.
If we return to Levin's classification (see Appendix) and run through her
categories, we have first a light semantic verb (put) which imposes few
restrictions on the type of object, on the destination point and thus on the
preposition. This verb can combine with the largest number of prepositions: 39.
Then, verbs such as set, lay and place add more specification and information on
the process of putting but still without heavy constraints on the object and on the
destination as is shown by the various definitions of the dictionaries and the verb-
preposition combinations: 24 different prepositions for place and only 20 for set
and lay. Next, we come to the POUR -verbs and the COIL-verbs which combine
with a restricted set of prepositions reflecting the semantics of the verb and the
focus on the quantity of the object displaced. At the boundary between the COIL-
verbs and the FILL-verbs, we find the SPRAY/LOAD-verbs which, depending on
the type of structure (a PP-construction or a with-construction), are closer to
either of these classes. LOAD, according to our findings in the Brown set of
corpora, does not really favour any of these two constructions: 24 PP-phrases and
34 with-phrases. Moreover, it must be stressed that 82% of the examples of load
with in the Brown set are passive, a distribution that contrasts with the preference
for active examples in reference books. Next, if we return to SPRAY, the figures
are reversed, since we found 13 occurrences with PP-structures and 5 with-
structures, a tendency which once again could not be induced without a corpus
analysis. Finally, the FILL-verbs specify and severely limit the goal location (the
container) by qualifying it with its object. Let us recall that 28% of the transitive
uses of fill in the Brown set do not have any goal location in their structures.
Large electronic corpora of the kind used here provide a fruitful resource
for new approaches to the study of language, such as the use of data for a more
qualitative analysis and for testing hypotheses about syntactic structures. Corpora
describe and reflect a type of language reality which does not always correspond
to the picture presented in traditional descriptions.
112 Caroline David
Notes
1. Dixon (1991: 99) even goes further since he gathers put, set, place, fill and
load in the same class labelled rest verbsput subtype, which refers to
causing something to be at rest at a Locus.
2. Note that deverbal uses have been included in this count.
3. Excluding deverbal cases.
4. In this study, I use some of the concepts in Culioli's Theory of Enunciative
Operations (Culioli 1990), in particular quantification (QNT) and
qualification (QLT) (see below). However, it should be noted (J. Chuquet,
private communication) that the concept of notional domain that one might be
tempted to associate with the put/set/lay/place system cannot be relevant here,
as it bears little or no relation to a prototype theory.
5. The use of the definite or indefinite article with the location and the object
transferred has been discussed by Laffut (1997, 1998).
6. Conversely, spray is more used in a PP-structure (13 examples) than in a with-
structure (5 examples) in the Brown set of corpora.
7. For more details on how COIL-verbs function, see Beatty (1979).
8. Quantification refers to the existence of an entity in Culioli's theory.
9. In rough approximation qualification refers to the properties of an entity.
(for a full treatment, see Culioli 1999).
10. They give the following examples: Her books translate well. The sentence
reads clearly. My shirts have dried very quickly. The sheets washed easily. My
teapot pours without spilling.
11. By contrast, it seems that spray has only got two structures (1a) and (1b).
12. 163 examples out of 583.
References
Electronic Corpora:
1. Put Verbs
arrange, immerse, install, lodge, mount, place, position, put, set, situate, sling,
stash, stow
2. Verbs of Putting in Spatial Configuration
dangle, hang, lay, lean, perch, rest, sit, stand, suspend
3. Funnel Verbs
bang, channel, dip, dump, funnel, hammer, ladle, pound, push, rake, ram, scoop,
scrape, shake, shovel, siphon, spoon, squeeze, squish, squash, sweep, tuck, wad,
wedge, wipe, wring
4. Verbs of Putting with a Specified Direction
drop, hoist, lift, lower, raise
5. Pour Verbs
dribble, drip, pour, slop, slosh, spew, spill, spurt
6. Coil Verbs
coil, curt, loop, roll, spin, twirl, twist, whirl, wind
7. Spray/Load Verbs
brush, cram, crowd, cultivate, dab, daub, drape, drizzle, dust, hang, heap, inject,
jam, load, mound, pack, pile, plant, plaster, ?prick, pump, rub, scatter, seed,
settle, sew, shower, slather, smear, smudge, spatter, splash, splatter, spray, spread,
sprinkle, spritz, squirt, stack, stick, stock, strew, string, stuff, swab, ?vest, ?wash,
wrap
8. Fill Verbs
adorn, anoint, bandage, bathe, bestrew, bind, blanket, block, blot, bombard,
carpet, choke, cloak, clog, clutter, coat, contaminate, cover, dam, dapple, deck,
decorate, deluge, dirty, douse, dot, drench, edge, embellish, emblazon, encircle,
encrust, endow, enrich, entangle, face, festoon, fill, fleck, flood, frame, garland,
garnish, imbue, impregnate, infect, inlay, interlace, interlard, interleave,
intersperse, interweave, inundate, lard, lash, line, litter, mask, mottle, ornament,
pad, pave, plate, plug, pollute, replenish, repopulate, riddle, ring, ripple, robe,
saturate, season, shroud, smother, soak, soil, speckle, splotch, spot, staff, stain,
stipple, stop up, stud, suffuse, surround, swaddle, swathe, taint, tile, trim, veil,
vein, wreathe
9. Butter Verbs
asphalt, bait, blanket, blindfold, board, bread, brick, bridle, bronze, butter,
buttonhole, cap, carpet, caulk, chrome, cloak, cork, crown, diaper, drug, feather,
fence, flour, forest, frame, fuel, gag, garland, glove, graffiti, gravel, grease,
groove, halter, harness, heel, ink, label, leash, leaven, lipstick, mantle, mulch,
muzzle, nickel, oil, ornament, panel, paper, parquet, patch, pepper, perfume,
pitch, plank, plaster, poison, polish, pomade, poster, postmark, powder, putty,
Putting putting verbs to the test of corpora 115
robe, roof, rosin, rouge, rut, saddle, salt, salve, sand, seed, sequin, shawl, shingle,
shoe, shutter, silver, slate, slipcover, sod, sole, spice, stain, starch, stopper, stress,
string, stucco, sugar, sulphur, tag, tar, tarmac, tassel, thatch, ticket, tile, turf, veil,
veneer, wallpaper, water, wax, whitewash, wreathe, yoke, zipcode
10. Pocket Verbs
archive, bag, bank, beach, bed, bench, berth, billet, bin, bottle, box, cage, can,
case, cellar, cloister, coop, corral, crate, dock, drydock, file, fork, garage, ground,
hangar, house, jail, jar, jug, kennel, land, lodge, pasture, pen, pillory, pocket, pot,
sheathe, shelter, shelve, shoulder, skewer, snare, spindle, spit, spool, stable,
string, tin, trap, tree, warehouse
Esphoric reference and pseudo-definiteness
Peter Willemse
University of Leuven
Abstract
1. Introduction
It seems that the category of esphora, as defined by Martin (1992), covers several
phenomena that are still quite different in nature. It is a broad category within
which further subcategories can be distinguished. Most importantly, real
definite esphoric NPs should be distinguished from what Davidse (1999: 231) has
referred to as pseudo-definite NPs. An example of a truly definite esphoric type
of NP is an NP containing a restrictive relative clause functioning as a
postmodifier, for instance the man who spoke first at the meeting. Davidse (2000:
1112) gives a convincing argument for this building on Langackers (1991)
interpretation of the restrictive relative clause (RRC) as part of the type
specification evoked by the nominal. The RRC makes the type specification more
specific and thereby restricts it. Davidse points out that in NPs which contain a
definite determiner, a reference mass, viz. the set of all instances corresponding
to that particular type in the discourse context, is defined by the type
specification. For an NP containing a definite article and a singular count noun to
be unambiguous, it is necessary that only one contextually relevant instance
corresponds to the specified type (viz. the one referred to). When an NP contains
a RRC, the latter may narrow down the reference mass (by making the type
specification more specific) to only one contextually relevant instance, thus
making the NP truly definite and justifying the use of a definite determiner. In the
example quoted above, for instance, the reference mass defined by the type
specification man presumably contains several instances; the added RRC who
spoke first at the meeting narrows down this reference mass to just one instance
and in this way makes the NP definite.
The focus of the present paper is on the other type of esphoric NPs, viz.
the pseudo-definite ones. Consider the following example:
If one imagines that this sentence occurs in a context from which it is clear that
the slave in question has more than one son, the definite form of the attribute NP
is unexpected. The definite article seems to be motivated here by some
relationship within the NP in which it occurs.
Esphoric reference and pseudo-definiteness 119
In (3), the existential states how much instantiation of the type usherettes there is
in the contextually specified situation. A cardinal existential indicates the
cardinality of the instantiation of the type expressed by the type specification in
the Existent NP (Davidse 1999: 238). Typically, cardinal existentials express
cardinal quantification (hence the name), i.e. they measure the intrinsic
magnitude of the designated mass in terms of a quantitative scale (see Langacker
1991: 84). Importantly, moreover, the existent (postverbal) NP is typically
indefinite: the designated instance(s) is/are being introduced into the discourse
and are not presumed known to the hearer. An enumerative existential, on the
other hand, enumerates in ordinal fashion, with implied reference to a
contextually specified type, instances sharing a superordinate type which
corresponds to that contextual type (Davidse 1999: 240-241). Let us look at
example (4) to make this more clear. In this example, three Existent NPs (the
British title, the Commonwealth, the European) are ordinally enumerated. They
all share a superordinate type, viz. something like major athletics competition.
This general type is further specified and situated contextually as roughly major
competitions for which Carl Lewis considered returning to England. The
instances of this type, designated by the three Existent NPs, are then held up
for consideration one by one. It is also possible that only one instance
corresponding to a contextually defined type is mentioned in the enumerative
existential. An important distributional difference with the cardinal existential is
120 Peter Willemse
The corpus analyzed for this paper is an extraction from COBUILDs The Bank
of English, a corpus of 450 million words containing both spoken and written
material.2 A concordance was extracted on any third-person finite form of to be
(i.e. is, are, was, were) preceded by there and followed by a complex noun phrase
consisting of a definite noun phrase and another noun phrase, connected by the
preposition of. This query was considered to be the most practical way of tracing
many, if not most, of the cases of postverbal NPs in cardinal existentials having
pseudo-definite status and involving some form of esphoric retrieval. Indeed,
pseudo-definite reference requires a definite determiner with its first noun for its
apparent definiteness and an indefinite or zero determiner with its second
noun to realize its true indefinite status. These are necessary but not sufficient
conditions for pseudo-definiteness, as many NP complexes of this form are truly
definite. Naturally, this extraction covers only those pseudo-definite NPs
occurring in unmarked existentials. At least some additional subtypes which were
not attested in the existential corpus will be included in the overview for the sake
of completeness. Some of these additional subtypes were found in the additional
corpus containing 50 attributive clauses (which was consulted only for types that
did not occur in the existential corpus). For the existential corpus, in total 200
examples were extracted. This relatively low number of examples is due to the
marked nature of esphora as such, and to its highly marked occurrence in the
Esphoric reference and pseudo-definiteness 121
2.1 Type/Subtype-constructions
The first category contains the complex Existent NPs in which the first NP
indicates a subtype of the second NP.
(5) Iona McLeish's vast concrete set, a wire mesh-gated compound within the
ravaged Troy, is a happening in itself. <p> And all around, and above,
there is the sort of action reminiscent of war movies: the clatter of
helicopter rotors, whistling jet airstrikes and, when the city burns, a fire so
realistic you could toast bread on it.
(6) They could still be our friends. But now they never will be. Now there is
the sort of hatred I've described, the sort of cruelty, savagery, barbarity.
to be pseudo-definite rather than truly definite (or truly indefinite, for that
matter): although on one level of interpretation, the identity of the referent is
recoverable (viz. the type of thing), on another level, an instance is still being
introduced into the discourse as a new entity, which is, consequently, not
presumed known to the hearer. We can briefly refer here to another type of
pseudo-definite NP which involves a similar reference mechanism, viz. NPs
containing specific types of postdeterminers. Postdeterminers are elements which
occupy the slot following the determiner slot in the NP and which fulfill a
function ancillary to the functions of (definite/indefinite) identification and/or
(absolute/relative) quantification realized primarily by the determiner (see
Davidse 2001 for a more comprehensive discussion of postdeterminers). Consider
the following examples:
(7) The Woody Allen-Mia Farrow breakup, and Woodys declaration of love
for one of Mias adopted daughters, seems to have everyones attention.
There are the usual sleazy reasons for that, of coursethe visceral thrill of
seeing the extremely private couples dirt in the street, etc. [San Francisco
Chronicle, 24/8/92; cited in Ward and Birner 1995:732]
(8) There was the usual crop of letters to the Member of Parliament
concerned, and about once every twelve months a really abusive one from
the tortured victim himself.
Ward and Birner (1995: 732) describe this kind of postverbal NPs as having dual
reference, both to a type and a token. They point out that, although the type has
hearer-old status, which justifies the use of the definite article, the token or
instance referred to is hearer-new, which accounts for the acceptability in
existential contexts. It is clear that the underlined NPs in examples (7) and (8)
introduce new entities into the discourse. At the same time, a definite determiner
is used to signal the identifiability, not of the instances, but of the type. In (7), for
instance, although the specific reasons for the public fascination with the break-
up are being introduced into the discourse and thus not presumed known to the
hearer, the speaker does assume that the hearer knows the type of reasons that are
usually the basis for such public attention. The postdeterminer fulfills a secondary
identifying function, in that it helps the hearer to make mental contact with the
right type: it provides a clue for the hearer to identify the correct type.
It may be clear that a very similar explanation holds for type-subtype
constructions, although in the latter type of construction the reference to the type
is given an explicit linguistic realization in the form of the head noun the sort of,
whereas the notion of type is not lexicalized in the case of a postdeterminer with
dual reference.
Esphoric reference and pseudo-definiteness 123
(9) In a room outside the court he talked with the French prosecuting counsel,
who showed him some of the evidence he was going to submit. There was
the shrunken head of a Polish boy whose crime had been that he had fallen
in love with a German girl. The head, mounted on a plaque like some
trophy of the hunt, had been found in a German official's house, used as an
ornament.
The use of the definite article in an example like this can be explained
straightforwardly in terms of an esphoric bridging relationship. Bridging is
defined by Martin (1992: 124) as a type of indirect reference in which the identity
of a part is recovered through an experiential connection which exists between
that part and another part of the same whole or between that part and the whole it
belongs to, or vice-versa. In the type of construction under discussion here the
bridging relationship is an esphoric one because it holds between the first and the
second NP within the same complex NP. NP2 introduces the entity Polish boy
into the discourse; it realizes presenting reference and consequently uses an
indefinite form (the indefinite article a). The first NP has presuming reference
and consequently uses a definite form (the definite article the); the identity of its
referent is recoverable by virtue of an experiential connection with the entity
introduced by the second NP: a head is part of (the body of) a boy.
Examples like these shed an interesting light on Martins (1992) taxonomy
of retrieval types. Bridging and esphora are more intertwined than it would seem
at first sight. In fact, bridging is often the motivating factor for the use of a
definite form in a pseudo-definite, esphoric NP. In such cases, the esphoric nature
of the NP lies in the fact that the information necessary to identify its referent,
and thus to justify the use of a definite determiner, is to be found further on in the
same NP; the actual information is then retrievable through bridging from the
second NP.
124 Peter Willemse
In the majority of the cases attested in the corpus, the part-whole relation
was of a more abstract nature. A first group contains examples such as the
following:
(10) there was a limp wad of lettuce whose leaves glistened with a fine film of
oil; there was a clean piece of wood jutting out with a shining nail bent at
the end of it; there were several eggshells showing bits of yellow yolk;
there was the stump of a cigar bearing the marks of a man's teeth; and
there was a clump of fluffy dust freshly gathered from some floor
(11) It comes from a pleasant er beach in Cornwall I won't want to say exactly
where it is in case it affects the tourist potential of the beach but er on the
slope where that photograph was taken from there is the remains of an
<ZF1> old tin mine <ZF0> old tin mine <ZGY> and it so happens that
that particular tin mine had quite a lot of uranium in the ore <ZF1> as a
<ZF0> as a sort of by-product.
(12) He added: `We still don't know how many are required. I just wish there
were 20 games left. There's the making of a good team here. There is real
ability. In the first half we definitely suffered from tension after their goal.
But in the second half we looked good."
Esphoric reference and pseudo-definiteness 125
(13) In this sample, he uses only a few, such as the s for plural, although there
is the beginning of a negative construction in `no book" and the beginning
of a question form in `where going?" although he does not yet have the
auxiliary verb added to the question.
In (12), a process such as creating a team is implied; in (13), the process which
is implicitly activated is that of the child acquiring (linguistic constructions).
The use of the definite determiner in the first NP in these examples can be
explained in the same way as for examples (10) and (11): the first NP refers to a
phase of the process associated with, and thus implied by, the referent of the
second NP.
In a number of similar examples, finally, the process is explicitly realized
in the form of a gerundive or a nominalization:
(14) There is the birth of healing and that may be a silly thing to say but I think
if I may be allowed to develop the theme I think that there is a p <ZF1> a
<ZF0> a feeling of healing and time passing in nineteen-ninety-six that
isn't again just locked into
(15) And I think if Mary Matalin wants to go out and raise those questions, she
could be doing the president an enormous disservice because there--there's
the beginning of discussion of--of his side of this too.
(16) Next month the Network is bringing Dr Kenneth Kaunda, the former
president of Zambia, to Scotland (he is the son of a Church of Scotland
minister) as part of the crusade to fight the `cancer of debt" in the world's
poorest countries.
(17) The 7ft 1in centre, who has a home in West Hampstead, is the son of a
Nigerian diplomat but has lived in England since he was two.
In these cases, the definite article is again motivated by forward bridging and
again, a conceptual-associative link rather than a hyponymy or meronymy
relationship is the basis for the bridging: kinship nouns evoke in their conceptual
structure other people who fulfill certain roles in relation to the person they refer
to. For instance, a son is always someones son; the concept son evokes in its
structure the concept of parents, or at least of one specific parent (mother or
father). The referent of the second NP corresponds to this concept.
126 Peter Willemse
1.3.1 Appositive
In by far the majority of my corpus examples, the relationship between NP1 and
NP2 is one of apposition, and more specifically of restrictive or close apposition.
Van Langendonck (1999: 113) points out that in close appositional structures,
there are two appositives which together form an intonational unit and cannot
always be interchanged. Several subtypes can be distinguished, according to the
sort of noun that functions as the head noun of NP1.
nouns of modality
(18) By several accounts, there is the possibility of an Iraqi attack, either by
missile or by bomb on the air base--the allied air base where the US forces
are in Saudi Arabia at Dhahran.
(19) He reckons the beans cause wind, and he feels there's a danger the eggs
can be underdone and there's the chance of sickness.
phenomenal nouns
(20) At that moment there was the sound of a door opening.
(21) The crowd surged as the musicians pounded and whined. There was the
scent of sweat, and the stench of arrack on hot breath.
other
(24) And for the two overall leaders after the semi-final stage of the
competition there is the bonus of a day out at the FA Cup final at
Wembley on May 11.
Esphoric reference and pseudo-definiteness 127
(25) Noosa is a great course it's fast and the climate is good. <p> And there's
the added motivation of a $25,000 car, so I'm giving it my best shot."
(26) Then the low whine of the vacuum cleaner came to his ears, and when it
stopped there was the musical flow of water in the bathroom.
(27) There was the smell of pot all over the apartment. [quoted in
Woisetschlaeger 1983: 142]
(28) Suddenly, from nowhere, there was the sound of a very fast Forbes saying
that all Americans need only pay a single income tax rate of less than 20
per cent. The middle classes should be given a big break.
(28) *There was a sound of a very fast Forbes saying that all Americans need
only pay a single income tax rate of less than 20 percent.
(29) The title tells it all, and there's the flavour of Whiskey Galore and The
Titfield Thunderbolt about the movie, which offers chuckles and beautiful
Welsh locations as a group of villagers insist on their hill" officially
recorded as a mountain the very first mountain in Wales.
(30) Every time there was a lull, every time there was the hint of an opportunity
for any Tory to giggle at her personally, in came the trolleys again rank
upon rank of them, as patients queued while wicked Conservatives `tore
the National Health Service limb from limb
As is clear from examples such as (29) and (30), lexical extension mechanisms
such as metaphor play a role in these cases: a phenomenal noun occurs as the
head noun of NP1, but a literal categorization of the referent of the second NP
in terms of this type of phenomenon is not intended.
1.3.2 Non-Appositive
The last type of pseudo-definite NPs occurring in my corpus includes the ones in
which a symbolic relation, and more specifically a relation of representing or
depicting, exists between the referents of NP1 and NP2.
(31) Turn your back on the rock and follow the coastal path the other side of
the church to the covered fontaine de St They. There is the statue of a saint
in one niche and, until a few years ago, the other contained a stone,
apparently also revered, showing that the old practices of Morgan's people
have not wholly faded away.
(32) There was the wedding picture of a young black couple among his papers.
[quoted in Woisetschlaeger 1983, example 15f]
130 Peter Willemse
The motivation for the use of the definite article in the first NP is, again, a
forward bridging relation between the two NPs in the NP complex. As has
already been remarked earlier, a relationship of collocation or association is a
possible basis for bridging. In these cases, the concepts evoked by NP1 and NP2
are strongly associated with each other; we can therefore assume that one concept
evokes the other fairly automatically. Note that in this type of construction, the
second, and not the first noun, is the node of the collocation. However, these are
still cases of forward bridging, because the definite article is motivated by, and
thus points forward to, the second NP.
3. Conclusion
The category of esphoric reference as it has been defined and discussed in the
literature (Martin 1992) covers a number of constructions which are still quite
different in nature. First of all, truly definite NPs may involve esphora, in that
the information needed to identify the referent of the NP is present in the NP
itself, for instance in the form of a restrictive relative clause.
Besides these real definites, there are a number of pseudo-definite
types of esphoric NPs, which, although they show formal signs of definiteness,
realize indefinite reference to instances that are being introduced into the
discourse. This paper has zoomed in on these constructions and has studied them
in an environment which specifically excludes truly definite NPs, viz. the
postverbal position in the unmarked type of existential sentences. On the basis of
a corpus analysis, a classification of the different types of pseudo-definite NPs
was made. The construction types realized by the pseudo-definite NPs and the
semantic relation between NP1 and NP2 formed the basis of the classification. An
important question that was asked for each type concerned the motivation of the
use of the definite article in the first NP of the pseudo-definite NP complexes:
why does the first NP contain a definite determiner even though the unit it is part
of really realizes indefinite reference? It turns out that there are two basic
explanations for this.
Esphoric reference and pseudo-definiteness 131
The first possible motivation is what I have called, using a term from
Ward and Birner (1995), dual reference. The definite article can in that case be
explained by definite reference to a generic concept, to a known type. The NP
complex is, however, only pseudo-definite and can hence take the postverbal
position in an unmarked existential, because at the same time there is reference to
new instances, which are being introduced into the discourse. The reference to
new instances is a pragmatic inference which must be made due to the particular
nature of the grammatical environment of the unmarked existential, which does
not allow true definites.
The second explanation is a relation of what I have termed forward
bridging within the NP. The first NP in the NP complex takes a definite article
because its referent is identifiable by virtue of a bridging relationship to the
information supplied in the second NP. The definite article is thus esphorically
motivated, with bridging as the ultimate foundation for the esphoric relationship.
The basis for the bridging relation may in its turn be hyponymy or meronymy or,
alternatively, a collocational or associative link.
Notes
References
Jonathan Charteris-Black
University of Surrey
Abstract
This paper compares choice of metaphor in two political corpora: the Inaugural
speeches of American Presidents and party political manifestos of two British
political parties during 1974-1997. Initially metaphors are classified according
to their source domain; they are then analysed from a cognitive semantic
approach. The major findings are that metaphors from the domains of conflict,
journeys and building are common to both corpora. However, the British corpus
includes metaphors that draw on the source domain of plants whereas the
American corpus contains metaphors that draw on source domains such as fire
and light and the physical environment that do not occur in the British corpus.
These variations suggest differences in metaphors between British and American
political discourse and provide insight into cultural differences.
The cognitive analysis reveals the importance of the conceptual metaphors
POLITICS IS CONFLICT, PURPOSEFUL SOCIAL ACTIVITY IS TRAVELLING ALONG A PATH
TOWARD A DESTINATION and A WORTHWHILE ACTIVITY IS A BUILDING in both
corpora. However, SOCIAL PURIFICATION IS HEAT and A SOCIAL CONDITION IS A
WEATHER CONDITION occur only in the American corpus. There is some evidence
that British political discourse has borrowed metaphors based on the concept
POLITICS IS RELIGION from American political discourse.
1. Introduction
3. Research method
What are the similarities and differences between the metaphors employed in
American inaugural speeches and British election manifestos?
Two corpora one American and one British were used to assist in answering
this question; the first is a corpus comprising the 51 Inaugural addresses of
American Presidents spanning approximately 200 years from George Washington
to Bill Clinton and was 98,237 words in length. The second was a corpus
comprising the party political manifestos for the Labour and Conservative Party
in the period 1945-1997 inclusive and was 132,775 words in length.1 For
convenience I will refer to the corpus of Inaugural speeches as the American
corpus and the corpus of political manifestos as the British corpus.
While diachronic variation was not the main focus of this study, it is
possible that some of the differences observed between the two corpora may be
partially attributed to the difference in the time period they cover. However, since
Inaugural speeches often refer intertextually to the speeches of earlier presidents
it was considered acceptable to treat them as a coherent and homogeneous body
of texts. In the same way, the structure of the British party manifestos has
remained relatively unchanged during the period covered and so this was also
considered a coherent body of texts. In both cases the function of language
combines communication of ideas with persuasion and this similarity of
communicative purpose makes them comparable genres. However, diachronic
factors could be taken into account in interpreting the findings and may indeed
form the central focus of future research in this area.
The methodology combines qualitative with quantitative approaches.
Initially, qualitative analysis of a sample of each corpus revealed a set of words
that have the potential to be used as metaphors. Identification of metaphor was
based on the definitions discussed above. The procedure was to analyse
potentially metaphorical linguistic forms in the two corpora to establish whether
on each occasion of use they should be classified as metaphor.
For example, windfall typically refers to apples blown down by the
wind; however, it is also found in New Labour discourse in expressions such as
windfall tax; this innovative use in a political context is the basis for identifying
all uses in this corpus as metaphor. Admittedly, for some speakers windfall may
more commonly be used in the context of taxation than that of fruit or the weather
however, there are invariably subjective issues of variation between language
users that influence metaphor identification. A further example from the
American corpus is that words such as path and step may be used as
metaphors that draw on the domain of journeys (see example 9); however, the
President sometimes refers to the white steps on which he is standing while
speaking. Evidently, such a use of step refers literally to the steps of the White
House. It is necessary to examine each of the contexts of words and phrases that
136 Jonathan Charteris-Black
4. Findings
I will first present an overview of the findings then address each part of the
research question. The findings as regards the resonance of metaphor source
domains are shown in Table 1.
From the bottom row we can see that more than twice as many types of
metaphor were identified in the American corpus as compared with the British
one. However, the larger British corpus contained many more tokens of metaphor
indicating that in the British corpus there is a tendency for metaphors to repeat the
same linguistic forms. There are areas of similarity in metaphor domains between
the two corpora since conflict, journeys and buildings are the three most resonant
lexical fields in each corpus. These domains account for 66% of total resonance
in the American corpus and 89% of total resonance in the British corpus;
however, there are also variations in metaphor use.
Metaphor in British and American political discourse 137
First, we may notice that while conflict is the most common lexical field
for metaphor in both corpora it is more resonant in the British than in the
American Corpus. Perhaps this may be explained by the combative discourse
function of the pre-election party political manifesto as compared with the
postelection inaugural speeches where there is less need to combat a defeated
opposition party.
Another interesting distinction is that while journey metaphors are more
common in the American Corpus, building metaphors are more common in the
British Corpus where they account for nearly a quarter of all metaphors. This is
an interesting difference in resonance that can be explained with reference to
different cultural experiences new experiences arising from journeys are salient
for Americans while the sense of security and solidity arising from buildings are
salient for the British.
A few lexical fields occurred in only one of the corpora; for example,
metaphors based on fire and light were only found in the American corpus. Fire
and light metaphors often convey idealism in the American corpus, while in the
British corpus religious metaphors such as vision are used for this purpose.
Another major difference between the two political corpora is that the
American corpus tends to employ physical environment metaphors in situations
where plant metaphors are employed in the British corpus (cf. example 10
below); in each case these two lexical fields constitute 9% of the total resonance.
This may have a cultural explanation in that gardening is a major pastime in
British society; the garden is a domain of private but external space. Conversely,
American cultural and historical experience draws on undomesticated space and
this reflects in the use of words such as valley, horizon, jungle, mountain
138 Jonathan Charteris-Black
Conflict metaphors
Metaphors from the lexical field of conflict were originally identified in relation
to spoken language in terms of debate and represented as ARGUMENT IS WAR (cf.
Lakoff and Johnson 1980). Conflict is the most common lexical field in both
corpora and provides evidence of a conceptual metaphor POLITICS IS CONFLICT. I
suggest that metaphors of conflict are chosen to emphasise the personal sacrifice
and physical struggle that is necessary to achieve social goals while subliminally
creating the opportunity for positive evaluation of actual military conflict. Table 2
shows some examples from the American corpus.
Nearly all the conflict metaphors have a very similar rhetorical pattern: in
pragmatic terms the choice of a conflict metaphor determines the nature of the
speakers evaluation. The conflict is either for abstract social goals that are
positively evaluated such as rights, freedom, faith etc. or against social
phenomena that are negatively evaluated such as poverty disease, injustice etc.;
these social ills are conceptualised as enemies. In addition, the stages through
which social progress is to be made are conceptualised in terms of the stages of a
military action: the trumpet that calls to action, attack, retreat, truce and eventual
victory or surrender. POLITICS IS CONFLICT implies an isomorphic relationship
between the domains of politics and war.
Similarly, in the British corpus both parties defend abstract social goals
that are positively evaluated by their own party but perhaps because of the
combative function of the election manifest imply that such goals are under
threat from political opponents:
(1) We will defend the fundamental right of parents to spend their money on
their childrens education should they wish to do so. (Conservative)
(2) While continuing to defend and respect the absolute right of individual
conscience (Labour)
Metaphor in British and American political discourse 139
Parties may also defend social institutions or groups in society that are positively
evaluated:
(3) Labour created the National Health Service and is determined to defend it.
(Labour)
However, while defence metaphors are used in similar ways by both the major
British parties, attack metaphors are used rather differently; The following
examples show that the Labour party fights against or attacks a general range of
social ills while the Conservative party defends social virtues:
(5) Economic success is not an end in itself. For the Labour party, prosperity
and fairness march hand in hand on the road to a better Britain. During the
next Parliament, we intend to continue our fight against all form of social
injustice. (Labour)
140 Jonathan Charteris-Black
(6) We will fight against crime and violence which affects all Western
societiesAt the same time, we shall attack the social deprivation which
allows crime to flourish. (Labour)
(7) We have to compete to win. That means a constant fight to keep tight
control over public spending and enable Britain to remain the lowest taxed
major economy in Europe. It means a continuing fight to keep burdens off
business. (Conservative)
(8) We will continue to fight for free and fair trade in international
negotiations. (Conservative)
Journey metaphors
They came here the exile and the stranger, brave but frightened to
find a place where a man could be his own man. (7)
First, justice was the promise that all who made the journey would share in
the fruits of the land. (8)
Think of our world as it looks from the rocket that is heading toward Mars.
It is like a child's globe, hanging in space, the continents stuck to its side
like colored maps. We are all fellow passengers on a dot of earth. (16)
For this is what America is all about. It is the uncrossed desert and the
unclimbed ridge. It is the star that is not reached and the harvest sleeping
in the unplowed ground. Is our world gone? We say "Farewell." Is a new
world coming? (31)
To these trusted public servants and to my family and those close friends
of mine who have followed me down a long, winding road, (32)
The journey metaphors attempt to evoke the original historical experience of the
Pilgrim Fathers (7 and 8); the opening up of the American west (31) and the
space programme (4 and 16). These are integrated with the more general use of
journey metaphors to describe human relationships as implied by Lakoff and
Johnsons LIFE IS A JOURNEY metaphor, as in (32). We can see from the
distribution of these metaphors that they form a path through the text inviting the
listener to participate in a journey. In metaphorical terms the President is
represented as a guide; since only the guide knows the destination, the speech
provides a type of map towards it. Since journeys may be to unknown
destinations, the choice of this conceptual basis has the rhetorical goal of
persuading the American people to accept innovation and social change.
However, travel can be slow and arduous because of impediments to
movement and hence there will be barriers to overcome and burdens to bear.
Lakoff and Johnson (1999: 188) represent metaphoric use of words such as
burden as DIFFICULTIES ARE IMPEDIMENTS TO MOVEMENT. In a political context
these metaphors express the need for patience since it takes time and effort to
reach a destination. This is rhetorically effective because it implies that the
142 Jonathan Charteris-Black
electorate should not expect instant results and that, at times, they may need to
suffer to achieve goals; it also implies that hardships are to be tolerated because
these goals are worthwhile. The extracts shown in the following examples all
share the notion of a burden whose weight should be endured (or ignored
altogether) because of the value placed on the destination:
(10) indeed all free men, remember that in the final choice a soldiers pack
is not so heavy a burden as a prisoners chains. (Dwight Eisenhower)
(11) Let us accept that high responsibility not as a burden, but gladly gladly
because the chance to build such a peace is the noblest (Richard Nixon)
Building metaphors
Metaphors from the source domain of building are typically evaluative, carry a
strong positive connotation and are employed to express aspiration towards
desired social goals such as peace, democracy and progress towards a better
future. They emphasise social cohesion, social purpose and control of ones
environment. These metaphors can be divided into two types. First, there are
those that refer to the parts of a building foundations, threshold, doors, etc. and
others that refer to types of building such as house, or bridge.
The most frequent part of a building that is used metaphorically is the
foundations. In such metaphors an abstract phenomenon is positively evaluated
which permits us to infer a conceptual representation A WORTHWHILE ACTIVITY IS
A BUILDING. Laying foundations is a conventional metaphor for a solid and
valuable policy although it may not in fact be taken through to completion. We
know that any building which is to be durable must first have foundations and
that these may take a long time to construct; however, we also know that the
laying of foundations does not necessarily imply the completion of a building. If
the money to buy materials or to pay builders runs out then the building will not
be built. So in reality it is very difficult to predict the extent to which laying
foundations will guarantee the successful completion of a construction.
Building metaphors make an interesting comparison with journey
metaphors. Building and travelling are conceptually related, as they are both
activities in which progress takes place in stages towards a predetermined goal.
Topographically, both involve increase in the surface that is covered; in the case
of journeys this is linear movement along a horizontal path whereas for buildings
there is three-dimensional increase along a vertical path. Both activities highlight
the need for patience since they require time and effort. Difficulties entail a need
to make sacrifices and not to expect instant outcomes. Since we think of
achieving goals as inherently good, in pragmatic terms, both journey and building
metaphors imply a positive evaluation of political policy. They require a plan or
map and an architect or guide, and it may be this conceptual proximity accounts
for their resonance in the corpus.
Metaphor in British and American political discourse 143
I will now address the second part of the research question by considering
metaphors that only occurred in one of the corpora.
The analysis suggests that light & fire metaphors are particular to American
political discourse. The lexical field of light has traditionally been linked with the
target domain of understanding and metaphors that draw on it are motivated by a
conceptual metaphor KNOWING IS SEEING (cf. Lakoff and Johnson 1999: 53-54).
However, for this political data I suggest a conceptual metaphor HOPE IS LIGHT
that invariably implies a positive evaluation. It is likely that spiritual notions will
be evoked because of the importance of hope in religious discourse. As we can
see from Table 3, light is contrasted with darkness that is associated with
ignorance, failure to understand and evil.
Light is always positive because of its polarity with darkness. In other
circumstances fire metaphors can also be used for positive evaluation. This is
because George Washington first used the fire metaphor in an inaugural address
and the metaphorical link between fire and liberty has become a source of
intertextual reference in presidential addresses as we can see from the following
examples:
(12) since the preservation of the sacred fire of liberty and the destiny of the
republican model of government (George Washington)
(13) The preservation of the sacred fire of liberty and the destiny of the
republican model of government are justly (Theodore Roosevelt)
(14) He would extinguish the fire of liberty, which warms and animates the
hearts of happy millions (James Polk)
Fire is represented as the guarantor of liberty. This may be because it implies that
some form of burning or destruction will be necessary: this is in keeping with
Americas revolutionary wars, struggle for independence and Civil War. In these
metaphors conflict can be represented as a means to peace. Consider the
following examples:
144 Jonathan Charteris-Black
(15) And it is imperative that we should stand together. We are being forged
into a new unity amidst the fires that now blaze throughout the world. In
their ardent heat we shall, in God's Providence, let us hope, be purged of
faction and division (Woodrow Wilson)
(16) Mill fires were lighted at the funeral pile of slavery. (Benjamin Harrison)
metaphor. For example, it seems that when words such as kindled or flames
are used metaphorically to convey notions of anger, it is the speed and rate of
burning that are important rather than heat. In this corpus heat is a positive rather
than a negative attribute of fire because it is associated in scientific senses with
the notions of purification (as when impure metals are converted to pure ones by
the application of heat). Similarly, fire metaphors are also positive when they
highlight the quality of fire to produce light as in metaphorical uses of beacon.
In this respect it depends on which aspect of the source domain is highlighted
whether a President conveys a positive or negative evaluation. Such malleability
makes fire a useful and potent cognitive domain as it can combine different
aspects of our knowledge of an element to convey an evaluation that is
appropriate to a specific discourse context. Similarly, light and darkness provide
prototype poles for creating contrasts between spiritual or moral notions of
goodness and evil. Light and fire metaphors therefore share both a cognitive and
pragmatic role in American political discourse.
(18) To build a responsible society which protects the weak but also allows the
family and the individual to flourish. (Conservative)
In these cases flourish identifies those social entities that are highly valued. In
some cases these are the same for both parties, for example 'families', but in
others they are specific to parties, for example 'business' is claimed to 'flourish'
under the Conservatives and 'democracy' is claimed to 'flourish' under Labour.
There is also evidence of effective use of plant metaphors for the purpose
of political persuasion. Let us consider the use of the term windfall. As in the
Labour Party 1997 manifesto, it is always used in a nominal compound form
windfall levy. The use of this metaphor is important in that it conceals agency:
it is not clear that this is in fact a tax imposed by the government of the day. The
Bank of English corpus shows that the other familiar collocations of this word are
windfall tax, cash windfall, and windfall profits. Here public revenue is
conceptualised as being obtained without any effort because it is through the
natural process of the wind blowing. There is no victim and no effort involved in
obtaining a social benefit. This is an example of a creative use of metaphor that
146 Jonathan Charteris-Black
(20) More realistic attitudes to profit and investment take root. (Conservative)
I decided to combine two sub-domains that are both related to the physical
environment; these are weather metaphors and metaphors for natural
geographical features. Such metaphors may appeal particularly to that significant
minority of the North American population that inhabits rural and semi-rural
areas such as the vast Midwest.
Weather metaphors are a conventional source domain for conveying
abstract notions of change and associated ideas; they have been related in the
cognitive linguistic literature to a conceptual key CIRCUMSTANCES ARE WEATHER
(e.g. Grady et al. 1997: 109). For example, our knowledge that wind brings about
a change in the weather provides a useful metaphorical representation of cause
and effect.
(21) Thus across all the globe there harshly blow the winds of change. (Dwight
Eisenhower)
(22) in the shadows of the Cold War assumes new responsibilities in a world
warmed by the sunshine of freedom but threatened still by ancient hatreds
and new plagues. (Bill Clinton)
Metaphor in British and American political discourse 147
(23) Together let us explore the stars, conquer the deserts, eradicate disease, tap
the ocean depths, (Dwight Eisenhower)
(24) Vitality has been preserved. Courage and confidence have been restored.
Mental and moral horizons have been extended. (Theodore Roosevelt)
(25) A spring reborn in the world's oldest democracy, that brings forth the
vision and courage to reinvent America (3)
Though we march to the music of our time, our mission is timeless. (5)
148 Jonathan Charteris-Black
We must bring to our task today the vision and will of those who came
before us. (16)
Our democracy must be not only the envy of the world but the engine of
our own renewal. (19)
The brave Americans serving our nation today in the Persian Gulf, in
Somalia, and wherever else they stand are testament to our resolve. (35)
An idea ennobled by the faith that our nation can summon from its myriad
diversity the deepest measure of unity. (40)
And so, my fellow Americans, at the edge of the 21st century, let us begin
with energy and hope, with faith and discipline, and let us work until our
work is done. The scripture says, "And let us not be weary in well-doing,
for in due season, we shall reap, if we faint not." (41)
Clearly, the references to vision, faith, mission etc. form a cohesive chain
that prepares the way for the strongly religious theme of this coda. This is a
further example of how metaphor can be used systematically to create coherence
in a political text.
I propose that the New Labour Party in Britain has borrowed from
American political discourse to introduce the lexical field of religion into British
political metaphors. It is no secret that Tony Blair had a close social relationship
with Bill Clinton as well as sharing a similar political allegiance to social
democracy. Example (26) shows some typical uses of vision metaphors in the
1997 New Labour manifesto:
(26) But a Government can only ask these efforts from the men and women of
this country if they can confidently see a vision of a fair and just society.
(New Labour)
The vision is one of national renewal, a country with drive, purpose and
energy. A Britain equipped (New Labour)
Our vision for Britain is founded on these values. Guided by them, we will
make our country more (New Labour)
An independent and creative voluntary sector, committed to voluntary
activity as an expression of citizenship, is central to our vision of a
stakeholder society (New Labour)
Metaphor in British and American political discourse 149
10. Conclusion
Notes
References
Black, M. (1962), Models and metaphors. Ithaca, N.Y.: Cornell University Press.
Charteris-Black, J. (2000), Metaphor and vocabulary teaching in ESP
economics, English for Specific Purposes 19: 149-165.
Charteris-Black, J. and T. Ennis (2001), A comparative study of metaphor in
English and Spanish financial reporting, English for Specific Purposes
20: 249-266.
Charteris-Black, J., and A. Musolff (2003), Battered hero or innocent victim? A
comparative study of metaphors for euro trading in British and German
financial reporting, English for Specific Purposes. 22:153-176
Charteris-Black, J. (2004) Corpus Approaches to Critical Metaphor Analysis.
Basingstoke: Palgrave-MacMillan
Gibbs, R.W. (1994), The Poetics of the mind: figurative thought, language and
understanding. Cambridge: Cambridge University Press.
Goatly, A. (1997), The Language of metaphors. London & New York: Routledge.
Grady, J.E., T. Oakley and S. Coulson (1997), Blending and metaphor, in: R.W.
Gibbs and G.J. Steen (eds), Metaphor in cognitive linguistics. Amsterdam
& Philadelphia: Benjamins, 101-124.
Halliday, M.A.K. (1985), An introduction to functional grammar. 2nd ed.
London: Edward Arnold.
Jansen, S.C. and D. Sabo (1994), The sport/war metaphor: hegemonic
masculinity, the Persian Gulf war, and the New World order, Sociology of
Sport Journal 11: 1-17.
Lakoff, G. (1991), The Metaphor System used to justify war in the Gulf,
Journal of Urban and Cultural Studies, 2(1): 59-72.
Lakoff, G. and M. Johnson (1980), Metaphors we live by. Chicago: University of
Chicago Press.
Lakoff, G. and M. Johnson (1999), Philosophy in the flesh : the embodied mind
and its challenge to Western thought. New York: Basic Books.
Lakoff, G.and M. Turner (1989), More than cool reason: a field guide to poetic
metaphor. Chicago: University of Chicago Press.
Ortony, A. (1979), Metaphor and thought. Cambridge: Cambridge University
Press.
Richards, I.A. (1936), The philosophy of rhetoric. New York and London: Oxford
University Press.
Signalling spokenness in personal advertisements on the Web:
The case of ESL countries in South East Asia
Abstract
The continuing impact of the World Wide Web (or the Web) on everyday life
focuses our attention on the ways in which the notions of speech community,
culture and language are patterned in this mega corpus of all time. This paper
investigates how people in South East Asia in particular Brunei, the
Philippines, Malaysia and Singapore use English in personal advertisements on
the Web. The study is part of a Web corpus project investigating related questions
in computer-mediated communication (see Herring 1996). The corpus is
currently being built and is derived entirely from the Web.
In ESL (English as a Second Language) nations, or outer circle (Kachru
1992) countries, English is often relegated to the position of a neutral and
transactional (as opposed to interactional) language where affect (emotion)
is played down and less developed in the private and personal (as opposed to
public) domains. We might assume English used for informal purposes to be less
developed. Yet, Web gurus recommend the use of spoken, as opposed to written,
norms when writing for the Web. This paper then focuses on how this tension is
resolved.
Using a combination of a pen-and-paper and corpus-based approach (see
Ooi 2001), we specifically focus on the use of appraisal, attested by Eggins and
Slade (1997) to characterise spoken language. Specifically, we examine a range
of amplification items. We compare the frequencies of the items found in our
personal advertisement sub-corpus and selected written and spoken portions of
the Singapore component of the International Corpus of English (ICE-SIN) and
attempt to account for the patterns discovered.
The results suggest that although South East Asian netspeak is aligned to
spoken language, this alignment is partial.
1. Preamble
This paper arose out of a project at the National University of Singapore whose
main aim was to investigate E-English (Netspeak, English in cyberspace or
computer-mediated communication).1 We have, at the moment, collected the bulk
of the data, which run into 3.6 million words. The question that we asked was
152 Peter Tan, Vincent Ooi and Andy Chiang
whether we could discern a sense of a speech community through examining the
kind of English used (see Herring 1996).
With the help of a research assistant, we targeted data associated with four South
East Asian nations: Singapore, Malaysia (West and East), the Philippines and
Brunei (see Figure 1).
One of the things that distinguish these nations from others in the region,
like Indonesia, Thailand or the Indo-Chinese nations, is that these nations have
undergone the colonial experience under English-speaking colonial powers
Britain and the US. These nations have therefore had a longer history of having
employed the English language and, it might be surmised, a higher likelihood of
having indigenised forms of English.
At the moment, the corpus consists of four sections: (a) news, (b)
electronic discussion groups, (c) personal advertisements and (d) electronic chat.
There has been much discussion about Netspeak and a very popular assumption is
that Netspeak is much closer to spoken language than written language. Many
Signalling spokenness in personal advertisements 153
might very easily be led to believe this particularly as Web gurus and style guides
(e.g. Hale and Scanlon 1999) push for more spoken styles. There has been a lot of
sociological, but hardly any linguistic, investigation into the nature of Netspeak.
Crystal (2001) gives a very useful coverage of the issue and his conclusion is that
We can also therefore see this in terms of languages associated with the public
and private spheres. This can be regarded as being analogous to the position of
Latin in medieval Europe. Like Latin, English is the language employed largely
only in writing and in situations where one is on ones best behaviour.
Yet there are parts of the Web that deal with situations where affect is
important and which veer more towards the private sphere such as personal
advertisements. We might expect a variety more associated with spoken English
to be employed here.
Secondly, where more informal or colloquial versions of English exist,
these tend to be more divergent from the informal varieties of the inner circle.
This stands to reason: standardisation arose out of the need to minimise variation
(see, for example, Bex and Watts 1999; Milroy and Milroy 1999; Crowley 1989),
and therefore standard, written Englishes tend to be more similar to each other
than informal, spoken Englishes.
Social commentators have already noted that genres and text types are not rigid
and unchanging. For example, Fairclough (1994) and others have commented on
the conversationalisation of public discourse like advertising, so that print
advertisements can take on features associated with spoken conversation.
Conversationalisation is, for him, in part to do with shifting boundaries between
written and spoken discourse practices (Fairclough 1994: 260). This observation
is of course not entirely new. Indeed, years earlier, Leech had already commented
on the use of the public-colloquial style for advertising (Leech 1966: 75). And
we are also aware that language associated with computers is fluid and tolerances
are being tested (Bruthiaux 2001).
The question is whether this could be said of personal advertisements on
the Web as well.
2. The focus
3. Methodology
reaction
Appreciation
composition
(of text/process)
valuation
(un)happiness
Affect
(in)security
(emotion)
Appraisal (dis)satisfaction
We select from Eggins and Slades (1997) appraisal system the sub-system of
amplification. Within that, we select a range of what we will call augmenters and
mitigators. Elsewhere in the literature other labels are used. Carter and McCarthy
(1995) talk about intensifiers and hedges in relation to the Cambridge and
Nottingham Corpus of Discourse in English (CANCODE). Biber (1986a, 1988),
and Conrad and Biber (2001) subdivide each of our categories into two.
Unmarked augmenters are called amplifiers (completely, greatly), whereas
informal ones are labelled emphatics (for sure, a lot). Similarly, unmarked
mitigators are termed downtoners (almost, merely), whereas informal ones are
christened hedges (more or less, sort of). Emphatics and hedges seem to occur
together and are dominant in informal conversation (Biber 1986b). Even with all
four categories taken together, the total mean frequencies for written genres are
distinct from those for conversational genres: 7.7 for academic prose and 10.9 for
romantic fiction, as opposed to 21.8 for face-to-face conversation and 21.0 for
telephone conversation (Biber 1988: 255, 260, 264, 265). The use of augmenters
can also be a feature of Opinionated (as opposed to Objective) style (Biber
1986: 18), one dimension distinguishing spoken and written texts.
The augmenters that we have chosen to examine are: very, a lot, really,
too, ever, incredibly and lah. These items were selected partly on the basis of
Eggins and Slades examples, and partly with the aim of combining intuitively
common items like very and less common items like incredibly. The inclusion of
lah needs further explanation. We follow Gupta (1992) in her analysis of lah as a
pragmatic particle with an assertive function.2 (Pragmatic particles in her analysis
can serve one of three functions: contradictory, assertive or tentative.) Lah as a
pragmatic particle is available in the informal varieties of English in Singapore,
Malaysia and Brunei and is potentially employable by the majority of the
advertisers.
The mitigators that we will examine are: only, just, a bit and somewhat.
To ensure that the selected items represented a combination of more
frequent and less frequent items, we subjected the sub-corpus to a CLAWS
(Constituent Likelihood Automatic Word-Tagging System) tagging and examined
the rank of the adverbs. Multi-word items could not be examined (which meant
the exclusion of a lot and a bit); and of course lah, being a regionalism, could not
be included; incredibly was also not included. The result showed items from a
range of rankings, including a number of high ranking ones (see Table 1). We
were satisfied with our selection of high- and low-ranking augmenters and
mitigators.
Signalling spokenness in personal advertisements 157
Table 1. Ranking of adverbs using CLAWS tagging
Item Rank
just 1
very 3
really 9
only 16
too 21
ever 26
somewhat 151
Tables 5 and 7 (and the accompanying Figures 3 and 4) show the number of
occurrences and the normalised frequencies of the augmenters and mitigators
selected. We use the abbreviations PA, SP and WR for the personal
advertisements sub-corpus, the selected spoken sub-corpus from ICE-SIN and the
selected written sub-corpus from ICE-SIN respectively. Normalised figures
represent the number of tokens per 10,000 words.
Table 5. Augmenters
PA SP WR
tokens normalised tokens normalised tokens normalised
incredibly 1 0.1 0 0.0 2 0.1
lah 2 0.2 1677 77.2 0 0.0
ever 37 3.4 34 1.6 22 0.8
a lot 56 5.1 331 15.2 18 0.6
really 167 15.1 495 22.8 32 1.2
too 176 16.0 334 15.4 113 4.1
very 327 29.7 1087 50.1 270 9.7
Total 766 69.6 3958 182.3 457 16.5
Signalling spokenness in personal advertisements 161
100
80
PA
60
SP
40
WR
20
0
incredibly lah ever a lot really too very
All the items were also checked for statistical significance using chi-square (2,
p>0.05); each item in each sub-corpus was checked against the same item for
each of the other sub-corpora. The results for the augmenters are found in Table
6.
We should probably disregard items where all three sub-corpora register very low
normalised frequencies (say, below 5 per 10,000 words): this would push out
incredibly and ever. All the other augmenters have a significantly higher
frequency in SP than in WR. The most dramatic case is lah, where SP had a
normalised frequency of over 77 and WR had a frequency of 0. Thus far, then, we
can say that the frequencies of the augmenters provide a fairly reliable index to
the spokenness or writtenness of a text and therefore confirm the tendencies seen
in Biber (1988).3
The difference in the PA frequencies from the SP and WR frequencies are
also statistically significant with an exception in the case of too, where the
differences in the PA and SP scores are not statistically significant. In most cases,
PA frequencies fall between those of SP and WR. In fact, in the case of too (and
of ever, which we discarded from consideration earlier), the PA frequency
exceeded that of SP. By and large the figures confirm the expectation: that PA
tends towards SP norms but not quite reaching them, in most cases. However,
162 Peter Tan, Vincent Ooi and Andy Chiang
the situation is not always that clear-cut. We could arguably say that the PA
frequency for a lot tends towards WR norms. However, the outstanding item is
lah, which only occurred twice, once in the Malaysian portion of the sub-corpus
and once in the Bruneian portion. There is certainly strong resistance to the use of
lah in personal advertisements in the region.
Table 7. Mitigators
PA SP WR
tokens normalised tokens normalised tokens normalised
somewhat 5 0.5 6 0.3 16 0.6
a bit 45 4.1 132 6.1 10 0.4
only 151 13.7 369 17.0 478 17.2
just 398 36.1 1,094 50.4 167 6.0
Total 599 54.3 1,601 73.7 671 24.2
60
50
40
PA
30 SP
20 WR
10
0
somewhat a bit only just
Tables 7 and 8, together with Figure 4, show the figures for mitigators. What is
interesting is that mitigators do not seem to display as robust a difference between
the sub-corpora as the augmenters. Figure 4 shows three curves more or less
keeping step with each other until we reach just. The distributions of somewhat
and only in the sub-corpora are not always significantly different. Of those that
are a bit and just the pattern seen in the augmenters is replicated. The PA
Signalling spokenness in personal advertisements 163
frequencies lie between those of SP and WR, and closer to SP rather than WR.
This might suggest that some mitigators are more important than others in
showing up the differences between the three modes. Interestingly, the
importance of a mitigator is not dependent on whether it is a high- or low-
frequency item: only, which is more frequent than a bit and less frequent than
just, is not important as a distinguisher between the three modes.
How then can we account for the less dramatic difference in the case of the
mitigators? We could perhaps hypothesise that it might be more important to
mitigate (as opposed to augment) in written texts. This also makes sense in the
light of the distinction between the Opinionated and Objective style (Biber
1986a): augmenters contribute to the Opinionated style but mitigators do not. The
difference here is also understandable if we consider that in a stereotypical
written genre, academic writing, it is important to not over-generalise and delimit
ones conclusions.
6. Conclusion
So, to what extent are the resources of spoken discourse relied on in PA? On the
basis of the augmenters and mitigators selected, we could say that personal
advertisers tend to make use of features of spokenness. The analysis shows,
corroborating Crystal, that the language of PA represents written language which
has been pulled some way in the direction of speech (Crystal 2001: 47). Given
the focus on the interactional function of language, it is not surprising that
advertisers try to take on board these features. This is the case despite the fact that
the sub-corpus is from outer circle countries where English tends to be used for
more transactional functions.
However, the situation is not entirely cut and dried. There is very strong
resistance to the employment of the pragmatic particle lah in personal
advertisements. It is not entirely clear to us why this should be the case, although
it is not impossible that the notion of a borderless cyberspace might discourage
advertisers from employing items like lah that point towards the local or suggest
an insular or parochial outlook. A non-local spoken model might also be
preferred if advertisers are open to responses to non-local sojourners in the
region. It would therefore be premature to say at this stage that Netspeak in South
East Asia is closely associated with the norms of spoken language although it
seems to be an important contributor to the norms associated with personal
advertisements.
We obviously need to examine other parts of the corpus, e.g. the chat data,
where localisation does not seem to be such a taboo.
164 Peter Tan, Vincent Ooi and Andy Chiang
Notes
References
Baron, N. (2000), Alphabet to email: How written English evolved and where its
heading. London: Routledge.
Besemeres, M. and A. Wierzbicka (forthcoming), Pragmatics and cognition: the
meaning of the particle lah in Singapore English, Journal of
Pragmatics and cognition 11(1): 1-36.
Bex, T. and R. J. Watts (eds) (1999), Standard English: the widening debate.
London: Routledge.
Biber, D. (1986a), On the investigation of spoken/written differences, Studia
Linguistica 40(1): 121.
Biber, D. (1986b), Spoken and written textual dimensions in English: resolving
the contradictory findings, Language 62(2): 384414.
Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge
University Press.
Biber, D. (2001), On the complexity of discourse complexity: a multi-
dimensional analysis, in: S. Conrad and D. Biber (eds), Variation in
English: multi-dimensional studies. London: Longman. 215240.
Brown, G. and G. Yule (1983), Discourse analysis. Cambridge: Cambridge
University Press.
Bruthiaux, P. (2001), Missing in action: verbal metaphor for information
technology, English Today 67 (Vol. 17 No. 3): 2430.
Carter, R.A. and M.J. McCarthy (1995), Grammar and the spoken language,
Applied Linguistics 16(2): 141158.
Collot, M. and N. Belmore (1996), Electronic language: a new variety of
English, in: S. Herring (ed.), 1328.
Signalling spokenness in personal advertisements 165
Conrad, S. and D. Biber (2001), Multi-dimensional methodology and the
dimensions of register variation in English, in: S. Conrad and D. Biber
(eds), Variation in English: multi-dimensional studies. London: Longman.
1342.
Crowley, T. (1989), Standard English and the politics of language. London:
Palgrave Macmillan.
Crystal, D. (2001), Language and the Internet. Cambridge: Cambridge University
Press.
Eggins, S. and D. Slade (1997), Analysing casual conversation. London: Cassell.
Fairclough, N. (1994), Conversationalisation of public discourse and the
authority of the consumer, in: R. Keat, N. Whiteley and N. Abercrombie
(eds), The authority of the consumer. London: Routledge. 253268.
Gupta, A.F. (1992), The pragmatic particles of Singapore colloquial English,
Journal of Pragmatics 18: 3157.
Hale, C. and J. Scanlon (1999), Wired style: principles of English usage in the
digital age. New York: Broadway Books.
Herring, S. (ed.) (1996), Computer-mediated communication: linguistic, social
and cross-cultural perspectives. Amsterdam: John Benjamins.
Hudson-Ettle, D. and J. Schmied (1999), Manual to accompany the East African
component of the International Corpus of English: Background
information, coding conventions and list of source texts. Chemnitz:
Department of English, Chemnitz University of Technology.
Hunston, S. and G. Thompson (2001), Evaluation in text: authorial stance and
the construction of discourse. Oxford: Oxford University Press.
Kachru, B.B. (ed.) (1992), The other tongue: English across cultures, 2nd edn.
Urbana: University of Illinois Press.
Kadir, M.A. (2000), Love @ cyberspace: A corpus-based study of personal ads
on the Web. Unpublished MA dissertation, National University of
Singapore.
Labov, W. and J. Waletzky (1967), Narrative analysis: Oral versions of personal
experience, in: J. Helm (ed.), Essays on the verbal and visual arts.
Seattle: University of Washington Press. 1244.
Leech, G.N. (1996), English in advertising: a linguistic study of advertising in
Great Britain. London: Longman.
Milroy, J. and L. Milroy (1999), Authority in language: investigating standard
English, 3rd edn. London: Routeldge.
Ooi, V.B.Y. (2001), Investigating and teaching genres on the World Wide Web,
in: M. Ghadessy, A. Henry and R. L. Roseberry (eds), Small corpus
studies and ELT: theory and practice (Studies in Corpus Linguistics, Vol.
5). Amsterdam: John Benjamins. 175204.
Yates, S.J. (1996), Oral and written linguistic aspects of computer conferencing,
in: S. Herring (ed.), 2946.
166 Peter Tan, Vincent Ooi and Andy Chiang
Item Frequency
I 3,192
and 2,272
to 1,937
a 1,706
the 1,142
am 820
you 811
in 796
me 770
my 769
Textual colligation: a special kind of lexical priming
Michael Hoey
University of Liverpool
Abstract
Corpus linguistics has not attended much to text-linguistic issues. This paper
argues that lexical choice has a major effect on features such as cohesion, Theme
choice and paragraph division and that corpus investigation can shed light on the
nature of the lexical choices made. It is argued that some lexis has a bias towards
(or against) certain textual functions and that this is an inherent property of such
lexis. It is also argued that lexical choices interlock, creating what I term
colligational prosody.
1. Introduction
Corpus linguistics has for perfectly understandable reasons focused most of its
attention upon lexical and grammatical matters. Although there are a few text-
linguistic and spoken discourse studies that make use of corpus linguistic
techniques (e.g. Hoey 1997; Partington 2003; Partington and Morley 2002), they
are few and far between and are rooted in no explicitly articulated theory. This
paper attempts to redress this lack by articulating a theoretical relationship
between lexis and text-linguistics.
The paper is divided into three uneven parts. In the first, I focus on current
perceptions about the organisation and nature of written discourse, identifying the
features that might be open to investigation in corpora. In the second, I contrast
two theoretically opposite positions, one of which I take to be essentially
incompatible with corpus investigation and the other of which is amenable to
such investigation. Unsurprisingly I shall favour the latter! In the third and much
the longest part, I want to present sample results from a corpus-linguistic
investigation of textual questions. These are hesitantly presented and are intended
only as hints as to how corpus linguistics might proceed. Despite the hesitancy
with which my findings are presented and the weak evidential base on which they
rest, they point to a theoretical suggestion that, if accepted, would place lexis and
text linguistics on a very different footing vis--vis each other.
172 Michael Hoey
Text is linearly developed. By this I mean that each sentence builds upon
what has gone before, from the speaker or writers point of view; from the
listener or readers point of view, each sentence that is reached prospects
the sentence or sentences to follow. There are two aspects to this feature of
text. In the first place, the speaker/writer is seeking to meet the
listener/readers expectations and the listener/reader has expectations on
the basis of what the speaker/writer has already said. This point, though
widely agreed, is differently articulated, depending on whose position one
considers. Amongst linguists in whose work one finds some aspect of this
position are Sinclair (1993), Halliday and Hasan (1985), Winter (1971,
1979, 1982), Crombie (1985), Graustein and Thiele (1979, 1987),
Beekman (1970), Beekman and Callow (1974), Bolivar (2001) and Tadros
(1985, 1993). In the second place, this covers the much-described
phenomenon of Theme-Rheme, both as described in the Prague School
(Firbas 1966, 1986; Dane 1974) and within the Systemic-Functional
tradition (most notably, Halliday 1994).
Thiele 1979, 1987; Mann and Thompson 1986, 1988; van Dijk and
Kintsch 1978). Some posit a patterning or structuring of some kind or
other without assuming that the chunking thereby created accounts for all
texts or all of any particular text (e.g. Labov 1972; Labov and Waletsky
1967; Longacre 1968, 1979, 1983; Hoey 1979, 1983, 2001; Swales 1981,
1990; Halliday and Hasan 1985; Martin 1992). There is a great deal of
commonality amongst these positions but not much actual agreement.
Still, some kind of chunking is acknowledged to exist and is reflected in
our cultural habit of writing books with chapters, sections (in the case of
academic texts) and paragraphs.
There seem to be two possible ways of modelling the relationship between lexis
and the features of text I have just outlined. The first is that the relationships
found in text whether they are interactive, linear, cohesive, hierarchical or
structural are independent of the lexis of the language. According to this view,
each sentence is constructed according to the grammar, collocations and
colligations of the language in response to textual needs but without constraints
broader than those particular needs. In other words, each text imposes its own
demands and has its own unique sentence requirements. If this view is correct,
corpus linguistics cannot offer anything useful for text-linguistics except in so far
as it might offer tools for exploring individual spoken or written texts.
The other way of modelling the relationship between lexis and text is to
see textual relationships (interactive, linear, cohesive, hierarchical and structural)
as dependent upon and created by the lexis of the language in a manner not
exhausted by the demands of the individual text. According to this view, each
sentence of every text is constructed along lines that have been laid down by all
the texts that the speaker/writer has encountered in the course of his or her life,
such that the production of a text is in fact in part a reproduction of previous
texts along strictly controlled lines. If this view is correct, corpus linguistics is the
key to the future of text-linguistics.
174 Michael Hoey
The first view is presumably the default position. The second is however, I
hope you will agree, a much more interesting position. Youll reply that reality
hasnt the least obligation to be interesting. And Ill answer you that reality may
avoid that obligation but that hypotheses may not.2
What I want to claim in this paper is that every lexical item is primed for
use in textual organisation. The notion of priming is taken from psychology and
in this context means that our encounters with a word accustom us to expect it to
be used in certain kinds of ways to such an extent that these potential uses
become part of our knowledge of the word and to some extent constrain the way
we are likely to use the word ourselves.
More specifically I want to make the following claims:
1. Every lexical item (or combination of lexical items) may have a positive
or negative preference for participating in cohesive chains.
2. Every lexical item (or combination of lexical items) may have a positive
or negative preference for occurring as part of Theme in a Theme-Rheme
relation.
3. Every lexical item (or combination of lexical items) may have a positive
or negative preference for occurring as part of a specific type of semantic
relation, e.g. contrast, time sequence, exemplification.
4. Every lexical item (or combination of lexical items) may have a positive
or negative preference for occurring at the beginning or end of an
independently recognised chunk of text, e.g. the paragraph.
5. If a lexical item (or combination of lexical items) has any of the above
preferences, it may only or especially be operative in texts of a particular
type or genre or designed for a particular community of users, e.g.
academic papers.
The positive and negative preferences of a lexical item with regard to the textual
features just described are what I would term its textual colligations. The claims
of course allow for the possibility of a lexical item having not only a positive or
negative preference but also a neutral preference for each of these features. If,
however, the great majority of lexical items in a language were to prove to be
neutral with regard to one of the features, then the specific claim would be
disconfirmed with regard to the feature in question, and if that were to prove true
of all the features, then the more general claim would fall also.
If on the other hand the claims were to prove correct, we could envisage a
description of the language from a text-linguistic point of view including a giant
matrix of all the words of the language looking something like that in Table 1. I
will flesh out this matrix with some real examples near the end of this paper. For
the moment, though, we need to examine the evidence for the claims made above.
Textual colligation 175
3.1 Claim 1: Every lexical item (or combination of lexical items) may have
a positive or negative preference for participating in cohesive chains
The first claim was that every lexical item may have a positive or negative
preference for participating in cohesive chains, where a cohesive chain is a set of
at least three lexical (and grammatical) items that either co-refer to a single entity
(identity chains) or cross-connect because of their similarity of meaning
(similarity chains) (Hasan 1984; Hasan in Halliday and Hasan 1985; Parsons
1995). I shall give least attention to this claim, partly for reasons of space and
partly because I have presented detailed evidence elsewhere (Hoey, forthcoming).
Here I will simply note that all the following lexical items occur in my corpus as
members of cohesive chains occurring in a number of different texts:3
army, baby, Blair, gay, lake, music, pit, planet, political, spleen
Preliminary investigations suggest that all the following lexical items show no
tendency in my corpus to occur in cohesive chains, despite their all being fairly
frequent words in my corpus. It is of course much harder to establish that
something does not occur and it is a painfully slow process to move from each
concordance line into the original text to check for possible cohesion, so the
following list must be regarded as provisional:
176 Michael Hoey
Orthography is significant in this matter. The lexical item crossroads did not
form cohesive chains; the lexical item Crossroads, on the other hand, which
refers to a defunct British soap, chains freely.
The following words are neutral in my corpus with regard to cohesion;
they do not occur in chains often and when they do, the chains tend to be short,
e.g. The first reason, The second reason:
These lists may not seem particularly surprising and it is tempting to account for
their inclusion in other ways. My point here is that a lack of cohesive potential is
as much a quality of the word surprising as the fact that it is evaluative (as in the
previous sentence) and therefore unlikely to be a topic. Notice, too, that lexical
items that do not have a preference for appearing in cohesive chains are every bit
as common in the English language as those that do appear in cohesive chains
(and in many cases more so), despite the fact that one might have predicted that
infrequency would make it less likely that a word would participate in chains.
It is possible to express claims about the cohesive potential more subtly
than I have done here. The chains of particular words may favour repetitions, co-
hyponyms or pro-forms, for example; see Hoey (forthcoming b) for examples and
details. A crude check on the chaining potential of repetition-favouring words can
be obtained by examining the plot of distribution of a word as calculated for a
word by WordSmith (Scott 1999).
3.2 Claim 2: Every lexical item (or combination of lexical items) may have
a positive or negative preference for occurring as part of Theme in a
Theme-Rheme relation
The second claim, that every lexical item may have a positive or negative
preference for occurring as part of Theme in a Theme-Rheme relation, like all the
claims, requires more detailed support than I am able to give it here. The
following statistics suggest however that the claim is not meritless. 250 instances
of years were examined, and it was found that 37% occurred as part of Theme.
This is slightly higher than would be expected on the basis of random distribution
across Theme and Rheme, though it is hardly a striking result; what is more
interesting is that the great majority of the instances of Thematic years occur as
part of a fronted Adjunct rather than as part of Subject.4 In other words, when
years is thematised, it is usually marked.
Another example is the distribution across Theme and Rheme of instances
of consequence. 1615 of these were analysed (excluding instances of the rarer
importance sense), and it was found that the word consequence occurs in Theme
43% of the time, a considerably higher percentage than would occur on a random
Textual colligation 177
distribution. Again the Adjunct use seems significant. Almost half of the
occurrences of thematised consequence occurred as part of an Adjunct. The word
sixty occurs in Theme 75% of the time, on the basis of a sample of 294 instances.
Again, orthography is relevant: 60 shows no such tendency.
The claim I have just made can be made more subtly and more complexly.
Some words, I would argue, have a tendency to appear in marked Theme (e.g.
years and consequence, as we have just seen); others have a propensity for
appearing as unmarked Theme, i.e. as Subject. Furthermore, it is possible to
combine this feature with the previous one. All the cases I have given of
Thematic preference have either negative or neutral cohesive preference. This
means that sixty, consequence and years are not going to participate in Thematic
progression (Dane1974). There may prove to be a correlation between Marked
Theme and a negative preference for cohesive chains; this would need
investigating.
If a lexical item has a positive preference for both Theme and cohesive
chains, it will inevitably have a positive preference for Thematic Progression;
again, there may be a correlation between a preference for unmarked Theme and
a preference for appearing in cohesive chains, though my claim is not of course
dependent upon such a correlation. It follows that as before one might be subtler
and expect some lexical items to be primed for participation in Simple Thematic
Progression or Linear Thematic Progression, etc.
3.3 Claim 3: Every lexical item (or combination of lexical items) may have
a positive or negative preference for occurring as part of a specific
type of semantic relation
The third claim was that every lexical item (or combination of lexical items) may
have a positive or negative preference for occurring as part of a specific type of
semantic relation, e.g. contrast, time sequence, exemplification. Such relations
may be the relations between clauses or parts of clauses or between larger chunks
of text; they may also reflect relations between speaker and listener, for example
indicating the relation between a speaker or writers utterance and a listener or
readers utterance. I give here just two examples of what this claim is intended to
cover: ago and reason.
An example of a lexical item associated with a semantic relation is ago.
More specifically ago has an association with contrast when it is part of Theme.
Of 65 Thematised instances of ago examined, 23 (35%) were followed by a
contrast and 5 (8%) preceded by a contrast. If 10 instances of not long ago and as
long ago as are removed from the calculation, the percentage associated with
contrast rises to 51%. These are small figures and not too much can be claimed of
them. But informal examination of larger quantities of data suggests that they are
not misleading. When ago is not part of Theme, there is still an association with
contrast but the manifestations are somewhat different (see Hoey forthcoming a).
The word reason may seem a rather obvious choice of word to illustrate
the association of a lexical item with a particular semantic relation, and of course
178 Michael Hoey
reason is intimately associated with the reason-result relation and other similar
relations, but it is not this association that I wish to draw attention to here.
Consider Table 2, which shows the distribution of the different structures of
postmodification of reason. Five postmodifying options are considered: reason
postmodified by a -clause (as in the reason he can continue to do that), reason
postmodified by a that-clause (as in the reason that this lobbying has had little
effect), reason postmodified by a prepositional phrase headed by for (as in part of
the reason for this), reason postmodified by a why-clause (as in another reason
why pop shows are getting better), and reason postmodified by to + V as in any
reason to celebrate).
The table shows how these options distribute across three of the functions
available in the clause Subject, Complement and Object and indicates whether
they are associated with an affirmation or a denial. Affirmation occurs when a
reason is asserted. Denial occurs when a reason is declared to be of no
importance, invalid or not known. Examples of reasons affirmed are:
The councils neglect was the reason the flats were falling apart.
The reason that this lobbying has had little impact is that the industry has
failed to construct a convincing case.
The negatives in the latter sentence are of course not denials of the reason but
denials about lobbying and the making of convincing cases.
Examples of reasons denied are:
The ratio of positive to negative clauses in general English is 9:1 (Halliday and
James 1993). Where therefore the ratio of affirmation/denial for a particular
syntactic choice is significantly skewed and where the frequency is high as a
proportion of the total number of cases considered, it has been highlighted in the
table. Where it is unclear whether it is the reason or some other aspect of the
clause that is being denied or any other problem of allocation to the categories
arises, the higher figure that results from inclusion of such cases is included in
brackets; it will be noticed that all such cases occur in the Subject options.
Of 7238 instances examined (excluding the doubtful cases), 4747 are
affirming the reason, 2491 denying it, a ratio of close to 2:1. This points strongly
to reason being associated with denial; that means that when reason is used, it
has a good chance of being part of a pre-emptive move by the writer/speaker to
say that s/he does not want to (or cannot) answer the reader/listeners expected
question Why? or that any counter-arguments that might be offered to his/her
position, whether by the reader/listener or by a third party, cannot be supported
with evidence. Either way, the word provides evidence of association with
affirmation-denial (Winter 1979, Williames 1985), a pivotal feature of
writer/reader and speaker/listener relationships.
Looked at more closely, the table allows us to state such an association
more precisely. In the first place, notice that the Subject function is strongly
associated with affirmation (1895:66, a whopping 29:1 ratio of affirmation to
denial), whereas Complement is associated with denial (1740:1608, close to a 50-
50 ratio). (Object has a less marked association with denial.) So if you want to
affirm your reason, put it in the Subject. If you plan to reject it or say that it is
irrelevant or unknown, use the Complement (or Object). This is a useful example
of a complex textual colligation where it is the operation of reason in a particular
grammatical function that has a particular textual implication.
Looked at another way, the different postmodifying structures with which
reason appears also distribute themselves differently between affirmation and
denial. So reason + -clause is associated with affirmation (in a ratio to denial of
15/1); the only structure to come near this weighting towards affirmation is the
relatively infrequent reason + that-clause, with a ratio of 10/1 in favour of
affirmation. On the other hand, reason + why-clause is associated with denial,
there being an absolute majority of cases of denial in all three grammatical
functions. This is another instance of a complex textual colligation, where the
colligation of reason with one or other kind of clause as postmodifier is the
condition that has to be met for a textual colligation to be observable.
Two instances on their own prove nothing, but it is hoped that the two
examples I have given at least elucidate what is meant by the third claim and
suggest it may be worth further investigation.
180 Michael Hoey
3.4 Claim 4: Every lexical item (or combination of lexical items) may have
a positive or negative preference for occurring at the beginning or end
of an independently recognised chunk of text.
The fourth claim was that lexical items may have a positive or negative
preference for occurring at the beginning or end of an independently recognised
chunk of text, e.g. the paragraph. The problem with verifying this claim is of
course that it is not easy to find independently recognised chunks of text that
have validity.
In writing, of course, there is rough chunking associated with
paragraphing, but it is not difficult to demonstrate that paragraphs have no
internal structure (except in so far that several generations of Freshman English
students in the United States have been taught to write paragraphs with a
particular arbitrarily imposed structure and this structure is becoming a self-
fulfilling prophecy). I have argued elsewhere (Hoey 1985) that paragraphs are a
device used by writers to signal to readers how parts of the text relate to each
other. They are not therefore an ideal starting-point for demonstrating the validity
of my sixth claim. Nevertheless, in the absence of other chunking devices, it is
possible to use paragraph boundaries, section divisions and of course the
beginnings of texts to test the claim.
The hypothesis here is that certain Thematised words or phrases have a
preference for occurring at the beginning or end of a paragraph, or a preference
for avoiding such positions. So a particular sentence-initial word might have a
preference for being paragraph-initial. There is no assumption here that the
sentence-initial word has to have a colligational preference for being sentence-
initial in order to be eligible for consideration with regard to the paragraph-initial
claim. It is perfectly feasible that a particular word or phrase might have no
special preference for being sentence-initial or indeed even have preference for
being in a non-sentence-initial position and still have, when in sentence-initial
position, a preference for being paragraph initial.
In Hoey (1997) I report an experiment carried out to discover whether
there is any relationship between paragraphing and the predilection of certain
lexical items for appearing in paragraph-initial position. The experiment took the
form of asking 67 students to paragraph a short passage from a history textbook
that had been previously deparagraphed.5 The passage in question was the
following:
The students were not told how many breaks to make; this was left to their
discretion. The number of breaks varied from one to eight, with slightly under
half of the students making three breaks. The choices of paragraph break made by
all informants is given in Table 3, choices being represented in terms of the lines
in which the sentences begin.
Some sentences were clearly seen as strong candidates for beginning a
paragraph (e.g. the sentence beginning on line 13) while others were not chosen
by any informant (e.g. the sentence beginning on line 24). Equally clearly there
was no unanimity as to where to break, with no paragraph boundary finding
universal approval. This undermines any claim that might be made for paragraphs
having a structural status, unless of course my students are thought to have been
deficient in this respect (and since they were not for the most part deficient at the
level of the clause or the group, that would itself be of interest). Instead it points
to there being some non-structural explanation. One such explanation lies in the
textual relations in the passage, and this explanation I have explored fully
elsewhere (Hoey 1985, 1997). However, a more interesting (and not
incompatible) explanation is, as mentioned above, that the students were being
cued by the lexical items that begin the sentences; in other words, it is possible
that the students were choosing to paragraph in one place rather than another
because of the way the sentences began.
182 Michael Hoey
Table 3: The distribution of paragraph break choices across the range of possible
break points
Line on which Number of informants beginning % of informants
sentence starts a paragraph at this point making the choice
2 (He..) 0 -
4 (His..) 11 17%
7 (He..) 22 33%
9 (He..) 0 -
13 (Lee..) 62 94%
16 (Lee..) 7 11%
20 (As..) 32 49%
23 (In..) 32 49%
24 (It..) 0 -
25 (Had..) 2 3%
27 (Had..) 1 2%
29 (Fundamentally..) 42 64%
30 (Lee..) 0 -
32a (The..) 13 20%
32b (It..) 5 8%
With this hypothesis in mind, I set about examining the lexical items that
began each of the sentences that were candidates for beginning a paragraph.
When I undertook that work, my corpus was greatly smaller than it now is, and
the numbers looked at were not large typically between 40 and 100 instances,
and in a couple of cases fewer than this. What I was concerned to do, however,
was focus my hypothesis, not prove by weight of numbers the paragraph priming
of certain words or phrases.
The results of my analysis were provisionally supportive of the hypothesis.
To begin with, my analysis suggested that exactly 50% of single surnames (like
Grant and Lee) in sentence-initial position are also paragraph initial; there are of
course four places where a surname is the first word of a sentence (lines 1, 13, 16,
30) (and a further four where it appears within Theme lines 20, 25, 27, 29).6
There did not however appear to be any tendency for single surnames to be
sentence-initial; rather the opposite. So we have the hypothesis of a negative
priming for surnames in Theme but a positive priming for Thematised surnames
in paragraph-initial position.
The exact opposite appeared for he, which begins three sentences in the
passage (lines 2, 7, 9). The evidence pointed strongly towards he having a strong
priming for Theme. Of 100 instances consulted 30 were sentence-initial, and this
of course discounts instances of he occurring within Theme but not in 1st position
in the sentence. On the other hand there was no tendency for sentence-initial he to
be also paragraph-initial. In fact he occurred in paragraph-initial position in my
corpus two and a half times less often than would have been expected on the basis
Textual colligation 183
lexical item with my current corpus of 100 million words, and with a consequent
786 instances of fundamentally, there were still only 20 instances in sentence-
initial position for investigation. Clearly the word does not like to begin
sentences. On the basis of my original data, I came to the conclusion that
fundamentally had a positive colligational preference for paragraph-initial
position, with 50% more likelihood of beginning a paragraph than was explicable
in terms of random distribution. With the still thin but better data of the later
corpus, I come to the same conclusion. Of the 20 sentence-initial cases, six begin
paragraphs and 13 do not; the final instance begins a one-sentence paragraph, and
this is discounted in the analysis. The average length of the paragraphs is 5
sentences (though this is distorted upwards by one particularly long paragraph).
Again, fundamentally turns out to begin paragraphs 50% more often than one
might expect.7
On the basis of this analysis, it was possible to correlate the positive and
negative priming for paragraph-initial position with the decisions that the students
had made.8 The correlation can be seen in Table 4. I have starred those results
which seem anomalous. It will be seen that for the most part there is a good
match between actual student choice of paragraph boundary and predicted
boundary breaks on the basis of corpus evidence. Where there is a discrepancy,
there are good reasons for it. In terms of the structure of the passage, the sentence
starting at line 4 represents a deviation from the smooth parallelism of the
comparison. Those paragraphing at line 4, despite the negative colligation of his
for paragraph initiation, were doing so to mark this deviation. The rather smaller
number who broke at line 7 were breaking, again in defiance of the negative
colligation of he, in order to mark a return to the parallelism. Those breaking, on
the other hand, at line 16 were doing so with no text-structural grounds for their
decision; the only thing going for such a break is the positive colligation of names
with paragraph initiation. The failure to break at line 30 is less interesting; the
fact that the passage is coming to an end at this juncture would have been a
deterrent to some informants, irrespective of the merits of a potential break at this
point.
To test whether these claims were correct and to discover whether the
colligations were having any effect on the judgements of students, I then doctored
the original text slightly as follows; changes are indicated in bold:
The changes were designed to test whether students had been influenced by the
textual colligations. I hypothesised that the change to line 4 would reinforce the
structural pressure to break at this point and that there would be a consequent
increase in the popularity of this sentence as a paragraph boundary. I likewise
186 Michael Hoey
The final claim will be given little attention, but it is an important one. This is
that all of the above claims should be regarded as domain-specific. In other
188 Michael Hoey
words, a word is not primed for textual use in all contexts but only under certain
conditions. Thus, for example, ago may, as I have argued here, be associated with
contrast in newspaper text and it might be it has a similar association in academic
articles; a corpus of fictional narrative on the other hand would be unlikely to
throw up this textual colligation. It might well be that both news articles and
fictional narratives favour the word appearing in text-initial position certainly I
have evidence for this in news texts but it is extremely doubtful whether there is
any such association in advertisements. And so on. Textual colligation claims
must be tied to particular genres, text-types, domains, communities of users
(defined temporally as well as in terms of employment and place) and the like. It
is probably this property of domain specificity that has led to textual colligation
being overlooked in corpus studies hitherto, in that large general corpora are ideal
for picking up patterns across a wide range of domains but have to be used
carefully to pick up features true of only certain kinds of text.
4. Colligational prosody
It will be observed that the individual words that make up the phrase sixty years
ago today share a number of properties. Thus sixty, ago and today share the
property of having negative preference for cohesive chains and that sixty, years
and ago share a preference for being Thematised. We can label this colligational
Textual colligation 189
prosody and its presence is both explained by and helps to explain the power of
collocation.
Table 7 represents the colligational prosodies of our chosen phrase.
5. Conclusions
I have been arguing in this paper for a new perspective on text-linguistics, one
that is rooted in the lexical item, not, as previously, a perspective that sees lexis as
a network in the text contributing to its cohesion or as contributing to the
signalling of the text organisation (though this perspective is not rendered
obsolete), but a perspective that makes no distinction between the description of
the text and the description of its component lexis. I have tried to show that the
properties of text can all be tackled through the concept of textual colligation.
More specifically I have argued that lexis is primed for textual use, such
that the choice of a lexical item is simultaneously the choice of its primings. Any
lexical item is primed positively or negatively with respect to cohesion, semantic
relations in the text, Theme and textual divisions. This is not to say that the
choice of a lexical item compels certain textual developments but it certainly
makes those developments more likely. The case of paragraphing in particular
suggests that some thorny text-linguistic problems might be amenable to solution,
or at least clarification, if a lexical perspective is adopted. Just as importantly, the
work I have reported here, if supported in subsequent investigations, suggests that
a corpus-centred account of the lexical item that stops at the phrase may be
unnecessarily limited. A full account of the word may be some way off yet.
190 Michael Hoey
Notes
References
Labov, W. (1972), Language of the inner city: Studies in the Black English
vernacular. Philadelphia, Pa: University of Pennsylvania Press.
Labov, W. and J. Waletzky (1967), Narrative analysis: oral versions of personal
experience, in: J. Helm (ed.), Essays on the verbal and visual arts,
Seattle: University of Washington Press. 25-42.
Longacre, R.E. (1968), Discourse, paragraph and sentence structure in selected
Philippine languages. S.I.L. Publications in Linguistics & Related Fields,
No 21, Vols. 1 & 2. Dallas, Texas: Summer Institute of Linguistics
Publications.
Longacre, R.E. (1979), The paragraph as a grammatical unit, in: T. Givn (ed.),
Discourse and syntax, New York: Academic Press. 115-34.
Longacre, R.E. (1983), The grammar of discourse. New York: Plenum Press.
Mann, W.C. and S.A. Thompson (1986), Relational processes in discourse,
Discourse Processes 9.1: 57-90.
Mann, W.C. and S.A. Thompson (1987), Rhetorical Structure Theory: a theory of
text or g a n i z a t i o n . Monica del Rey, Ca: Information Science
Institute/University of Southern California.
Martin, J.R. (1992), English text: system and structure. Amsterdam: John
Benjamins.
Morgan, J.L. and M.B. Sellner (1980), Discourse and linguistic theory, in: R.J.
Spiro (ed.), Theoretical issues in reading comprehension: Perspectives
from cognitive psychology, linguistics, artificial intelligence and
education, Hillside, N.J.: Lawrence Erlbaum Associates. 165-200.
Nystrand, M. (1986), The structure of written communication: Studies in
reciprocity between writers and readers. Orlando: Academic Press.
Nystrand, M. (1989), A social interactive model of writing, Written
Communication 6.1: 66-85.
Parsons, G. (1995), Measuring cohesion in English texts: the relationship
between cohesion and coherence. Ph.D. Thesis, University of
Nottingham.
Partington, A. (2003), The linguistics of political argument: The spin-doctor and
the wolf-pack at the White House. London: Routledge.
Partington, A. and J. Morley (2002), From frequency to ideology: comparing
word and cluster frequencies in political debate. Paper given at the 5th
TALC (Teaching and Language Corpora) conference, Bertinoro, 26-31
July.
Scott, M. (1999), WordSmith Tools, Version 3, Oxford: Oxford University Press.
Sinclair, J.McH. (1993), Written discourse analysis, in: J. Sinclair, M. Hoey and
G. Fox (eds), Techniques of description: spoken & written discourse. A
festschrift for Malcolm Coulthard, London: Routledge. 6-31.
Smith, F. (1978), Understanding reading (2nd ed.). New York: Holt, Rinehart &
Winston.
Swales, J. (1981), Aspects of article introductions. Aston ESP Monographs No 1,
Birmingham: Aston University.
194 Michael Hoey
Hilde Hasselgrd
University of Oslo
Abstract
On the basis of material from the International Corpus of English this paper
presents a study of IT-clefts with an adverbial in cleft position. Most of these IT-
clefts are of the informational-presupposition type (Prince 1978), i.e. the cleft
clause conveys new information. Various textual functions of the IT-clefts are
explored, using the classification of Johansson (2002). Unexpectedly, the function
of giving contrastive focus to the cleft constituent does not seem to be
predominant in this material. An alternative hypothesis is explored: the IT-cleft
construction is seen primarily as a thematizing device, whereby the cleft
constituent receives thematic focus (which may imply contrast) and the theme-
rheme division is made particularly explicit.
1. Introduction
The IT-cleft construction has been studied both as a focusing device (e.g. Prince
1978, Gundel 2002) and as a thematizing device (e.g. Gmez-Gonzlez 2000).
The construction is of interest in studies of information structure because it
allows a speaker/writer to spread the information of a single proposition over two
clauses and, consequently, two information units. It is normally assumed that the
cleft construction is a means of steering the focus towards the clefted constituent
(e.g. Gundel 2002: 118). The IT -cleft can have various types of phrases and
clauses as its focus, as shown below.
1
IT-cleft = IT + BE + clefted constituent [NP, PP, AdvP, (non-)finite clause] + cleft clause
Before going further, I shall briefly outline the kinds of adverbials that occur in
cleft focus position in the ICE-GB. Table 1 shows the occurrence of different
semantic types of adjunct in cleft position. The distribution of semantic types of
adjuncts in cleft position is not surprising; time and place adjuncts are the most
common types of adjuncts in most registers irrespective of position (cf. Biber et
al. 1999: 783 f.).
The most common assumption about the information structure of IT-clefts is that
the clefted constituent represents new, often contrastive, information (e.g. Biber
et al. 1999: 959). The subordinate clause typically conveys presupposed
information (e.g. Prince 1978: 896). She calls this stressed focus IT-clefts, thereby
indicating the discourse function of such clefts, namely to give special focus to
the clefted constituent. Gundel (2002: 118) refers to this information structure in
clefts as prototypical. Similarly, Collins (1991: 84) follows Halliday in claiming
that this information structure constitutes the unmarked type of IT-cleft: the
theme/new combination is unmarked: the construction creates, through
predication, a local structure the superordinate clause in which information
focus is in its unmarked place, at the end.4 This is illustrated in Table 3, from
Halliday (1994).
The term unmarked here does not, however, reflect quantitative data. In
Collinss comprehensive study of cleft constructions, only 36% of the IT-clefts
have a new clefted constituent and a given cleft clause (1991: 111), although the
clefted constituent is new in a clear majority of cases.
Another type of information structure in IT-clefts is described by Prince
(1978) (and others after her: Collins 1991, Delin and Oberlander 1995, Johansson
2002), namely the informative presupposition cleft, in which the cleft clause
conveys new information. The clefted constituent may contain either given or
new information in an informative presupposition cleft. In (2), both the clefted
constituent and the cleft clause are new, since the sentence occurs text-initially.
(2) It was just about 50 years ago that Henry Ford gave us the weekend.
(Prince 1978: 898)
According to Prince (1978: 898) the information in the cleft clause is encoded as
a (non-negotiable) fact. Although it is new, it is presupposed rather than asserted,
i.e. it is marked as known to some people although not yet known to the intended
hearer (ibid: 899). Prince says that The whole point of these sentences is to
inform the hearer of that very information (ibid: 898). Delin (1992: 296),
198 Hilde Hasselgrd
(3) However there are worrying signs for the Republicans in the contests for
state governors. Because of the shift in population to the warmer parts of
the country states like Florida Texas and California are to be given extra
seats in Congress. The governors of those states will have a big say in
redrawing the boundaries. And its here that the Democrats have made
significant headway. They have won the elections for governor in both
Florida and Texas from the Republicans although Mr Bushs party appears
to have held on to the biggest prize of all California. (S2b 006#15-19)
were anaphoric, cf. examples (1), (3), and (7). The great majority of the IT-clefts,
close to 90%, were of the informative presupposition type.
Based on the distribution of given and new information over the clefted
constituent and the cleft clause, Johansson (2002: 185 ff.) arrives at four patterns
of IT-clefts:
There are quite a few differences between the results given in Table 4 and the
corresponding results of Johansson (2002: 188), which embraces all types of
clefted constituent. First and foremost, Johansson finds that Types A and B are
equally frequent (38% each), while Type C is least frequent in his material (10%).
Type D accounts for 14% in Johanssons material of English original texts.
Although the number of IT-clefts with adverbials is rather too low to give
conclusive results, the comparison of Johanssons material (based on 240
examples) and mine suggests that the information structure of clefts with
adverbials differs markedly from that of clefts with noun phrases. The most
200 Hilde Hasselgrd
important difference is that the IT -clefts with adverbials occur by far most
commonly with cleft clauses conveying new information (86%), while the cleft
clauses of IT-clefts in general seem to be divided about equally between given and
new information (Johansson 2002: 188). It may, however, be noted that Collins
(1991: 111) reports that 63% of his IT-clefts have new/contrastive information in
the cleft clause. It is possible that the differences may be due to the fact that
Collinss material as well as my own from ICE-GB contains both spoken and
written material while Johanssons contains only written material.
Studying the IT-clefts with adverbials in context, I found that they seem to have a
range of textual functions in the organization of the information flow. Collins
(1991: 106) makes a similar observation: The use of adverbials in cleft focus
position seems to have important textual functions, e.g. by acting as a bridge from
one topic to another or launching a (new) discourse topic. Returning to (3), for
example, we can note that the cleft sentence marks a transition between two
sections of the text. The discourse topic before the cleft sentence is demographic
and political features of Florida, Texas and California. After the cleft sentence
the text has moved on to the success of the Democratic Party, a topic that was
introduced in the cleft clause.
Johansson (2002: 193) proposes four main discourse functions of IT-
clefts (irrespective of the type of clefted constituent). I decided to use these
categories in order to facilitate comparison with his results, and was able to
identify all of them in the material for this study. The categories are the
following:
cases where it may be argued that the cleft construction represents a merger of
two categories, and additional categories will be suggested.
5.1 Contrast
In the canonical cleft (Gundel 2002: 118) the clefted constituent conveys new
information which is explicitly contrasted with something mentioned in the
preceding context. The cleft clause represents information that is known to the
speaker. In (4) the subject of reading books at an early age has already been
talked about, while the clefted constituent introduces a reader of a different age
from those previously discussed.
(4) I struggled terribly with them in my early teens and had no success at all.
It wasnt till I was perhaps twenty-five or thirty that I read them and
enjoyed them <S1A-013 #237-238:1:E>
It is also possible to express contrast in examples where the cleft clause conveys
new information. In such cases there is usually also another discourse function
associated with the IT-cleft. For instance, the clefted constituent may launch a
new discourse topic at the same time as marking a contrast (see next section), or
there may be a transition between two discourse topics (section 5.3).
(5) We must try to work out security arrangements for the future so that these
terrible events are never repeated <,> and we shall I promise you <,> bring
our own forces back home just as soon as it is safe to do so <,> It is to
those men and women serving our country in the Middle East <,> that my
thoughts go out most tonight # and to all of their families here at home <,>
To you I know this is not a distant war. It is a close and ever present
anxiety <,> I was privileged to meet many of our servicemen and women
in the Gulf last week <,> <S2B-030 #63-68:1:A>
202 Hilde Hasselgrd
(6) Shortages of food have been a repeated feature of recent Soviet experience
<,> with heavy dependence on grain imported from the United States as
the Soviets own production has failed <,> But the spotlight has been on
empty shops in the towns rather than empty larders in the countryside <,>
It is to Africa that the television cameras go to show what happens when
local natural resources are so inadequate for the population living off them
that drought or continuous small-arms war causes famine for the people
counted in millions and death for many of them <,> Apart from such
disasters in succession in the same or in different places infant mortality is
the main counter to the birth rates effect in Africa <,> and in parts of the
continent the heterosexual incidence of Aids may prove to have halted or
even reversed the growth in the population of potential parents and
condemned large numbers of children to early death <,> <S2B-048
#80:1:A>
It is the two-part structure of the IT -cleft that allows it to link together two
discourse topics: the current discourse topic is referred to in the clefted
constituent, while a new discourse topic is introduced in the cleft clause. I would
actually prefer to call this function Transition since it not only links together
two topics but also provides a bridge between two sections of a text. In other
words, the cleft sentence becomes a vehicle for topic shifting. Because the new
topic is presented as known or presupposed according to Prince, it is an
unobtrusive way of introducing new information which can then be the starting
point for the next section of the text. In (3) the cleft clause the Democrats have
made significant headway marked the beginning of a new section of the text. In
example (7) the idea of topic shifting is even clearer, because Cs attempt to shift
the topic is refused by A, who wants to spend more time on the previous topic.
Thus, in (7), the transition does not fulfil its function, while in (3) and (8) the
topic is successfully shifted.
Adverbials in IT-cleft constructions 203
(7) C: But really whats happened with my sort of history is when I met uh did a
little recording with Chandos Records uhm and the Ulster orchestra who
was conducting there came up with enough money to do their first record
and they got Chandos interested. It was then that uh I fell in love with
music like Hamilton Harty and a bit of Stanford <,> and the Arn the
Arnold Bax Saga became something quite uh excellent.
A:Well thats a day we certainly want to come back to a bit later. But if we
could just for a moment concentrate on the latter years of the nineteenth
century. < S1B-032 #22:1:C>
Contrast + transition
Example (8) has a combination of marking a contrast with the clefted constituent
and creating a transition by means of the cleft clause. The contrast is between the
Villa Somalia mentioned earlier and the office building. The introduction of the
letters in the cleft clause starts off a new section of the discourse.
(8) {BEGINNING OF TEXT} The Villa Somalia which was Siad Barres
official residence in Mogadishu still lies abandoned <,> guarded by a
handful of young men from the United Somali Congress the rebel force
which took control of Mogadishu at the end of January <,> But it was in
one of the office buildings that I discovered the letters <,> thousands of
them <,> addressed to His Excellency President Mohammed Siad Barre
but all unopened <,> I picked up one from Britain <,> It had been posted
in September nineteen eighty-eight and was signed by a retired
schoolteacher from Guildford in Surrey <,> writing on behalf of Amnesty
International to plead for the release of a blind Somali preacher whod
been imprisoned for his religious beliefs <,,> < S2B-023 #61:3:A>
5.4 Summative
Summative IT-clefts tend to occur towards the end of a text or a section of a text,
and represent a kind of conclusion or rounding off. Example (9) occurs at the
very end of a speech and contains two cleft constructions. They share a clefted
constituent which is inferable. It is not contrastive, but may have the uniqueness
feature noted by Delin and Oberlander (1995: 469). The cleft clause in the
second cleft is new. Although it is backgrounded by means of subordination
(Delin and Oberlander 1995: 473), it represents a kind of punchline and softens
the war-talk in a clever way.
Contrast + summative
In example (10) we see both contrast and the summative function, towards the
end of an obituary. The clefted constituent resolves an either-or relationship
(Perzanowski and Gurney 1997: 218) i.e. a writer for children as opposed to
adults - while the summative function is evident from the cleft clause.
(10) Dahls books often portrayed children battling against evil adults. [] As
an adult author Dahls fame was to come much later when his Tales of the
Unexpected were transferred to television. Yet it will be as a childrens
writer hell be remembered. His lasting legacy includes another two books
still to be published. Roald Dahl whos died at the age of seventy-four <,>
{END OF TEXT} < S2B-011 #17:1:B>
5.5 Thematization
In certain cases the main function of the cleft seems to be to make extra clear
what is to be understood as the theme and the rheme of a sentence. A good
example is (11), which represents a complete text. Thus there can be no contrast
involved, nor any topic-linking, topic-launching, or summary. Rather, in this
case, the writer wants to give thematic prominence to the regret he/she feels. It
may be noted that a non-cleft version (11a) cannot easily have the same
constituent in thematic position.
(11) It is with much regret that I find it necessary to send you a copy of the
enclosed letter which is self explanatory. <W1B-026 #121:15> {= entire
text}
(11a) ? With much regret I find it necessary to send you a copy of the enclosed
letter which is self explanatory.
describes the IT-cleft as a special theme construction, i.e. one that marks off the
theme of a sentence and gives it extra focus. Similarly, Collins (1991: 171) notes
that the theme in clefts carries a textual form of prominence. If the IT-cleft is
seen as a construction for thematization, it should follow that the theme in this
construction, like other themes, can be given, new, contrastive or non-contrastive.
Perzanowski and Gurney (1997: 214) note that certain types of it-clefts
[] frequently occur in negative contexts. This is also a finding of the present
study, where three examples had not until as the clefted constituent, such as
(12). In such cases, a non-cleft version is not without problems, given that the
speaker wants a particular theme-rheme structure. That is, the corresponding non-
cleft version will require subject-verb inversion (12a). The IT-cleft may thus be a
way of using a marked construction in order to avoid one that is even more
marked.
(12) However it wasnt until his fourth album that the instruments capabilities
were more fully explored <,> <S2B-023 #22:1:A>
(12a) Not until his fourth album was the instruments capabilities more fully
explored.
Gundel (2002) and Johansson (2002) both document that IT -clefts are more
common in Norwegian/Swedish than in English. Looking for more examples
similar to (12) above, I have, however, found several examples of English IT-
clefts corresponding to other thematization structures in Norwegian, where
fronting of adverbials is more common and less marked than in English
(Hasselgrd 1997: 14). In example (13), from the English-Norwegian Parallel
Corpus (ENPC), the Norwegian original does not have a cleft, but a fronted
adverbial. The translator, struggling to keep the thematic structure intact, opts for
a cleft possibly again to avoid an even more marked structure.
(13) Frst p den tredje dagen hadde Aua vknet. (ENPC: MN1)
Lit: first [=only] on the third day had Aua awakened.
It was not until the third day that Aua awakened. (MN1T)
However, the thematizing function of IT-clefts does not only occur where a non-
cleft alternative would be awkward. In (14), for example, a non-clefted alternative
with a fronted adverbial is quite acceptable (14a), although there may be slightly
less focus on the adverbial. The clefted constituent does not mark any contrast,
nor does it close or launch a topic. According to Collins (1991: 175) the IT-cleft
enables an unambiguous mapping of theme on to new information in the
unmarked instance, as themes in IT-clefts are likely to convey new information
(Collins 1990: 111). In (12)-(14) the clefted constituent is indeed new.
(14) It was in nineteen hundred and six that the Queens great-grandfather King
Edward the Seventh decreed that privates in the Household Cavalry should
henceforth to be known as troopers <ICE-GB:S2A-011 #91:1:A>
206 Hilde Hasselgrd
(14a) In nineteen hundred and six the Queens great-grandfather King Edward
the Seventh decreed that privates in the Household Cavalry should
henceforth to be known as troopers
(15) And the Field Officer Brigade waiting rides up to Her Majesty the Queen.
He was not granted security for officer training when he joined the
regiment in nineteen sixty-eight because of his Polish ancestry. It was as a
G u a r d s m a n that he came to the Second Battalion which now he
commands and eventually became a lance sergeant instructor at the Guards
Depot. When he was finally accepted for Sandhurst he went on to win the
Sword of Honour and has since served as an officer with every company
of each battalion of the regiment. < S2A-011 #122-125:1:A>
Johansson (2002: 193) suggests that the discourse functions of clefts are
associated with the different patterns of information structure outlined in section
4 above. According to his findings, type A correlates with the discourse functions
contrast and summative, type B with topic linking, type C with topic launching,
and type D with contrast. Table 5 presents a summary of the occurrence of the
various discourse functions of IT-clefts with adverbials in the present material.
This has been correlated with the type of information structure identified in each
cleft sentence.
Because few of the IT -clefts with adverbials were of the stressed focus
type, there are few examples of contrast. There are thus not enough examples of
this function to make a valid comparison with Johanssons results, although it is
interesting that several of the contrast examples belong to type C (all new). It may
be noted that when a clefted constituent conveying new information expresses
contrast, the implication is contrary to expectation rather than contrary to what
has been claimed. Type B (given + new) seems to be a good indicator of Topic
Adverbials in IT-cleft constructions 207
Since the ICE-GB includes a range of different spoken and written registers, it
was possible to check whether the registers differed as to the use of adverbials in
IT -clefts. Collins (1991: 181) reports a slightly higher frequency of IT-clefts in
writing (the LOB Corpus) than in speech (the London-Lund Corpus). In the ICE-
GB the difference between speech and writing was the opposite as regards the
frequency of IT-clefts with adverbials: approximately 0.6 vs. 0.4 occurrences per
10,000 words, respectively. However, when the spoken category was divided into
scripted vs. unscripted, a further difference emerged, as shown in Table 6. The
category of scripted speech, making up only 6% of the corpus, accounts for 24%
of the clefts with adverbials. The unscripted spoken categories and the written
categories are then left with the same frequency of clefted adverbials.
One may speculate that the discourse functions of clefts are well suited to the
(rhetorical) purposes of the scripted speech categories. These categories typically
belong to expository genres, particularly lectures and broadcast narration, but
there are also some official speeches. Possible reasons why the informative-
presupposition clefts with adverbials are handy may have to do with the
possibility of assigning unambiguous thematic prominence to the clefted
constituent and the possibility of presenting new information in the cleft clause
without asserting it. Further, as Delin (1992: 300) claims, the information in the
(presupposed) cleft clause is presented as a non-negotiable fact, which clearly has
its rhetorical advantages. Further exploration of such rhetorical properties of
clefts is, however, beyond the scope of this paper.
7. Concluding remarks
The starting point for the present study was a hypothesis that IT-clefts with
adverbials behave differently from other IT-clefts in discourse. The background
for this hypothesis was that clefts with adverbials in focus position seemed to
have an unexpected information structure, particularly that there seemed to be
many examples of the given + new pattern. One of the conclusions of the present
study must be that these perceived differences have to do with quantity rather
than with quality, as the information structure and the discourse functions found
with clefted adverbials have also been identified and described with other types of
clefted constituent (e.g. by Collins 1991, Johansson 2002). Presumably, many of
the differences arise from the fact that IT -clefts with adverbials tend to be
informative-presupposition clefts, while other IT-clefts are more likely to be
stressed-focus clefts.
The typical information structure of the IT-clefts with adverbials involves a
clefted constituent carrying given information and a cleft clause carrying new
information or, alternatively, one in which both parts of the cleft construction are
new. In both Collins (1991:11) and Johansson (2002: 188) these two types are
less frequent.
It is clear that clefts with adverbials have a range of textual meanings, or
discourse functions. It is equally clear that one must study the clefts in context in
order to get at these functions. The material offered examples of IT-clefts serving
contrastive, topic-launching, transitional, and summative functions. It was
suggested in section 5.5 that these discourse functions can all be regarded as
somehow ancillary to the function of theme (or to the theme-rheme nexus in the
case of transition).
The use of an IT-cleft enhances the textual prominence of the theme. As a
consequence, the construction is well suited for marking off the theme as new or
contrastive. However, IT -clefts are also used when the clefted constituent is
neither new nor contrastive, in which case the construction may simply serve to
make the theme-rheme division of the message extra clear. This may be the case
Adverbials in IT-cleft constructions 209
in clefts marking transition as well as in clefts where none of the other discourse
functions outlined in section 5 can be identified.
The function of transition seems particularly prominent with clefted
adverbials. Clefts with this function typically have a given, often anaphoric,
adverbial in cleft focus position, while the cleft clause introduces a topic for the
subsequent discourse. The speaker/writer thus achieves a smooth transition
between two topics, juxtaposing them by means of a relational clause, and
launching the new topic unobtrusively in a subordinate clause.
It was also shown that there are cases of IT-clefts being used to place an
adverbial in thematic position that would otherwise be difficult to place clause-
initially. This was seen with negative adverbials (e.g. not until), which would
have required subject-operator inversion in a corresponding non-cleft sentence,
and with adverbials such as with much regret (example 11), which probably could
not have occurred in initial position in a non-cleft sentence. Furthermore, the IT-
cleft can give extra thematic focus to clefted constituents that would have come
across as relatively unmarked themes in non-cleft sentences, particularly time
adverbials (e.g. example 14).
In the present study I have made frequent comparison with other studies of
clefts, particularly Collins (1991) and Johansson (2002). There are some
weaknesses involved in these comparisons. First of all, the three studies are based
on rather different corpora. More importantly, assigning information values and
discourse functions is no exact science, and the subjective element involved in
this work may account for some of the differences between previous studies and
my own. Ideally, the present study should have been extended to include IT-clefts
with nominal constituents in the ICE-GB corpus. This might have made the
comparisons with other IT-clefts more reliable. However, this task must be left to
a later study.
Another possible extension of the study would be to explore further the
different uses of IT-clefts in various genres. The investigation reported in section
5.6 showed a clear difference in frequency of the construction across genres,
cutting across the spoken/written dimension. A further study might look more
closely into such genre differences as well as more specific rhetorical uses of the
IT-cleft.
Notes
1. I follow the terminology of e.g. Gundel (2002) and Delin (1992), describing
the IT-cleft construction in terms of a clefted constituent and a cleft clause. No
discussion of the syntactic status of the relative-like subordinate clause will
be undertaken here.
2. In the corpus examples, the IT-cleft construction has been underlined, with the
clefted constituent in italics.
3. The category of manner adjuncts has been defined quite widely and includes
adjuncts of means and comparison.
210 Hilde Hasselgrd
References
Corpus material
Bernard De Clerck
University of Ghent
Abstract
This paper presents the results of research into the pragmatic functions of lets
utterances in the spoken component of the ICE-GB.1 The first part of the paper
gives an overview of the grammatical features and the pragmatic uses of lets
utterances as described in the literature. The second part presents a detailed
analysis of the attested lets utterances in the corpus. Apart from testing the force
and accuracy of the existing descriptions, the paper also examines the
frequencies of occurrence of these functions and possible relationships with the
different text categories they occur in. The goal is to provide an answer to such
questions as who uses lets utterances where, why, and how.
1. Introduction
Constructions with lets are intriguing. When one considers the possible
meanings of the pair
(a) Let us have a drink
(b) Lets have a drink
one can see that (b) is not just an informal variant of (a) with the abbreviated
objective pronoun us. On a semantic and a pragmatic level the picture is clearly
more complex than that. The meaning of example (a) is ambiguous and can be
interpreted in two ways. On the one hand, it can be interpreted as a non-inclusive
request for permission (i.e. the hearer does not belong to the group referred to by
us), which can be paraphrased as Allow us to have a drink. On the other hand, it
can be interpreted as a hearer-inclusive proposal or suggestion for joint action,
involving both the speaker and the hearer. Example (b), however, is restricted in
semantic scope and has lost its non-inclusive interpretation. It no longer has the
meaning allow us to have a drink. In contrast with (a), its illocutionary function
is restricted to a hearer-inclusive proposal for joint action and as such Shall we
214 Bernard De Clerck
There has been a great deal of debate on the syntactic properties of let and lets in
the existing literature, including the way they should be labelled or categorised.
What is interesting about this discussion is that it shows the shortcomings of
traditional grammatical distinctions whenever they are confronted with the hybrid
syntactic nature of a certain language item. The discussions also exemplify the
interconnectedness of the pragmatics, semantics and grammar of a language and
the descriptive and analytical problems that arise when the consequences of
certain changes in a construction affect these three levels at different rates.
Indeed, when reviewing the relevant literature one can see that the syntactic
properties of let and lets in the constructions at hand are actually described as a
mixture of auxiliary and non-auxiliary-like syntactic properties, whose syntactic
behaviour is often explained in terms of idiosyncratic construction-specific
characteristics.
One way of accounting for these properties is found in Seppnen (1977),
who identifies let as a hybrid modal auxiliary with a mixture of features,
characteristic of central and marginal modals. Seppnen points to the fact that,
like a regular modal, let occurs only in combination with a main verb, forming
with it a complex predicate where the semantic contribution of let is the notion of
volition (Seppnen 1977: 517).3 Furthermore, like the modals, let is always
followed by a bare infinitive form of the main verb. According to Seppnen, other
shared characteristics include the absence of non-finite forms, the lack of
inflection in the third person singular and the past tense, its use for negation and
emphatic stress. However, unlike the modals, negation with do is possible (Dont
lets do it vs. Lets not do it), as is sometimes the case with the semi-modals
ought to and used to. Yet, in order to explain differences in use between let in
lets constructions and these semi-modals, Seppnen has to resort to idiosyncratic
properties. This is especially the case with regard to the use of let as the operator
of the sentence in combination with do (Do lets try it again vs. Didnt they ought
to like it?). Furthermore, in his analysis, he regards the NP following let as its
subject, which forces him to conclude that another unique and idiosyncratic
property of let in these constructions is that they require the subject to be in the
On the pragmatic functions of lets utterances 215
accusative (objective) form when it is a pronoun, hence, us, me, him, her, them.
The problem with Seppnens argumentation is that he is forced to attribute these
properties to the idiosyncratic nature of let as a modal verb.4 If let is analysed as
the imperative form of a full verb, the latter properties can be explained more
prosaically.
Treating let as the imperative of a full verb, however, is far from problem-
free either. In one approach, put forward by Costa (1972), no distinction is made
between the full lexical let and let as it occurs in lets constructions. In the latter
case, Costa still regards let as a straightforward imperative of a full lexical verb,
i.e. allow. The effect of let as an imperative is described as exhorting the
second person to allow the desired event () to take place (1972: 142).5 This
view on the meaning of let, however, cannot account for the ambiguous meaning
of Let us have a drink, described above, and seems to ignore the fact that in the
contracted variant Lets have a drink the interpretation request for permission
(similar to allow us) has all but disappeared.6 There are other distinctive features
of let in lets constructions which remain unexplained in Costas approach. One
of the most obvious characteristics of let in these constructions is of course the
fact that the accusative form of we, i.e. the pronoun us, can be, and most of the
time is, contracted to s. In all other types of imperatives (including those with a
full lexical let) us cannot be contracted. Other features that distinguish the lets
construction from the imperative with the full lexical verb include (a) the
occurrence of shall we instead of will you in tag questions, (b) the non-
omissibility of lets in ellipsis, (c) the difference in semantic scope in negative
utterances and (d) the fact that lets cannot occur with a subject. A full discussion
of these features is to be found in Huddleston and Pullum (2002: 934-935). I will
restrict myself to giving a few examples that illustrate these contrastive
distinctions.
(a) Lets have another drink, shall we? Let her have another drink, will you?
(b) Lets go with her. *Yes, do. *No, dont
Yes, lets. No, lets not. (Huddleston
and Pullum 2002: 934)
(c) 1a. Dont lets go with her. 1b. Dont let her go with you.
2a. Lets not go with her. 2b. Let her not go with you.
(Huddleston and Pullum 2002: 935)
(d) *You lets go. You let her go.
In (c) there is a clear difference in meaning between the ordinary imperatives (1b)
and (2b): in (1b) let is inside the scope of the negation, so it is paraphrasable as
Dont allow X to do Y. In (2b) let is outside the scope of the negation: in this
case the utterance can be paraphrased as Allow X (not) to do Y. All these
distinctive properties clearly show that the interpretation of let as an imperative of
the full lexical verb let (allow), as proposed by Costa does not account for these
differences.
216 Bernard De Clerck
More recent views on let and lets, set out by Quirk et al. (1985) and Biber
et al. (1999) label lets as a pragmatic particle. Quirk et al. (1985: 148), for
example, say that it is a pragmatic particle with a quasi-modal status, an
unanalysed particle pronounced /lets/. Along the same lines Biber et al. (1999:
1117) say that in present-day English it is for practical purposes an invariant
pragmatic particle introducing independent clauses in which the speaker makes a
proposal for action by the speaker and the hearer. According to Quirk et al., the
particle status of lets is also supported by the existence, in familiar AmE, of the
pleonastic variant lets us, lets dont and of the construction lets you and me in
which the addition of the second person pronoun indicates that s is no longer
associated with us.
Huddleston and Pullum (2002) make similar observations and point out
existing differences between what they generally call dialects A and B of the
English language. Dialect B is characterised as more lenient towards
constructions such as Lets you and I, which are similar to the pleonastic
variants given by Quirk et al. with regard to AmE. In this dialect these uses
would appear to be widely enough used to qualify as acceptable informal style in
standard English (Huddleston and Pullum 2002: 935). According to them, these
constructions indicate that syntactically the specialisation of let has been taken a
significant step further:
However, when talking about the less lenient dialect A (i.e. a dialect which does
not have the pleonastic variants illustrated above), Huddleston and Pullum say
that there is no compelling reason to suggest that there has been a reanalysis of
the syntactic structure (and hence no reason to regard lets as a pragmatic
particle). In their opinion, the data are compatible with an analysis where let is
still a catenative verb, used with an NP object (us or s) and (except in ellipsis) a
bare infinitival clause as second complement. In their view, then, the analysis of
lets as a pragmatic particle proposed by Quirk et al. (1985) and Biber et al.
(1999) would then only apply to AmE and not (yet) to BrE. Davies (1986) holds a
similar view in regarding let in lets constructions grammatically as the
imperative of the full verb let, with additional and exceptional features, the
possibility of contracting let us to lets being one of them. Syntactically, it still
has the status of an imperative of a main verb, but [t]o provide a plausible
account of both the form and the interpretation of the let-construction, then, it
seems necessary to acknowledge a certain lack of correspondence between the
two (Davies 1986: 250).
On the pragmatic functions of lets utterances 217
The literature also mentions uses of lets which move away from this idea of joint
agency. Quirk et al. (1985: 830), for example, mention that in very colloquial
English, lets is sometimes used for a first person singular imperative as well:
Lets give you a hand.9 Biber et al. (1999: 1117) also refer to this use, saying
there is also a tendency veering towards a first person singular (exclusive)
meaning, which they equate with Let me.10 Similarly, Huddleston and
Pullum (2002: 936) refer to cases where the action is in fact just carried out by
just one (typically the speaker). In an example such as Lets open the window, it
is possible that the actual aim is that of securing your agreement to my opening
it, rather than a proposal to open the window together (ibid.). This shift in
meaning is even more obvious in cases where the hearer cannot (appropriately)
perform the action presented in the verb, but where the agreement and co-
operation of the hearer are needed in order to carry it out successfully. This use of
lets would then explain why it is often found in a medical context, when a doctor
or specialist is talking to a patient:
Lets have a look at your tongue (Biber et al. 1999: 1117)
In these cases, the speaker is trying to find agreement with the hearer for an
action that will be carried out by the speaker only. Although the hearer has a role
to play in the process of having her throat examined, the kind of action that is
expected is not the same as the one presented in the verb following lets. The
actual perlocutionary effect is to get the patient to open her mouth and not to have
a look at her own throat together with the doctor. Rather than a genuine proposal
for joint action, it is more a (self-)exhortative announcement of the next step in
the examining process with the implication that the hearer will have to open her
mouth at one stage.
There is another tendency in the use of lets utterances that moves in the
opposite direction and merges with a second person singular imperative meaning,
i.e. the hearer is the intended agent of the action presented in the lets utterance.
Biber et al. (1999) refer to this use as a second quasi-imperative meaning, as
On the pragmatic functions of lets utterances 219
this use proposes an action which is clearly to be carried out by the hearer. This
crypto-directive style (Biber et al. 1999: 1117), which aims to camouflage an
authoritative speech act as a collaborative one, is used especially by adults when
addressing children:
You all have something to do for Ms. <name>? Lets do it please.
(Biber et al. 1999: 1117)
De Rycker (1990: 311) also notes this use and says that () it has been
frequently observed that they [= lets utterances] also function as thinly disguised
addressee-oriented directive acts. In these cases the lets utterance can be
paraphrased roughly as a second person singular imperative. Used as a crypto-
directive, Lets open our books on page 5 then actually means Open your books
on page 5. The tact involved in this use of lets utterances can be perceived, not
as a face-maintaining strategy, but rather as a signal of insincerity and
condescension (De Rycker 1990: 313). As lets utterances that serve as indirect
directives are largely restricted to these rank-sensitive contexts, where direct
types [i.e. 2nd pers. sg. imperatives] of directive realisation patterns are involved,
they may well be interpreted as an unnecessary display of superiority on the part
of the speaker (ibid.).
In a similar way, one can also find the use of the first person plural
pronoun we to refer to the hearer only. This use often occurs in the interaction
between medical staff and patients, where this aspect of pretended togetherness is
found in questions such as How are we feeling today?, Did we sleep well last
night?, Did we take our pill?, where we is used to refer to the patient only. This
convivial use is often felt to be patronising as it resembles the use of we that is
sometimes found in the discourse of parents or teachers when addressing children
(e.g. Did we do our homework yesterday?). It will become clear from the analysis
below that lets utterances can also be used (deliberately) in a patronising way,
especially when they are seemingly used to smooth out rank differences where
there are none.
In the literature, the uses of lets utterances as action instigators (whether joint or
not) are mostly illustrated by examples that involve a non-linguistic action (cf. the
typical example Lets go for a drink). It is presented as the means par excellence
to make a proposal for the speaker, the hearer and possibly others to do
something together outside the linguistic boundaries of the ongoing conversation.
De Rycker (1990: 229), however, remarks that lets utterances can also have
specific functions within the linguistic context of the ongoing interaction itself.
He says that, as conversational imperatives (), they frequently serve no other
function than managing part of the topical and structural development of the
interaction itself and perform actions relevant to the talk exchange. Their
illocutionary functions have to do with conversational activity (Levinson 1983:
228): speaking, to stop with speaking, paying attention, listening, and
220 Bernard De Clerck
(1) B: in the Academy they taught them to use more pencil but in the College
more rubber <,>
A: Well lets talk about Arnold Bax because the names already come up in
our conversation uhm and uh hes obviously a very important figure and
both of you have recorded quite a good deal of music by Bax
A: Where did he look for his sources of inspiration?
(ICE-GB/S1B-032#90-92; broadcast discussions)
Other lets utterances that work at the level of ongoing interaction are
those which function as pragmatic formatives of the commentary type (Fraser
1987: 187) or as prospective or retrospective metapragmatic comments
(Thomas 1985: 770). They signal how the primary illocutionary act (performed
by the utterance of which they are part) fits into the ongoing conversational
structure (Fraser 1987: 187). Some examples are Lets say or Lets face it. Their
use as true suggestions for a joint activity of saying or considering something is
secondary to their use as metalinguistic utterances that provide clues to the
interpretation of the utterance as a whole.
In the next section, we will have a closer look at the frequency of the uses
mentioned above in the spoken component of the ICE-GB and investigate
whether their description in the literature can account for all attested uses.
Attention will be paid to the frequency of lets utterances in different text
categories, their different pragmatic functions and the relationships that can be
established between these functions and the speakers who use them. A distinction
will be made between conversational and non-conversational uses and between
truly joint and speaker/hearer-oriented lets utterances, and related to these,
whether they are to be seen as negotiable proposals or as conversational moves.
In this study, I examined the use of lets utterances in the spoken component of
the ICE-GB corpus. The spoken component consists of 300 texts, hierarchically
organised in different text categories, which are represented in Table 1.
164 instances of lets utterances could be attested, unevenly spread across
the different text categories.11 In the dialogues the frequency of lets is 1/3000
On the pragmatic functions of lets utterances 221
and in the monologues 1/5000 words.12 Moving further down to the text
categories, one can trace more specific differences in frequency.
7
6
5
4
3
2
1
0
tes tions ches calls tions talks iews aries ches tions tions ions sons tions
ba v t s
y de enta spee one mina cast inter men spee ersa nstra scus m les nsac
n ta pr cast eleph -exa road ast com ted conv emo st di sroo s tra
r e s
me legal road t ss b adc us crip ce d ca las nes
rlia cro s
bro aneo un -to-f
a ad c usi
pa n-b al nt bro b
no leg o fa c e
sp
Figure 1. Distribution of lets in the spoken categories of the ICE-GB (tokens per
10,000 words)
222 Bernard De Clerck
I first investigated the use of lets utterances in the corpus by using the distinction
that is made in the literature between joint, speaker and hearer-oriented agency.
The pie chart in Figure 2 shows the distribution of the intended agents of lets
utterances in the corpus.
38%
joint agency
intended agent ~hearer
54% intended agent ~speaker
8%
As we can see, about half of the lets utterances are proposals for joint action in
which the hearer and the speaker are the intended agents of the proposed action.
The following corpus example illustrates this use:
(2) A: Let s have a good uh
A: So let s play Trivial Pursuit as well after or something
B: Mm
A: Shall we
(ICE-GB/S1A-048#123-126; face-to-face conversations)
On the pragmatic functions of lets utterances 223
Both speaker and hearer will be involved and actively participate in the proposed
action. In 38% of the lets utterances the intended agent of the action is the
speaker. In most of these cases, the hearer neither gives nor has the opportunity to
give a verbal response and is undergoing the action performed by the speaker. In
(3) we have an example of this speaker-oriented use in which a woman addresses
herself while trying to solve a slide feed problem. The use of lets is similar to
that of let me in this case:
(3) A: Think I have a slide feed problem <,,>
A: Here lets try the next one <,>
(ICE-GB/S2A-029# 97-98; unscripted speeches)
Hearer-oriented utterances only comprise 8% of the cases. Example (4) illustrates
this use:
(4) A: So you go up on O and come down on Ooo and see if we can get to it that
way <,,>
B: That was a bit That was certainly easier <,>
A: Well lets do it again only this time your little <unclear-words>
meantime
B: Yeah
(ICE-GB/S1A-044# 114-117; face-to-face conversations)
The music teacher is using a lets utterance to give instructions, but does not carry
out the proposed action herself.
100%
90%
80%
70%
30%
20%
10%
0%
DIALOGUE MONOLOGUE
Figure 3 shows the distribution of these three types in the monologue and
dialogue text categories. A closer examination of these different sub-categories
provides the picture shown in Table 2.
224 Bernard De Clerck
Apart from distinguishing between the different kinds of agent, the analysis also
focused on the pragmatic function as conversational and non-conversational
imperatives. The first important observation is the high frequency of
conversational lets utterances in the ICE-GB. No less than 67% of joint and
speaker-oriented lets utterances consisted of process types that were aimed at
influencing the conversational flow of the interaction. An example of this
conversational use is shown in (5), where the lets utterance is used by a
university lecturer to structure the topical organisation of his talk:
(5) A: So we 've got the nerve is having a kind of trophic action on muscle but
ma muscle actually also a a a acts in a trophic way towards the nerve
A: But let s just stick with the nerve affecting the muscle for the moment.
(ICE-GB/S1B-009#159-160; classroom lessons)
The correspondence between speaker-oriented lets utterances and its
conversational uses was most obvious in the monologue text categories of
unscripted speeches, demonstrations and broadcast talks. About 70% of the lets
On the pragmatic functions of lets utterances 225
primary illocutionary act (performed by the utterance of which they are part) fits
into the ongoing conversational structure (Fraser 1987: 187). Their use is briefly
illustrated in the following examples:
(9) A: And the bar for a start
A: Right
A: which is unlikely, lets face it.
B: So I said to him
(ICE-GB/S1A-008# 284-287; face-to-face conversations)
(10) A: But I still believe that <,> what is
A: I mean let s say Charles Dickens is communicating through you
A: It 's still <,> got to be through your through the medium of you and
therefore the writer
(ICE-GB/ S1B-026#235-237; broadcast discussion)
(11) A: Well it 's not that wonderful a film really <,>
A: let s be honest
A: I 'm sure we 'll find something
B: No
(ICE-GB/S1A-006# 167-169; face-to-face conversations)
When used as a pragmatic formative, lets face it can be seen in the first place as
an indication or a signal that the speaker feels very strongly about the primary
communicative act. According to De Rycker (1990: 403), it may also have a
defensive side to it: it expresses the speakers awareness that s/he is saying
something that is either controversial or obvious, but which in each case may well
lead to a potentially unfavourable reaction by the listener. Lets is used as a
connective and in this way it still retains some of its prototypical pragmatic
meaning as an act of suggesting a desirable joint activity. Similarly, it can be
argued that lets say is an indication that part of what follows counts as a
hypothetical example, a rough guess or anything else about which the speaker is
not entirely sure (De Rycker 1990: 402). Of course it can also be used as a
convenient hesitation marker, thereby allowing the speaker to take more time to
formulate his or her thoughts. In this way lets say can be both a conversational
imperative and a pragmatic formative. Lets be honest also gives indications as to
how the utterance of which it is part should be interpreted: it seems to indicate
that although the information in the proposition might be controversial, it is still
something that the speaker supports. By using the connective lets and appealing
to the hearers sense of honesty, the speaker tries to repair common ground or at
least indicates that s/he is aware of a possible discrepancy between the
propositional attitudes of the partners in the conversation. On the whole, it
appears that the function of lets utterances is primarily interactional, when used
as conversational imperatives. Especially in the case of stereotyped or formulaic
uses taken from a stock of ready-made utterances, their illocutionary force as a
proposal for joint action is fairly weak and secondary to their use as idiomatic
overtures, hesitation markers and pragmatic formatives.
228 Bernard De Clerck
In this example one can see that a negative evaluation of the hearers behaviour
has been poured into a lets utterance. Bearing in mind the fact that hearer-
oriented lets utterances are primarily used in rank-sensitive contexts (cf. above),
one could see this as a conscious use of patronising language by the speaker in
order to be deliberately offensive, ironic or sarcastic in criticising the hearers
behaviour. From this position of assumed authority the speaker reprimands the
On the pragmatic functions of lets utterances 229
5. Conclusion
Apart from their typical function as a (genuine) proposal for joint action, lets
utterances have various other pragmatic uses. From the analysis it has become
clear that, in the ICE-GB, their most frequent function is that of a conversational
imperative, aimed at regulating the conversational flow of the interaction or the
structure of the speakers own talk. This use is especially frequent when the lets
utterances are speaker-oriented or in cases of joint agency when they are used by
the interactionally more powerful speaker. In such cases, their illocutionary force
of proposal for joint action is secondary to their use as conversational managers.
Other non-typical functions were attested in non-conversational uses of
lets utterances which were not really oriented towards joint action, but which
aimed at presenting evaluative statements or feelings on the part of the speaker at
an interpersonal level. Rather than being proposals for joint action, they can be
seen as retrospective evaluations.
230 Bernard De Clerck
Notes
1. The research reported on in this paper was made possible by the Research
Fund of Flanders.
2. There are other more specialised uses of lets utterances, which will be
focused on in Sections 3 and 4. None of these uses, however, is completely
compatible with the strict notion of request, as in allow.
3. Seppnen rightly remarks that in this way, let is both syntactically and
semantically, close to the modal may, as used in wishes: May I (we) never see
that day! (Seppnen 1977: 517).
4. Another argument Seppnen uses to support his subject interpretation of the
following NP, is the existence of utterances such as
which have the nominative form and can hence not be explained in the
analysis of let as the imperative of a main verb. As an example of a parallel
development he refers to Dutch, where similarly both the accusative and the
nominative are used with the verb laten: Laat me/ik voorzichtig zijn, Laat
hem/hij maar komen, Laat ons gaan/even Laten we gaan. The parallelism
with Dutch, however, is far from complete. In fact, the only instances where
these rare examples of the nominative form after let are found in English, are
restricted to cases where the NP consists of two co-ordinated NPs and where
the NP is separated from the verb. According to Davies (1986), the
nominative form in (a) could be felt to result from the same sort of
hypercorrection that is responsible for the now frequent use of forms like
between you and I, or Thats for you and I to decide, while (b) could be
considered to result from hypercorrection in reaction to use forms like us
three even as subjects (Davies 1986: 237). Apart from that, no examples of
*Let we see or *Let I see can be attested in the English language, which, all in
all, makes the argument of parallelism with the uses of laten in Dutch rather
doubtful.
5. A similar view is taken by Ukaji (1978). With regard to lets constructions,
the meaning is described as one in which the speaker prays the hearer to
allow a group of persons among whom the speaker and the hearer are
included to carry out a particular action (Ukaji 1978: 120).
On the pragmatic functions of lets utterances 231
6. Davies (1979) errs in the other direction by assigning special status to all
examples involving let, even those examples which are obviously instances of
the imperative form of the lexical verb.
7. To support their argument Huddleston and Pullum refer to the occurrence of a
negative construction used by some speakers of dialect B that provides
evidence for the reanalysis: lets dont bother. Huddleston and Pullum (2002:
935) remark, though, that this is much less common than the construction
with an NP after lets, and cannot be regarded as acceptable in standard
English. Still, according to Huddleston and Pullum (ibid.), its syntactic
interest is that it shows conclusively that let is no longer construed as a verb: a
subjectless dont could not appear in the complement of a catenative verb.
This, in fact corroborates what Quirk et al. say about the lets dont in AmE.
They do not say, however, that it is not acceptable as standard English.
8. It is possible, of course, that some of these constructions do occur in certain
dialects of English. However, from the fact that they do not occur in the ICE-
GB, we might tentatively conclude that they are not used widely enough to
qualify as acceptable informal style in standard British English.
9. There are other uses of us with a first person reference in colloquial English.
Instances such as Give us a hand, Tell us a story, Give us a kiss all feature
uses of us with a first person singular reference. In spoken language us is
often abbreviated to /s/, which makes its resemblance to this particular use of
lets even more striking.
10. Although Biber et al. (1999) say that the meaning of lets is equivalent to let
me in these cases, I tend to believe that this equivalence is not complete. It
seems to me that there is a slight difference in illocutionary force and in the
freedom given to the addressee to reject the proposal. Let me, just like let us,
can still be used or interpreted in two ways: as a true request for permission
from a person or as a (self-addressed) exhortative. As mentioned, the
permissive interpretation of lets has faded and an example like Lets have a
look at your throat can no longer be interpreted as a true request for
permission the way Let me have a look at your throat can. Both utterances
aim at getting the hearers agreement, but do so in a slightly different way. By
playing with the requestive interpretation of the ambiguous let me, one can
give the impression of actually looking for or asking for agreement, whereas
when one uses the convivial lets ranks being equal agreement is taken
for granted and can be taken as a starting-point to proceed with the proposed
action. Consequently, one could say in a more accurate way that the meaning
of first person lets is equivalent to the exhortative meaning of let me only.
11. All in all, the corpus contains 178 lets utterances, 14 of which were found in
the written part of the corpus.
12. The mixed category of broadcast news did not feature any lets utterances.
13. Clearly, the fact that some text categories allow certain specific pragmatic
uses more easily than others, is just one possible way of explaining these
232 Bernard De Clerck
References
Thomas Kohnen
University of Cologne
Abstract
1. Introduction
The study of the diachronic development of speech acts with the help of
electronic corpora raises serious questions which challenge both the reliability of
existing data collections and the results of the investigations which are based on
them. Any attempt to write a corpus-based illocutionary history is faced with
basic problems involving the methodology of historical pragmatics and the design
and use of historical corpora. This paper aims to give a summary of some
important problems which I encountered in my corpus-based research, illustrating
them with several studies on the history of English directive speech acts. It falls
into two parts. The first part addresses some basic methodological issues; the
second part is devoted to some illustrative results of studies exploring aspects of
the history of English directives.
2. Methodological issues
we may assume that this illocutionary function remains stable throughout the
history of English. But we do not know in advance what linguistic form a speaker
or writer may employ for his directive. We can only rely on the fact that people
tend to use more or less fixed phrases or patterns in order to perform certain
speech acts. Since corpus searches must be based on forms rather than functions,
the study of a history of directives has to start with a selection of forms we would
consider typical manifestations of directives in the different periods of the
English language.
What are the most important manifestations of directives in the history of
English? The most straightforward examples which come to mind are explicit
performatives (I order you to carry this message to the king), imperative
sentences, constructions with let (lets do it) and constructions involving the
subjunctive, especially those with inverted word order (go we).1 Clearly,
however, there are many other manifestations of directive speech acts. The
number of possible candidates becomes even greater if we include those
realisations which are sometimes called indirect, because they involve sentence
types different from the imperative format.2 Some typical examples are
declarative sentences with the second person pronoun plus a modal involving
obligation (you must leave, you ought to do this etc.), declarative sentences with a
first person pronoun plus a verb involving volition (I want you to do this, I would
like you to do this etc.) and different kinds of interrogative manifestations (Can
you open the door? Will you do the washing up? Why dont you come in? cf.
Quirk et al. 1985: 1477-78).
Quite clearly, this enumeration could be continued. It is difficult to give a
comprehensive list of all the typical manifestations of directive speech acts for all
periods of the English language. What does this mean for a corpus-based analysis
of speech acts? Basically, there seem to be two kinds of procedure. First, since
we are faced with an open, heterogeneous and highly variable set of forms we can
restrict our analysis to an eclectic illustration of the speech act under
consideration. That is, we look around in the periods of English and see what
typical realisations we find, for example, showing some imperatives and inverted
constructions in Middle English texts and some interrogative constructions in
Early Modern texts, perhaps adding some intuitive judgments about changes
which we assume to be typical. Secondly, we can base our analysis on a
deliberate selection of typical patterns which we trace by way of a representative
analysis throughout the history of English. For example, we could examine the
development of imperatives, constructions with let or interrogative directives
throughout the history of English. I call the first kind of procedure illustrative
eclecticism, the second structured eclecticism. Given the fact that the research is
doomed to be eclectic, I think corpus linguists should opt for structured
eclecticism.
Another methodological problem is difficulties of interpretation. Speech
act assignment is in many cases a matter of interpretation which requires careful
consideration of contextual knowledge. For example, we are liable to find
imperative constructions which do not serve as directives, but as imprecations or
Methodological problems in corpus-based historical pragmatics 239
wishes (Quirk et al. 1985: 831-832, Biber et al. 1999: 220). This problem, of
course, becomes more serious with indirect manifestations because their
indirectness is due to their openness to different speech-act assignments. These
are difficulties with functional interpretation, which apply to any linguistic data,
historical or contemporary. But in the history of English directives we also
encounter difficulties which relate to semantic or syntactic changes. This often
results in what I would call pragmatic false friends, constructions which, against
a contemporary background, suggest a wrong pragmatic interpretation. I will
present four examples.
The first example is taken from Chaucer's Canterbury Tales. Here a young
knight urgently needs to know what women desire most.
Quite clearly, the meaning of the last line is not Could you please instruct me? I
would certainly reward your efforts, that is, an indirect request, but rather If you
could tell me (knew how to instruct me), I would reward your efforts. The
difficulty of interpretation is here due to the fact that the inverted clause pattern
(koude ye) could be interrogative or conditional in Middle English and that the
verb cunne was still used as a full verb in this period.
The second example is taken from an official letter by Henry V:
(2) we wol and charge you. fiat ye se and ordeyne at hasty restitucion
of e forsaide goodes be maad and at ye do compelle our saide
sougettes to make restitucion abouesaid
(Helsinki Corpus, Letters, 1418/1419, Henry 5, 99)
The difficulty of interpretation is here due to the fact that willan had an additional
speech-act meaning in Old English and Middle English. Thus the expression we
wol must be taken as a performative phrase, where wol has the speech-act
meaning of order, command. The speech-act meaning seems likely since wol is
in a co-ordinate construction with another performative directive (charge). Thus
we are not dealing with some kind of indirect directive along the lines of I would
like you to but with a performative expression.
In the third example, which is from Shakespeare, Falstaff has already been
informed that Ford wants to talk to him. Thus the utterance Would you speak with
me? is not a request (Would you talk to me?) but rather a real question which
serves to identify the man who wanted to talk to Falstaff (Did you want to talk to
me?). In Modern English this interpretation would not be possible because would
cannot be taken as referring to the past.
The fourth example contains a construction with let us. It is found at the ending
of Stevenson's play Gammer Gvrtons Nedle. Here the phrase let vs haue a
plaudytie clearly is an invitation addressed to the audience to give applause,
which should be paraphrased with allow us / cause us to have some applause.
Thus, although this construction is an imperative, it cannot be considered a
hortative or periphrastic imperative construction (cf. Rissanen 1999: 279). Rather
it is a construction with the full verb let. This is quite disturbing against a
contemporary background since we would like to rely on the assumption that
constructions with let us are always periphrastic constructions.
The discussion of the excerpts has shown that a corpus-based analysis
which selects items on the basis of form must be extremely careful with the
interpretation of the examples. Each individual item which we assume to be a
manifestation of a directive speech act requires careful consideration if we want
to avoid pragmatic false friends.
The third methodological issue involves the relationship between the
examples of particular manifestations of directives found in a corpus and the
underlying total number of directives. If we compare the different frequencies
of selected manifestations in the periods of English, do the increasing or
decreasing numbers only reflect an increase or decrease in the respective
manifestations or do they suggest a general change in the use of directives as
well? For example, if we find more directive performatives in Late Middle
English letters than in Modern English ones, is this because people choose to use
different means for expressing their requests in letters today or is it because they
use fewer requests? In other words, would a decreasing frequency of
performative directives point to an increase in alternative manifestations (e.g.
imperatives, constructions with let, interrogative directives) or to a general,
underlying decrease of directives? I think this problem can only be tackled if we
base our analysis on comparable text types or genres and if we assume a more or
less stable functional profile for these text types or genres. For example, we could
assume that religious instruction requires directive speech acts in the Middle
Ages as well as today. Or we might assume that text types involving spoken
Methodological problems in corpus-based historical pragmatics 241
individual text types or genres may raise doubts about the representativeness of
the analysis.
To sum up this methodological section, I would like to advocate the
procedure which was called structured eclecticism. It makes up for the
heterogeneity of the data by systematic selection, comprehensive diachronic
statistical analysis and careful consideration of each item. In addition, it has been
shown that a diachronic analysis of speech acts should be embedded in a
reasonably stable functional profile of text types. This as well as the notorious
lack of data call for more extensive text-type specific corpora for historical
pragmatic studies.
What is the outline of a history of directives in English? It seems best to start with
a general consideration of the speech-act class of directives. Since a directive
aims at an act to be performed by the addressee, it can be seen as a threat against
the addressee's freedom of action and freedom from imposition, that is as a threat
against what is usually called the addressee's negative face (Brown and Levinson
1987). It seems that in the history of English considerations of face have assumed
increasing importance, changing the manifestations of directives more towards
polite and indirect realisations. This tendency can be illustrated by a decrease of
direct realisations of directives, for example performatives and imperatives, and
by an increase of prototypical indirect manifestations, for example interrogative
directives and constructions with let.
With regard to performatives, I found that the frequency of directive
performatives in the Old English section of the Helsinki Corpus is seven times as
high as that found in the LOB Corpus (4 vs. 0.55 per 10,000 words). In addition,
the frequency of performative verbs referring to acts of ordering and
commanding is far higher in the Old English section than in the LOB Corpus (1.5
vs. 0.07). And whereas the LOB Corpus shows a clear predominance of
suggest/advice verbs, the Old English part of the Helsinki Corpus has none (for
a detailed discussion see Kohnen 2000). So it seems that performatives, which at
least today are a rather direct and mostly face-threatening manifestation of
directives, are significantly less common in contemporary (written) English than
they were during the Anglo-Saxon period. If they are used today, they tend to be
employed in rather mild requests, like suggestions or advice. On the assumption
that people perform as many directives in written English today as they did in
Anglo-Saxon times we may infer that the performative option of directives has
fallen out of favour and that other possibly less face-threatening means are
employed instead.
With regard to the imperative manifestation of directives it can be shown
that the number of imperatives decreases from Middle English to Modern
English. I looked at imperatives in the religious treatises in the Penn-Helsinki
Parsed Corpus of Middle English (PPCME2, Kroch and Taylor 2000) and in the
Methodological problems in corpus-based historical pragmatics 243
Brown Corpus and LOB Corpus. Religious treatises, both in their Late Medieval
and their modern form, can be assumed to have a basic instructional function,
which makes it likely for imperatives to be used there. Since I wanted to focus on
instructional imperatives which directly serve the purpose of religious instruction
associated with the text type, I excluded those sentences which appear in direct
speech, that is imperatives contained in narrative sections, quotations from the
Bible, etc. (see Figure 1).
5 4,5
3
1,9
2
0,8
1 0,6
0
1200-1375 1390-1450
Figure 1: Imperatives (excluding direct speech) in the PPCME2, the LOB Corpus
and the Brown Corpus (tokens per 1000 words).
Figure 1 shows that the frequency of imperatives in the period 1200-1375 is fairly
high (4.5). It goes down significantly in Late Middle English (1.9). But the
frequency in the LOB Corpus (0.6) and in the Brown Corpus (0.83) is less than
half of the Late Middle English figure.4 Since we may assume that all texts share
the same basic function of religious instruction, the decrease in imperatives can
be explained either by the hypothesis that religious instruction uses fewer
directives today (and, for example, more representative speech acts instead) or by
the hypothesis that imperatives no longer enjoy wide currency as a means of
directives in religious instruction but are replaced by other manifestations, for
example indirect directives. It is difficult to determine with certainty which
hypothesis is correct, but there is some evidence which renders the second option
more probable than the first one. Frequent constructions with let us and with
modals point to the fact that directives are still being used in religious instruction.
And, of course, a number of directives employing the imperative can still be
found in the contemporary data. There is, however, a remarkable difference
between most of these imperatives and the typical imperatives in the Middle
244 Thomas Kohnen
(5) ... flerfore do flou fli-silf alle fle gode deedis wifl-oute deuocioun, fle
whiche ou didist bifore with deuocioun.
(a 1396, Walter Hilton, Hilton's Eight Chapters on Perfection,
PPCME2 - 4.23)
(6) ... and preie e hertly for hem, that God of his greet mercy eue to
hem very knowing of scripturis, and meekenesse, and charite.
(a 1397, John Purvey, Purvey's General Prologue to the Bible,
PPCME2 I, 49.2033)
(7) What about religion and politics? They are not in two watertight
compartments. Think of the number of laws that have just as much to
do with a mans soul as with his body.
(LOB, D16, 11-13)
(8) If the people had kept the Lord before them and observed His words
through the former prophets, things would have been far otherwise.
And what was His word now through Zechariah, but just what it had
been through them. Take Isaiahs first chapter as an example. He
accused the people of moral corruption, whilst maintaining
ceremonial exactitude.
(LOB, D11, 39-44)
Since the data indicates that the frequency of more direct and possibly face-
threatening manifestations of directives decreases in the history of English, it
makes sense to ask whether these forms are replaced by other, possibly more
polite directives. To find a possible answer, it is instructive to look at the
Methodological problems in corpus-based historical pragmatics 245
(9) Sam. I pray you let mee intreat you: foure or five houres is not so
much.
Dan. Well, I will goe with you.
(Helsinki Corpus, 1593, George Gifford, A Dialogue Concerning
Witches B2R)
(10) Venat. Come my friend, Piscator, let me invite you along with us;
I'll bear you charges this night, and you shall bear mine to morrow;
for my intention is to accompany you a day or two in Fishing.
(Helsinki Corpus, 1676, Izaak Walton, The Compleat Angler 212)
4. Conclusion
Although the research presented here is the result of what I called structured
eclecticism and although corpus-based historical pragmatics faces several
methodological problems, the general picture of the history of English directives
is quite consistent. During the history of English directives become less explicit,
less direct and less face-threatening. By contrast, the number of indirect
manifestations increases. The important period for the evolution of indirect
manifestations seems to be the Early Modern period, whereas the frequency of
direct manifestations seems to decrease after the Middle English period. The
underlying motivation seems to be the growing importance of considerations of
politeness.
Notes
1. Cf. Fischer (1992: 248) and Rissanen (1999: 228-229, 279-280); for a survey
of directives, see Quirk et al. (1985: 827-833).
2. See, for example, Searle (1976) and Levinson (1983: 263-276).
3. For information on the Helsinki Corpus, see Kyt (1996).
4. Text 9 from the Brown Corpus (Organizing the local church, 2056 words)
was excluded from the analysis, since, as instruction for instructors, it is
focused on organisational matters and does not contain religious instruction in
its proper sense.
5. Biber et al. (1999: 222) say that in contemporary academic prose imperatives
are used as a means of guiding the reader in interpreting the text.
References
Kyt, M. (1996), Manual to the diachronic part of the Helsinki Corpus of English
Texts. 3rd ed. Helsinki: University of Helsinki.
Levinson, S.C. (1983), Pragmatics. Cambridge: Cambridge University Press.
MED: Middle English Dictionary, electronic version, in: F. McSparran et al.
(eds) (1999), The Middle English compendium. Ann Arbor, Mi: University
of Michigan Press.
OED: The Oxford English dictionary second edition on compact disc (1994),
Version 1.13. Oxford: Oxford University Press.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rissanen, M. (1999), Syntax, in: R. Lass (ed.), The Cambridge history of the
English language, vol. III: 1476-1776. Cambridge: Cambridge University
Press. 187-331.
Searle, J.R. (1969), Speech acts. Cambridge: Cambridge University Press.
Searle, J.R. (1976), A classification of illocutionary acts. Language in Society 5:
1-24.
Measure noun constructions: degrees of delexicalization and
grammaticalization
Lieselotte Brems
University of Leuven
Abstract
In a narrow sense, the term Measure Noun (MN) refers to such nouns as acre
and kilo, which typically measure off a well-established and specific portion of
the mass or entity specified in a following of-phrase, e.g. a kilo of apples. When
used like this, the MN is generally considered to constitute the lexical head of the
bi-nominal noun phrase. However, the notion of MN can be extended to include
such expressions as a bunch of and heaps of, which, strictly speaking, do not
designate a measure, but display a more nebulous potential for quantification.
The structural status of MNs in this broader sense, then, is far from
straightforward and most grammatical reference works of English are either
hesitant or silent with regard to the issue. Two main analytical options seem to
suggest themselves. Either the MN is interpreted as constituting the head of the
NP, with the of-phrase as a qualifier of this head, or the MN is analysed as a
modifier, more specifically a quantifier, of the head, which in this case is the
noun in the of-phrase.
Starting from the structural analyses of MN constructions offered by such
linguists as Halliday and Langacker, my paper goes on to discuss a corpus study
aimed at charting and elucidating the structural ambivalence observed in MN
constructions. The framework eventually opted for is that of
grammaticalization. The focus of the corpus study is synchronic
grammaticalization (Lehmann 1985). More specifically, it investigates the
degree of grammaticalization of the various MNs looked at, viz. bunch(es) of,
heap(s) of, pile(s) of and load(s) of.
and load(s) of. The type of construction in which they are used is a bi-nominal
noun phrase of the kind illustrated in the following set of examples. All examples
in this paper are from the Cobuild corpus.3
(1) The fox, unable to reach a bunch of grapes that hangs too high, decides
that they were sour anyway
(2) A jilted girlfriend got revenge on the boyfriend who dumped her by
dumping a foot-high pile of manure in his bed.
(3) We still have to move loads of furniture and other stuff.
(4) The surrogate mum to princes William and Harry shared heaps of fun
with them at a fair yesterday while father Charles was otherwise engaged.
(5) I would take up a pile of commonplace books like Lord David Cecils
Library Looking Glass, John Julius Norwichs Christmas Crackers, Rupert
Hart-Daviss A Beggar in Purple, etc.
(6) Then I noticed, under a pile of other books on my nightstand, the worn
journal my father had given me those weeks ago.
The central question with regard to MN constructions is the status of the MNs
bunch, pile, loads and heaps within their respective NPs. Does the MN constitute
the head noun, or does it function as a quantifier of the head noun in the of-
phrase? Naturally, assessing the status of the MN within the bi-nominal NP has
repercussions at clause level, most notably on the question of subject-finite
concord whenever the MN-nominal occurs in subject position. The central
question of the study is of a comparative nature and focuses on possible
differences in the extent to which the various MNs have already come to function
as a quantifier (Section 3).
A quick glance at the above set of examples illustrates the specific rub of
MN constructions. Sentences (1) and (2), and (6) are rather unproblematic: in (1),
(2) and (6) the MN is the head noun, displaying the literal and collocationally
restricted meaning of bunch and pile. This analysis is reinforced by the verb
agreement between hangs and bunch in (1), under in (6) and the premodifier foot-
high in (2), which stresses the fact that it is a literal pile taking up a certain space.
In (4), on the other hand, the MN functions as a quantifier of the noun in the of-
phrase (N2).
The lexical constellation specifics of heaps have bleached into a mere
quantifier, which allows the MN to be used with an abstract noun like fun that
surely cannot be made into actual heaps. Examples (3) and (5) are more
problematic: do the furniture and stuff constitute actual loads or does the sentence
simply mean a large quantity of furniture and stuff without it necessarily being
arranged in a literal load? Do the commonplace books in (5) together make up an
actual pile or is it merely implied that the number of books could constitute a
pile?
Measure noun constructions 251
Examples (3) and (5) are intermediate between examples such as (1) and
(2), with the original and fully lexical meaning of the MN, on the one hand, and
the grammatical quantifier meaning of (4) on the other. I will suggest that the
developments observed in MN constructions are best looked at as a case of
ongoing delexicalization and grammaticalization in MNs (Section 2.3). The by
now more or less full-blown quantifiers a lot of and lots of can be considered
historical precursors in these developments.
Synchronic variation in verb agreement patterns is an important argument
for claiming that the structural status of the MN is changing from head to
quantifier. Wherever subject-finite concord is observable, analysis of data reveals
consistent patterns. When the MN is head of the bi-nominal group it controls verb
agreement (examples 7 to 9), when it functions as a quantifier, the finite agrees
with N2 (examples 10 to 12).
(7) I can show you a van-load of weapons that was confiscated at the gate.
(8) Three plane-loads of food have been ferried into the town in the past three
weeks.
(9) The fox, unable to reach a bunch of grapes that hangs too high, decides
that they were sour anyway.
(10) A bunch of drunken, braindead louts seem determined to disgrace our
team.
(11) But then, when I needed one, there were a load of excuses as to why I
couldnt borrow one.
(12) They are threatening to kill off a bunch of select committees that have
been around for a long time.
The dependency layer, on the other hand, analyses the nominal group as a
univariate structure, viz. in terms of the recursive head-modifier relationship
displayed by the nominal group, as shown in Table 2.
Measure noun constructions 253
In the default case the Thing of the experiential layer and the logical Head
coincide. However, there are a few types of nominal group where Head and
Thing do not coincide and those involving a measure of something () [i.e.]
measure nominals (Halliday 1985: 173) are an example of such a discrepancy
between Head and Thing. Halliday goes on to analyse so-called measure
expressions (id.: 169) in the following way:
In the logical structure, the measure word (pack, slice, yard) is Head,
with the of phrase as Postmodifier. The Thing, however, is not the
measure word but the thing being measured: here cards, bread, cloth.
The measure expression functions as a complex Numerative.
(Halliday 1985: 173)
This dual analysis can be visualised by the box diagram for a pack of cards
shown in Table 3.
[i]t is not that one [analysis] is right and the other wrong; but that in
order to get an adequate account of the nominal group, [] we need
to interpret it from both these points of view at once. [Italics LB]
(Halliday 1985: 172-173).
suggesting that in each use the MN is always both. Against this, the description of
MN constructions proposed in this article will involve two synchronically distinct
analyses, with the second one being treated as a (diachronic) re-analysis of the
first:
Langacker (1991) turns to the issue of MNs in his general discussion of the
function of quantification in the NP. What is interesting about his observations is
that they immediately address the question of MNs from a diachronic angle, i.e.
MNs as an emergent means of quantification.
Langackers observations pertain to bi-nominal MN phrases such as a
bunch of carrots, a bucket of water and a lot of sharks. He observes that the
nouns which appear as heads constitute a diverse and open-ended class. [italics
LB] (Langacker 1991: 88). MNs are by default attributed head status, despite the
ambivalent semantics of appear. He continues by remarking that
In addition, most of these nouns have developed a more figurative sense. Such
metonymic extensions are possible because the above MNs all incorporate a
conception of their typical size, which is part of their encyclopedic
characterization. In the extended senses the physical entity designated by the
MNs has become secondary to the size specification provided by the noun: For
instance, a bathtub may contain a bucket of water without there being any bucket
in it it is only implied that the water would fill a bucket were it placed in one
(Langacker 1991: 88). Or in other words (Langacker 1991: 88-89), The notion
of a discrete physical object has faded, leaving behind the conception of a
schematically characterized mass (the mass that, in the original sense, either fills
Measure noun constructions 255
The aim of the corpus study reported on in this paper is to provide some answers
to both proposed re-analyses, viz. has N1 shifted from head to quantifier and N2
from postmodifier to head?
In conclusion to Langackers account, we can say that it is interesting that
he notes a diachronic shift with regard to the structural status of MN from head to
quantifier, instead of the mere synchronic ambivalence proposed in mainstream
grammars or the simultaneous layers in Hallidays analysis. Langacker also
suggests that this grammatical re-analysis is paralleled by lexical extension and
desemanticization of the MN. He therefore does accommodate the dynamic
aspect of MN constructions by working with two distinct diachronic stages in the
structural development of MN constructions.
The framework which seems most suitable for tackling the specific developments
encountered in MN constructions is that of grammaticalization theory, which not
only does justice to these developments but also explains them.
Grammaticalization itself has been defined in several ways (e.g.
Haspelmath 1989, Fischer 1999 and Bybee 2000), but its essence is captured by
Lehmanns (1985) definition, which is appropriately general and consists of a
number of interesting parameters. Lehmann also distinguishes between
diachronic and synchronic grammaticalization, a distinction that will prove useful
for this corpus study. Lehmann defines both types of grammaticalization as
follows:
For this corpus study heap(s) of, pile(s) of, bunch(es) of and load(s) of were
extracted from the Cobuild Corpus, The Bank of English. In each case the plural
Measure noun constructions 257
and singular variants of the MNs were regarded as distinct expressions. The
corpus data were analyzed as either head or quantifier, or vague. The last
category subsumes those MN uses which activate both head and quantifier
features. Typically, it contains expressive stretches of discourse in which both the
lexical meaning and the quantificational potential of the MN are exploited.
On the basis of these quantified data, differences in the degree of
(synchronic) grammaticalization between the various MNs can be studied, which
will be the main focus in the discussion of the corpus results.
The relative frequencies of quantifier, head or vague uses of the various
MNs examined are represented in the following tables, which also contain
examples of the respective categories. Adjectives or nouns premodifying MN or
N2 are underlined, as well as verbs or other elements of structure that serve as
important clues for either head or quantifier status.
The percentages indicated in the tables immediately point out that there are
differences between the various MNs in terms of their degree of
grammaticalization or quantifier potential. In order to represent visually the
degrees to which the MNs have grammaticalized, they are set out on a scale of
synchronic grammaticalization in Figure 1 (see Section 2.3). The percentages for
lot of and lots of were also obtained through analysis of Cobuild data. They are
included in the cline as precursors in the present MN-developments.
260 Lieselotte Brems
2.1% 99.8%
Loads of, bunch of, load of and heaps of have all grammaticalized strongly, while
piles of, pile of and bunches of have hardly grammaticalized at all.
Two main factors seem to play a part in these observed differences in
degree of grammaticalization, viz. dissimilarity in the degree of delexicalization
and collocational broadening, and differences in expressive value of the MNs,
which also involves a phonetic factor.
Considering the fact that the grammaticalization processes of MNs are
largely semantically driven (Section 2.3), it seems only natural that differences in
grammaticalization level can be explained by differences in the preliminary
delexicalization processes. Differences in quantifier potential between the
semantically related heap and pile, for example, have to be put down to
differences in delexicalization potential between the two MNs. These differences
are dependent on certain lexico-semantic properties inherent in the concepts of
pile and heap, which are resistant and conducive to semantic generalization
respectively.
The blocking factor in pile is the feature of verticality and constructional
solidity it calls up. These semantic features are, so to speak, too specific to bleach
into a mere quantity meaning. The concept of heap on the other hand is in itself
more vague and simply profiles an undifferentiated mass, from which it is much
easier to detach a mere quantifier meaning. The lack of delexicalization potential
in pile is matched by a very restricted collocational extension, mainly limited to
prototypically stackable concrete nouns like rubble, paper, bricks etc. Heap(s) of,
on the other hand, has loosened its collocational requirements systematically. In
addition to the prototypically stackable nouns it combines with when used as
head, it has extended to concrete nouns irrespective of their semantics, to human
nouns (e.g. (25)) and abstract nouns (e.g (26)). Heaps of has hence developed a
systematic quantifier use which is more or less devoid of its original lexical
semantics.
The non-head/quantifier uses of pile of and piles of distinguished in the
tables above are all restricted to very specific contexts, as in (34), highly
expressive stretches of discourse, as in (30) and (35), or dependent on metonymy,
as in (31) and (36) for example. Compare the following two MN nominals, which
alternatively have heaps and piles combined with the same set of nouns; the
collocational restrictions on the quantifier use of piles of are immediately
obvious:
piles of stones/paper/people
heaps of stones/paper/people
Piles of evokes a vertical, layered constellation with all three nouns, which
renders the combination piles of people highly marked. By contrast, with heaps a
mere quantifier reading is at least as unmarked as a literal interpretation for all
three nouns; in the case of people the quantifier reading is the most natural one.
Bunch of is another MN of which the high level of grammaticalization can
be explained by a process of extensive collocational broadening. As opposed to
262 Lieselotte Brems
pile or heap, the delexicalization process of bunch has a readily identifiable first
stage in which it designates a very specific cluster-like constellation with an
accordingly restricted set of collocates in the N2 slot, e.g. grapes, flowers,
carrots. Gradually, the specific cluster meaning starts to bleach and the
collocational scatter broadens to include concrete nouns beyond that limited set,
as well as abstract nouns and human nouns, which are in fact the predominant N2
type when bunch functions as a quantifier. The following table represents the
various extensions.
Bunch of, just like loads of and heaps of, also illustrates the expressivity and
affective values grammaticalizing MNs often develop, at times leading to new
patterns of collocational consolidation. Especially when used with human nouns,
bunch of goes beyond merely quantifying N2 and also qualifies it, usually
negatively. This qualitative function can be reinforced by the addition of
qualitative adjectives premodifying the MN, as in (53) and (56).
The negative qualification expressed by bunch is best described as
negative semantic prosody, as defined by Sinclair (1992) and Bublitz (1996): a
negative, or occasionally positive, semantic aura spreading from node to
Measure noun constructions 263
collocate. Bunch radiates a specific halo, it prospects ahead and sets the
scene (Sinclair 1992: 8) for a particular type of subsequent item (Bublitz 1996:
11). This strong predictive power with regard to N2 can create new collocational
requirements and idiom-like patterns of collocational consolidation, as in (48),
(50), (53) and (56), both with human nouns and abstract nouns. It is only with
regard to nouns such as guys, lads, kids, etc. that bunch of radiates a positive
semantic prosody. In such expressions as a bunch of guys/lads/etc. there is the
additional suggestion of bondedness, of a close-knit group of amicable people.
This can be seen as a metaphorical revival of the original cluster semantics.
The specific qualitative meaning of bunch of brings us to the second
important factor motivating the grammaticalization of MNs, viz. the expressive
value they can acquire. As a means of quantification heaps of and loads of are
very hyperbolic in nature, which can be stressed by repeating them as in (43).
Differences in expressive value might also explain why the plural versions of
heap, pile and load have grammaticalized more strongly than the singular
variants. The plurality in terms of grammatical number adds to the hyperbolic
meaning it expresses as a quantifier. In addition, the intrinsic mass meaning of
plural nouns (Langacker 1991) likewise enlarges the magnitude already expressed
by the MN. Phonetically, these plurals contain a vowel that can easily be
lengthened, producing a similar effect of exaggerating the quantity of N2. In this
respect, the extensive grammaticalization of loads of might be enhanced by the
graphemic and phonetic resemblance to lots of, with the added bonus of a
strongly prolongable diphthong in front of a voiced consonant.
In the case of bunch, on the other hand, it is precisely the other way round:
the plural form displays a near-exclusive head use, while it is the singular form
which has a prevailing quantifier use. The resistance of bunches to
grammaticalize into a quantifier might be due to prosodic features which do not
lend themselves well to expressive use, such as the extra syllable the plural
morpheme gives rise to (p. c. Halliday). Grammaticalizing MNs, with their
typical blend of lexical and grammatical potential, thus satisfy the language
users needs for a quantifier as well as the desire to be expressive.
4. Conclusion
Notes
1. I would like to thank all people at the 2nd Workshop of the Systemic
Functional Research Community (FWO - Fund for Scientific Research
Flanders grant n WO.018.00N) in Leuven, 21-24 November 2001, as well
as those at ICAME 2002 for their much appreciated comments on earlier
versions of this paper.
2. These are just two of the many names they are commonly labelled with.
Others are quantifying nouns in Biber et al. (1999) and NP-like quantifiers
in Akmajian and Lehrer (1976).
3. All examples are extracted from the Cobuild Corpus, The Bank of English,
and reproduced here with the kind permission of HarperCollins.
4. This concept is alternatively referred to as semantic attrition, desemanti-
cization and demotivation (Lehmann 1985:307).
Measure noun constructions 265
References
Gran Kjellmer
Gteborg University
Abstract
In general, the reference of English personal pronouns has been relatively stable
over the centuries: I (and its forerunners) can normally be taken to refer to the
first person singular, and so on. If this is the general picture, it is necessary to
add some qualifications, most of them of a minor kind. For instance, I is
sometimes used to refer to the second person singular (I shouldnt disturb him
at this time of night), we is sometimes used with reference to the first person
singular, the authorial and the royal we (We are not amused), to the
second person singular (How are we today?) and with general reference (We
should not underestimate the defence of honour), and they can also be used
with general reference (They say that ill weeds grow apace). However, you
and its reflexive-emphatic correspondence yourself stand out in this respect and
differ from their pronominal cousins, both in that their referential changes have
been more generally remarkable over the centuries, and in that such changes are
still in progress. This paper will attempt to chart some of those changes in
modern English with the help of large modern corpora.
1. Introduction
one may well receive a jolt. Our traditional grammars would have us expect the
professor to say see for yourselves and the boy to say repeating myself. A
natural question then is if such linguistic events are due to accidental slips, and
hence of no great interest, or if something is happening to the use of yourself. I
268 Gran Kjellmer
would suggest that among the bewildering mass of uses that yourself can be put
to an interesting pattern can be seen to emerge.
2. Development of you
In order to look into the matter of yourself we shall use material from the
CobuildDirect and the BNC corpora. It is indeed one of the great advantages of
corpora that they can provide us with material that has not yet become
established as mainstream varieties. But before discussing yourself, let us first
consider very briefly the development of the closely related pronoun you. You,
starting out as an object form of the Old English second person plural personal
pronoun , came in late Middle English and Early Modern English times to be
used as a plural subject form and, about the same time, as a singular pronoun,
used both as a subject and an object form (OED You I-II). In Early Modern
English a secondary use developed, Denoting any hearer or reader; hence as an
indef. pers. pron.: One, any one (OED You III:6). From Early Modern English
times onwards you can thus be used in the same way as today, in the plural and
singular and with general reference. Modern examples, from the Cobuild Corpus,
are:
(4) The great thing about meeting someone through an agency is that you find
a lot about them the first time you meet.
(Cobuild: ukmags/03. Text: N0000000375)
Out of the generic use there has developed a more specific use where you mostly,
but not always, means I or we (and which is not in the OED; cf. Quirk et al.
1985:354):
Yourself: a general-purpose emphatic-reflexive? 269
(8) `Theres another one in the back as well Mr Giggins added: `For all the
world it looked as though there were people asleep in the car although
when you looked again you realised they had been shot ) (you = I/we)
(Cobuild: times/10. Text: N2000951208
(9) but I shouldnt think its probably all that much different <F01> Mm.
<F02> except we used to finish off putting chairs on the tables hands
together and eyes <tc text=laughs> closed you know before you went
home every night. (you = we)
(Cobuild: ukspok/04. Text: S9000000758)
(10) Balancing the lust for a story against the demands of self-preservation,
conquering your own fear and crawling that extra exclusive maggot-
infested mile before remembering you were a mother with responsibilities
back home. Home. It was time to call her husband. Her nervousness, for
which she had no explanation - or, at least, none she could remember -
came flooding back. (you = she)
(Cobuild: ukbooks/08. Text: B0000001117)
3. History of yourself
As for yourself, its early history is partly dependent on that of you, as could be
expected. The Middle English plural e ou selve(n) became our(e) self(e) in the
early part of the fourteenth century, and like you the latter form came to be used
with singular reference in late Middle English and Early Modern English (OED
yourself II, originally as a honorific plural). And then towards the end of the
fifteenth century the present s-plural ourselves, yourselves came into existence
and eventually became the standard forms (Wright and Wright 1924: 323; see
also Visser 1962-73 I 455). The forms with -selves are [...] the normal plural
usage by the middle of the sixteenth century (Barber 1997: 159). So the form
yourselves gradually becomes the standard one for use in the plural. If yourself,
on the other hand, was thus originally a plural form, as in
270 Gran Kjellmer
its standard modern use is as a singular reflexive form (OED Yourself II: 6), as in
(12) Now you never thought of yourself as a fan. You were a journalist
covering sports.
(Cobuild: npr/07. Text: S2000901019)
(13) Vu: You used to molest other kids yourself? <p> Mary: Mm-hmm.
(Cobuild: npr/07. Text: S2000911102)1
Let us start with the standard use of yourself, where it refers to a singular
addressee:
(16) Before buying a single share of stock, force yourself to answer one
question: are you reasonably sure that you can keep your money invested
for 7 to 10 years?
(Cobuild: usbooks/09. Text: B9000000404)
(17) If you have just spent 329,000 on a red Ferrari F50 then why not treat
yourself to the perfect number plate?
(Cobuild: times/10. Text: N2000960217)
How then are we to know whether, and how often, yourself in fact refers to a
number of addressees? It is difficult to answer that question as, just in the case of
you, the speaker or writer may not always have made a distinction between
singular and plural but may be addressing himself indifferently to an audience of
one or several. The context is often of little or no help. However, by an indirect
route we might get an idea of the size of the phenomenon. The reflexives myself,
himself, herself, itself have plural correspondences, ourselves and themselves. If
we assume that the relation between reflexive singulars and plurals is very
approximately constant throughout the system, we can investigate the matter in a
corpus like Cobuild and draw our conclusions. The figures are shown in Table 1.
The discrepancy between 27-28% and 4% suggests that a great number of the
yourself instances have plural reference.
When yourself can be interpreted as referring to plural addressees, as in
(15) - (17), one further step in its development follows naturally, viz. that when
yourself unambiguously refers to plurals, and plurals only. This step constitutes a
break with traditional descriptions of the word; it is not described in our standard
grammars. Sentence (1) is one example, and some further examples follow.
(19) Well can you sort that out amongst yourself and then after youve done
that then present it to the February sales meeting
(BNC: JN6 142)
(20) If come Valentines Day you girls found yourself still manless after
deploying every known method to hook that rare breed of muscle, there
was only one place to be.
(Cobuild: ukmags/03. Text: N0000000722)
(22) I have some good news for those of you who didnt manage to pull
yourself together enough to get tickets to Creamfields
(Cobuild: sunnow/17. Text: N9119980502)
One can see the process in operation whereby yourself is supplanting yourselves
in examples like the following, where the speaker is hesitating between the two
forms and deciding on yourself :
(25) So what subjects did you take then at er <ZF1> S<ZF0> School
Certificate? <ZF1> What what <ZF0> what were your pushing yourselves
to yourself towards?
(Cobuild: ukspok/04. Text: S0000000834)
As suggested above, analogy with you is probably at work here. There is also a
slim chance that a few instances of plural yourself, labelled by the OED as
obsolete,3 are a deliberate continuation of the Middle English plural and hence
imitative of Middle English usage. This may be the case in an example like (23),
where the tone is solemn and somewhat archaic.
Examples like (18) - (24) above, where yourself is used with direct
reference to several addressees, are frequent enough in the corpora. (It is hardly
possible to give statistics, because yourself is a very frequent word,4 and evidence
of the number of addressees, if it occurs at all, may occur anywhere in the
Yourself: a general-purpose emphatic-reflexive? 273
context, often at some distance from yourself.) On the other hand, a further step
in the development of the word, where it is still plural but no longer limited to the
second person, is not recorded as frequently. This step could be represented by
cases like
(26) When I went to that stress management course we were told to use
physical resources like deep breathing and actually making yourself sit
down and making yourself go floppy. and let every muscle let it relax.
(BNC: KBF 8025)
(27) Fiona Me and, did you see me and Sarah [at the show] ...
Jessica No. No, cos we were sitting down down by yourself
(BNC: KBL 2998)
This usage is clearly colloquial and scarcely acceptable in the standard language.
The shifts in the usage of yourself that we have seen so far represent a
widening of its sphere of application, from reference to second person singular to
reference to second person singular and plural, and from there, in addition, to
reference to other plurals. It has, in other words, become more general in its
application. By a slightly different route it concurrently acquires a generic sense,
as we shall now see.
When yourself, in the wake of you, was used to refer to singular and plural
addressees indifferently, the semantic distinction between what might be called
specific addressing, where you means e.g. you, Benjamin (You should avail
yourself of this opportunity) and general addressing, where you means one
(When you are young, without a job, ... it is your passions that often define
you) became blurred, particularly in general contexts. Ever since late Middle
English times English has lacked a distinctive generic pronoun, corresponding to
French on and German man,5 but you (and one) have come to fill that place.
Consequently yourself, too, could be used in a generic sense, as in the following
examples:
(29) Knowing how to present yourself # can really make or break you,
Charmaine said.
(Cobuild: oznews/01. Text: N5000950205)
(31) Janet Parsons knows what it is to find yourself a victim of crime. Her
husband, Leslie was killed at the wheel of his lorry by two joyriders
racing each other.
(BNC: K1K 3765)
This very clear step towards generality is also shown by the fact that yourself in
this sense can refer back to generic one:
(33) Theres a danger that in a science course one concentrates purely on how
and why nature works, or in an engineering course one concerns yourself
only with how to apply and harness phenomena, not to understand
sufficiently the nature of the phenomena and what are the inherent
limitations.
(BNC: KRW 36)
(35) Id have loosened my tie, but they had taken it away along with my wallet,
gun, belt and shoelaces. I wondered how easy it would be to hang yourself
with your shoelaces.
(BNC: GVL 1718)
The general phrasing refers to the speakers specific problem, but both the
general and the specific meaning of yourself are part of the full meaning of the
sentence. The relevant part means both to hang oneself with ones shoelaces
and to hang myself with my shoelaces. This type of usage can be seen as a
transition to the final stage, that where the reference of yourself is exclusively
specific (and not always I or we, as in (39)). Some examples are:
(37) I know I, er in the past when Ive felt myself going off to sleep in those
situations, Ive been pinching myself and, and really making yourself do
something rather than just sitting there doing nothing, - - - weve read and
heard about people that have gone to sleep on motorways havent they?
(BNC: KBX 687)
(39) Petes gone down to the shop and got yourself a bottle whisky.
(BNC: KCT 7304)
As the contexts make clear, these sentences do not mean ... repeating you, ...
making you, etc., and they could not mean ... repeating oneself , ... making
oneself, etc. yourself is clearly specific here.6
The different types of usage that have been presented above could of
course be described as related in several different ways, none of which is
necessarily the correct one. If they are set out as suggested here, the stages in
the development of yourself can be seen as implicational in Figure 1:
This means, for instance, that those who use yourself to refer to the second
person plural (d) will also use it to refer to the second person singular and plural
indifferently (c), but not necessarily to other plurals (e).
5. Conclusions
As we have seen, yourself has changed a good deal through the ages, with
striking results in some variety or varieties of the language. We need not assume,
however, that the development of yourself in the standard language will
inevitably follow suit. This is one line of development among several, in its later
phases very much a minority option. Nevertheless, it is an interesting option in
that it represents the phenomenon of pattern neatening, to borrow a phrase
from Jean Aitchison (1991). From being distributionally and semantically quite
different from its corresponding personal pronoun you deviating in number as
well as type of reference yourself has become a close reflexive-pronoun copy of
it by getting rid of constraining features in its later stages of development. In
those stages it would appear justifiable to regard yourself as a general-purpose
emphatic-reflexive pronoun.
276 Gran Kjellmer
Notes
1. There is occasional ambiguity between the reflexive and the emphatic use, as
in
You gave yourself to the poor,
meaning either You dedicated yourself to the poor or You yourself gave
to the poor.
2. ... it is not always clear in present-day English whether the second person
pronoun refers to one or more people (Biber et al. 1999: 330).
3. Yourself I. In plural sense: now replaced by yourselves.
4. There are 6758 occurrences of yourself in Cobuild and 10587 in the BNC.
Yourself: a general-purpose emphatic-reflexive? 277
5. Old English man with that meaning developed into Middle English me and
became obsolete in late Middle English times.
6. A case like
I shouldnt worry yourself, Dolly, said Carrie, with apparent innocence
(BNC HHC 240)
is probably different, in that I shouldnt do that is often used to mean You
shouldnt do that; I shouldnt worry yourself then means You shouldnt
worry yourself.
References
Aitchison, J. (1991), Language change: progress or decay. 2nd ed. Cambridge
University Press.
Aston, G., and L. Burnard (1998), The BNC handbook. Edinburgh: Edinburgh
University Press.
Barber, C. (1996), Early Modern English. 2nd ed. Edinburgh: Edinburgh
University Press.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman
grammar of spoken and written English. Harlow: Longman.
BNC = British National Corpus, see Aston and Burnard (1998).
CobuildDirect Corpus, cf. Sinclair (1987).
OED = Simpson, J. A., and E. S. C. Weiner (eds) (1989), The Oxford English
dictionary, 2nd ed. Oxford: Clarendon.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London & New York: Longman.
Sinclair, J. M. (ed.) (1987), Looking up. An account of the COBUILD project in
lexical computing. London and Glasgow: Collins.
Visser, F. Th. (1962-73). An historical syntax of the English language I-III.
Leiden: Brill.
Wright, J., and E. M. Wright (1924), An elementary historical new English
grammar. London, etc.: Oxford University Press.
Aspects of spoken vocabulary development in the Polytechnic of
Wales Corpus of Childrens English
Clive Souter
University of Leeds
Abstract
The Polytechnic of Wales Corpus was collected in the late 1970s for the study of
syntactic and semantic development of native English-speaking children aged
between six and twelve. This paper demonstrates that interesting lexical
information can be gleaned from this corpus for EFL instructors and curriculum
designers, even though the size of the corpus (61,000 words) makes it too small
for dictionary development. The Corpus was organised to permit researchers to
observe changes across age groups, and differences between the sexes and
between children of different socio-economic backgrounds. Five investigations
illustrate:
rate of vocabulary growth with age in this Corpus;
the extent to which vocabulary is sex-specific;
differences between sexes in the use of affirmatives and negatives, and in
the use of male and female personal pronouns;
the extent to which vocabulary size is related to socio-economic class;
persistence of errors in applying regular verb endings to irregular verbs.
The Corpus does show active vocabulary size increasing with age, at a rate of
only around 50 words per year (in the limited activities used to elicit speech from
the children). Surprisingly, around half of the words used by each of the sexes are
limited to that sex. Boys make more use of positive expressions, whereas girls
make greater use of negatives. Both sexes use he far more than she. There is no
clear evidence that social class differences influence vocabulary size. Errors
caused by applying regular verb endings to irregular verbs seem to diminish in
children between ages six and eight, and have disappeared by age ten.
Although it is clear that data sparsity influences these results, they are still
useful (and thought-provoking) to curriculum developers and coursebook
designers in EFL, as well as researchers in sociolinguistics of child language.
1. Introduction
The motivation for such a study came from my belief that, until recently,
the Polytechnic of Wales (POW) Corpus has never been used for vocabulary
study. (It was originally collected for the study of childrens syntactic and
semantic development.) This omission can perhaps be explained by the small size
of the corpus: only 61,000 words. Lexicographers building dictionaries of adult
vocabulary have had access to far larger English corpora, such as LOB and
Brown, and more recently the British National Corpus and the COBUILD/Bank
of English. For dictionary-building purposes, clearly the POW corpus is nothing
like large enough, and may have been overlooked for this reason alone. However,
it does have great value for researchers into child language development, TEFL
syllabus designers and course-book authors.
The POW Corpus is unique in containing childrens spoken language,
organised clearly by age, sex and class, and in being richly syntactically
annotated. I hope to show that there are some interesting features to be uncovered
even in such a small corpus, by modern standards. Such features should hopefully
catch the attention of the designers of school syllabi for English language
learning. In many EU countries, there is pressure on the education system to
introduce foreign language learning earlier in the curriculum, at primary rather
than secondary school age. This is not without difficulty: there are few primary
school teachers trained to teach foreign languages. Space needs to be found in the
curriculum and working week of primary schools. An appropriate syllabus needs
to be designed to engage younger learners. Finally, the impact on the secondary
curriculum needs to be addressed, particularly if some children have been
introduced to a foreign language already, but others havent. For this reason, a
team at the Freie Universitt Berlin in Germany led by Dieter Mindt has also
recently been using the POW Corpus to assess which vocabulary and
grammatical items should be introduced to younger German learners of English,
and in what order. A paper describing their work was also presented by Norbert
Schlter at the ICAME conference in May 2002 in Gteborg, Sweden.
pronunciation examples
intonation and prosody examples
awareness of accents
Aspects of vocabulary development 281
This paper will deal primarily with lexical variations between types of speaker,
and illustrate some of the lexical errors produced by younger native speakers of
English.
The POW Corpus was collected by Robin Fawcett and Mick Perkins, between
1978-9, for the purpose of studying development of syntax and semantics in
children aged between 6 and 12. The corpus was carefully balanced for age, sex
and socio-economic class. In total, there were 96 child informants, subdivided by
age (within 3 months of 6, 8, 10 and 12 years old), sex (B, G) and class (A, B, C,
or D). Such a division resulted in 32 homogeneous groups of 3 children. Each
group was recorded in a play session (PS) performing a lego building task, and
each child was interviewed (I) separately by the same adult to discuss favourite
games, TV programmes etc.
The recordings were then transcribed orthographically, and annotated
prosodically and published in four volumes (Fawcett and Perkins 1980). A
machine-readable version of the corpus was produced in 1980 with full syntactic
analysis for each utterance, using Fawcetts Systemic Functional Grammar
(Fawcett 1981), but which omitted the prosodic annotation, and separated the
speech of each individual child into one text file. For example, the file 6ABICJ
contains the speech of a six-year-old, social class A boy in the interview situation,
whose initials are CJ. The corresponding utterances during the play session for
this individual are in the file 6ABPSCJ (but not those of his playmates). This is
beneficial for our present purpose, but does make analysis of dialogue difficult.
The original machine-readable version contains around 65,000 words, but
the corpus is now more commonly distributed as the Edited Polytechnic of Wales
Corpus (EPOW: ODonoghue 1991). EPOW contains only 60,784 word-forms
(3,730 word-types), because the texts have been edited for typographical errors
which led to part-of-speech categories wrongly being counted as words for
example. This total corresponds to around 11,000 utterances.
The corpus was initially collected and used for the study of the linguistic
development of older children (Perkins 1983). It was later used for the machine
learning of probabilistic models of lexis and grammar for computer parsing
programs (ODonoghue 1993, Weerasinghe 1994, Souter 1989, 1996).
282 Clive Souter
4. Investigations
Three investigations are presented here into vocabulary range by age, across the
sexes, and by socio-economic class. We then investigate errors in use of irregular
verbs, and the extent to which speakers develop their use of syntactically
ambiguous words.
We can use the corpus to investigate how childrens vocabulary expands with
age. Taking the part-of-speech tagged version of the EPOW corpus as our data
source, we can extract the number of unique word + word-tag pairs for each age
group. This is achieved using standard unix operating system commands on the
text files of the corpus, once they have been verticalised with only one word +
word-tag per line. For instance, the unix command
and shows that there are 1,821 unique word + wordtag pairs used by the entire
group of six-year-olds. Extracting the same for the older children gives us an
indicative growth rate over each two year span of around 6% (Table 1). Note that
we are not talking about growth rates and vocabulary sizes for individuals here,
but of the combined vocabulary of 24 children in each age group. It does however
give us some indication of the typical upper bound for word + word-tag pairs
used by children. The number of unique word-forms is somewhat lower: the
number of unique words in the corpus is 3,730, compared with 4,618 unique word
+ word-tag pairs.
From intuition, we may expect that vocabulary size should grow with age for
older children. We might also expect that the corpus had been carefully controlled
so that there were equal numbers of word-forms in each age cohort, but this was
not the case. As can be seen from the third row of Table 1, there are more tokens
in each cohort as the ages increase.
Aspects of vocabulary development 283
2500
2000
1500 Age 6
Age 8
Types
Age 10
1000 Age 12
500
0
00
00
00
0
0
00
00
00
00
00
40
60
80
10
12
14
16
2
Tokens
Using similar unix commands, we can easily separate the data by sex and age.
Table 2 shows the range of word + word-tag pairs used by boys and girls.
Although the overall total for the corpus for each sex is almost the same, this
parity is only maintained in the subcorpus for eight- and ten-year-olds. Six-year-
old boys appear to have a significantly smaller vocabulary than six-year-old girls,
whereas the reverse is the case for twelve-year-olds, at least to judge from the
POW corpus.
What is interesting to observe here, and which is made more obvious in Table 3,
is the number of word types being used only by boys, or only by girls.
There are 3,730 unique words (word types) being used in the corpus as a whole.
Table 3 columns 1 and 2 show how many of these are used specifically by just the
boys or just the girls. Columns 3-6 show how many types are used by the six-
year-olds (of either sex), eight-year-olds, ten-year-olds, and twelve-year-olds,
respectively. Columns 3-6 are indicative of fairly steady vocabulary growth in
children aged between six and twelve.
Aspects of vocabulary development 285
Boys use 2,491 words and girls 2,487, which are remarkably similar totals.
However, only around 1,240 of the words in the corpus are being used by both
sexes, and the other half is specific to the speakers sex. We might perhaps expect
that the overlap between sexes would increase if we had a larger corpus, or if the
speakers were adult, but perhaps this distribution is demonstrating a genuine
socio-linguistic phenomenon as well.
We can explore the words used only by boys or only by girls by deleting
those used by both from an alphabetically sorted lexicon extracted from the
corpus. Appendix 2 contains such words (beginning with A) extracted from the
corpus.
An obvious area of difference is in the use of proper nouns. Male names
are prominent in the boys only list, and female names in the girls only list. The
corpus also displays stereotypical examples for favourite toys, careers, games etc
for each sex. Beyond this, we have to speculate as to whether the appearance of a
word in one column or the other is due to data sparsity, or whether it really is
indicative of a difference between the sexes.
There is evidence for both, I would argue. Data sparsity is evidenced by
the occurrence of amusement twice in boys speech (but not in girls), and
amusements once in girls speech (but not in boys). Boys talk about aeroplane,
aircraft, air-force and airport, whereas only air stewardess and air hostess
feature on the girls side. Boys talk about antennas and airholes, action men and
astronauts, whereas girls talk about animal magic, all creatures great and small,
and Alice in Wonderland.
Clearly, in a list such as Appendix 2, many of the items occur only once in
the corpus. If we instead consider the most frequent words used by boys and girls,
can we see any differences? Appendix 3 contains the 100 most frequent word +
word-tag pairs in the boys and girls sub-corpus. If we consider the most
common words which express affirmation or negation, we can see a clear
difference between the sexes. In the POW Corpus, words like yes and no are
labelled with the part of speech F (formula). Given that the corpus contains equal
quantities of text spoken by each sex, boys tend overall to use more positives than
girls do, whereas girls use more negative words, as illustrated in Table 4. There
are, of course, other ways of expressing affirmation and negation, but these are
the ones found most frequently in the corpus. (The use of no as a quantifier has
been omitted from the table.) Either this reflects a general trend between the sexes
in childrens spoken language, or it is an artifact of the tasks performed in corpus
collection. Perhaps Lego building elicits more positive responses from boys, and
more negative responses from girls. Perhaps being interviewed by a friendly male
adult has an impact.
286 Clive Souter
In line with the data for all the children, regardless of sex, the personal pronoun
he occurs far more frequently than she. One might expect this in the boys
language (239 instances of he against only 56 instances of she), but even the girls
use he (178 occurrences) more frequently than she (123 occurrences).
Few clear patterns are evident. Vocabulary range is not always highest for the
class A children, although it is for the ten- and twelve-year-olds. For eight-year-
olds, it is the class D children who have the widest vocabulary. Given the
judgmental approach to allocation of socio-economic class labels, it is perhaps
not worth exploring this area any further.
Running a spelling checker on the Edited POW Corpus, and ignoring the many
proper nouns, we can find some examples of native learner errors, such as regular
Aspects of vocabulary development 287
past tense forms for irregular verbs. Table 6 shows alphabetically which errors of
this kind are found in the corpus, and the source file in each case. One six-year-
old girl is the source of many of these. There are only 11 such errors among the
six-year-olds. Eight-year-olds have produced only four, and thereafter it appears
that these children have learned to use the irregular forms correctly.
e) Lexical ambiguity
One of the reasons for using the tagged POW corpus in these investigations was
to discover whether there was an increase in the range of syntactic uses of a word
with age, between the ages 6-12. Do children of these ages know how to use the
word cut as a noun, verb, and adjective? Table 7 shows the number of lexically
ambiguous word types used by each age group, as a percentage of the total
number of types of word + word-tag pairs. This proportion remains remarkably
static across the four age groups. Perhaps children have already learned all such
syntactic differences before the age of six, but I would think that unlikely. More
probably, the corpus elicitation tasks were too constrained to demonstrate this
feature adequately.
5. Conclusions
The five investigations have hopefully illustrated some of the possibilities for
discovery of distinguishing features of childrens vocabulary development.
Whilst in some areas it is clear that the data are too sparse (to inform the
compilation of a childrens dictionary, for example), there are others which are
more promising and perhaps disturbing, from the point of view of syllabus and
course material designers. The POW corpus evidence suggests that many of the
words we use between the ages of 6-12 are not regularly used by the opposite sex
in similar contexts. This feature is worth a good deal more investigation. Growth
in vocabulary with age has also been demonstrated, although perhaps not at a rate
of increase we might expect. It would be interesting to compare the vocabulary of
children aged 6-12 with that of adults in the better known corpora, but the limited
tasks for speech collection used in the POW Corpus would confound a
straightforward comparison.
For syllabus and coursebook designers, there are also some warnings to be
made with respect to the Welsh dialect features of the POW Corpus. Although the
collectors sought to minimise Welsh language influence in the data, there are
some dialectal features which show through quite strongly. Two of these are the
disproportionately high occurrence of tag questions (including the use of isnt it
without person agreement with the main clause verb), and the use of Welsh
dialect locative adverbs by-here and by-there, instead of here and there, which
becomes more prevalent in the older age groups.
Further warnings should be made regarding the domain-based lexis. The
most frequent common nouns in POW are house, door, man, window and car,
because of the Lego-building task which the children were set.
From the point of view of syntactic structures, the POW corpus illustrates
just how ill-behaved speech can be, especially when uttered by children.
Around 30% of the constituents in the parsed corpus are lacking a grammatical
head, mainly because of ellipsis or interruption, so there is a wide range of
grammatical structures not typically found in written corpora.
The POW Corpus is a small corpus for lexical work, but it still reveals
some interesting comparative and quantitative linguistic features of children of
different ages and across the sexes. It is almost unique as a lexico-grammatical
resource for childrens spoken language. I have not tried to show all such
features, by any means, but I hope to have demonstrated that it is worth
exploring, particularly if you have an interest in learning and teaching language.
Aspects of vocabulary development 289
References
Atwell, E., P. Howarth and C. Souter (2003), The ISLE Corpus: Italian and
German spoken learners English, ICAME Journal 27: 5-18.
Fawcett, R.P. (1981), Some proposals for systemic syntax. Journal of the
Midlands Association for Linguistic Studies (MALS), 1.2, 2.1, 2.2 (1974-
76). Re-issued with light amendments, 1981, Department of Behavioural
and Communication Studies, Polytechnic of Wales.
Fawcett, R.P. and M. Perkins (1980), Child language transcripts 6-12 (with a
preface, in 4 volumes). Department of Behavioural and Communication
Studies, Polytechnic of Wales.
Granger, S. (1993), The International Corpus of Learner English, in: J. Aarts, P.
de Haan and N. Oostdijk (eds), English language corpora: design,
analysis and exploitation. Amsterdam: Rodopi, 57-69.
Granger, S. (ed.) (1998), Learner English on computer. London and New York:
Addison Wesley Longman.
Menzel, W., E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton and C.
Souter (2000), The ISLE Corpus of non-native spoken English, in: M.
Gavrilidou, G. Carrayannis, S. Markantionadou, S. Piperidis and G.
Stainhaouer (eds), Proceedings of LREC2000: Language Resources and
Evaluation Conference, vol. 2, 957-964. European Language Resources
Association.
O'Donoghue, T.F. (1991), Taking a parsed corpus to the cleaners: the EPOW
corpus, ICAME Journal 15: 55-62.
O'Donoghue, T.F. (1993), Reversing the process of generation in Systemic
Grammar. Ph.D. thesis. School of Computer Studies, Leeds University.
Perkins, M.R. (1983), Modal expressions in English. London: Frances Pinter.
Souter, C. (1989), The COMMUNAL Project: Extracting a grammar from the
Polytechnic of Wales corpus, ICAME Journal 13: 20-27.
Souter, C. (1996), A corpus-trained parser for systemic-functional syntax. Ph.D.
Thesis. School of Computing, University of Leeds.
Weerasinghe, A.R. (1994), Probabilistic parsing in Systemic Functional
Grammar. Ph.D. thesis. School of Computing Mathematics, University of
Wales College of Cardiff.
290 Clive Souter
68 'S OX 73 ON AX 71 UP AX 73 GET M
67 THERE STH 71 KNOW M 70 NOT N 71 ALL DQ
67 IF B 70 WHAT HWH 70 'VE OX 66 WITH P
66 NOT N 68 THEM HP 70 'S OX 66 'VE OX
64 THEM HP 68 IF B 67 NOW AX 64 LOOK M
62 SOME DQ 67 'S OX 67 MAKE M 63 OUT AX
61 THIS DD 64 NOT N 64 HOUSE H 63 LIKE M
60 HERE AX 57 IS OM 59 LITTLE AX 62 THESE DD
59 MAN H 56 LIKE P 58 WHAT HWH 62 HERE AX
58 DO M 55 WITH P 58 GO M 61 IS OM
57 TO P 55 NOW AX 58 GET M 61 BY-THERE AX
56 UP AX 55 'M OX 57 ON AX 59 LITTLE AX
56 DOOR H 51 WAS OM 57 IS OM 58 ROOF H
55 WHAT HWH 51 PLAY M 57 GOOD AX 58 IF B
55 BUT & 50 AND-THEN & 56 HERE AX 58 DO M
54 HOUSE H 46 HAVE-TO XM 55 WITH P 57 WAS OM
53 'VE OX 46 AN' & 54 LIKE P 56 UP AX
51 ME HP 45 THINK M 54 LIKE M 55 ONES HP
51 BE M 45 SHE HP 53 THESE DD 55 FOR P
50 ALL DQ 45 GET M 53 BUT & 54 JUST AI
49 AND-THEN & 45 FOR P 53 AND-THEN & 51 GOING-TO X
48 WAS OM 45 COULD OM 51 DOOR H 51 BUILD M
48 DO O 44 WHERE AXWH 51 BY-THERE AX 50 SOME DQ
47 JUST AI 43 BUT & 48 WHEN B 50 HAVEN'T OXN
46 ONE DQ 42 OUT AX 48 THINK M 49 TO P
46 FOR P 42 DOOR H 47 WINDOWS H 48 CAN'T OMN
46 COME M 41 UP AX 47 JUST AI 48 'S OX
45 WANT M 40 CAN'T OMN 46 FOR P 47 BUT &
45 'LL OM 39 TO P 45 ONE DQ 45 MAKE M
44 LITTLE AX 38 LIKE M 43 IN AX 44 HAVE-TO XM
44 GOOD AX 37 WHEN B 43 'RE OM 44 GOT-TO XM
43 HAVEN'T OXN 37 NEED M 42 WINDOW H 43 WANT M
42 CAR H 37 DO O 42 NEED M 43 RED AX
41 NEED M 36 LITTLE AX 41 TO P 42 ONE DQ
40 CAN'T OMN 36 GOT-TO XM 41 ROOF H 42 OFF AX
37 LIKE M 36 BUS-STOP H 40 SOME DQ 41 MY DD
37 GOING-TO X 35 TWO DQ 40 DO O 40 PLAY M
36 THINGS H 35 SOME DQ 39 GOING-TO X 40 NEED M
36 PLAY M 35 LEGO HN 38 HAVE-TO XM 39 SHE HP
36 MINE HP 35 IN AX 37 YEH F 39 OR &
35 WAS OX 35 BIG AX 37 BECAUSE B 39 IN AX
292 Clive Souter
1 AIR-FORCES 2 AGES
1 AIRCRAFT 1 AHEAD
1 AIRHOLE 1 AHEAD-OF
1 AIRPORT 3 AIR-HOSTESS
1 AL 1 AIR-STEWARDESS
1 ALARM 6 ALEX
1 ALF 1 ALICE-IN-WONDERLAND
1 ALFRED-HITCHCOCK 3 ALIVE
1 ALL-OF-A-SUDDEN 1 ALL-ABOARD
1 ALL-THE-WAY 1 ALL-CREATURES-GREAT-AND-SMALL
1 ALL-TOGETHER 1 ALL-RIGHT-THEN
2 ALMOST 2 ALLEY
1 ALRIGHT-ALRIGHT 1 ALONE
2 AMUSEMENT 1 ALRIGHT-THEN
2 AMUSING 1 ALTOGETHER
2 ANDERSON 1 AM...
1 ANDRE 1 AMN'T
3 ANGRY 1 AMOUNT
1 ANIMAL-SNAP 1 AMUSEMENTS
1 ANTENNA 1 AN-ALL
1 ANY-MORE 1 AND'
1 ANY-WHERE 1 AND-FILEY
4 ANYMORE 3 ANDREA
1 ANYONE 2 ANGELS
2 ANYWHERE 1 ANGLES
2 APART-FROM 1 ANIMAL-MAGIC
1 APPLE 1 ANY-HOW
1 ARBEE 1 ANY-RATE
1 ARCADE 1 ANY-WAY
1 AREA 1 ANYHOW
2 ARGENTINA 2 ARCHES
1 ARGUED 1 ARGUE
1 ARROW 1 AROUNDS
2 ARROWS 1 ARRESTED
2 ART 1 AS-FAR-AS
2 ARTIST 1 AS-IF
2 AS-WELL-AS 2 AS-LONG-AS
1 ASTRONAUT 1 AS-SOON-AS
2 ASTRONOMY 1 ASKED
1 AT-FIRST 1 ASLEEP
1 AT-LAST 2 ASSEMBLY
294 Clive Souter
1 ATH-LYMPICS 1 ATTACHED
1 ATTACK 1 ATTENTION
2 ATTACKING 1 AVE
1 AWKWARD 1 AW-MAMMY
Boys Girls
Frq Type Tag Frq Type Tag
1190 I HP 1489 I HP
1186 THE DD 1064 THE DD
942 A DQ 959 A DQ
800 IT HP 801 AND &
749 AND & 727 YOU HP
571 YOU HP 725 IT HP
571 'S OM 602 'S OM
565 WE HP 552 WE HP
561 YEAH F 477 THAT DD
543 THAT DD 361 THEY HP
354 GOT M 337 GOT M
288 IN P 336 YEAH F
274 NO F 311 NO F
249 THEY HP 282 TO I
241 TO I 266 IN P
240 PUT M 242 PUT M
239 HE HP 232 THERE AX
212 OF VO 227 THERE STH
209 THIS DD 223 DON'T ON
193 ON AX 214 YES F
190 ONE HP 211 ONE HP
188 DON'T ON 202 MY DD
179 CAN OM 200 LOOK M
173 ON P 197 HAVE M
167 'LL OM 194 CAN OM
166 THERE AX 193 ON P
156 THERE STH 188 THIS DD
151 MY DD 188 OF VO
149 LOOK M 180 KNOW M
149 DO M 178 HE HP
148 BE M 174 NOT N
Aspects of vocabulary development 295
Roumiana Blagoeva
Abstract
1. Introduction
3. The corpora
A learner corpus is very different from a native corpus because of the nature of
the material collected. A native corpus contains data from a natural language and
can be used on its own for the investigation of characteristic features of this
language. A learner corpus presents evidence of an interlanguage; and an
Demonstrative reference as a cohesive device 299
4. Theoretical framework
noted here, namely that with extended reference and with reference to a fact
only singular forms can be used.
In English the use of demonstratives to refer to extended text, including
text as fact [] applies only to the singular forms this and that used without
a following noun (Halliday and Hasan 1976: 66). Whereas extended reference
differs from usual instances of reference only in extent the referent is more than
just a person or object, it is a process or sequence of processes (grammatically, a
clause or string of clauses not just a single nominal) text reference differs in
kind: the referent is not being taken at its face-value but is being transmuted into
a fact or report (Halliday and Hasan 1976: 52).
In Bulgarian, as Krastev (1992:78) notes, the singular form tova (near),
but not onova (remote), has a special place in the system and is one of the most
frequent and most economical words in the language. Only the demonstrative
tova can replace any word, combination of words, phrases and even whole
stretches of text. Thus in Bulgarian only one form of the singular demonstratives
performs the functions of extended reference and reference to fact, which in
English are shared between the two singular forms.
Using WordSmith Tools (Scott 1997), frequency lists and concordances were
produced for all the investigated items in each of the four corpora. The raw data
were then examined to exclude all examples that were irrelevant to the present
study, namely cases where that was used as a conjunction or relative pronoun,
and whenever it was used as an adverb in front of an adjective to express the
degree of a quality. The total number of tokens that were extracted from the
corpora after these first searches is shown in Table 2.
Most often a first step in a quantitative study of any language feature is to look at
the number of occurrences of the items examined, which can give a preliminary
idea of the spread of the feature through entire collections of texts. So when
examining the cohesive function of demonstratives it seems reasonable to start
with a comparison of the total number of tokens found in the corpora. A first
glance at the figures in Table 2 shows a striking similarity between the
frequencies of this/these and that/those in Corpus 1 and Corpus 2. Moreover, the
302 Roumiana Blagoeva
frequencies are nearly twice as high as that in Corpus 3 (the BNC) and slightly
higher than that in Corpus 4 (the Bulgarian language corpus).
However, these data could be misleading and could bring us to the rash
conclusion that there is no over- or underuse of demonstratives by the Bulgarian
learners of English. Instead, it may be that the use of demonstratives is
determined by the different text types represented in the learner and non-learner
corpora, as their number is greater in the argumentative essays than in the BNC
sub-corpus and the Bulgarian language corpus, both of which consist of other
types of non-fiction texts. However, if we make a distinction between near and
remote types of demonstratives and look at each of these types separately, the
picture changes, as shown in Tables 3 and 4.
(1) I know a little boy, for example, whose father is a scientist. This nine-year
old boy reads only Science Fiction and I can never persuade him to read a
fairy tale or fable or a folk tale. He is not interested even in books about
famous adventurers, about sailors and pirates, books which I read with
Demonstrative reference as a cohesive device 303
great interest and pleasure when I was his age. That boy reads only about
robots, machines, spacecraft, numbers. I agree that Science Fiction
somehow stirs children's imagination but it creates a world controlled by
machines, rather than one controlled by human beings. Probably the
science fiction stories will be the fairy tales of the new era. (BUCICLE)
The other typical group of examples observed involves the use of demonstratives
to refer to extended text, including text as fact. In English this function applies
only to the singular forms this and that used without a following noun (see
Halliday and Hasan 1976: 66) as in:
In English the choice of this or that to refer to something that has been said before
is clearly related to that of near (the speaker) versus not near; what I have
just mentioned is, textually speaking, near me whereas what you have just
mentioned is not (Halliday and Hasan 1976: 60). At the same time the notion
of proximity has various interpretations; and in such cases there is no very clearly
felt distinction between this and that (Halliday and Hasan 1976: 61).
In Bulgarian the demonstrative tova (singular, neuter, near), which
according to most traditional Bulgarian grammars (Krastev 1992; Pashov 1994;
Andreichin et al. 1998) expresses the idea of near in time and space, has a very
wide spectrum of uses and has a special place in the system of Bulgarian
demonstratives. As mentioned above in Section 4, apart from its use as pronoun
or determiner to refer to any singular neuter object or person, it is the only
demonstrative that can convey extended reference relations in a text. Here the
distinction near/remote is lost and the reference of tova is derived from the
immediate context in or outside the textual world irrespective of the idea of
proximity. Thus in this particular function its use coincides with both this and
that in English and we may expect a great overuse of this by Bulgarian learners.
The functions of onova (singular, neuter, remote) are always either Head
or Modifier so it can never be used in extended reference and reference to fact;
and as the data demonstrate (Table 6) it is rare in Bulgarian. Yet, this infrequent
use of onova does not cause an underuse of its English equivalent that by the
Bulgarian learners. On the contrary, Table 2 shows a clear overuse of that in
Corpus 1 in comparison with Corpora 2 and 3. It is true that the total number of
singular forms is nearly the same in the learner material, the native-speaker
304 Roumiana Blagoeva
student writing and the Bulgarian language corpus, as shown in Table 5 and this
at first glance may blur some differences.
One possible reason could be the fact that most teaching materials used in
Bulgaria overlook the distinction between the English counterparts of tova and
onova and learners are left with the impression that it is unimportant and that both
this and that, having a very wide range of referents, could be used
indiscriminately to point to any word, phrase or longer stretch of text.
The lower frequency of singular forms in Corpus 3 than in the other
corpora could be attributed to the differences between the text types involved.
One could argue that since the distinction near/remote in the use of the
singular forms is not as clear-cut in English as in Bulgarian, the
interchangeability of this and that is permissible and might not lead to serious
communication breakdowns. Still, it is my view that it could interfere with a
receivers comprehension of a text and could contribute to the production of
unclear textual references by learners of English. In the following example the
choice of this or that would only slightly change the point of view of the writer:
That is probably preferred because the fact it refers to in the preceding sentence is
not explicitly linked to the personal feelings of the writer; it is perceived rather as
Demonstrative reference as a cohesive device 305
being officially stated by a third party. In such cases this could easily substitute
for that and make the whole statement more involved.
But sometimes this tendency goes too far and in their desire to vary their
style and avoid repetition learners use this and that as absolute synonyms.
Consider the following examples from BUCICLE:
(5) [] my opinion is that dreaming and imagination are still part of our
society. Even if it werent so, I do not see what the problem is. The world
is changing, developing all the time and if it does not need these, it gets rid
of them as something useless, that is just the way it goes. And if someone
cannot live without dreams they either adapt to the new conditions or keep
dreams in their souls which is a question of personal choice.
In (5) it is unclear why the referents of these (dreaming and imagination) are
perceived as being closer to the writer of the passage than the fact that is referred
to by means of that. The idea of proximity is even more confused in (6) where
one and the same fact is referred to by both this and that in the same sentence:
(6) But is it really so, or it is just another old-dated "fairy tale" we are taught
to believe in and which is so trivial that we have learned it by heart. We
fight for freedom, we strive for equality, we talk about democracy and
having equal rights, but that is just an illusion, with which our minds are
washed away and we are all blind, because we believe in this. Human
beings are not equal. Inequality is determined by history. History is the
reflection of our lives.
6. Conclusions
The observations of the data presented in this paper demonstrate: (1) an overuse
of demonstratives in argumentative writing by both Bulgarian learners of English
and native-speaker students; (2) a tendency for Bulgarian learners to use
that/those in spite of the very low frequency of occurrence of their Bulgarian
equivalents; (3) a similar frequency of this/these in Bulgarian learner writing and
English native-speaker student writing; (4) a similar frequency of this/these and
their Bulgarian equivalents.
These findings shed light on some aspects of Bulgarian learner discourse
that are still unexplored and need further investigation. At this stage of the study
some of the similarities between the production of Bulgarian learners and native
speaker students might point to an influence on learner production by the nature
of the text type. A task-based learner corpus requiring students to produce one
particular text type might not reveal features of other text types. Yet, an academic
essay gives students freedom to write what they want, and more importantly what
they can, on a variety of topics, and in this sense a corpus of this kind can tell the
researcher a lot about learners abilities to produce coherent texts in any real-life
306 Roumiana Blagoeva
References
Helge Dyvik
University of Bergen
Abstract
The paper reports from the project From Parallel Corpus to Wordnet at the
University of Bergen (20012004), which explores a method for deriving wordnet
relations such as synonymy and hyponymy from data extracted from parallel
corpora. Assumptions behind the method are that semantically closely related
words ought to have strongly overlapping sets of translations, and words with
wide meanings ought to have a larger number of translations than words with
narrow meanings. Furthermore, if a word a is a hyponym of a word b (such as
tasty of good, for example), then the possible translations of a ought to be a
subset of the possible translations of b.
Based on assumptions like these a set of definitions are formulated,
defining semantic concepts like, e.g., synonymy, hyponymy, ambiguity and
semantic field in translational terms. The definitions are implemented in a
computer program which takes words with their sets of translations from the
corpus as input and performs the following calculations: (1) On the basis of the
input different senses of each word are identified. (2) The senses are grouped in
semantic fields based on overlapping sets of translations, such overlap being
assumed to indicate semantic relatedness. (3) On the basis of the structure of a
semantic field a set of features is assigned to each individual sense in it, coding
its relations to other senses in the field. (4) Based on intersections and inclusions
among these feature sets a semilattice is calculated with the senses as nodes.
According to our hypothesis, hyponymy/hyperonymy, near-synonymy and other
semantic relations among the senses now appear through dominance and other
relations among the nodes in the semilattice. Thus, the semilattice is supposed to
contain some of the semantic information we want to represent in wordnets. (5)
In accordance with this assumption, thesaurus-like entries for words are
generated from the information in the semilattice.
In the project these assumptions are tested against data from the English-
Norwegian Parallel Corpus ENPC (Johansson 1997).
312 Helge Dyvik
1. Introduction
Parallel corpora, in which original texts are aligned with their translations into
another language, are a rich source of semantic information. Translations come
about when translators evaluate the degree of interpretational equivalence
between linguistic expressions in specific contexts. In many ways such
evaluations, made without any theoretical concerns in mind, seem more reliable
as sources of semantic information than the careful paraphrases of the semanticist
or the meaning descriptions of the lexicographer. Assuming that this is the case,
can we then retrieve some of the semantic properties of expressions by going
backwards from the network of translational relations in situated texts? Can we
reconstruct semantic properties from the translational properties manifested in a
parallel corpus?
The idea that semantic information can be gleaned from multilingual data
has been explored by others. Resnik and Yarowsky (1997), discussing word sense
disambiguation, suggest that in distinguishing between senses it may be fruitful to
restrict attention to such distinctions as are lexicalised differently in other
languages. Nancy Ide has explored the connections between semantics and
translation in several papers; in Ide et al. (2002) the authors study versions of the
same novel in seven languages and attempt to identify subsenses of words by
considering how the translations of a given word cluster in the six other texts.
The output of the method presented here is a structure containing some of the
information which we find in wordnets. A wordnet is a semantically structured
lexical database. The Princeton WordNet (Fellbaum 1998), which has been built
manually, distinguishes between the senses of words and groups senses across
words into synsets according to near-synonymy. Pointers between such synsets
express semantic relations like hypero- and hyponymy, antonymy, and holo- and
meronymy. Wordnets for various European languages were developed within the
project Eurowordnet (http://www.illc.uva.nl/EuroWordNet/).
Wordnets are important resources for many applications within language
technology. They can be used in meaning-based information retrieval (searching
for concepts rather that specific word forms), in logical inference (if a document
mentions dogs, a wordnet allows the inference that it is about animals), in word
sense disambiguation (providing the search space of alternative meanings), etc.
A related kind of semantic resource is the thesaurus. As an example we
may consider the entry for the adjective conspicuous in the Merriam-Webster
Collegiate Thesaurus (http://www.m-w.com/home.htm), where two senses are
distinguished, each with its own sets of synonyms, antonyms etc.:
Translations as semantic mirrors 313
We may compare this with the thesaurus-like entry for conspicuous below, which
has been generated automatically from parallel corpus data by the method to be
described in this paper:
conspicuous
Sense 1
(Norwegian: avstikkende.)
Sense 2
Hyperonyms: great, hard, large.
Subsense (i) (Norwegian: synlig, tydelig.)
Near-synonyms:
clear, conclusive, definite, distinct, distinctive, obvious,
plain, substantial, unmistakable, vivid.
Hyponyms: apparent, evident, pervasive, visible.
Subsense (ii) (Norwegian: fremtredende, kraftig,
sterk, stor.)
Near-synonyms: outstanding, primary.
Subsense (iii) (Norwegian: oppsiktsvekkende.)
Near-synonyms: amazing, spectacular, startling,
surprising, unusual.
Antonyms and contrasted words are not included in the latter entry, since the
method only allows the derivation of relations of semantic similarity (synonymy,
hyperonymy and hyponymy) from the parallel corpus data. The entry displays a
major division into two senses (of which the first one in this case has no
information associated with it apart from a Norwegian translation), and
furthermore a division into subsenses within the more informative second sense.
Sense 1 in this example is probably a spurious consequence of sparsity of data
in the corpus. A better example of a major division into senses although even
there we would have liked sense 1 to have been merged with sense 4 is
provided by the following automatically derived entry for the Norwegian noun
rett, which is contrastively ambiguous between a number of senses, among which
we find course in a meal and court of law. Some of the related words listed in
this entry are surprising, while most of them are to the point:
rett N
Sense 1
(English: course.)
Sense 2
(English: court, justification.)
314 Helge Dyvik
Near-synonyms:
argument, begrunnelse, berettigelse, domstolsbehandling,
grd, grdsplass, plass, sak, ting.
Sense 3
Subsense (i) (English: option.)
Hyponyms: tilbud.
Subsense (ii) (English: rightN.)
Hyponyms: adgang, rettighet.
Subsense (iii) (English: order.)
Near-synonyms:
bestemmelse, klasse, krav, lov, lsning, mte, orden, regel,
regelverk, stand, system, vedtak.
Sense 4
(English: dish, food, supper.)
Near-synonyms:
aftens, aftensmat, fat, fde, gryte, kar, kopp, kosthold,
kveldsmat, lunsj, mat, matvare, middag, mltid, nring, skl,
tallerken.
The thesaurus entries above are generated from semantic lattices, which in their
turn are derived automatically from the translational data. Figure 1 below is an
example of such a lattice, representing the semantic field associated with sense 4
of rett in the entry above (labelled rettN2 in the lattice):
According to the hypothesis behind the method, senses on dominating nodes are
hyperonyms of senses on dominated nodes. Thus, a sense of mat food
dominates senses of rett dish, middag dinner, mltid meal, lunsj lunch,
kveldsmat supper, aftensmat supper, and aftens supper, all of which are
plausible hyponyms of mat. Less convincingly, lunsj also dominates aftensmat.
Formally the lattice expresses inclusion and overlap relations among sets
of translationally derived features, as described in section 2.3 below.
Translations as semantic mirrors 315
The English-Norwegian Parallel Corpus (ENPC), from which the above results
are derived, comprises approximately 2.6 million words, originals and
translations included. The corpus contains fiction as well as non-fiction and
English originals translated into Norwegian as well as the other way around. The
corpus is aligned at sentence level (Johansson et al. 1996), while it is a part of our
present project to align the ENPC at word level, in order to be able to extract the
sets of translations of a given word automatically. Our present data has been
derived from the sentence-aligned corpus, however, which means that the
translational data for each word in our data set has been extracted manually.
For example, searching for the Norwegian word form bemerkelsesverdig
returns the sentences containing bemerkelsesverdig coupled with the
corresponding English sentences in the parallel text (translation or original).
Based on a set of heuristic criteria to decide whether a word can be said to
correspond to a given word in the translation or not, the set of translations of
bemerkelsesverdig is extracted by the human analyser:
Sets of such lemmas with their associated sets of translations from the corpus
constitute the input to the procedure deriving semantic lattices and thesaurus
entries, by principles which we now proceed to describe.
2. Semantic mirrors
We assume that contrastive ambiguity, such as the ambiguity between the two
unrelated senses of the English noun bank money institution and riverside
tends to be a historically accidental and idiosyncratic property of individual
words. That is, we don't expect to find instances of the same contrastive
ambiguity replicated by other words in the language or by words in other
languages. Furthermore, we don't expect words with unrelated meanings to share
translations into another language, except in cases where the shared word is
contrastively ambiguous between the two meanings. By the first assumption there
should then be at most one such shared word.
Given these assumptions contrastive ambiguity should be discoverable in
the patterns of translational relations. We may consider the Norwegian noun tak,
contrastively ambiguous between the meanings roof and grip. Figure 2 shows
the first t-image of tak in the right-hand box, and the first t-images of each of
those English words again in the left-hand box. We refer to the last-mentioned set
of sets as the inverse t-image of tak.
316 Helge Dyvik
The point worth noticing is that the images of roof and ceiling overlap in hvelving
in addition to tak, while the images of grip and hold overlap in grep in addition to
tak. This indicates that roof and ceiling are semantically related, and similarly
grip and hold, while no overlap (apart from tak) unites grip/hold and roof/ceiling.
Grip/hold and roof/ceiling hence seem to represent unrelated meanings, and the
conclusion is that tak is ambiguous.
The overlap patterns are necessarily preserved within the first t-image of tak
when we make our third movement and find all the first t-images in English of
the words in the inverse t-image, as shown in Figure 3. We refer to this set of sets
as the second t-image of tak.
As shown in Figure 3, the second t-image can be divided into three
clusters or groups of sets, each group being held together by overlap relations (we
only consider overlaps in the restriction of the second t-image to the members of
the first t-image). On the basis of these groups the first t-image of tak can be
partitioned into the three sense partitions shown in Figure 4.
Once senses are individuated in the manner described, they can be grouped into
semantic fields. Traditionally, a semantic field is a set of senses that are directly
or indirectly related to each other by a relation of semantic closeness.
In our translational approach, the semantic fields are isolated on the basis
of overlaps among the first t-images of the senses. Since we treat translational
correspondence as a symmetric relation (disregarding the direction of translation),
we get paired semantic fields in the two languages involved, each field assigning
a subset structure to the other. Figure 5 gives a rough illustration of the principle
(arrows indicate the t-image of each sense for simplicity, the indicated sets are
just suggested and in no way reflect the corpus data accurately).
318 Helge Dyvik
The subset structure of a semantic field, assigned by its partner field in the other
language, contains rich information about the semantic relations among its
members. For example, senses with a wide meaning (such as good) will in
general have a larger number of alternative translations than words with a
narrower meaning (such as tasty). The number of translations is of course directly
reflected in the number of subsets of which the sense is a member. Thus the
senses at the peaks in the semantic fields will have the widest meanings.
We may illustrate this by means of a constructed and artificially simple
example. Assume that we find the translational pattern illustrated in Figure 6,
where hingst stallion is found translated into animal, horse and stallion, while
dyr animal is translated into animal, horse, stallion, mare and dog, etc.
Translations as semantic mirrors 319
The next step is to encode, for each sense, its position within the semantic field,
along with its translational relations to the members of the other field. This is
done by means of feature sets, automatically derived from the set structure. In
accordance with traditional semantic componential analysis, the intention is that
wide senses should have few features, while more specific senses should have
more features, some of which are inherited from wider, superordinate senses. This
is achieved by starting from the tops in two paired fields i.e. the sense pair
which is both translationally interrelated and whose members belong to the
largest number of subsets which in Figure 7 gives us the pair dyr1 and animal1.
A feature [dyr1|animal1] is constructed from this pair and assigned to both its
members dyr1 and animal1. Then the feature is inherited (non-transitively) by
lower senses according to the following principle: all senses in the first t-image
of animal1 and ranked lower than dyr1 (i.e. belonging to fewer subsets than dyr1)
inherit the feature, and conversely, all senses in the first t-image of dyr1 and
ranked lower than animal1 inherit the feature. Then the procedure moves to the
next highest, translationally interrelated, peaks hest1 and horse1, constructs a
feature from that pair, and assigns it according to the same principle. The result is
shown in Figure 7.
320 Helge Dyvik
The feature sets in Figure 7 define a lattice based on inclusion relations among
them, as shown in Figure 8.
In Figure 8 the daughters of a node N have supersets of the feature set associated
with N. In this constructed example the lattices evidently also reflect hyperonym /
hyponym relations among the senses.
Translations as semantic mirrors 321
The lattices in Figure 8 are simple trees, while actual derived lattices tend
to be more complex. In the first place, senses may inherit features from more than
one peak in the semantic field, which gives rise to multiple mothers in the
lattice. In the second place, nodes may have intersecting feature sets without
either of the sets including the other, so that there is no mother/daughter
relationship between the nodes in question. When no actual sense is associated
with the intersection, x-nodes (cf. Figure 1) are introduced, carrying the
intersection of the feature sets of their daughters. Thus the x-nodes can intuitively
be seen as virtual hyperonyms of their daughters. It is the presence of x-nodes
which guarantees that the structure is a semilattice (i.e. all nodes with intersecting
feature sets are guaranteed to be dominated by a node carrying the intersection).
In the semilattice, two senses are assumed to be more closely related the more of
their features they share, i.e. the shorter the distance is to their common
dominating node.
Returning now to the actual corpus-based lattice in Figure 1, it is defined
by the feature sets on the nodes according to the principles just described. For
instance, mat2 is associated with the singleton feature set {[mat2|supper3]},
kveldsmat1 with {[mat2|supper3], [kveldsmat1|meal1]}, and aftensmat2 with
{[mat2|supper3], [kveldsmat1|meal1], [lunsj2|meal1], [aftensmat2]}. In Figure 1,
x-nodes with only one feature (such as x1) are displayed with the feature beside
them.
Sweet1 is also dominated by several nodes outside this sublattice; size limitations
prevent displaying a more complete graph. The node sweet1 is associated with the
following feature set: {[god3|good1], [fin2|nice2], [pen1|gentle3],
[vakker1|soft2], [snill1|pleasant1], [deilig1|splendid3], [frisk4|sweet1],
[blid3|sweet1]}. Finding hyperonyms, near-synonyms and hyponyms of sweet1
now first involves considering which other senses in the lattice share features
with sweet1. The features in question are assigned to the following senses in the
complete semilattice (we will refer to the sets of senses as the denotations of the
features):
[god3|good1]:
(able1 accurate1 adept1 adequate2 affectionate1 all_right2 amiable2 appropriate5
attractive4 beautiful2 beneficial1 benign3 bright2 burning3 charming2 clean1 clear1 close3
comfortable2 comforting3 competent2 confident2 correct1 cozy2 cute1 decent2 delicious1
delightful2 detailed3 dishy1 easy1 efficient2 elegant3 excellent2 fair2 fancy1 favourable1
fine1 firmA1 first-class3 first-rate2 fit3 fortunate1 fresh3 friendly2 full2 genuine2 good1
handsome2 happy3 healthy2 high3 hot2 joyful2 kind1 kindly1 long3 lovely2 lucky2
magnificent3 marvellous1 neat2 nice2 okay1 peaceful1 perfect3 placid2 pleasant1 pleased2
pleasing1 pleasurable1 plentiful1 plenty1 polite2 positive1 pretty2 proficient1 quite_certain1
real2 reassuring2 respectable3 right2 ripe1 safe2 satisfactory1 satisfying1 secure2 sizeable1
smart2 smooth3 soft2 solid2 sound2 spectacular2 steady1 strong3 successful2 suited1
superb2 superior5 sure1 sweet1 talented2 thorough1 tidy1 well2 whole2 wholesome1
wonderful3 worthy2)
Translations as semantic mirrors 323
[fin2|nice2]:
(attractive4 beautiful2 breathtaking2 charming2 comfortable2 cute1 delicate3 dishy1 easy1
elegant3 enchanting1 excellent2 fancy1 fine1 first-class3 gentle3 glorious4 graceful2
handsome2 impressive2 lovely2 magnificent3 marvellous1 neat2 nice2 okay1 perfect3
pleasurable1 polite2 pretty2 pure2 slight3 smart2 soft2 splendid3 sweet1 thin2 wonderful3)
[pen1|gentle3]:
(attractive4 beautiful2 charming2 clean1 cute1 dishy1 elegant3 enchanting1 fancy1 fine1
first-class3 formal1 gentle3 graceful2 handsome2 lovely2 neat2 pleasant1 polite2 pretty2
soft2 sweet1 tidy1)
[vakker1|soft2]:
(attractive4 charming2 cute1 delightful2 dishy1 enchanting1 fair2 fancy1 graceful2
handsome2 lovely2 magnificent3 mild2 ornate2 pleasant1 pleasurable1 pretty2 soft2 sweet1)
[snill1|pleasant1]:
(all_right2 amiable2 benign3 friendly2 good-humoured1 good-natured3 jolly1 kind1 kindly1
mild3 pleasant1 pleasing1 polite2 smiling2 sweet1)
[deilig1|splendid3]:
(beautiful2 charming2 cute1 enchanting1 delicious1 delightful2 pleasureable1 splendid3
sweet1)
[frisk4|sweet1]:
(all_right2 brisk5 eager2 fit3 fresh3 healthy2 new1 pert2 sweet1 well2)
[blid3|sweet1]:
(amiable2 blithe3 cheerful4 cheery1 good-humoured1 good-natured3 jolly1 kind1 kindly1
merry1 mild3 smiling2 sweet1)
OverlapThreshold = 0.05:
sweet
Hyperonyms: gentle, good, nice.
Subsense (i) (Norwegian: frisk.)
Hyponyms: all_right, brisk, crisp, eager, fit, fresh,
healthy, new, pert, well.
Subsense (ii) (Norwegian: blid, deilig, fin, god, pen,
snill, st, vakker.)
Near-synonyms:
amiable, amused, attractive, beautiful, benign, blithe, charming,
cheerful, cheery, cute, delicious, delightful, dishy, easygoing,
enchanting, fair, fancy, friendly, good-humoured, good-natured,
graceful, handsome, jolly, kind, kindly, lovely, magnificent, merry,
mild, ornate, picturesque, pleasant, pleasing, pleasurable, polite,
pretty, smiling, soft.
Hyponyms: all_right.
OverlapThreshold = 0.1:
sweet
Hyperonyms: gentle, good, nice.
Subsense (i) (Norwegian: frisk.)
Hyponyms: all_right, brisk, crisp, eager, fit, fresh,
healthy, new, pert, well.
Subsense (ii) (Norwegian: deilig, fin, god, pen, st,
vakker.)
Near-synonyms:
attractive, beautiful, charming, cute, delicious, delightful, dishy,
enchanting, fair, fancy, graceful, handsome, lovely, magnificent,
Translations as semantic mirrors 325
3. Conclusion
Notes
1. The analyses in this paper are based on corpus data resulting from work by
Martha Thunes, Gunn Inger Lyse and the author. The software producing the
semantic analyses has been developed by the author and reimplemented and
improved by Paul Meurer. I am grateful to Martha Thunes for useful
comments on an earlier version of this article.
References
ke Viberg
Uppsala University
Abstract
The major English physical contact verbs strike, hit and beat are compared with
their primary Swedish translation equivalent sl on the basis of data from the
English-Swedish Parallel Corpus. The analysis is carried out within two
theoretical frameworks concerning the underlying conceptual representation and
the linguistic cues that can be used for word sense identification. In addition to a
rather detailed account of points of contrast in the fairly extensive patterns of
polysemy that are characteristic of the verbs, an attempt is made to provide a
general characterisation in contrastive terms. In comparison with the English
verbs, the conceptual representation of sl is grounded more firmly in
sensorimotor experience and the fact that hitting prototypically is a hand action.
As in other languages such as Chinese, the main verb of hitting in Swedish has
extended senses that refer to other types of hand actions. With respect to word
sense identification, the semantic classification of the subject and object is a
prominent cue for the distinction between the major meanings of the main
physical contact verbs but to various degrees in English and Swedish. Several
examples are also given of cases where linguistic cues are not sufficient and
disambiguation must be based on topical or pragmatic information.
1. Introduction
This paper will present a contrastive lexical analysis of the major English
physical contact verbs strike, hit and beat in comparison to the Swedish verb sl
which is the closest equivalent to all three English verbs. The semantic analysis is
based on an earlier paper on the verbs of physical contact in Swedish (Viberg
1999). The verb sl has a complex pattern of polysemy and many extended
meanings which require a wide range of translations in English. The rich
polysemy tends to be characteristic of verbs with the same prototypical meaning
across a wide range of languages (for Chinese, see Gao 2001).
The comparison of Swedish and English that will be presented in this
paper is based on the English-Swedish Parallel Corpus, ESPC (Aijmer et al. 1996,
Altenberg and Aijmer 2000), which contains original text samples in English and
Swedish together with their translations. The text samples represent both fiction
and non-fiction and the total number of words from each source language is about
328 ke Viberg
half a million. The corpus will be used for contrastive purposes, whereas matters
such as translation problems or the general characteristics of translated texts will
not be dealt with (see Johansson 1998 on the various uses of parallel corpora).
The aim of the present paper is primarily to present a systematic
contrastive account of the data but the general theoretical significance will be
briefly indicated within two frameworks. The first concerns the conceptual
representation of lexical items accounting for the patterns of polysemy and their
cognitive motivations. This will be oriented towards cognitive semantics and in
particular prototype theory (Taylor 1989). Another important cognitive semantic
idea is the notion of embodiment which implies that our concepts to a large extent
are shaped by our bodies and brains (Lakoff and Johnson 1999). In particular,
bodily movement will be shown to play an important role for the conceptual
representation of the main verbs of physical contact.
The second framework concerns the contextual representation of lexical
items and the process of word sense identification accounting for the interaction
between word meaning and cues in the linguistic context in the disambiguation
process and in the choice of translation equivalents. According to Miller and
Leacock (2000), each meaning of a word must be associated with a contextual
representation, which can be either local or topical. Experimental work has shown
that people can identify various meanings of a polysemous word with a relatively
high degree of success if they are presented with a window of 2 words of
context, but local context is not always enough. Local cues turned out to be very
precise when they occurred but all too often they simply did not occur (op. cit.
p. 156). Miller and Leacock also give an account of the use of topical context
which refers to the general topic of a text or conversation. Topical context has
been tested with various statistical classifiers run on computers. In one such
experiment, only the words occurring in the same sentence as the target word
were presented (in random order). With three or more senses to distinguish of
words such as line and serve the statistical classifiers reached close to 75%
correctness. Human subjects who were presented with lists of words co-occurring
with line in reverse alphabetical order only managed to identify the correct sense
a little better than the statistical classifiers, which justified the conclusion that the
result obtained with the classifiers was close to the ceiling for what can be
achieved with topical context alone.
Table 1 shows the most frequent Swedish equivalents of strike, hit, beat
and knock. Due to the relatively limited number of occurrences, originals and
translations in each language have been pooled together, which is not ideal, but a
separate account would be difficult to grasp. (Originals and translations are
separately coded in the underlying analysis of the data.) The row named Total
English verbs shows the total number of occurrences of the four verbs in the
ESPC. The following rows show the most frequent Swedish equivalents. It turns
out that the most frequent translation equivalent of all these verbs except knock is
the verb sl which is clearly the dominant physical contact verb in Swedish. The
two verbs strike and hit share the verbs drabba affect negatively and trffa in
the sense hit a target as the second and third most frequent equivalents, whereas
Physical contact verbs in English and Swedish 329
beat and knock only share the verb sl. As for knock, the verb knacka serves as
the major equivalent when the verb refers to knocking on a door, otherwise sl is
the major equivalent. The rightmost column shows the total number of
occurrences of the Swedish verbs in the corpus.
Table 1 rather clearly reflects the fact that the semantic field of physical contact
verbs has one central member in Swedish, the verb sl, which is the major
equivalent of the three verbs strike, hit and beat in English. In percentage terms,
sl accounts for between 47% (strike) and 33% (hit) of the equivalents of these
three verbs. On the other hand, these verbs account only for a small proportion of
the English equivalents of sl. Together they account only for 18% of the
equivalents of sl. In spite of this, at least strike and hit are usually experienced
as the closest equivalents of sl by Swedes who know English; this is probably
due to the fact that these two verbs account for close to half (47%) of the
equivalents of sl in its prototypical meaning as a physical contact verb. In
addition, as many as 29 other English verbs which can be regarded as physical
contact verbs are used as equivalents of sl (e.g. bang, pound, punch, slam, slap).
As will be shown below, there are also many English equivalents which belong to
other semantic fields than physical contact due to the extensive patterns of
polysemy which characterize sl. The next section provides an analysis of the
most frequent meanings of the major English physical contact verbs. This is
followed by an account of the extensive pattern of polysemy of Swedish sl and
how it is reflected in the English equivalents.
In Table 2, an attempt is made to show the relationships between the major senses
of strike, hit and beat as they are reflected in the ESPC. Unfortunately, the
number of occurrences is rather limited but it is still possible to sketch the basic
semantic relationships. The frequencies (F) given for each verb in the last three
columns refer to the total number of occurrences with a certain meaning and
typical subject and include some cases where the major Swedish equivalent is not
used.
330 ke Viberg
Table 2. Main senses of strike and hit and beat with their major Swedish
equivalents
The verbs strike, hit and beat can all be used about a human being moving the
arm and bringing the hand (or something held in the hand) into contact with
something in order to have an impact on it. This use as a bodily action verb can
be taken as prototypical. When the object is also a human being which is
frequently the case the intention is usually antagonistic: to hurt (or even to kill) or
Physical contact verbs in English and Swedish 331
defeat the other human, not just to touch in a friendly way (cf. pat, stroke,
caress). It is hard to find any clear semantic contrast between strike and hit in this
use, whereas beat is frequentative and generally indicates a more intensive effect.
The dominant Swedish equivalent of this use is sl. Equivalents clearly
expressing the intention are also used, in particular as equivalents of beat (e.g.
misshandla batter, kl upp beat up, thrash, ge stryk give a beating, lick).
The verbs can also be used with various classes of inanimate subjects to
describe various types of physical events (i.e. events which can be experienced
with our senses). In this case, there are several clear contrasts between hit, strike
and beat. Since the database is so limited, it is useful to compare the patterns in
the ESPC with the large BNC corpus. Table 3 shows which nouns are salient as
subjects according to Kilgarriffs WASPBENCH , a tool which shows which
collocates appear with more than chance frequency together with a certain target
word according to a statistical formula producing a salience index (Kilgarriff and
Tugwell 2002; see also the demo at http://www.itri.bton.ac.uk/peopleindex.html).
The columns marked F show the frequency of the noun as subject of the verb and
the columns marked Sal. show the salience index. The subjects are ordered in
descending frequency according to this index.
The type of subject is also important for the choice of Swedish translation.
In particular, projectiles such as bullets influence the choice of Swedish
translations in the direction of trffa hit a target. When used as a physical
contact verb, trffa focuses the moment when contact occurs, whereas sl (see
below) prototypically describes a complete bodily action (stretching of arm
followed by contact between hand and target):1
As can be observed in Table 3, bullet appears as one of the most salient subjects
both of strike and hit and it is reasonable to regard it as a prototypical projectile.
(Among the salient subjects of hit, there are further examples: ball, shot, bomb,
missile, shell, pellet. Hit is the dominant alternative when the subject is a
projectile even in the ESPC according to Table 2.) However, not only nouns that
are lexically marked as projectiles favour the choice of trffa in Swedish. Any
concrete object that forcefully moves through the air can be interpreted as a
projectile:
(3) Hade hon kommit bara lite tidigare If she had come out just a little
kunde hon ha trffats i huvudet av earlier, the icicle might have hit
istappen (MG) her.
(4) Mannen brjade springa och The man started running, and
Kollberg skt igen och den hr Kollberg shot again and this time
gngen trffade han honom i hit him in the knee.
knvecket. (SW)
(5) We try to aim as close as possible Vi frsker sikta s nra som
without actually hitting them. mjligt utan att verkligen trffa
(MA) dem.
The verbs meaning shoot and aim, respectively, which form part of the topical
context, serve as the major cues to the choice of Swedish equivalent of hit.
The typical and most frequent object of strike, hit and beat in the ESPC is
a human being when the verbs appear in their prototypical use as bodily action
verbs. This is, however, only a tendency, whereas it is more or less a requirement
of Swedish sl (see below). There are a number of more abstract uses where these
verbs have an object which refers to a human experiencer. In prototypical uses
such as Harry struck/hit/beat Peter, there is usually an implication that the agent
wants to dominate or defeat the object. This implication tends to be strongest with
beat and this may be the reason why beat is used when only the abstract sense
defeat is present. The most frequent Swedish equivalent is sl but even more
abstract verbs such as besegra defeat can be used:
(6) He was quick and good at tic-tac- Han var snabb och duktig i
toe and checkers, and cunning and luffarschack och damspel, och
aggressive; he easily beat me. (OS) listig och offensiv; han slog mig
utan besvr.
(7) I was better at maths and science Jag var bttre i matte och
and practical things; you only had naturvetenskap och praktiska
to show him a lathe in the metal vningsmnen; man behvde bara
workshop for him to pretend he visa honom en revolversvarv p
had a fainting fit; but when he metallsljden fr att han skulle
wanted to beat me, he beat me. ltsas svimma; men nr han ville
(JB) besegra mig s gjorde han det.
Physical contact verbs in English and Swedish 333
Table 3. Salient subject collocates of strike, hit and beat according to Kilgarriffs
WASPBENCH
strike F Sal. hit F Sal. beat F Sal.
Total 7149 9777 7552
BNC
subject 4417 0.6 subject 6106 0.7 subject 3987 0.5
lightning 65 24.6 smash 33 24.0 heart 198 27.5
disaster 52 22.7 recession 99 23.7 drum 15 14.7
clock 80 22.0 bullet 45 19.3 pulse 19 13.0
thought 95 19.7 car 90 14.0 side 50 12.1
bullet 21 14.5 ball 42 13.7 stick 11 11.3
tragedy 17 14.2 shot 23 12.3 England 27 11.1
contrast 14 12.5 bomb 24 12.0 sun 31 11.0
blow 13 12.2 missile 14 11.9 team 52 10.8
similarity 11 12.1 squall 7 11.4 wing 15 10.1
bargain 10 11.8 downturn 7 11.3 rain 20 9.4
thing 74 11.3 blast 11 10.9 keeper 7 8.3
lightening 4 10.9 drought 8 10.6 gang 10 8.1
band 22 10.4 shell 13 10.6 whites 7 7.9
cyclone 6 10.4 wave 27 10.3 United 9 7.4
it 511 9.9 cyclone 5 9.9 Surrey 4 7.1
fact 28 9.4 chart 11 9.6 goal 13 7.1
burglar 13 9.2 loss 21 9.3 man 67 7.1
deal 15 9.2 hurricane 7 9.2 they 368 6.9
jinx 4 9.1 blow 9 9.0 Liverpool 6 6.8
raider 7 9.0 crisis 14 8.8 Rangers 5 6.8
thief 11 9.0 pellet 6 8.7
earthquake 6 8.7 slowdown 4 8.6
right 26 8.5 kick 8 8.3
sun 19 8.3 depression 8 8.3
plague 6 8.2 header 7 8.2
These two examples also illustrate how the meaning and the choice of translation
in certain cases can be identified only pragmatically by the wider discourse
context. When both the subject and object are human, the meaning beat
physically is possible but ruled out by the fact that a game such as tic-tac-toe has
been mentioned earlier as in the first example. On many occasions, the cues are
even more indirect, for example when they reflect the general topic of
conversation such as sports. The meaning defeat, however, is also represented in
the list of salient subjects of beat in Table 3. Many of the subjects are (parts of)
names of teams (England, United, Surrey, Liverpool, Rangers). In addition, there
is the noun team itself and a relatively large proportion of the examples of they
334 ke Viberg
also refer to teams. Most of the examples of the salient subject side also belong
here (e.g. Skem boss Dave Maloney, who watched his side beat Glossop 2-1 on
Saturday).
A prominent class of subjects that appear with hit and strike but not with
beat are nouns referring to events with negative effects for humans such as
natural disasters, economic crises, wars and diseases. Several of the salient
subjects in Table 3 are of this semantic type (strike: disaster, tragedy, cyclone,
earthquake, plague; hit: recession, downturn, drought, cyclone, loss, hurricane,
crisis, slowdown, depression). The object typically refers to human groups and
institutions of various types. The dominant Swedish equivalent in this case is
drabba which basically means affect negatively:
(8) When a severe drought struck the Mot slutet av Ahabs styre, nr en
land towards the end of his reign svr torka drabbade landet []
[] (KAR)
Since the negative consequences of the event for humans is in focus, the verb
very often appears in the passive, which places the human experiencer in subject
position:
(10) In 1665 yet another plague hit the 1665 hemsktes London av nnu
capital (SUG) en pest
(11) I slutet av 1870-talet intrffade en Sweden was hit by a very deep
mycket svr lgkonjunktur med en recession at the end of the 1870s,
lng rad svenska konkurser som resulting in a large number of
fljd. (TR) Swedish bankruptcies.
A peculiar fact about the use of hit in this meaning is that around 50% of the
occurrences in the ESPC have the passive form. (The passive forms are not as
prominent 3 out of 14 with strike used with the same meaning but this will not
be discussed in detail due to the relatively small number of examples.) One
reason for this is the general tendency of human arguments to be realized as
subject. At the same time, the frequent use of the passive form serves as an
indication that hit is being used as a psychological predicate rather than a physical
action verb. A comparison with Swedish drabba is interesting. There are 182
occurrences of drabba in the ESPC corpus, 103 (62%) of which are passive.
Besides hit and strike, its English correspondences are verbs which have a basic
Physical contact verbs in English and Swedish 335
meaning close to affect (negatively) such as affect (23 examples), afflict (12)
and befall (5). The most frequent equivalent is actually the verb suffer (from)
(33), which takes a human Experiencer as subject in an active sentence:
(12) Men Joe var fr tidigt fdd och But Joe was born too early and had
hade drabbats [Passive] av suffered from lack of oxygen
syrebrist under frlossningen. during his birth.
(SCO)
Negative events of the type just described are in principle observable with our
senses, even if the psychological reaction of the Experiencer is in focus. The
subject can also refer to a purely mental event. A clear case is when the noun
thought is used as subject.
(13) Den frsta tanken slog mig nr jag That thought struck me the
vaknade nsta morgon och tnde following morning when I woke up
ljuset. (RJ) and switched on the light.
In the ESPC, only strike is used with this meaning (the sudden appearance of a
thought). The dominant equivalent in Swedish is sl. In both languages, this
meaning is usually tied to the construction it + Verb +NP +that-S (or wh-S):
(14) I know that at one stage it struck Jag vet att det vid ett tillflle slog
me how utterly out of place I was mig hur ytterligt malplacerad jag
in that cathedral. (BR) var i den dr katedralen.
The use of strike with a mental meaning is also reflected in the list of salient
subjects in Table 3. The noun thought appears close to the top. Among the other
salient subjects, the nouns thing and fact tend to serve as the abstract head of
sentential complements (e.g. The first thing that struck me about Dana's poems
was his incredibly tiny script and I was struck by the fact that there were no
spokes) and the salience of it as a subject of strike is no doubt due to expressions
of the type it struck me that-S.
The verb strike (often in combination with as) can also be used to describe
how something appears to a human Experiencer. In this case, the Swedish
equivalent sl cannot be used as an equivalent and various mental verbs are
preferred instead, such as te sig or tyckas appear:
(15) Det enda som tycktes honom avvik- The only thing that struck him as
ande var ett litet krucifix som satt p being odd was a little crucifix on
vggen intill drren till pentryt. (HM) the wall by the kitchen door.
(16) Yes, I think that 's how she struck Ja. Det var vl ungefr s jag
me. (JB) upplevde henne.
To sum up, an important cue for word sense identification and for the choice of
Swedish translation of strike and hit is the semantic class of the subject. However,
there is a wide range of other linguistic cues some of which will be dealt with in
the following account of sl, but as will become evident these cues are not as
prominent as for the Swedish verb. There are also cases where only the wider
discourse context or general pragmatic factors are decisive. With respect to the
conceptual representation, the Bodily action component of strike, hit and beat is
less prominent than in Swedish as will be demonstrated in the next section.
In Swedish, there is one nuclear physical contact verb sl which has a much
higher frequency than any other verb in the field. The meaning of Swedish sl is
analyzed in greater detail in Viberg (1999). In brief, sl in its prototypical use as a
physical contact verb involves Intentional action, Body movement, primarily with
the arm and hand, which results in contact between the hand and some (optionally
specified) part of the body of some other human being, as in the following corpus
example: Mor slog far i ansiktet (IB) Mother struck father in the face. The
various aspects of the meaning of sl can be related to a number of experiential
levels as outlined in Table 4.
(18) Min femriga arm som med all and my five-year-old arm raising
kraft lyfter handen fr att sl my hand to hit back with all its
tillbaka. (MS) might.
The use of the body part as subject in this rather exceptional use also backgrounds
the cognitive level conceptualizing the hitting as an uncontrolled event.
Hitting can be experienced both from within as a sensorimotor activity and
from outside as motion through space. The similarity between the visual
perception of the fist moving through the air and a projectile moving through the
air and hitting its target links examples like Harry hit Peter and A bullet hit Peter
in English. This example also shows that languages exploit potential links
differently in polysemy. As described earlier, Swedish would use sl in the first
case (Harry slog Peter) and trffa (Kulan trffade Peter) in the second. The verb
trffa, however, is not completely ruled out when referring to bodily action in
examples such as Harry trffade Peter med ett vlriktat slag Harry hit Peter with
a well-aimed blow. What motivates the use of trffa in this example is that the
trajectory of the fist and in particular the exact location of its end-point is
focused. Examples where the meaning of sl is based primarily on spatial
perception will be presented later in this section.
One characteristic of Swedish sl is that the direct object is usually also
human unless there is a verbal particle (see below). When it is non-human, the
target of the contact is usually realized by a formally more marked form as a
338 ke Viberg
There is a strong implication that the contact has a clear effect or impact on the
object. This distinguishes hitting from touching. When the object is human, the
effect is usually psychological. The agents intention to hurt or defeat the other
human is part of the prototypical meaning of sl. Swedish sl can also be used
when the result is death. The object in this case refers to a human or an animal
(cf. the meaning of the English cognate slay) but in this case sl is usually
combined with the particle ihjl (etymologically into Hel, the kingdom of the
dead in Old Norse mythology). Sl ihjl is in most of the cases translated by kill
which is unmarked for manner, but the more direct equivalent beat to death also
occurs:
(21) Han kunde sl ihjl mig utan att He'd kill me without giving it a
blinka. (SG) second thought.
(22) Klappar det p porten r hans A knock at the door? His first
frsta impuls att gripa yxan och impulse is to seize an axe, rush out
rusa ut och sl ihjl. (IU) and beat his visitor to death.
As in many of the other cases where sl is combined with a particle, the particle
signals the result, whereas the verb primarily contributes a manner component. A
sentence such as Peter slog ihjl ormen can be paraphrased as Peter killed the
snake (by hitting it). However, sl without a particle has the conventional
meaning kill when the subject refers to a bear: Bjrnen slog ett lamm The bear
got a lamb.
The verb sl is associated with an extensive pattern of polysemy. The
relationships between a number of the most basic meanings are shown in Figure 1
(see Viberg 1999 for discussion) and the major English equivalents tied to various
meanings are shown in Table 5. In Figure 1, the prototype is shown in the box in
the middle. Above the prototype, a number of uses are displayed where some part
of the prototypical meaning is focused. A relatively frequent use, focuses on the
limb movement without any resulting contact. The typical English equivalent is a
motion verb:
(23) Pastor Tureson slog uppgivet ut Pastor Tureson threw up his hands
med hnderna. (HM) in acknowledgment.
(24) Zablonsky spread his hands. (FF) Zablonsky slog ut med hnderna.
Physical contact verbs in English and Swedish 339
(25) Hon for upp och sprang runt i She leapt up and ran round the
kket, slog armarna runt kroppen, kitchen, flinging her arms round
och hulkade och snyftade. (AP) her body, sobbing and sniffing.
The verbs strike, hit and beat only have a few uses where limb movement is
focused, as in the following example:
(26) Han hade brjat skaka av kld och He had begun shaking with cold, so
slog armarna om sig sjlv. (KE) he kept beating his arms round his
chest []
Examples such as Per slog ut med armarna Per spread his arms, where sl
describes limb movement, serve as a model for the conventionalized use of sl to
describe the motion of petals in expressions like Blommorna slog ut The flowers
came out. In the corpus, there is one example which shows that similar
extensions are productive to some extent:
An example like this one is based on the spatial perception of a movement that
looks like a certain type of arm movement (perhaps via the conventionalized
extension describing flowers coming out). There is no direct connection to the
sensorimotor experience in this example.
The result of defeating someone can also be focused. In English, this is
possible only with beat. In the following example, the discourse context makes it
clear that the physical part of the meaning of sl and beat should be suppressed:
Focusing:
Stationary motion
Blommorna slog ut
The flowers came out
Resultative strengthening:
Specialized meanings:
lemma in Swedish, but from a semantic point of view slss is closely associated
with the prototypical meaning of sl. Basically, it refers to a fight with the fists
(Pojkarna slss The boys are fighting) but it is often extended to a fight with
other physical means and can be extended into abstract domains as evident from
the second example below:
(29) Somliga sp och slogs s det var Some of them used to drink and
inte klokt. (SW) fight like you wouldn't believe.
(30) Kanske slss dom mot tystnaden, They may struggle with the silence
men mera troligt r att dom fljer but more often they coexist with
med den tystnad dom upptckt. the silence they have discovered.
(SC)
The most frequent equivalent of slss is fight but other alternatives such as
struggle, compete, contend, contest, vie and scramble for also occur.
In the construction sl sig ner (sl + Reflexive + down), sl functions
semantically as a postural verb. The dominant English equivalent is sit down as in
the following example:
(31) Dag slog sig ner p golvet bredvid Dag sat down on the floor beside
Ludde. (MG) Ludde.
(32) A fly alighted on his lower lip [] En fluga slog sig ner p hans
(BO) underlpp []
(33) Svenska och finska nybyggare slog Swedes and Finns settled in the
sig ner i kolonin, som kallades Nya colony which received the name of
Sverige. (AA) New Sweden.
Hitting a physical object can have various physical effects such as setting the
object in motion, breaking it, producing a new object or producing a sound. Such
Physical contact verbs in English and Swedish 343
(34) Natalie not caring about the way Att Natalie inte bryr sig om ifall
she makes Jane break plates hon fr Jane att sl snder tallrikar
matters; (FW) har ocks betydelse []
(35) Vem r det som slagit ert h, sa "Who mows your hay?" asked the
frmlingen. (SC) stranger.
The verb sl can also be used in phrases with the meaning cause to form a unit
but in that case a verbal particle such as samman together or ihop (etymol. in
+ heap) must be used. Even if it is possible to interpret combinations such as sl
ihop or sl samman concretely involving the striking of two objects against one
another, all occurrences in the ESPC have a more abstract meaning. The most
frequent equivalent is merge but join also occurs in a couple of examples:
(37) I denna stad hade kungen sin grd, The King had his residence in that
och i Sigtuna slogs ocks de ldsta town, and the oldest dated coins
daterbara mynten i landet. (AA) were minted there.
344 ke Viberg
Interestingly, the expression sl mynt av (lit. strike coins out of) has primarily
survived in modern Swedish in a metaphorical sense to produce a benefit for
oneself, i.e. to take advantage of a certain situation:
(38) "You'll pay for this," Con said, "Det hr ska du f betala fr", sa
already seeing opportunities for Con, som redan hade insett att det
cashing in on this young fool's gick att sl mynt av den unge
misfortune. (JC) klparens misslyckade frsk.
The verb sl can also be used in the sense set in motion by hitting as in the
example Per slog bollen ver nt Per hit the ball over the net. There is also a
more extended use of sl as a motion verb where the object is a liquid. The most
frequent equivalent of sl in this use is pour:
(39) Det fick dra ett tag innan When they 'd soaked it all up, the
gstgiverskan slog p en skvtt innkeeper's wife poured in some
mjlk och lt den koka in. (KE2) milk and let it all putter.
In examples like this one, sl no longer refers to hitting but to a movement with
the arm and hand that is partly similar: to move liquid by tilting a container held
in the hand. (There is also a verb hlla pour in Swedish which has this as it
basic meaning.)
There are several other uses more or less closely linked to the prototypical
meaning where sl refers to some specialized kind of movement with the arm and
hand. One such hand action that is loosely associated with the prototypical
motion of arm and hand are the expressions sl p/sl av referring to the turning
of a switch on or off. The two major equivalents are turn on/off or switch on/off:
(42) Han tog ut en dyrbar och vackert He took out an expensive and
ornamenterad prm och slog upp beautifully decorated portfolio and
den framfr sig p skrivbordet. opened it before him on the desk.
(HM)
Physical contact verbs in English and Swedish 345
(43) Jag lade ifrn mig pennan eller slog I put my pen down or closed my
ihop boken. (AP) book.
The most frequent equivalents are open and close. When the object refers to
books and other physical objects consisting of pages joined together (newspapers,
journals, menues, etc.), sl + particle refers to opening and closing in a neutral
way. There is, however, another large group of objects referring to doors,
windows and other barriers that can be moved to allow passage (such as lid). In
this case, the use of sl + particle indicates that the action is carried out briskly
and forcefully. In addition to the neutral use of the verb open alone, there are
various equivalents that mirror the manner component:
(44) Nr drren ut till hallen nyo slogs When the door from the exhibition
[Passive] upp (KOB) hall opened again
(45) Pltsligt slogs drren upp (LH) Then the door flew open
(46) I detta nu slogs drren upp (ARP) Then the door crashed open
(47) Djupt inne i mitt medvetande slogs Deep in my consciousness doors
drrar upp (GT) were thrown open
The expression sl igen drren usually implies that the door was closed so
forcefully that a loud noise was produced, and this is mirrored by the frequent
equivalent slam the door:
(48) "Nr gr ni av skiftet i kvll?" "When do you get off your shift?"
frgade han i samma gonblick he asked the one in the back as she
som en av dem slog igen slammed the car door.
bildrren. (JG)
The use of sl upp and sl igen to refer to opening and closing is so well-
established that it can be further extended to uses where hand action is not
involved. Sl upp can be used about the opening of the eyes:
(49) Eriksson slog upp gonen. (SC) Eriksson opened his eyes.
Both sl upp and sl igen can be used with nouns meaning door (or movable
barrier in general) as subject. In examples like the following, there is no clear
implication that a human was involved:
Another use expressing a hand action loosely associated with striking is when sl
refers to the dialling of a telephone number. In this case, the direct object is
usually numret the number or siffrorna the numbers and the dominant
equivalent is dial:
346 ke Viberg
(51) Hon lste upp bilen och slog She unlocked the car and dialed the
numret till kontoret i Ystad p number of the Ystad office on the
biltelefonen. (HM2) car phone.
This is also an interesting example illustrating the cues that can be used for sense
identification and the choice of translation. The major cue in this case is the
semantic class of the object, which in addition to nouns meaning number can be
any combination of digits which can serve as a telephone number: Peter slog 112
Peter dialled 112. Another example which has been discussed above is the class
of objects that can appear when sl refers to mowing or cutting hay and related
objects.
In Swedish, sl can be combined with a large number of particles. But
even in these cases the semantic class of the object is an important cue. The
combination sl upp, for example, is related to different senses and translations
depending on the semantic class of the object. The meaning open appears when
the object refers to (1) door or other movable barrier, (2) book or other printed
matter consisting of pages joined together or (3) eyes. The meaning pour
appears when the object refers to a liquid, especially a drink or beverage:
(52) Han slr upp vattnet och lgger i He pours out the water and puts a
ngra citronklyftor. (MS) few slices of lemon in each glass.
The combination sl upp can also refer to the finding of information by opening a
book or other printed matter. This meaning is metonymically related to the
meaning open which is transformed into a manner component (find
information by turning the pages in a book). The usual English equivalent in this
case is look up:
(53) I looked up the name Gahan. (SG) Jag slog upp namnet Gahan.
Typical objects in this case are words which refer to verbal or numerical
information such as name and telephone number but in principle any word
used metalinguistically could appear as object: Peter slog upp skiftnyckel (i sin
ordbok) Peter looked up wrench (in his dictionary). In print, (single) quotes are
often used to signal that a word is used metalinguistically but in speech topical or
situational cues must be used.
In comparison with strike and hit, the semantic class of the subject plays a
less prominent role for the interpretation of sl since human subjects dominate so
strongly. Inanimate physical objects do occur as subjects but only to a certain
extent. Natural forces occur as subjects of sl to approximately the same extent as
with the English verbs. When the subject refers to lightning, the equivalent is
always strike but when it refers to rain and waves or fire and smoke, a wide range
of physical contact verbs are used (bang, batter, beat, crash, hammer, smack) in
addition to a few motion verbs (gush, sprout, sweep). Usually, various fine-
Physical contact verbs in English and Swedish 347
(54) Regnbyarna slog mot vindrutan. Rain squalls hammered against the
(HM2) windshield.
(55) Grtt regn slr mot glas. (PCJ) Grey rain batters the glass.
The verb sl can also be used as a mental verb and take a proposition or a mental
noun such as tanke thought as subject. (The uses of sl with a mental subject are
treated together with other mental uses in Table 5.) A sentential subject is usually
extraposed and introduced by a dummy subject (det it) as in the English
construction it struck me that-S (Swed. det slog mig att-S):
(56) Eftert slog det mig att det kanske Later it struck me that it is perhaps
inte gr att drmma att man dr. not possible to dream that you die.
(BL)
(57) Det slog mig att det var mycket It occurred to me it had been quite
lnge sedan jag knt mig generad. a while since l'd felt
(LH) embarrassment.
(58) Det slr mig att han antagligen inte The thought crosses my mind that
alls hr till kongressen. (MS) he probably does n't have anything
to do with the convention.
(59) And this, it suddenly came to her, Och detta, slog det henne pltsligt,
might well be the wages of sin. skulle mycket vl kunna vara
(FW) syndastraffet.
Mental nouns such as tanke thought, idea can be used as subjects when the
object is human:
(60) Tanken slog mig att Pekka kanske It came to my mind that Pekka had
hade seglat ivg med MacDuffs perhaps sailed away with
kvinna (BL) MacDuff's woman
There are two phrasal combinations with sl that are relatively frequent in the
ESPC, especially in the non-fiction texts, viz. sl fast and sl vakt om. The phrase
sl fast means literally fasten by hitting. As a mental metaphor it refers to
forming a decision that one sticks to. A number of different equivalents are used,
such as establish, specify, state:
(63) Jag tycker ocks att man hr borde In my view, we should have used
ha tagit chansen att sl fast att this opportunity to establish that
parlamentets ordfrande skall utses the President of Parliament should
p fem r [] (ESJO) be elected for five years []
The phrasal combination sl vakt om (lit. strike guard of) is not transparent in
present-day Swedish. The most frequent equivalents are safeguard and protect:
4. Conclusion
The present paper is relatively data-oriented and an account has been given of a
rather large number of cases where English and Swedish contrast. However, an
attempt has also been made to characterize the contrasts between the two
languages in general terms based on two different frameworks. With respect to
the conceptual representation, Swedish sl is grounded more firmly in
sensorimotor experience of limb movement than strike, hit and beat, even if
sensorimotor experience plays an important role also for the conceptualization of
the English verbs. At a general level, the extensions of the major verb of hitting to
other types of hand action probably represent a universal tendency. The polysemy
of the Chinese equivalent d_ hit is to a great extent motivated by the fact that
the prototypical meaning refers to hand action according to Gao (2001).
However, a comparison at a more detailed level with Swedish sl shows that
there appears to be great variation with respect to the specific hand actions (out of
the many potential ones) that are conventionally associated with the verb whose
prototypical meaning is hit.
With respect to the process of word sense identification, there is also a
general tendency. In both English and Swedish, there are many types of linguistic
disambiguation cues. It appears, however, that the major equivalents of strike and
hit can be identified with the help of the semantic class of the subject, whereas the
semantic class of the subject is helpful in fewer cases in Swedish due to the
relative dominance of human subjects of sl. The semantic class of the object, on
the other hand, is utilized as a cue to distinguish a rather great number of senses
of sl and appears to be more important for sl than it is for hit, strike and beat.
The relative importance of various types of cues varies a great deal within a
language depending on the type of lexical item. The major meanings of Swedish
f get; may such as Possession, Modal, Causative can be identified with the help
of the syntactic frame (or construction), whereas the subtle but important contrast
between the two modal meanings Permission and Obligation are identified
primarily with the help of pragmatic factors (Viberg 2002).
The semantic class of the subject and object referred to in this paper can be
compared to the notion of local context (Miller and Leacock 2000. See the
introduction). To a large extent it will be available within such a narrow window
as 2 words and is local in that sense. The concept of argument structure of which
subject and object form a central part is, however, different from simple co-
occurrence. In a lexical study, it appears to be justified to provide the more
structured information even if it is still an open question excatly how this
information is used by human or machine. As has been exemplified several times
in this paper, topical and pragmatic information will be needed in many cases to
reach the correct interpretation.
The comparison of Swedish and English has turned up many differences in
semantic structure in spite of the fact that the two languages are rather closely
related. As a matter of fact, most of the verbs treated in this paper have cognates
in the other language: sl slay, strike - stryka stroke, hit hitta find (a
350 ke Viberg
As can be seen the extension of sl into the mental domain (it struck him that-S)
has a parallel in French in addition to English, whereas the extension to meanings
such as opening and pouring appear to be language-specific characteristics of
Swedish in spite of the fact that they represent natural extensions from the
prototypical conceptual representation of Swedish sl. To be able to say what is
universal, languages that are genetically and geographically more distant from
Swedish must be taken into consideration, but as already mentioned certain types
of extension such as the extension from hitting to various other hand actions have
parallels in non-European languages such as Chinese.
Note
1. In the following corpus examples the original text is placed first. For an
explanation of the text codes, see
http://www.englund.lu.se/research/corpus/corpus/webtexts.html.
Physical contact verbs in English and Swedish 351
References
Viberg, . (1985), Hel och trasig. En skiss av ngra verbala semantiska flt i
svenskan, in: Svenskans beskrivning 15: 529-554. Gteborg: Gteborgs
universitet.
Viberg, . (1996), The meanings of Swedish dra pull: a case study of lexical
polysemy. EURALEX'96. Proceedings. Part I, 293-308. Department of
Swedish, University of Gteborg.
Viberg, . (1999), Polysemy and differentiation in the lexicon. Verbs of physical
contact in Swedish, in: J. Allwood and P. Grdenfors (eds), Cognitive
semantics. Meaning and cognition. Amsterdam: Benjamins. 87-129.
Viberg, . (2002), Polysemy and disambiguation cues across languages. The
case of Swedish f and English get, in: B. Altenberg and S. Granger (eds),
Lexis in contrast. Amsterdam: Benjamins. 119-150.
Exploring theme contrastively: the choice of model
Anna-Lena Fredriksson
Gteborg University
Abstract
The aims of this paper are to discuss different approaches to the notion of theme
and to show how parallel corpora can successfully be used for cross-linguistic
analyses of theme.1 The realisation of theme is language-specific which can be
problematic for contrastive studies of thematic structures. In this paper, I start by
describing theme in English following Systemic Functional Grammar (Halliday
1994) and discuss questions concerning the delimitation of the theme from the
rheme in English, which is relevant also for monolingual and cross-linguistic
studies. In a brief overview of various approaches to theme in other languages,
monolingual as well as cross-linguistic, I then demonstrate that the positions
taken to theme differ and the original approach, which is English-based, may
have to be modified to suit other languages simply because different languages
have different ways of realising this function.
1. Introduction
Parallel corpora offer great possibilities for contrastive text analysis.2 In recent
years studies have covered a variety of features in the languages involved and
often combined a syntactic and a textual feature. Studies have for example
focussed on the thematic uses of non-referential there in English-Finnish texts
(Mauranen 1999), sentence openings and textual progression (English-Swedish)
(Svensson 2000), connectors and sentence openings (English-Swedish)
(Altenberg 1998), word order and thematic structure in English and Norwegian
(Hasselgrd 1998, 2000), and thematic development in English-German texts
(Ventola 1995). To my knowledge, Ghadessy and Gao (2001), investigating
English and Chinese, is the only purely quantitative study of thematic
development in parallel texts. The usefulness of this kind of research for
translators and translator training as well as for machine translation is often
stressed.
The present paper originates in problems that I have encountered in my
ongoing thesis work on passives from a corpus-based contrastive English-
Swedish perspective. It is well-known that the passive is a multifunctional
structure that provides a useful way of omitting the agentive subject where it can
be ignored, or of postponing an agentive subject by making it the agent in cases
where we want to give it end focus. At the same time, it gives thematic status to
354 Anna-Lena Fredriksson
the affected entity (cf. Svartvik 1966, Granger 1983, Quirk et al. 1985: 1390f.,
Pry-Woodley 1991, Teleman et al. 1999: 4: 379ff. among others). Such
operations facilitate a smooth development of the text. Its important role in text
organisation gives rise to the question of how passive sentences in original texts
are treated by translators. To what extent is the thematic structure preserved or
altered in translation? Baker points out that [r]endering a passive structure by an
active structure, or conversely an active structure by a passive structure in
translation can affect the amount of information given in the clause, the linear
arrangement of semantic elements such as agent and affected entity, and the focus
of the message (1992: 106). But how can we compare thematic structure across
languages?
Due to the simple fact that language systems and their realisations differ,
difficulties often arise when we want to study text structure across languages. We
can assume that in all languages the clause has some kind of text-related
organisation, and we can acknowledge theme and rheme as basic notions for the
organisation of the message presented in clauses. However, the realisation of
these notions may be specific to each language (e.g. Fries 1995a: 15). Even in
English and Swedish, which are both SVO languages, it is sometimes difficult to
determine which elements are to be considered thematic. Consider (1):
(1) (a) EO: Recently, some 2 billion has been invested in the area; (SUG1)3
(b) ST: Nyligen har ca 2 miljarder pund investerats i Docklands;
Lit: Recently has approximately 2 billion invested-PAST-PASS in
Docklands.
In the Swedish translation (1b) the finite operator precedes the subject. The
inversion occurs because Swedish, like many other Germanic languages, is a
verb-second (V2) language which requires the verb to occupy second position in
declarative main clauses. Consequently, each time a non-subject occurs in initial
position, subject-predicate inversion takes place. Such a typological difference
may influence the choice of model for a thematic analysis.
In cross-language research we need descriptions of the way languages
organise the clause thematically and syntactically, and from there we may
proceed to finding a model of analysis that fits the languages compared. The
present paper discusses the theme-rheme system within Systemic Functional
Grammar (SFG) (Halliday 1967, 1994) which provides a much used model for
thematic analysis in English. Despite the fact that SFG has a strong orientation
towards English which is a potential problem for using it in other languages, the
theory has had considerable influence on translation theorists and on translation
studies of various kinds (cf. Hatim and Mason 1990, 1997, Baker 1992, House
1997, Steiner 2001, Teich 2001), and it has been applied to a variety of
languages. The main focus of this paper is on cross-linguistic descriptions of the
theme-rheme structure. How has the theme been interpreted, defined, and
delimited from the rheme in various languages? Can the notion of theme be
modified for contrastive purposes? I will show that studies of this kind need to be
Exploring theme contrastively 355
corpus-based, and that parallel corpora prove useful for describing the theme-
rheme structure both monolingually and contrastively.
The paper is organised as follows. Section 2 gives a presentation of the
concept theme in English following Halliday (1994) and also discusses how far
into the clause the theme reaches. Section 3 contains a brief overview of some
approaches to theme in other languages, and Section 4 discusses different models
used in cross-linguistic theme-rheme analysis. Concluding remarks are given in
Section 5.
2. Theme in English
As explained above, SFG identifies two textual units in the clause in English: the
theme and the rheme, which appear in the clause in that order.4 The theme can be
described positionally and functionally. Basically, the theme can be identified by
its initial position in the clause. Functionally, Halliday defines the theme as [t]he
element which serves as the point of departure of the message; it is that with
which the clause is concerned (1994: 37). In other words, [i]t is the element the
speaker selects for grounding what he is going to say (Halliday 1994: 34).
Although thematic structure and information structure (GivenNew) are
separate notions in SFG, there is a strong correlation between them, and we may
say that the theme typically contains information that is contextually or otherwise
retrievable (given information) (Halliday 1994: 299). The rheme, on the other
hand, consists of that which the speaker says about the theme. In terms of
newsworthiness, the rheme typically has a higher degree of newsworthiness than
the theme. The notion of theme is connected with the mood system in that the
choice of theme depends on the choice of mood. For example, in the unmarked
case in declaratives, the theme is conflated with the subject as in (2):
(2) EO: We [Exp-Th/Pa] had never seen builders work like this. Everything [Exp-
Th/Pa] was done on the double: scaffolding [Exp-Th/Pa] was erected and a
ramp of planks [Exp-Th/Pa] was built before the sun was fully up, the
kitchen window and sink [Exp-Th/Pa] disappeared minutes later [...] (PM1)5
Every unit given in bold in (2) is an unmarked theme. The concept of markedness
can be understood as a scale on which an unmarked theme is the option
representing the most typical choice in terms of probability and frequency of
usage. An unmarked theme is placed at one end of the scale and the further we
move away from the unmarked option(s), the more marked the choice is.
According to Halliday, the most marked theme of a declarative clause functions
as complement as in (my emphasis and notation) A bag-pudding [Exp-Th/Pa] the
King did make (Halliday 1994: 44). At an intermediate position we find clause-
initial circumstantial adjuncts (adverbial groups and prepositional phrases) which
make up the entire theme:
356 Anna-Lena Fredriksson
(3) EO: A few months later [Exp-Th/C] Henry was called in to Detroit again []
(RL1)
The themes we have seen so far are all experiential themes denoting
participants or circumstantial phenomena. This theme type belongs within the
experiential metafunction which constitutes one of the three metafunctions of
language according to Halliday. The other two are the interpersonal metafunction
and the textual metafunction, both of which may also contribute to forming a
theme. According to Halliday (1994: 52ff.), the theme always includes one and
only one experiential element, which is called the topical theme, but this item
may be preceded by one or several textual and/or interpersonal elements resulting
in a multiple theme. Figure 1 illustrates an extended multiple theme in English
with subtypes of the textual and interpersonal components.
well but then Ann surely wouldnt the best idea be to join
the group
continuative structural conjunctive vocative modal finite topical
textual interpersonal experiential
Theme Rheme
Figure 1. Extended multiple theme (Halliday 1994: 55).
What are the principles behind this stacking of thematic items? First, some textual
and interpersonal elements (e.g. connectors, modal adjuncts, and relative
pronouns) regularly take clause-initial position, and because of this their thematic
status is somewhat attenuated (Halliday 1994: 52). Second, their overall
function can be regarded as orienting (cf. Gmez-Gonzlez 1998: 83, Mauranen
1993) and as a consequence it is difficult to say that they express what the clause
is about. Therefore, when such elements occur in initial position, they do not
exhaust the thematic potential of the clause but allow a referential element to be
part of the theme. According to Halliday, the unmarked order of components
within the structure of a multiple theme is textual < interpersonal <
experiential/topical. While the experiential element typically comes last in the
theme and constitutes topical theme, the order of the textual and interpersonal
components may be switched. Finally, everything that follows the topical theme
constitutes the rheme. Example (4) illustrates a multiple theme of a more modest
length than that in Figure 1:
(4) EO: Unfortunately [Int-Th/Mo], part two of the lecture (Why The Earth Is
Becoming Flatter) [Exp-Th/Pa] was interrupted by a crack of another burst
pipe, and [Txt-Th/St] my education [Exp-Th/Pa] was put aside for some virtuoso
work with the blow-lamp. (PM1)
Flatter) which is also the subject. Further, the conjunction and is a textual theme
preceding the topical my education.
As we have seen, multiple themes come in slightly different shapes, which
opens the question of where the transition between theme and rheme takes place.6
Matthiessen suggests that the boundary of the theme be moved. Consider (5)
(adapted from Matthiessen 1992: 51):
(5) A. Do you mean were overdressed? said the charming father of the
Family.
B. [Place:] In England, [Time:] at this moment, [Purpose:] for this
occasion, [Participant:] we would be quite over-dressed.
(6) Towards the end of his life [Exp-Th/C], Freud [Exp-Th/Pa] concluded that he
was not a great man (Downing 1991: 127).
(7) This of course was not because the government failed in its supposed duty
as provider but largely because energy prices rose considerably in relation
to other prices
Further, just as there may be more than one textual and/or interpersonal item in a
multiple theme, an Extended multiple Theme may contain not only one but
several experiential elements, marked or unmarked, resulting in complex topical
themes.
It is important to consider the significance of the theme in the overall
development of the text. A number of studies (e.g. Francis 1989, Fries 1983) have
shown that the theme plays an important role in the organisation of discourse, or
as Halliday puts it, [t]he choice of Theme, clause by clause, is what carries
forward the development of the text as a whole (1994: 336). As shown by Dane
(1974) the thematic progression (or method of development, Fries 1983) of a
piece of text tends to follow certain identifiable patterns. Thus, this discourse
perspective supports Matthiessens (1992) proposal for an extension of theme.
Consider (8) from Matthiessen (1992: 51):
(8) Autumn passed and winter [passed], and in the spring the Boy went out
to play in the wood. While he was playing, two rabbits crept out from the
bracken and peeped at him.
(9) EO: The Pope himself probably survived only because he isolated himself
from everybody else in his huge palace. I suppose isolation was a very
natural impulse. Everywhere in Europe [Exp-Th/C] people [Exp-Th/Pa] resorted
to it, whether [Txt-Th/St] they [Exp-Th/Pa] were noblemen or priests or
intellectuals or ordinary peasants. (ABR1)
Exploring theme contrastively 359
The topical theme of the subclause (they) has the same referent as the second
theme of the main clause (people), and following Matthiessen the latter is part of
a complex theme, whereas Halliday has it as part of the rheme. Example (10)
starts with a multiple theme consisting of one interpersonal and one experiential
component. Here we find the first mention of the participant I in this stretch of
text. The next sentence has a complex experiential theme in the first clause (The
next morning and I) and I is taken up as theme both in the subclause and in the
subsequent sentences:
(10) EO: With regret [Int-Th/Mo] I [Exp-Th/Pa] put the diary into my other trouser
pocket. The next morning [Exp-Th/C] I [Exp-Th/Pa] supposed, I [Exp-Th/Pa] would
have to telephone his office with the dire news. I [Exp-Th/Pa] couldn't
forewarn anyone as I [Exp-Th/Pa] didn't know the names, let alone the phone
numbers, of the people who worked for him. I [Exp-Th/Pa] knew only that he
had no partners, as he had said several times that the only way he could
run his business was by himself. (DF1)
Again, a safe starting point seems to be to assume that in all languages the clause
has some kind of text-related organisation. The concept of theme is thought of as
a language universal, meaning that there is always one unit expressing what the
clause is concerned with (or is about), and one unit, the rheme, saying
something about the other unit. The realisation of the theme, however, is
language-specific: in English it is realised by initial position, whereas in Japanese
for example, it is expressed by the postposition particle wa (Halliday 1994: 36f.;
see also Rose 2001). Basically then, theme can be viewed from at least two
different angles: from its functional definition and from its realisation.
In their account of theme in Danish, Andersen et al. (2001) find that both
the aboutness aspect and the position aspect apply to Danish in the same way as
they apply to English. In other words, theme represents the point of departure of
the clause as message and all theme types occur in clause-initial position.
However, the Danish system of theme differs radically from the English system
in at least one respect: no distinction is made between topical and interpersonal
theme since it is found that a theme may consist of interpersonal information
only. Consider the following examples taken from Andersen et al. (2001: 175f.):
360 Anna-Lena Fredriksson
Being experiential in meaning, the themes in examples (11a) and (11c) are
analogous with English themes. The difference between (11a) and (11b) is that
the latter has a fronted modal adjunct which is interpersonal in meaning, and in
contrast to any English model, this element can and does make up the whole
theme. A further contrast here is that the subject is placed in postverbal position
in accordance with the V2 constraint. Example (11d) is another instance in which
the theme, here the finite operator, is primarily interpersonal and forms the
entire theme (Andersen et al. 2001: 177). A multiple theme in Danish may
encompass textual items followed by an interpersonal or experiential item.
Andersen et al. follow the initial position criterion and describe how theme is
realised in Danish in different clause types, but do not further discuss the
functional definition.
Steiner and Ramm (1995) offer an account of theme in German, also a V2
language, in which they establish a close connection between theme and the
traditional notion of Vorfeld in German grammar, and as a consequence there is
no stipulation that there is always an ideational element in the theme (1995: 62).
They find that a simple theme may consist of a constituent from either the textual,
the interpersonal, or the experiential metafunction. The theme in (12) can be
either textual (trotzdem), or interpersonal (vielleicht) (1995: 81):
However, it is doubtful whether we can say that textual and interpersonal items
such as trotzdem and vielleicht, and the interpersonal mske in Danish, express
what the clause is about, or that with which the clause is concerned in
Hallidays (1994: 37) wording. Rather, they only serve an orienting function
(Gmez-Gonzlez 1998: 83, Mauranen 1993).
There is no doubt that a parallel corpus may provide data for modelling a way of
analysing theme-rheme structures contrastively. The data obtained often reveal
both the strengths and the weaknesses of the model one is using. Since thematic
structure is clearly discourse-related, it is crucial that the model is tested on
Exploring theme contrastively 361
(13) (a) EO: Surely [Int-Th/Mo] I [Exp-Th/Pa] 'd been freed from those painful
memories long ago. (ABR1)
(b) ST: Visst [Int-Th/Mo] hade jag [Exp-Th/Pa] fr lnge sedan blivit befriad frn
de dr plgsamma minnena.
Lit: Surely had I for long ago become freed from those painful
memories.
Example (13b) shows that the second thematic element of the English text in
(13a) has been postponed to post-auxiliary position. The question is then: where
does the theme end and the rheme begin? As we have seen, Andersen et al.
(2001), as well as Steiner and Ramm (1995), interpret only the interpersonal
modal adjunct as theme in cases like (13b). In many other situations English and
Swedish behave in similar ways, but still the English model is not ideal for an
English-Swedish contrastive analysis. Clearly, a model developed for one
language is not necessarily applicable to another one. A number of researchers
have in fact pointed to the difficulties of finding models that can be used for
contrastive analyses and in the remainder of this section we will look at a few
corpus-based alternative solutions that modify the English definition of theme.
Mauranen (1999), who has investigated English and Finnish on the basis
of a parallel corpus, suggests a model consisting of an orienting theme realised
by fronted material, e.g. connectors and adverbials, and a topical theme realised
by nominal groups (Finnish) and a subject (English) (Mauranen 1999: 72):
In this model, the cut-off point between the theme and the rheme is placed before
the verb, and the rheme hence contains the verb plus optional constituents.
Despite the fact that English and Finnish are typologically different in many
ways, a cross-linguistic comparison of thematic structure is possible (see also
Mauranen 1993).
English and Norwegian (and Swedish) are more closely related than
English and Finnish. Nevertheless, Hasselgrd (1998, 2000) observes difficulties
in applying the SFG model of theme for comparing English and Norwegian, and
has used different definitions of theme. The crucial point is again the V2
362 Anna-Lena Fredriksson
constraint requiring the finite verb to occur in second position. The basic
definition in Hasselgrd (1998) includes in the theme the initial part of the
sentence up to and including the first experiential constituent. However, since the
finite verb is by default the second constituent, each time a non-subject occurs in
initial position a choice is made to regard this finite verb as a structural theme [a
subtype of textual theme], so that in cases where the fronted non-subject is a
conjunct or a disjunct adverbial, the theme will include the first experiential
element after the finite verb (1998: 148). This is seen as analogous with the
thematic structure of polar interrogatives in Halliday which have a two-part (i.e. a
multiple) theme consisting of the finite verb followed by the subject (Halliday
1994: 46): Is anybody at home? and Can you find me an acre of land? However,
an objection can be raised against this identification of theme, since it may result
in clauses consisting of only a theme and no rheme, as in (15):
The process in (15a) consists of the phrasal verb g under go under which is to
be treated as a lexical unit, and the theme hence extends over the whole clause.
An alternative approach is to disregard word order differences between the
languages and adhere to the strict Hallidayan definition taking the first
experiential element as the topical theme (Hasselgrd 2000). The data, taken from
the English-Norwegian Parallel Corpus, show clearly the differences in the
structure of themes that this approach results in. Consider (16) and (17) from
Hasselgrd (2000: 15):
(17) (a) But [Txt-Th/Str] first [Txt-Th/Cj] I [Exp-Th/Pa] needed this brief withdrawal.
(b) M e n [Txt-Th/Str] f r s t [Txt-Th/Cj] trengte [Exp-Th/Pr] jeg denne kortvarige
ensomheten.
Lit: But first needed I this brief withdrawal.
assume that the subject typically conveys Given information and the predicator
typically New information, and that the unmarked order of these components is
Given before New. Moreover, in the unmarked case, a speaker will choose the
Theme from within what is Given and locate the focus, the climax of the New,
somewhere within the rheme (Halliday 1994: 299). Having a process/verb
typically conveying New information in thematic position is therefore counter-
intuitive. New information in the theme does indeed occur (cf. Fries 1983,
1995b), but is seen as a marked alternative in English. On the other hand, as
Hasselgrd points out, word order is not open to speaker choice in this case but is
governed by syntactic rules, and the subject-predicator inversion is not likely to
have any major consequences on the overall thematic structure or information
structure of a text.
An approach similar to that of Hasselgrd (2000) is taken by McCabe
(1999) in a contrastive analysis of thematic patterns in English and Spanish
history texts. She counts as thematic everything up to and including the first
experiential element encountered in the clause. As in English, theme in Spanish is
realised by clause-initial position. Because VSO word order is permitted in
Spanish, an unmarked theme can also be realised by a process, creating a pattern
of theme that is different from English.
Teich (2001) draws partly on Steiner and Ramms account of theme in
German (see Section 3) in her corpus-based English-German analysis. The
English theme is analysed according to the original SFG model, but, due to the
V2 constraint in German, the German theme is equated with Vorfeld which
incorporates anything which comes before the finite verb. Hence, only elements
occurring before the finite verb are seen as thematic. Excluding the finite
auxiliary from the theme when it occurs before the first experiential element (as
in (19b)) deviates from the Hallidayan model. These definitions result in themes
as in (18) and (19) (Teich 2001: 202):
The results show different theme patterns in English and German. In (19b) a
textual adjunct forms the entire theme, whereas in English (19a) there is a
multiple theme with a textual adjunct and a subject. In contrast to some other
contrastive approaches discussed here, Teich, like McCabe, does not attempt to
find one model that fits both languages, but chooses to use two different
interpretations of theme. Since theme in German is realised differently from
theme in English, two different definitions are used.
364 Anna-Lena Fredriksson
(20) Autumn [Exp-Th/Pa] passed and [Txt-Th/St] winter [Exp-Th/Pa] [passed], and [Txt-
Th/St] in the spring [Exp-Th/C] the boy [Exp-Th/Pa] went out to play in the wood.
While [Txt-Th/St] h e [Exp-Th/Pa] was playing, two rabbits crept out from the
bracken and peeped at him. [my notations and emphasis added]
This example, as well as (9) and (10), repeated here as (21) and (22), show that
not only clauses or sentences in isolation, but also the context has to be taken into
account when deciding on the theme-rheme transition point.
(21) The Pope himself probably survived only because he isolated himself from
everybody else in his huge palace. I suppose isolation was a very natural
impulse. Everywhere in Europe [Exp-Th/C] people [Exp-Th/Pa] resorted to it,
whether [Txt-Th/St] they [Exp-Th/Pa] were noblemen or priests or intellectuals or
ordinary peasants. (ABR1)
(22) With regret [Int-Th/Mo] I [Exp-Th/Pa] put the diary into my other trouser pocket.
The next morning [Exp-Th/C] I [Exp-Th/Pa] supposed, I [Exp-Th/Pa] would have to
telephone his office with the dire news. I [Exp-Th/Pa] couldn't forewarn
anyone as I [Exp-Th/Pa] didn't know the names, let alone the phone numbers,
of the people who worked for him. I [Exp-Th/Pa] knew only that he had no
partners, as he had said several times that the only way he could run his
business was by himself. (DF1)
An extended theme which includes all preverbal elements thus allows not only
one but several experiential elements in the theme. So how does this work in
Swedish? Altenberg observes that in translations from English into Swedish, the
components of an English multiple theme have to be split up and spread out
beyond the finite verb due to the V2 constraint in Swedish (1998: 138). The
Swedish translation of (20) reads as follows:
Exploring theme contrastively 365
(23) Hsten [Exp-Th/Pa] gick och [Txt-Th/St] vintern [Exp-Th/Pa] [gick], och [Txt-Th/St] p
vren [Exp-Th/C] gick pojken [Exp-Th/Pa] ut fr att leka i skogen. Medan [Txt-
Th/St] han [Exp-Th/Pa] lekte krp tv kaniner fram ur ormbunken och kikade p
honom.
Lit: Autumn passed and winter [passed], and in the-spring went the-boy
out to play in the-wood. While he played crept two rabbits out of the-
bracket and peeped at him.
Comparing the English text in (20) with (23), we can see that the distribution of
themes differs. Consider the long multiple theme in (20), and in the spring the
boy, which is split into two chunks when translated into Swedish. A preliminary
term for this type of theme is split theme (cf. Hasselgrd 2000: 24). A split theme
(in a declarative clause) can be defined as including all elements preceding the
finite verb plus the postverbal subject. Preverbal elements may be any
combination of textual, interpersonal, and experiential elements occurring in this
position. There is always an experiential element in the theme. Examples (24)
(26) illustrate the definition of theme suggested here. First, (24) has subjects in
initial position which are simple unmarked themes:
The languages behave in similar ways in such structures. Let us now look at some
multiple themes involving textual, interpersonal, and experiential elements:
(25) (a) EO: Nevertheless [Txt-Th/Cj] he [Exp-Th/Pa] loved her dearly, and [Txt-Th/St]
over the week past [Exp-Th/C] he [Exp-Th/Pa] had come to love her even
more [...]. (RDA1)
(b) ST: Inte desto mindre [Txt-Th/Cj] lskade han [Exp-Th/Pa] henne djupt, och
[Txt-Th/St] under den vecka som gtt [Exp-Th/C] hade han [Exp-Th/Pa] kommit
att lska henne nnu mer [].
Lit: Nevertheless, loved he her dearly, and over the week past had he
come to love her even more [].
(26) (a) SO: "Frankly [Int-Th/Mo], I [Exp-Th/Pa] 'm assuming somebody killed him."
(SG1)
(b) ET: "Uppriktigt sagt [Int-Th/Mo] r jag [Exp-Th/Pa] vertygad om att ngon
ddade honom."
In (25), there are two experiential themes in both English and Swedish. In other
words, all types of theme components, not only textual and interpersonal ones,
may be stacked and they do not necessarily occur in any typical order (Halliday
1994: 53). The English multiple themes in (25a) comprise the elements Textual <
Experiential in the first clause, and Textual < Experiential < Experiential in the
366 Anna-Lena Fredriksson
second clause. In Swedish, on the other hand, we have split themes. In the first
clause the theme is made up of the components Textual < non-thematic element <
Experiential, and in the second clause we find the elements Textual <
Experiential < non-thematic element < Experiential. The initial conjunctive
adjunct and time adverbial trigger inversion of the subject and the finite operator,
and the same holds for the modal adjunct in (26a).
Since there may be more than one experiential item in a theme it is not
possible to determine whether one element is more topical than another.
Consequently, the concept topical theme has no function in this approach.
Circumstances and participants acting as theme may simply be referred to as
circumstantial theme and participant theme (Rose 2001: 127).
The model proposed here developed out of a need. There simply did not
seem to be a well-functioning model to compare theme in English and Swedish.
The main advantage of this approach is that it is operational and suits the
purposes of my study. A second advantage is that there is an underlying discourse
basis that is larger than the clause - the role of an item in the surrounding context
was taken into consideration in determining the transition point between theme
and rheme. We cannot neglect the fact that themes contribute to the method of
development of a text, which is why we need to take a global view of the notion
of theme (cf. Baker 1992: 129).
5. Concluding remarks
The main purpose of this paper has been to discuss contrastive theme analysis on
the basis of parallel corpora. It has been shown that the theme-rheme definition in
SFG may serve as the basis for an analysis of a number of languages both
monolingually and contrastively, but it is also clear that the original approach has
to be modified when used to analyse languages other than English.
It has been claimed that a parallel corpus can be used for trying out a
suitable method for analysing thematic structure cross-linguistically. A parallel
corpus reveals the ways in which system differences between languages create
differences in the realisation of thematic structure. A parallel corpus is then a
valuable tool for testing existing models and for constructing new ones.
Notes
1. This work was carried out with funding from the Bank of Sweden
Tercentenary Foundation. I am grateful to Karin Aijmer and Bengt Altenberg
for their valuable comments on earlier drafts of this paper, and to Joe Trotta
for proofreading. Any remaining flaws are mine.
2. There is a great deal of terminological confusion concerning the labels
parallel corpus, comparable corpus, and translation corpus which are
used for different types of monolingual, bilingual, and multilingual corpora
Exploring theme contrastively 367
(cf. Baker 1993: 248, 1995: 228ff., Johansson 1998: 4f., McEnery and Wilson
1995: 57f.). In this paper, the expression parallel corpus is thought of as an
umbrella term covering both translation corpus (original texts and their
translations), and comparable corpus (original texts in different languages or
original and translated texts in the same language). Such texts are comparable
in terms of for example genre and domain. A majority of the examples in this
paper were taken from the English-Swedish Parallel Corpus (ESPC) which
consists of original texts in English and Swedish together with their respective
translation into the other language. The corpus is described in detail at
http://www.englund.lu.se/research/corpus/index.phtml.
3. The code in parenthesis shows that the example was taken from the ESPC,
and refers to the text from which the example was extracted (see Corpus
texts). EO refers to English original text, ST to Swedish translated text.
Further on, SO refers to Swedish original text and ET to English translated
text. A word-for-word translation of the Swedish sentences is provided.
4. There are a number of approaches to the concepts that SFG calls theme and
rheme. The reader is referred to e.g. Gomz-Gonzlez (2001) who provides an
extensive overview.
5. See the Appendix for an explanation of the abbreviated theme types within
square brackets. Themes are marked in bold type.
6. Within the concept of communicative dynamism in the Prague School
theory of Functional Sentence Perspective (FSP) a division is made between
the theme and the non-theme in which the non-theme consists of the
transition and the rheme. The transition consists of elements performing
the linking function. The TMEs [the temporal and modal exponents of the
finite verb] are the transitional element [sic] par exellence: They carry the
lowest degree of CD [communicative dynamism] within the non-theme and
are the transition proper. The highest degree of CD, on the other hand, is
carried by the rheme proper (Firbas 1986: 54, italics in the original).
7. As pointed out by Rose (2001: 126), Halliday (1994: 66) does refer to a
participant following a circumstantial theme as a displaced theme and
explains that it is a topical element which would be unmarked Theme (in the
ensuing clause) if the existing marked topical Theme was reworded as a
dependent clause.
8. In this and other examples taken from sources other than the ESPC I have
sometimes removed any original notation and added my own.
368 Anna-Lena Fredriksson
Corpus texts
Brink, A. (1984), The wall of the plague. London: Fontana Paperbacks. (ABR1)
Davies, R. (1985), Whats bred in the bone. Harmondsworth: Penguin Books.
(RDA1)
Ferguson, R. (1991), Henry Miller: a life. London: Hutchinson. (RF1)
Francis, D. (1989), Straight. London: Michael Joseph. (DF1)
Grafton, S. (1990), D is for deadbeat. London: Pan Books. (SG1)
Lacey, R. (1986), Ford. The man and the machine. Boston: Little, Brown & Co.
(RL1)
Larsson, B. (1992), Den keltiska ringen. Stockholm: Albert Bonniers. (BL1)
Mayle, P. (1989), A year in Provence. London: Hamish Hamilton. (PM1)
References
McCabe, A.M. (1999), Theme and thematic patterns in Spanish and English
history texts, vol. I. PhD thesis, Aston University.
McEnery, T. and A. Wilson (1996), Corpus linguistics. Edinburgh: Edinburgh
University Press.
Pry-Woodley, M.-P. (1991), French and English passives in the construction of
text, Journal of French Language Studies 1: 55-70.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rose, D. (2001), Some variations in theme across languages, Functions of
language 8: 109-145.
Steiner, E. (2001), Intralingual and interlingual versions of a text how specific
is the notion of translation, in: E. Steiner and C. Yallop (eds), Exploring
translation and multilingual text production: Beyond content. Berlin &
New York: Mouton de Gruyter. 161-190.
Steiner, E. and W. Ramm (1995), On Theme as a grammatical notion for
German, Functions of Language 2: 57-93.
Svartvik, J. (1966), On voice in the English verb. The Hague & Paris: Mouton.
Svensson, M. (2000), Sentence openings and textual progression in English and
Swedish, in: C. Mair and M. Hundt (eds), Corpus linguistics and
linguistic theory. Papers from the Twentieth International Conference on
English Language Research on Computerized Corpora (ICAME 20),
Freiburg im Bresnau 1999. Amsterdam & Atlanta, GA: Rodopi. 355-370.
Teich, E. (2001), Towards a model for the description of cross-linguistic
divergence and commonality in translation, in: E. Steiner and C. Yallop
(eds), Exploring translation and multilingual text production: Beyond
content. Berlin & New York: Mouton de Gruyter. 191-227.
Teleman, U., S. Hellberg and E. Andersson (1999), Svenska Akademiens
grammatik, 1-4. Stockholm: Norstedts.
Ventola, E. (1995), Thematic development and translation, in: M. Ghadessy
(ed.), Thematic development in English texts. London & New York:
Pinter. 85-104.
Appendix: Abbreviations
Abstract
1. Introduction
differences and/or correspondences can reveal cultural and typological facets and
that these have to be reckoned with in the process of translation.
2. The corpora
Our data is derived from a set of two comparable corpora (Teubert 1996) in
English and Italian in the fields of Agriturismo in Italy and Farmhouse
Holidays in the U.K. Perhaps the easiest way to characterise the common
denominator between these two fields is to say that they offer their customers a
relaxing holiday in the countryside and with it a number of country activities
related to life on the farm. So, guests are often invited to engage in walking,
hiking, riding, fishing, birdwatching, swimming, etc. and are encouraged to enjoy
the proximity and contact with farm animals. One can expect a comparable
typology in terms of the offer and in the way this offer is put across, although, of
course, allowances have to be made for differences, due to geographical location,
national habits and preferences and, in general, for the specific requirements of
the two different markets.2 In spite of these differences, we assume that certain
more general concepts will have a fairly straightforward equivalent in terms of
their linguistic realisations.
We will henceforth refer to our two corpora as the Agriturist corpus in
Italian and the Farmhols corpus in English. We have assembled these corpora
from web pages and the Agriturist corpus now provisionally contains 115,000
words while the Farmhols one stands at 203,000 words. They can be considered
comparable in that the language they represent has a similar function and aims to
sell a similar product.
As a first step we consulted the frequency list for the Farmhols corpus and
identified the word welcome as a particularly frequent one, as Table 1 shows. A
series of interviews with the owners of different www pages for farmhouse
holidays confirmed the centrality of the word which repeatedly appeared in
definitions such as this one:
FARMHOLS AGRITURIST
CORPUS CORPUS
Welcome Benvenuto/a/i/e
324 instances 4 instances
The difference in frequency was so marked that we had to ask ourselves why the
concept of welcoming people which appears to be equally central in both the
fields of Agriturismo and Farmhouse Holidays could be realized so differently in
its formal realizations. In spite of our initial assumptions we had to face up to the
problem of non-equivalence.
In this context non-equivalence goes beyond the absence of a match
between L1 and L2. Sometimes when we compare languages we recognise non-
equivalence when there is no match to a certain word: take for instance the
English word hangover which needs to be paraphrased in Italian because there is
no direct equivalent. Sometimes a justification for this phenomenon is possible in
cultural terms. In our case the mismatch occurs when a word like welcome, which
is prominent in terms of frequency in L1, appears only very rarely in L2. The
problem we have to consider, then, is how to identify an equivalent function
given that this may be realised in different ways at the formal level. The other
possibility is that, of course, for some reason, whether cultural or ideological, the
word might not have a direct equivalent.
In order to ascertain whether indeed the concept of welcoming is so
dramatically absent in the Italian of Agriturismo or whether it is simply expressed
differently, we adopted a different approach and decided to address the issue of
translating a word starting from the context in which it is most frequently
embedded. We will explain in the sections that follow our assumptions and our
methodology.
The view we take is that equivalence should not, and often cannot, be
established at simple word level; when indeed a certain type of equivalence
exists, this should be established at the wider level of functionally complete units
of meaning (Tognini Bonelli 1996a/b, 2001). Our aim here is to show how a
systematic contextual and co-textual analysis of the data can help the translator to
identify this wider notion of equivalence built on a network of collocates rather
than on single items. This enlargement of the issue is specially necessary when
374 Elena Tognini Bonelli and Elena Manca
That many textual meanings arise from the co-selection of more than one
word.
That habitual co-selection tends to specialise the function of one or more
of the words concerned.
That co-selection is largely covert and subliminal, which increases its
importance in communication.
This paper aims to take this work on co-selection (see also Francis 1993,
Partington 1998) one step further and considers the implications of its centrality
in translation with particular attention to methodology.
In the process of establishing equivalence, we will also observe how a
systematic enlargement of the unit of meaning in terms of patterns of co-
occurrence can help to define a typology of the extra-linguistic features
associated with it: the type of product offered and also the specific ways in which
it is offered. We will examine differences which are not only due to the different
geographical provenance of the text but also to cultural diversity.
Welcoming children, pets and guests 375
4. Procedure
Our initial word in L1 is welcome which, for lack of space, will be discussed
here only in its adjectival function. The choice of this word is supported by the
fact that the word welcome is very prominent in the Farmhols Corpus. A simple
word-frequency list reveals immediately that welcome is almost top of the list of
lexical words. However, as we mentioned, there is no direct equivalent to it in the
Agriturist corpus this in spite of the existence of a prima facie equivalent such
as benvenuto. Tables 2 and 3 illustrate the frequencies of welcome in the two
corpora.
BENVENUTO/A/I/E (4 instances)
The mismatch between the frequencies is very clear and, because of this, we shall
try to identify TEs in L2 going through several stages of contextualisation and
relating each item to its environment. We shall identify the collocational profile
of each item both in L1 and in L2 and establish the possible correspondences
between larger units. So, at first, by analysing the concordance to the initial node
in the Farmhols corpus we shall locate the nodes most frequent collocates. For
each of the collocates we shall posit a prima facie translation equivalent (TE1,
TE2, TE3, etc.): each of these will be investigated in its own right as a node in the
Agriturist Corpus and it is within their collocational range that we shall try to
locate an equivalent to welcome. Our methodological steps are outlined in
Figure 1.
376 Elena Tognini Bonelli and Elena Manca
Collocate1/L1 TE1/L2
(children) (bambini)
Collocate3/L1 TE3/L2
(visitors (ospiti)
/guests)
The first step in contextualisation will consider the word welcome as a unit taken
together with its most frequent collocate, children. A quick examination of the
concordance shows quite clearly two points (a few citations are reported in Table
4). First, the close association between children and pets or dogs; we have not
enough data to discuss this in detail, but it certainly should be noted because it
seems rather unusual to find them in the same category. Second, that when
children do not share this association with pets, there is always some kind of
restriction or limitation to their presence in the farm, whether it be some age
restriction (over 10 .., over 5 ..) or the fact that no discount is available, for
example.
Welcoming children, pets and guests 377
The specific age restriction is confirmed by other citations in the same corpus
where the noun children is not combined with the adjectival use of welcome, as
shown in Table 5.
We should remember that this type of holiday on the farm in the U.K. is often
centred around domestic animals and their young and part of the fun offered is to
observe them in their own farm environment. The type of conditioned welcome
that we see in the instances above, rather than qualifying a warm and friendly
reception, seems to function as damage limitation when a face-threatening
situation, such as a restriction on the offer, arises. It also reflects well the
situational and cultural context in Britain where the children are not always
welcomed even in places such as farmhouses, where the presence of farm animals
and pets would seem to be an incentive for their presence.
In three instances we find children associated with discount offers (see
Table 6), but these are fairly rare (2.9%), if compared as we shall see in Table 7
with the Agriturist corpus.
Let us now proceed to the second step in contextualisation, that is examining the
patterns of co-selection associated with our prima facie TE of children, viz.
bambini, in the Agriturist Corpus. Table 7 gives some examples.
The patterning shown in the citations in Table 7 is very typical. Bambini are
never associated with expressions of welcome or denoting an explicit permission
to stay in the Agriturismo. However, they regularly seem to be connected with the
semantic field of discounts identified by words such as riduzione, sconti e
agevolazioni, gratis and gratuito, which, if only implicitly pointing to the
welcome, they certainly show it in tangible and concrete terms. In Table 6 we
reported the only three instances of this type in the Farmhols Corpus. In the
Agriturist corpus this is the most typical pattern associated with bambini.
As in the Farmhols Corpus bambini are associated with some age
limitations (fino a 3 anni .., da 2 a 6 anni .., 2/10 anni..), but these only refer to
the discounts and the reductions offered and not to the actual acceptance of
bambini in the Agriturismo.
To sum up this section, we can say that the contextual analysis of the data
in the two languages has shown no match for the word welcome in the context of
children. This is true not only in terms of a similar grammatical pattern - we had
started from the lack of correspondence welcome/benvenuto - but also with other
lexical or grammatical patterns that might have realised a similar function. Can
we then ask ourselves whether this absence of welcome in the Italian of
Agriturismo means that children are not really welcomed in Italian Agriturismo
while they are in British farmhouses? We maintain that the analysis should
always be extended to the context and the overall function of the unit. So,
considering the data we have analysed, perhaps the best answer would be to
remind ourselves again of a citation from the Farmhols Corpus where the
welcome cannot certainly be taken as encouragement,
and to conclude that the English welcome, when applied to children, may not
necessarily convey the warmth and the friendliness that we associate with it; a
qualified welcome is perhaps to be interpreted as discouragement to those
excluded by the qualification. On the other hand, the fact that no explicit welcome
is stated in relation to bambini should also be interpreted in the context of the
regular statements about discounts and reductions made available to children, and
these should be taken as encouragement for the presence of children in the Italian
Agriturismo. It seems to be taken for granted that children are welcome.
Pets and dogs are the recipients of the welcome in 20% of the instances in the
Farmhols corpus. In half of these occurrences, however, this welcome is
accompanied by a limitation on the offer, as was the case with children. As one
can see in Table 8 below, this conditioned welcome is realised here by a variety of
expressions ranging from provided, providing and but to by arrangement and on
Welcoming children, pets and guests 379
corpus; in the Farmhols corpus the issue seemed to be more that pets should be
well-behaved or kept under control.
From the point of view of the translation equivalence the result is quite
satisfactory because, while we could not find a one-to-one equivalent for welcome
in general, we were able to locate a perfectly good equivalent for the English pair
welcome-pets in the Italian accettare/ammettere-animali. At the level of
functionally complete units of meaning, the pragmatic dimension of the unit is
realised by the expressions of limitation associated with it both in English and in
Italian. This suggests that the use of welcome in this context in English is just a
euphemism for accepted.
The patterning associated with welcome in the context of guests and visitors
differs from both the patterning with children and pets; here we consistently find
the structure Vb-BE + welcome + to-inf. as in Our visitors are welcome to
explore the farm. The concordance in Table 10 groups together some citations for
visitors, guests and also the pronoun you which addresses the potential visitor or
guest in the text from the web pages.
We note here that the structure in which welcome is embedded has a
different impact on the meaning: if with children and pets the welcome conveys
the meaning of permission and implies that they are allowed to join in the
farmhouse holiday, subject to certain specific conditions; with visitors and guests
we find a straight invitation to take advantage of all the leisurely activities offered
by the farmhouse.
Let us now consider the Italian equivalent of guests and visitors, that is
ospiti. Again, we note the absence of the typical TE of welcome as suggested by
traditional reference books, the fully lexical benvenuto/i. Some examples are
given in Table 11.
Welcoming children, pets and guests 381
In the concordance in Table 11 it is pretty clear that the equivalent of the English
structure Vb-BE + welcome + to-inf is conveyed in Italian by the modal potere
to be able to in its inflected forms. Here we have the example of a fully lexical
word such as welcome in L1 that has primarily a grammatical realisation in L2.
The phrase vi la possibilit di (there is the possibility to) carries the same
modal meaning but in a lexicalised form. In spite of this lexical status it belongs
under the same umbrella of modality that in traditional linguistics is usually
understood as Grammar. This is a potential trap for translators because the
lexical choice implicitly carries more weight and as such may become a more
visible, and therefore preferred, option when translating. We can certainly say
that it is the purely lexical meaning that tends to be the focus of traditional
reference books, so welcome is translated as benvenuto, and no guidance is given
about the likely use of the modal potere. In this case a translation corpus could
help us to identify the favourite choices of translators, to verify for instance if the
grammatical translation of welcome is indeed used and if so, if it is used
appropriately.
The noun ospiti shows a frequent association with another expression, also
related to modality: a disposizione di. Let us consider some examples in Table 12.
One thing to notice which, for lack of space, is only mentioned in passing here, is
the fact that the phrase a disposizione degli ospiti in the Agriturist corpus is
mainly associated with the type of accommodation offered (eg. quattro camere
doppie four double rooms), while welcome + to-inf. is connected with the
382 Elena Tognini Bonelli and Elena Manca
different leisure activities offered by the farmhouse holiday package. This points
to the specificity of the semantic preference within similar units of meaning and
to the fact that collocational restriction is based on semantic criteria. It is certainly
something that should be investigated further, especially in view of the impact it
can have on the translation process at the level of appropriateness.
The data discussed in the sections above show that while the single word
denoting welcome cannot be translated satisfactorily in Italian, each of the
collocational pairs welcome-children, welcome-pets and welcome-guests has an
appropriate TE (even if this is 0-equivalence in the case of children) that conveys
welcome either in terms of permission or in terms of invitation.
By enlarging the translation unit to encompass the more systematic
patterning associated with the initial collocation pair, a typology of the offer
specific to each type of guest emerges. We have seen how certain guests (children
and pets in the Farmhols corpus, animali in the Agriturist corpus) invited the
presence of restrictions while others (bambini and ospiti in the Agriturist corpus,
guests in the Farmhols corpus) did not. The type of restrictions, we have seen,
were not the same in the two languages and reflected cultural and ideological
preferences; so while the presence of children was restricted in terms of age in the
Farmhols corpus, in the Agriturist corpus the only qualification was on the type of
discount accorded. With pets the restrictions demanded that they should be under
control and that they should be well-behaved in the Farmhols corpus while the
parallel term animali in the Agriturist corpus seemed to invite restrictions on size
rather than behaviour, and that specific arrangements for their presence should be
made in advance.
The typology of the offer for children included a large safe area, explorer
trails, ample space as well as some specific facilities like cots, highchair and
child minding. The equivalent offer for bambini in the Agriturist corpus showed
predominantly the semantic area of children games and game-parks with words
such as giochi per bambini, spazi attrezzati per bambini, piscina rotonda per
bambini.
6. Conclusion
This paper started off exploring the notion of translation equivalence at word
level between two items which had similar grammatical, lexical and even
morphological realizations in English and Italian. The assumption of equivalence
appeared very plausible because the concept in question, the idea of welcome in
the field of eco-tourism and farmhouse-style accommodation is central both in
English and in Italian. It seemed therefore likely that there would be a fairly
straight-forward match between welcome in English and its Italian counterpart
Welcoming children, pets and guests 383
retrieved in L2. So, although both welcome and children can be individually
translated in Italian, this does not mean that the unit of meaning in which they are
combined can be translated.
The upshot of our discussion is that any translating activity should start by
considering very carefully the context in which a certain word or expression is
embedded and the one into which it is going to be transferred. While we cannot
maintain that welcome in general language is always to be translated as accettare
or potere, we can certainly say that welcome should be translated with some form
of the verb accettare when it applies to pets and with some form of the verb
potere when it applies to guests in the specific restricted language of Farmhouse
Holidays in the U.K. That is if we want our translation to sound natural and
avoid the unmistakable ring of translationese (Gellerstam 1986).
Corpus evidence gives us a privileged start by allowing us to examine
simultaneously the syntagmatic and paradigmatic dimensions of meaning. We
have tried to show that it is only by comparing possible TEs in the presence of
their syntagmatic patterning and their paradigmatic associations in the two
languages that it is possible to identify functional equivalence.
This study has not specifically focused on the typology of the offer in
Italian Agriturismo and British Farm-house holidays. However, in the course of
our observations, it was apparent that some very interesting insights can be
gained from a close look at the data from a typological perspective. In this
context we only want to point to the possibility of identifying the parameters of
this offer in a systematic way. We believe that anybody wanting to advertise their
offer in a foreign language should be aware of the comparable offer available to
their target customers, not only in terms of linguistic realisations but also in terms
of the facilities they advertise. This will be the focus of further research in the
future.
Notes
1. A first version of the work reported here was presented at the A.I.A.
conference in Catania in September 2001 (published in Textus XV, no. 2,
2002). This version, presented at ICAME 2002 (Gteborg) greatly benefited
from the careful and stimulating comments of the editors of this volume,
Karin Aijmer and Bengt Altenberg, as well as the discussion and the questions
that followed the presentation.
2. See for instance the importance of genuine food and the pleasures linked to a
traditional country cuisine which is central in the Agriturist offer in Italy and
has no real equivalent in the Farmhols Corpus.
3. The word welcome, as well as an adjective and an exclamation, is also used as
a verb (see Manca 2001). In this study we will only consider the adjectival
function in some detail.
Welcoming children, pets and guests 385
References
Natalie Kbler
Abstract
In this paper, we present an experiment that was carried out to use finite corpora
and WebCorp in the classroom with a pedagogical objective that was different
from language teaching. The use of WebCorp and corpora was embedded within
the wider framework of teaching students how to approach machine translation
by building a customised dictionary with the aid of available tools and resources.
The issue of exploiting finite corpora and the Web as a corpus was raised in this
framework and will be discussed here. Although there is no simple and definite
answer, the experiment led students to investigate the Web as a source of
information and tobetter understand the issues involved in corpus building and
corpus use.
1. Introduction
In this paper, we present an experiment that was carried out using finite corpora
and WebCorp in the classroom with objectives that were different from mere
language teaching (see section 2.1). Corpus-based, or corpus-driven teaching as
Johns (1988) termed it, can be adapted to using the Web as a corpus; in this
context, WebCorp can be a useful tool for language teachers and students. Our
purpose was however slightly different. Although WebCorp was tested in a
pedagogical situation, its use was embedded within the wider framework of
teaching students how to extract lexical and syntactic information to build
customised dictionaries for machine translation (MT) in languages for specific
purposes (LSPs). In the light of this specific context, we shall tackle the issue of
finite corpus use as opposed (or not) to WebCorp use.
The first part of this paper presents the pedagogic and scientific context of
the experiment. Some details must be given about the project in which the
experiment took place, since it has an impact on the type of results that were
expected from the WebCorp search.
In the second part, the resources and tools that were used are described.
In the third part, samples of the results obtained with WebCorp and with
the finite corpora will be presented and explained. We will show how WebCorp
can be used to complement and update search for linguistic information in finite
388 Natalie Kbler
corpora. This part will also discuss the benefits of using WebCorp parallel to
querying finite corpora.
The conclusion will deal with future prospects and enhancement
requirements for WebCorp.
2. Experiment context
The objectives of this project involved not only teaching the students the various
skills which will be described below, but also considering the limits of finite
corpus use versus Web as a corpus use. This approach is very profitable to
young people who are computer-literate, and for whom the Web is regarded as
the fount of all knowledge. Comparison helps them find the advantages and
disadvantages of the two approaches; it is also aimed at showing them that
information extracted from the Web must be carefully examined and not be taken
for granted. This also raised the issues at stake in corpus-building as opposed to
using texts collected without specific criteria, or using the Web.
Below are listed the kinds of competence students should have acquired at
the end of the course; they should be able to:
Using WebCorp in the classroom 389
The projects in which WebCorp was used and tested consist in translating texts in
the computer science area, using a customisable machine translation system.
Some texts to be translated from English into French were dictionary definitions,
extracted from a Web-based computing dictionary;2 the other type of texts were
some of the Linux HOWTOs that have not yet been translated. The Linux
HOWTOs are the user manuals of the Linux operating system; they have been
translated into several languages by the various Linux communities.3 The French
Linux community is quite active and has translated most HOWTOs. However, as
new HOWTOs, or updates of previous ones, are regularly released, there are still
some documents that remain to be translated. Our students thus had to translate
some of the most recent HOWTOs.
The machine translation system that was used was Systran, and more
precisely Systranet which is Systrans customisable on-line translation system. It
allows users to create their own (bilingual or multilingual) term bases to improve
translation results; this feature can give quite good results in specialized
translation. Students had to create their own customised dictionaries, in order to
test them with Systranet.
To create term bases (or customized dictionaries) from scratch, the first
step involved automatically extracting term candidates from the English text to be
translated and then finding their French equivalents. The first dictionary would
then be used to translate the text.
Systranet offers the possibility of aligning the source and target text, and,
in the aligned target text, of highlighting unknown terms in red and the users
dictionary terms in green. These features make it possible for the user to add to
the dictionary all the words that are not recognized by Systrans home
dictionaries. The second step is more demanding in terms of linguistic work:
390 Natalie Kbler
students compare source and target texts to complement and modify the
dictionary until no more dictionary change can improve the translation result.
When the dictionary is saturated, i.e. no more change can be made to
improve the translation result, the final translation of the text is achieved; the
result will then be proofread and post-edited to correct the translation errors that
could not be solved by modifying the dictionary.
Finite corpora and the Web as a corpus are key elements in the process of
building and correcting dictionaries, and of proofreading the final translation
result. After extracting term candidates from the source texts, students must
decide which candidates are actual terms. Corpus query must then be applied to
answer this question. Parallel corpora are then necessary to help find the French
equivalents for the terms. Corpus use is not only essential to finding terms and
their equivalents, it is also often the only possible means of finding syntactic
information for the terms, especially for verbs and adjectives; verbs and
adjectives are in fact not always considered terms, and little linguistic information
about these classes can therefore be retrieved.
Finite corpora are not the only resources that are essential to creating
customised dictionaries; it will be shown later how the Web as a corpus can
complete and update the information extracted from finite corpora.
This section describes the tools and resources that were used to fulfil the
assignments in the project. The two most important resources for the tasks under
consideration in this paper are WebCorp and the finite corpora that were used.
3.1 WebCorp
WebCorp is a tool developed in a project that was set up at the Research and
Development Unit for English Studies at the University of Liverpool. Its
objectives were to investigatethe usability of the Web as a linguistic resource.
The project also had to identify and address problems of retrieval and analysis. It
allows the user to type in a request for linguistic information that is processed and
fed into the selected Web search engines. The search engine returns a list of
URLs that WebCorp accesses directly; it then returns concordances or collocates
for the query. We will show below how it can be used to retrieve useful linguistic
information to create bilingual term bases in LSPs. A detailed description of
WebCorp has been given by Renouf (2003) and Kehoe and Renouf (2002).
Using WebCorp in the classroom 391
3.2 Corpora
The finite corpora that were available for the students were first developed at the
Laboratoire de Linguistique Informatique at the University of Paris 13. They have
been augmented and enhanced at the University Denis Diderot Paris 7 for several
years. These corpora, parallel and comparable, are accessible via a Web-based
interface,4 in which a concordancer allows visitors to use perl-like regular
expressions, as described in Foucou and Kbler (2000). The following corpora
were used by the students:
a) The parallel English-French HOWTO corpus, that has been used for several
years at Paris 7. It is made of the Linux HOWTOs (user manual files of the
Linux operating system), which were originally written in English. The
HOWTOs have been translated into several languages, including French. The
source language and target language texts were aligned at section level. The size
of the parallel corpora is approximately 500,000 words each. It is possible to ask
for concordances and then have an aligned view of the section in which the term
or expression occurs. Concordances with regular expressions are very useful for
extracting refined linguistic information about terms. Furthermore, by looking at
the equivalent section in French, it is possible to find the French equivalents of
the term or expression.
was adapted to specific pedagogical needs, allowing the teacher to create the
groups and to have access to all the students dictionaries, as well as partially to
the logs of the sessions.
The most interesting feature of our project, apart from the translation
engine as such, was the possibility for the user to create and compile customised
dictionaries.
Dictionaries contain more than just a correspondence between a source
word (in this case in English) and a target word (in French), since users can enter
what is called advanced linguistic information in these. The information can be
divided into several levels:
part-of-speech information: basic part-of-speech information can be attached to
the entries, such as verb, noun, proper noun, adjective, and sentence, which
deals with adverbs, adverbial phrases, or whole idioms, such as your mileage may
vary.
syntactic information, such as the governed prepositions for nouns, verbs, and
adjectives, or direct objects for verbs. A verb which governs a preposition is
shown in example (1).
(1) access (verb)(noprep)=accder (verb)(prep:)
semantic information, such as the conceptual class of the possible direct object
of a verb, as shown in example (2). In this example, the coding for the verb
runindicates that the direct object must belong to the semantic class [OS], which
means all terms sorted under the operating system class. Below the verb, the
noun Unix is marked as belonging to the [OS] class. This means Unix can be the
direct object of run.
(2) to run (verb)(context:OS)
Unix (noun) (SEMCAT:OS)
morphological information, such as the plural form of a noun in any language,
the gender of a noun in French, or altering the number in the target or source
language. Example (3) shows how the gender of cache can be altered to
masculine. In general French, the noun cache(hiding place) is feminine,
whereas in computer science French, it is masculine and means cache.
(3) cache (noun) = cache (noun) (masculine)
The term URL takes a plural in s in English, i.e. URLs, whereas in French, it is
invariable; this type of information can be coded in the dictionary, as is shown in
example (4).
(4) URL (noun) (plural:URLs) = URL (noun) (plural:URL)
translational information, such as DNT, which means that the string must not
be translated, i.e. it must remain as it is in the translation process. This feature is
quite useful in computer science, as there are command names for example that
are never translated, such as the Unix command cd, or mkd.
Using WebCorp in the classroom 393
To extract term candidates from the source texts, a very simple and user-friendly
tool was applied, viz. Terminology Extractor. This tool works for English and
French and gives several types of results. First, it extracts all the words that are
recognised by its dictionaries, plus all the non-words, i.e. words that are not in the
dictionaries. The non-word feature is interesting, as it usually gives a list of very
specialised words which are not in general dictionaries. Then it extracts in a
window of two to ten words all the sequences that appear at least twice in the
text. This feature allowed the students to have a list of term candidates among
which they could choose the actual terms with the help of the various corpora and
WebCorp.
Finite corpora and the Web as a corpus were the main resources used in the
project. There were also secondary sources, such as on-line glossaries, or on-line
term bases. These were presented to the students to help them understand why
data-driven information is essential to this type of work, and why dictionaries and
glossaries are not always satisfactory. Figure 4 shows the type of information that
can be accessed in a Web-based bilingual term base. The search for the
translation of the English word buffer yielded the translation mmoire-tampon,
and three synonyms and translations of these, but no syntactic or phraseological
information. There were no compounds of the word buffer, although it is very
common in computer science English.
ENGLISH FRENCH
Buffer mmoire tampon n. f.
Syn. Syn.
buffer storage tampon n. m.
buffer memory mmoire intermdiaire n. f
intermediate memory zone tampon n. f.
Figure 4. The term buffer and its French translations in Le Grand Dictionnaire
Terminologique.
Using WebCorp in the classroom 395
Taking our experiment in the classroom into account, we want to show how the
use of finite corpora and WebCorp is neither contradictory nor incompatible.
Available finite corpora, such as the HOWTO corpus and the smaller ones in
subdomains of computing, can give the user a lot of information. But as
computing is a very quickly changing domain, new terms are coined all the time,
which means that available corpora tend to become insufficient or slightly
obsolete, even though they can be regularly updated. In the subject area of
computer science, most neologisms can be found on the Web. So being able to
query the Web as a non-finite corpus is a fruitful way of obtaining missing
information. Taking the above-mentioned example of buffer, we will describe and
discuss this.
The problem is that the HOWTO translators have not always translated the whole
text, or they may have modified sentences in such a way that some words just
disappear. As a result, some compounds can be found, but not all, and not always
their French equivalents. This indicates the limitation of finite corpora. New
terms that were created after the collection of the corpus, or translations that have
been radically modified, cannot be found in a finite corpus. Term bases are
generally not complete enough. Because of this, the information must be looked
for on the Web. As not only lexical information but also phraseological and
translational information is necessary, a tool that makes it possible to extract
concordances from the Web is likely to be appropriate. The next sub-sections deal
with examples of Web search, using WebCorp, and demonstrate how the
necessary information can be found.
396 Natalie Kbler
As the Web is not an aligned corpus, heuristics must be applied to find the French
equivalents for English words. One possibility consisted in searching for an
English term on a French Web-site. In the current state of WebCorp, the only way
of doing that was to look for URLs in the French domain, i.e. ending in .fr. In
French, computer scientists often use the English term for a given concept. Some
translators therefore use the English term and often give its French equivalent in
parentheses at the beginning of the document and then no more. Others use the
French term, but add the English word in parentheses. This permitted us to find
translations and also more terms, as illustrated in Figure 6, which shows a
concordance for buffer extracted with WebCorp. These concordance lines yield
two multi-word units in English, viz. buffer overflow and heap buffer overflow,
and their equivalents in French.
Not all searches provide the reader with the English source term in parentheses.
In the case of dial-in line, for example, only part of the term is translated into
French, and no indication of the source term is given. Figure 7 shows an
occurrence of ligne de dial-in, in which only part of the term is translated.
However, other occurrences of dial-in in French text show that this is the correct
way of using it in French.
An argument that appears after the preposition under can also be used after on,
but the opposite is quite rare. Building a customised dictionary means listing, as
exhaustively as possible, the different verb arguments that can occur in the
different positions in a sentence. Finite corpora can produce a quite exhaustive
answer, which needs to be complemented and updated by using the Web as a
corpus. Figure 8 shows how the expression run * * on, which uses two
wildcards instead of words before the preposition on, can give significant results
on the arguments that can fill the syntactic positions. These arguments could not
be found in the HOWTO corpus, nor in the smaller finite corpora
Another useful feature offered by WebCorp is the collocate function; it gives the
most frequent collocates of the sequence. Frequent collocates of the verb to run,
for example, are Debian, Alpha and messages, the first two being product names
in computer science. As WebCorp is limited in the number of sites that can be
opened, it is possible to filter out the collocates and discard the URLs in which
they occur. It can be done by using the exclude feature (using the - sign, as in
search engines). This allows WebCorp to extract concordances from other URLs,
which then provide more linguistic information.
The same operation can be applied to extract linguistic information about
the French equivalent of the verb, i.e. tourner. As shown in Figure 9, the first
pass is not always conclusive, since there are occurrences that have nothing to do
with computer science. The sequence tourn* * * sur will find all the words
398 Natalie Kbler
beginning with tourn, followed by two words, followed by the preposition sur
(on).
fonctionner avec Windows, il peut tourner ou pas sur des cartes vido ou
de type Unix qui peut tourner entre autres sur PC. Il est install par
des ordinateurs distants Pour faire tourner un programme sur une machine
distante dont ladresse
texte ASCII par un module tournant sous Windows (sur PC) et devrait bientt
Figure 10. Occurrences of tourn using filters.
4.4. Discussion
These few examples show occurrences of terms and their phraseological contexts
that could not be found in the finite corpora on computer science. Studying
terminology and phraseology for practical purposes raises issues that are different
from describing the language as such. Describing languages for specific purposes
means working in well-defined subject areas, which does not need huge corpora
as in the study of general language (if there is such a thing as general language).
A few hundred thousand words, sometimes less than a hundred thousand words
are enough to describe the characteristics of a language for specific purposes.
However, applying this type of description for practical purposes, such as
creating a dictionary that will be integrated into a machine translation system,
raises the issue of exhaustiveness. Machine translation needs human input to
achieve satisfactory translation results. In this case, a small, specialised corpus is
not enough. Moreover, the issue of up-to-date information arises. WebCorp, as a
tool enabling the user to make daily updates, is ideal for complementing and
updating the information extracted from time-bound specialised finite corpora.
However, using finite corpora presents some advantages over WebCorp
that will be difficult for a concordancer using the Web as a corpus to overcome.
Finite corpora have the significant advantage of presenting controlled and
balanced information. The texts collected in a corpus have been selected in
Using WebCorp in the classroom 399
preference to other candidates. Using the Web as a corpus implies that one has no
control over the content of the documents that are extracted. The huge quantity of
documents is also a problem.
5. Conclusion
While, in our case, finite corpora were used as the basis for the creation of
customised dictionaries, WebCorp provided us with more complete and up-to-
date linguistic information. In the classroom situation, students were faced with
those issues, i.e. finding information in finite corpora, discovering they needed
more, and using WebCorp instead of collecting a bigger corpus in the domain.
Students learned how to use heuristics to find appropriate information using
WebCorp; this also led them to note the advantages of WebCorp over classical
search engines, namely the availability of concordances, collocates, regular
expressions, and the possibility of limiting and filtering the linguistic information.
WebCorp still needs some improvements, such as refining language
identification, and domain filters. Linguistic information extracted with WebCorp
would be more accurate if domain filters could be used to restrict the search to
one domain. Refined regular expressions would allow users to extract more
accurate phraseological information. As these improvements are integrated into
the next release of WebCorp, the next step will be to test them and see if the
results are significantly improved.
Notes
References
University of Liverpool
Abstract
The Web is a text store which can potentially supplement traditional corpora as a
source of up-to-date linguistic data. The WebCorp project investigates this
potential, and in its second year tackles some residual problems inherent in the
nature of Web text, thereby refining its retrieval and analysis tool for the
facilitation of corpus linguistic study.
1. Introduction
Several approaches could be taken to extracting linguistic data from the Web and
processing it online. The WebCorp system has adopted a straightforward
approach, as shown in Figure 1. WebCorp has six basic stages of operation. It
first registers the users request for linguistic information. Then it translates the
request and feeds it to a search engine. The search engine locates relevant texts,
returning a list of URLs to WebCorp, which accesses these directly, processes the
associated texts in memory, and then returns concordance results to the user
interface.
Search
Engine
2
3
4
5 WebCorp
Web
Texts 1
User
Interface
daily basis. The neologism Enronomics was not found in existing corpora in May
2002. It is derived from Enron, a US company that in early 2002 was discovered
to have conducted large-scale financial malpractice. The name now carries
connotations of the particular kinds of shady business dealing and poor
management style involved, and is used to characterise companies and practices
exhibiting similar qualities. Contexts for this neologism could already be
extracted from the Web by WebCorp in May 2002. They indicated that the root
form Enron was extremely productive, already appearing in a range of derived
forms. In the sample output for Enronomics in Figure 3, we also find Enronyms,
Enronitis, Enronify, Enronethics, Enronizing, enronish, Enronitize and enronomy.
In addition, we note that Enronomics is probably modelled on Reaganomics, as is
Clintonomics.
attack Bushe economic policies with the term Enronomics (a phrase that
originated
to Believe He Knows About the Economy? Enronomics = Contributors Get
Richer
corporate malfeasance. Recently spotted Enronyms: Enronitis, Enronify,
Enronomics
laid bare by what rivals call Enronomics the political fable of the Enron
corporation
slogan and neutralize the Enronomics accusations, may I coin the term
Enronethics
Team Bush - talk of Enronomics, or Enronizing Social Security and
Medicare
believing their press, watch out. Its Enronomics, folks. The rich seducing
the poor
to be enronish and to practice Enronomics. Weve seen ugly, enronish
sights before
The Looting of America: Reagonomics, Clintonomics and Enronomics
Strategy) . Enronomics Explained (deliberately driving the country into
spent two weeks talking about Bushs Enronomics and Enronizing
Social Security.
It blows the lid off Bushs Enronomics, and his plan to Enronitize Social
Security
hardest hit by the Bush trickle down enronomics. Now it looks like the Bush
enronomy
Figure 3. WebCorp output for search term Enronomics Domain: .uk or .com
Alternatively, one might wish to check the neologistic status of a word through a
Web search. In an article on Health Obsessions in the Observer of 14.04.02, the
vogue term medicalisation is presented in inverted commas as though a
neologism. Though there is no consistent meta-information for date on the Web
to support the chronological extraction of word occurrences, WebCorp can
The accidental corpus 407
retrieve at least some in-text dates indicating that the word is not new, but has
been used as early as 1974, as shown in Figure 4.
Figure 4 also includes evidence of the vogue use of medicalisation to mean treat
medically a natural condition as if it were a disease, in the context of words such
as ageing, childbirth, everyday life, death, and psychologisation, as well as more
established uses. In the context of abortion or drugs, medicalisation is used to
mean decriminalisation; while in the context of terminal conditions, it can also
mean treating with medicine, collocating with such words as palliative. The
rarity of inverted commas here indicates that the word is no longer considered to
be a new coinage, the one use (in 16) being to indicate the novelty of its status
back in 1974.
408 Antoinette Renouf, Andrew Kehoe and David Mezquiriz
During the development phase, we have established many of the needs of users
via our feedback mechanism. These have led us to face a number of retrieval and
processing issues, which we shall outline below, together with solutions that we
have found. The major areas of concern are:
3.1 Scope
All things being equal, it seems a good idea to maximise the scope of Web search
in order to garner as many examples as possible. However, a Web search is
limited to the scope of indexing of the various search engines. A report (Bergman
2001) stated that the foremost search engine, Google, had indexed 2 billion Web
pages, but estimated that it only searched 10% of the Deep Web. The use of
multiple search engines currently Google, AltaVista, Metacrawler, FAST,
Northern Light and SearchEngine.com is a remedy that we have applied to
increase coverage.
3.2 Speed
Any Web language retrieval system will be subject to speed constraints. These
are imposed by each agent in the loop, including local server, university resources
and Web traffic. An arrangement which allows direct access to the Web via the
index built by one of the search engines is likely to increase speed. In the case of
WebCorp, this improvement is achieved by linking into SearchEngine.com, a
major UK-based system. Speedier processing can also be achieved through the
parallelisation of the downloading and processing of Web pages. Neither measure
brings huge benefits, however; a new order of processing power is required, of
the scale envisaged for the post-Internet era of distributed computing.
Search engines require careful monitoring since they are constantly changing:
opening up, closing down, amalgamating, adding new functionality, and
imposing new restrictions.
The accidental corpus 409
The Web page is in a state of disorder from every point of view that concerns
linguistic processing. To begin with the basic unit of word, even the boundary
between words, is erratic. Then, spelling is variable and presents a problem
analogous with that which has preoccupied generations of historical linguists.
Punctuation is haphazardly sprinkled, and frequently omitted (or suppressed by
some intermediate processing), a tendency that presents a particular dilemma in
that it removes the sole means of processing the surface text for sentence
boundary.
Web pages are a mixture of text and metatext (including URLs and other
links). For some purposes, the linguist requires access to the text itself; for others,
such as the study of meta-terms for specialised dictionary creation (see Kbler
and Foucou 2000), access to the metatext. Scarcely any purpose is served by a
system which retrieves a mixture of both. A partial solution here is to construct a
retrieval routine that identifies and ignores the kind of text, such as link text, on
the Web page which is not required.
There are a number of variables that serve a linguist and are readily producible.
With reference to the WebCorp GUI, we offer options for case
sensitive/insensitive search, URL display and full text hyperlink, specifiable span
(ideally up to a maximum of the total text), and selected formats (including
HTML, ASCII and HTML Tables).
owned first quarter losses after cutting costs in its South African and
Scandinavian operations Ananova: Melissa computer virus creator
gets 20 years in prison David Smith, who admitted creating the
Melissa virus that swamped computer networks worldwide and
caused millions of dollars in damage in 1999, was sentenced today to
20 months in prison, prosecutors said.
So another approach to finding sentence boundary has been tested with WebCorp,
in which it simply searches backwards through the text, left of the search term,
for the previous upper-case initial word. This simple measure is surprisingly
successful in identifying a sentence start, or at least a clause start, which is often a
satisfactory compromise in terms of the interpretability of a context. However, its
success is determined by various factors, such as grammar. For instance, it works
well with the verb swamped because the previous upper-case initial word is very
often the noun (or proper name) designating the clause subject. (This word relates
to David Blunketts unfortunate remark in 2002 about schools being swamped
with immigrants). Our output is shown in Figure 5.
David Smith, who admitted creating the "Melissa" virus that swamped
computer networks worldwide and caused millions of dollars in damage in
1999, was sentenced today to 20 months in prison, prosecutors said.
January 2000 "Swamped!
Technology Summary: Swamped!
By combining research in autonomous character design, automatic camera
control, tangible interfaces and action interpretation, Swamped!
Academic Papers: Swamped!
Sorry, I have been swamped with other stuff but
Or, as with any developer, youre probably swamped with bugs.
Some of the competitors, however, persisted in racing until they were
swamped.
Birmingham Citys ticket offices were bracing themselves to be swamped
by eager football fans today hoping for a ticket for the Division One play-off
final.
Call centers of high-tech companies are swamped, and consumers are
fuming
Figure 5. Potentially sentence-length contexts for swamped
The accidental corpus 411
He grabbed the stapler, and sulkily asked me to make him a cup of tea.
Her husband, who is driving, frowns sulkily.
"I suppose so," the other sulkily replied, as he crawled out of the umbrella.
"Sorry," they mumbled sulkily.
Cilla: (sulkily) All right, fine
Ed sulkily.
Elinor responded sulkily as she smoothed the folds of her long cambric
overdress.
An obvious strategy for improving the output is to download the text for post-
processing, at which point the potential of grammatical and other factors for
sentence identification may be exploited.
Left Right
Word Total L4 L3 L2 L1 R1 R2 R3 R4
Total Total
wage 36 1 34 1 1 35
national 15 15 15 0
rate 6 3 3 0 6
Please 5 2 1 2 2 3
set 5 1 4 0 5
UK 4 3 1 4 0
National 4 2 2 2 2
standards 4 1 3 1 3
requirements 4 1 2 1 0 4
level 4 4 0 4
guide 3 2 1 3 0
new 3 1 1 1 3 0
rates 3 1 2 1 2
section 2 1 1 2 0
maximum 2 2 0 2
regulations 2 1 1 1 1
Lexical items are often common combinations of two or more words, in more or
less fixed patterns. It is possible with WebCorp to search on the Web for these,
and also for discontinuous phrases, which can be effected through the use of a
wildcard character. So the * sank retrieves a series of phrases containing some
of the collocational set which sits between the words the and sank, which is:
The accidental corpus 413
the boat sank, the ship sank, the ferry sank, etc.
Multiple wildcard characters within the pattern the * * sank can expand the
search to discover some of the members of each of the two collocational sets that
sit between the words the and sank, which include: the unsinkable ship sank, the
Russian submarine sank, etc.
It is also possible to support a search for variable strings using wildcards.
These can match inflections and suffixes, such that run* will represent run,
running, runs, runner, runners, but also runt, rune, rung. However, wildcard use
in the matching of initial word elements (e.g. *ing) is not supported by search
engines, though there are obvious off-line post-editing remedies to apply.
Square brackets and pipe characters (as separators) are additional
measures for introducing grammatical or orthographic variation into the search,
as for instance the boat s[a|u]nk. Square brackets around lexical variants, e.g.
the [boat|ship] sank, allows a search for the alternatives specified.
Brackets can be used to allow more flexibility and/or specificity, so that
run can be explicitly expanded to r[un|an|unning|uns], which will retrieve
instances of run, runs, running, and ran.
Wildcards allow the discovery of new/unconventional forms, of the kind
that supports the testing of a users hypothesis that electronic communication
encourages greater inflectional variation, especially in youth-speak. For example,
the query formulated as follows: [he|she|I] text* [him|her|me], confirms this
and moreover reveals that text not only functions as a verb but as an uninflected
past tense verb:
I sent him my picture and he text-ed me back that I look like his wife
I was almost speechless when she textd me the last one below
Yesterday he texted me in a meeting with you want to go out?
The next time I text him, he didnt reply
I texted her and invited her to meet us
The pattern can be further specified in the light of first run results, as in:
414 Antoinette Renouf, Andrew Kehoe and David Mezquiriz
Text type and genre can be specified via the Open Directory or Yahoo
Some indication of document date (typically last update) can be identified,
where it is provided, using the WebCorp output option that displays URLs
Search may be limited to the whole or part of a particular URL, such as
bbc.co.uk, or .gov
Search may be limited to certain (and multiple) domains, using Boolean
terms as follows: .net OR .org; .ac.uk OR .edu
A word filter may be used, specifying that the search term, e.g. plant, must
occur in a text also containing, or excluding, a particular word or words,
such as +flower nuclear
3.5.2 Internationalisation
other languages. We have in the last year or so built some of this functionality
into WebCorp.
The user may wish to refine his/her search by specifying the language of the
context surrounding the chosen search term. One possibility is to specify a
particular country code. However, our findings are that there is no one-to-one
correlation between a country code and its associated language. The country code
can retrieve text in other languages than that associated with the country. A
search on the term gracejar, a Portuguese word meaning to joke, might be
expected to generate relevant output, but even with the specification of a country
code, in this case .pt, it does not, as shown in Figure 8.
Perhaps the best one can say is that the country code refines the scope of
reference to one of interest to inhabitants of that country, and this tends to favour
texts in the native language. Ultimately, success in retrieving a particular
language via the country code comes primarily with search terms that are unique
to the language associated with it. The exclusively French word blaguer with
French domain setting retrieves only French language contexts, as in Figure 10.
Even so, if the search term is cited rather than used, it could occur anywhere, as
we see in Figure 11 below, where we submitted the search term blaguer to
Portuguese text domains and nevertheless managed to retrieve it in Portuguese
contexts.
Feature analysis
Secondly, one could identify a language through Feature Analysis of a
candidate Web text. Much work has been done on the automatic identification of
particular languages, not least by the Leeds team of Eric Atwell, Clive Souter,
and their postgraduate students (Souter et al. 1994). The two approaches that we
have so far isolated as promising are what we shall call Negative Feature
Analysis, and Positive Feature Analysis.
The principle of negative feature analysis is that a text is deemed not to be
in a particular language if it contains features not associated with that language.
The features could be a sequence of characters drawn from text of a given
language. This approach is exemplified by the work of a team of undergraduate
computer scientists at the University of Paris VII (Longuemaux et al. 2001). They
have built exemplar corpora in selected major languages, and they match a Web
email to each in turn, ranking the unlikelihood of the email being in each
language. The text is judged to be more likely to have been written in the
language of which it contains fewest untypical or impossible features. The
advantage of their system is that a one-page corpus furnishes sufficient features
for matching, and the language of the unknown text can be identified after very
few character combinations. The system can also rank the relative probabilities of
the language content of a Web text or page that contains more than one. This
would differentiate between the main language use and subsidiary languages, say
occurring in links to text headers in other languages.
The principle of positive feature analysis, as devised by Souter and team,
is to build a character-bigram (or trigram) model of text in each of the languages
that it is desirable to identify, then to compare new incoming text against each
letter-bigram/trigram model. This isolates the right language in a few characters,
because each language has specific patterns rarely found in other languages. It
can sometimes function even with a single word as its input data. We are still
finalising our method for the WebCorp tool, but language identification does not
seem to be problematic.
418 Antoinette Renouf, Andrew Kehoe and David Mezquiriz
4. Next steps
In the next phase, we will carry on this research within the framework of the
University of Liverpool ULGRID initiative. This is concerned with the design
and implementation of the next generation of the Internet, with reference to the
new types of software, middleware and hardware that are required to facilitate
the larger tasks and greater traffic anticipated for the future. Greater in-university
processing power and distributed processing initiatives will help to increase the
speed of WebCorp response. In terms of improving access to more linguistically
usable Web-based text, we will be making recommendations, to the Semantic
Web and other initiatives, to enrich and standardise Web text mark-up for
document language and linguistically vital information such as date of authorship.
A fledgling markup infrastructure exists, but its adoption and uniform use by
Web page creators is slow.
Acknowledgement
References
Bergman, M.K. (2001), The deep Web: surfacing hidden value: http://www.
brightplanet.com/deepcontent/tutorials/DeepWeb/deepwebwhitepaper.pdf.
Kehoe, A. and A. Renouf (2002), WebCorp: applying the Web to linguistics and
linguistics to the Web, in: Proceedings of 11th International World Wide
Web Conference, Honolulu, Hawaii, 7-11 May 2002 (http://www.
2002.org/CDROM/poster/67/)
Kbler, N. and P.-Y. Foucou (2000), A Web-based environment for teaching
technical English, in: L. Burnard and T. McEnery (eds), Rethinking
language pedagogy. Papers from the Third International Conference on
Language and Teaching. Frankfurt am Main: Peter Lang. 65-73.
Longuemaux, F., F. Morandeau, A. Riviere, R. Tadayoni-Rouchon, P. Vaz
Martinho (2001), Reconnaissance de la langue partir de facteurs
interdits. Unpublished manuscript, Univ. Paris VII Denis Diderot.
The accidental corpus 419