Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2477080

AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation


Models

Conference Paper · December 1994


Source: CiteSeer

CITATIONS READS

25 110

3 authors:

Eric Atwell John Hughes


University of Leeds Galway-Mayo Institute of Technology
393 PUBLICATIONS   3,881 CITATIONS    12 PUBLICATIONS   166 CITATIONS   

SEE PROFILE SEE PROFILE

Clive Souter
University of Leeds
46 PUBLICATIONS   435 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Arabic Dialects Text Classification View project

Arabic semantic Quranic search tool View project

All content following this page was uploaded by Eric Atwell on 17 May 2013.

The user has requested enhancement of the downloaded file.


AMALGAM: :
Automatic Mapping Among
Lexico-Grammatical Annotation Models
Eric Atwell, John Hughes, and Clive Souter
C e n t r e for C o m p u t e r Analysis of L a n g u a g e A n d Speech
School of C o m p u t e r S t u d i e s , Leeds University, Leeds LS2 9 J T , U K
eric@scs, leeds, ac. uk j ohn@scs, leeds, ac. uk cs@scs, leeds, ac. uk

Abstract Introduction
Several research projects around the world are buihli.g
Several Corpus Linguistics research groups have gone grammatically analysed corpora; that is, collections o1"
beyond collation of 'raw' text, to syntactic annotation text annotated with part-of-speech wordtags and sy.-
of the text. However, linguists developing these lin- tax trees. Tagged and parsed English corpora (Ba.k of
guistic resources have used quite different wordtagging English [54]; BNC [441, [22]; Brown [24]; ICE [12], [28],
and parse-tree labelling schemes in each of these anno- [64]; Lancaster-IBM [26], [22]; LOB [1], [3], [38], [.12];
tated corpora. This restricts the accessibility of each London-Lund [62]; Nijmegen [12]; PoW [23], [55], [57];
corpus, making it impossible for speech and handwrit- SEC [631; TOSCA [46], [311, [12]; UPenn [53], [45]; etc)
ing researchers to collate them into a single very large are used, among other things, as atithoritative exal n pies
training set. This is particularly problematic as there by researchers in English Language Teaching and Lexi-
is evidence that one of these parsed corpora on its own cography (e.g. [44]), and as training data for statistical
is too small for a general statistical model of grammat- syntactic constraint models to improve recognition ac-
ical structure, but the combined size of all the above curacy in speech and handwriting recognisers (e.g. [37],
annotated corpora should deliver a much more reliable [I01).
model. However, projects have used quite different wordtag-
We are developing a set of mapping algorithms to ging and parsing schemes. In contrast to the Speech
map between the main tagsets and phrase structure research community, which has reached broad agree-
grammar schemes used in the above corpora. We plan ment on an uncontentious set of labelling conventions
to develop a Multi-tagged Corpus and a MultiTreebank, for phonetic/phonemic analysis, there is no general con-
a single text-set annotated with all the above tagging sensus in the international Natural Language research
and parsing schemes. The text-set is the Spoken En- community on analogous conventions for grammatical
glish Corpus: this is a half-way house between formal analysis, Developers of corpora adhere to a variety o['
written text and colloquial conversational speech. How- competing models or theories of grammar and parsing,
ever, the main deliverable to the computational linguis- with the effect of restricting the accessibility of their
tics research community is not the SEC-based Multi- respective corpora, and the potential for collation into
Treebank, but the mapping suite used to produce it a single fully parsed corpus.
- this can be used to combine currently-incompatible In view of this heterogeneity, we have begun to in-
syntactic training sets into a large unified multicorpus. vestigate and develop methods of automatically map-
Our architecture combines standard statistical language ping between the annotation schemes of the most widely
modelling and a rule-base derived from linguists' anal- known corpora, thus assessing their differences and im-
yses of tagset-mappings, in a novel yet intuitive way. proving the reusability of the corpora. Annotating a
Our development of the mapping algorithms aims to single corpus with the different schemes allows for com-
distinguish notational from substantive differences in parisons, and will provide a rich test-bed for automatic
the annotation schemes, and we will be able to evalu- parsers.
ate tagging schemes in terms of how well they fit stan- The most widely known tagged colrpora for English
dard statistical language models such as n-pos (Markov) are: the Lancaster-Oslo/Bergen (LOB) Corpus; the
models. 1 Brown Corpus; and the London-Lund Corpus. In addi-
1This research began with grants from the UK Science the Higher Education Funding Councils' New Technologies
and Engineering Research Council (SERC) and the Univer- Initiative (HEFCs' NTI); we gratefully acknowledge their
sities Funding Council's Knowledge Based Systems Initia- financial support. We are also grateful to the Corpus [,in-
tive (UFC KBSI), and is now funded by the UK Engineer- guistics research teams who have generously provided back-
ing and Physical Sciences Research Council (EPSRC) and ground information on their tagging and parsing schcnlt:s.

11
lion, the International Corpus of English (ICE) should to develop tagset mappings between the LOB Corpus
b,, included as its tagset has now been published [28]. (our primary tagset Interlingua) and each of the 'ma-
Parsed corpora for English include: the Lancaster-IBM jor' tagged corpora: BROWN, ICE, Lancaster-IBM,
l'rcebank; the Lancaster-IBM Spoken English Corpus and UPenn. Next full grammar mappings would be
(SEC) 'rreebank; the Lancaster-Leeds Treebank; the developed between the Lancaster-IBM Treebank (our
Polytechnic of Wales (PEW) Corpus; the Nijmegen primary parsing scheme Interlingua) and each of: the
('orpus; the TOSCA Corpus;; and the University of UPenn Treebank and the Lancaster-Leeds Treehank.
P,,nnsylvania (UPenn) Treebank. We plan to include The ICE and BNC tagsets and parsing schemes could
the parsed ICE-GB (Great Britain component of ICE) be included when they become available. Mapping be-
and the BNC (British National Corpus) in the project tween tagsets will involve relabelling of words, whereas
when they become available. mapping between grammar schemes also involves struc-
As a development and testing resource, we are us- tural manipulation. These treebanks have been chosen
ing the text of the Lancaster-IBM Spoken English Cor- for their skeletal parsing schemes, which are of rela-
pus (SEC). The SEC is a collection of recordings of ra- tively similar structure apart from a small number of
dio broadcasts with accompanying annotated transcrip- systematic differences.
tie,s, collected by Lancaster University and IBM UK We have chosen the SEC as a 'core' text for this
as a general research resource. The SEC is available project, because
from the International Computer Archive of Modern 1. the tagged SEC uses the same tagset as the LOB
English (ICAME) based at the Norwegian Computing Corpus (widely considered to be the UK standard
(',entre for the Humanities (in Bergen, Norway). The and our proposed primary tagset);
corpus exists in several forms and annotations: the digi-
tiscd acoustic waveform; the graphemic transcription 2. the parsed SEC uses the same grammatical scheme as
a.m~otated with prosodic markings; and a part-of-speech the Lancaster-IBM Treebank (our proposed primary
analysis (using the LOB Corpus tagset). Skeletal pars- parsing scheme);
ing has been added to create the SEC Treebank, and 3. these are the annotation schemes which we have most
this forms a subset of the Lancaster-IBM Treebank. prior experience of;
(;crry Knowles (Lancaster) and Peter Roach (Leeds) 4. the text material, BBC radio broadcasts, are a neu-
are collaborating in an ESRC-funded project to set up tral compromise between written and conversational
a time-aligned database of recorded speech, accompa- spoken English genres.
nied by phonetic and graphemic transcriptions. Our
Our aim is to develop bidirectional mappings for
proposal will produce, as a side-effect, several alterna-
the above tagsets and grammar schemes, although we
tive tagged and parsed versions of the SEC which will
appreciate that for mapping from simple to delicate
b,, made available to the SEC database project collab-
schemes this will not be possible, and that mappings
orators. It will also be able to act as a test-bed for the will be imperfect. As mapping algorithms are devel-
('ompa.rison and evaluation of parsing schemes.
oped and tested, and whilst building the Multi-Tagged
Corpus and Multi-Treebank, we will compile "hand-
Objectives of the project books" of common errors (i.e. mismatches) and their
The main objectives are as follows : corrections. These will help future users of the devel-
To design and implement algorithms for mapping be- oped mapping algorithms to straightforwardly post-edit
ween corpus annotation schemes; for both wordtag sets their mapped corpora and treebanks, thus maximising
a ml phrase structure grammar schemes. resource reusability. To map between two tagsets other
To empirically evaluate the accuracy and shortcom- than LOB, two mappings will be necessary (via the
ings of the developed mapping algorithms, by applying primary tagset, our "interlingua" representation); simi-
them to the tagged SEC and the SEC Treebank. The laxly for non-terminal grammar schemes. We appreciate
~,ulcome of this evaluation will be to highlight the no- the danger of propogating incorrect mappings.
la~h)nal and substantive differences between the alter- If there is sufficient time, we hope to go on to in-
Ilative tagging and parsing schemes. vestigate mapping algorithms for other (more detailed)
To build a Multi-Tagged Corpus, by enhancing the grammar schemes; for example the parsed P e w Cor-
Spok~,n English Corpus with different wordtagging pus (Systemic Functional Grammar), and the parsed
~chCIItes. Nijmegen Corpus (Extended Affix Grammar). The
To build a Multi-Treebank, by enhancing the Spoken non-corpus-based Generalised Phrase Structure Gram-
English Corpus with grammatical analyses according to mar (GPSG) (as used in the Alvey Natural Language
s,,v,,ral alternative grammatical theories. Toolkit ANLT) should also be included. Mapping from
To investigate tim use of tile Multi-Treebank as a these to the Lancaster-IBM Treebank grammar scheme
h,.~chmark for grammars and parsers. would only be uni-directional i.e. from a d,:tail,~d to a
Itfil.ially, we considered adopting the 'lnterlingua ap- skeletal analysis.
I,r,,ach' to mapping, as used in Machine Translation The Multi-Treebank will be produced by applying
l,r,~j,','ts such as EUR.OTRA. This would require us the final version of each grammar scheme mapping al-

12
gorithm to the SEC Treebank. Similarly, for the Multi- the CELEX database [17], (which were thet,,sclw,s <t,,-
Tagged Corpus, the final version of each tagset mapping rived largely from those in LDOCE [51]), and th,, sys-
algorithm will be applied to the tagged SEC. The re- temic functional grammar (SFG) used to hand pars,"
suiting annotations will then be intensively proofread the Polytechnic of Wales corpus [23], [551.
and post-edited. This will require consultations with The aim was to provide a large lexicon to supporL
authorities in each of the tagsets and grammar schemes SFG-based parsing programs. The original CELEX h.x-
involved. icon, which contained some 80,000 English wordforms.
was transformed into a lexicon with SFG tags, usi,g a
Progress to date semi-automatic mapping program, written in the AWK
We envisage three main stages to the project: imple- programming language. The resulting lexicon was tl,,.
mentation of algorithms for mapping between tagsets; compatible with a large corpus-based systemic gram-
implementation of algorithms for mapping between mar consisting of over 4,000 phrase-structure rules [56].
phrase structure grammatical analysis schemes; and Together they can then support relatively robust prob-
investigating applications of the mapping programs, abilistic parsing programs.
multi-tagged corpus, and multi-treebank. The problems encountered in trying to specify such
We are currently in the first of these. Mapping algo- a mapping result from disparity in the level of delicacy
rithms are being designed and implemented between the in the two tagging schemes. Mapping from a coats,,- to
LOB Corpus tagset and each of: the tagged BROWN fine-grained grammar must be achieved manually, us,'
Corpus, the tagged ICE, the Lancaster-IBM Treebank, subcategorisation information contained in the lexi~'o.,
the UPenn Tagset (and the BNC tagset will be added contextual information, or exception lists. Souter's pro-
when published). Each tagset is being considered in gram contained simple one-to-one mappings, many-l.o-
turn: one mappings, and one-to-many mappings supported
by exception lists and subcategorisation information.
1. Analysis of the notational and substantive differences In his work, contextual information could not bc ,sed
between the LOB tagset and the 'current' tagset. to support the mapping because the source material w~us
2. Design and implementation of a mapping algorithm a lexicon and not a tagged corpus. A small part of the
(two-way, where possible). mapping code (used to map between pronoun labels) is
shown in Figure 1.
3. Evaluate success of algorithm by applying it to the
tagged SEC; incrementally improve in light of com-
mon errors and linguistic intuition. else if (category ffi= "PRON") # pronouns
(
A side-effect of this phase is the production of a
if (\$8 •ffi "Y") printf("HWH)(") # I/H-pronouns
Multi-Tagged Corpus: the SEC text annotated with else if ((\$I ~ffi "no-one")l l\
each tagset. ( \ $ 1 == "nobody")[ 1\
( \ $ 1 . . . . nothing") [ l \
A standard format for tagged and parsed (\$I == "noone") J l\
corpora (\$I .... none") )
As well as using different tagsets and parsing schemes, printf("HPN) (") # negative pronouns
different annotated corpora come in a range of different else printf("HP)(") # other pronouns
}
formats - see [57], [59], [60]. A non-trivial first step in
merging tagged and parsed corpora is to decide on a Figure 1: Fragment of an AWK mapping
unitary standardised format. Although the Text En- from CELEX to SFG tagsets
coding Initiaitive (TEI) [61], [18] offers general guide-
lines for text formatting standards, and some corpora
(including BNC, ICE) aim to be "rEI-conformant", in Here, the coarse-grained CELEX tag PRON (pro-
practice it seems almost as hard for Corpus linguists noun) is mapped to three SFG tags, HWH, HPN and
to agree to accept a single annotation format as it is HP. The default mapping is to HP (pronominal head of
to agree on a single annotation scheme. Our mapping the nominal group), but if the CELEX lexicon contains
software will use a standardised internal format for tag- subcategorisation information in the form of a Y in col-
gings and parse-trees, but will have to be able to accept umn 8, then we can assign the label for wh-pronou,
input and produce output in a range of existing formats. head (HWH). An exception list is used to map to the
third SFG pronoun label (HPN), for negative pronoun
Hand-crafting a detailed mapping heads.
One approach to obtaining a mapping between two
tagsets is to use expert linguistic knowledge in iden- Incremental refinement through feedback
tifying the relationship between particular tags, and is Earlier work at Leeds [32] explored ways that a prob-
exemplified in, for example, [58]. In this work, Souter abilistic grammar may be improved with positiw, fi~ed-
drew up a mapping between the parts of speech used in back from a human user; this has direct implications

13
fi)r how to improve the mappings incrementally. As the textual information can be used with clustering tech-
mapped annotations are to be hand-corrected by ex- niques to classify words. These techniques could be
i)(,rts this provides positive feedback. Rather than tag applied to the tagging scheme with littlemodification.
the cole text completely using the best derived mapping The classificationscould then be examined to see which
a better idea would be to do it in sections and then have types of tag are difficultto group. Tags which are dif-
Iho ~,xpert correct the errors in each section in turn. Af- ficult to cluster may be useful in identifying problem
ter each section is complete the mapping rules will be areas early. As an example, Figure 2 shows a clustering
ul)dated to incorporate the new information. Hopefully dendogram for the L O B Corpus tagset.
this will enable future sections to be mapped more ac- One possible problem might be: given the set of
curately. This method is similar to that used by the Words in (SI ...Sn) and the interlingua tags (al... an)
Ni.imegen corpus parsing group, [47] how are the other tagging scheme's tags (bl...bn) de-
rived? Obviously, the tagging scheme itselfcan be used
Problems with the interlingua-based to map directly the words ($I... Sn) onto the tags
approach (bl...bn) without knowledge of the interlingua tags
TIw interlingua idea seems sound for several reasons. (al...an). Also, mappings could be made solely be-
Amongst these is the saving made in required map- tween the interlinguatags (al ...an) and the other tags
pings. For instance, tire tagging schemes would require (bl ...bn). This could be done by having an expert tag
twenty mappings if each pair is mapped directly in both a section of the 'coretext' with both the interlinguaand
(lir~,ctions. However only eight mappings axe required the other tagging scheme. Probabilisiticrules could be
if one of the tagging schemes acts as an interlingua. derived indicating how the tags match up. These rules
This saving becomes greater as more tagging schemes would be strengthened ifthe context of the surrounding
arc considered. The interlingua also helps the map- tags was incorporated.
pings attain a level of consistency as the interlingua W h e n the two pieces of information axe combined it
is I.ho basis of all possible mappings from one tagging ishoped that a more accurate mapping can be achieved.
scheme to another. However, the interlingua may cause This can be done by mapping directly from the word
prol)lenm in the instances where it is coarse-grained rel- plus tag to the new tag. For instance;
ative to other tagging schemes. For instance, the LOB
tagset has no notion of verb transitivity whereas the Si + ai ~ bi.
ICE tagset does. If a mapping is being made between However, rules could grow very large even when reg-
two tagging schemes both of which incorporate the con- ularities are used to reduce their number.
cept of transitivity then a tag may be wrongly allocated Alternatively, a mapping could be made for Si ~-* bil
a.~ the sense of transitivity is lost via the LOB tagset according to the standard annotation rules for the non-
intcrlingua. interlingua tagging scheme; and a mapping could be
One problem with the work plan is the strong em- done for ai ~ bi2 according to the procedure outlined
phasis on this problem of imperfect mappings due to above. These mappings produce two potential tags in-
course-grained parts of the interlingua. It may turn out dependently. One algorithm might always accept bil
after experimentation that the contextual information when bil = hi2. When b~l ~ bi2 a decision needs to
of the surrounding tags and words make up for this. be made as to which, if any, of the tags should be cho-
'l'hc sentence, S, the interlingua tags, a, and the other sen. Brill [14] developed a clever and highly accurate
I..Igging scheme's tags, b, can be represented as follows: tagging scheme which could have implications for this
!
problem. He tagged every occurrence of a word with its
Sentence Tagging 1 Tagging 2 most probable tag if there was more than one choice. A
$1 al bl second pass of the corpus would update the tags accord-
82 a2 b2 ing to a set of automatically acquired rules. A similar
$3 as ba idea could be utilised to choose between the tags. Per-
haps the most probable tag would always be selected
on the first pass but we would allow that decision to be
S.- u an-2 bn- 2 altered on a second pass according to rules derived from
Sn-1 an-1 bn-1 earlier sections of the corpus that had been tagged and
S. a. b,~ checked by the expert linguist. This, then, is another
example of incremental learning.
Using a window of the closest four neighbours, say,
the tag ai when presented with a difficulttag, bi, has Combining Symbolic and Statistical
I,hc additional context information of Si-2, S i - t , Si+l, Approaches to Language
N',t:,, o,-2, ai--t, ai+l and ai+a to work on. Previous Our research is particularly relevant to this work-
rcs,.arch (including [4], [5], [33], [34], [35], [36]) has indi- shop, as it is clear we will have to combine rule-
,.at,.d I hat there is a high degree of useful contextual in- based symbol-mapping knowledge with statisticaldis-
I',~rmal ion implied by the surrounding items. This con- ambiguation models. W e envisage that the bulk of a

14
mapping can be done using simple one-to-one symbol
replacement rules; but some rules will map one source
symbol onto a set of more than one target symbol. A
statistical part-of-speech tagger along the lines of th,~
CLAWS tagger used on LOB [42], [3], [25] ca,, th,'.,, l.,
used to select between the reduced camlhhtt,, set ush,g ~,
g, statistical context model; probably a lst-order Mark, w
or Bi-Pos model is most appropriate as this is tt,' sim-
plest 'standard' statistical language model and is widely
used and understood, see for example [2], [48], [7], [37],
u [9], [49].
m
m We can arrive at such a model of source-to-tart:it
mapping by 'working backwards': first run a CLAWS-
style Markovian target-tagset tagger over the text, ig-
noring the source tags; proofread the output to note
MD
where this makes mistakes (assigns incorrect target
VB tags); and then devise source-to-target tag mapping
rules only for these cases. We are aware from our own
initial attempts at deivising tag-mappings that this re-
m
quires a high level of specialist linguistic knowledge, of
g,
1ol both source and target tagset; this "symbolic patching
m
of the statistical model" approach minimises the 'lin-
guistic expertise' we need to capture (and first develop!)
to devise symbolic mapping rules. We have learnt of
i
corpus-tagset mapping work by a number of other re-
v~
searchers (including [13], [30], [20] [39] [41], [52]), I,ut
)
generally such research in the past has been merely a
gtDt means to an end (to create a re-tagged Corpus), so
the full mapping algorithms have not been formalis~,d
or published; but if all we need is a limited nub~'r of
mapping-rules to "patch" the Markov model the,t we
may be able to glean sufficient details from informal
notes etc. It may appear that we are promoting bad
Software Engineering principles in advocating symbolic
"patches" to fix the flaws in the statistical model as and
when we spot them patching up a program as the bugs
-

seep out, However, we prefer to view this as a princi-


pled, well-founded approach to combining symbolic and
statistical models, minimising the 'overlap' by ensufi,g
that each has a separate useful contribution to mak(, I.o
I== ~ the overall mapping task.
We envisage combining a CLAWS-style tagger mod-
ule for each of the target tagsets into a single Multi-
tagger program. This accepts as input a stream of
words annotated with tag(s) from one (or more) of tlw
source tagsets; to output is a stream of words plus t~tgs
from ALL target tagsets. This model allows for inclu-
sion of mapping rules both direct from source to targ,q.
tagset, and via an interlingua (a backup default to try
if there are no direct mapping rules).
!
Future Work
The remainder of the project will be devoted to tim
two other phases of the plan listed earlier, mapping be-
tween phrase-structure parsing schemes, and investigkt-
ing applications of the multi-tagged corpus and multi-
treebank.
Figure 2: A Clustering of LOB Tags

15
Implementation of Algorithms For "troublesome" subset of sentences which linguists can
Mapping Between Grammar Schemes concentrate their debate on.
Initially mapping algorithms will be designed and im- One possible criticism of a lot of work in Corpus
plemented between the Lancaster-IBM Treebank gram- Linguistics, including the AMALGAM proposed work-
mar scheme, and each of the UPenn Treebank and the plan, is that we restrict ourselves to variants of exist-
Lancaster-Leeds Treebank. Each grammar scheme will ing tagging and parsing schemes which are specifically
bc considered in turn: crafted for Corpus annotation, but which are quite dif-
ferent from grammar models being advocated and de-
1. Analysis of the notational and substantive differences veloped by non-Corpus-based theoretical linguists, such
b~.tween the Lancaster-IBM grammar scheme and the as GPSG or HPSG (see e.g. [27]). Unfortunately, we
~cHrrent' grammar scheme.
know of no English corpus parsed according to such
2. Mammlly parse a subset of the SEC according to the a feature-based unification-oriented formalism, so one
'current' grammar scheme. This subset should be cannot readily be included in the AMALGAM project;
sufficient to allow a prototype mapping algorithm to however, we would like to hear from theoretical linguists
I~q~ imluced. who we could collaborate with in extending the multi-
parser to a unificational grammar formalism. It is not
:~. Apply mapping algorithm to the parsed SEC; incre- clear that our multi-corpus will be a 'fair' benchmark
mentally improve in light of common errors and lin-
for testing grammars and parsers from such widely-
guistic intuition.
differing theories; it will be interesting to see whether
Dcpending on how much time is available, mapping the partition between "uncontentious" and "trouble-
algorithms for more detailed grammar schemes will be some" sentences is also applicable in assessment of uni-
investigated: parsed POW Corpus, parsed Nijmegen ficational grammars.
(',orpus, GPSG, and the BNC grammar scheme (when Another constraint of the AMALGAM project is that
published). A side-effect of this phase will be the pro- we are not considering Corpus-based semantic tagging
duction of a Multi-Treebank; the SEC automatically schemes (e.g. [4O], [22], [21]), only syntactic tagging
a:mml.ated with each grammar scheme. schemes. Again, it will be interesting to see whether
The all-in-one Multi-tagger architecture outlined the syntactically "troublesome" sentences are also se-
~d,ove can be carried over to a Multi-parser. Instead mantically complex or anomalous; but this is a question
of a CLAWS-style Markovian tagger, for each target for another, follow-up, project.
parsing scheme a grammar and parser can be extracted
directly from the corresponding training Treebank. A
Context-Free Grammar can be elicited directly by ex-
tracting each non-terminal and its immediate-daughter- Comparison of the Multi-Treebank with
scqu~'nce, to become the left-hand-side and right-hand- other parsed corpora
side respectively of a context-free grammar rule [6]; fre-
quencies of constituents in the training treebank can be
used to make this a Probabilistic Context Free Gram- W e will compare the S E C data with other parsed texts
mar [49], useable in a treebank-trained probabilistic (LOB, UPenn, P O W , Nijmegen, etc), to assess differ-
parser such as those in [2], [29], [9], [50]. Rather than ences in the range and frequency distributions of gram-
producing a single, fully correct parse-tree for each matical constructs. The S E C consists of transcripts
input sentence, these probabilistic Treebank-trained of scripted (and probably rehearsed) radio broadcasts.
p~rser generally output an ordered list of possible parse- Some natural language researchers may feel that the
trees, with a probability or weight attached to each. Spoken English dataset is thus inappropriate for their
As with the procedure for developing a partial tag- work, since the grammars and parsers they are devel-
mapping, we need only devise source-to-target parse- oping are designed for a different type of language,
tree-constituent mappings in cases where the target- for example, unrehearsed informal spoken dialogue as
parser's 'best' parsetree is not fully correct. found in the London-Lund Corpus and British Na-
tional Corpus spoken section, or more formal published
A s s e s s m e n t o f t h e M u l t i - T r e e b a n k as a (written) text as found in the Brown and Lancaster-
Benchmark for Grammars Oslo/Bergen Corpora. It may be appropriate to aug-
ment the S E C dataset with additional material from
'this requires analysis of the substantive differences bee- alternative sources. On the other hand, it may be that
tw,'en different parses of the S E C sentences; detailed the main differences are in vocabulary rather than syn-
analysis of how many and which constructs differ in tax, and that the coverage of the SEC, though not com-
tlw dilfcrent language umdels. It may be possible to plete or perfect, is adequate for most applications. W c
divide. I,he sentences in the SEC into two subsets: a will try to find empirical evidence for or against the ac-
~'ommon core of "uncontentious" sentences which all or ceptability of 'scripted' Spoken English to the N L com-
most lheories analyse in much the same way; and a mun ity.

16
A s s e s s m e n t o f M u l t b T r e e b a n k as a benchmark for parsers (explored in the workpla.). It
Benchmark for Parsers would also be a rich resource for grammar-learning ,,x-
This will involve attempting to parse the SEC text with periments - a research topic of growing interest (see ,,.g.
other parsers, available from a variety of sources. To [8], [11],[16],[33]).
avoid the need for intensive manual proofreading or We envisage supplying the computational linguistics
checking of results, a (semi-)antomatic assessment pro- research community with a valuable research rcso,,rc,',
cedure will be developed. and the ACL Workshop will be all invaluable ol?port,
nity for us to survey potential customer require.w.ts
Anticipated Results and preferences!
The tangible 'deliverables' of use to the Speech and References
Language research community include:
[1] Eric Steven Atwell. 1982. LOB Corpus 7'ag.qz,lg
. Final implementations of algorithms for mapping be- Project: Manual Post-edit Handbook. Departments
tween pairs of tagsets of Computer Studies and Linguistics, Lancaster Uni-
. Final implementations of algorithms for mapping be- versity.
tween pairs of Treebanks [2] Eric Atwell. 1983. Constituent Likelihood Gram-
• Handbooks of common errors and corrections for mar. In Journal of the International Compuler
post-editing Archive of Modern English (1CAME Journal), No. 7,
pages 34-66. Norwegian Computing Centre for I.he
• The Multi-Tagged Corpus
Humanities, Bergen University
. The MultiTreebank [3] Eric Steven Atwell, Geoffrey Leech and Roger Car-
• Reports on the above side 1984. Analysis of the LOB Corpus: progress arid
The mapping software, Multi-tagged Corpus and prospects in Jan Aarts and Willem Meijs (ed), ('or-
MultiTreebank (along with postediting handbooks and pus Linguistics: Proceedings of the ICAME ~th Inter-
documentation) will be delivered to ICAME and Ox- national Conference on the Use of Computer Corpora
ford Text Archive for public distribution; they will in English Language Research pp40-52, Amsterdam:
also be available for incorporation into the SEC Speech Rodopi.
Database. Reports on the findings of the three stages [4] Eric Steven Atwell. 1987. A parsing expert system
of investigations will be made widely available to all in- which learns from corpus analysis. In Willem Meijs,
terested parties through SALT and ELSNET (UK and editor, Corpus Linguistics and Beyond: Proceediu.q~
European Networks of Excellence) and other channels of the ICAME 7th International Conference, pages
including conference presentations and journal papers. 227-235. Amsterdam, Rodopi.
[5] Eric Steven Atwell and Nikos Drakos. 1987. l~attcrn
Applications Recognition Applied to the Acquisition of a Gram-
The implemented mapping algorithms will be made matical Classification System from Unrestricted E,-
widely available to the UK and international speech glish Text. In Bente Maegaard, editor, Proeeediuqs of
and language research community. They will allow re- the Third Conference of European Chapter of the A.~-
search groups who are using corpus-based training data sociation for Computational Linguistics, New Jersey,
to make use of other corpora straightforwardly, without Association for Computational Linguistics.
substantial modifications. Any current and future users [6] Eric Steven Atwell. 1988. Transforming a p~zrscd
of corpora will have a much expanded resource. corpus into a corpus parser. In Merja Kyto, Ossi
The Multi-Tagged Corpus and the Multi-Treebank Ihalainen, and Matti Risanen, editors, Corpus Lin-
will be distributed, along with the main Spoken English guistics, Hard and Soft: Proceedings of the ICA ME
Corpus, through ICAME. They will also be available for 8th International Conference, pages 61-70. Amster-
incorporation into the SEC Speech Database currently dam, ttodopi.
being created by Gerry Knowles and Peter Roach, fur-
ther enhancing the SEC as a general research resource. [7] Eric Steven Atwell. 1988. Grammatical analysis of
Both the Multi-Tr'eebank and the Multi-Tagged cor- english by statistical pattern recognition. In .]osef
pus will potentially be used by speech and language Kittler, editor, Pattern Recognition: Proceedings of
technology groups for many research and teaching pur- the 4th International Conference Cambridge, pages
poses, including: training data for speech-recognisers, 626-635. Berlin, Springer-Verlag.
optical text recognisers, word processor text-critiquing [8] Eric Steven Atwell. 1992. Overview of grammar
systems, machine translation systems, natural language acquisition research. In Henry Thompson, editor,
interfaces, and NLP applications generally; and for pro- Workshop on sublanguage grammar and lexicon ac-
viding examples for English Language Teaching (ELT) quisition for speech and language: proceedings, pages
grammar textbooks and training material. In addi- 65-70. Human Communication l:tesearch Centre, Ed-
tion, the Multi-Treebank may be used as a testbed and inburgh University.

17
[9] Eric Steven Atwell. 1993. Corpus-based statistical Recognition to appear in Proceedings of 2nd Hellenic-
modelling of English grammar. In Clive Souter and European Conference on Mathematics and Informat-
Eric Atwell, editors, Corpus-Based Computational ies (HERMIS), Athens University of Economics and
Li~,fluistics, pages 195-214. Amsterdam, Rodopi. Business.
[10] Eric Steven Atwell. 1993. Linguistic Constraints [22] Elizabeth Eyes and Geoffrey Leech. 1993. Progress
for Large-Vocabulary Speech Recognition In Eric in UCREL research: Improving corpus annotation
Stcw:n Atwell (ed), h'nowledge at Work in, Univer- practices. In Jan Aarts, Pieter de Haan, and Nelleke
,~ilies: Proceedings of the second annual conference Oostdijk, editors, English Language Corpora: de-
of the Higher Education Funding Councils' Knowl- sign, analysis and exploitation; Proceedings of the
cdgc Based Systems Initiative, pp26-32. Leeds, Leeds 13th ICAME conference, pages 123-144. Amsterdam:
I~n iversity Press. Rodopi.
[11] Eric Steven Atwell, Simon Arnfield, George [23] Robin Fawcett and Michael Perkins. 1980. Child
l)~.metriou, Stephen Itanlon, John Hughes, Uwe Jost, Language Transcripts 6-12. (With a preface, in J vol-
I¢ot> Pocock, Clive Souter, and Joerg Ueberla. 1993. umes). Department of Behavioural and Communica-
Multi-level disambiguation grammar inferred from tion Studies, Polytechnic of Wales.
I,:uglish corpus, treebank and dictionary. In Proceed-
ing.~ of the IEE Two One-Day Colloquia on Gram- [24] W.N. Francis and H. Ku~era. 1979. Manual
mutical Infi:rence : Theory, Applications and Alter- of Information to Accompany a Standard Corpus of
oati~,cs, (Ref 1993/092). London, Institution of Elec- Present-Day Edited American English, for use with
i rical Engim~crs (lEE). Digital Computers (Corrected and Revised edition).
[12] Ih~nk Barkema. 1994. The TOSCA Analysis En- Department of Linguistics, Brown University, Provi-
dence, Rhode Island.
~,~ro~,mcnt for ICE. Technical Report, Department
of I,auguage and Speech, Katholieke Universiteit Ni- [25] Roger Garside, Geoffrey Leech, and Geoffrey
.imegen, The Netherlands. Sampson (editors). 1987. The Computational Analy-
[13] Nancy Belmore. 1991. Tagging Brown with the sis of English : A Corpus-Based Approach. Longman,
I,OB tagging suite. In Journal of the International London and New York.
Uompnter Archive of Modern English (ICAME Jonr- [26] Roger Garside, Geoffrey Leech and T a m ~ V~iradi.
,al). No. 15, pages 63-86. Norwegian Computing 1990. Manual of Information for the Lancaster
Centre for the Humanities, Bergen University. Parsed Corpus. Technical Report, Department of
[14] Eric Brill. 1991. A Simple Rule-Based Part of Linguistics and Modern English, University of Lan-
Speech Tagger. Technical Report: Department of caster, UK.
C.omputer Science, University of Pennsylvania. [27] Gerald Ga~dar and Chris Mellish. 1989. Natural
[15] Eric Brill and Mitchel Marcus. 1992. Tagging an Language Processing in POP-11 : An Introduction
Unfamiliar Text with Minimal Human Supervision. to Computational Linguistics. Addison Wesley.
In Robert Goldman, editor, Working notes of the [28] Sidney Greenbaum. 1993. The Tagset for the In-
A A AI ~hli Symposium on Probabilistic Approaches ternational Corpus of English. In Clive Souter and
to Natural Language, AAAI Press. Eric Atwell (eds.) Corpus-based Computational Lin-
[1{~] Eric Brill, David Magerman, Mitchell Marcus, guistics Amsterdam: Rodopi.
:~ml Beatrice Santorini. 1992. Deducing Linguistic [29] Robin Haigh, Geoffrey Sampson and Eric Atwell.
Structure from the Statistics of Large Corpora. In 1988. Project APRIL - a progress report on the Leeds
Carl Weir and Ralph Grishman, editors, Proceedings annealing parser project. In Proceedings of the ~6th
of AAAI-gP Workshop Program: Statistically.Based Annual Meeting of the Association for Computa-
NLP Techniques San Jose, California. tional Linguistics (ACL), pages 104-112. New Jersey,
[I 7] Gavin Burnage. 1990. CELEX - A Guide for Users. Association for Computational Linguistics (ACL).
Nij n wgen: Centre for Lexical Information (CELEX).
[30] Robin Haigh. 1993. personal communication.
[18] Lou Burnard. 1991. What is the TEI? In D. Green-
stein, editor, Modelling Historical Data. Goettingen: [31] Hans van Halteren and Nelleke Oostdijk. 1993.
St. Katharinen. Towards a syntactic database: the TOSCA anal-
ysis system. In Jan Aarts, Pieter de Haan, and
[19] K. Church. 1992. Parts of Speech Tagging. Fifth Nelleke Oostdijk, editors, English Language Corpora:
Amm:d CUNY Conference on Human Science Pro- design, analysis and exploitation; Proceedings of the
~:~.ssiwlg.
13th ICAME conference, pages 145-162. Amsterdam:
[211] Aviv Cohen. 1994. personal communication. Rodopi.
[71] (h'~rgc C. Demetriou and Eric Steven Atwell. [32] John Hughes. 1989. A Learning Interface to the
199,1. Machinc-Lc~irs~abic, Non-Compositional Se- Realistic Annealing Parser. Technical Report: School
maolic.~ fiJr Domain Independent Speech or Text of Computer Studies, The University of Leeds.

18
[33] John ltughes and Erie Steven Atwell. 1993. uto- the 1992 Pisa Symposium on European Textual (:ol.-
matically acquiring and evaluating a classification of pora.
words In Proceedings of the IEE Two One-Day Col. 06] Nelleke Oostdijk. 1989. TOSCA Corpus MaT~ual.
loquia on Grammatical Inference : Theory, Applica- University of Nijmegen.
tions and Alternatives, (Ref 1993/092). London, In-
stitution of Electrical Engineers (IEE). [47] Nelleke Oostdijk. 1991. Corpus linguistic~ and the
[34] John Hughes. 1994. Automatically Acquiring a automatic analysis of English. Amst,erdam: I{,o,lopi.
Classification of Words. PhD Thesis: School of Com- [48] Marian Owen. 1987. Evaluating automatic gram-
puter Studies, The University of Leeds. marital tagging of text. In Newsletter of the Interna-
[35] John Hughes and Eric Steven Atwell. 1994. A tional Computer Archive of Modern English (ICAME
Methodical Approach to Word Class Formation Us- NEWS), No. 11, pages 18-26. Norwegian Computing
ing Automatic Evaluation. In Lindsay Evett and Centre for the Humanities, Bergen University.
Tony Rose, editors, Proceedings of AISB workshop [49] Rob Pocock and Eric Atwell. 1993. Extracting sta-
on Computational Linguistics for Speech and Hand- tistical grammars from the Lancaster-IBM Spoken
writing Recognition. Leeds University. English Corpus Treebank. Technical Report 93.29,
[36] John Hughes and Erie Steven Atwell. 1994. The School of Computer Studies, Leeds University.
Automated Evaluation of Inferred Word Classifica- [50] Rob Pocock and Eric Atwell. 1993. Probabilis-
tions. In Tony Cohn (ed), Proceedings of the 1I th tic grammatical models for treebank-trained lattice
European Conference on Artificial Intelligence, Am- disambiguation. Technical Report 93.30, School of
sterdam. Computer Studies, Leeds University.
[37] F. Jelinek. 1990. Self-organised language modelling [51] Paul Procter. 1978. Longman Dictionary of Con-
for speech recognition. In Alex Waibel and Kai-Fu temporary English. London: Longman.
Lee, editors, Readings in Speech Recognition, pages
450-506. Morgan Kaufmann. [52] Geoffrey Sampson. 1994. "personal comnm,fi('a-
tion".
[38] Stig Johansson, Eric Atwell, Roger Garside and
Geoffrey Leech. 1986. The Tagged LOB Corpus- [53] Beatrice Santorini. 1990. Part-of-speech ta!l.qing
Users' Manual. The Norwegian Centre for the Hu- guidelines for the Penn treebank project. Teclmi('al
manities, Bergen. Report MS-CIS-90-47, Department of Computer and
• Information Science, University of Pennsylvania.
[39] Stig Johansson. 1994. personal communication.
[40] Uwe Jost and Eric Steven Atwell. 1993. Deriving a [54] John Sinclair. 1987. 'Looking Up: An Account of
the COBUILD Project in Lexical Computing. Collins,
probabilistic grammar of semantic markers from un-
restricted English text In Proceedings of the lEE Two Glasgow.
One-Day Colloquia on Grammatical Inference : The- [55] Clive Souter. 1989. A short handbook to the Pol.q-
ory, Applications and Alternatives, (Ref 1993/0921. technic of Wales Corpus. Bergen: Norwegian (',om-
London, Institution of Electrical Engineers (lEE). puting Centre for the Humanities, Bergen Uniw;rsity.
[41] Judith Klavans. 1994. personal communication. [56] Clive Sourer. 1990. Systemic functional grammars
[42] Geoffrey Leech, Roger Garside and Eric Atwell. and corpora. In J. Aarts and W. Meijs, editors, The-
1983. The automatic grammatical tagging of the ory and Practice in Corpus Linguistics, pages 179-
LOB Corpus. In Journal of the International Com- 211. Amsterdam: Rodopi.
puter Archive of Modern English (1CAME Journal), [57] Clive Sourer and Eric Steven Atwell. 1992. A
No. 7, pages 13-33. Norwegian Computing Centre for richly annotated corpus for probabilistic parsing. In
the Humanities, Bergen University. Carl Weir and Ralph Grishman, editors, Proceedings
[43] Geoffrey Leech and Roger Garside. 1991. Running of AAAI workshop on Statistically-Based NLP Tech-
a grammar factory: The production of syntactically niques, San Jose, CA, pages 28-38.
analysed corpora or "treebanks". In Stig Johansson [58] Clive Souter. 1993. Harmonising a lexical datal)a~se
and Anna-Brits Stenstr6m, editors, English Com- with a corpus-based grammar. In Souter and
puter Corpora: Selected Papers and Research Guide. Atwell, editors, Corpus-based Computational Lin-
Berlin: Mouten de Gruyter. guistics, pages 181-193. Amsterdam: Rodopi.
[44] Geoffrey Leech. 1993. 100 Million Words of En- [59] Clive Souter. 1993. Towards a standard format fi)r
glish: The British National Corpus (BNC) Project. parsed corpora. In Jan Aarts, Pieter de Iiaan, and
English Today. Nelleke Oostdijk, editors, English Language Corpora:
[45] Miteh P. Marcus and Beatrice Santorini. 1992. design, analysis and exploitation; Proceedings of the
Building Very Large Natural Language Corpora: The 13th ICAME conference, pages 197-214. Amsterdam:
Penn Treebank. In N. Ostler, editor, Proceedings of Rodopi.

19
[(10] Clive Souter and Eric Steven Atwell. 1994. Us-
ing Parsed Corpora: A review of current practice In
N~lleke Oostdijk and Pieter de Haan (eds), Corpus-
based Research Into Language, pp143-158. Amster-
d a m , l'~odopi.
[(;1] C. Sperberg-McQueen and L. Burnard. 1990.
Guidelines for the encoding and interchange of
machine-readable tezts, TEI P1, Technical report,
[l n iversities of Chicago and Oxford.
[62] Jan Svartvik (ed). 1990. The London-Lund Corpus
of Spoken English: Description and Research. Lund
University Press, Lund, Sweden.
[(i:~] L.J. Taylor and G. Knowles. 1988. Manual of In-
formation to Accompany the SEC Corpus. Technical
Report, Unit for Computer Research on the English
l,anguage,University of Lancaster, UK.
[(il] Ni Yihin 1993. The ICE Tagset - A Complete List
of Tags used by the Tag-Selector for the Reference
,~f Tag.Selectors and Researchers. Technical Report,
I)epartment of English, University College London,
UK.

20

View publication stats

You might also like