Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

NPtool, a detector of English noun phrases 

Atro Voutilainen
Research Unit for Computational Linguistics
P.O. Box 4 (Keskuskatu 8)
FIN-00014 University of Helsinki
Finland
E-mail: avoutila@ling.Helsinki.FI

Abstract
cmp-lg/9502010 13 Feb 1995

as other morphosyntactic features that can be use-


ful for stating syntactic generalisations. In this way
NPtool is a fast and accurate system for ex- a carefully built and informative lexicon facilitates
tracting noun phrases from English texts the construction of accurate, wide-coverage parsing
for the purposes of e.g. information re- grammars.
trieval, translation unit discovery, and cor-  We use tags to encode morphological distinc-
pus studies. After a general introduction, tions, part of speech, and also syntactic information;
the system architecture is presented in out- for instance:
line. Then follows an examination of a re-
cently written Constraint Syntax. Section 6 I PRON @HEAD
reports on the performance of the system. see V PRES @VERB
a ART @>N
bird N @HEAD
1 Introduction . FULLSTOP

This paper outlines NPtool, a noun phrase detector. In this type of analysis, each word is provided with
At the heart of this modular system is reduction- tags indicating e.g. part of speech, in ection, deriva-
istic word-oriented morphosyntactic analysis tion, and syntactic function.
that expresses head{modi er dependencies. Previ-  Morphological and syntactic descriptions are
ous work on this approach, largely based on Karls- based on hand-coded linguistic rules rather than
son's original proposal [Karlsson, 1990], is docu- on corpus-based statistical models. They employ
mented in [Karlsson et al., forthcoming]. Let us sum- structural categories that can be found in descrip-
marise a few key features of this style of analysis. tive grammars, e.g. [Quirk, Greenbaum, Leech and
 As most parsing frameworks, also the present Svartvik, 1985].
style of analysis employs a lexicon and a grammar. Regarding the at times heated methodological de-
What may distinguish the present approach from bate on whether statistical or rule-based information
most other frameworks is the considerable degree of is to be preferred in grammatical analysis of running
attention we pay to the morphological and lexical text (cf. e.g. [Sampson, 1987a; Taylor, Grover and
description: morphological analysis is based on an Briscoe, 1989; Church, 1992]), we do not object to
extensive and detailed description that employs in- probabilistic methods in principle; nevertheless, it
ectional and central derivational categories as well seems to us that rule-based descriptions are prefer-
able because they can provide for more accurate and

This paper, published in the Proceedings of Work- reliable analyses than current probabilistic systems,
shop on Very Large Corpora held June 22 1993 at Ohio e.g. part-of-speech taggers [Voutilainen, Heikkila and
State University, is based on a longer manuscript with the Anttila, 1992; Voutilainen, forthcoming a].1 Proba-
same title. The development of ENGCG was supported
by TEKES, the Finnish Technological Development Cen- 1
Consider for instance the question posed in [Church,
ter, and a part of the work on Finite-state syntax has 1992] whether lexical probabilities contribute more to
been supported by the Academy of Finland. morphological or part-of-speech disambiguation than
bilistic or heuristic techniques may still be a use- eralise over past experience, and this is not necessar-
ful add-on to linguistic information, if potentially re- ily harmful { as long as the generalisations can also
maining ambiguities must be resolved { though with be validated against representative test corpora.
a higher risk of error.  At least in a practical application, a parsing
 In the design of our grammar schemes, we have grammar should assign the best available anal-
paid considerable attention to the question on the re- ysis to its input rather than leave many of the input
solvability of grammatical distinctions. In the utterances unrecognised e.g. as ill-formed. This does
design of accurate parsers of running text, this ques- not mean that the concept of well-formedness is irrel-
tion is very important: if the description abounds evant for the present approach. Our point is simply:
with distinctions that can be dependably resolved although we may consider some text utterance as de-
only with extrasyntactic knowledge2, then either the viant in one respect or another, we may still be inter-
ambiguities due to these distinctions remain to bur- ested in extracting as much information as possible
den the structure-based parser (as well as the poten- from it, rather than ignore it altogether. To achieve
tial application based on the analysis), or a guess, this e ect, the grammar rules should be used in such
i.e. a misprediction, has to be hazarded. a manner4 that no input becomes entirely rejected,
This descriptive policy brings with it a certain de- although the rules as such may express categorical
gree of shallowness; in terms of information con- restrictions on what is possible or well-formed in the
tent, a tag-based syntactic analysis is somewhere language.
between morphological (e.g. part-of-speech) analysis  In our approach, parsing consists of two main
and a conventional syntactic analysis, e.g. a phrase kinds of operation:
structure tree or a feature-based analysis. What we 1. Context-insensitive lookup of (alternative) de-
hope to achieve with this compromise in information scriptions for input words
content is the higher reliability of the proposed anal-
yses. A superior accuracy could be considered as an 2. Elimination of unacceptable or contextually il-
argument for postulating a new, `intermediary' level legitimate alternatives.
of computational syntactic description. For more de- Morphological analysis typically corresponds to
tails, see e.g. [Voutilainen and Tapanainen, 1993]. the lookup module: it produces the desired mor-
 Our grammar schemes are also learnable: accord-
ing to double-blind experiments on manually assign- phosyntactic analysis of the sentence, along with
ing morphological descriptions, a 100% interjudge a number of inappropriate ones, by providing each
agreement is typical [Voutilainen, forthcoming a].3 word in the sentence with all conventional analyses as
 The ability to parse running text is of a high
a list of alternatives. The grammar itself exerts the
priority. Not only a structurally motivated descrip- restrictions on permitted sequences of words and de-
tion is important; in the construction of the pars- scriptors. In other words, syntactic analysis proceeds
ing grammars and lexica, attention should also be by way of ambiguity resolution or disambigua-
paid to corpus evidence. Often a grammar rule, tion: the parser eliminates ill-formed readings, and
as we express it in our parsing grammars, is formed what `survives' the grammar is the (syntactic) anal-
as a generalisation `inspired' by corpus observations; ysis of the input utterance. Since the input contains
in this sense the parsing grammar is corpus-based. the desired analysis, no new structure will be built
However, the description need not be restricted to during syntactic analysis itself.
 Our grammars consist of constraints { par-
the corpus observation: the linguist is likely to gen- tial distributional de nitions of morphosyntactic cat-
context does. The ENGCG morphological disambigua- egories, such as parts of speech or syntactic func-
tor, which is entirely based on context rules, uniquely tions. Each constraint expresses a piecemeal linear-
and correctly identi es more than 97% of all appropriate precedence generalisation about the language, and
descriptions, and this is considerably more than the near- they are independent of each other. That is, the con-
90% success rate achieved with lexical probabilities alone straints can be applied in any order: a true grammar
[Church, 1992]. Moreover, note that in all, the ENGCG will produce the same analysis, whatever the order.
disambiguator identi es more than 99.5% of all appropri- The grammarian is relatively free to select the
ate descriptions; only, some 2{3% of all analyses remain level of abstraction at which (s)he is willing to ex-
ambiguous and thus do not become uniquely identi ed. press the distributional generalisation. In particular,
For more details, see [Voutilainen, forthcoming 1993]. also reference to very low-level categories is possi-
2 Witness, for instance, ambiguities due to adverbial
attachment or modi er scope. ble, and this makes for the accuracy of the parser:
3
The 95% interjudge agreement rate reported in while the grammar will contain more or less ab-
[Church, 1992] probably indicates that in the case of stract, feature-oriented rules, often it is also expe-
debatable constructions, explicit descriptive conventions dient to state further, more particular restrictions
have not been consistently established. Only a carefully on more particular distributional classes, even at
de ned grammar scheme makes the evaluation of the ac-
curacy of the parsing system a meaningful enterprise (see 4
e.g. by ranking the grammar rules in terms of
also [Sampson, 1987b]). compromisability
the word-form level. These `smaller' rules do not sequences of determiners, premodi ers and nominal
contradict the more general rules; often it is sim- heads { by inserting brackets around them, e.g.
ply the case that further restrictions can be im-
posed on smaller lexical classes5 This exibility in [A/AT former/AP top/NN aide/NN] to/IN
the grammar formalism greatly contributes to the [Attorney/NP/NP General/NP/NP Edwin/NP/NP
accuracy of the parser [Voutilainen, forthcoming a; Meese/NP/NP] interceded/VBD ...
Voutilainen, forthcoming 1993]. The appendix in [Church, 1988] lists the analysis of
a small text. The performance of the system on the
2 Uses of a noun phrase parser text is quite interesting: of 243 noun phrase brackets,
The recognition and analysis of subclausal structural only ve are omitted. { The performance of Parts of
units, e.g. noun phrases, is useful for several pur- speech was also very good in part-of-speech analysis
poses. Firstly, a noun phrase detector can be useful on the text: 99.5% of all words got the appropri-
for research purposes: automatic large-scale analy- ate tag. The mechanism for noun phrase identi ca-
sis of running text provides the linguist with better tion relies on the part-of-speech analysis; the part-of-
means to conduct e.g. quantitative studies over large speech tagger was more successful on the text than
amounts of text. on an average; therefore the average performance of
An accurate though somewhat super cial analysis the system in noun phrase identi cation may not be
can also serve as a `preprocessor' prior to more ambi- quite as good as the gures in the appendix of the
tious, e.g. feature-based syntactic analysis. This kind paper suggest.
of division of labour is likely to be useful for technical Bourigault's LECTER [Bourigault, 1992] is a
reasons. One major problem with e.g. uni cation- surface-syntactic analyser that extracts `maximal-
based parsers is parsing time. Now if a substan- length noun phrases' { mainly sequences of determin-
tial part of the overall problem is resolved with ers, premodi ers, nominal heads, and certain kinds
more simple and ecient techniques, the task of the of postmodifying prepositional phrases and adjec-
uni cation-based parser will become more manage- tives { from French texts for terminology applica-
able. In other words, the more expressive but compu- tions. The system is reported to recognise 95% of all
tationally heavier machinery of e.g. the uni cation- maximal-length noun phrases (43,500 out of 46,000
based parser can be reserved entirely for the analysis noun phrases in the test corpus), but no gures are
of the descriptively hardest problems. The less com- given on how much `garbage' the system suggests as
plex parts of the overall problem can be tackled with noun phrases. It is indicated, however, that manual
more simple and ecient techniques. validation is necessary.
Regarding production uses, even lower levels of Rausch, Norrback and Svensson [1992] have de-
analysis can be directly useful. For instance, the de- signed a noun phrase extractor that takes as its in-
tection of noun phrases can provide e.g. information put part-of-speech analysed Swedish text, and inserts
management and retrieval systems with a suitable brackets around noun phrases. In the recognition of
input for index term generation. `Nuclear Noun Phrases' { sequences of determiners,
Noun phrases can also serve as translation units; premodi ers and nominal heads { the system was
for instance, [van der Eijk, 1993] suggests that noun able to identify 85.9% of all nuclear noun phrases in a
phrases are more appropriate translation units than text collection, some 6,000 words long in all, whereas
words or part-of-speech classes. some 15.7% of all the suggested noun phrases were
false hits, i.e. the precision6 of the system was 84.3%.
3 Previous work The performance of a real application would proba-
bly be lower because potential misanalyses due to
This section consists of two subsections. Firstly, a previous stages of analysis (morphological analysis
performance-oriented survey of some related systems and part-of-speech disambiguation, for instance) are
is presented. Then follows a more detailed presenta- not accounted for by these gures.
tion of ENGCG, a predecessor of the NPtool parser
in an information retrieval system. 3.2 ENGCG and the SIMPR project
3.1 Related systems SIMPR, Structured Information Management: Pro-
So far, I have found relatively little documentation cessing and Retrieval, was a 64 person year ESPRIT
on systems whose success in recognising or parsing II project (Nr. 2083, 1989{1992), whose objective
noun phrases has been reported. I am aware of three was to develop new methods for the management
systems with some relevant evaluations. and retrieval of large amounts of electronic texts. A
Church's Parts of speech [Church, 1988] performs central function of such a system is to recognise those
not only part-of-speech analysis, but it also identi- words in the stored texts that represent it in a con-
es the most simple kinds of noun phrases { mostly cise fashion { in short, index terms.
5
Consider for instance the attachment of prepositional 6
For de nitions of the terms recall and precision, see
phrases in general and of of-phrases in particular. Section 6.
Term indices created with traditional methods7 Intersection of noun phrase sets
are based on isolated, perhaps truncated words. In the rest of this section, we will observe the analysis
These largely string-based statistical methods are of the following sample sentence, taken from a car
somewhat unsatisfactory because many content iden- maintenance manual:
ti ers consist of word sequences { compounds, head{
modi er constructions, even simple verb { noun The inlet and exhaust manifolds are mounted
phrase sequences. One of the SIMPR objectives on opposite sides of the cylinder head, the
was also to employ more complex constructions, the exhaust manifold channelling the gases to a
recognition of which would require a shallow gram- single exhaust pipe and silencer system.
matical analysis. The Research Unit for Computa-
tional Linguistics at the University of Helsinki par- 4.1 Preprocessing and morphological
ticipated in this project, and ENGTWOL, a Twol- analysis
styled morphological analyser as well as ENGCG, The input ASCII text, preferably SGML-annotated,
a Constraint Grammar of English, were written is subjected to a preprocessor that e.g. determines
1989{1992 by Voutilainen, Heikkila and Anttila sentence boundaries, recognises xed syntagms9,
[forthcoming]. The resultant SIMPR system is an normalises certain typographical conventions, and
improvement over previous systems [Smart (Ed.), verticalises the text.
forthcoming] { it is not only reasonably accurate, but This preprocessed text is then submitted to mor-
also it operates on more complex constructions, e.g. phological analysis. ENGTWOL, a morphological
postmodifying constructions and simple verb{object analyser of English, is a Koskenniemi-style morpho-
constructions. logical description that recognises all in ections and
There were also some persistent problems. The central derivative forms of English. The present
original plan was to use the output of the whole lexicon contains some 56,000 word stems, and al-
ENGCG parser for the indexing module. However, together the analyser recognises several hundreds of
the last module of the three sequential modules in thousands of di erent word-forms. The analyser also
the ENGCG grammar, namely Constraint Syntax employs a detailed parsing-oriented morphosyntac-
proper, was not used in the more mature versions of tic description; the feature system is largely derived
the indexing module { only lexical analysis and mor- from [Quirk, Greenbaum, Leech and Svartvik, 1985].
phological disambiguation were applied. The omis- Here is a small sample:
sion of the syntactic analysis was mainly due to the
somewhat high error rate (3{4% of all words lost the ("<*the>"
proper syntactic tag) and the high rate of remaining ("the" DET CENTRAL ART SG/PL (@>N)))
ambiguities (15{25% of all words remained syntacti- ("<inlet>"
cally ambiguous. ("inlet" N NOM SG))
Here, we will not go into a detailed analysis of the ("<and>"
problems8, suce it to say that the syntactic gram- ("and" CC (@CC)))
mar scheme was unnecessarily ambitious for the rela- ("<exhaust>"
tively simple needs of the indexing application. One ("exhaust" <SVO> V SUBJUNCTIVE VFIN (@V))
of the improvements in NPtool is a more optimal syn- ("exhaust" <SVO> V IMP VFIN (@V))
tactic grammar scheme, as will be seen in Section 5.1. ("exhaust" <SVO> V INF)
("exhaust" <SVO> V PRES -SG3 VFIN (@V))
4 NPtool in outline ("exhaust" N NOM SG))
("<manifolds>"
In this section, the architecture of NPtool is pre- ("manifold" N NOM PL))
sented in outline. Here is a ow chart of the system:
All running-text word-forms are given on the left-
Preprocessing hand margin, while all analyses are on indented lines
\/ of their own. The multiplicity of these lines for a
Morphological analysis word-form indicates morphological ambiguity.
\/
For words not represented in the ENGTWOL lex-
Constraint Grammar parsing
icon, there is a 99.5% reliable utility that assigns
\/ \/
ENGTWOL-style descriptions. These predictions
NP-hostile finite NP-friendly finite
are based on the form of the word, but also some
state parsing
\/
state parsing
\/
heuristics are involved.
NP extraction NP extraction 4.2 Constraint Grammar parsing
\/ \/
The next main stage in NPtool analysis is Con-
7See e.g. [Salton and McGill, 1983]. straint Grammar parsing. Parsing consists of two
8
See e.g. [Voutilainen, Heikkila and Anttila, 1992] for
details. 9
e.g. multiword prepositions and compounds
main phases: morphological disambiguation and ("manifold" N (@NH)))
Constraint syntax. ("<are>"
 Morphological disambiguation. The task ("be" V (@V)))
of the morphological disambiguator is to discard all ("<mounted>"
contextually illegitimate morphological readings in ("mount" PCP2 (@V)))
ambiguous cohorts. For instance, consider the fol- ("<on>"
lowing fragment: ("on" PREP (@AH)))
("<opposite>"
("<a>" ("opposite" A (@>N)))
("a" <Indef> DET CENTRAL ART SG (@>N))) ("<sides>"
("<single>" ("side" N (@NH)))
("single" <SVO> V IMP VFIN (@V)) ("<of>"
("single" <SVO> V INF) ("of" PREP (@N<)))
("single" A ABS)) ("<the>"
Here an unambiguous determiner is directly followed ("the" DET (@>N)))
by a three-ways ambiguous word, two of the analyses ("<cylinder>"
being verb readings, and one, an adjective reading. ("cylinder" N (@>N @NH)))
{ A determiner is never followed by a verb10 ; one ("<head>"
of the 1,100-odd constraints in the disambiguation ("head" V (@V))
grammar [Voutilainen, forthcoming a] expresses this ("head" N (@NH)))
fact about English grammar; so the verb readings of ("<$,>")
single are discarded here. ("<the>"
The morphological disambiguator seldom discards ("the" DET (@>N)))
an appropriate morphologicalreading: after morpho- ("<exhaust>"
logical disambiguation, 99.7{100% of all words re- ("exhaust" N (@>N)))
tain the appropriate analysis. On the other hand, ("<manifold>"
some 3{6% of all words remain ambiguous, e.g. ("manifold" N (@NH)))
head in this sentence. There is also an additional ("<channelling>"
set of some 200 constraints { after the application ("channel" PCP1 (@V)))
of both constraint sets, 97{98% of all words be- ("<the>"
come fully disambiguated, with an overall error rate ("the" DET (@>N)))
of up to 0.4% [Voutilainen, forthcoming b]. The ("<gases>"
present disambiguator compares quite favourably ("gas" N (@NH)))
with other known, typically probabilistic, disam- ("<to>"
biguators, whose maximum error rate is as high as ("to" PREP (@AH)))
5%, i.e. some 17 times as high as that of the ENGCG ("<a>"
disambiguator. ("a" DET (@>N)))
 Constraint syntax. After morphological dis- ("<single>"
ambiguation, the syntactic constraints are applied. ("single" A (@>N)))
In the NPtool syntactic description, all syntactic am- ("<exhaust>"
biguities are introduced directly in the lexicon, so ("exhaust" N (@>N)))
no extra lookup module is needed. Like disambigua- ("<pipe>"
tion constraints, syntactic constraints seek to discard ("pipe" N (@NH)))
all contextually illegitimate syntactic function tags. ("<and>"
Here is the syntactic analysis of our sample sentence, ("and" CC (@CC)))
as produced by the current parser. To save space, ("<silencer>"
most of the morphological codes are omitted. ("silencer" N (@>N)))
("<system>"
("<*the>" ("system" N (@NH)))
("the" DET (@>N))) ("<$.>")
("<inlet>"
("inlet" N (@>N @NH))) All syntactic-function tags are anked with `@'. For
("<and>" instance, the tag `@>N' indicates that the word is
("and" CC (@CC))) a determiner or a premodi er of a nominal in the
("<exhaust>" right-hand context (e.g. the). The second word, in-
("exhaust" N (@>N))) let, remains syntactically ambiguous due to a pre-
("<manifolds>" modi er reading and a nominal head @NH reading
{ note that the ambiguity is structurally genuine, a
10
save for no, which can be followed by an -ing-form; coordination ambiguity. The tag @V is reserved for
cf. no in There is no going home verbs and auxiliaries, cf. are as well as mounted. The
syntactic description will be outlined below. posing penalties on regular expressions, e.g. on tags.
Pasi Tapanainen11 has recently made a new imple- A penalised reading is not discarded as ungrammat-
mentation of the Constraint Grammar parser that ical, only the parser returns all accepted analyses in
performs morphological disambiguation and syntac- an order where the least penalised analyses are pro-
tic analysis at a speed of more than 1,000 words per duced rst and the `worst' ones last.
second on a Sun SparcStation 10, Model 30. Thus there is an `NP-hostile' nite-state parser
that penalises noun phrase readings; this would
4.3 Treatment of remaining ambiguities prefer the sentence reading with cylinder/@NH
The Constraint Grammar parser recognises only head/@V. The `NP-friendly' parser, on the other
word-level ambiguities, therefore some of the traver- hand, penalises all readings which are not part of a
sals through an ambiguous sentence representation noun phrase reading, so it would prefer the analysis
may be blatantly ill-formed. with cylinder/@>N head/@NH. Of all analyses, the
NPtool eliminates locally unacceptable analyses by selected two parses are maximally dissimilar with re-
using a nite-state parser [Tapanainen, 1991]12 as a gard to NP-hood. The motivation for selecting max-
kind of `post-processing module' that distinguishes imally con icting analyses in this respect is that a
between competing sentence readings. The parser candidate noun phrase that is agreed upon as a noun
employs a small nite-state grammar that I have phrase by the two nite-state parsers systems just as
written. The speed of the nite-state parser is com- it is { neither longer nor shorter { is likely to be an
parable to that of the Constraint Grammar parser. unambiguously identi ed noun phrase. The compar-
The nite-state parser produces all sentence read- ison of the outputs of the two competing nite-state
ings that are in agreement with the grammar. Con- parsers is carried out during the extraction phase.
sider the following two adapted readings from the
beginning of our sample sentence: 4.4 Extraction of noun phrases
the/@>N inlet/@>N and/@CC exhaust/@>N An unambiguous sentence reading is a linear se-
manifolds/@NH are/@V mounted/@V quence of symbols, and extracting noun phrases from
on/@AH opposite/@>N sides/@NH this kind of data is a simple pattern matching task.
of/@N< the/@>N cylinder/@NH head/@V In the present version of the system, I have used
the gawk program that allows the use of regular ex-
the/@>N inlet/@>N and/@CC exhaust/@>N pressions. With gawk's gsub function, the bound-
manifolds/@NH are/@V mounted/@V aries of the longest non-overlapping expressions that
on/@AH opposite/@>N sides/@NH satisfy the search key can be marked. If we formu-
of/@N< the/@>N cylinder/@>N head/@NH late our search query as something like the following
schematic regular expression:
The only di erence is in the analysis of cylinder head:
the rst analysis reports cylinder as a noun phrase [[M>N+ [CC M>N+]*]* HEAD
head which is followed by the verb head, while the [N< [D/M>N+ [CC D/M>N+]*]* HEAD]*]
second analysis considers cylinder head as a noun
phrase. Now the last remaining problem is, how where
to deal with ambiguous analyses like these: should `[' and `]' are for grouping,
cylinder be reported as a noun phrase, or is cylinder `+' stands for one or more
head the unit to be extracted? occurrences of its argument,
The present system provides all proposed noun `*' stands for zero or more
phrase candidates in the output, but each with an occurrences of its argument,
indication of whether the candidate noun phrase is `M>N' stands for premodifiers,
unambiguously analysed as such, or not. In this so- `D/M>N' stands for determiners and
lution, I do not use all of the multiple analyses pro- premodifiers,
posed by the nite-state parser. For each sentence, `HEAD' stands for nominal heads
no more than two competing analyses are selected for except pronouns,
further processing: one with the highest number of `N<' stands for prepositions
words as part of a maximally long noun phrase anal- starting a postmodifying
ysis, and the other with the lowest number of words prepositional phrase,
as part of a maximally short noun phrase analysis. and do some additional formatting and `cleaning',
This `weighing' can be done during nite-state the above13two nite-state analyses will look like the
parsing: the formalism employs a mechanism for im- following :
11 Research Unit for Computational Linguistics, Uni-
the
versity of Helsinki
12
For other work in this approach, see also [Kosken- 13
Note that the noun phrase heads are here given in
niemi, 1990; Koskenniemi, Tapanainen and Voutilainen, the base form, hence the absence of the plural form of
1992; Voutilainen and Tapanainen, 1993]. e.g. `manifold'.
np: inlet and exhaust manifold structions. By far the most common construction
are mounted on is a nominal head with optional, potentially coordi-
np: opposite side of the cylinder nated premodi ers and postmodifying prepositional
head, the phrases, typically of phrases. The remainder, less
np: exhaust manifold than 10%, consists almost entirely of relatively sim-
channelling the ple verb{NP patterns.
np: gas The syntactic description used in SIMPR em-
to a ployed some 30 dependency-oriented syntactic func-
np: single exhaust pipe tion tags, which di erentiate (to some extent) be-
and tween various kinds of verbal constructions, syntac-
np: silencer system tic functions of nominal heads, and so on. Some of
. the ambiguity that survives ENGCG parsing is in
part due to these distinctions [Anttila, forthcoming].
the The relatively simple needs of an index term ex-
np: inlet and exhaust manifold traction utility on the one hand, and the relative
are mounted on abundance of distinctions in the ENGCG syntactic
np: opposite side of the cylinder head description on the other, suggest that a less distinc-
, the tive syntactic description might be more optimal for
np: exhaust manifold the present purposes: a more shallow description
channelling the would entail less remaining ambiguity without un-
np: gas duly compromising its usefulness e.g. for an indexing
to a application.
np: single exhaust pipe
and 5.1 Syntactic tags
np: silencer system I have designed a new syntactic grammar scheme
. that employs seven function tags. These tags cap-
The proposed noun phrases are given on indented italise on the opposition between noun phrases and
lines, each marked with the symbol `np:'. The can- other constructions on the one hand, and between
didate noun phrases are then subjected to further heads and modi ers, on the other. Here we will not
routines: all candidate noun phrases with at least go into details; a gloss with a simple illustration will
one occurrence in the output of both the NP-hostile suce.
and NP-friendly parsers are labelled with the sym-  @V represents auxiliary and main verbs as well
bol `ok:', and the remaining candidates are labelled as the in nitive marker to in both nite and non-
as uncertain, with the symbol `?:'. From the outputs nite constructions. For instance:
given above, the following list can be produced: She should/@V know/@V what to/@V do/@V.
ok: inlet and exhaust manifold
ok: exhaust manifold  @NH represents nominal heads, especially
ok: gas nouns, pronouns, numerals, abbreviations and -ing-
ok: single exhaust pipe forms. Note that of adjectival categories, only those
ok: silencer system with the morphological feature <Nominal>, e.g. En-
?: opposite side of the cylinder glish, are granted the @NH status: all other adjec-
?: opposite side of the cylinder head tives (and -ed-forms) are regarded as too unconven-
tional nominal heads to be granted this status in the
The linguistic analysis is relatively neutral as to present description. An example:
what is to be extracted from it. Here we have con-
centrated on noun phrase extraction, but from this The English/@NH may like the conventional.
kind of input, also many other types of construction  @>N represents determiners and premodi ers
could be extracted, e.g. simple verb{argument struc- of nominals (the angle-bracket `>' indicates the di-
tures. rection in which the head is to be found). The head
5 The syntactic description is the following nominal with the tag @NH, or a pre-
modi er in between. An example:
This section outlines the syntactic description that I the/@>N fat/@>N butcher's/@>N wife
have written for NPtool purposes. The ENGTWOL
lexicon or the disambiguation constraints will not be  @N< represents prepositional phrases that un-
described further in this paper; they have been doc- ambiguously postmodify a preceding nominal head.
umented extensively elsewhere (see the relevant ar- Such unambiguously postmodifying constructions
ticles in Karlsson & al. [forthcoming]). are typically of two types: (i) in the absence of cer-
According to the SIMPR experiences, the vast ma- tain verbs like `accuse', postnominal of-phrases and
jority of index terms represent relatively few con- (ii) preverbal NP|PP sequences, e.g.
The man in/@N< the moon had 5.3 Evaluation
a glass of/@N< ale.
The present syntax has been applied to large
Currently the description does not account for other amounts of journalistic and technical text (news-
types of postmodi er, e.g. postmodifying adjectives, papers, abstracts on electrical engineering, manuals
numerals, other nominals, or clausal constructions. on car maintenance, etc.), and the analysis of some
 @CC and @CS represent co-ordinating and sub- 20,000{30,000 words has been proofread to get an
ordinating conjunctions, respectively: estimate of the accuracy of the parser.
After the application of the NPtool syntax, some
Either/@CC you or/@CC I will go
93{96% of all words become syntactically unambigu-
if/@CS necessary.
ous, with an error rate of less than 1%14.
 @AH represents the `residual': adjectival heads, To nd out how much ambiguity remains at the
adverbials of various kinds, adverbs (also intensi- sentence level, I also applied a `NP-neutral' version15
ers), and also those of the prepositional phrases that of the nite-state parser on a 25,500 word text from
cannot be dependably analysed as a postmodi er. The Grolier Electronic Encyclopaedia. The results
An example is in order: are given in Figure 1.
There/@AH have always/@AH been very/@AH
many people in/@AH this area.
Figure 1: Ambiguity rates after nite-state parsing
5.2 Syntactic constraints in a text of 1,495 sentences (25,500 words). R in-
The syntactic grammar contains some 120 syntactic dicates the number of analyses per sentence, and F
indicates the frequency of these sentences.
constraints. Like the morphological disambiguation
constraints, these constraints are essentially negative R F R F R F R F
partial linear-precedence de nitions of the syntactic 1 960 6 19 12 6 32 2
categories. 2 304 7 3 14 3 48 2
The present grammar is a partial expression of four 3 54 8 28 16 5 64 1
general grammar statements: 4 93 9 3 24 1 72 1
1. Part of speech determines the order of determin- 5 4 10 3 28 1
ers and modi ers.
2. Only likes coordinate. Some 64% (960) of the 1,495 sentences became
3. A determiner or a modi er has a head. syntactically unambiguous, while only some 2% of
all sentences analyses contain more than ten read-
4. An auxiliary is followed by a main verb. ings, the worst ambiguity being due to 72 analyses.
We will give only one illustration of how these This compares favourably with the ENGCG perfor-
general statements can be expressed in Constraint mance: after ENGCG parsing, 23.5% of all sentences
Grammar. Let us give a partial paraphrase of the remained ambiguous due to a number of sentence
statement Part of speech determines the order of de- readings greater than the worst case in NPtool syn-
terminers and modi ers: `A premodifying noun oc- tax.
curs closest to its head'. In other words, premodi ers
from other parts of speech do not immediately fol- 6 Performance of NPtool
low a premodifying noun. Therefore, a noun in the Various kinds of metrics can be proposed for the eval-
nominative immediately followed by an adjective is uation of a noun phrase extractor; our main metrics
not a premodi er. Thus a constraint in the grammar are recall and precision, de ned as follows16:
would discard the @>N tag of Harry in the following
sample sentence, where Harry is directly followed by  Recall: the ratio `retrieved intended NPs'17 /
an unambiguous adjective: `all intended NPs'
("<*is>"  Precision: the ratio `all retrieved NPs' / `re-
("be" <SVC/N> <SVC/A> V PRES SG3 (@V))) trieved intended NPs'
("<*harry>"
("harry" <Proper> N NOM SG (@NH @>N))) 14
This gure also covers errors due to previous stages
("<foolish>" of analysis.
("foolish" A ABS (@AH)))
15
i.e. a parser which does not contain the mechanism
("<$?>") for penalising or favouring noun phrase analyses; see Sec-
tion 4.3 above.
We require that the noun in question is a nominative 16
This de nition also agrees with that used in Rausch
because premodifying nouns in the genitive can occur & al. [1992].
also before adjectival premodi ers; witness Harry's 17
An `intended NP' is the longest non-overlapping
in Harry's foolish self. match of the search query given in extraction phase.
To paraphrase, a recall of less than 100% indicates [Heikkila, forthcoming a] Heikkila, J. (forthcoming
that the system missed some of the desired noun a). A TWOL-Based Lexicon and Feature System
phrases, while a precision of less than 100% indicates for English. In Karlsson & al.
that the system retrieved something that is not re- [Heikkila, forthcoming b] Heikkila, J. (forthcoming
garded as a correct result. b). ENGTWOL English lexicon: solutions and
The performance of the whole system has been problems. In Karlsson & al.
evaluated against several texts from di erent do-
mains. In all, the analysis of some 20,000 words has [Karlsson, 1990] Karlsson, F. 1990.
been manually checked. Constraint Grammar as a Framework for Pars-
If we wish to extract relatively complex noun ing Running Text. In H. Karlgren (ed.), Papers
phrases with optional coordination, premodi ers and presented to the 13th International Conference on
postmodi ers (see the search query above in Sec- Computational Linguistics, Vol. 3. Helsinki. 168-
tion 4.4), we reach a recall of 98.5{100%, with a 173.
precision of some 95{98%. [Karlsson, forthcoming] Karlsson, F. (forthcoming).
As indicated in Section 4.4, the extraction utility Designing a parser for unrestricted text. In Karls-
annotates each proposed noun phrase as a `sure hit' son & al.
(`ok:') or as an `uncertain hit' (`?:'). This distinction
is quite useful for manual validation: approximately [Karlsson et al., forthcoming] Karlsson, F., Vouti-
95% of all super uous noun phrase candidates are lainen, A., Heikkila, J. and Anttila, A. Con-
marked with the question mark. straint Grammar: a Language-Independent Sys-
tem for Parsing Unrestricted Text. Mouton de
7 Conclusion Gruyter.
In terms of accuracy, NPtool is probably one of the [Koskenniemi, 1990] Koskenniemi,
best in the eld. In terms of speed, much remains to K. (1990). Finite-state parsing and disambigua-
be optimised. Certainly the computationally most tion. In Karlgren, H. (ed.) COLING{90. Papers
demanding tasks { disambiguation and parsing { are presented to the 13th International Conference on
already carried out quite eciently, but the more Computational Linguistics, Vol. 2. Helsinki, Fin-
trivial parts of the system could be improved. land. 229{232.
8 Acknowledgements [Koskenniemi, Tapanainen and Voutilainen, 1992]
Koskenniemi, K., Tapanainen, P. and Voutilainen,
I wish to thank Krister Linden, Pasi Tapanainen and A. (1992). Compiling and using nite-state syn-
two anonymous referees for useful comments on an tactic rules. In Proceedings of the fteenth Inter-
earlier version of this paper. The usual disclaimers national Conference on Computational Linguist-
hold. ics. COLING-92, Vol. I. Nantes, France. 156{162.
[Quirk, Greenbaum, Leech and Svartvik, 1985]
References Quirk, R., Greenbaum, S., Leech, G. and Svartvik,
[Anttila, forthcoming] Anttila, A. (forthcoming). J. 1985. A Comprehensive Grammar of the English
How to recognise subjects in English. In Karlsson Language. London & New York: Longman.
& al. [Rausch, Norrback and Svensson, 1992] Rausch, B.,
[Bourigault, 1992] Bourigault, D. 1992. Surface Norrback, R., and Svensson, T. 1992. Excerper-
grammatical analysis for the extraction of termi- ing av nominalfraser ur lopande text. Manuscript.
nological noun phrases. In Proceedings of the f- Stockholms universitet, Institutionen for Lingvis-
teenth International Conference on Computational tik.
Linguistics. COLING-92, Vol. III. Nantes, France. [Salton and McGill, 1983] Salton, G. and McGill,
977{981. M. 1983. Introduction to Modern Information Re-
[Church, 1988] Church, K. 1988. A Stochastic Parts trieval. McGraw{Hill, Auckland.
Program and Noun Phrase Parser for Unrestricted [Sampson, 1987a] Sampson, G. 1987. Probabilistic
Text. Proceedings of the Second Conference on Ap- Models of Analysis. In Garside, Leech and Samp-
plied Natural Language Processing, ACL. 136-143.
[Church, 1992] Church, K. 1992. Current Practice in son (eds.) 1987. 16-29.
Part of Speech Tagging and Suggestions for the [Sampson, 1987b] Sampson, G. 1987. The grammat-
Future, in Simmons (ed.), Sbornik praci: In Honor ical database and parsing scheme. In Garside,
of Henry Ku^cera. Michigan Slavic Studies. Leech and Sampson (eds.) 1987. 82-96.
[van der Eijk, 1993] van der Eijk, P. 1993. Automat- [Smart (Ed.), forthcoming] Smart (Ed.) (forthcom-
ing the acquisition of bilingual terminology. Pro- ing). Structured Information Management: Pro-
ceedings of EACL'93. Utrecht, The Netherlands. cessing and Retrieval. (provisional title).
[Tapanainen, 1991] Tapanainen, P. 1991. A arellisina decimal adder may take less time than
automaatteina esitettyjen kielioppisaantojen so- a fast binary adder doing an addition
veltaminen luonnollisen kielen jasentajassa (Nat- and two conversions. A careful review
ural language parsing with nite-state syntactic of the significance of decimal and
rules). Master's thesis. Dept. of computer science, binary addressing and both binary
University of Helsinki. and decimal data arithmetic,
[Taylor, Grover and Briscoe, 1989] Taylor, L., Gro- supplemented by efficient
ver, C. and Briscoe, T. 1989. The Syntactic Regu- conversion instructions.
larity of English Noun Phrases. In Proceedings of Here is the list of noun phrases extracted by NPtool.
the Fourth Conference of the European Chapter of For the key, see Section 4.4.
the ACL. 256{263. ok: addition
[Voutilainen, forthcoming a] Voutilainen, A. (forth- ok: advantage
coming a). Context-sensitive disambiguation. In ok: application
Karlsson & al. ok: arithmetic speed
[Voutilainen, forthcoming b] Voutilainen, A. (forth- ok: binary address
coming b). Experiments with heuristics. In Karls- ok: binary addressing
son & al. ok:
ok:
binary and decimal data arithmetic
binary computer
[Voutilainen, forthcoming 1993] ok: binary number system
Voutilainen, A. (forthcoming 1993). Designing a ok: careful review of the significance
parsing grammar. of decimal and binary addressing
[Voutilainen, Heikkila and Anttila, 1992] ok: certain powerful operation
Voutilainen, A., Heikkila, J. and Anttila, A. ok: computer
(1992). Constraint Grammar of English: A ok: decimal instruction format
Performance-Oriented Introduction. Publication ok: decimal number
No. 21, Department of General Linguistics, Uni- ok: decimal representation
versity of Helsinki. ok: decimal-binary conversion
[Voutilainen and Tapanainen, 1993] Voutilainen, A. ok:
ok:
efficient conversion instruction
fast binary adder
and Tapanainen, P. 1993. Ambiguity resolution in ok: high-performance, general-purpose
a reductionistic parser. Proceedings of EACL'93. computer
Utrecht, Holland. ok: greater compactness of binary number
APPENDIX ok: greater simplicity of a binary
Here is given the NPtool analysis of a small sample arithmetic unit
from the CACM text collection. { Here is the input ok: instruction format
text: ok: man
ok: overall performance
The binary number system offers many ok: slower decimal adder
advantages over a decimal representation ok: time
for a high-performance, general-purpose ok: two conversion
computer. The greater simplicity of a ok: way
binary arithmetic unit and the greater ?: communicating
compactness of binary numbers both ?: processing of a large volume of
contribute directly to arithmetic inherently decimal input
speed. Less obvious and perhaps more ?: processing of a large volume of
important is the way binary addressing inherently decimal input and
and instruction formats can increase the output data
overall performance. Binary addresses ?: output data
are also essential to certain powerful
operations which are not practical with
decimal instruction formats. On the other
hand, decimal numbers are essential for
communicating between man and the
computer. In applications requiring the
processing of a large volume of
inherently decimal input and output
data, the time for decimal-binary
conversion needed by a purely binary
computer may be significant. A slower

You might also like