Professional Documents
Culture Documents
The Sensem Project Syntactico Semantic A PDF
The Sensem Project Syntactico Semantic A PDF
iv
AMSTERDAM STUDIES IN THE THEORY AND
HISTORY OF LINGUISTIC SCIENCE
general editor
e.F.K. KoeRneR
(Zentrum für Allgemeine sprachwissenschat, typologie
und universalienforschung, Berlin)
volume 292
edited by
nicolAs nicolov
Umbria, Inc.
KAlinA BontcHevA
University of Sheield
gAliA AngelovA
Bulgarian Academy of Sciences
RuslAn MitKov
University of Wolverhampton
Editors’ Foreword ix
III. PARSING
Ming-Wei Chang, Quang Do & Dan Roth
Multilingual dependency parsing: A pipeline approach 55
Sandra Kübler
How does treebank annotation influence parsing? Or how
not to compare apples and oranges 79
V. CLASSIFICATION
Nicolas Nicolov & Franco Salvetti
Efficient spam analysis for weblogs through URL segmentation 125
Jörg Tiedemann
A genetic algorithm for optimising information retrieval
with linguistic features in question answering 177
VII. ONTOLOGIES
Keiji Shinzato & Kentaro Torisawa
A simple WWW-based method for semantic word class acquisition 207
IX. CORPORA
Anne De Roeck
The role of data in NLP: The case for dataset profiling 259
The book covers a wide variety of nlp topics: datasets, annotation, tree-
banks, parallel corpora, information extraction, parsing, word sense disam-
biguation, translation, indexing, ontologies, question answering, document
similarity, spam analysis, document classification, anaphora resolution, re-
ferring expressions generation, textual entailment, latent semantic analysis,
summarization, rhetorical relations, etc. To help the reader find his/her way
we have added an index which contains major NLP terms used throughout
the volume. We have also included a list and addresses of all contributors.
We would like to thank all members of the Program Committee. With-
out them the conference, although well organised, would not have had an
impact on the development of NLP. Together they have ensured that the
best papers were included in the final proceedings and have provided in-
valuable comments for the authors, so that the papers are ‘state of the art’.
The following is a list of those who participated in the selection process and
to whom a public acknowledgement is due:
Eneko Agirre (Univ. of the Basque Country, Donostia)
Elisabeth Andre (University of Augsburg)
Galia Angelova (Bulgarian Academy of Sciences)
Amit Bagga (Ask Jeeves, Piscataway)
Branimir Boguraev (IBM, T. J. Watson Research Center)
Kalina Bontcheva (University of Sheffield)
António Branco (University of Lisbon)
Eugene Charniak (Brown University, Providence)
Dan Cristea (University of Iaşi)
Hamish Cunningham (University of Sheffield)
Walter Daelemans (University of Antwerp)
Ido Dagan (Bar Ilan Univ. & FocusEngine, Tel Aviv)
Robert Dale (Macquarie University)
Thierry Declerck (DFKI GmbH, Saarbrücken)
Gaël Dias (Beira Interior University, Covilhã)
Rob Gaizauskas (University of Sheffield)
Alexander Gelbukh (Nat. Polytechnic Inst., Mexico)
Ralph Grishman (New York University)
Walther von Hahn (University of Hamburg)
Jan Hajic (Charles University, Prague)
Johann Haller (University of Saarland)
Catalina Hallett (Open University, Milton Keynes)
Graeme Hirst (University of Toronto)
Eduard Hovy (ISI, University of Southern California)
Nikiforos Karamanis (University of Wolverhampton)
Martin Kay (Stanford University)
Dorothy Kenny (Dublin City University)
Alma Kharrat (Microsoft Natural Language Group)
EDITORS’ FOREWORD xi
Abstract
Even now techniques are in common use in computational linguistics
which could lead to important advances in pure linguistics, especially
language acquisition and the study of language variation, if they were
applied with intelligence and persistence. Reliable techniques for as-
saying similarities and differences among linguistic varieties are useful
not only in dialectology and sociolinguistics, but could also be valu-
able in studies of first and second language learning and in the study
of language contact. These techniques would be even more valuable
if they indicated relative degrees of similarity, but also the source of
deviation (contamination). Given the current tendency in linguistics
to wish to confront the data of language use more directly, techniques
are needed which can handle large amounts of noisy data and extract
reliable measures from them. The current focus in Computational
Linguistics on useful applications is a very good thing, but some
further attention to the linguistic use of computational techniques
would be very rewarding.1
1 Introduction
The idea my plea is based on—that there are opportunities for computa-
tional contributions to “pure” linguistics—is not absolutely new, of course,
as many computational linguists have been involved in issues of pure linguis-
tics as well, including especially grammatical theory. And we will naturally
attempt to identify such work as we become more concrete (below). We aim
to spark discussion by identifying less discussed areas where computational
forays appear promising, and in fact, we will not dwell on grammatical
theory at all.
It is best to add some caveats. First, the sort of appeal we aim at can only
be successful if it is sketched with some concrete detail. If we attempted
to argue the usefulness of computational techniques to general linguistic
theory very abstractly, virtually everyone would react, “Fine, but how can
we contribute more concretely?” But we can only provide more concrete
detail on a very limited number of subjects. Of course, we are limited by
our knowledge of these subjects as well, but the first caveat is that this
little essay cannot be exhaustive, only suggestive. We should be delighted
to hear promptly of several further areas of application for computational
techniques we omit here.
Second, the exhortation to explore issues in other branches of linguistics
more broadly takes the form of an examination of selected issues in non-
computational linguistics together with suggestions on how computational
techniques might shed added light on them. Since the survey is to be
brief, the suggestions about solutions—or perhaps, merely perspectives—
of necessity will also be brief. In particular, they will be no more than
suggestions, and will make no pretense at demonstrating anything at all.
Third, we might be misconstrued as urging you to ignore useful, money-
making applications in favor of dedicating yourselves to the higher goal of
collaborating in the search for scientific truth. But both the history of CL
and the usual modern attitude of scientists toward applications convinces
me that the application-oriented side of CL is very important and eminently
worthwhile. Perhaps you should indeed turn a deaf ear to the seductions of
filthy mammon and consecrate yourself to a life of (pure) science, but this
is a matter between you and your clergyman (or analyst). You have not
gotten this advice here.
Fourth, and finally, we might be viewed as advocating different sorts
of applications, namely the application of techniques from one linguistic
subfield (CL) to another (dialectology, etc.). In this sense modern genetics
applies techniques from chemistry to biological molecules to determine the
physical basis of inheritance, anthropology applies techniques from nuclear
chemistry (carbon dating) to date human artefacts, and astronomy applies
techniques from optics (glass) and electromagnetism (radio astronomy) to
map the heavens. In all of these case is the primary motivation is scien-
LINGUISTIC CHALLENGES 3
tific curiosity, not utilitarian, and this view is indeed parallel to the step
advocated here.
Martin Kay received the Lifetime Acheivement Award of the Associa-
tion for Computational Linguistics in 2005, and his acceptance address to
the 2005 meeting ended in cri de coeur to computational linguists to keep
studies of language central: “Language [...] remains a valid and respectable
object of study, and I earnestly hope that [computational linguists] will con-
tinue to pursue it.” (Kay 2005:438). Perhaps the present piece can serve
to suggest less likely areas where CL might contribute to progress in the
scientific understanding of language.
2 Computational linguistics
constantly creating. For the most part the techniques we urge you to apply
more broadly have been developed in order to build better and more var-
ied applications, as this has been the great motor in the recent dynamics
of computational linguistics. But some of the techniques have also been
useful in theoretical computational linguistics, and the distinction will play
no role here. In fact, perhaps the simplest view is to acknowledge that ap-
plications and theory make use of common technology, a sort of technical
infrastructure, and to emphasize the opportunities this provides.
3 Dialectology
Fig. 1: In this line map the average Levenshtein distances between 490
Bulgarian dialects are shown for 36 words. Darker lines join varieties with
more similar pronunciations; lighter lines indicate more dissimilar ones
4 Diachronic linguistics
But these studies deserve follow-ups, tests on new data, and extensions to
other problems. Among many remaining problems we note that it would
be valuable to detect borrowed words, which should not figure in cognate
lists, but which suggest interesting influence; to operationalize the notion
of semantic relatedness relevant to cognate recognition; to quantify how
regular sound change is; or to investigate the level of morphology, which is
regarded as especially probative in historical reconstruction. But we em-
phasize that there are likely to be interesting opportunities for contributions
with respect to detail as well, perhaps in the construction of instruments
to examine data more insightfully, to measure hypothesized aspects, or to
quantify the empirical base on which historical hypotheses are made.
5 Language acquisition
6 Language contact
Because the technique is still under development, we cannot yet report much
more. The differences are indeed statistically significant, which, in itself, is
not surprising. The corpora are quite raw, however, so that the differences
we are finding to-date are dominated by hesitation noises and errors in tag-
ging. The promise is in the technique. If we have succeeded in developing
an automated measure of syntactic difference, we have opportunities for
application to a host of further questions about syntactic differences, e.g.,
about where these differences are detectable, and where not; about the time
course of contamination effects (do second-language learners keep improv-
ing, or is there a ceiling effect?); and about the role of the source language in
the degree of contamination. Some crucial computational questions would
remain, however, concerning detecting the source of contamination.
7 Other areas
As noted in the introduction, this brief survey has tried to develop a few
ideas in order to convince you that there are promising lines of inquiry
for computationalists who would seek to contribute to a broader range of
linguistic subfields. We suspect that there are many other areas, as well.
We have deliberately omitted grammatical theory from the list of poten-
tial near-term adopters of computational techniques. There are two reasons
for eschewing a sub-focus on grammar here, the first being the fact that the
potential relevance of computational work to grammatical theory has been
recognized for a long time, as grammar has been cited since the earliest days
of CL as a likely beneficiary of closer engagement (Kay 2002). But second,
even as computational grammar studies uncover new means of contributing
to the study of pure grammar (van Noord 2004), it seems to be a minority
of (non-computational) grammarians who recognize the value of computa-
tional work. Many researchers have explored this avenue, but the situation
has stabilized to one in which computational work is pursued vigorously
by small specialized groups (Head-Driven Phrase Structure Grammar and
Lexical-Functional Grammar), and largely ignored by most non-mainstream
grammarians. We deplore this situation as do others (Pollard 1993), but it
unfortunately appears to be quite stable.
In addition to the areas discussed above, it is easily imaginable that
CL techniques could play an interesting role in a number of other linguis-
tic subareas. As databases of linguistic typology become more detailed and
more comprehensive, they should become attractive targets for data-mining
techniques Psycholinguistic studies of processing are promising because they
provide a good deal of empirical data. We shall be content with a single ex-
ample. Moscoso del Prado Martı́n (2003) reviews a large number of studies
relating the difficulty of processing complex word forms, i.e., those involv-
12 JOHN NERBONNE
8 Conclusions
REFERENCES
Aarts, Jan & Sylviane Granger. 1998. “Tag Sequences in Learner Corpora: A
Key to Interlanguage Grammar and Discourse”. Learner English on Com-
puter ed. by Sylviane Granger, 132-141. London: Longman.
Albright, Adam & Bruce Hayes. 2003. “Rules vs. Analogy in English Past
Tenses: A Computational/Experimental Study”. Cognition 90:119-161.
Benedetto, Dario, Emanuele Caglioti & Vittorio Loreto. 2002. “Language Trees
and Zipping”. Physical Review Letters 88:4.048702.
Brants, Thorsten. 2000. “TnT — A Statistical Part of Speech Tagger”. 6th
Applied Natural Language Processing Conference, 224-231. Seattle, Wash-
ington.
Brent, Michael, ed. 1997. Computational Approaches to Language Acquisition.
Cambridge, Mass.: MIT Press.
Brent, Michael. 1999. “An Efficient, Probabilistically Sound Algorithm for Seg-
mentation and Word Discovery”. Machine Learning Journal 34:71-106.
Brent, Michael. 1999. “Speech Segmentation and Word Discovery: A Computa-
tional Perspective”. Trends in Cognitive Science 3:294-301.
Burke, James. 1985. The Day the Universe Changed. Boston, Mass.: Little,
Brown & Co.
Chambers, J.K. & Peter Trudgill. 1998. Dialectology. Cambridge, U.K.: Cam-
bridge University Press.
Cole, Ronald A., Joseph Mariani, Hans Uszkoreit, Annie Zaenen & Victor Zue.
1996. Survey of the State of the Art in Human Language Technology.
National Science Foundation (NSF) and European Commission (EC),
www.cse.ogi.edu/CSLU/HLTsurvey/
Dunn, Michael, Angela Terrill, Geert Reesink, Robert A. Foley & Stephen C.
Levinson. 2005. “Structural Phylogenetics and the Reconstruction of An-
cient Language History”. Science 309:5743.2072-2075.
Eska, Joseph F. & Don Ringe. 2004. “Recent Work in Computational Linguistic
Phylogeny”. Language 80:3.569-582.
Gildea, Daniel & Daniel Jurafsky. 1996. “Learning Bias and Phonological Rule
Induction”. Computational Linguistics 22:4.497-530.
14 JOHN NERBONNE
van Coetsem, Frans. 1988. Loan Phonology and the Two Transfer Types in
Language Contact. Dordrecht: Foris Publications.
van Noord, Gertjan. 2004. “Error Mining for Wide-Coverage Grammar Engi-
neering”. 42nd Meeting of the Association for Computational Linguistics,
446-453. Barcelona, Spain.
Watson, Greg. 1996. “The Finnish-Australian English Corpus”. ICAME Jour-
nal: Computers in English Linguistics 20:41-70.
Yang, Charles. 2004. “Universal Grammar, Statistics or Both?”. Trends in
Cognitive Science 8:10.451-456.
NLP: An Information Extraction Perspective
Ralph Grishman
New York University
Abstract
This chapter examines some current issues in natural language pro-
cessing from the vantage point of information extraction (ie), and so
gives some perspective on what is needed to make ie more success-
ful. By ie we mean the identification of important types of relations
and events in unstructured text. Ie provides a nice reference point
because it can benefit from a wide range of technologies, including
techniques for analyzing sentence structure, for identifying syntac-
tic and semantic paraphrases, and for discovering the information
structure of domains.
The methods just described will give us a better chance of identifying in-
stances of a particular linguistic expression, but we are still faced with the
problem of finding the myriad linguistic expressions of an event — all (or
most of) the paraphrases of a given expression of an event. A direct ap-
proach is to annotate all the examples of an event in a large corpus, and
then collect and distill them either by hand or using some linguistic rep-
resentation and machine learning method. However, good coverage may
require a really large corpus, which can be quite expensive. Could we do
better?
Finally, there may be situations where we don’t have specific event or rela-
tion types in mind – where we simply want to identify and extract the ‘im-
portant’ events and relations for a particular domain or topic. Riloff (1996)
introduced the basic idea of dividing a document collection into relevant (on
topic) and irrelevant (off topic) documents, and selecting constructs which
occur much more frequently in the relevant documents. Her approach relied
on a relevance-tagged corpus. This idea was extended by (Yangarber et al.
2000) to bootstrap the discovery process from a small ‘seed’ set of patterns
which define a topic. Sudo generalized the form of the discovered patterns
(Sudo et al. 2003) and created a system which started from a narrative de-
scription of a topic and used this description to retrieve relevant documents
(Sudo et al. 2001).
These methods have been used to collect the linguistic expressions for a
specific set of event types, and they are effective when these events form a
coherent ‘topic’ – when they co-occur in documents. Because these methods
are based on the distribution of constructs in documents, they may gather
together related but non-synonymous forms like ‘hire’, ‘fire’, and ‘resign’,
or ‘buy’ and ‘sell’. However, by coupling these methods with paraphrase
22 RALPH GRISHMAN
REFERENCES
Agichtein, Eugene & Luis Gravano. 2000. “Snowball: Extracting Relations from
Large Plain-Text Collections”. 5th ACM Int. Conf. on Digital Libraries,
85-94. San Antonio, Texas.
Barzilay, Regina & Kathleen R. McKeown. 2001. “Extracting Paraphrases from
a Parallel Corpus”. ACL/EACL 2001, 50-57. Toulouse, France.
Brin, Sergei. 1998. “Extracting Patterns and Relations from the World Wide
Web”. World Wide Web and Databases International Workshop (= LNCS
1590), 172-183. Heidelberg: Springer.
Chieu, Hai Leong & Hwee Tou Ng. 2002. “Named Entity Recognition: A Max-
imum Entropy Approach Using Global Information”. 19th Int. Conf. on
Computational Linguistics (COLING-2002 ), 190-196. Taipei.
Hirschman, Lynette. 1998. “Language Understanding Evaluations: Lessons
Learned from MUC and ATIS”. 1st Int. Conf. on Language Resources and
Evaluation (LREC-1998 ), 117-122. Granada, Spain.
Ji, Heng & Ralph Grishman. 2005. “Improving Name Tagging by Reference
Resolution and Relation Detection”. 43rd Annual Meeting of the Association
for Computational Linguistics (ACL’2005 ), 411-418. Ann Arbor, Michigan.
Kambhatla, Nanda. 2004. “Combining Lexical, Syntactic and Semantic Features
with Maximum Entropy Models for Extracting Relations”. Companion Vol-
ume to the Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics (ACL’2004 ), 178-181. Barcelona, Spain.
Kingsbury, Paul & Martha Palmer. 2002. “From TreeBank to PropBank”. 3rd
Int. Conf. on Language Resources (LREC-2002 ), 1989-1993. Las Palmas,
Spain.
Lin, Dekang & Patrick Pantel. 2001. “Discovery of Inference Rules for Question
Answering”. Natural Language Engineering 7:4.343-360.
NLP: AN INFORMATION EXTRACTION PERSPECTIVE 23
Meyers, Adam, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielin-
ska, Brian Young & Ralph Grishman. 2004. “Annotating Noun Argument
Structure for Nombank”. 4th Int. Conf. on Language Resources and Evalu-
ation (LREC-2004 ), 803-806. Lisbon, Portugal.
Riloff, Ellen. 1996. “Automatically Generating Extraction Patterns from Un-
tagged Text”. 13th National Conference on Artificial Intelligence, 1044-1049.
Portland, Oregon.
Roth, Dan & Wen-tau Yih. 2004. “A Linear Programming Formulation for
Global Inference in Natural Language Tasks”. Conf. on Computational Nat-
ural Language Learning (CoNLL’2004 ), 1-8. Boston, Mass.
Sekine, Satoshi. 2005. “Automatic Paraphrase Discovery Based on Context and
Keywords between NE Pairs”. 3rd Int. Workshop on Paraphrasing, Jeju
Island, South Korea.
Sekine, Satoshi. 2006. “On-demand Information Extraction”. 21st Int. Conf. on
Computational Linguistics and 44th Annual Meeting of the Association for
Computational Linguistics (COLING-ACL’2006 ), Sydney, Australia.
Shinyama, Yusuke, Satoshi Sekine, Kiyoshi Sudo & Ralph Grishman. 2002.
“Automatic Paraphrase Acquisition from News Articles”. Human Language
Technology Conference (HLT-2002 ), San Diego, Calif.
Stevenson, Mark & Mark Greenwood. 2005. “A Semantic Approach to IE Pattern
Induction”. 43rd Annual Meeting of Association for Computational Linguis-
tics, 379-386. Ann Arbor, Michigan.
Sudo, Kiyoshi, Satoshi Sekine & Ralph Grishman. 2001. “Automatic Pattern
Acquisition for Japanese Information Extraction”. Human Language Tech-
nology Conference (HLT-2001 ), 51-58. San Diego, Calif.
Sudo, Kiyoshi , Satoshi Sekine & Ralph Grishman. 2003. “An Improved Ex-
traction Pattern Representation Model for Automatic IE Pattern Acquisi-
tion”. 41st Annual Meeting of the Association for Computational Linguistics
(ACL’2003 ), 224-231. Sapporo, Japan.
Yangarber, Roman, Ralph Grishman, Pasi Tapanainen & Silja Huttunen. 2000.
“Automatic Acquisition of Domain Knowledge for Information Extraction”.
18th Int. Conf. on Computational Linguistics (COLING-2000 ), 940-946.
Saarbrücken, Germany.
Zhao, Shubin, Adam Meyers & Ralph Grishman. 2004. “Discriminative Slot
Detection Using Kernel Methods”. 20th Int. Conf. on Computational Lin-
guistics (COLING-2004 ), 757-763. Geneva.
Zhao, Shubin & Ralph Grishman. 2005. “Extracting Relations with Integrated
Information Using Kernel Methods”. 43rd Annual Meeting of the Association
for Computational Linguistics, 419-428. Ann Arbor, Michigan.
Semantic Indexing using Minimum Redundancy Cut
in Ontologies
Abstract
This paper presents a new method that improves semantic indexing
while reducing the number of indexing terms. Indexing terms are
determined using a minimum redundancy cut in a hierarchy of con-
ceptual hypernyms provided by an ontology (e.g., WordNet, EDR).
The results of some information retrieval experiments carried out on
several standard document collections using the WordNet and EDR
ontologies are presented, illustrating the benefit of the method.
1 Introduction
Three fields are mainly reported in the literature about the use of seman-
tic knowledge for Information Retrieval: query expansion (Voorhees 1994,
Moldovan & Mihalcea 2000), Word Sense Disambiguation (Ide & Véronis
1998, Wilks & Stevenson 1998, Besançon et al. 2001) and semantic index-
ing. This contribution relates to the latest, the main idea of which is to
use word senses rather than, or in addition to, the words (usually lem-
mas or stems) for indexing document, in order to improve both recall (by
handling synonymy) and precision (by handling homonymy and polysemy).
However, the experiments reported in the litterature lead to contradicting
results: some claim that it degrades the performance (Salton 1968, Harman
1988, Voorhees 1993, Voorhees 1998); whereas for others the gain seems sig-
nificant (Richardson & Smeaton 1995, Smeaton & Quigley 1996, Gonzalo
et al. 1998a & 1998b, Mihalcea & Moldovan 2000).
Although it seems desirable for IR systems to take a maximum of seman-
tic information into account, the resulting expansion of the data processed
may not develop its full potential. Indeed, the growth of the number of
index terms not only increases the processing time but could also reduce
the precision, as discriminating documents by using a very large number of
index terms is a hard task.
This problem is not new, and various techniques aiming at reducing the
size of the indexing set already exist: filtering by stoplist, part of speech
tags, frequencies, or through statistical techniques as in LSI (Deerwester
et al. 1990) or PLSI (Hofmann 1999). However, most of these techniques are
not adapted to the case where an explicit semantic information is available,
26 F. SEYDOUX & J.-C. CHAPPELIER
wi,j,k r
(a)
c1 ... c 2 c3
(b) MRC c2
c3 c1
(c) wi
wj wk
Fig. 1: Several indexing scheme: (a) usual indexing with words, stems or
lemmas; (b) synset (or hypernyms synsets) indexing: each indexing term is
replaced by its (hypernyms) synset; this, in principle, reduces the size of
the indexing set since all the indexing terms that are shared by the same
hypernym are regrouped in one single indexing feature; (c) Minimum
Redundancy Cut (MRC) indexing: each indexing term is replaced by its
dominating concept chosen with MRC. This furthermore reduces the size
of the indexing set since all the indexing terms that are subsumed by the
same concept in the MRC are regrouped in one single indexing feature.
for example in the form of a thesaurus or an ontology (i.e., with some
underlying formal – not statistical – structure).
The focus of the work presented here is to use external1 structured se-
mantic resources such as an ontology in order to limit the semantic indexing
set. This work, which is a continuation of (Seydoux & Chappelier 2005),
relates, but from a different point of view, with experiments described in
(Gonzalo et al. 1998b, Whaley 1999, Mihalcea & Moldovan 2000), which
uses the synsets (or hypernyms synsets (Mihalcea & Moldovan 2000)) of
WordNet as indexing terms. We follow the onto-matching technique de-
scribed in (Kiryakov & Simov 1999), but here selecting the indexing terms
using the “Minimum Redundancy Cut” (MRC) (MRC, see Figure 1), an
information theory based criterion applied to the inclusive ’is-a’ relation
(hypernyms) provided by the WordNet (Fellbaum 1998, Miller 1995) and
EDR taxonomies (Miyoshi et al. 1996).
ni
Γ ni
Γ"
Γ"
ni Γ
m
∋
(Γ \ {ni })
m (ni )
(a) (b)
Fig. 2: (a) Lower search: node ni is replaced by ni ↓ without the nodes
already covered by other nodes in Γ (e.g., m), i.e., (Γ \ {ni })⇓ .
(b) Upper search: node ni is replaced by ni ↑ , and all nodes covered by ni ↑
⇓
(i.e., (ni ↑ ) ) are removed from Γ.
(see fig. 2a), where n⇓ stands for the transitive closure of n↓ ; and similarly
for ni ↑ (see fig. 2b).
Then, the cut with minimal redundancy among these new considered
cuts and the current one is kept, and the search continue as long as better
cuts are found. The full algorithm3 is given hereafter (Algorithm 1).
This algorithm converges towards a local minimum redundancy cut close
to the leaves. Note that this algorithm can be stopped any time, if required,
since it always works on a complete cut.
3 Experiments
...
Γ A B C
g f e d c b a
Fig. 3: Possibles configuration when indexing the queries’ terms. Word ‘a’
(not in the ontology) is always indexed by himself, while ‘d’ is always
indexed by ‘C’ and ‘e’, ‘f ’ and ‘g’ are always indexed by ‘B’. In the 0up
strategy, ‘b’ and ‘c’ are not indexed; in the 1up strategy, ‘c’ is indexed by
‘C’, and in the 2up strategy, ‘b’ is indexed as well by ‘B’ and ‘C’.
sort of disambiguation), and only the most frequent sense for EDR 5 .
4. A MRC cut is then computed with the algorithm previously presented.
5. The index of documents and queries are then computed; each token
is substituted by the concepts from the cut which subordinate it. As
the cut was computed only with nodes covering words contained in
the documents (not the queries), tokens covered by the ontology but
not by concepts in the selected cut can occur in the queries. We thus
evaluated the three following strategies (see Figure 3):
“0up” the first strategy simply consists in ignoring these tokens (in
Figure 3: ’b’ and ’c’ are not indexed);
“1up” in a more sophisticated strategy, we look if the related concepts
(or synsets) subordinate a part of the cut; in this case, the term
is indexed by the subordinate part of the cut, otherwise it is
ignored (in Figure 3: ’b’ is ignored but ’c’ is indexed by ’C’);
“2up” the most sophisticated strategy evaluated here consists in also
↑
looking for the hypernyms of the related concepts, i.e., (t↑ ) (in
Figure 3: ’c’ is always indexed by ’C’, but ’b’ is indexed as well,
by ’B’ and ’C’).
Tokens not covered by the ontology are indexed in the standard way
(in the figure, the token ’a’ is always indexed by itself).
5
This technique gives better results rather than keeping all possible senses in EDR (Sey-
doux & Chappelier 2005); this kind of WSD was not possible for WordNet, because
the most frequent sense information was not present in the WordNet version used.
SEMANTIC INDEXING USING MRC 31
used ontology. As with EDR, the simple ’0up’ strategy gives better results,
this strategy is clearly the worst with WordNet, while the ’2up’ seems to
give the best results. This can perhaps be explained by the structure of the
ontologies: all the synsets of WordNet directly cover several leaves, while for
EDR, many of the concepts are just “pure concept”, only related to others
concepts.
As future work, it is forseen to confront the results presented here with
similar experiments using better WSD procedure (especially for WordNet),
and also using different weighting scheme for the evaluation (as for example
the “lnu-weighting”, more reliable for handling noisy data and dealing with
long documents). It would be also interesting to generalize this technique
in order to take in account all relationships between terms given by the
ontology, in addition to the thesaurus of hypernyms. Finally, we want to
emphasize that the technique presented here does not limit to IR but could
also be apply to other NLP domains, such as document clustering (Hotho
et al. 2003) and text summarization.
REFERENCES
Besançon, R., J.-C. Chappelier, M. Rajman & A. Rozenknop. 2001. “Improving
Text Representations through Probabilistic Integration of Synonymy Rela-
tions”. Xth Int. Symp. on Applied Stochastic Models and Data Analysis
(ASMDA’2001 ), vol. 1, 200-205. Compiègne, France.
Deerwester, S.C., S.T. Dumais, T.K. Landauer, G.W. Furnas & R.A. Harshman.
1990. “Indexing by Latent Semantic Analysis”. Journal of the American
Society of Information Science 41:6.391-407.
Fellbaum, Christiane, ed. 1998. WordNet, An Electronic Lexical Database. Cam-
bridge, Mass.: MIT Press.
Gonzalo, J., F. Verdejo, I. Chugur & J. Cigarran. 1998. “Indexing with WordNet
Synsets Can Improve Text Retrieval”. COLING/ACL Workshop on Usage
of WordNet for Natural Language Processing, 38-44. Montreal, Canada.
Gonzalo, J., F. Verdejo, C. Peters & N. Calzolari. 1998. “Applying EuroWordNet
to Multilingual Text Retrieval”. Journal of Computers and the Humanities
32:2-3.185-207.
Harman, D. 1988. “Towards Interactive Query Expansion”. 11th Conf. on Re-
search and Development in Information Retrieval, 321-331. Grenoble.
Hofmann, T. 1999. “Probabilistic Latent Semantic Indexing”. 22nd Int. Conf.
on Research and Development in Information Retrieval, 50-57. Berkeley.
Hotho, A., S. Staab & G. Stumme. 2003. “WordNet Improves Text Document
Clustering”. SIGIR 2003 Semantic Web Workshop. Toronto, Canada.
Ide, Nancy & J. Véronis. 1998. “Word Sense Disambiguation: The State of the
Art”. Computational Linguistics 24:1.1-40.
34 F. SEYDOUX & J.-C. CHAPPELIER
Kiryakov, A.K. & K.Iv. Simov. 1999. “Ontologically Supported Semantic Match-
ing”. Nordic Conf. on Computational Linguistics. Trondheim, Norway.
Mihalcea, R. & D. Moldovan. 2000. “Semantic Indexing Using WordNet Senses”.
ACL Workshop on IR & NLP. Hong Kong.
Miller, G.A. 1995. “Wordnet: A Lexical Database for English”. Communications
of the ACM 38:11.39-41.
Miyoshi, H., K. Sugiyama, M. Kobayashi & T. Ogino. 1996. “An Overview of
the EDR Electronic Dictionary and the Current Status of Its Utilization”.
16th Int. Conf. on Computational Linguistics (COLING-1996 ), 1090-1093.
Copenhagen.
Moldovan, D.I. & R. Mihalcea. 2000. “Using WordNet and Lexical Operators to
Improve Internet Searches”. IEEE Internet Computing 4:1.34-43.
Richardson, R. & A.F. Smeaton. 1995. “Using WordNet in a Knowledge-based
Approach to Information Retrieval”. Technical Report CA-0395. Dublin,
Ireland: Dublin City University.
Salton, G. 1968. Automatic Information Organization and Retrieval. New York:
McGraw-Hill Book Company.
Salton, G. 1971. The SMART Retrieval System - Experiments in Automatic
Document Processing. Englewood Cliffs, NJ: Prentice Hall Inc.
Seydoux, F. & J.-C. Chappelier. 2005. “Hypernyms Ontologies for Semantic
Indexing”. Methodologies and Evaluation of Lexical Cohesion Techniques in
Real-world Applications (electra’05), 49-55. Salvador, Brazil.
Shannon, C.E. 1948. “A Mathematical Theory of Communication”. The Bell
System Technical Journal 27.379-423.
Smeaton, A.F. & I. Quigley. 1996. “Experiments on Using Semantic Distances
between Words in Image Caption Retrieval”. 19th Int. ACM-SIGIR Conf.
on Research and Development in Information Retrieval, 174-180. Zurich.
Voorhees, E.M. 1993. “Using WordNet to Disambiguate Word Senses for Text
Retrieval”. 16th Annual Int. ACM-SIGIR Conf. on Research and Develop-
ment in Information Retrieval, 171-80. Pittsburgh, Penn.
Voorhees, E.M. 1994. “Query Expansion Using Lexical-Semantic Relations”.
17th Int. ACM-SIGIR Conf. on Research and Development in Information
Retrieval, 61-69. Dublin, Ireland.
Voorhees, E.M. 1998. “Using WordNet for Text Retrieval”. WordNet, An Elec-
tronic Lexical Database 285-303. Cambridge, Mass.: MIT Press.
Whaley, J.M. 1999. “An Application of Word Sense Disambiguation to Infor-
mation Retrieval”. Technical Report PCS-TR99-352. Hanover, New Haven:
Dartmouth College, Computer Science Dept.
Wilks, Yorick & Mark Stevenson. 1998. “Word Sense Disambiguation Using
Optimised Combinations of Knowledge Sources”. 17th Int. Conf. on Com-
putational Linguistics (COLING-1998 ), 1398-1402. Montreal, Canada.
Indexing and Querying Linguistic Metadata and
Document Content
Niraj Aswani, Valentin Tablan, Kalina Bontcheva &
Hamish Cunningham
University of Sheffield
Abstract
The need for efficient corpus indexing and querying arises frequently
both in machine learning-based and human-engineered natural lan-
guage processing systems. This paper presents the annic system,
which can index documents not only by content, but also by their
linguististic annotations and features. It also enables users to for-
mulate versatile queries mixing keywords and linguistic information.
The result consists of the matching texts in the corpus, displayed
within the context of linguistic annotations (not just text, as is cus-
tomary for kwic systems). The data is displayed in a graphical
user interface, which facilitates its exploration and the discovery of
new patterns, which can in turn be tested by launching new annic
queries.1 .
1 Introduction
The need for efficient corpus indexing and querying arises frequently both
in machine learning-based and human-engineered natural language process-
ing systems. A number of query systems have been proposed already
and Mason (1998:20-27), Bird (2000b:1699-1706) and Gaizauskas (2003:
227-236) are amongst the most recent ones. In this paper, we present a full-
featured annotation indexing and retrieval search engine, called
annic (ANNotations-In-Context), which has been developed as part of
gate (General Architecture for Text Engineering) (Cunningham 2002:168-
175).
Whilst systems such as McKelvie & Mikheev (1998), Gaizauskas (2003:
227-236) and Cassidy (2002) are targeted towards specific types of docu-
ments, Bird (2000b:1699-1706) and Mason (1998:20-27) are general purpose
systems. Annic falls in between these two types, because it can index docu-
ments in any format supported by the gate system. These existing systems
were taken as a starting point, but annic goes beyond their capabilities in
a number of important ways. New features address issues such as exten-
sive indexing of linguistic information associated with document content,
1
This work was partially supported by an ahrb grant etcsl and an eu grant sekt
36 ASWANI, TABLAN, BONTCHEVA & CUNNINGHAM
2 GATE
3 ANNIC
Annic is built on top of the Apache Lucene2 a high performance full-
featured search engine implemented in Java which supports indexing and
search of large document collections. Our choice of ir engine is due to
the customisability of Lucene. After few changes in the behaviour of the
key components of Lucene, we were able to make Lucene adaptable to our
requirements.
Table 3.1 explains the token stream that contains tokens for the above
annotations. The annotation type itself is stored as a separate Lucene token
with its attribute type “*” and text as the value of “annotation type”. This
allows users to search for a particular annotation type. In order not to
confuse features of one annotation with others, feature names are qualified
with their respective annotation type names. Where there exist multiple
annotations over the same piece of text, only the position of the very first
feature of the very first annotation is set to 1 and it is set to 0 for the
rest of the annotations and their features. This enables users to query over
overlapping annotations and features.
2
http://lucene.apache.org
38 ASWANI, TABLAN, BONTCHEVA & CUNNINGHAM
It is possible for two annotations to share the same offsets. If two anno-
tations share both offsets, such annotations are kept on the same position
in the token stream, and otherwise one after another. But, what if anno-
tations overlap each other (i.e., they share only one of the start and end
offsets)? In this case, though annotations do not appear one after another,
they are stored one after another. This may lead to incorrect results being
returned and therefore the results are verified in order to filter out invalid
overlapping patterns.
Before indexing gate documents with Lucene, we convert them into
the Lucene format and refer to them as gate Lucene documents. For each
document, the token stream is stored in a separate file as a Java serializable
object in the index directory. Later, it is retrieved in order to fetch left and
right contexts of the found pattern.
Query Term
JAPE Pattern Token Term Text Term Type
“String” String Token.string
{annotType} annotType “*”
{annotType.featureType == “value”} value annotType.featureType
Table 2: JAPE pattern tokens and their respective query terms
We introduce our own query parser (annic query parser) which accepts
jape queries and converts them into Lucene queries. Lucene query parser,
INDEXING & QUERYING METADATA & CONTENT 39
this query returns all positions from the token stream where the annota-
tion is “Person”. We compare the rest terms (i.e., “Token.string == “told” ) by
fetching terms after the “Person” annotation and by comparing query terms
with them.
Klene operators: Annic supports two operators, + and *, to specify
the number of times a particular annotation or a sub pattern should appear
in the main query pattern. Here, ({A})+n means one and up to n occur-
rences of annotation {A} and ({A})*n means zero or up to n occurrences
of annotation {A}.
Figure 1 gives a snapshot of the annic search window. The bottom section
in the window contains the patterns along with their left and right context
INDEXING & QUERYING METADATA & CONTENT 41
QP Patterns
1 {Token.string==“Microsoft”} | “Microsoft Corp”
2 {Person} {Token.cat==“IN”} {Token.cat==“DT”})*1 {Organization}
3 ({Token.kind==number})+4
({Token.string =“/”} | {Token.string==“-”}) ({Token.kind==number})+2
({Token.string==“/”} | {Token.string==“-”}) ({Token.kind==number})+2
4 {Title} ({Token.orth==upperInitial} | {Token.orth==allCaps}) ({FirstPerson})*1
5 ({Token.cat==“DT”})*1 {Location} {Token.cat==“CC”}
({Token.cat==“DT”})*1 {Location} {Organization}{Token.cat==“IN”}
({Token.cat==“DT”})*1 {Location}
QP=Query Pattern
Table 3: ANNIC queries
5 Applications of ANNIC
actual live analysys of the corpus. Each pattern intended as part of a jape
rule can be easily tested directly on the corpus and have its specificity and
coverage assesed almost instantaneously.
Annic can be used also for corpus analysys. It allows querying the in-
formation contained in a corpus in more flexible ways than simple full-text
search. Consider a corpus containing news stories that has been processed
with a standard named entity recognition system like annie3 . A query like
{Organization} ({Token})*3 ({Token.string==’up’}|{Token.string==’down’}) ({Money} | {Percent})
would return mentions of share movements like “BT shared ended up 36p”
or “Marconi was down 15%”. Locating this type of useful text snippets
would be very difficult and time consuming if the only tool available were
text search. Annic can also be useful in helping scholars to analyse linguis-
tic corpora. Sumerologists, for instance, could use it to find all places in the
etcsl corpus 4 where a particular pair of lemmas occur in sequence.
6 Performance results
3
Gate is distributed with an ie system called annie, A Nearly-New IE system.
4
http://www-etcsl.orient.ox.ac.uk/
INDEXING & QUERYING METADATA & CONTENT 43
7 Related work
2) Annic queries are very specific about the annotation types, i.e., query
itself describes the annotation type in which the string should be searched.
If user does not specify annotation type, annic does it automatically to
search strings with the gate token annotation type.
REFERENCES
Bird, S., P. Buneman & W. Tan. 2000. “Towards a Query Language for An-
notation Graphs”. Proceedings of the Second International Conference on
Language Resources and Evaluation, 807-814. Athens.
Bird, S., D. Day, J. Garofolo, J. Henderson, C. Laprun & M. Liberman. 2000b.
“ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation”.
Proceedings of the Second International Conference on Language Resources
and Evaluation, 1699-1706, Athens.
Buneman, P., A. Deutsch, W. Fan, H. Liefke, A. Sahuguet & W.C. Tan. 1998.
“Beyond XML Query Languages”. Proceedings of the Query Language Work-
shop (QL 1998 ). http://www.w3.org/TandS/QL/QL98/pp/penn.ps [Source
checked in April 2006]
Cassidy, S. 2002. “Xquery as an Annotation Query Language: A Use Case
Analysis”. Proceedings of 3rd Language Resources and Evaluation Conference
(LREC’2002 ), Gran Canaria, Spain.
Cunningham, H., D. Maynard, K. Bontcheva & V. Tablan. 2002. “GATE: A
Framework and Graphical Development Environment for Robust NLP Tools
and Applications”. Proceedings of the 40th Anniversary Meeting of the As-
sociation for Computational Linguistics (ACL 2002 ), 168-175. Philadelphia.
Gaizauskas, R., L. Burnard, P. Clough & S. Piao. 2003. “Using the XARA XML-
Aware Corpus Query Tool to Investigate the METER Corpus”. Proceedings
of the Corpus Linguistics 2003 Conference, 227-236. Lancaster, U.K.
Grishman, Ralph. 1997. “TIPSTER Architecture Design Document Version 2.3”.
Technical report, The Defense Advanced Research Projects Agency (DARPA).
http://www.itl.nist.gov/div894/894.02/related projects/tipster/
[Source checked in April 2006]
Kazai, G., M. Lalmas & A. Vries. 2004. “The Overlapping Problem in Content-
Oriented XML Retrieval Evaluation”. Proceedings of the 27th International
conference on Research and Development in Information Retrieval, 72-79.
Sheffield, U.K.
Mason, O. 1998. “The CUE Corpus Access Tool”. Workshop on Distributing
and Accessing Linguistic Resources, 20-27, Granada, Spain.
http://www.dcs.shef.ac.uk/~hamish/dalr/ [Source checked in February
2006].
McKelvie, D. & A. Mikheev. 1998. “Indexing SGML files using LT NSL”. LT
Index documentation http://www.ltg.ed.ac.uk/ [Source checked in April
2006]
Term Representation
with Generalized Latent Semantic Analysis
Irina Matveeva∗ , Gina-Anne Levow∗ , Ayman Farahat∗∗ &
Christiaan Royer∗∗
∗ ∗∗
University of Chicago, Palo Alto Research Center
Abstract
Representation of term-document relations is very important for doc-
ument indexing and retrieval. In this paper, we present Generalized
Latent Semantic Analysis as a framework for computing semantically
motivated term and document vectors. Our focus on term vectors
is motivated by the recent success of co-occurrence based measures
of semantic similarity obtained from very large corpora. Our exper-
iments demonstrate that glsa term vectors efficiently capture se-
mantic relations between terms and outperform related approaches
on the synonymy test.
1 Introduction
Fig. 1: Precision with GLSA, PMI and count over different window sizes,
for the TOEFL (left), TS1 (middle) and TS2 (right) tests
3 Related approaches
Iterative Residual Rescaling (Ando 2000) tries to put more weight on docu-
ments from underrepresented clusters of documents to improve the perfor-
mance of lsa on heterogeneous collections. Locality Preserving Indexing
(lpi) (He et al. 2004) is a graph-based dimensionality reduction algorithm
which preserves the similarities only locally. The Word Space Model (ws)
for word sense disambiguation developed by Schutze (1998) and Latent Re-
lational Analysis (lra) (Turney 2001) are another special cases of glsa
which compute term vectors directly.
These approaches are limited to only one measure of semantic associ-
ation. They use weighted co-occurrence statistics in the document space
(lsa, lpi), in the space of the most frequent informative terms (ws) and in
the space of context patterns (lra) to compute input similarities.
4 Experiments
The goal of the experimental evaluation of the glsa term vectors was to
demonstrate that the glsa vector space representation for terms captures
their semantic relations. We used the synonymy and term pairs tests for
the evaluation.
To collect the co-occurrence information for the matrix of pair-wise term
similarities S, we used a subset of the English Gigaword collection (ldc),
containing New York Times articles labeled as “story”. We had 1,119,364
TERM REPRESENTATION WITH GLSA 49
documents with 771,451 terms. We used the Lemur toolkit1 to tokenize and
index all document collections used in our experiments; we used stemming
and a list of stop words.
The similarities matrix S was constructed using the pmi scores. The
results with likelihood ratio and χ2 were below those for pmi and we omit
them here. We used the pmi matrix S in combination with svd to compute
glsa term vectors. Unless stated otherwise, for the glsa method we report
the best performance over different numbers of embedding dimensions. We
used the plapack package2 to perform the svd.
Table 3: Precision at the 5 nearest terms for some of the top 100 words by
mutual information with the class label. The table also shows the first 3
nearest neighbors. The word’s label is given in the brackets (1=os.windows;
3=hardware; 4=graphics; 5=forsale; 7=autos; 10=crypt;
13=middle-east;15=motorcycles; 17=hokey; 18=religion-christian;
20=baseball)
label. The top 100 are highly discriminative with respect to the news group
label whereas the top 1000 words contain many frequent words. Our results
show that glsa is much less sensitive to this than pmi.
First, we sorted all pairs of words by similarity and computed precision
at the k most similar pairs as the ratio of word pairs that have the same
label. Table 1 shows that glsa significantly outperforms the pmi score.
pmi has very poor performance, since here the comparison is done across
different pairs of words.
The second set of scores was computed for each word as precision at
the top k nearest terms, similar to precision at the first k retrieved docu-
ments used in information retrieval. We report the average precision values
for different values of k in Table 2. Glsa achieves higher precision than
pmi. Glsa performance has a smooth shape peaking at around 200-300 di-
mension which is in line with results for other svd-based approaches. The
dependency on the number of dimensions was the same for the top 1000
words.
In Table 3 we show results for some of the words. Glsa representation
achieves very good results for terms that are good indicators of particular
news groups, such as “god” or “bike”. For frequent words, and words which
have multiple senses, such as “windows” or “article”, the precision is lower.
The pair “car”, “driver” is semantically related for one sense of the word
“driver”, but the word “driver” is assigned to the group “windows-os” with
a different sense.
TERM REPRESENTATION WITH GLSA 53
Our experiments have shown that the cosine similarity between the glsa
term vectors corresponds well to the semantic similarity between pairs of
terms. Interesting questions for future work are connected to the computa-
tional issues. As other methods based on a matrix decomposition, glsa is
limited in the size of vocabulary that it can handle efficiently. Since terms
can be divided into content-bearing and function words, glsa computa-
tions only have to include content-bearing words. Since the glsa document
vectors are constructed as linear combinations of term vectors, the inner
products between the term vectors are implicitly used when the similarity
between the document vectors is computed. Another interesting extension
is therefore to incorporate the inner products between glsa term vectors
into the language modelling framework and evaluate the impact of the glsa
representation on the information retrieval task.
We have presented the glsa framework for computing semantically mo-
tivated term and document vectors. This framework allows us to take ad-
vantage of the availability of large document collections and recent research
of corpus-based term similarity measures and combine them with dimen-
sionality reduction algorithms. Using the combination of point-wise mutual
information and singular value decomposition we have obtained term vec-
tors that outperform the state-of-the-art approaches on the synonymy test
and show a clear advantage over the pmi-ir approach on the term pairs test.
REFERENCES
Ando, R.K. 2000. “Latent Semantic Space: Iterative Scaling Improves Preci-
sion of Inter-Document Similarity Measurement”. Proceedings of the 23th
annual international ACM SIGIR conference on Research and development
in information retrieval (SIGIR’00 ), 216-223. Athens, Greece.
Bartell, B.T., G.W. Cottrell, & R.K. Belew. 1992. “Latent semantic indexing is
an optimal special case of multidimensional scaling”. Proceedings of the 15th
annual international ACM SIGIR conference on Research and development
in information retrieval (SIGIR’92 ), 161-167. Copenhagen, Denmark.
54 MATVEEVA, LEVOW, FARAHAT & ROYER
Cox, T.F. & M.A. Cox. 2001. Multidimensional Scaling. London: CRC/Chapman
and Hall.
Deerwester, S.C., S.T. Dumais, T.K. Landauer, G.W. Furnas & R.A. Harshman.
1990. “Indexing by Latent Semantic Analysis”. Journal of the American
Society of Information Science 41:6.391-407.
Golub, G. & C. Reinsch. 1971. Handbook for Matrix Computation II, Linear
Algebra. New York: Springer-Verlag
He, X., D. Cai, H. Liu & W. Ma. 2004. “Locality preserving indexing for doc-
ument representation”. Proceedings of the 27th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval
(SIGIR’04 ), 96-103. Sheffield, U.K.
Landauer, T.K., S.T. Dumais. 1997. “A Solution to Platos Problem: The Latent
Semantic Analysis Theory of the Acquisition, Induction, and Representation
of Knowledge”. Psychological Review 104:2.211-240.
Manning, C. & H. Schutze. 1999. Foundations of Statistical Natural Language
Processing. Cambridge, Mass.: MIT Press.
Ponte, J.M. & W.B. Croft. 1998. “A language modeling approach to information
retrieval”. Proceedings of the 21th Annual International ACM SIGIR Con-
ference on Research and Development in Information Retrieval (SIGIR’98 ),
275-281. Melbourne, Australia.
Salton, G. & M.J. McGill. 1983. Introduction to Modern Information Retrieval.
New York: McGraw-Hill.
Schutze, H. 1998. “Automatic Word Sense Discrimination”. Computational Lin-
guistics 24:21.97-124
Terra, E.L. & C.L. Clarke. 2003. “Frequency Estimates for Statistical Word
Similarity Measures”. Proceedings of the Joint Human Language Technology
Conference and the Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics (HLT-NAACL’2003 ), vol. 1, 165-172.
Edmonton, Canada.
Turney, Peter D. 2001. “Mining the Web for Synonyms: PMI-IR versus LSA
on TOEFL”. Machine Learning: ECML 2001 — Proceedings of the 12th
European Conference on Machine Learning, Freiburg, Germany ed. by Luc
de Raedt & Peter Flach (= Lecture Notes in Artificial Intelligence, 2167),
491-502. Heidelberg: Spinger.
Widdows, D. 2003. “A Mathematical Model for Context and Word-Meaning”.
Proceedings of the 4th International and Interdisciplinary Conference on Mod-
eling and Using Context, 23-25. Stanford, Calif.
Multilingual Dependency Parsing:
A Pipeline Approach
Ming-Wei Chang, Quang Do & Dan Roth
University of Illinois at Urbana-Champaign
Abstract
This paper develops a general framework for machine learning based
dependency parsing based on a pipeline approach, where a task is
decomposed into several sequential stages. To overcome the error
accumulation problem of pipeline models, we propose two natural
principles for pipeline frameworks: (i) make local decisions as reli-
able as possible, and (ii) reduce the number of sequential decisions
made. We develop an algorithm that provably satisfies these princi-
ples and show that the proposed principles support several algorith-
mic choices that improve the dependency parsing accuracy signifi-
cantly. We present state of the art experimental results for English
and several other languages.1
1 Introduction
later stage to fix mistakes due to the pipeline. Punyakanok et al. (2005)
and Marciniak & Strube (2005) also address some aspects of this problem.
However, these solutions rely on the fact that all decisions are made with
respect to the same input; specifically, all classifiers considered use the same
examples as their input. In addition, the pipelines they study are shallow.
This paper develops a general framework for decisions in pipeline models
which addresses these difficulties. Specifically, we are interested in deep
pipelines—a large number of predictions that are being chained.
A pipeline process is one in which decisions made in the ith stage (1) de-
pend on earlier decisions and (2) feed on input that depends on earlier de-
cisions. The latter issue is especially important at evaluation time since, at
training time, a gold standard data set might be used to avoid this issue.
We develop and study the framework in the context of a bottom up ap-
proach to dependency parsing. We suggest that two principles should guide
the pipeline algorithm development:
(i) Make local decisions as reliable as possible.
(ii) Reduce the number of decisions made.
Using these as guidelines we devise an algorithm for dependency parsing,
prove that it satisfies these principles, and show experimentally that this
improves the accuracy of the resulting tree.
Specifically, our approach is based on a shift-reduced parsing as in (Ya-
mada & Matsumoto 2003). Our general framework provides insights that
allow us to improve their algorithm, and to principally justify some of the
algorithmic decisions. Specifically, the first principle suggests to improve
the reliability of the local predictions, which we do by improving the set of
actions taken by the parsing algorithm, and by using a look-ahead search.
The second principle is used to justify the control policy of the parsing algo-
rithm, namely, which edges to consider at different stages in the algorithm.
We prove that our control policy is optimal in some sense and, overall,
that the decisions we made guided by these principles lead to a significant
improvement in the accuracy of the resulting parse tree.
This section describes our dependency parsing algorithm and justifies its
advantages by viewing and analyzing it as a pipeline model.
L R S
X
a b a b a b
Definition 3 (Focus point) When the algorithm considers the pair (a, b),
a < b, we call the word a the current ‘focus point’.
Next we describe several policies for determining the focus point of the algo-
rithm following an action. We note that, with a few exceptions, determining
60 MING-WEI CHANG, QUANG DO & DAN ROTH
the focus point does not affect the correctness of the algorithm. It is easy
to show that for (almost) any focus point chosen, if the correct action is
selected for the corresponding word pair, the algorithm will eventually yield
the correct tree (but may require multiple cycles through the sentence). In
practice, the actions selected are noisy, and a wasteful focus point policy
will result in a large number of actions, and thus in error accumulation. To
minimize the number of actions taken, we want to find a good focus point
placement policy.
We always let the focus point move one word to the right after S. After
L or R there are several natural placement policies to consider for the focus
point:
Start Over: Move focus point to the first word in T .
Stay: Move focus point to be the next word to the right. That is, for
T = (a, b, c), and the focus point is a, an L action results in a as the
focus point (and the pair considers (a, c), while R action results in the
focus being b (and the pair considered (b, c)).
Step Back: The focus point moves to the previous word (on the left). That
is, for T = (a, b, c), and the focus point is b, in both cases, a will be the
focus point (and the pair considered are (a, c) in case of a R action,
and (a, b) in case of a L).
In practice, different placement policies have a significant effect on the num-
ber of pairs considered by the algorithm and, therefore, on the final accu-
racy2 .
The following analysis justifies the Step Back policy. We claim that
if Step Back is used, the algorithm will not waste any action. Thus, it
achieves the goal of minimizing the number of actions taken in a pipeline
algorithm. Notice that using this policy, when L is taken, the pair (a, b) is
reconsidered, but with new information, since now it is known that c is the
child of b. Although this seems wasteful, we will show in Lemma 1 that this
movement is necessary in order to reduce the number of actions.
As mentioned above, each of these policies yields the correct tree. Ta-
ble 1 provides an experimental comparison of the three policies in terms of
the number of actions required to build a tree. As a function of the focus
point placement policy, and using the correct action for each pair considered,
it shows the number of word pairs considered in the process of generating all
the trees for section 23 of Penn Treebank. It is clear from Table 1 that the
policies result in very different number of actions and that Step Back is
the best choice. Note that since the actions are the gold-standard actions,
2
Note that Yamada & Matsumoto 2003) mention that they move the focus point back
after R, but do not state what they do after executing L actions, and why. Yamada
(p.c. 2006) indicates that they also moved the focus point back after L.
MULTILINGUAL DEPENDENCY PARSING 61
the policy affects only the number of S actions used, and not the L and
R actions, which are a direct function of the correct tree. The number of
required actions in the test stage (when the actions taken are based on a
learned classifier) shows the same trend and consequently the Step Back
also gives the best dependency accuracy. The parsing algorithm with this
policy is given Algorithm 2. Note that a good policy should always try to
consider a child before considering its parent to save the number of deci-
sions. As we will show later, the Step Back has several nice properties to
be considered as a good policy.
2.2 Correctness and pipeline properties
We can prove several properties of our algorithm. First we show that the
algorithm builds the dependency tree in only one pass over the sentence.
Then, we show that the algorithm does not waste actions in the sense that it
never considers a word pair twice in the same situation. Consequently, this
shows that under the assumption of a perfect action predictor, our algorithm
makes the smallest possible number of actions, among all algorithms that
build a tree sequentially in one pass. Finally, we show that the algorithm
only requires linear number of actions. Note that this may not be true if
the action classifier is not perfect, and one can contrive examples in which
an algorithm that makes several passes on a sentence can actually make
fewer actions than a single pass algorithm. In practice, however, as our
experimental data shows, this is unlikely.
The following analysis is made under the assumption that the action
classifiers are perfect. While this assumption does not hold in practice, it
makes the following analysis tractable, and still gives a very good insight
into the practical properties of our algorithm, given that, as we show the
action classifier is quite accurate, even if not perfect.
For example, in Lemma 1, we show that the algorithm requires only
one round to parse a sentence under this assumption. In practice, in the
evaluation stage, we allow the parsing algorithm to run multiple rounds.
That is, when the focus point reaches the last word in T , we will set the
62 MING-WEI CHANG, QUANG DO & DAN ROTH
focus point at the beginning of the sentence if the tree is not complete. We
found that over 99% of the sentences can be parsed in a single round and
that, the average number of rounds needed to parse a sentence (average
taken over all sentences in the test corpus) is 1.01. Therefore, we believe
that the following analysis, done, under the assumption of golden actions
can still be quite useful.
Before providing our argument, it is important to recall that one child word
will be eliminated from T after performing Left or Right. The Step Back
policy and this elimination procedure are the key points in this lemma.
Note that in Algorithm 2, we use the focus point to choose the word
pair. Since we move the focus point in the word vector T , we can only
consider “consecutive” pairs in T . Note that the elimination after Left or
Right will generate new “consecutive” word pairs in T .
MULTILINGUAL DEPENDENCY PARSING 63
will be considered again only if the focus point “steps back”. Therefore, we
consider two cases here. First, if the next action followed S is L, b will get a
new child and the algorithm will step back and consider (a, b) again. Since
b gets new information, it is reasonable to check (a, b) again to see if the
algorithm can recover the deferred action or not. Second, if the next action
followed S is R, b will be eliminated from T and (a, b) will not be considered
again. ✷
The next lemma claims a stronger property: the algorithm requires only
O(n) actions, where n is the length of the input sentence.
m ≤ n − 1 + 2l + 2r
where l and r are the number of L and R actions taken, respectively. Note
that l + r ≤ n − 1 since the sentence has only n words. Therefore, m ≤ 3n
and the total number of actions is bounded by 3n. ✷
MULTILINGUAL DEPENDENCY PARSING 65
W X Y Z
W Y Z
Let (w, x, y, z) be a sentence of four words, and assume that the correct
dependency relations are as shown in the top part of Figure 3. If the sys-
tem mistakenly predicts that x is a child of w before y and z becomes x’s
children, we can only consider the relationship between w and y in the next
stage. Consequently, we will never find the correct parent for y and z. The
previous prediction error propagates and impacts future predictions. On
the other hand, if the algorithm makes a correct prediction, in the next
MULTILINGUAL DEPENDENCY PARSING 67
State: encodes the configuration of the environment (in the context of the
dependency parsing this includes the sentence, the focus point and the
current parent and children for each word). Note that State changes
when a prediction is made and that the features extracted for the
action classifier also depend on State.
The search algorithm will perform a search of length Depth. Additive scoring
is used to score the sequence, and the first action in this sequence is selected
and performed. Then, the State is updated, the new features for the action
classifiers are computed and search is called again.
One interesting property of this framework is that it allows the use of
future information in addition to past information. The pipeline model
naturally allows access to all the past information. Since the algorithm uses
a look ahead policy, it also uses future predictions. The significance of this
becomes clear in Section 3. There are several parameters, in addition to
Depth that can be used to improve the efficiency of the framework. For
example, given that the action predictor is a multi-class classifier, we do
not need to consider all future possibilities in order to determine the current
action. In our experiments, we only consider two actions with highest score
at each level (which was shown to produce almost the same accuracy as
considering all four actions). Note that our look ahead search is a local
search so the depth is usually small. Furthermore, we can keep a cache of
the search paths of the look-ahead search for the i-th action. These paths
can be reused in the look-ahead search for the (i + 1)-th action. Therefore,
our look ahead search is not very expensive.
3 Experimental study
In this section, we describe a set of experiments done to investigate the
properties of our algorithm and evaluate it. All the experiments here were
done on English. Results on other languages are described in Section 4.
We use the standard corpus for this task, the Penn Treebank (Marcus
et al. 1993). The training set consists of sections 02 to 21 and the test set
is section 23. The pos tags for the evaluation data sets were provided by
the tagger of (Toutanova et al. 2003) (which has an accuracy of 97.2% on
section 23 of the Penn Treebank).
where acti means the normalized activation value for i-th class of feature
vector x. The activation value is directly come from our machine learning
model.
3.2 Features
For each word pair (w1 , w2 ) we use the words, their pos tags and also these
features of the children of w1 and w2 . We also include the lexicon and pos
tags of 2 words before w1 and 4 words after w2 (as in (Yamada & Mat-
sumoto 2003)). The key additional feature we use, relative to (Yamada &
Matsumoto 2003), is that we include the previous predicted action as a fea-
ture. We also add conjunctions of above features to ensure expressiveness
of the model. Yamada & Matsumoto (2003) make use of polynomial ker-
nels of degree 2 which is equivalent to using even more conjunctive features
(Cumby & Roth 2003). Overall, the average number of active features in
an example is about 50.
3.4 Results
We first present the results of several of the experiments that were in-
tended to help us analyze and understand some of the design decisions
in our pipeline algorithm.
To see the effect of the additional action, we present in Table 2 a com-
parison between a system that does not have the WaitLeft action (similar
to the Yamada & Matsumoto’s (2003) approach) with one that does. For
70 MING-WEI CHANG, QUANG DO & DAN ROTH
Action set
Evaluation metric w/o WaitLeft w/ WaitLeft
Dependencies 90.27 90.53
Root 90.73 90.76
Sentence 39.28 39.74
Leaf 93.87 93.94
Table 2: The significance of the action WaitLeft
a fair comparison, in both cases, we do not use the look ahead procedure.
Note that, as stated above, the action WaitRight is never needed for our
parsing algorithm. It is clear that adding WaitLeft increases the accuracy
significantly.
Evaluation metric Depth=1 Depth=2 Depth=3 Depth=4
Dependencies 90.53 90.67 90.69 90.79
Root 90.76 91.51 92.05 92.26
Sentence 39.74 40.23 40.52 40.68
Leaf 93.94 93.96 93.94 93.95
Table 3: The effect of different Depth settings. Increasing the Depth
usually improves the accuracy
Table 3 investigates the effect of the look ahead, and presents results with
different Depth parameters (Depth= 1 means “no search”), showing a con-
sistent trend of improvement.
Table 4 breaks down the results as a function of the sentence length;
it is especially noticeable that the system also performs very well for long
sentences, another indication for its global performance robustness.
Sentence length
Evaluation metric <11 11-20 21-30 31-40 > 40
Dependencies 93.4 92.4 90.4 90.4 89.7
Root 96.7 93.7 91.8 89.8 87.9
Sentence 85.2 56.1 32.5 16.8 8.7
Leaf 94.6 94.7 93.4 94.0 93.3
Table 4: The effect of sentence length (Depth = 4). Note that our
algorithm performs very well on long sentences
Table 5 shows the results with three settings of the pos tagger. The best
result is, naturally, when we use the gold standard also in testing. However,
it is worthwhile noticing that it is better to train with the same pos tagger
available in testing, even if its performance is somewhat lower.
Finally, Table 6 compares the performances of several of the state of the
art dependency parsing systems with ours. When comparing with other
dependency parsing systems it is especially worth noticing that our system
gives significantly better accuracy on completely parsed sentences.
MULTILINGUAL DEPENDENCY PARSING 71
Kromann 2003, van der Beek et al. 2002, Brants et al. 2002, Kawata &
Bartels 2000, Afonso et al. 2002, Džeroski et al. 2006, Civit Torruella &
Martı́ Antonı́n 2002, Nilsson et al. 2005, Oflazer et al. 2003, Atalay et
al. 2003). We apply the techniques earlier in this section to extend our
algorithm to on non-projective languages and to the prediction of both
head words and dependency labels.
Interestingly, we can show that the trees produced by our algorithm are
relatively good even for long sentences, and that our algorithm is doing
especially well when evaluated globally, at a sentence level, where our results
are significantly better than those of existing approaches—perhaps showing
that the design goals were achieved.
We also present the primarily results for the multilingual datasets pro-
vided in the conll-x shared task. This shows that dependency parsers
based on a pipeline approach can handle many different languages. Re-
cently, (McDonald & Pereira 2006) proposed using “second-order” feature
information when applying the Einser’s algorithm and a global reranking
algorithm. The proposed method outperform (Mcdonald et al. 2005) and
the approach proposed in this paper. Although there is more room for com-
parison, we believe the the key advantage there is simply using a better
feature set. Overall, the bottom line from this work, as well as from the
conll-x shared task is that a pipeline approach is competitive with a global
reranking approach (Buchholz & Marsi 2006).
Beyond enhancing our current results by introducing more and better
features, we intend to consider incorporating more information sources (e.g.,
shallow parsing results) into the pipeline process, to investigate different
search algorithms in the pipeline framework and to study approaches to
predicts dependencies and their type as a single task.
Acknowledgements. We are grateful to Ryan McDonald and the organizers
of conll-x for providing the annotated data set and to Vasin Punyakanok for
useful comments and suggestions. This research was supported by the Advanced
Research and Development Activity (arda)’s Advanced Question Answering for
Intelligence (aquaint) Program and a doi grant under the Reflex program.
REFERENCES
Abeillé, A. ed. 2003. Treebanks: Building and Using Parsed Corpora. (= Text,
Speech and Language Technology, 20). Dordrecht: Kluwer Academic.
Afonso, S., E. Bick, R. Haber & D. Santos. 2002. “Floresta sintá(c)tica: A tree-
bank for Portuguese”. 3rd Int. Conf. on Language Resources and Evaluation
(LREC ), 1698-1703. Las Palmas, Canary Islands, Spain.
Atalay, N. B., K. Oflazer & B.Say. 2003. “The Annotation Process in the Turk-
ish Treebank”. 4th Int. Workshop on Linguistically Interpreteted Corpora
(LINC ). Budapest, Hungary.
Aho, A.V., R. Sethi & J. D. Ullman. 1986. Compilers: Principles, Techniques,
and Tools. Reading, Mass.: Addison-Wesley.
Attardi, Giuseppe. 2006. “Experiments with a Multilanguage Non-projective De-
pendency Parser”. 10th Conf. on Computational Natural Language Learning
(CoNLL-X ), 166-170. New York.
76 MING-WEI CHANG, QUANG DO & DAN ROTH
Brants, S., S. Dipper, S. Hansen, W. Lezius & G. Smith. 2002. “The TIGER Tree-
bank”. 1st Workshop on Treebanks and Linguistic Theories (TLT ). Sozopol,
Bulgaria.
Böhmová, A., J. Hajič, E. Hajičová & B. Hladká. “The PDT: A 3-level Annota-
tion Scenario”. In (Abeillé 2003), ch. 7.
Buchholz, Sabine & Erwin Marsi. 2006. “Conll-X Shared Task on Multilingual
Dependency Parsing”. 10th Conference on Computational Natural Language
Learning (CoNLL-X ), 149-164. New York.
Carlson, A., C. Cumby, J. Rosen & D. Roth. 1999. “The SNoW learning architec-
ture”. Technical Report UIUCDCS-R-99-2101. Urbana, Illinois: University
of Illinois at Urbana-Champaign (UIUC) Computer Science Department.
Chang, M.-W., Q. Do & D. Roth. 2006a. “A Pipeline Framework for Dependency
Parsing”. Annual Meeting of the Association for Computational Linguistics
(ACL), 65-72. Sydney, Australia.
Chang, M.-W., Q. Do & D. Roth. 2006b. “A Pipeline Model for Bottom-up De-
pendency Parsing”. 10th Conf. on Computational Natural Language Learning
(CoNLL-X ), 186-190. New York City.
Chen, K., C. Luo, M. Chang, F. Chen, C. Chen, C. Huang & Z. Gao. 2003. “Sinica
Treebank: Design Criteria, Representational Issues and Implementation”. In
(Abeillé 2003), ch. 13, 231-248.
M. Civit Torruella & Ma A. Martı́ Antonı́n. 2002. “Design Principles for a
Spanish Treebank”. 1st Workshop on Treebanks and Linguistic Theories
(TLT ). Sozopol, Bulgaria.
Cumby, C. & D.Roth. 2003. “On Kernel Methods for Relational Learning”. Int.
Conf. on Machine Learning (ICML), 107-114. Washington, DC, USA..
Džeroski, S., T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtsky & A. Žele. 2006.
“Towards a Slovene Dependency Treebank”. 5th Int. Conf. on Language
Resources and Evaluation (LREC ).Geneva, Switzerland.
Eisner, J. 1996. “Three New Probabilistic Models for Dependency Parsing: An
Exploration”. Int. Conf. on Computational Linguistics (COLING), 340-345.
Copenhagen, Denmark.
Haghighi, A., A. Ng & C. Manning. 2006. “Robust Textual Inference via Graph
Matching”. Human Language Technology Conference and Conference on
Empirical Methods in Natural Language Processing, 387-394. Vancouver,
Canada.
Hajič, J., O. Smrž, P. Zemánek, J. Šnaidauf & E. Beška. 2004. “Prague Arabic
Dependency Treebank: Development in Data and Tools”. NEMLAR Int.
Conf. on Arabic Language Resources and Tools, 110-117. Cairo, Egypt.
Kawata, Y. & J. Bartels. 2000. “Stylebook for the Japanese Treebank in VERB-
MOBIL”. Verbmobil-Report 240. Tübingen, Germany: Seminar für Sprach-
wissenschaft, Universität Tübingen.
MULTILINGUAL DEPENDENCY PARSING 77
Kromann, Matthias Trautner. 2003. “The Danish Dependency Treebank and the
DTAG Treebank Tool”. 2nd Workshop on Treebanks and Linguistic Theories
(TLT ) ed. by Joakim Nivre & Erhard Hinrichs, 217-220. Växjö, Sweden.
Lin, D. 1994. “Principar—An Efficient, Broad-coverage, Principle-based Parser”.
Int. Conf. on Computational Linguistics (COLING), 482-488. Kyoto, Japan.
McDonald, R., K. Crammer & F. Pereira. 2005. “Online Large-margin Training
of Dependency Parsers”. Annual Meeting of the Association for Computa-
tional Linguistics (ACL’05 ), 91-98. Ann Arbor, Michigan.
McDonald, R. & F. Pereira. 2006. “Online Learning of Approximate Dependency
Parsing Algorithms”. 11th Conf. of the European Chapter of the Association
for Computational Linguistics (EACL’06 ), 81-88. Trento, Italy.
Marciniak, T. & M. Strube. 2005. “Beyond the Pipeline: Discrete Optimiza-
tion in NLP”. 9th Conference on Computational Natural Language Learning
(CoNLL-2005 ), 136-143. Ann Arbor, Michigan.
Marcus, M. P., B. Santorini & M. Marcinkiewicz. 1993. “Building a Large Anno-
tated Corpus of English: The Penn Treebank”. Computational Linguistics
19:2.313-330.
Nilsson, J., J. Hall & J. Nivre. 2005. “MAMBA Meets TIGER: Reconstructing
a Swedish Treebank from Antiquity”. 15th Nordic Conference of Computa-
tional Linguistics (NODALIDA) Special Session on Treebanks. Copenhagen
Studies in Language 32, 119-132. Copenhagen, Denmark.
Nivre, Joakim. 2003. “An Efficient Algorithm for Projective Dependency Pars-
ing”. Int. Workshop on Parsing Technology (IWPT ), 149-160. Nancy,
France.
Nivre, J. & J. Nilsson. 2005. “Pseudo-projective Dependency Parsing”. 43rd
Annual Meeting of the Association for Computational Linguistics (ACL),
99-106. Ann Arbor, Michigan.
Nivre, J. & M. Scholz. 2004. “Deterministic Dependency Parsing of English
Text”. 20th Int. Conference on Computational Linguistics (COLING-04 ),
64-70. Geneva, Switzerland.
Oflazer, K., B. Say, D. Zeynep Hakkani-Tür & G. Tür. 2003. “Building a Turkish
Treebank”. In (Abeillé 2003), ch. 15.
Punyakanok, V., D. Roth & W. Yih. 2005. “The Necessity of Syntactic Parsing
for Semantic Role Labeling”. Int. Joint Conference on Artificial Intelligence
(IJCAI ), 1117-1123. Edinburgh, Scotland, UK.
Ratnaparkhi, Adwait. 1997. “A Linear Observed Time Statistical Parser Based
on Maximum Entropy Models”. 2nd Conf. on Empirical Methods in Natural
Language Processing (EMNLP ), 1-10. Somerset, New Jersey.
Roth, Dan. 1998. “Learning to Resolve Natural Language Ambiguities: A Uni-
fied Approach”. National Conference on Artificial Intelligence (AAAI ), 806-
813. Montréal, Québec, Canada.
78 MING-WEI CHANG, QUANG DO & DAN ROTH
Roth, D. & W. Yih. 2004. “A Linear Programming Formulation for Global Infer-
ence in Natural Language Tasks”. Annual Conf. on Computational Natural
Language Learning (CoNLL) ed. by Hwee Tou Ng & Ellen Riloff, 1-8. Boston,
Mass.
Toutanova, K., D. Klein & C. Manning. 2003. “Feature-rich Part-of-speech Tag-
ging with a Cyclic Dependency Network”. Human Language Technology Con-
ference (HLT-NAACL), 173-180. Edmonton, Canada.
van der Beek, L., G. Bouma, R. Malouf & G. van Noord. 2002. “The Alpino De-
pendency Treebank”. Computational Linguistics in the Netherlands (CLIN ).
Groningen, The Netherlands.
Yamada, H. & Y.Matsumoto. 2003. “Statistical Dependency Analysis with Sup-
port Vector Machines”. Int. Workshop on Parsing Technologies (IWPT ),
195-206. Nancy, France.
How Does Treebank Annotation Influence Parsing?
Or How Not to Compare Apples and Oranges
Sandra Kübler
SfS-CL, University of Tübingen
Abstract
In this paper, we will investigate the influence which different deci-
sions in the annotation schemes of treebanks have on parsing. The
investigation uses the comparison of similar treebanks of German,
Negra and TüBa-D/Z, which are subsequently modified to allow a
comparison of the differences. The results show that some modifica-
tion have a negative effect while others have a positive influence.
1 Introduction
In the last decade, the Penn treebank (Marcus et al. 1994) has become
the standard data set for evaluating pcfg parsers. The fact that most
parsers are solely evaluated on this specific data set leaves the question
unanswered to what degree these results depend on the annotation scheme
of the treebank. This point becomes more urgent in the light of more recent
publications on parsing the Penn or the Negra treebank such as (Dubey
& Keller 2003; Klein & Manning 2003), which show that parsing results
can be improved if the parser is optimized for the input data. Klein &
Manning (2003), e.g., gain more than 1 point in F-score when they mark
each phrase that immediately dominates a verb, thus showing shortcom-
ings in the annotation scheme. These findings raise the question whether
such shortcomings in the annotation can be avoided during the design of
the annotation scheme of a treebank. The question, however, can only be
answered if we know which design decisions are favorable for pcfg parsing.
In this chapter, we will investigate how different decisions in the anno-
tation scheme influence parsing results. In order to answer this question,
however, a method needs to be developed which allows the comparison of
different annotation decisions without comparing unequal categories.
For a comparison of different annotation schemes, one ideally needs one
treebank with two different sets of (manual) annotations. An automatic
conversion from one annotation scheme with flatter structures to another
with deeper structures is impossible because of the high probability that
systematic errors are introduced by the conversion. If no such doubly anno-
tated text is available, two treebanks which are based on the same language
and the same text genre and which are annotated with different annotation
80 SANDRA KÜBLER
PP PP PP
500 501 502
AC NK AC NK NK AC NK NK
Im Rathaus−Foyer wird neben dem Fund auch die Forschungsgeschichte zum Hochheimer Spiegel präsentiert .
0 1 2 3 4 5 6 7 8 9 10 11 12 13
APPRART NN VAFIN APPR ART NN ADV ART NN APPRART ADJA NN VVPP $.
Both treebanks use German newspapers as their data source: the Frank-
furter Rundschau newspaper for Negra and the ‘die tageszeitung’ (taz) news-
paper for TüBa-D/Z. Negra comprises 20 000 sentences, TüBa-D/Z 15 000
sentences. Both treebanks use an annotation framework that is based on
phrase structure grammar and that is enhanced by a level of predicate-
argument structure. Despite all these similarities, the treebank annotations
differ in four important aspects: 1) Negra does not allow unary branching
(nodes with one daughter) while TüBa-D/Z does; 2) in Negra, phrases re-
ceive a flat annotation while TüBa-D/Z uses phrase internal structure; 3)
Negra uses crossing branches to represent long-distance relationships while
TüBa-D/Z uses a projective tree structure combined with functional labels;
4) Negra encodes grammatical functions in a combination of structural and
functional labeling while TüBa-D/Z uses a combination of topological fields
(Höhle 1986) and functional labels, which results in a flatter structure on
HOW DOES TREEBANK ANNOTATION INFLUENCE PARSING? 81
SIMPX
521
− − − − −
NF
520
OA−MOD
VF R−SIMPX
518 519
ON − − −
NX MF
516 517
HD − V−MOD PRED
PX LK MF C ADVX EN−ADD VC
509 510 511 512 513 514 515
− HD HD OA ON − HD − HD
Der Autokonvoi mit den Probenbesuchern fährt eine Straße entlang , die noch heute Lagerstraße heißt .
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ART NN APPR ART NN VVFIN ART NN PTKVZ $, PRELS ADV ADV NN VVFIN $.
the clause level. The two treebanks also use different notions of gram-
matical functions: TüBa-D/Z defines 36 grammatical functions covering
head and non-head information, as well as subcategorization for comple-
ments and modifiers. Negra utilizes 48 grammatical functions. Apart from
commonly accepted grammatical functions, such as SB (subject) or OA
(accusative object), Negra grammatical functions also comprise a more ex-
tended notion, e.g. RE (repeated element) or RC (relative clause). Because
of the differences in the definition of grammatical functions, a comparison
of grammatical functions is only possible in a task-based evaluation within
an application that uses these grammatical functions as input.
Figure 1 shows a typical tree from the Negra treebank. The syntactic
categories are shown in circular nodes, the grammatical functions as edge
labels in square boxes. The PP “Im Rathaus-Foyer” (in the foyer of the
town hall) and the NP “auch die Forschungsgeschichte zum Hochheimer
Spiegel” (also the research history of the Hochheimer Spiegel) do not contain
internal structure, the noun kernel elements are marked via the functional
labels NK. The fronted PP is grouped under the VP, resulting in crossing
branches. Figure 2 shows a typical example from TüBa-D/Z. Here, the
complex NP “Der Autokonvoi mit den Probenbesuchern” (the car convoy
with the visitors of the rehearsal) contains an NP and the PP with internal
NP, both NPs being explicitly annotated (as NX). The tree also contains
several unary nodes, e.g. the verb phrases “fährt” (goes) and “heißt” (is
called) or the street name “Lagerstraße”. The main ordering principle on
the clause level are the topological fields, long-distance relationships such
as the relation between the noun phrase “eine Straße” (a street) and the
extraposed relative clause “die heute noch Lagerstraße heißt” (which is still
called Lagerstraße) are marked via functional labeling; OA-MOD specifies
that this clause modifies the accusative object OA.
82 SANDRA KÜBLER
Negra TüBa-D/Z
crossing brackets 1.07 2.27
labeled recall 70.09% 84.15%
labeled precision 67.82% 87.94%
labeled F-score 68.94 86.00
crossing brackets 1.04 1.93
function labeled recall 52.75% 73.65%
function labeled precision 51.85% 76.13%
function labeled F-score 52.30 74.87
MF
506
PRED
S
VF LK ADJX
501 503 504 505
SB HD MO MO ON HD − HD
PP NX VXFIN ADJX
500 500 501 502
AC NK − HD HD HD
" Ich stehe ständig unter Strom . " Das Ergebnis ist völlig offen .
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7
$( PPER VVFIN ADJD APPR NN $. $( ART NN VAFIN ADJD ADJD $.
Fig. 3: Negra (left) and TüBa-D/Z (right) trees with one word phrases
SIMPX
526
ON HD OA VPT OA−MOD
NX R−SIMPX
522 524
− HD − ON − HD
PX NX MF
521 525 523
− − HD − HD MOD MOD PRED
Der Autokonvoi mit den Probenbesuchern fährt eine Straße entlang , die noch heute Lagerstraße heißt .
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ART NN APPR ART NN VVFIN ART NN PTKVZ $, PRELS ADV ADV NN VVFIN $.
Fig. 4: The sentence from Figure 2 flattened and without unary branches
S
509
−− −− −− −−
MF
508
MO SB
VF NP RK
505 506 507
MO MO NK NK MNR OC
PP LK PP PP VP
500 501 502 503 504
AC NK HD AC NK NK AC NK NK HD
Im Rathaus−Foyer wird neben dem Fund auch die Forschungsgeschichte zum Hochheimer Spiegel präsentiert .
0 1 2 3 4 5 6 7 8 9 10 11 12 13
APPRART NN VAFIN APPR ART NN ADV ART NN APPRART ADJA NN VVPP $.
Table 2 gives the results for the evaluation of the two types of tests: the
upper half of the table gives results for parsing with syntactic node labels
only while the lower half of the table gives results for parsing syntactic
categories and grammatical functions. The results show that all transfor-
mations approximate the other treebank in annotation as well as in parser
performance.
REFERENCES
Collins, Michael. 1999. Head-Driven Statistical Models for Natural Language
Parsing, Ph.D. dissertation. University of Pennsylvania. Philadelphia.
Dubey, Amit & Frank Keller. 2003. “Probabilistic Parsing for German using
Sister-Head Dependencies”. 41st Annual Meeting of the Association for Com-
putational Linguistics (ACL’2003 ), 96-103. Sapporo, Japan.
Höhle, Tilman. 1986. “Der Begriff ‘Mittelfeld’, Anmerkungen über die Theo-
rie der topologischen Felder”. Akten des Siebten Internationalen Germanis-
tenkongresses 1985, 329-340. Göttingen, Germany.
Klein, Dan & Christopher Manning. 2003. “Accurate Unlexicalized Parsing”.
Proceedings of the 41st Annual Meeting of the Association for Computational
Linguistics (ACL’2003 ), 423-430. Sapporo, Japan.
Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann
Bies, Mark Ferguson, Karen Katz & Britta Schasberger. 1994. “The Penn
Treebank: Annotating Predicate Argument Structure”. Proceedings of the
Human Language Technology Workshop (HLT’94 ), 114-119. Plainsboro,
New Jersey.
Schmid, Helmut. 2000. LoPar: Design and Implementation. Technical report.
Universität Stuttgart, Germany.
Skut, Wojciech, Thorsten Brants, Brigitte Krenn & Hans Uszkoreit. 1998. “A
Linguistically Interpreted Corpus of German Newspaper Texts”. ESSLLI
Workshop on Recent Advances in Corpus Annotation. Saarbrücken, Ger-
many.
Telljohann, Heike, Erhard Hinrichs & Sandra Kübler. 2004. “The TüBa-D/Z
Treebank: Annotating German with a Context-Free Backbone”. Proceed-
ings of of the Fourth International Conference on Language Resources and
Evaluation (LREC’2004 ), 2229-2235. Lisbon, Portugal.
Thielen, Christine & Anne Schiller. 1994. “Ein kleines und erweitertes Tagset
fürs Deutsche”. Lexikon & Text ed. by H. Feldweg & E. Hinrichs, 215-226.
Tübingen: Niemeyer.
The SenSem Project: Syntactico-Semantic
Annotation of Sentences in Spanish
Abstract
This paper presents SenSem, a project that aims to systematize the
behavior of verbs in Spanish at the lexical, syntactic and semantic
level. As part of the project, two resources are being built: a corpus
where sentences are associated to their syntactico-semantic interpre-
tation and a lexicon where each verb sense is linked to the corre-
sponding annotated examples in the corpus. We also discuss some
tendencies that can be observed in the current state of development.
1 Introduction
The conjunction of all this information will provide a very fine-grained de-
scription of the syntactico-semantic interface at sentence level, useful for ap-
plications that require an understanding of sentences beyond shallow pars-
ing. In the fields of automatic understanding, semantic representation and
automatic learning systems, a resource of this type is very valuable.
In the rest of the paper we will describe the corpus annotation process
in more detail and examples will be provided. Section 2 offers a general
overview of other projects similar to SenSem. In section 3, the levels of
annotation are discussed, and the process of annotation is described in sec-
tion 4. We then proceed to present the results obtained to date and the
current state of annotation, and we put forward some tentative conclusions
obtained from the results of the annotation thus far.
2 Related work
As shown by Levin (1993) and others (Jones et al. 1994, Jones 1995, Kipper
et al. 2000, Saint-Dizier 1999, Vázquez et al. 2000), syntax and semantics are
highly interrelated. By describing the way linguistic layers inter-relate, we
can provide better verb descriptions since generalizations from the lexicon
that previously belonged to the grammar level of linguistic description can
be established (lexicalist approach).
Within the area of Computational Linguistics, it is common to deal
with both fields independently (Grishman et al. 1994, Corley et al. 2001).
In other cases, the relationship established between syntactic and semantic
components is not fully exploited and only basic correlations are established
(Dorr et al. 1998, McCarthy 2000). We believe this approach is interesting
even though it does not take full advantage of the existing link between
syntax and semantics.
Furthermore, we think that in order to coherently characterize the syn-
tactico-semantic interface, it is necessary to start by describing linguistic
data from real language. Thus, a corpus annotated at syntactic and seman-
tic levels plays a crucial role in acquiring this information appropriately.
In recent years, a number of projects related to the syntactico-semantic
annotation of corpora have been carried out. The length of the present
paper does not allow us to consider them all here, but we will mention a
few of the most significant ones.
FrameNet (Johnson & Fillmore 2000) is a lexicographic resource that
describes approximately 2,000 items, including verbs, nouns and adjectives
that belong to diverse semantic domains (communication, cognition, per-
ception, movement, space, time, transaction, etc.). Each lexical entry has
examples from the British National Corpus, manually annotated with ar-
gument structure and, in some cases, also adjuncts.
THE SENSEM PROJECT 91
3 Levels of annotation
Were we annotating the verb iniciar –begin– we would ignore the partici-
pants of the main sentence and only take into account the elements within
the clause. If we were annotating the verb hacer –make– we would anno-
tate the subject to include the relative clause as a whole, without further
analysis, and with the word “president” as the head of the whole structure.
4 Annotation process
The SenSem corpus will describe the 250 most frequent verbs Spanish. For
each verb, 100 examples are extracted randomly from 13 million words of
corpora obtained from the electronic version of the Spanish newspaper El
Periódico de Catalunya, disregarding uses of the verb as an auxiliary and
idioms.
The manual annotation of examples is carried out via a graphical inter-
face, seen in Figure 1, where the three levels are clearly distinguished.
The interface displays one sentence at a time. First, when a verb sense is
selected from the list of possible senses, and its prototypical event structure
and semantic roles are displayed. Then, constituents are identified and
analyzed by selecting the words that belong to it; the head of the arguments
and its possible metaphorical usage are also signaled. Finally, annotators
specify any applicable semantics at clause level (e.g., anticausative, reflexive,
stative, etc.).
At the constituent level, the semantic role chosen for each phrase is
often predictive of the other labels of that phrase: agents tend to be noun
phrases with subject function, themes tend to be noun phrases with subject
or object function (if they occur in passives, antiagentives, anticausatives
or statives), etc. These associations are exploited to semi-automate the
annotation process: once a role is selected, the category and function most
frequently associated with it and its role as a verb argument are pre-selected
so that the annotator only has to validate the information.
The final corpus will be available to the linguistic community by means
of a web-based interface (http://grial.uab.es/projectes/sensem).
94 ALONSO, CAPILLA, CASTELLÓN, FERNÁNDEZ, VÁZQUEZ
At this stage of the project1 , 77 verbs have already been annotated, which
implies that the corpus at this moment is made up of 7,700 sentences
(199,190 words). A total of 900 sentences out of these 7,700 have already
been validated, which means that a corpus of approximately 25,000 words
has already undergone the complete annotation process.
(74.26% of the sentences) over processes (20.67%) and states (8.96%). This
skewed distribution of clause types, with a clear predominance of events,
may be exclusive to the journalistic genre. We have yet to investigate its
distribution in other genres.
As concerns syntactic functions, the most frequent category is direct
object, covering 40% of the labels, with a significant difference in subjects
(23%). This is not surprising if we take into account that Spanish is a pro-
drop language. Prepositional objects (13%) and indirect objects (2%) are
less frequent.
As for semantic roles, themes are predominant, as would be expected
given that the most common syntactic function is that of direct object, and
that there is a high presence of antiagentive, anticausative and passive con-
structions. Within the different types of the semantic role theme, unaffected
themes (moved objects) appear most frequently.
tation have been settled (Vázquez et al. 2005). These guidelines serve as
a reference for annotators, and we believe they will increase the overall
consistency of the resulting corpus.
REFERENCES
Carletta, J., A.Isard, S.Isard, J.C.Kowtko, G.Doherty-Sneddon & A.H.Anderson.
1996. “HCRC Dialogue Structure Coding Manual”. HCRC Technical report
HCRC/TR-82. Edinburgh, U.K.: Univ. of Edinburgh.
Cohen, J. 1960. “A Coefficient of Agreement for Nominal Scales”. Educational
& Psychological Measure, vol. 20, 37-46.
Corley, S., M. Corley, F. Keller, M. W. Crocker, S. Trewin. 2001. “Finding
Syntactic Structure in Unparsed Corpora” Computer and the Humanities
vol. 35, 81-94.
Dorr, B., M. A. Martı́, I. Castellón. 1998. “Spanish EuroWordNet and LCS-
Based Interlingual MT”. Proceeding of the ELRA Congress. Granada, Spain.
http://citeseer.ist.psu.edu/dorr97spanish.html
Fellbaum, C. 1998. “A Semantic Network of English Verbs”, WordNet: An Elec-
tronic Lexical Database ed. by C. Fellbaum, 69-104. Cambridge, Mass.: MIT
Press.
Garcia de Miguel, J.M. & S. Comesaña. 2004. “Verbs of Cognition in Span-
ish: Constructional Schemas and Reference Points”, Linguagem, Cultura
e Cogniao: Estudos de Linguı́stica Cognitiva ed. by A. Silva, A. Torres &
M. Gonçalves, 399-420. Coimbra: Almedina.
Grishman, Ralph, C. Macleod & A. Meyers. 1994. “Comlex Syntax: Build-
ing a Computational Lexicon”. International Conference on Computational
98 ALONSO, CAPILLA, CASTELLÓN, FERNÁNDEZ, VÁZQUEZ
Abstract
The generation of referring expressions—noun phrases which speak-
ers use to identify intended referents for their hearers—is the most
well-established and agreed-upon subtask in natural language gen-
eration. This paper surveys the developments in the field that have
brought us to this point, and characterises the state of the art in the
area; we also point to unsolved problems and new directions waiting
to be pursued.
1 Introduction
Text planning is concerned with taking as input some high level commu-
nicative goal—for example, my desire to tell you what I did today—
and deriving from this some structured plan for what has to be said
to achieve this goal. Typically this plan is considered to be tree-like in
nature, corresponding to the hierarchical decomposition of the orig-
inal goal into subgoals, subgoals of those subgoals, and so on. So,
for example, my plan to tell you about my activities today might be
decomposed into the three subgoals of telling you what I did in the
morning, what I did in the afternoon, and what I did in the evening;
and each of these subgoals might in turn consist of lower-level goals,
and so on recursively until some level of atomicity, perhaps corre-
sponding to utterances that can be realised as single sentences or
clauses, is reached.
We might expect this tree structure to be labelled to indicate the
rhetorical relations that hold between the constituent elements of the
plan. So, for example, my goal to tell you about a particular event
that occurred might be preceded by a goal to explain some relevant
background information in order to allow you to understand the sig-
nificance of the event, and the relationship between these two goals
in the plan might be labelled as being one of Background. The termi-
nal nodes or leaves of this plan then correspond to the propositions
to be conveyed in the resulting output, each of these propositions
being a semantic chunk that might correspond to a major clause in
the language. These proposition-like elements are often referred to as
messages.
Micro-planning then uses these messages to build a sequence of sentence
plans that determine the content to be realised in each sentence of
the output. This may involve combining messages to build complex
sentences, a process sometimes referred to as aggregation. So, for
example, two propositions that might independently be realised as
the sentences I went to the zoo and Mary went to the zoo can instead
be combined to produce a more complex proposition that might be
realised as the sentence Both Mary and I went to the zoo.
More importantly for our purposes, the micro-planning stage is
generally considered to be the place where we determine how the
entities mentioned in the messages should be referred to. Up until
this point the entities are represented by symbolic identifiers, such as
person001, person002 and zoo167; in our present example, referring
expression generation determines that person001, the speaker, should
be identified by a first person singular pronominal form, person002
should be referred to by her given name, and zoo167 should be re-
ferred to simply in terms of the basic type of the object in question,
with no additional modifiers.
102 ROBERT DALE
Linguistic realisation is the final stage of the process; this maps these
sentence plans, which are still in the form of semantic constructs,
into the appropriate lexico-syntactic material of the target natural
language to produce the final output surface forms. It is at this point
that the words I, Mary and zoo are finally selected.
Within this view of the architecture of natural language generation, our
interest here, then, is in a component process in the micro-planning stage:
the system has determined that it wants to refer to a given entity, indi-
cating this by incorporating the internal symbol for that entity in one or
more messages, and the task of referring expression generation is then to
determine the semantic content of a referring expression that identifies this
entity for the hearer.2
reasons about the effects on the hearer obtained by using particular elements
of content in a referring expression.
needed, since e1 and e3 have different brands. So, for e1 , the attributes color
and brand provide us with a distinguishing description.
On the other hand, if the intended referent was e2 , the color or size alone
would be sufficient; and if the intended referent was e3 , the brand alone
would be sufficient. Verifying that the algorithm does indeed provide these
results is left as an exercise for the reader.
In reality, since noun phrases always (or almost always) contain head
nouns, the head noun indicating the entity’s type is also incorporated into
the resulting description; for completeness, the algorithm in Figure 1 would
have to be augmented to ensure this. In the case of e1 above, the resulting
description might then be realised as the red Stabilo pen.
The algorithm we have just presented is intended to be quite general and
domain-independent; wherever we can characterize the problem in terms
of the basic assumptions laid out earlier, this algorithm can be used to
108 ROBERT DALE
4 Outstanding issues
In the foregoing we have surveyed, albeit somewhat briefly, the major as-
pects of referring expression generation that have been the focus of attention
in the development of algorithms over the last 15 years. There are many
other algorithms in the literature beyond those presented here, but the
problems addressed are essentially the same. In this section, we turn to
aspects of the problem that have not yet received the same level of detailed
exploration.
on generating these forms of reference, they have not received the degree of
attention that has been lavished on definite descriptions, and in particular
there has been little work that has seriously attempted to integrate the var-
ious forms of reference in a unified algorithmic framework. This is perhaps
most surprising in the case of pronominal reference, given that this is seen
as such a central problem in natural language understanding.
Other forms of definite reference have also been neglected: in particular,
there is no substantive work that explores the generation of associative
anaphoric expressions, such as the use of a reference like the cap in a context
where the hearer can infer that this must be a first mention of the cap of a
particular pen that is already salient in the discourse.
4.4 Evaluation
Despite the limitations discussed in the preceding sections, it turns out
that there are, as yet, no significant nlg applications that would provide
a thorough testing ground for the wide range of algorithms that have been
proposed; for most currently practical applications of language generation,
relatively simple techniques will suffice. As a consequence, the bulk of the
algorithms we have mentioned are explored by their designers in relatively
simple toy domains. However, there would clearly be some advantage in
being able to compare the performance of different algorithms against some
standard data set of desirable referring expressions. This raises a number
of questions about how such a corpus might be constructed, and how the
performance of any given algorithm might be measured; some of these issues
are discussed further in (van Deemter et al. 2006) and (Viethen & Dale
2006).
5 Conclusions
REFERENCES
Appelt, Douglas E. 1985. Planning English Sentences. Cambridge: Cambridge
University Press.
Creaney, Norman. 2002. “Generating Descriptions Containing Quantifiers: Ag-
gregation and Search”. Information Sharing: Reference and Presupposi-
tion in Language Generation and Interpretation ed. by Kees van Deemter
& Rodger Kibble (= CSLI Lecture Notes, 143), 265-294. Stanford: CSLI
Publications.
Dale, Robert. 1992. Generating Referring Expressions: Constructing Descrip-
tions in a Domain of Objects and Processes. Cambridge, Mass.: MIT Press.
Dale, Robert & Nicholas Haddock. 1991. “Content Determination in the Gener-
ation of Referring Expressions”. Computational Intelligence 7:4.252-265.
114 ROBERT DALE
Abstract
This paper reports on a hybrid architecture for computational
anaphora resolution (Car) of German that combines a rule-based
pre-filtering component with a memory-based resolution module (us-
ing the Tilburg Memory Based Learner – TiMBL). The Car exper-
iments performed on the TüBa-D/Z treebank data corroborate the
importance of modelling aspects of discourse structure for robust,
data-driven anaphora resolution. The best result with an F-measure
of 0.734 achieved by these experiments outperforms the results re-
ported by Schiehlen (2004), the only other study of German Car
that is based on newspaper treebank data.
1 Introduction
While early work on Car was carried out almost exclusively in a rule-based
paradigm, there have been numerous studies during the last ten years that
have demonstrated that machine-learning and statistical approaches to Car
can offer competitive results to rule-based approaches. In particular, this
more recent work has shown that the hand-tuned weights for anaphora
116 ERHARD W. HINRICHS, KATJA FILIPPOVA & HOLGER WUNSCH
3 Data
The present research focuses on German and utilizes the TüBa-D/Z (Hin-
richs et al. 2004), a large treebank of German newspaper text that has
been manually annotated with constituent structure and grammatical rela-
tions such as subject, direct object, indirect object and modifier. These types
of syntactic information have proven crucial in previous Car algorithms.
More recently, the TüBa-D/Z annotations have been further enriched to
also include anaphoric relations (Hinrichs et al. 2004), thereby making the
treebank suitable for research on Car. German constitutes an interest-
ing point of comparison to English since German exhibits a much richer
inflectional morphology and a relatively free word order at the phrase level.
PRONOMINAL ANAPHORA RESOLUTION FOR GERMAN 117
4 Experiments
cases. For technical reasons, the numeric values are prefixed by a dash in
order for TiMBL to treat them as discrete rather than continuous values.
The first vector in Table 2 displays the pairing of the pronoun with the
NP der neue Vorsitzende der Gewerkschaft Erziehung und Wissenschaft,
the first NP in the text. This NP is the subject (On) of its clause. The
value for this grammatical function is −1 since the NP occurs in the clause
immediately preceding the pronoun. The second vector pairs the two pre-
ceding NPs with the pronoun er. Since the NP Ulli Thöne is in predicative
position (Pred) and occurs in the same clause as the subject NP der neue
Vorsitzende der Gewerkschaft Erziehung und Wissenschaft, the value for
these two grammatical functions On and Pred is −1. Thus, the intended
semantics of the features for each grammatical function is to encode the dis-
tance of the last occurrence of a member of the same coreference class with
that particular grammatical function.2 Using mention counts of grammat-
ical functions instead of distances did not significantly change the results,
and were therefore omitted from the experiments.
The sample vectors in Table 2 illustrate the incremental encoding of in-
stances. The initial vector encodes only the relation between the pronoun
and the antecedent first mentioned in the text. Each subsequent instance
adds one more member of the same coreference class. This incremental
encoding follows the strategy of Kennedy & Boguraev (1996) and reflects
a dynamic modelling of the discourse history. The last item in the vec-
tor indicates class membership. In the memory-based encoding used in the
experiments, anaphora resolution is turned into a binary classification prob-
lem. If an anaphoric relation holds between an anaphor and an antecedent,
then this is encoded as a positive instance, i.e., as a vector ending in yes.
If no anaphoric relation holds between a pronoun and an NP, then this
encoded as a negative instance, i.e., as a vector ending in no.
2
A similar encoding is also used by Preiss (2002a).
120 ERHARD W. HINRICHS, KATJA FILIPPOVA & HOLGER WUNSCH
5 Evaluation
To assess the difficulty of the pronoun resolution task for the TüBa-D/Z
corpus, we established as a baseline a simple heuristic that picks the closest
preceding subject as the antecedent. This baseline is summarized in Table 3
together with results of the experiments described in the previous section.
For each experiment ten-fold cross-validation was performed, using 90% of
the corpus for training and 10% for testing.
negative evidence alike. Perhaps the most surprising result is the fact that
distance between anaphor and antecedent is given the second lowest weight
among all eighteen features. This sharply contrasts with the intuition of-
ten cited in hand-crafted approaches that the distance between anaphor
and antecedent is a very important feature for an adequate resolution algo-
rithm. The reason why distance receives such a low weight might well have
to do with the fact that this feature becomes almost redundant when used
together with the other distance-based features for grammatical functions.
The empirical findings concerning feature weights summarized in Ta-
ble 4 underscore the limitation of hand-crafted approaches that are based
on the analysts’ intuitions about the task domain. In many cases, the rel-
ative weights of features assigned by data-driven approaches will coincide
with the weights assigned by human analysts and fine-tuned by trial and
error. However, in some cases, feature weightings obtained automatically by
data-driven methods will be more objective and diverge considerably from
manual methods, as the weight assigned by TiMBL to the feature distance
illustrates.
ing TiMBL with the following parameters: modified value distance metric
(Mvdm), no feature weighting, k = 3, and inverse distance weighting for
class voting. The optimizing effect of the parameters is not entirely sur-
prising.4 The Mvdm metric determines the similarities of feature values by
computing the difference of the conditional distribution of the target classes
for these values. For informative features, δ(v1 , v2 ) will on average be large,
while for less informative features δ will tend to be small. Daelemans et
al. (2005) report that for Nlp tasks Mvdm should be combined with values
of k larger than one. The present task confirms this result by achieving
optimal results for a value of k = 3.
The only previous study of German Car that is based on newspaper tree-
bank data is that of Schiehlen (2004). Schiehlen compares an impressive
collection of published algorithms, ranging from reimplementations of rule-
based algorithms to reimplementations of machine-learning and statistical
approaches. The best results of testing on the Negra corpus were achieved
with an F-measure of 0.711 by a decision-tree classifier, using C4.5 and a
pre-filtering module similar to the one used here. The best result with an
F-measure of 0.734 achieved by the memory-based classifier and the Xip-
based pre-filtering component outperforms Schiehlen’s results, although a
direct comparison is not possible due to the different data sets.
4
See Hoste et al. (2002) for the optimizing effect of Mvdm in the word sense disam-
biguation task.
124 ERHARD W. HINRICHS, KATJA FILIPPOVA & HOLGER WUNSCH
REFERENCES
Aı̈t-Mokhtar, Salah, Jean-Pierre Chanod & Claude Roux. 2002. “Robustness
Beyond Shallowness: Incremental Deep Parsing”. Natural Language Engi-
neering 8:2-3.121-144.
Daelemans, Walter, Jakub Zavrel, Ko van der Sloot & Antal van den Bosch. 2005.
TiMBL: Tilburg Memory Based Learner – version 5.1–Reference Guide. Tech-
nical Report ILK 01-04. Induction of Linguistic Knowledge, Computational
Linguistics, Tilburg University.
Ge, Niyu, John Hale & Eugene Charniak. 1998. “A Statistical Approach to
Anaphora Resolution”. Sixth Workshop on Very Large Corpora, 161-170.
Montreal, Canada.
Hinrichs, Erhard, Sandra Kübler, Karin Naumann, Heike Telljohann & Julia
Trushkina. 2004. “Recent Developments in Linguistic Annotations of the
TüBa-D/Z Treebank”. Third Workshop on Treebanks and Linguistic Theo-
ries, 51-62. Tübingen, Germany.
Hoste, Veronique, Iris Hendrickx, Walter Daelemans & Antal van den Bosch.
2002. “Parameter Optimization for Machine-Learning of Word Sense Disam-
biguation”. Natural Language Engineering 8:4.311-325.
Kennedy, Christopher & Branimir Boguraev. 1996. “Anaphora for Everyone:
Pronominal Anaphora Resolution Without a Parser”. 16th International
Conference on Computational Linguistics, 113-118. Copenhagen, Denmark.
Kouchnir, Beata. 2003. A Machine Learning Approach to German Pronoun
Resolution. Master’s thesis. School of Informatics, University of Edinburgh.
Lappin, Shalom & Herbert Leass. 1994. “An Algorithm for Pronominal Anaphora
Resolution”. Computational Linguistics 20:4.535-561.
Mitkov, Ruslan. 2002. Anaphora Resolution. Amsterdam: John Benjamins.
Müller, Christoph, Stefan Rapp & Michael Strube. 2002. “Applying co-training
to reference resolution”. 40th Annual Meeting of the Association for Com-
putational Linguistics (ACL’02 ), 352-359. Philadelphia, Penn.
Preiss, Judita. 2002a. “Anaphora Resolution With Memory-Based Learning”.
5th UK Special Interest Group for Computational Linguistics (CLUK5 ), 1-8.
Preiss, Judita. 2002b. “A Comparison of Probabilistic and Non-Probabilistic
Anaphora Resolution Algorithms”. Student Workshop at the 40th Annual
Meeting of the Association for Computational Linguistics (ACL’02 ), 42-47.
Philadelphia, Penn.
Schiehlen, Michael. 2004. “Optimizing Algorithms for Pronoun Resolution”.
20th International Conference on Computational Linguistics (COLING-2004 ),
515-521. Geneva, Switzerland.
Strube, Michael & Udo Hahn. 1999. “Functional Centering - Grounding Ref-
erential Coherence in Information Structure”. Computational Linguistics
25:3.309-344.
Efficient Spam Analysis for Weblogs through
URL Segmentation
Abstract
This paper shows that in the context of spam classification of we-
blogs, segmenting the urls beyond the standard punctuation is help-
ful. Many spam urls contain phrases in which the words are glued
together in order to avoid spam filtering techniques based on uni-
grams and punctuation segmentation. An approach for segmenting
tokens into words forming a phrase is proposed and validated. The
accuracy of the implemented system is better than that of humans
(78% vs. 76%).
1 Introduction
The blogosphere, which is the part of the web consisting of personal elec-
tronic journals (weblogs) currently encompasses 107.7 million pages [2007-
10-02] and doubles in size every 5.5 months (Technorati 2006). The large
amount of information contained in the blogosphere has been proven valu-
able for applications such as market intelligence, trend discovery, and opin-
ion tracking (Hurst 2005). Unfortunately in the last several years the blo-
gosphere has been heavily polluted with spam. For certain subsets of the
blogosphere the spam content can be as high as 70-80%. Spam weblogs
(called splogs) are weblogs used for the purposes of promoting affiliated
websites. Splogs skew statistics. Thus, methods for filtering out splogs on
large collections of weblogs are needed. Sophisticated content-based meth-
ods or methods based on link analysis (Gyöngyi et al. 2004, Andras et al.
2005) can be slow. We propose a fast, lightweight and accurate method
merely based on analysis of the url of the web page itself without con-
sidering (nor downloading) the content of the page (Kan 2004, Kan & Thi
2005).
For qualitative analysis of the contents of the blogosphere it is acceptable
to eliminate data from analysis as long as the remaining part is representa-
tive and spam-free. For example, in determining the public opinion about
a new product it is sufficient to determine the ratio of the people that like
the product vs. the people that dislike it rather than the actual number.
Therefore, considering the vasts amount of weblogs available it is acceptable
for a spam classifier to lower recall in order to improve precision, especially
126 NICOLAS NICOLOV & FRANCO SALVETTI
2 Engineering of splogs
“Splogs, are weblog sites which the author uses only for promoting affiliated
websites. The purpose is to increase the PageRank of the affiliated sites,
get ad impressions from visitors, and/or use the blog as a link outlet to
get new sites indexed. Content is often nonsense or text stolen from other
websites with an unusually high number of links to sites associated with the
splog creator which are often disreputable or otherwise useless websites.”
(Wikipedia 2006:Splog).
Spammers indicate the semantic content of the webpage directly in the
url. A Uniform Resource Locator (url), or web address, is a standardized
address name layout for a resource (a document, image, etc.) on the internet
(or elsewhere). The currently used url syntax is specified in the internet
standard RFC 17381 :
resource type://hostname.domain:port/file path name#anchor
While the resource type and domain are a small set, the hostname,
file path name and anchor present a perfect opportunity to indicate seman-
tic content by using descriptive phrases—often noun groups (non-recursive
noun phrases) like http://www.adultvideompegs.blogspot.com.2 There are dif-
ferent varieties of sites promoted by spam:
1. commercial products (especially electronics);
2. vacations;
3. mortgages;
4. adult-related.
Users don’t want to see spam web pages in their search results and market
intelligence applications are affected when data contains spam. To address
this, search engines provide filtering.
Simple approaches to filtering use keyword spotting (or unigram models)—
if the webpage or its url contain certain tokens the page is eliminated from
the results (or ranked lower). For spotting keywords in the urls a url
is often tokenized on punctuation symbols. To avoid being identified as a
splog one of the creative techniques that spammers use is to glue words
1
See also RFC 3986 on Uniform Resource Identifiers (URIs).
2
Non-spam sites also communicate concepts through the urls:
http://showusyourcharacter.com.
128 NICOLAS NICOLOV & FRANCO SALVETTI
together into longer tokens which will not match the filter keywords (e.g.,
http://businessopportunitymoneyworkathome.coolblogstuff.com).
Another approach to dealing with spam is having a list of spam websites
(SURBL 2006). Such approaches based on blacklists are now less effective
because the available free bloghost tools for creating blogs have made it
very easy for spammers to automatically create new splogs which are not
on the blacklist.
The spam classifier uses a segmenter which aggressively splits the url
(Section 3), and then the token sequence is used in a supervised learning
framework for classification (Section 4).
3 URL segmentation
• abc
|{z} • remaining characters
matched word
↑ ↑
ini new
pos pos
Fig. 1: Forward maximum match
1. d i e ? t s t hatwork
2. d i e ◦ t s thatwork
3. d i e t ◦ s t i atwork
4. di e t s • t h a twork
5. diets • t h a ? t w o rk
6. diets • t h a ◦ t w ork
7. diets • t h a t • w o r k
1. i m i n h e a ven
2. i m i n h e aven
3. i m i n h eaven
4. i m i n heaven
5. • i • m • in • h e a v e n
6. • i • m • in • • heaven •
7. i • m • in • heaven
feature abbr
L known word Lw
L known word (morph. extended) Lmw
R known word Rw
L known suffix Ls
L maximal known suffix Lms
R known prefix Rp
Lw & Rw LwRw
Ls & Rp LsRp
L & R char unigrams 1-1
L & R char bigrams 2-2
L & R char trigrams 3-3
L bigram & R trigram (char) 2-3
L trigram & R bigram (char) 3-2
Table 1: Segmentation features for MaxEnt; L = ‘to the left there is a
. . . ’; R = ‘to the right there is a . . . ’
3.4 Contextual
Finally we considered an approach where diverse information about the
segmentation can be used by the system in a unified way. We employed a
Maximum Entropy framework and at a position between two characters in
the string we considered the following features (contextual predicates) again
motivated by error analysis of the previous systems as well as linguistic
intuitions (cf. Table 1).
feature example
Lw complex|andconfused
Lmw weekly |tipsandtricks
Rw speed|racer
Ls wellness|formulas
Lms institutionalized|models
Rp quick|antiaging
LwRw keydata|recovery
LsRp dating|unleashed
Table 2 shows examples of the word and affix features. The ‘|’ symbol indi-
cates the current position. The underlined substrings refer to the triggers
for the corresponding features. ‘Lmw’ refers to there being a morphological
variant (extension) of a known word to the left of the current position, e.g.,
weekly |. . . ; a maximal suffix is a suffix which is not a prefix of another
132 NICOLAS NICOLOV & FRANCO SALVETTI
suffix (ing is a suffix but it is not maximal because of the composite suf-
fix ings). Note how the character n-grams subsume the symmetric sliding
window approaches above in Section 3.2.
4 URL classification
P (c) · P (T |c)
ĉ = arg max P (c|T ) = arg max
c∈C c∈C P (T )
= arg max P (c) · P (T |c)
c∈C
n
Y
= arg max P (c) · P (ti |c)
c∈C
i=1
count(ti , c) + 1
P ∗ (ti |c) =
Nc + Vc
where: count(ti , c) is the number of time the token ti occurred in in the
collection for category c; Nc is the total number of tokens in the collection
for category c; Vc is the number of distinct tokens in the collection for
category c.
We have also explored simple voting techniques for determining the class
ĉ:
( Pn
spam, if sgn i=1 sgn (P (ti|spam) − P (ti |good)) = 1
ĉ =
good, otherwise
Because we are interested in having control over the precision of the classifier
we introduce a score meant to be used for deciding whether to label a url
as unknown.
P (spam|T ) − P (good|T )
score(T ) =
P (spam|T ) + P (good|T )
SPAM CLASSIFICATION OF WEBLOGS 133
# of # spam # good
splits URLs URLs
1 2,235 2,274
2 868 459
3 223 46
4 77 7
5 2 1
6 4 1
8 3 -
Total 3,412 2,788
Table 3: Number of splittings in a URL
First we discuss the segmenter. 10,000 spam and 10,000 good weblog urls
and their corresponding webpages were used for the experiments. The
20,000 weblog HTML pages are used to induce the segmenter. The first
experiment was aimed at finding how common was segmentation as a phe-
nomenon. The segmenter was run on the actual training urls. The number
of urls that are additionally segmented besides the segmentation on punc-
tuation are reported in Table 3.
The multiple segmentations need not all occur on the same token in the
url after initial segmentation on punctuation.
The various segmenters were then run on a separate test set of 1,000
urls for which the ground truth for the segmentation was marked. For
the MaxEnt model we used a feature count cutoff of 7—this affects mostly
the lexical and chaacter n-gram features. The results are in Table 4. OOV
The spam classifier was then run on the test set. The results are shown in
Table 5.
accuracy 78%
prec. spam 82%
rec. spam 71%
f-measure spam 76%
prec. good 74%
rec. good 84%
f-measure good 79%
Table 5: Classification results
The performance of humans on this task was also evaluated. Eight indi-
viduals performed the spam classification just looking at the unsegmented
urls. The scores for the human annotators are given in Table 6. The aver-
age accuracy of the humans (76%) is slightly lower than that of the system
(78%).
From an information retrieval perspective if only 50.9% of the urls
are retrieved (labelled as either spam or good and the rest are labelled
as unknown) then of the spam/good decisions 93.3% are correct. This is
relevant for cases where a url spam filter is in cascade followed by, for
example, content-based spam classifier.
6 Discussion
The system performs better with the aggressive segmentation because the
system has been forced to smooth on fewer occasions. For instance given
SPAM CLASSIFICATION OF WEBLOGS 135
Mean σ
accuracy 76% 6.71
precision spam 83% 7.57
recall spam 65% 6.35
f-measure spam 73% 7.57
precision good 71% 6.35
recall good 87% 6.39
f-measure good 78% 6.08
Table 6: Results for the human annotators; µ is the mean; σ is the standrd
deviation
7 Future work
8 Conclusions
REFERENCES
Andras, T. S., A. Benczúr, Károly Csalogány & M. Uher. 2005. “SpamRank —
Fully Automatic Link Spam Detection”. 1st Int. Workshop on Adversarial
Information Retrieval on the Web, WWW-2005.
Gao, J., M. Li & C.-N. Huang. “Chinese Word Segmentation and Named Entity
Recognition: A Pragmatic Approach”. Computational Linguistics 31:4.531-
574, December 2005.
Gyöngyi, Z., H. Garcia-Molina & J. Pedersen. 2004. “Combating Web Spam
with TrustRank”. 30th Int. Conference on Very Large Data Bases (VLDB),
271-279. Toronto, Canada.
Hurst, Matthew. 2005. “Deriving Marketing Intelligence from Online Discus-
sion”. 11th ACM SIGKDD Int. Conference on Knowledge Discovery in Data
mining (KDD’05 ), 419-428. Chicago, Illinois, U.S.A.
Kan, M.-Y. 2004. “Web Page Classification without the Web Page”. 13th Int.
World Wide Web Conference (WWW’2004 ), New York. poster.
Kan, M.-Y. & H. O. N. Thi. 2005. “Fast Webpage Classification Using URL Fea-
tures”. Conference on Information and Knowledge Management (CIKM’05 ),
325-326, Bremen, Germany.
Kolari, P., T. Finin & A. Joshi. 2006. “SVMs for the Blogosphere: Blog Identi-
fication and Splog Detection”. Computational Approaches to Analyzing We-
blogs, number SS-06-03, pages ed. by N. Nicolov, F. Salvetti, M. Liberman &
J. H. Martin, 92-99. Menlo Park, Calif.: AAAI Press.
Mishne, Gilad, D. Carmel & R. Lempel. 2005. Blocking Blog Spam with
Language Model Disagreement. 1st International Workshop on Adversar-
ial Information Retrieval on the Web (AIRWeb’05), at WWW2005, ed. by
B. D. Davison, 1-6. Chiba, Japan.
Shih, L.K. & D. Krager. 2004. “Using URLs and Table Layout for Web Classi-
fication Tasks”. 13th Int. World Wide Web Conference (WWW’2004 ), New
York.
SURBL. 2006. “Surbl – Spam Uri Realtime Blocklists”. http://www.surbl.org.
Technorati. 2006. “State of the Blogosphere February 2006 part 1: On
Blogosphere Growth”. http://technorati.com/weblog/2006/ 02/81.html.
Wikipedia. 2006. “Splog (spam blog)”. http://en.wikipedia.org/wiki/Splog.
Document Classification Using Semantic Networks
with an Adaptive Similarity Measure
Filip Ginter, Sampo Pyysalo & Tapio Salakoski
Turku Centre for Computer Science (TUCS )
University of Turku, Finland
Abstract
We consider supervised document classification where a semantic net-
work is used to augment document features with their hypernyms.
A novel document representation is introduced in which the contri-
bution of the hypernyms to document similarity is determined by
semantic network edge weights. We argue that the optimal edge
weights are not a static property of the semantic network, but should
rather be adapted to the given classification task. To determine the
optimal weights, we introduce an efficient gradient descent method
driven by the misclassifications of the k-nearest neighbors (kNN)
classifier. The method iteratively adjusts the weights, increasing or
decreasing the similarity of documents depending on their classes.
We thoroughly evaluate the method using the kNN classifier and
show it to statistically significantly outperform the bag-of-words rep-
resentation as well as the more advanced hypernym density represen-
tation of Scott & Matwin (1998).
1 Introduction
traveller
air traveller space traveller
astronaut cosmonaut
Fig. 1: Hyponymy relationships in a semantic network
from direct terms to more general terms, decreasing according to the data-
dependent strengths of connections between hyponyms and hypernyms.
We have previously introduced a data-driven method for determining hy-
pernym relevance for document classification, where relevance was limited
to the two cases ‘fully relevant’ and ‘irrelevant’ (Ginter, Pyysalo, Boberg,
Järvinen & Salakoski 2004). In this paper, we present a method that applies
a finer-grained concept of relevance and shows a more substantial perfor-
mance advantage.
2 Document representation
1P
if t ∈ T (d),
wt′ t at′ (d)
t′ ∈Baset (d)
at (d) = |Baset (d)|
if t ∈ Aff (d) \ T (d), (1)
0 otherwise.
Finally, each document d is represented by an activation vector
4 Evaluation
w ← 1̄
until stopping criterion satisfied do:
w′ ← 0̄
for each document di ∈ D:
classify di using D \ {di } as training set
if misclassified di then:
for each dj ∈ N (di , k, D \ {di }):
if class (di ) = class (dj ) then δ ← +1 else δ ← −1
w′ ← w′ + δ∇w(di , dj )
w ← w + η · w′
for each weight wk in w:
wk ← max{0, min{1, wk }}
2000 randomly selected articles from the journal and 2000 randomly se-
lected articles that have appeared elsewhere. Additionally, we form for
each dataset seven down-sampled training sets, the largest consisting of
1000 positive and 1000 negative examples (the other 2000 being used for
testing). Each task is then a binary classification problem where the doc-
uments must be classified either as originating from the journal or not.
Since the journals are usually focused on a subdomain, these classification
problems model document classification by topic.
We evaluate the proposed document representation with and without
the adaptive component. In the fixed representation, the semantic network
weights are all set to one constant value wf ix , 0 ≤ wf ix ≤ 1, determined
from the data. In the adaptive representation, the weights are computed
using the algorithm introduced in Section 3, with a stopping criterion where
iteration ends when the average performance increase on the training set
over the last three rounds drops below 0.05%.
We compare the performance of the fixed and adaptive representations
against two baselines, the commonly used bag-of-words (BoW) represen-
tation and a modification of the hypernym density (HD) representation of
Scott & Matwin (1998) in which each document di is represented by a mul-
tiset consisting of all direct terms of di , together with their hypernyms up
to a distance h from any of the direct terms. We found that in our case
coercing the multiset into a set results in an improvement of performance,
and thus we apply this step in our evaluation. Further, infrequent terms are
not discarded and the HD normalization step is performed by the classifiers.
The main evaluation is performed using the kNN classifier. The param-
eters of the various methods — k for BoW, k, h for HD, k, wf ix for fixed,
and k, η for adaptive — are selected by cross-validated grid search on the
training set. We also perform a limited evaluation using Support Vector
Absolute difference [percentage units] DOCUMENT CLASSIFICATION USING SEMANTIC NETWORKS 143
5 8
7 Adaptive vs. BoW
3
6
8 HD vs. BoW
8 Fixed vs. BoW
5
2 3 5 HD vs. Fixed
0 3 2 5 4
4 5
3
1 3 5
2 3
1 3
3 5 5
2
1 3
1 4
0 3
1
0 1 1 0 2
0 0
0
Machines (SVM) (Vapnik 1998). For this evaluation, only the SVM reg-
ularization parameter C is separately selected, while other parameters are
set to their kNN optimum values.
We measure the performance using average 5 × 2 cross-validated accu-
racy. To assess the statistical significance for individual datasets, we use the
robust 5 × 2 cross-validation test of Alpaydin (1999). To assess the overall
significance across all datasets, we use the two-tailed paired t-test.
Average performance differences when using kNN are plotted in Figure 4
and average absolute performance figures are given in Table 1a. The adap-
tive method statistically significantly outperforms all others for all except
the smallest training set size. Further, the relative decrease in error rate
grows almost monotonically, indicating that the adaptive method works
better given more data. As the documents were assigned on average only
10 MeSH terms and the MeSH ontology contains almost 23000 nodes, reli-
able optimization of the edge weights is expected to be difficult with very
small training sets. Nevertheless, the adaptive method works remarkably
well with as few as 62 training examples.
When applied to SVMs, the fixed and HD representations outperform
the adaptive method (Figure 1b). The difference is statistically significant
for most training set sizes larger than 62. Clearly, the adaptive method
does not optimize a criterion beneficial for SVM classification, and hence
modification of the adaptive strategy is required to increase applicability
144 FILIP GINTER, SAMPO PYYSALO & TAPIO SALAKOSKI
REFERENCES
Agirre, Eneko & German Rigau. 1996. “Word Sense Disambiguation Using
Conceptual Density”. Proceedings of the 16th Conference on Computational
Linguistics, Copenhagen, Denmark (COLING’96 ), 16-22. San Francisco:
Morgan Kaufmann.
Alpaydin, Ethem. 1999. “Combined 5 × 2 cv F -Test for Comparing Supervised
Classification Learning Algorithms”. Neural Computation 11:8.1885-1892.
Ginter, Filip, Sampo Pyysalo, Jorma Boberg, Jouni Järvinen & Tapio Salakoski.
2004. “Ontology-Based Feature Transformations: A Data-Driven Approach”.
Proceedings of the 4th International Conference on Advances in Natural Lan-
guage Processing, Alicante, Spain (EsTAL’04 ), ed. by José L. Vicedo et al.
(= Lecture Notes in Computer Science, 3230 ), 279-290. Heidelberg: Springer.
Gómez-Hidalgo, José M. & Manuel de Buenaga Rodrı́guez. 1997. “Integrat-
ing a Lexical Database and a Training Collection for Text Categorization”.
Proceedings of the ACL/EACL Workshop on Automatic Information Extrac-
tion and Building of Lexical Semantic Resources, Madrid, Spain, ed. by Piek
Vossen et al., 39-44. Association for Computational Linguistics.
Jiang, Jay J. & David W. Conrath. 1997. “Semantic Similarity Based on Corpus
Statistics and Lexical Taxonomy”. Proceedings of the International Con-
ference on Research in Computational Linguistics, Tainan, Taiwan (RO-
CLING’97 ), 19-33.
Patwardhan, Siddharth, Satanjeev Banerjee & Ted Pedersen. 2003. “Using Mea-
sures of Semantic Relatedness for Word Sense Disambiguation”. Proceed-
ings of the Fourth International Conference on Intelligent Text Processing
and Computational Linguistics, Mexico City, Mexico (CICLing’03 ), ed. by
Alexander F. Gelbukh, 241-257. Heidelberg: Springer.
Resnik, Philip. “Using Information Content to Evaluate Semantic Similarity in
a Taxonomy”. 1995. Proceedings of the 14th International Joint Conference
on Artificial Intelligence, Montreal, Canada (IJCAI’95 ), ed. by Christopher
Mellish, 448-453. San Francisco: Morgan Kaufmann.
Scott, Sam & Stan Matwin. “Text Classification Using WordNet Hypernyms”.
1998. Proceedings of the COLING-ACL Workshop on Use of WordNet in
Natural Language Processing Systems, Montreal, Canada, ed. by Sanda Hara-
bagiu, 38-44. Association for Computational Linguistics.
Vapnik, Vladimir. 1998. Statistical Learning Theory. New York: Wiley.
∂ sim(di , dj )
Appendix 1 Derivation of the formula for ∂ wrs
This appendix details the derivation of the formula to compute the value of
∂ sim(di , dj )
∂ wrs
. The partial derivative is solved and the final formula is obtained
jointly from Equations 3, 7, and 8.
146 FILIP GINTER, SAMPO PYYSALO & TAPIO SALAKOSKI
=
rs (6)
∂ wrs ka(di)k2
Substituting from (6) into (3) gives Q(di , dj ) =
P ∂ at (di ) P P ∂ au (di )
t∈T ∂ wrs ka(d i )kât (d j ) − t∈T ât (d j )ât (d i ) u∈T au (d i ) ∂ wrs
ka(di )k2
P ∂ at (di ) P P
t∈T ∂ wrs ka(di )kât (dj ) − u∈T au (di ) ∂ ∂auw(di)
t∈T ât (dj )ât (di )
= rs
ka(di)k2
P ∂ at (di ) P
(2) t∈T ∂ wrs ka(di )kât (dj ) − u∈T au (di ) ∂ ∂auw(drsi ) sim(di , dj ) (7)
=
ka(di)k2
P ∂ at (di )
u|t t∈T ∂ wrs
(ka(di )kât (dj ) − at (di ) sim(di, dj ))
=
ka(di)k2
P ∂ at (di )
t∈T ∂ wrs
(ât (dj ) − ât (di ) sim(di , dj ))
=
ka(di )k
/ Aff (di )\T (di) then at (di) is by (1) constant and consequently ∂∂awt (drsi ) =
If t ∈
0. For t ∈ Aff (di) \ T (di ),
(
∂ at (di ) (1) 1 X ar (di ) if (t′ , t) = (r, s)
= ∂ at′ (di ) (8)
∂ wrs |Baset (di )| ′ wt′ t ∂ wrs otherwise.
t ∈Baset (di )
The recursion in (8) ends when (t′ , t) = (r, s). Finally, substituting from
∂ sim(d , d )
(8) into (7) and then into (3) completes the derivation of ∂ wrsi j .
Text Summarization for Improved Text Classification
Rada Mihalcea & Samer Hassan
University of North Texas
Abstract
This paper explores the possible benefits of the interaction between
text summarization and text classification. Through experiments
performed on standard data sets, we show that techniques for extrac-
tive summarization can be effectively combined with classification
methods, resulting in improved performance in a text categorization
task. Moreover, comparative results suggest that the synergy be-
tween text summarization and text classification can be regarded as
a new task-oriented evaluation testbed for automatic summarization.
1 Introduction
that these results have implications not only on the problem of text clas-
sification, where the method proposed provides the means to improve the
categorization accuracy with error reductions of up to 19.3%, but also on
the problem of text summarization, by suggesting a new application-based
evaluation of tools for automatic summarization.
TRAINING NEW
DOCUMENTS
DOCUMENT
Extractive Summarization
SUMMARIES
SUMMARY
TEXT
Learning
CLASSIFIER
CATEGORY
2.3 Data
For the classification experiments, we use the Reuters-21578 1 and WebKB 2
data sets—two of the most widely used test collections for text classifica-
tion. For Reuters-21578, we use the standard ModApte data split (Apte et
al. 1994), obtained by eliminating unlabeled documents, and selecting only
categories that have at least one document in the training set and test set
(Yang & Liu 1999). For WebKB, we perform a four-fold cross validation
after removing the other category, using the 3 : 1 training/test split auto-
matically created with the scripts provided with the collection, which sepa-
rates the data into training on three of the universities plus a miscellaneous
collection, and testing on a fourth held-out university. Both collections are
further post-processed by removing all documents with less than 10 sen-
tences, resulting into final data sets of 1277 training documents, 436 test
documents, 60 categories for Reuters-21578, and 350 training documents,
101 test documents, 6 categories for WebKB. This last step is motivated by
the goal of our experiments: we want to determine the impact of various
degrees of summarization on text categorization, and this is not possible
with very short documents.
Reuters-21578 WebKB
Summary length Rocchio Naı̈ve Bayes Rocchio Naı̈ve Bayes
1 sentence 70.18% 72.94% 60.40% 61.39%
2 sentences 72.48% 73.85% 62.38% 71.29%
3 sentences 73.17% 75.46% 60.40% 74.26%
4 sentences 72.94% 75.46% 63.37% 76.24%∗
5 sentences 75.23% 75.92% 66.35%∗ 79.24%∗
6 sentences 77.00%∗ 78.38%∗ 65.37%∗ 77.24%∗
7 sentences 76.38% ∗ 77.61%∗ 65.34%∗ 76.24%∗
8 sentences 76.38%∗ 77.38%∗ 64.36% 76.21%∗
9 sentences 75.23% 77.61%∗ 65.34%∗ 75.25%
10 sentences 75.23% 77.52%∗ 65.35%∗ 75.25%
full document 74.91% 75.01% 63.36% 74.25%
3 Experimental results
Classification experiments are run using each of the two learning algorithms,
with documents in the collection represented by extracts of various lengths.
The classification performance is measured for each experiment, and com-
pared against a traditional classifier that performs text categorization using
the original full-length documents.
Table 1 shows the classification results obtained using the Reuters-21578
and the WebKB test collections. Note that these results refer to data subsets
somewhat more difficult than the original test collections, and thus they are
not directly comparable to results previously reported in the literature. For
instance, the average number of documents per category in the Reuters-
21578 subset used in our experiments is only 25, compared to the average
of 50 documents per category available in the original data set3 . Similarly,
the WebKB subset has an average number of 75 documents per category,
3
Using the same Naı̈ve Bayes and Rocchio classifier implementations on the entire
Reuters-21578 data set results in a classification accuracy of 77.42% for Naı̈ve Bayes
and 79.70% for Rocchio, figures that are closely comparable to results previously re-
ported on this data set.
152 RADA MIHALCEA & SAMER HASSAN
75 75
70 70
Accuracy (%)
Accuracy (%)
65 65
60 60
75 75
70 70
65 65
Accuracy (%)
Accuracy (%)
60 60
55 55
4 Related work
5 Conclusions
REFERENCES
Apte, Chidanand, Fred Damerau & Sholom M. Weiss. 1994. “Towards Language
Independent Automated Learning of Text Categorisation Models”. Proceed-
ings of the 17th ACM SIGIR Conference on Research and Development in
Information Retrieval, 23-30. Dublin, Ireland.
Brin, Sergey & Lawrence Page. 1998. “The Anatomy of a Large-Scale Hy-
pertextual Web Search Engine”. Computer Networks and ISDN Systems
30:1-7.107-117.
Erkan, Gunes & Dragomir Radev. 2004. “Lexpagerank: Prestige in Multi-
Document Text Summarization”. Proceedings of the Conference on Empirical
Methods in Natural Language Processing, 365–371. Barcelona, Spain.
Hulth, Anette. 2003. “Improved Automatic Keyword Extraction Given More
Linguistic Knowledge”. Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP 2003 ), 216-223. Sapporo, Japan.
Joachims, Thorsten. 1997. “A Probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization”. Proceedings of the 14th International
Conference on Machine Learning (ICML 1997 ), 143-151. Nashville, Ten-
nessee.
Kamvar, Sepandar, Taher Haveliwala, Christopher Manning & Gene Golub. 2003.
“Extrapolation Methods for Accelerating PageRank Computations”. Pro-
ceedings of the 12th International World Wide Web Conference, 261-270.
Budapest, Hungary.
Kleinberg, Jon. 1999. “Authoritative Sources in a Hyperlinked Environment”.
Journal of the ACM 46:5.604-632.
156 RADA MIHALCEA & SAMER HASSAN
Ko, Youngjoong, Jinwoo Park & Jungyun Seo. 2002. “Automatic Text Catego-
rization Using the Importance of Sentences”. Proceedings of the 19th Inter-
national Conference on Computational Linguistics, 1-7. Taipei, Taiwan.
Kolcz, Aleksander, Vidya Prabakarmurthi & Jugal Kalita. “Summarization as
Feature Selection for Text Categorization.” Proceedings of the 10th Inter-
national conference on Information and Knowledge Management, 365-370.
Atlanta, Georgia.
Lin, Chin-Yew & Eduard Hovy. 2003. “Automatic Evaluation of Summaries
Using n-gram Co-occurrence Statistics”. Proceedings of the Human Language
Technology Conference, 71-78. Edmonton, Canada.
McCallum, Andrew & Kamal Nigam. 1998. “A Comparison of Event Models
for Naive Bayes Text Classification”. Proceedings of AAAI-98 Workshop on
Learning for Text Categorization, 41-48. Madison, Wisconsin.
Mihalcea, Rada & Paul Tarau. 2004. “TextRank – Bringing Order into Texts”.
Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2004 ), 404-411. Barcelona, Spain.
Mihalcea, Rada & Paul Tarau. 2005. “An Algorithm for Language Independent
Single and Multiple Document Summarization”. Proceedings of the Interna-
tional Joint Conference on Natural Language Processing, 19-24. Jeju Island,
Korea.
Ng, Hwee Tou, Goh, Wei Booh & Low, Kok Leong. 1997. “Feature Selection,
Perceptron Learning, and a Usability Case Study for Text Categorization”.
Proceedings of the 20th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 67-73. Philadelphia,
Pennsylvania.
Pang, Bo & Lillian Lee. 2004. “A Sentimental Education: Sentiment Analysis
Using Subjectivity Summarization Based on Minimum Cuts”. Proceedings of
the 42nd Meeting of the Association for Computational Linguistics, 271-278.
Barcelona, Spain.
Rocchio, Joseph J. 1971. “Relevance Feedback in Information Retrieval”. Engle-
wood Cliffs, New Jersey: Prentice Hall.
Yang, Yiming & Xin Liu. 1999. “A Reexamination of Text Categorization Meth-
ods”. Proceedings of the 22nd ACM SIGIR Conference on Research and
Development in Information Retrieval, 42-49. Berkeley, California.
Yang, Yiming & Jan O. Pedersen. 1997. “A Comparative Study on Feature Se-
lection in Text Categorization”. Proceedings of the 14th International Con-
ference on Machine Learning, 412-420. Nashville, Tennessee.
Exploiting Linguistic Cues to Classify
Rhetorical Relations
Caroline Sporleder∗ & Alex Lascarides∗∗
∗ ∗∗
Tilburg University & University of Edinburgh
Abstract
We propose a method for automatically identifying rhetorical re-
lations. We use supervised machine learning but exploit discourse
markers to automatically label training data. Our models draw on a
variety of linguistic cues to distinguish between relations. We show
that these feature-rich models outperform the previously suggested
bigram models by more than 20%, at least for small training sets.
Our approach is therefore better suited to deal with relations for
which it is difficult to automatically label a lot of training data be-
cause they are rarely signalled by unambiguous discourse markers.
1 Introduction
Clauses in a text link to each other via rhetorical relations such as con-
trast or result (Mann & Thompson 1987). For example, the sentences
in (1) are linked by result. Many NLP applications, such as question-
answering and text summarisation, would benefit from a method which
automatically identifies such relations. While rhetorical relations are some-
times signalled by discourse markers (also known as discourse ‘connectives’)
such as since or consequently, these are often ambiguous. For example, since
can indicate either a temporal or an explanation relation (examples (2a)
and (2b), respectively). Discourse markers are also often missing (as in (1)).
Hence, it is not possible to rely on discourse markers alone.
(1) A train hit a car on a level crossing. It derailed.
(2) a. She has worked in retail since she moved to Britain.
b. I don’t believe he’s here since his car isn’t parked outside.
We present a machine learning method which uses a variety of linguistic and
textual features (word stems, part-of-speech tags, tense information) to de-
termine the rhetorical relation between two adjacent text spans (sentences
or clauses) in the absence of an unambiguous discourse marker. To avoid
manual annotation of training data, we train on automatically labelled ex-
amples, building on earlier work by Marcu & Echihabi (2002), who extracted
examples from large text corpora and used unambiguous discourse mark-
ers to label them with the correct rhetorical relation. Before training, the
markers were removed to enable the classifier to learn from other features
158 CAROLINE SPORLEDER & ALEX LASCARIDES
and thus identify relations in unmarked examples. This approach works be-
cause there is often a certain amount of redundancy between the discourse
marker and the general linguistic context. For example, the two clauses in
example (3a) are in a contrast relation signalled by but. However, this
relation can also be inferred if no discourse marker is present (see (3b)).
(3) a. She doesn’t make bookings but she makes itinerary recommendations.
b. She doesn’t make bookings; she makes itinerary recommendations.
Hobbs et al. (1993) and Asher & Lascarides (2003) propose a logical ap-
proach to inferring relations, which in this case would rely on the linguistic
cues of a negation in the first span, syntactic parallelism of the two spans,
and the fact that they both have the same subject. We explore whether
such cues can also be exploited as features in a statistical model for recog-
nising rhetorical relations and whether such a feature-rich model leads to
a better performance than Marcu & Echihabi’s (2002) word co-occurrence
model.
2 Related research
3 Our approach
3.2 Data
We used three corpora, mainly from the news domain, to extract our data
set: the British National Corpus (BNC, 100 million words), the North Amer-
ican News Text Corpus (350 million words) and the English Gigaword Cor-
pus (1.7 billion words). We preprocessed the corpora by removing duplicate
texts and speech transcripts, and running a sentence splitter (Reynar &Rat-
naparkhi 1997) on the latter two corpora.
We applied the extraction patterns to the raw text to find potential
examples. These were then parsed with the RASP parser (Carroll & Briscoe
2002) and the parse trees were processed to (i) identify the two spans and
(ii) filter out false positives. For instance, (9) was extracted as an example
of summary based on the apparent presence of the discourse marker in
short. However, the parser correctly identified this string as part of the
prepositional phrase in short order and the example was discarded. We
extracted both intra- and inter-sentential relations (see (4) and (7) above,
respectively). However, we limited the length of the spans to one sentence,
as we specifically wanted to focus on relations between small units of text.
(9) In short order I was to fly with ‘Deemy’ on Friday morning.
There are three potential sources of errors: (i) the two spans are not
related, (ii) they are related but the wrong relation is hypothesised and
(iii) the hypothesised span boundaries are wrong. To assess the quality of
the extracted data, we manually inspected 100 randomly selected examples
(20 per relation). We found 11 errors overall: three of type (i) (no relation)
and eight of type (iii) (wrong boundary). No wrongly predicted relation
was found. The overall precision was thus 89%.
The number of extracted training examples ranged from 1,732 for con-
tinuation and around 50,000 for contrast. On the whole, our data set
EXPLOITING LINGUISTIC CUES 161
is much smaller than the one used by Marcu & Echihabi (2002), which
contained around 10 million examples for six relations.
4 Experiments
of the four relations. For each example, we included the two preceding
and following sentences as context. We then asked three subjects who were
trained in discourse annotation to classify each of the 50 examples as one
of the four relations or as none. All subjects were aware that discourse
markers had been removed from the examples but did not know the location
of the removed discourse marker. We evaluated the annotations against the
gold standard. The average accuracy was 71.25%, the average, pairwise
Kappa coefficient (Siegel & Castellan 1988) was .61. While the agreement
is far from perfect, it is relatively high for a discourse annotation task.
Hence it seems that the task of predicting the correct relation for sentences
from which the discourse marker has been removed is feasible for humans.
Avg. Accuracy
Relation random bigrams BT
contrast 20.00 33.11 43.64
explanation 20.00 75.39 64.45
result 20.00 16.21 47.86
summary 20.00 19.34 48.44
continuation 20.00 25.48 83.35
all 20.00 33.96 57.55
Table 1: Results for BoosTexter (BT ) and two baselines (10-fold
cross-validation)
ous marker (e.g., and) or no marker at all. For such relations, our approach
seems a better choice than the model proposed by Marcu & Echihabi (2002).
Lexical features (stems, words, lemmas) seem the most useful. Table 3
shows some of the words chosen by BoosTexter as particularly predictive of
a given relation. Most of the choices seem fairly intuitive. For instance, an
explanation relation is often signalled by tentatively qualifying adverbs
(e.g., perhaps), while summary and continuation relations frequently
contain pronouns and contrast can be signalled by words such as other
or still etc. Table 2 also suggests that the lexical items in the left span are
more important than those in the right. For example, left stems is 10% more
EXPLOITING LINGUISTIC CUES 165
accurate then right stems. This makes sense from a processing perspective:
if the relation is already signalled in the left span the sentence will be easier
to process than if the signalling is delayed until the right span is read.
Relation Predictive Words
contrast other, still, not, . . .
explanation perhaps, probably, mainly, . . .
result undoubtedly, so, indeed, . . .
summary their, this, yet . . .
continuation you, it, there . . .
Table 3: Words chosen as cues for a relation
5 Conclusion
We have presented a machine learning method for automatically classify-
ing discourse relations in the absence of unambiguous discourse markers,
using a model which combines a wide variety of linguistic cues. The model
was trained on automatically labelled data and tested on five relations. We
found that this feature-rich model significantly outperformed the simpler
bigram model proposed by Marcu & Echihabi (2002), at least on our, rel-
atively small training set. Hence, our method is particularly suitable for
relations which are rarely signalled by (unambiguous) discourse markers
(e.g., continuation). In such cases, it is difficult to obtain sufficiently
large training sets that a bigram model will perform well, even if the data
is extracted from very large corpora.
So far we have only tested our method on automatically labelled data,
i.e., examples from which an originally present, unambiguous discourse
marker had been removed, not on examples which occur naturally with-
out such a marker. However, the latter are exactly the types of examples at
which our method is aimed. In future research, we thus intend to create a
small, manually labelled test corpus of examples which are not unambigu-
ously marked and test our method on it to determine whether our results
carry over to that data type. The RST Discourse Treebank (Carlson et al.
2002) could be used as a starting point for the corpus creation.
Acknowledgements. The research in this paper was supported by epsrc
grant number GR/R40036/01. We would like to thank Ben Hutchinson and
Mirella Lapata for interesting discussions, participating in our labelling experi-
ment and letting us use some of their Perl scripts. We are also grateful to the
reviewers for their helpful comments.
REFERENCES
Asher, Nicholas & Alex Lascarides. 2003. Logics of Conversation. Cambridge:
Cambridge University Press.
166 CAROLINE SPORLEDER & ALEX LASCARIDES
Abstract
This paper addresses Textual Entailment (i.e., recognizing that the
meaning of a text entails the meaning of another text) using a Tree
Edit Distance algorithm between the syntactic trees of the two texts.
It also compares the contribution of lexical resources for recognizing
textual entailment in an experiment carried on over the pascal-rte
dataset.
1 Introduction
The problem of language variability (i.e., the fact that the same information
can be expressed with different words and syntactic constructs) has been
attracting a lot of interest during the years and it poses significant issues
in front of systems aimed at natural language understanding. The example
below shows that recognizing the equivalence of the statements came in
power, was prime-minister and stepped in as prime-minister is a challenging
problem.
• Ivan Kostov came in power in 1997.
• Ivan Kostov was prime-minister of Bulgaria from 1997 to 2001.
• Ivan Kostov stepped in as prime-minister 6 months after the December
1996 riots in Bulgaria.
While the language variability problem is well known in Computational
Linguistics, a general unifying framework has been proposed only recently
in (Dagan & Glickman 2004). In this approach, language variability is
addressed by defining the notion of entailment as a relation that holds
between two language expressions (i.e., a text t and an hypothesis h) if the
meaning of h, as interpreted in the context of t, can be inferred from the
meaning of t. The entailment relation is directional, as the meaning of one
expression can entail the meaning of the other, while the opposite may not.
The Recognizing Textual Entailment (rte) task takes as input a T H
pair and consists in automatically determining whether an entailment re-
lation holds between T and H or not. The task, potentially, covers almost
all the phenomena in language variability: entailment can be due to lexical
variations, as it is shown in example (1), to syntactic variation (example 2),
to semantic inferences (example 3) or to complex combinations of all such
168 MILEN KOUYLEKOV & BERNARDO MAGNINI
3 System architecture
ed(T, H)
score(T, H) = (1)
ed(, H)
where ed(T, H) is the function that calculates the edit distance cost and
ed(, H) is the cost of inserting the entire tree h. A similar approach is
presented in (Monz & de Rijke 2001), where the entailment score of two
′
documents d and d is calculated by comparing the sum of the weights (idf)
TREE EDIT DISTANCE FOR TEXTUAL ENTAILMENT 171
of the terms that appear in both documents to the sum of the weights of
′
all terms in d .
We used a threshold t such that if score(T, H) < t then t entails h,
otherwise no entailment relation holds for the pair. To set the threshold
we have used both the positive and negative examples of the training set
provided by the pascal-rte dataset (see Section 5.1 for details).
N
idf (w) = log (3)
Nw
The most frequent words (e.g., stop words) have a zero cost of insertion.
Cost[Subs(w1 , w2 )] = (4)
Ins(w2 ) ∗ (1 − Ent(w1 , w2 ))
where Ins(w2 ) is calculated using (3) and Ent(w1 , w2 ) can be approximated
with a variety of relatedness functions between w1 and w2 .
172 MILEN KOUYLEKOV & BERNARDO MAGNINI
There are two crucial issues for the definition of an effective function for
lexical entailment: first, it is necessary a database of entailment rules with
enough coverage; second, we have to estimate a quantitative measure for
such rules.
We experimented the use of a dependency based thesaurus available at
http://www.cs.ualberta.ca/ lindek/downloads.htm. For each word, the
~
thesaurus lists up to 200 most similar words and their similarities calcu-
lated on a parsed corpus using frequency counts of the dependency triples.
A complete review of the method including a comparison with different
approaches is presented in (Lin 1996). Dependency triples consists of the
head, a dependency type and a modifier. They can be viewed as features
for the head and the modifiers in the triples when calculating similarity.
The cost of a substitution is calculated by the following formula:
where w1 is the word from t that is being replaced by the word w2 from h
and simth (w1 , w2 ) is the similarity between w1 and w2 in the thesaurus mul-
tiplied by the similarity between the corresponding relations. The similarity
between relations is stored in a database of relation similarities obtained by
comparing dependency relations from a parsed local corpus. The similar-
ities have values from 1 (very similar) to 0 (not similar). If there is no
similarity, the cost of substitution is equal to the cost of inserting the word
w2.
4.1 Dataset
For the experiments we have used the pascal-rte dataset (Dagan et al.
2005). The dataset1 has been collected by human annotators and it is com-
posed of 1367 text (T ) - hypothesis (H ) pairs split into positive and negative
examples (a 50%-50% split). The examples came from seven different appli-
cation scenario (i.e., Information Retrieval, Question Answering, Machine
Translation, Information Extraction, Reading Comprehension, Paraphrase
Acquisition) where textual entailment is supposed to play a crucial role.
Typically, T consists of one sentence while H was often made of a shorter
sentence. The dataset has been split in a training (576 pairs) and a test
(800 pairs) part.
4.2 Experiments
The following configurations of the system have been experimented.
4.3 Results
For each system we have tested we report a table with the following data:
• #attempted. The number of deletions, insertion and substitutions
that the algorithm successfully attempted (i.e., operations that are
included in the best sequence of editing transformations).
• %success. The proportion of deletions, insertion and substitutions
that the algorithm successfully attempted over the total attempted.
• accuracy. The proportion of T-H pairs correctly classified by the
system over the total number of pairs.
• cws. Confidence Weighted score, The proportion of T-H pairs cor-
rectly classified by the system over the total number of pairs.
Table 1 shows the results obtained by the systems we experimented.
Table 1: Results
The hypothesis about the 0 cost of the deletion operation was confirmed by
the results of system 2.
The similarity database used in system 3 increased the number of the
successful substitutions made by the algorithm from 25% to 27%. It also
increased performance of the system for both accuracy and cws. The impact
of the similarity database on the result is small, because of the low similarity
between the dependency trees of h and t.
5 Discussion
The approach we have presented can be used as a framework for testing re-
sources for lexical entailment. The experiments show that using a similarity
database with the edit distance algorithm can be of benefit for recognizing
textual entailment.
In order to test the real performance of resources of lexical entailment
rules a set of pair from RTE dataset which require lexical entailment rules
must be selected.
TREE EDIT DISTANCE FOR TEXTUAL ENTAILMENT 175
REFERENCES
Dagan, Ido & Oren Glickman. 2004. “Generic Applied Modeling of Language
Variability”. Proceedings of PASCAL Workshop on Learning Methods for
Text Understanding and Mining. Grenoble, France.
Dagan, Ido, Oren Glickman & Bernardo Magnini. 2005. “The PASCAL Recog-
nizing Textual Entailment Challenge”. Proceedings of PASCAL Workshop
on Recognizing Textual Entailment, 1-8. Southampton, U.K.
Fellbaum, Christiane. 1998. WordNet, an Electronic Lexical Database. Cam-
bridge, Mass.: MIT Press.
Lin, Dekang. 1998. “Dependency-based evaluation of MINIPAR”. Proceedings
of the Workshop on Evaluation of Parsing Systems (LREC-1998 ). Granada,
Spain.
Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity”. Pro-
ceedings of 15th International Conference on Machine Learning, 296-304.
Madison, Wisconsin.
Lin, Dekang & Patrik Pantel. 2001. “Discovery of inference rules for Question
Answering”. Natural Language Engineering 7.4:343-360.
Monz, Christof & Maarten de Rijke. 2001. “Light-Weight Entailment Check-
ing for Computational Semantics”. Proceedings of The third workshop on
inference in computational semantics (ICoS-3 ), 59-72. Siena, Italy.
Punyakanok, Vasin, Dan Roth & Wen-tau Yih. 2004. “Mapping Dependencies
Trees: An Application to Question Answering”. Proceedings of AI & Math
2004. Fort Lauderdale, Florida.
Ratnaparkhi, Adwait. 1996. “A Maximum Entropy Part-Of-Speech Tagger”.
Proceeding of the Empirical Methods in Natural Language Processing Con-
ference, 491-497. Philadelphia, Pennsylvania.
Szpektor, Idan, Hristo Tanev, Ido Dagan & Bonaventura Coppola. 2004. “Scaling
Web-based Acquisition of Entailment Relations”. Proceedings of Conference
on Empirical Methods in Natural Language Processing (EMNLP-04 ), 41-48.
Barcelona, Spain.
Zhang, Kaizhong & Dennis Shasha. 1990. “Simple fast algorithms for the editing
distance betweentrees and related problems”. SIAM Journal on Computing
18:6.1245-1262.
A Genetic Algorithm for Optimising Information
Retrieval with Linguistic Features in Question
Answering
Jörg Tiedemann
University of Groningen
Abstract
In this chapter we describe a genetic algorithm applied to the optimi-
sation of information retrieval (ir) within a Dutch question answer-
ing (qa) system. In our approach, information retrieval is used for
passage retrieval to filter out relevant text passages for subsequent
answer extraction modules in the qa system. We focus on the in-
tegration of linguistic features extracted from syntactic analyses of
the texts in the retrieval component. For this we apply output of
the Dutch dependency parser Alpino together with an off-the-shelf
ir engine. Query parameters are optimised using a genetic algorithm
to improve the overall performance of the qa system.
1 Introduction
top
smain
su hd
vc
1 verb
ppart
np word4
det hd hd
mod obj1 mod
det noun verb
pp 1 pp
het0 embargo1 stel in5
hd obj1 hd
obj1
prep name prep
np
tegen2 Irak3 na6
det hd
mod mod
det noun
pp pp
de7 inval8
hd obj1 hd obj1
prep name prep noun
in9 Koeweit10 in11 199012
Fig. 1: A dependency tree produced by Alpino: Het embargo tegen Irak werd
ingesteld na de inval in Koeweit in 1990. (The embargo against Iraq has
been declared after the invasion of Kuwait in 1990.)
The index described above is filled with the data from the machine anno-
tated clef corpus. Now it is possible to send complex queries to the ir
engine using the powerful query language supported by Lucene. There are
several features that we explore in our experiments:
Field restrictions: This is the basic parameter needed to search in the
multi-layer index. We have to specify the field we like to query with
the given keywords. For example, we can look into the text layer with
a query such as “text:(Irak Verenigde Naties)”
Keyword weighting: Each keyword in a query can be weighted with a
so-called “boost factor”. For example, we can weight the named entity
(ne) “Irak” with a weight of 2: “ne:(Irak^2 Verenigde_Naties)”
Required keywords: Lucene’s query language allows one to mark key-
words to be required in matching paragraphs using the character ’+’.
180 JÖRG TIEDEMANN
In our approach, we use ir as a black box filled with our data. We can
modify the input parameters and inspect the output produced but we do not
A GENETIC ALGORITHM FOR OPTIMISING IR IN QA 181
know exactly how the internal ranking is influenced by the given parameters.
However, we want to perform a systematic search for optimised queries to
improve the performance of the qa system. Here, we can use a randomised
optimisation process to iteratively search through possible settings with the
goal to find the optimal parameter setup. Genetic algorithms have proven to
be useful for such a task, running in a manner of a “focused trial-and-error”
beam search through a pre-defined hypothesis space. The goal in genetic
algorithms is to let a population of individuals find a way to better fitness in
their environment using evolutionary principles. Useful genes (=settings)
are kept and combined in the population from generation to generation.
Random changes are introduced using mutation.
Here, individuals refer to qa processes trying to answer a given set of
questions. Each individual differs in the settings used for ir. The parame-
ters are simply the keyword types used in the ir queries as defined in the
previous section. Here, each of these keyword types may have exactly one
weight or a required marker. Fitness is determined by the quality of the
result when running qa with the individual’s ir settings. We use mean
reciprocal ranks of the top five answers to measure this quality. One impor-
tant principle in genetic algorithms is “crossover/reproduction”, i.e. means
to produce new children for coming generations. Here, we define crossover
as the creation of a new qa process using the ir settings of two parents from
the current population. The settings of the two parents are simply merged.
In cases of overlaps we compute the arithmetic mean of parameters (i.e.
weights). Again, required markers overwrite weights on identical keyword
types. Parents are selected randomly without any preference mechanism.
Another important principle in genetic algorithms is “natural selection”.
This is usually done in a probabilistic way. However, we simplify this func-
tion to a top N search for the individuals with the highest fitness scores in
the current population for reasons we will explain below.
One of the beauties of genetic algorithms is the possibility to run par-
allel processes in order to explore the space of possible solutions. This is
especially useful in our case because question answering requires a lot of
processing power and time. It is crucial that we can run through a large
number of settings in order to find a (sub-)optimal solution. For our exper-
iments, we use a Beowulf Linux cluster to run up to 50 processes in parallel
(that is what we define to be the maximum number of children to “grow
up” at a time). We keep the size of the population small with only 25 in-
dividuals from which new children are created. This means that we define
our selection process to be a top 25 search among “living individuals”. This
makes it possible to apply the selection procedure each time a child reports
its fitness score. Hence, we do not have to wait for an entire new generation
which would be necessary when applying probabilistic selection.
182 JÖRG TIEDEMANN
4 Experiments
For our experiments we used date from the clef competition on Dutch
qa. We used questions from the last three years, 2003 – 2005 for training
and evaluation. These question sets have been annotated with their correct
answers and supporting documents from the clef corpus in which the an-
swer can be found (if they were found by one of the participating teams).
We used only those questions which have at least one answer and we use
simple string matching to determine if our system produces correct answers
(i.e. we used answer strings instead of document IDs for evaluation). We
randomly selected 250 questions for evaluation and used the remaining 522
questions for training.
A GENETIC ALGORITHM FOR OPTIMISING IR IN QA 183
The genetic algorithm has been initialised with single keyword type settings
using default settings for weights (=1) and no required markers. We limited
ourselves to a small set of possible constraints to select keywords of different
types. We included constraints on a sub-set of word classes (names, nouns,
adjectives and verbs) and a sub-set of dependency relations (direct object,
modifier, apposition and subject). Each combination of constraints has been
considered and applied to each of the index layers to form the set of initial
settings. However, as explained earlier, we pruned this set by discarding
keyword types according to the conditions defined in section 2.
After running the initial settings the genetic algorithm decides how to
combine them and how to develop further. Table 2 shows the results after
various numbers of settings tested.
number of settings training evaluation
300 0.419 0.413
450 0.426 0.421
600 0.429 0.429
750 0.436 0.436
900 0.436 0.436
1000 0.441 0.452
1200 0.447 0.455
1500 0.448 0.442
1700 0.449 0.442
baseline (plain text keywords) 0.407 0.412
Table 2: Improvements of QA performance by optimising IR query
parameters
We obtained scores above the baseline already after 300 indivuduals tested
(initial settings included). Here, the baseline is the best performing initial
setting, namely plain text word keywords which is similar to standard ir
(Dutch stemming and stop word removal has been applied). In the top
part of Figure 2, the training curve is plotted for the 1700 settings. The
bottom part shows the corresponding scores on unseen evaluation data.
The thin black lines refer to the scores of individual settings tested in the
optimisation process. The solid bold lines refer to the top scores obtained
using the optimised query settings.
The plots illustrate the typical behaviour of genetic algorithms. The
random modifications cause the oscillation of the fitness scores plotted with
thin black lines. The algorithm picks up the advantageous features and
promotes them in the competitive selection process. This gives the gradual
improvements in terms of fitness of the top individual tested. The final
scores are well above the base line (about 10% improvement in MRR). The
largest improvements can be observed in the beginning of the optimisation
184 JÖRG TIEDEMANN
0.46
0.44
0.42
answer string MRR
0.4
0.38
0.36
0.34
0.32
baseline (0.407)
optimized (training)
0.3
0 200 400 600 800 1000 1200 1400 1600
number of settings
0.46
0.44
0.42
answer string MRR
0.4
0.38
0.36
0.34
0.32
baseline (0.412)
optimized (evaluation)
0.3
0 200 400 600 800 1000 1200 1400 1600
number of settings
Fig. 2: Parameter optimisation with a genetic algorithm. Training (top)
and evaluation (bottom).
REFERENCES
Bouma, Gosse, Gertjan van Noord & Robert Malouf. 2001. “Alpino: Wide
coverage computational analysis of Dutch”. Computational Linguistics in
the Netherlands (CLIN 2000 ), 45-59. Tilburg, The Netherlands: Rodopi.
Cui, Hang, Renxu Sun, Keya Li, Min-Yen Kan & Tat-Seng Chua. 2005. “Ques-
tion Answering Passage Retrieval Using Dependency Relations”. Proceedings
of the 28th Annual International ACM SIGIR Conference on Research and
Development of Information Retrieval (SIGIR 2005 ), 400-407. Salvador,
Brazil.
Apache Jakarta. 2004. Apache Lucene - A High-performance, Full-featured Text
Search Engine Library. http://lucene.apache.org/java/docs/index.html
[Source checked in April 2006].
Katz, Boris & Jimmy Lin. 2003. “Selectively using relations to improve precision
in question answering”. Proceedings of the EACL-2003 Workshop on Natural
Language Processing for Question Answering, 43-50. Budapest, Hungary.
Moldovan, Dan, Sanda Harabagiu, Roxana Girju, Paul Morarescu, Finley Laca-
tusu, Adrian Novischi, Adriana Badulescu & Orest Bolohan. 2002. “LCC
Tools for Question Answering”. Proceedings of the 11th Text Retrieval Con-
ference, TREC, 388-397. Gaithersburg, MD, USA.
Neumann, Günter and Bogdan Sacaleanu. 2004. “Experiments on Robust NL
Question Interpretation and Multi-Layered Document Annotation for a Cross-
Language Question/Answering System”. Multilingual Information Access for
Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation
Forum, CLEF, Lecture Notes in Computer Science, Volume 3491, Springer
Berlin / Heidelberg, 411-422. Bath, UK.
Prager, John, Eric Brown, Anni Cohen, Dragomir Radev & Valerie Samn. 2000.
“Question-Answering by Predictive Annotation”. Proceedings of the 23rd
Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, 184-191. Athens, Greece.
Strzalkowski, Tomek, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang
Lin, José Pérez-Carballo, Troy Straszheim, Jin Wang & Jon Wilding. 1997.
“Natural Language Information Retrieval: TREC-5 Report”. Proceedings of
the 5th Text Retrieval Conference, TREC, 291-314. Gaithersburg, MD, USA.
Lexico-Syntactic Subsumption for Textual Entailment
Vasile Rus & Arthur Graesser
The University of Memphis
Abstract
We present here a graph-based approach to the task of textual en-
tailment between a Text and Hypothesis. The approach takes into
account the full lexico-syntactic context of both the Text and Hy-
pothesis and relies heavily on the concept of subsumption. It starts
with mapping the Text and Hypothesis into graph-structures where
nodes represent concepts and edges represent lexico-syntactic rela-
tions among concepts. Based on a subsumption score between the
Text-graph and Hypothesis-graph an entailment decision is made. A
novel feature of our approach is the handling of negation. The results
obtained from a standard entailment test data set are better than
results reported by systems that use similar knowledge resources.
We also found that a tf-idf approach performs close to chance for
sentence-like Texts and Hypotheses.
1 Introduction
2004; Graesser, VanLehn, Rose, Jordan & Harter 2001), is under investi-
gation. AutoTutor is an its that works by having a conversation with the
learner. AutoTutor appears as an animated agent that acts as a dialog
partner with the learner. The animated agent delivers AutoTutor’s dialog
moves with synthesized speech, intonation, facial expressions, and gestures.
Students are encouraged to articulate lengthy answers that exhibit deep
reasoning, rather than to recite small bits of shallow knowledge. It is in the
deep knowledge and reasoning part where we explore the potential help of
entailment.
The rest of the paper is structured as follows. Section 2 outlines re-
lated work. Section 3 describes our approach in detail. Section 4 presents
the experiments we performed, results and a comparative view of similar
approaches. The Conclusions section wraps up the paper.
2 Related work
The task of textual entailment has been recently treated, in one form or an-
other, by research groups ranging from informational retrieval to language
processing leading to a large spectrum of approaches from tf-idf (term fre-
quency - inverse document frequency) to knowledge-intensive.
In one of the earliest explicit treatments of entailment Monz & de Rijke
(2001) proposed a weighted bag of words approach. They argued that tra-
ditional inference systems based on first order logic are limited to yes/no
judgements when it comes to entailment tasks whereas their approach de-
livered “graded outcomes”. They established entailment relations among
larger pieces of text (4 sentences on average) than the proposed rte setup
where the text size is a sentence (seldom two) or part of a sentence (phrase).
We replicated their approach, for comparison purposes, and report its per-
formance on the rte test data.
Pazienza and colleagues Pazienza, Pennacchiotti & Zanzotto (2005) use
a syntactic graph distance approach for the task of textual entailment. The
set of dependencies, or grammatical relations, they use is different from
ours. Further, it is not clear if they handle remote dependencies or not.
Among the drawbacks of their work, as compared to ours, is lack of negation
handling.
Recently, Dagan & Glickman (2004) presented a probabilistic approach
to textual entailment based on lexico-syntactic structures. They use a
knowledge base with entailment patterns and a set of inference rules. The
patterns are composed of a pattern structure (entailing template → entailed
template) and a quantity that tells the probability that a text which entails
the entailing template also entails the entailed template. This is a good
example of a knowledge-intensive approach.
LEXICO-SYNTACTIC SUBSUMPTION FOR TEXTUAL ENTAILMENT 189
3 Approach
Our solution for recognizing textual entailment is based on the idea of sub-
sumption. In general, an object X subsumes an object Y if X is more general
than or identical to Y, or alternatively we say Y is more specific than X.
The same idea applies to more complex objects, such as structures of in-
terrelated objects. Applied to textual entailment, subsumption translates
into the following: hypothesis H is entailed from text T if and only if T
subsumes H. We use graph representations for the text T and hypothesis H
to implement the idea of subsumption.
manage d−obj
to−comp
subj mod mod
compounds
bombers not to
det p−comp det nn
mod
the enter
subj the embassy
Fig. 1: Example of dependency graph for test pair #1981 from RTE (edges
are in grayscale to better visualize the correspondence)
The two text fragments involved in a textual entailment decision are initially
mapped into a graph representation that has its roots in the dependency-
graph formalisms of Hays (1964) and Mel’cuk (1998). The mapping process
has three phases: preprocessing, dependency graph generation, and final
graph generation.
190 VASILE RUS & ARTHUR C. GRAESSER
subj obj
mod s−comp nn
The bombers had not managed to enter the embassy compounds.
subj mod
obj
subj nn
sum of the node and relation matching scores. The overall score is adjusted
after considering negation relations.
P
Vh ∈HV maxVt ∈TV match(Vh , Vt )
entscore(T, H) = (α × +
|Vh |
P
Eh ∈HE maxEt ∈TE synt match(Eh , Et )
β× + γ) ×
|Eh |
(1 + (−1)#neg rel )
(1)
2
n is the number of the pairs in the test set, and i ranges over the pairs.
The Confidence-Weighted Score varies from 0 (no correct judgements at
all) to 1 (perfect score), and rewards the systems’ ability to assign a higher
confidence score to the correct judgements than to the wrong ones. Precision
was also proposed and represents the percentage of positive cases that were
correctly solved. It is not use as much and we only rarely reported next.
4.2 Results
In this section we first provide some insights on several baseline approaches
and then show how our lexico-syntactic approach works in the rte data.
It is arguable what the baseline for entailment is. In Dagan, Glickman
& Magnini (2005), the first suggested baseline is the method of blindly
and consistently guessing true or false for all test pairs. Since the test
data was balanced between false and true outcomes, this blind baseline
would provide an average accuracy of 0.50. Randomly predicting true
or false is another blind method that leads to a run being better than
chance for (cws>0.540)/(accuracy>0.535) at the 0.05 level or for a run
with (cws>0.558)/(accuracy>0.546) at the 0.01 level.
194 VASILE RUS & ARTHUR C. GRAESSER
Every score below a certain threshold leads to a false entailment and every-
thing above leads to true entailment. We obtained the optimal threshold
from different runs with different thresholds (0.1, 0.2, ..., 0.9) on the devel-
opment data. The results for the test data presented in the second row in
Table 1 are from the run with the optimal threshold.
The third row in the table shows the results on test data obtained with the
proposed graph-based method. Initially we used linear regression to esti-
mate the values of the parameters but then switched to a balanced weighting
(α = β = 0.5, γ = 0) which provided better results on development data.
Depending on the value of the overall score three levels of confidence were
assigned: 1, 0.75, 0.5. For instance, an overall score of 0 led to false en-
tailment with maximum confidence of 1. The results of this lexico-syntactic
approach on test data are significant at 0.01 level.
The bottom rows in Table 1 replicate, for comparison purposes, the
results of systems that participated in the rte Challenge (Dagan, Glickman
& Magnini 2005). We picked the best results (some systems report results
for more than one run) for runs that use similar resources to us: word
overlap, WordNet, and syntactic matching.
5 Conclusions
REFERENCES
Dagan, Ido, Oren Glickman & Bernardo Magnini. 2005. “The PASCAL Recog-
nising Textual Entailment Challenge”. Proceedings of the Recognizing Textual
Entaiment Challenge Workshop (RTE ), 1-8. Southampton, U.K.
Graesser, Arthur, Kurt VanLehn, Carolyn Rose, Pamela Jordan & Derek Har-
ter. 2001. “Intelligent Tutoring Systems with Conversational Dialogue”. AI
Magazine 22:4.39-51.
Graesser, Arthur, Shulan Lu, Tanner Jackson, Heather Mitchell, Mathew Ven-
tura, Andrew Olney & Max Louwerse. 2004. “Autotutor: A Tutor with
Dialogue in Natural Language”. Behavioral Research Methods, Instruments,
and Computers, 36:2.180-193.
Hays, David. 1964. “Dependency Theory: A Formalism and Some Observations”.
Language 40:4.511-525.
Magerman, David. 1994. Natural Language Parsing as Statistical Pattern Recog-
nition. Unpublished PhD thesis, Stanford University, February 1994.
Mel’cuk, Igor. 1998. Dependency Syntax: Theory and Practice. Albany, NY:
State University of New York Press.
Miller, George. 1995. “WordNet: A Lexical Database for English”. Communi-
cations of the ACM (November) 38:11.39-41.
Moldovan, Dan & Vasile Rus. 2001. “Logic Form Transformation of WordNet
and its Applicability to Question Answering”. Proceedings of the 39th Annual
Meeting of the Association for Computational Linguistics (ACL-2001 ), 394-
401. Toulouse, France.
Monz, Christof & Maarten de Rijke. 2001. “Light-Weight Entailment Checking
for Computational Semantics”. Proceedings of Inference in Computational
Semantics (ICoS-3 ), ed. by P. Blackburn & M. Kohlhase, 59-72. Schloss
Dagstuhl, Germany.
Pazienza, Maria, Marco Pennacchiotti & Fabio Zanzotto. 2005. “Textual En-
tailment as Syntactic Graph Distance: A Rule Based and SVM Based Ap-
proach”. Proceedings of the Recognizing Textual Entailment Challenge Work-
shop (RTE ), 25-29. Southampton, U.K.
Rus, Vasile & Kirtan Desai. 2005. “Assigning Function Tags with a Simple
Model”. Proceedings of Conference on Intelligent Text Processing and Com-
putational Linguistics (RTE ), 112-115. Mexico City, Mexico.
Rus, Vasile, Arthur Graesser & Kirtan Desai. 2005. “Lexico-Syntactic Sub-
sumption for Textual Entailment”. Proceedings of International COnference
on Recent Advances in Natural Language Processing Conference (RANLP
2005 ), 444-451. Borovets, Bulgaria.
Skiena, S. S. 1998. The Algorithm Design Manual. New York: Springer-Verlag.
A Knowledge-based Approach to
Text-to-Text Similarity
Courtney Corley, Andras Csomai & Rada Mihalcea
Dept. of Computer Science, University of North Texas
Abstract
In this paper, we present a knowledge-based method for measuring
the semantic similarity of texts. Through experiments performed
on two different applications: (1) paraphrase and entailment iden-
tification, and (2) word sense similarity, we show that this method
outperforms the traditional text similarity metrics based on lexical
matching.
1 Introduction
Measures of text similarity have been used for a long time in applications
in natural language processing and related areas. The typical approach to
finding the similarity between two text segments is to use a simple lexical
matching method, and produce a similarity score based on the number of
lexical units that occur in both input segments. Improvements to this sim-
ple method have considered stemming, stop-word removal, part-of-speech
tagging, longest subsequence matching, as well as various weighting and
normalization factors (Salton et al. 1997). While successful to a certain de-
gree, these lexical similarity methods fail to identify the semantic similarity
of texts. For instance, there is an obvious similarity between the text seg-
ments I own a dog and I have an animal, but most of the current similarity
metrics will fail in identifying a connection between these texts.
In this paper, we explore a knowledge-based method for measuring the
semantic similarity of texts. While there are several methods previously
proposed for finding the semantic similarity of words, to our knowledge the
application of these word-oriented methods to text similarity has not been
yet explored. We introduce an algorithm that combines the word-to-word
similarity metrics into a text-to-text semantic similarity metric, and we
show that this method outperforms the simpler lexical matching similarity
approach, as measured in two different applications: (1) paraphrase and
entailment identification, and (2) word sense similarity.
Given two input text segments, we want to automatically derive a score that
indicates their similarity at the semantic level, thus going beyond the simple
198 C. CORLEY, A. CSOMAI & R. MIHALCEA
least common subsumer (LCS), and combines these figures into a similarity
score:
2 ∗ depth(LCS)
Simwup = (2)
depth(concept1 ) + depth(concept2 )
Finally, the last similarity metric we consider is Jiang & Conrath (Jiang
& Conrath 1997), which returns a score determined by:
1
Simjnc = (5)
IC(concept1 ) + IC(concept2 ) − 2 ∗ IC(LCS)
This score, which has a value between 0 and 1, is a measure of the directional
similarity, in this case computed with respect to Ti . The scores from both
directions can be combined into a bidirectional similarity using a simple
product function:
2
The reason behind this decision is the fact that most of the semantic similarity measures
apply only to nouns and verbs, and there are only one or two relatedness metrics that
can be applied to adjectives and adverbs.
3
All similarity scores have a value between 0 and 1. The similarity threshold can be
also set to a value larger than 0, which would result in tighter measures of similarity.
TEXT SEMANTIC SIMILARITY 201
correctly identified true or false classifications in the test data set. We also
measure precision, recall and F-measure, calculated with respect to the true
values in each of the test data sets. Tables 1 and 2 show the results obtained
in paraphrase and entailment recognition. We also evaluate a metric that
combines all the similarity measures, including the lexical similarity, using
a simple average, with results indicated in the Combined row.
tained with the semantic similarity metrics, as well as the two baselines.
Note that all evaluation measures take values between 0 and 1, with smaller
entropy values and higher purity values representing clusters of higher qual-
ity (a perfect match with the gold standard would be represented by a set
of clusters with entropy of 0 and purity of 1).
All the semantic similarity measures lead to sense clusters that are better
than those obtained with the lexical similarity measure. The best results are
obtained with the Lin metric, for an overall entropy of 0.163 and a purity of
0.835, at par with the Combined measure that combines together all the
individual metrics. It is worth noting that the random baseline establishes
a rather high lower bound: 0.248 entropy and 0.761 purity. Considering
these values as the origin of a 0–100% evaluation scale, the improvements
brought by the best semantic similarity measure (Lin) relative to the lexical
matching baseline translate into 11.6% and 6.7% improvement for entropy
and purity respectively. These figures are competitive with previously pub-
lished sense clustering results (Chklovski & Mihalcea 2003).
tion does improve over the random and vectorial similarity baselines. Once
again, the combination of similarity metrics gives the highest accuracy, mea-
sured at 58.3%, with a slight improvement observed in the supervised set-
ting, where the highest accuracy was measured at 58.9%. Both these figures
are competitive with the best results achieved during the Pascal entail-
ment evaluation (Dagan et al. 2005).
For the word sense similarity application, the clusters obtained using the
semantic similarity metrics have higher purity and lower entropy as com-
pared to those generated based on the simpler lexical matching measure.
The best clustering solution is obtained with the Lin similarity metric, for
an entropy of 0.163 and a purity of 0.835, which represent a clear improve-
ment with respect to the lexical matching baseline, taking also into account
the competitive lower bound obtained through random clustering.
Although our method relies on a bag-of-words approach, as it turns
out the use of measures of semantic similarity improves significantly over
the traditional lexical matching metrics5 . We are nonetheless aware that
a bag-of-words approach ignores many of important relationships in sen-
tence structure, such as dependencies between words, or roles played by the
various arguments in the sentence. Future work will consider the investi-
gation of more sophisticated representations of sentence structure, such as
first order predicate logic or semantic parse trees, which should allow for
the implementation of more effective measures of text semantic similarity.
REFERENCES
Budanitsky, Alexander & Graeme Hirst. 2001. “Semantic Distance in WordNet:
An Experimental, Application-oriented Evaluation of Five Measures”. Pro-
ceedings of the Workshop on WordNet and Other Lexical Resources, 29-34.
Pittsburgh, Pennsylvania.
Chklovski, Timothy & Rada Mihalcea. 2003. “Exploiting Agreement and Dis-
agreement of Human Annotators for Word Sense Disambiguation”. Proceed-
ings of the International Conference on Recent Advances in Natural Language
Processing (RANLP ), 98-104. Borovets, Bulgaria.
Corley, Courtney, Andras Csomai & Rada Mihalcea. 2005 “Text Semantic Sim-
ilarity, with Applications”. Proceedings of the International Conference on
Recent Advances in Natural Language Processing (RANLP ), 173-180.
Borovets, Bulgaria.
Dagan, Ido, Oren Glickman & Bernardo Magnini. 2005. “The PASCAL Recog-
nising Textual Entailment Challenge”. Proceedings of the PASCAL Work-
shop, 1-8. Southampton, United Kingdom.
5
The improvement of the combined semantic similarity metric over the simpler lexical
similarity measure was found to be statistically significant in all experiments, using a
paired t-test (p < 0.001).
206 C. CORLEY, A. CSOMAI & R. MIHALCEA
Dolan, William B., Chris Quirk & Chris Brockett. 2004. “Unsupervised Con-
struction of Large Paraphrase Corpora: Exploiting Massively Parallel News
Sources”. Proceedings of the 20th International Conference on Computational
Linguistics (COLING), 250-356 Geneva, Switzerland.
Jain, Anil K.,& Richard C. Dubes. 1998. Algorithms for Clustering Data. En-
glewood, New Jersey: Prentice Hall.
Jiang, Jay J. & David W. Conrath. 1997. “Semantic Similarity Based on Corpus
Statistics and Lexical Taxonomy”. Proceedings of the International Confer-
ence on Research in Computational Linguistics, 19-33. Taiwan.
Leacock, Claudia & Martin Chodorow. 1998. “Combining Local Context and
WordNet Sense Similarity for Word Sense Identification”. WordNet, An
Electronic Lexical Database ed. by Christiane Fellbaum, 265-283. Cambridge,
Mass.: MIT Press.
Lesk, Michael E. 1986. “Automatic Sense Disambiguation Using Machine Read-
able Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone”. Pro-
ceedings of the Special Interest Group on the Design of Communication Con-
ference (SIGDOC ), 24-26. Toronto, Canada.
Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity”. Pro-
ceedings of the 15th International Conference on Machine Learning (ICML),
296-304. Madison, Wisconsin.
Miller, George A. 1995. “Wordnet: A Lexical Database”. Communication of the
ACM 38:11.39-41.
Palmer, Martha & Hoa Trang Dang. Forthcoming. “Making Fine-Grained and
Coarse-Grained Sense Distinctions, Manually and Automatically”. To appear
in Natural Language Engineering.
Patwardhan, Siddharth, Satanjeev Banerjee & Ted Pedersen. 2003. “Using Mea-
sures of Semantic Relatedness for Word Sense Disambiguation”. Proceedings
of the Fourth International Conference on Intelligent Text Processing and
Computational Linguistics (CICLING), 241-257. Mexico City, Mexico.
Resnik, Philip. 1995. “Using Information Content to Evaluate Semantic Similar-
ity in a Taxonomy”. Proceedings of the 14th International Joint Conference
on Artificial Intelligence, 448-453. Montreal, Canada.
Salton, Gerald & Chris Buckley. 1997. “Term Weighting Approaches In Au-
tomatic Text Retrieval”. Readings in Information Retrieval, 323-328. San
Francisco: Morgan Kaufmann.
Sparck-Jones, Karen. 1972. “A Statistical Interpretation of Term Specificity and
its Application in Retrieval”. Journal of Documentation 28:1.11-21.
Wu, Zhibiao & Martha Palmer. 1994. “Verb Semantics and Lexical Selection”.
Proceedings of the 32nd Annual Meeting of the Association for Computational
Linguistics (ACL), 133-148. Las Cruces, New Mexico.
Zhao, Ying & George Karypis. 2001. “Criterion Functions for Document Cluster-
ing: Experiments and Analysis”. Technical Report (TR 01-40). Minneapolis,
Minnesota: Department of Computer Science, Univ. of Minnesota.
A Simple WWW-based Method
for Semantic Word Class Acquisition
Keiji Shinzato∗ & Kentaro Torisawa∗∗
∗
Kyoto University &
∗∗
Japan Advanced Institute of Science and Technology
Abstract
This chapter describes a simple method to obtain semantic word
classes from html documents. We previously showed that itemiza-
tions in html documents can contain semantically coherent word
classes. However, not all the itemizations are semantically coherent.
Our goal is to provide a simple method to extract only semantically
coherent itemizations from html documents. Our new method can
perform this task by obtaining hit counts from an existing search
engine 2n times for an itemization consisting of n items.
1 Introduction
There are many natural language processing tasks in which semantically co-
herent word classes can play an important role, and many automatic meth-
ods for obtaining word classes (or semantic similarities) have been proposed
(Church & Hanks 1989, Hindle 1990, Lin 1998, Pantel & Lin 2002). Most
of these methods rely on particular types of word co-occurrence frequencies,
such as noun-verb co-occurrences, obtained from parsed corpora.
On the other hand, we previously showed that itemizations in html
documents on the World Wide Web (www), such as that in Figure 1, can
contain semantically coherent word classes (Shinzato & Torisawa 2004). We
say a class of words is coherent if it contains only words that are semanti-
cally similar to each other and that have a common hypernym other than
trivial hypernyms such as things or objects. The expressions in Figure 1
have common non-trivial hypernyms such as “Record shops”, and consti-
tute a semantically coherent word class. Since one can find a huge number
of html itemizations throughout the www, we can expect a huge number
of semantic classes to be available. However, itemizations do not always
contain semantically coherent word classes. Many itemizations are used
just for formatting html documents properly and contain semantically in-
coherent items. We therefore need an automatic method to filter out such
inappropriate itemizations.
In this chapter, we present a simple filtering method to obtain only
semantically coherent word classes from itemizations in html documents.
The method calculates strength of association between items in a word class
208 KEIJI SHINZATO & KENTARO TORISAWA
extracted from html itemizations using hit counts obtained from a search
engine, and tries to exclude semantically incoherent word classes according
to this strength. Our new method is simple in the sense that all it requires
are hit counts from a search engine, mutual information regarding the hit
counts, and an implementation of Support Vector Machines (Vapnik 1995).
The method is also efficient and lightweight in the sense that it can perform
this task just by obtaining hit counts at most 2n times for a word class
consisting of n items. There is no need to download and analyze a large
number of html documents.
We have tested the effectiveness of our method through experiments
using html documents and human subjects. A problem is that it is difficult
to set a rigorous evaluation criterion for semantic word classes. We try to
solve this problem using a manually tailored thesaurus.
In this chapter, Section 2 reviews previous work and Section 3 describes
our proposed method. Our experimental results, obtained using Japanese
html documents, are presented in Section 4.
2 Previous work
difference from our new algorithm, though, is that hram requires a large
amount of time and is not appropriate as a filter to obtain a large number
of word classes in a short time period. It needs to download a considerable
number of texts, parse at least part of the texts, and count the occurrence
frequencies of many words. Our aim is to skip such heavyweight processes
and provide a lightweight filter.
As another type of method for obtaining semantically coherent word
classes, there are many methods to automatically generate semantically co-
herent word classes from normal texts. Most of these collect the contexts in
which an expression appears and calculate similarities between the contexts
for the expressions to constitute a word class. In Lin’s work (Lin 1998),
for instance, the contexts for an expression are represented as a set of syn-
tactic dependency relations in which the expression appears. Many others
have taken similar approaches and here we cite just a few examples (Hindle
1990, Pantel & Lin 2002). The difference between these methods and our
work is that they assume rather complex contexts such as dependency rela-
tions. We do not assume such complex contexts obtained by using parsers
or parsed corpora. We only need to obtain the numbers of documents that
include expressions in the same itemization using a search engine. This
type of (co-occurrence) frequencies (or some statistical values computed
from the frequencies) are known to be useful for detecting semantic relat-
edness between two expressions (Church & Hanks 1989). The relatedness
is not limited to semantic similarities, though, which we need to compute
to obtain semantic word classes. For instance, Church found that while
“doctor” and “nurse” have a large mutual information value, “doctor” and
“bill” also have a large value. We do not think “doctor” and “bill” are sim-
ilar, though they are somehow related. Our trick for excluding such word
pairs from our semantic class is to use itemizations in html documents. We
assume that expression pairs having semantic relatedness other than strong
semantic similarities are unlikely to be included in the same itemization.
For instance, “doctor” and “bill” will not appear in the same itemization.
3 Proposed method
The procedure was designed based on the following two assumptions. Step
1 corresponds to Assumption A, and Step 2 to Assumption B.
Assumption A: At least, some iess are semantically coherent.
Assumption B: Expressions in a semantically coherent ies are likely to
co-occur in the same document.
To judge the semantic coherence among the expressions in an ies, according
to Assumption B, we estimate strength of co-occurrences between pairs of
expressions. As this strength, we use simple document frequencies and pair-
wise mutual information (Church & Hanks 1989). Our algorithm computes
these values for pairs of expressions in the same ies. An important point
is that our algorithm does not compute the values for all the possible pairs
in an ies, although a most straightforward implementation of Assumption
B should be the algorithm that does so. Instead, our algorithm randomly
generates only n pairs of expressions from an ies consisting of n expressions
and then calculates document frequency and mutual information for each
pair. The parameters required to compute the values are obtained from an
existing search engine. The number of queries to be given to the engine is
just 2n for an ies consisting of n expressions. Note that if we compute the
strength for all possible pairs, we need to throw n(n−1)/2+n queries to the
search engine. Our method is thus much more efficient than this exhaustive
algorithm in terms of the number of queries. The details of Steps 1 and 2
are described below.
d ocs(e1 , e2 )/N
I(e1 , e2 ) = log2
d ocs(e1 )/N × d ocs(e2 )/N
where d ocs(e) is the number of documents including an expression e and
d ocs(ei , ej ) is the number of documents including two expressions ei and ej .
We estimate d ocs(e) and d ocs(ei , ej ) using a search engine, which is “goo”
(http://www.goo.ne.jp/) in our experiments. N is the total number of
documents and we used 4.2 × 109 as N according to “goo.” Note that we
used −109 as a logarithm of 0 in calculating the mutual information values.
Consider the following iess:
A {Tower Records, HMV, Virgin Megastores, RECOfan}
B {Gift Certificates, International, New Releases, Top Sellers, Today’s Deals}
We think that Set A is semantically coherent while Set B is semantically
incoherent. The pairwise mutual information values computed for each ies
212 KEIJI SHINZATO & KENTARO TORISAWA
ID Descriptions ID Descriptions
1 Sum of the I(ei , ej ) values of all pairs. 11 Smallest docs(ei , ej ) of a pair in P.
2 Average of the I(ei , ej ) values of all pairs. 12 2nd smallest docs(ei , ej ) of a pair in P.
3 Largest I(ei , ej ) of a pair in P. 13 Number of pairs whose docs(ei , ej ) is 0.
4 2nd largest I(ei , ej ) of a pair in P. 14 Number of items in an IES.
5 Smallest I(ei , ej ) of a pair in P. 15 Number of items whose docs(e) is 0.
6 2nd smallest I(ei , ej ) of a pair in P. 16 Sum of the hit count for all items in an IES.
7 Sum of the docs(ei , ej ) values of all pairs. 17 Average hit count for all items in an IES.
Average of the docs(ei , ej ) values of all 18 Largest hit count for an item in an IES.
8 pairs. 19 2nd largest hit count for an item in an IES.
9 Largest docs(ei , ej ) of a pair in P. 20 Smallest hit count for an item in an IES.
10 2nd largest docs(ei , ej ) of a pair in P. 21 2nd smallest hit count for an item in an IES.
P : A set of pairs of two randomly selected items in an IES.
are listed in Table 1. These values for the pairs in Set A are all positive
and larger than those for the pairs in Set B. This roughly means that every
pair in Set A co-occurs much more frequently than expected only from the
frequencies of each item in the pair with assuming independence of the
occurrences of the items. In addition, the differences between the actual
co-occurrence frequencies and the expected ones are larger in Set A than
those in Set B. We expect to be able to select semantically coherent iess by
looking at such differences in mutual information values.
We used the features listed in Table 2. The major part of the features are
mutual information and hit counts for expression pairs, but they also include
hit counts for single expressions and some other items which we expect to be
useful in our task. Note that we used only the largest, the second largest,
the smallest, and the second smallest mutual information, co-occurrence
frequencies, and hit counts as features (features with id 3 to 6, 9 to 12 and
18 to 21.) Since we restricted iess given to our method only to the ones
that have more than three expressions, the feature values are always defined.
Finally, our method ranks the iess according to output values of the svm
(i.e., values of the decision function of the svm) and produces only the top
M iess as final outputs. We assume that a value of the decision function
indicates the likeliness that a given ies is semantically coherent. More
precisely, we made an assumption that the larger a value of the decision
function for a given ies is, the more likely the ies is to be semantically
coherent. Then, we expect that by producing only the iess having large
decision function values as final outputs, we can obtain the semantically
coherent iess with a relatively high precision.
4 Experiments
tion 3.1. We randomly picked 800 sets from the extracted iess as our test
set. It contained 5, 227 expressions in total. As our training set for the
svms, we randomly selected 400 sets, which included 2, 541 expressions.
The training set was annotated with Coherent/Incoherent labels by the au-
thors according to an evaluation scheme for iess, as described in a later
section. Note that the test set and the training set were chosen so that the
two sets do not have any common expressions. We chose TinySVM 1 as an
implementation of svms. As the kernel function, we used the anova kernel
of degree 2 provided in TinySVM. This choice was made according to the
observations obtained in experiments using the training set. Other types
of kernel provided in TinySVM did not converge during the training or did
not indicate high performance on the training set.
100 100
90 90
80 80
Precision [%]
Precision [%]
70 70
60 60
50 50
Proposed Method Proposed Method
40 Proposed Method (STRICT) 40 Proposed Method (STRICT)
HRAM HRAM
30 30
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
# of word classes # of word classes
which are labeled by a Japanese expression naming the class, and the classes
are organized into a hierarchical structure. We tried to make a list of trivial
hypernyms from the thesaurus according to the following steps. First, we
extracted 245 nouns included in the classes that are located in the top five
levels in the hierarchy, and then manually checked whether the extracted
nouns should be regarded as trivial hypernyms. As a result, we could obtain
164 trivial hypernyms such as “ (individual)” and “ (phenomena).”
We then removed the trivial hypernyms from the set of general nouns in
the thesaurus. As a result, we could obtain the list of 92,002 nouns, and
assumed that it is a list of non-trivial (possible) hypernyms.
We asked four human subjects to evaluate the acquired iess. The sub-
jects were asked if they can come up for each ies with a common hypernym
in the obtained non-trivial hypernym list. More precisely, the subjects were
asked to give a common hypernym of a given ies to our evaluation tool.
The subjects could proceed to the evaluation of the next ies only when the
tool finds the given hypernym in the non-trivial hypernym list, or when the
subjects tell the tool that they could not find any non-trivial hypernyms.
By this, we could prevent the subjects from choosing trivial hypernyms.
from the outputs of the both methods. The performances are shown in
Figure 2. Graph (A) shows the precisions of the methods when we assume
that semantically coherent iess are only the iess that were accepted by three
or more human subjects in the four subjects, while graph (B) indicates the
precisions of the methods when semantically coherent iess are only the iess
that all the four subjects accepted. In the both graphs, the y-axis indicates
the precision of the acquired word classes, while the x-axis indicates the
number of classes. “Proposed Method” refers to the precisions achieved by
our method and “hram” indicates the precisions obtained by hram. The
evaluation of both methods were done according to CRITERION defined
before. “Proposed Method (STRICT)” is the results of our method when
we used CRITERION (STRICT).
From the both graphs, we can see that our method outperforms hram.
If we look at the results evaluated according to CRITERION, the precision
of our method in graph (A) reached 88% for the top 100 classes which was
12.5% of all the given iess in the test set. For the top 200 classes (25%
of all the iess in the test set), the precision was about 80%. The kappa
statistic for measuring the inter-rater agreement was 0.69 for our method.
For hram, the statistic was 0.78. These values indicate that our subjects
had good agreement (Landis & Koch 1977). Table 3 shows examples of the
iess obtained by our method.
5 Conclusions
Our method ranks itemizations collected from the www according to the
decision function of svms, and produces highly ranked itemizations as out-
puts. In our experiments, when the top 10% of collected itemizations were
produced, at least, three of the four human subjects regarded about 80%
of the itemizations as sets of semantically similar expressions for which the
subjects could come up with non-trivial common hypernyms.
REFERENCES
Church, Kenneth & Patrick Hanks. 1989. “Word Association Norms, Mutual
Information, and Lexicography”. Proceedings of the 27th Annual Meeting of
the Association for Computational Linguistics (ACL’89), 26-29 June, 1989,
Vancouver, BC, Canada., 76-83. San Francisco: Morgan Kaufmann.
Hindle, Donald. 1990. “Noun Classification from Predicate-argument Struc-
tures”. Proceedings of the 28th Annual Meeting of the Association for Com-
putational Linguistics (ACL’90), 6-9 June 1990, Pittsburgh, PA, U.S.A., 268-
275. San Francisco: Morgan Kaufmann.
Ikehara, Satoru, Miyazaki Masahiro, Shirai Satoshi, Yokoo Akio, Nakaiwa Hi-
romi, Ogura Kentaro, Ooyama Yoshihumi & Hayashi Yoshihiko. 1997. Ni-
hongo Goi Taikei – A Japanese Lexicon. Tokyo: Iwanami Syoten.
Landis, Richard. & Gary Koch. 1977. “The Measurement of Observer Agreement
for Categorical Data”. Biometrics, 33:1.159-174.
Lin, Dekang. 1998. “Automatic Retrieval and Clustering of Similar Words”.
Proceedings of the 36th Annual Meeting of the Association for Computational
Linguistics and 17th International Conference on Computational Linguistics
(COLING-ACL’98 ), 768-774. San Francisco: Morgan Kaufmann.
Pantel, Patrick & Dekang Lin. 2002. “Discovering Word Senses from Text”.
ACM Special Interest Group on Knowledge Discovery and Data Mining
(SIGKDD’02 ), 613-619. New York: ACM.
Shinzato, Keiji & Kentaro Torisawa. 2004. “Acquiring Hyponymy Relations from
Web Documents”. Human Language Technology Conference of the North
American Chapter of the Association for Computational Linguistics (HLT-
NAACL’04 ), 73-80. Boston, Mass.
Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. Berlin:
Springer.
Automatic Building of Wordnets
Eduard Barbu∗ & Verginica Barbu Mititelu∗∗
∗
Graphitech Italy
∗∗
Romanian Academy, Research Institute for Artificial Intelligence
Abstract
This paper dealing with the automatic building of wordnets starts by
stating the assumptions behind such enterprise and then presents a
two-phase methodology for automatically building a target wordnet
strictly aligned with an already available wordnet (source wordnet).
In the first phase the synsets for the target language are automati-
cally generated and mapped onto the source language synsets using
a series of heuristics. In the second phase the salient relations that
can be automatically imported are identified and the procedure for
their import is explained. The idea behind all our heuristics will be
stated, the heuristics employed will be presented and their success
evaluated against a case study: automatically building a Romanian
wordnet using the Princeton WordNet.
1 Introduction
heuristics, and also the criteria we used in selecting the source language test
synsets. Finally, we state the problem to be solved in a more formal way,
we present the idea behind all heuristics, the heuristics themselves and their
success evaluated against a case study (automatically building a Romanian
wordnet using pwn 2.0).
2 Assumptions
specifying how much languages can differ with respect to the conceptual
space they reflect we will follow (Sowa 1992) who considers three distinct
dimensions: accidental (the two languages have different notations for the
same concept; for example, the Romanian word măr and the English word
apple lexicalize the same concept), systematic (the systematic dimension
defines the relation between the grammar of a language and its conceptual
structures; it has little import for our problem), cultural (the conceptual
space expressed by a language is determined by environmental, cultural fac-
tors, etc.; it could be the case, for example, that concepts that define the
legal systems of different countries are not mutually compatible; so, when
one builds a wordnet starting from a source wordnet he/she should ask
himself/herself what the parts (if any) that could be safely transferred in
the target language are; more precise what the parts that share the same
conceptual space are).
The assumption that we make use of is that the differences between the
two languages (source and target) are merely accidental: they have different
lexicalizations for the same concepts. As the conceptual space is already
expressed by the source wordnet structure using a language notation, our
task is to find the concepts notations in the target language.
When the source wordnet is not perfect (the real situation), then a draw-
back of the automatic mapping approach is that all the mistakes existent
in the source wordnet are transferred in the target wordnet. Another pos-
sible problem appears because the previously made assumption about the
sameness of the conceptual space is not always true.
In this section we introduce the notations used in the paper and we outline
the guiding idea of all heuristics we used:
1. By TL we denote the target lexicon. In our experiment TL will contain
Romanian words (nouns).
2. By SL we denote the source lexicon. In our case SL will contain English
words (nouns).
3. WT and WS are the wordnets for the target language and the source
language, respectively.
4. wjk denotes the k th sense of the word wj .
5. BD is a bilingual dictionary which acts as a bridge between SL and
TL .
If we ignore the information given by the definitions associated with word
senses, then, formally a sense of a word in the pwn is distinguished from
other word senses only by the set of relations it contacts in the semantic
network. These relations define the position of a word in the semantic
network.
The idea of our heuristics could be summed up in three points: (i) Take
profit of and increase the number of relations in the source wordnet to obtain
a unique position for each word sense. (ii) Try to derive useful relations
between the words in the target language. (iii) In the mapping stage of the
procedure take profit of the structures built at points (i) and (ii). We have
developed a set of four heuristics.
The synonymy heuristic exploits the fact that synonymy enforces equiva-
lence classes on word senses.
Let EnSyn= {ewji11
11
, ewji12
12
, . . . , ewji1n
1n
} (where ewj11 , ewj12 , ewj1n are the
words in synset and the superscripts denote their sense numbers) be a SL
synset and length (EnSyn) > 1. The length of a synset is equal to the
number of distinct words (words that are not variants) in the synset. So
we disregard synsets such as {artefact, artifact}. For achieving this we
AUTOMATIC BUILDING OF WORDNETS 221
computed the well known Levenshtein distance between the words in the
synset.
Taking the actual RoWN as a gold standard we can evaluate the results
of our heuristics by comparing the obtained synsets with those in the RoWN.
We distinguish five possible cases:
• The synsets are equal (this case will be labelled as Identical – id).
• The generated synset has all literals of the correct synset and some
more (Over-generation – ovg).
• The generated synset and the golden one have some literals in common
and some different (Overlap – ovp).
• The generated synset literals form a proper subset of the golden synset
(Under-generation – ug).
• The generated synset have no literals in common with the correct one
(Disjoint – dj).
The cases ovg, ovp and dj will be counted as errors. The other two cases,
namely id and ug, will be counted as successes. The evaluation of the
synonymy heuristic is given in Table 1.
The hyperonymy heuristic draws from the fact that, in the case of nouns,
the hyperonymy relation can be interpreted as an is-a relation (for versions
of pwn lower than 2.1 this is not entirely true: the hyperonym relation can
be interpreted as is-a or instance-of). It is also based on two related
observations: a hyperonym and his hyponyms carry some common infor-
mation and the information common to the hyperonym and the hyponym
will increase as you go down the hierarchy.
Let EnSyn1 = {ewji11 11
, . . . , ewji1t
1t
} and EnSyn2 = {ewji21
21
, . . . , ewji2s
2s
} be
two SL synsets such that EnSyn1 HYP EnSyn2 , meaning that EnSyn1 is a
222 EDUARD BARBU & VERGINICA BARBU MITITELU
The results of the hyperonymy heuristic are presented in Table 2. The low
recall is due to the fact that we did not find many common translations
between hyperonyms and their hyponyms.
The domain heuristic takes profit of an external relation imposed over pwn.
At IRST pwn was augmented with a set of Domain Labels, the resulting
resource being called WordNet Domains (Magnini & Cavaglia 2000). The
domain labels are hierarchically organized and each synset received one or
more domain labels. The idea of using domains is helpful for distinguishing
word senses (different word senses of a word are assigned to different do-
mains). The best case is when each sense of a word has been assigned to
a distinct domain. But even if the same domain labels are assigned to two
or more senses of a word, in most cases we can assume that this is a strong
indication of a fine-grained distinction. It is very probable that the dis-
tinction is preserved in the target language by the same word. We labelled
every word in the BD dictionary with its domain label. For English words
the domain is automatically generated from the English synset labels. For
labelling Romanian words we used two methods:
1. We downloaded a collection of documents from web directories such
that the categories of the downloaded documents match the categories
used in the Wordnet Domain. As a set of features we selected the
nouns that provide more information. For this we used the well known
χ2 statistic. χ2 statistic checks if there is a relationship between being
AUTOMATIC BUILDING OF WORDNETS 223
N × (AD − CB)2
χ2 (t, c) =
(A + C) × (B + D) × (A + B) × (C + D)
where:
• A is the number of times t and c co-occur;
• B is the number of times t occur without c;
• C is the number of times c occurs without t;
• D is the number of times neither c nor t occurs;
• N is the total number of documents.
For each category we computed the score between that category and
the noun terms of our documents. Then, for choosing the terms that
discriminate well for a certain category we used the formula below
(where m denotes the number of categories):
m
χ2 max (t) = max χ2 (t, ci )
i=1
2. We took advantage of the fact that some words have already been
assigned subject codes in various dictionaries. We performed a manual
mapping of these codes onto the Domain Labels used at IRST. The
Romanian words that could not be associated domain information
were associated with the default factotum domain.
The following entry is a BD entry augmented with domain information:
In the square brackets the domains that pertain to each word are listed.
For each synset in the SL we generated all the translations of its literals
in TL . Then the TL synset is built using only those TL literals whose domain
“matches” the SL synset domain (that is: it is the same as the domain of
SL , subsumes the domain of SL in the IRST domain labels hierarchy or is
subsumed by the domain of SL in the IRST domain labels hierarchy). The
results of this heuristic are given in Table 3.
The monolingual dictionary heuristic takes advantage of the fact that the
source synsets have a gloss associated and also that target words that are
224 EDUARD BARBU & VERGINICA BARBU MITITELU
9 Combining results
domain heuristic and the ambiguity of the members of the source synset
from which it was generated is at most equal to 2. Table 5 shows the
combined results of our heuristics.
As one can observe there, for 106 synsets in pwn 2.0 the Romanian
equivalent synsets could not be found. There also resulted 635 synsets that
are smaller than the synsets in the RoWN.
10 Import of relations
After building the target synsets an investigation of the nature of the re-
lations that structure the source wordnet should be made for establishing
which of them can be safely transferred in the target wordnet. As one
expects, the conceptual relations can be safely transferred because these
relations hold between concepts. The only lexical relation that holds be-
tween nouns and that was subject to scrutiny was the antonymy relation.
We concluded that this relation can also be safely imported. The importing
algorithm works as described below.
If two source synsets S1 and S2 are linked by a semantic relation R
in WS and if T1 and T2 are the corresponding aligned synsets in the WT ,
then they will be linked by the relation R. If in WS there are intervening
synsets between S1 and S2 , then we will set the relation R between the
corresponding TL synsets only if R is declared as transitive (R+, unlimited
number of compositions, e.g. hypernym) or partially transitive relation (Rk
with k a user-specialized maximum number of compositions, larger than the
number of intervening synsets between S1 and S2 ). For instance, we defined
all the holonymy relations as partially transitive (k=3).
11 Conclusions
REFERENCES
Atserias, Jordi, Salvador Climent, Xavier Farreres, German Rigau & Horacio
Rodriguez. 1997. “Combining Multiple Methods for Automatic Construction
of Multilingual WordNets”. Proceedings of the International Conference on
Recent Advances in Natural Language Processing, 143-150. Tzigov Chark,
Bulgaria.
Carroll, John B., ed. 1964. Language, Thought, and Reality: selected writings of
Benjamin Lee Whorf. Cambridge, Mass.: MIT.
Dicţionarul explicativ al limbii române. 1996. Bucureşti, Romania: Univers
Enciclopedic.
Fellbaum, Christiane, ed. 1998. WordNet: An Electronical Lexical Database.
Cambridge, Mass.: MIT.
Kilgarriff, Adam. 1997. “I Don’t Believe in Word Senses”. Computers and the
Humanities 31:2.91-113.
Lee, Changki, Geunbae Lee & Seo Jung Yun. 2000. “Automatic Wordnet map-
ping Using Word Sense Disambiguation”. Proceedings of the Joint SIGDAT
Conference on Empirical Methods in Natural Language Processing and Very
Large Corpora (EMNLP/VLC 2000 ), 142-147. Hong Kong.
Magnini, Bernardo & Gabriela Cavaglia. 2000. “Integrating Subject Field Codes
into WordNet”. Proceedings of the Second International Conference on Lan-
guage Resources and Evaluation (LREC-2000 ) ed. by Maria Gavrilidou et
al., 1413-1418. Athens, Greece.
Sowa, John F. 1992. “Logical Structure in the Lexicon” Lexical Semantics and
Commonsense Reasoning ed. by J. Pustejovsky & S. Bergler, 39-60. Berlin:
Springer-Verlag.
Tufiş, Dan, ed. 2004. The BalkaNet Project (= Special Issue of Romanian Journal
of Information Science and Technology, 7:1/2). Bucureşti, Romania: Editura
Academiei Române.
Vossen, Piek, ed. 1998. A Multilingual Database with Lexical Semantic Networks.
Dordrecht: Kluwer.
Lexical Transfer Selection
Using Annotated Parallel Corpora
Stelios Piperidis, Panagiotis Dimitrakis & Irene Balta
Institute for Language and Speech Processing
Abstract
This paper addresses the problem of bilingual lexicon extraction and
lexical transfer selection, in the framework of computer-aided and
machine translation. The proposed method relies on parallel cor-
pora, annotated at part of speech and lemma level. We first extract
a bilingual lexicon using unsupervised statistical techniques. For
each word with more than one translation candidates we build con-
text vectors, in order to aid the selection of the contextually correct
translation equivalent. The method achieves an overall precision of
ca. 85% while the maximum recall reaches 75%.
1 Background
The emergence of parallel corpora has evoked the appearance of many meth-
ods that attempt to deal with different aspects of computational linguistics
(Véronis 2000). Of special significance in the fields of lexicography, termi-
nology, computer-aided translation (cat) and machine translation (mt) is
the impact of ‘bitexts’; a pair of texts in two languages, where each text
is a translation of the other (Melamed 1997). Such texts are necessary for
providing evidences of use, directly deployable in statistical methods and en-
hance the automatic elicitation of the otherwise sparse linguistic resources.
Recent developments in cat and mt have moved towards the use of par-
allel corpora, aiming at two primary objectives: (i) to overcome the sparse-
ness of the necessary resources and (ii) to avoid the burden of producing
them manually. Furthermore, parallel corpora have proven rather useful
for automatic dictionary extraction, which offers the advantages of a lex-
icon capturing the corpus specific translational equivalences (Brown 1997,
Piperidis et al. 2000). Extending this approach, we investigate the possi-
bility to use bilingual corpora in order to extract translational correspon-
dences coupled with information about the word senses and contextual use.
In particular, we focus on polysemous words with multiple translational
equivalences.
The relation between word and word usage, as compared to the relation
of word and word sense has been thoroughly addressed (Gale et al. 1992a,
Yarowsky 1993, Kilgarriff 1997). We argue that through the exploitation
of parallel corpora and without other external linguistic resources, we can
228 STELIOS PIPERIDIS, PANAGIOTIS DIMITRAKIS & IRENE BALTA
adequately resolve the task of target word selection in cat and mt. Along
this line, it has been argued that the accumulative information added by a
second language could be very important in lexical ambiguity resolution in
the first language (Dagan et al. 1991), while tools have been implemented
for translation prediction, by using context information extracted from a
parallel corpus (Tiedemann 2001).
In addition, research on word sense disambiguation (wsd) is signifi-
cant in the design of a methodology for automatic lexical transfer selection.
Although monolingual wsd and translation are perceived as different prob-
lems (Gale et al. 1992d), we examine whether certain conclusions, which
are extracted during the process of wsd, could be useful for translation
prediction.
The role of context, as the only means to identify the meaning of a
polysemous word (Ide & Véronis 1998), is of primary importance in various
statistical approaches. Brown in (Brown et al. 1991) and Gale in (Gale
et al. 1992b, Gale et al. 1992d), use both the context of a polysemous
word and the information extracted from bilingual aligned texts, in order
to assign the correct sense to the word. Yarowsky explores the significance
of context in creating clusters of senses (Yarowsky 1995), while Schütze
in (Schütze 1998), addresses the sub-problem of word sense discrimination
through the context-based creation of three cascading types of vectors.
We examine the impact of context upon translation equivalent selection,
through an “inverted” word sense discrimination experiment. Given the
possible translational candidates, which are extracted from the statistical
lexicon, we investigate the discriminant capacity of the context vector, which
we build separately for each of the senses of the polysemous word.
2 Proposed method
The basic idea underlying the proposed method is the use of context vectors,
for each of the word usages of a polysemous word, in order to resolve the
problem of lexical transfer selection. The method consists of three stages,
shown in Figure 1:
• Bilingual Lexicon Extraction
• Context Vectors Creation
• Lexical Transfer Selection
Parallel / Aligned /
Parallel Corpus Tokenization Annotated Corpus
Sentence Lexical
Alignment Context
Lexicon Transfer
Vectors
Selection
POS Tagging
Lemmatization
the pos tagged corpus and retain only nouns, adjectives and verbs. The
corpus-specific lexicon is extracted using unsupervised statistical methods,
based on two basic principles:
• No language-pair specific assumptions are made about the corre-
spondences between grammatical categories. In this way, all possible
correspondence combinations are produced, as this is possible during
the translation process from a source language to the target language.
• For each aligned sentence-pair, each word of the target sentence is a
candidate translation for each word of the aligned source sentence.
Following the above principles we compute: the absolute frequency of each
word and the frequency of each word pair, in a sentence pair. Using these
frequencies, we extract lexical equivalences, based on the following criteria:
1. The frequency of the word pair must be greater than threshold T hr1 .
2. Each of the conditional probabilities P (Wt |Ws ) and P (Ws |Wt ) has to
be greater than threshold T hr2 .
3. The product P (Wt |Ws ) · P (Ws |Wt ) must be greater than threshold
T hr3 . This product is indeed the score of the translation.
After experimentation and examination of the results, {T hr1 , T hr2 , T hr3 }
were set to {5, 0.25, 0.15}. Experiments with T hr1 =1 and T hr1 =10 re-
vealed that the former resulted in high recall with significantly low preci-
sion, i.e., a fairly large but low quality lexicon, while the latter resulted in
relatively high precision, whilst recall was radically reduced, i.e., a fairly
small high-quality lexicon. We empirically fixed T hr1 in the middle of the
experimentation range. T hr2 was set to 0.25 to account for a maximum of
4 possible translations, which a polysemous word could have, according to
the corpus data. T hr3 was set to 0.15 to account for the lower bound of
the product of conditional probabilities P (Wt |Ws ) and P (Ws |Wt ), taking
into consideration the threshold of the individual conditional probabilities,
i.e., we empirically set the lower bounds of the conditional probabilities to
{0.25, 0.6}.
The extracted bilingual lexical equivalences account for: words with
one translation (90% of total) and words with multiple translations (10%).
230 STELIOS PIPERIDIS, PANAGIOTIS DIMITRAKIS & IRENE BALTA
Words with multiple equivalents are further distinguished in: (i) words that
have multiple, different in sense, translational equivalents, (ii) words that
have multiple, synonymous translations and (iii) words that have multiple
translations, which are in fact wrong due to statistical errors of the method.
In the following, we focus on polysemous words, with multiple translational
equivalents, which cannot mutually replace each other in the same context.
Inside this window, we look for words, Wx , which belong to the selected
grammatical categories. Each of Wx is added to the CV, along with the
number of times this word has appeared in the context of Wp Ti . We follow
a similar procedure for each word Wu .
Step 3: The final formation of the CVs is based on the following equations:
NWx Wp Ti ≥k (2)
NWx Wu ≥ k (4)
P (Wx |Wp Ti )
ECV ScorexWp Ti = (8)
21−dx
P (Wx |Wu )
ECV ScorexWu = , (9)
21−dx
where P (Wx |Wp Ti ) and P (Wx |Wu ) are defined in (3) and (5) and dx is the
depth in which Wx was found. In case of multiple appearances of Wx in the
ECV, we choose the one in the lowest depth, as it is the most significant in
the process of defining the sense of Wp .
Step IV: The final lexical transfer selection procedure examines each ECV
of Wp Ti separately. We compare the words inside the ±n window of word
Wp of the sentence under examination with those included in the ECV.
For each matched word we compute the appropriate score, using (8) and
(9). By adding the scores of the matched words, we assign to each possible
translational equivalent Ti a total score, depending on the associated ECV.
Finally, for the lexical transfer selection, we choose the word-sense Wp Ti
with the highest score. If both scores are equal, the algorithm does not
choose randomly and can output both as candidate translations. A feedback
mechanism could be foreseen to minimize these cases, if appropriate, in a
subsequent transfer selection round.
LEXICAL TRANSFER SELECTION WITH PARALLEL CORPORA 233
3 Results
The corpus used was the intera parallel corpus (Gavrilidou et al. 2004)
consisting of official eu documents in English and Greek from five different
domains; education, environment, health, law and tourism. The corpus
comprises 100, 000 aligned sentences, containing on average 830, 000 tokens
of the selected grammatical categories (nouns, verbs and adjectives) in either
language. The corresponding lemmas are 20, 000. The complete bilingual
lexicon comprises 5, 280 records (where multiple translational equivalences
of a word are counted as one record).
Evaluation was focused on the ability to resolve truly ambiguous words,
leaving aside words with synonymous translations, or erroneous transla-
tional candidates. For this purpose, a set of ambiguous words in English,
the contextually correct translation equivalent of which is “univocal” in
Greek, were manually selected. The selected set was {active, floor, seal,
settlement, solution, square, vision}.
For the above words, we extracted the sentences, in the parallel cor-
pus, that contain them. We adopted the 10-fold cross validation technique
for evaluation, computing the average results over the 10 iterations of the
algorithm. The possible answers, given by the algorithm were:
234 STELIOS PIPERIDIS, PANAGIOTIS DIMITRAKIS & IRENE BALTA
As expected, the wider the window, the more likely that the system gives an
answer, although the precision decreases. Especially for a window size n=15,
which in most cases in the given corpus contains all the tokens in a sentence,
we notice that the system’s performance declines disproportionately. This
is due to multiple erroneous statistical co-occurrences that are semantically
irrelevant in such wide windows.
In the second experiment, the percentage of the answered cases and
recall increased, compared to the first experiment, while precision slightly
decreased. The results also indicate that although a smaller window and
a higher absolute appearance threshold k would lead to a lower number of
answers, the accuracy increases.
To evaluate the performance of the method taking into consideration the
special characteristics of our corpus, we computed a “baseline” performance
(Gale et al. 1992c). We assign to each polysemous word, found in the
test set, the most frequent of its possible senses. The estimated baseline
LEXICAL TRANSFER SELECTION WITH PARALLEL CORPORA 235
REFERENCES
Brown, Peter F., Stephen Della Pietra, Vincent J. Della Pietra & Robert L
Mercer. 1991. “Word-Sense Disambiguation Using Statistical Methods”.
Proceedings of the 29th Annual Meeting on Association for Computational
Linguistics (ACL’91 ), 264-270. Berkeley, Calif.
Brown, Ralf D. 1997. “Automated Dictionary Extraction for ‘Knowledge-Free’
Example-Based Translation”. Proceedings of the 7th International Confer-
ence on Theoretical and Methodological Issues in Machine Translation
(TMI-97 ), 111-118. Santa Fe, New Mexico.
Dagan, Ido, Alon Itai & Ulrike Schwall. 1991. “Two Languages Are More Infor-
mative Than One”. Proceedings of the 29th Annual Meeting on Association
for Computational Linguistics (ACL’91 ), 130-137. Berkeley, Calif.
Gale, William A. & Kenneth W. Church. 1991 “A Program for Aligning Sentences
in Parallel Corpora”. Proceedings of the 29th Annual Meeting on Association
for Computational Linguistics (ACL’91 ), 177-184. Berkeley, Calif.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992a. “One Sense per
Discourse”. Proceedings of the Workshop on Speech and Natural Language,
233-237. Harriman, New York.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992b. “Using Bilin-
gual Materials to Develop Word Sense Disambiguation Methods”. Proceed-
ings of the 4th International Conference on Theoretical and Methodological
Issues in Machine Translation (TMI-92 ), 101-112. Montreal, Canada.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992c. “Estimating
Upper and Lower Bounds on the Performance of Word-Sense Disambigua-
tion Programs”. Proceedings of the 30rd Annual Meeting on Association for
Computational Linguistics (ACL’92 ), 249-256. Delaware, Newark.
236 STELIOS PIPERIDIS, PANAGIOTIS DIMITRAKIS & IRENE BALTA
Gale, William A., Kenneth W. Church & David Yarowsky. 1992d. “A Method for
Disambiguating Word Senses in a Large Corpus”. Common Methodologies
in Humanities Computing and Computational Linguistics, (= Special issue
of Computers and the Humanities, December 1992) 26:5-6.415-439.
Gavrilidou, M., P. Labropoulou, E. Desipri, V. Giouli, V. Antonopoulos,
S. Piperidis. 2004. “Building Parallel Corpora for eContent Professionals”.
COLING Workshop on Multilingual Linguistic Resources (MLR2004 ), 90-93.
Geneva, Switzerland.
Ide, Nancy & Jean Véronis. 1998. “Introduction to the Special Issue on Word
Sense Disambiguation: The State of the Art”. Computational Linguistics
24:1.2-40.
Kilgarriff, Adam. 1997. “I don’t Believe in Word Senses”. Computers and the
Humanities 31:2.91-113.
Melamed, I. Dan. 1997. “A Word-to-Word Model of Translationan Equivalence”.
Proceedings of the 35th Annual Meeting on Association for Computational
Linguistics (ACL’97 ), 490-497. Madrid, Spain.
Piperidis, Stelios, Harris Papageorgiou & Sotiris Boutsis. 2000. “From Sentences
to Words and Clauses”. Parallel Text Processing ed. by Jean Véronis, 117-
138. Dordrecht, The Netherlands: Kluwer Academic.
Shütze, Hinrich. 1998. “Automatic Word Sense Discrimination”. Word Sense
Disambiguation (= Special issue of Computational Linguistics, March 1998)
24:1.97-123.
Tiedemann, Jörg. 2001. “Predicting Translations in Context”. Proceedings of the
International Conference on Recent Advances in Natural Language Processing
(RANLP-2001 ), 240-244. Tzigov Chark, Bulgaria.
Véronis, Jean. 2000. “From the Rosetta Stone to the Information Society: A
Survey of Parallel Text Processing”. Parallel Text Processing ed. by Jean
Véronis, 1-25. Dordrecht, The Netherlands: Kluwer Academic.
Yarowsky, David. 1993. “One Sense Per Collocation”. Proceeding of the Work-
shop on Human Language Technology, 266-271. Princeton, New Jersey.
Yarowsky, David. 1995. “Unsupervised Word Sense Disambiguation Rivaling Su-
pervised Methods”. Proceedings of the 33rd Annual Meeting on Association
for Computational Linguistics (ACL’95 ), 189-196. Cambridge, Mass.
Multi-Perspective Evaluation of the FAME
Speech-to-Speech Translation System for Catalan,
English and Spanish
Victoria Arranz∗ , Elisabet Comelles∗∗ & David Farwell∗∗
∗
ELDA - Evaluation and Language Resources Distribution Agency,
∗∗
TALP Research Centre, Universitat Politcnica de Catalunya & Instituci
Catalana de Recerca i Estudis Avanats
Abstract
This paper describes the final evaluation of the FAME interlingua-
based speech-to-speech translation system for Catalan, English and
Spanish. It is an extension of the already existing NESPOLE! Sys-
tem that translates between English, French, German and Italian.
However, the FAME modules have now been integrated in an Open
Agent Architecture platform. We describe three types of evalua-
tion (task-oriented, performance-based and of user satisfaction) and
present their results when applied to our system. We also compare
the results of the system with those obtained by a stochastic trans-
lator developed independently within the FAME project.1
1 Introduction
2 System architecture
3 Evaluation
Evaluation was done on real users of the SST system, in order to:
• examine the performance of the system in as real a situation as pos-
sible, as if it were to be used by a real tourist trying to book accom-
modation in Barcelona,
• study the influence of using ASR in translation,
• compare the performance of a statistical approach and an interlingual
approach in a restricted semantic domain and for a task of this kind2 ,
• investigate the relevance of certain standard evaluation methods used
in statistical translation when applied to interlingual translation.
Lang Pairs # sent. mWER BLEU Lang Pairs # sent. mWER BLEU
CAT2ENG 119 74.66 0.1218 CAT2ENG 119 78.98 0.1456
ENG2CAT 117 77.84 0.1573 ENG2CAT 117 81.19 0.2036
SPA2ENG 84 61.10 0.1934 SPA2ENG 84 60.93 0.3462
ENG2SPA 125 80.95 0.1052 ENG2SPA 125 86.71 0.1214
The results are consistent with respect to the relative performance of the dif-
ferent systems in terms of language pairs. The Spanish-to-English systems,
both statistical and rule-based, performed best. The English-to-Spanish sys-
tems, both statistical and rule-based, performed the worst. The Catalan-to-
English and English-to-Catalan systems performed somewhere in between
with the latter slightly outperforming the former.
As for the relative performance of the statistical systems as opposed to
the rule-based systems, the results are entirely contradictory. The mWER
scores of the statistical system are consistently better than those of the
rule-based systems (apart from the Spanish-to-English case where the two
systems essentially performed equally). On the other hand, the BLEU scores
of the rule-based system are consistently better than those of the statistical
systems. It is unclear how this happened although it is likely that since the
BLEU metric rewards overlapping strings of words (as opposed to simply
matching words) that the rule-based systems produced a greater number of
correct multiword sub-strings than the statistical systems did. In any case,
were it not for the low performance of all the systems and the very limited
size of the test corpus, this would be a very telling result with regard to the
validity of the evaluation metrics.
were asked 10 questions about the naturalness and lenght of the dialogue,
the behaviour of the system, user’s success in getting what he wanted, the
difficulty to correct errors, if the user would use this system again, etc.The
average response was 3.4 points out of 5. An informal inspection of the
results per questionnaire indicates that the reaction of the users as a whole
was consistent and weakly positive (taking 3.0 as a median).
4 Conclusions
REFERENCES
Gavaldà, Marsal. 2000. “SOUP: A Parser for Real-world Spontaneous Speech”.
Proceedings of the 6th International Workshop on Parsing Technologies
(IWPT-2000 ), XX-YY. Trento, Italy.
Holzapfel, Hartwig, I. Rogina, M. Wölfel & T. Kluge. 2003. “FAME Deliverable
D3.1: Testbed Software, Middleware and Communication Architecture”.
Lavie, Alon, F. Metze, R. Cattoni & E. Constantini. 2002. “A Multi-Perspective
Evaluation of the NESPOLE! Speech-to-Speech Translation System”. Pro-
ceedings of Speech-to-Speech Translation Workshop at the 40th Annual Meet-
ing of the Association of Computational Linguistics (ACL-2002 ), XX-YY.
Philadelphia, PA.
Levin, Lori, D. Gates, D. Wallace, K. Peterson, A. Lavie, F. Pianesi, E. Pianta,
R. Cattoni & N. Mana. 2002. “Balancing Expressiveness and Simplicity in
an Interlingua for Task based Dialogue”. Proceedings of Speech-to-Speech
Translation Workshop at the 40th Annual Meeting of the Association of Com-
putational Linguistics (ACL-2002 ), XX-YY. Philadelphia, Penn.
Metze, Florian, J. McDonough, J. Soltau, C. Langley, A. Lavie, L. Levin, T. Schultz,
A. Waibel, L. Cattoni, G. Lazzari, N. Mana, F. Pianesi & E. Pianta. 2002.
“The NESPOLE! Speech-to-Speech Translation System”. Proceedings of the
Human Language Technology Conference (HLT-2002 ), XX-YY. San Diego,
Calif.
Searle, John. 1969. Speech Acts: An Essay in the Philosophy of Language.
Cambridge, U.K.: Cambridge University Press.
Schultz, Tanja, D. Alexander, A.W. Black, K. Peterson, S. Suebvisai & A. Waibel.
2004. “A Thai Speech Translation System for Medical Dialogs”. Proceedings
of the Human Language Technology Conference HLT/NAACL-2004, XX-YY.
Boston, Mass.
Tomita, Masaru & E.H. Nyberg. 1988. “Generation Kit and Transformation
Kit, Version 3.2, User’s Manual”. Technical Report (CMU-CMT-88-MEMO).
Center for Machine Translation, Carnegie Mellon University, Pittsburgh, PA.
Woszczyna, Monika, N. Coccaro, A. Eisele, A. Lavie, A. McNair, T. Polzin, I. Rogina,
C. Rose, T. Sloboda, M. Tomita, J. Tsutsumi, N. Aoki-Waibel, A. Waibel &
W. Ward. 1993. “Recent Advances in JANUS: A Speech Translation Sys-
tem”. Proceedings of the 1993 Eurospeech. Berlin.
Parallel Corpora for Medium Density Languages
Abstract
The choice of natural language technology appropriate for a given
language is greatly impacted by ‘density’ (availability of digitally
stored material). More than half of the world speaks medium den-
sity languages, yet many of the methods appropriate for high or
low density languages yield suboptimal results when applied to the
medium density case. In this paper we describe a general method-
ology for rapidly collecting, building, and aligning parallel corpora
for medium density languages, illustrating our main points on the
case of Hungarian, Romanian, and Slovenian. We also describe and
evaluate the hybrid sentence alignment method we are using.
1 Introduction
There are only a dozen large languages with a hundred million speakers or
more, accounting for about 40% of the world population, and there are over
5,000 small languages with less than half a million speakers, accounting
for about 4% (Grimes 2003). In this paper we discuss some ideas about
how to build parallel corpora for the five hundred or so medium density
languages that lie between these two extremes based on our experience
building a 50M word sentence-aligned Hungarian-English parallel corpus.
Throughout the paper we illustrate our strategy mainly on Hungarian (14m
speakers), also mentioning Romanian (26m speakers), and Slovenian (2m
speakers), but we emphasize that the key factor leading the success of our
method, a vigorous culture of native language use and (digital) literacy, is
by no means restricted to Central European languages. Needless to say,
the density of a language (the availability of digitally stored material) is
predicted only imperfectly by the population of speakers: major Prakrit or
Han dialects, with tens, sometimes hundreds, of million speakers, are low
density, while minor populations, such as the Inuktitut, can attain high
levels of digital literacy given the political will and a conscious Hansard-
building effort (Martin et al. 2003). With this caveat, population (or better,
GDP) is a very good approximation for density, on a par with web size.
248 VARGA, HALÁCSY, KORNAI, NAGY, NÉMETH & TRÓN
Starting with Resnik (1998), mining the web for parallel corpora has emerged
as a major technique, and between English and another high density lan-
guage, such as Chinese, the results are very encouraging (Chen & Nie 2000,
Resnik & Smith 2003). However, when no highly bilingual domain (like .hk
for Chinese or .ca for French) exists, or when the other language is much
lower density, the actual number of automatically detectable parallel pages
is considerably smaller: for example, Resnik and Smith find less than 2,000
English-Arabic parallel pages for a total of 2.3m words.
For medium density languages parallel web pages turn out to be a sur-
prisingly minor source of parallel texts. Even in cases where the popula-
tion and the economy is sizeable, and a significant monolingual corpus can
be collected by crawling, mechanically detectable parallel or bilingual web
pages exist only in surprisingly small numbers. For example a 1.5 billion
word corpus of Hungarian (Halácsy et al. 2004), with 3.5 million unique
pages, yielded only 270,000 words (535 pages), and a 200m word corpus of
Slovenian (202,000 pages) yielded only 13,000 words (42 pages) using URL
parallelism as the primary matching criterion as in PTMiner (Chen & Nie
2000).
Web pages are undoubtedly valuable for a diversity of styles and contents
that is greater than what could be expected from any single source, but a few
hundred web pages alone fall short of a sensible parallel corpus. Therefore,
one needs to resort to other sources, many of them impossible to find by
mechanical URL comparison, and often not even accessible without going
through dedicated query interfaces.
Literary texts. The Hungarian National Library maintains a large
public domain digital archive Magyar Elektronikus Könyvtár ’Hungarian
Electronic Library’ mek.oszk.hu/indexeng.phtml with many classical texts.
Comparison with the Project Gutenberg archives at www.gutenberg.org
yielded well over a hundred parallel texts by authors ranging from Jane
Austen to Tolstoy. Equally importantly, many works still under copyright
were provided by their publishers under the standard research exemption
clause. While we can’t publish most of these texts in either language, we
publish the aligned sentence pairs alphabetically sorted. This “shuffling”
somewhat limits usability inasmuch as higher than sentence-level text layout
becomes inaccessible, but at the same time makes it prohibitively hard to
reconstruct the original texts and contravene the copyright. Since shuffling
PARALLEL CORPORA FOR MEDIUM DENSITY LANGUAGES 249
nips copyright issues in the bud, it simplifies the complex task of dissemi-
nating aligned corpora considerably.
Religious texts. The entire Bible has been translated to over 400
languages and dialects, and many religious texts from the Bhagavad Gita to
the Book of Mormon enjoy nearly as broad currency. The Catholic Church
makes a special effort to have papal edicts translated to other languages
from the original Latin (see www.vatican.va/archive).
International Law. From the Geneva Convention to the Universal
Declaration of Human Rights (www.unhchr.ch/udhr) many important le-
gal documents have been translated to hundreds of languages and dialects.
Those working on the languages of the European Union have long availed
themselves of the CELEX database.
Movie captioning. Large mega-productions are often dubbed, but
smaller releases will generally have only captioning, often available for re-
search purposes. For cult movies there is also a vigorous subgenre of amateur
translations by movie buffs.
Software internationalization. Multilingual software documentation
is increasingly becoming available, particularly for open source packages
such as KDE, Gnome, OpenOffice, Mozilla, the GNU tools, etc (Tiedemann
and Nygaard 2004).
Bilingual magazines. Both frequent flyer magazines and national
business magazines are often published with English articles in parallel.
Many magazines from Scientific American to National Geographic have edi-
tions in other languages, and in many countries there exist magazines with
complete mirror translations (for instance, Diplomacy and Trade Magazine
publishes every article both in Hungarian and English).
Annual reports, corporate home pages. Large companies will often
publish their annual reports in English as well. These are usually more
strictly parallel than the rest of their web pages.
There is no denying that the identification of such resources, negoti-
ating for their release, downloading, format conversion, and character-set
normalization remain labor-intensive steps, with good opportunities for au-
tomation only at the final stages. But such an effort leverages exactly the
strengths of medium density languages: the existence of a joint cultural
heritage both secular and religious, of national institutions dedicated to the
preservation and fostering of culture, of multinational movements (particu-
larly open source) and multinational corporations with a notable national
presence, and of a rising tide of global business and cultural practices. Al-
together, the effort pays off by yielding a corpus that is two-three orders of
magnitude larger, and covering a much wider range of jargons, styles, and
genres, than what could be expected from parallel web pages alone. Table 1
summarizes the different types of texts and their sizes in our Hungarian-
250 VARGA, HALÁCSY, KORNAI, NAGY, NÉMETH & TRÓN
such as long surplus chapters. Once the similarity matrix is obtained for the
relevant sentence pairs, the optimal alignment trail is selected by dynamic
programming, going through the matrix with various penalties assigned to
skipping and coalescing sentences. The score of skipping is a fixed param-
eter, learnt on our training corpus while the score of coalescing is the sum
of the minimum of the two token-based scores and the length-based score
of the concatenation of the two sentences. For performance reasons, the
dynamic programming algorithm does not take into account the possibility
of more than two sentences matching one sentence. After the optimal align-
ment path is found, a postprocessing step iteratively coalesces a neighboring
pair of one-to-many and zero-to-one segments wherever the resulting new
segment has a better character-length ratio than the starting one. With
this method, any one-to-many segments can be discovered.
The hybrid algorithm presented above remains completely meaningful
even in the total absence of a dictionary. In this case, the crude translation
will be just the source language text, and sentence-level similarity falls back
to surface identity of words.After this first phase a simple dictionary can
be bootstrapped on the initial alignment. From this alignment, the second
phase of the algorithm collects one-to-one alignments with a score above a
fixed threshold. Based only on all one-to-one segments, cooccurrences of
every source-target token pair are calculated. These, when normalized with
the maximum of the two tokens’ frequency yield an association measure.
Word pairs with association higher than 0.5 but are are used as a dictionary.
Our algorithm is similar in spirit to that of Moore (2002) in that they
both combine the length-based method with some kind of translation-based
similarity. In what follows we discuss how Moore’s algorithm differs from
ours.
Moore’s algorithm has three phases. First, an initial alignment is com-
puted based only on sentence length similarity. Next, an IBM ‘Model I’
translation model (Brown et al. 1993) is trained on a set of likely matching
sentence pairs based on the first phase. Finally, similarity is calculated us-
ing this translation model, combined with sentence length similarity. The
output alignment is calculated using this complex similarity score. Compu-
tation of similarity using Model I is rather slow, so only alignments close
to the initially found alignment are considered, thus restricting the search
space drastically.
Our simpler method using a dictionary-based crude translation model
instead of a full IBM translation model has the very important advantage
that it can exploit a bilingual lexicon, if one is available, and tune it ac-
cording to frequencies in the target corpus or even enhance it with extra
local dictionary bootstrapped from an initial phase. Moore’s method offers
no such way to tune a preexisting language model. This limitation is a real
PARALLEL CORPORA FOR MEDIUM DENSITY LANGUAGES 253
one when the corpus, unlike the news and Hansard corpora more familiar
to those working on high density languages, is composed of very short and
heterogeneous pieces. In such cases, as in web corpora, movie captions, or
heterogeneous legal texts, average-based models are actually not close to
any specific text, so Moore’s workaround of building language models based
on 10,000 sentence subcorpora has little traction.
On top of this, our translation similarity score is very fast to calculate,
so the dictionary-based method can be used already in the first phase where
a much bigger search space can be traversed. If the lexicon resource is good
enough for the text, this first phase already gives excellent alignment results.
Maximizing alignment recall in the presence of noisy sentence segmen-
tation is an important issue, particularly as language density generally cor-
relates with the sophistication of NLP tools, and thus lower density implies
poorer sentence boundary detection. From this perspective, the focus of
Moore’s algorithm on one-to-one alignments is less than optimal, since ex-
cluding one-to-many and many-to-many alignments may result in losing
substantial amounts of aligned material if the two languages have different
sentence structuring conventions.
While speed is often considered a mundane issue, hunalign, written in
C++, is at least an order of magnitude faster than Moore’s implementation
(written in Perl), and the increase in speed can be leveraged in many ways
during the building of a parallel corpus with tens of thousands of documents.
First, rapid alignment allows for more efficient filtering of texts with low
confidence alignments, which usually point to faulty parallelism such as
mixed order of chapters (as we encountered in Arabian Nights and many
other anthologies), missing appendices, extensive extra editorial headers
(typical of Project Gutenberg), comments, different prefaces in the source
texts etc. Once detected automatically, most cases of faulty parallelism
can be repaired and the texts realigned. Second, debugging and fine-tuning
lower-level text processing steps (such as the sentence segmentation and
tokenization steps) may require several runs of alignment in order to monitor
the impact of certain changes on the quality of alignment. This makes speed
an important issue. Interestingly, runtime complexity of Moore’s program
seems to be very sensitive to the faults in parallelism. Adding a 300 word
surplus preface to one side of 1984 but not the other slows down this program
by a factor of five, while it has no detectable impact on hunalign.
Finally, Moore’s aligner, while open source and clearly licensed for re-
search, is not free software. In particular, parallel corpora aligned with it
can not be made freely available for commercial purposes. Since we wanted
to make sure that our corpus is available for any purpose, including com-
mercial use, Moore’s aligner program was not a viable choice.
254 VARGA, HALÁCSY, KORNAI, NAGY, NÉMETH & TRÓN
4 Evaluation
In this section we describe our attempts to assess the quality of our parallel
corpus by evaluating the performance of the sentence aligner on texts for
which manually produced alignment is available. We also compare our
algorithm to Moore’s (2002) method.
Evaluation shows hunalign has very high performance: generally it
aligns incorrectly at most a handful of sentences. As measured by Moore’s
method of counting only on one-to-one sentence-pairs, precision and recall
figures in the high nineties are common. But these figures are overly op-
timistic because they hide one-to-many and many-to-many errors, which
actually outnumber the one-to-one errors. In 1984, for example, 285 of the
6732 English sentences or about 4.3% do not map on a unique Hungarian,
and 716 or 10.6% do not map on a unique Romanian sentence – similar
proportions are found in other alignments, both manual and automatic.
To take these errors into account, we used a slightly different figure of
merit, defined as follows. The alignment trail of a text can be represented
by a ladder, i.e. an array of pairs of sentence boundaries: rung (i, j) is
present in the ladder iff the first i sentences on the left correspond to the
first j sentences on the right. Precision and recall values are calculated by
comparing the predicted and actual rungs of the ladder: we will refer to
this as the ‘complete rung count’ as opposed to the ‘one-to-one count’. In
general, complete rung figures of merit tend to be lower than one-to-one
figures of merit, since the task of getting them right is more ambitious: it
is precisely around the one-to-many and many-to-one segments of the text
that the alignment algorithms tend to stumble.
condition precision recall
id 34.30 34.56
id+swr 74.57 75.24
len 97.58 97.55
len+id 97.65 97.42
len+id+swr 97.93 97.80
dic 97.30 97.08
len+dic-stem 98.86 98.88
len+dic 99.34 99.34
len+boot 99.12 99.18
Table 2 presents precision and recall figures based on all the rungs of the
entire ladder against the manual alignment of the Hungarian version of
Orwell’s 1984 (Dimitrova et al. 1998).
PARALLEL CORPORA FOR MEDIUM DENSITY LANGUAGES 255
If length-based scoring is switched off and we only run the first phase with-
out a dictionary, the system reduces to a purely identity based method
we denote by id. This will still often produce positive results since proper
nouns and numerals will “translate” to themselves. With no other steps
taken, on 1984 id yields 34.30% precision at 34.56% recall. By the simple
expedient of stopword removal, swr, the numbers improve dramatically, to
74.57% precision at 75.24% recall. This is due to the existence of short
strings which happen to have very high frequency in both languages (the
two predominant false cognates in the Hungarian-English case were a ‘the’
and is ‘too’).
Using the length-based heuristic len instead of the identity heuristic is
better, yielding 97.58% precision at 97.55% recall. Combining this with the
identity method does not yield significant improvement (97.65% precision
at 97.55% recall). If, on top of this, we also perform stopword removal, both
precision (97.93%) and recall (97.80) improve.
Given the availability of a large Hungarian-English dictionary by A.
Vonyó, we also established a baseline for a version of the algorithm that
makes use of this resource. Since the aligner does not deal with multiword
tokens, entries such as Nemzeti Bank ‘National Bank’ are eliminated, re-
ducing the dictionary to about 120k records. In order to harmonize the
dictionary entries with the lemmas of the stemmer, the dictionary is also
stemmed with the same tool as the texts. Using this dictionary (denoted by
dic in Table 2) without the length-based correction results in slightly worse
performance than identity and length combined with stop word removal.
If the translation-method with the Vonyó dictionary is combined with
the length-based method (len+dic), we obtain the highest scores 99.34%
precision at 99.34% recall on rungs (99.41% precision and 99.40% recall on
one-to-one sentence-pairs). In order to test the impact of stemming we let
the algorithm run on the non-stemmed text with a non-stemmed dictionary
(len+dic-stem). This established that stemming has indeed a substantial
beneficial effect, although without it we still get better results than any of
the non-hybrid cases.
Given that the dictionary-free length-based alignment is comparable
to the one obtained with a large dictionary, it is natural to ask how the
algorithm would perform with a bootstrapped dictionary as described in
Section 3. With no initial dictionary but using this automatically boot-
strapped dictionary in the second alignment pass, the algorithm yielded
results (len+boot), which are, for all intents and purposes, just as good as
the ones obtained from combining the length-based method with our large
existing bilingual dictionary (len+dic). This is shown in the last two lines
of Table 2. Since this method is so successful, we implemented it as a mode
of operation of hunalign.
256 VARGA, HALÁCSY, KORNAI, NAGY, NÉMETH & TRÓN
5 Conclusion
In the past ten years, much has been written on bringing modern language
technology to bear on low density languages. At the same time, the bulk
of commercial research and product development, understandably, concen-
trated on high density languages. To a surprising extent this left the medium
density languages, spoken by over half of humanity, underresearched. In this
paper we attempted to address this issue by proposing a methodology that
does not shy away from manual labor as far as the data collection step
is concerned. Harvesting web pages and automatically detecting parallels
turns out to yield only a meager slice of the available data: in the case
of Hungarian, less than 1%. Instead, we proposed several other sources
of parallel texts based on our experience with creating a 50 million word
Hungarian–English parallel corpus.
Once the data is collected and formatted manually, the subsequent steps
can be almost entirely automated. Here we have demonstrated that our
hybrid alignment technique is capable of efficiently generating very high
quality sentence alignments with excellent recall figures, which helps to get
the maximum out of small corpora. Even in the absence of any language
resources, alignment quality is very high, but if stemmers or bilingual dic-
tionaries are available, our aligner can take advantage of them.
REFERENCES
Brown, Peter F., Jennifer Lai & Robert Mercer. 1991. “Aligning Sentences
in Parallel Corpora”. 29th Meeting of the Association for Computational
Linguistics (ACL’91 ), 169-176. Berkeley: University of California.
Brown, Peter F., Vincent J. Della Pietra, Stephen A. Della Pietra & Robert L.
Mercer. 1993. “The Mathematics of Statistical Machine Translation: Pa-
rameter Estimation”. Computational Linguistics 19:2.263-311.
Chen, Jiang & Jian-Yun Nie. 2000. “Automatic Construction of Parallel English-
Chinese Corpus for Cross-Language Information Retrieval”. 6th Conf. on
Applied Natural Language Processing, 21-28. San Francisco, Calif.
1
Although paragraph identification itself contains a lot of errors, improvement may be
due to the fact that paragraphs, however faulty, are consistent in terms of alignment.
The details of this and the question of exploiting higher-level layout information is left
for future research.
258 VARGA, HALÁCSY, KORNAI, NAGY, NÉMETH & TRÓN
Abstract
There is a rich literature documenting the fact that text and docu-
ment properties affect the success of techniques in data driven Nat-
ural Language Processing and Information Retrieval. At the same
time, there is little by way of a systematic investigation of the precise
role played by data. This paper sets out the rationale for achieving
a better understanding of the role of data, and points out some un-
desirable methodological, epistemological and practical consequences
of not doing so. Some rough sparseness measures are used to show
that standard datasets differ in important ways. A case is made for
exploring the role of data by compiling dataset profiles. These would
contain measures tailored to highlight inherent bias in a collection,
reflecting its fitness to support a technique in the context of a task.
1 Data matters
It is an inescapable fact of life in Natural Language Processing (nlp) and
Information Retrieval (ir), that given some task, the performance of a tech-
nique will depend on the properties of the data on which it is deployed. A
substantial literature documents this three-way dependency between task,
data and technique performance, often highlighting it in the context of ex-
perimental evaluations. It is quite easy to find mention of text or document
properties that are believed to have an effect. For instance, breadth of do-
main coverage is a factor. “Deep” language understanding techniques are
reported to work well in narrow domains with a great deal of inherent struc-
ture (Copestake & Spärck Jones 1990), but they entail large processing costs
and so are less suited to handling large volumes of text, or for deployment
in interactive systems (Zaenen & Uzkoreit 1996). They are also too brittle
for general querying of broad, unstructured sources, such as Web search,
where “shallow” statistically-based ir techniques work well (Spärck Jones
1999). Explicit structure markers have an effect. For interactive search, hy-
perlinks do not significantly improve recall and precision in diverse domains,
such as the trec test data (Savoy & Pickard 1999, Hawking et al. 1999),
but they do in narrow domains and for searching intranets (Kruschwitz
2001). Stemming is a well established technique that does not improve
effectiveness of retrieval in general (Harman 1991), except for morphologi-
cally complex languages (Popovic & Willett 1992), and for short documents
260 ANNE DE ROECK
(Krovetz 1993). Document (and hence query) length is a factor in its own
right. Short keyword-based queries behave differently from long structured
queries (Pickens & Croft 2000), and general textbooks (Jurafsky & Martin
2000) state quite categorically that keyword-based retrieval works better on
long texts.
Whilst the role of data is very much acknowledged, a precise charac-
terisation of salient data features is not usually pursued. For instance,
whilst document length is a known factor, it remains unclear exactly what
is meant by a short document in any given set of experiments. Similarly, lit-
tle is known about how to recognise a “diverse” domain in a collection. On
the whole, papers setting out experimental results do not routinely contain
information on the datasets on which these were obtained.
As a research community with a large experimental agenda, the absence
of detailed characterisation of the role of data should worry us. In 1972, the
physicist Richard Feynman (Feynman 1972) described a cornerstone of his
notion of scientific integrity:
“. . . a principle of scientific thought that corresponds to a kind of utter
honesty–a kind of leaning over backwards. For example, if you’re doing
an experiment, you should report everything that you think might make
it invalid–not only what you think is right about it: other causes that
could possibly explain your results; and things you thought of that you’ve
eliminated by some other experiment, and how they worked–to make sure
the other fellow can tell they have been eliminated. [...] In summary, the
idea is to give all of the information to help others to judge the value of
your contribution; not just the information that leads to judgement in one
particular direction or another.” (Feynman 1992:341)
In data driven nlp and ir, text and document properties are clearly ac-
knowledged as one of these “other causes”, and consequently it would seem
necessary to engage with charting in what way they contribute to experi-
mental outcomes. However, determining such properties and reporting on
them is a non-trivial task, because there is no agreed common framework for
approaching it. Without such a framework for approaching the systematic
understanding of the role of data, we are faced with three serious, unde-
sirable consequences. The first is methodological in nature, and concerns
replicability: unless it is understood which aspect of the data give rise to
which effects, experimental outcomes cannot be replicated reliably, except
in a trivial sense by running the same experiment on the same data. The
second is epistemological: it is impossible to generalise from empirical find-
ings in the absence of a transparent framework for describing systematic,
testable relationships between data characteristics and technique perfor-
mance given some task. The third is practical: when faced with a new set
of data and some task, it is impossible to know how the dataset relates to
THE ROLE OF DATA IN NLP: DATASET PROFILING 261
2 Sparseness
Given enough data, even simple frequency-based measures can show how
datasets differ significantly along dimensions we know will affect the be-
haviour of techniques and applications. Sparseness, for instance, is a well
known problem in nlp and ir. Type to token ratios (ttr) can be used as
’off-the-cuff’ sparseness indicators. Whilst they are coarse-grained, they are
cheap to calculate, by dividing the total number of words in a portion of
text by the number of terms. On the assumption that each occurrence of a
term provides evidence of its use, ttrs show, for some sample of running
text, how much evidence (in words) is present on average for every term.
Thus, sparser data have lower ttrs. Another way of interpreting the ratio
is as a very rough indicator of how far apart (in terms of running words)
evidence of new terms is spaced.
When run over standard collections, such as tipster, ttrs reveal some
interesting differences. For ease of reference, Table 1 gives a brief description
of all the datasets that will be used in this paper, including the tipster
ones. Table 2 sets out some ttrs for 7 of those datasets, calculated over text
samples of 100, 200, 800, 20,000 and 1 million words long. As expected, the
ttrs suggest that sparseness ratios drop with sample length: large datasets
are less likely to be sparse. However, they also show that there are significant
differences between datasets in the tipster collection. For instance, at one
million words, the U.S. Patents corpus (pat) will on average supply 62 words
worth of evidence for every term (or, in an alternative reading of the ttr,
evidence of a new term will crop up, on average, about 62 words apart).
A one million word section of the San Jose Mercury corpus (sjm) is much
sparser, with only 26 words worth of evidence for every term. This suggests
that techniques that are sensitive to sparseness might do less well when
confronted with one million words of sjm text than they might on the same
amount of pat text. Importantly, the ttrs in Table 2 further suggest there
are significant differences between languages. Comparing newspaper text,
262 ANNE DE ROECK
the ratio for one million words of Arabic is 8.25, which appears dramatically
worse than that of text drawn from the San Jose Mercury (26.38). Looking
at more balanced corpora, the ratios for the Bengali corpus are much sparser
than the overall tipster ratios.
Data Set Contents Collection size
TIPSTER Diverse English language reference dataset; has sub-collections.
AP TIPSTER. Copyrighted AP Newswire stories from 1989. 114,438,101
DOE TIPSTER. Short abstracts from the US Department of Energy. 26,882,774
FR TIPSTER. US government reports from Federal Register (1989). 62,805,175
PAT TIPSTER. US Patent Documents for the years 1983-1991. 32,151,785
SJM TIPSTER. Copyrighted stories from 1991 San Jose Mercury News. 39,546,073
WSJ TIPSTER. Stories from Wall Street Journal 1987-89 41,560,108
ZF TIPSTER. Computer Select disks 1989/90, Ziff-Davis Publishing 115,956,732
OU Dataset drawn from Open University Intranet. 39,807,404
Arabic Arabic newspaper articles drawn from Al-Hayat newspaper. 18,639,264
Bengali Modern balanced Bengali corpus - Institute of Indian Languages 3,052,522
detection has used very frequent function word distribution, which further
strengthens their position as a suitable starting point for the exploration of
dataset profiling measures for genre-sensitive tasks and techniques.
5 Conclusion
Data matters, and there are significant differences between text and doc-
ument collections that impact on the behaviour of techniques. There are
many compelling reasons to investigate the role of dataset characteristics
on experimental results in nlp and ir. In order to do so, a framework is
required to guide systematic exploration of the influence of data character-
istics. One way to approach such a framework is to develop measures that
reflect inherent bias in a collection, by highlighting its fitness to support a
technique in the context of a task. Such measures can be combined into
dataset profiles which can be published. This would make benchmarking
possible and has the advantage of adding a level of detail to the interpreta-
tion of existing results for standard reference collections, introducing oppor-
tunities for aggregating across existing studies. It is important to select mea-
sures that are informative, and sufficiently fine-grained to bring out salient
characteristics. Early attempts at investigating collection properties were
hampered by lack of data and the limitations of computational resources.
These limitations are far less critical today, and apart from vast amounts of
data, it is now possible to expand the battery of standard frequency-based
measures with others that are computationally more demanding.
Acknowledgements. I became acutely aware of the need to get a firmer
handle on the role of data when I started working on natural language interfaces
to intranets and other extreme datasets, such as the Yellow Pages, with Udo
Kruschwitz, Nick Webb and John Carson. The notion of dataset profiling arose
directly from the need to develop reliable datasets to drive experiments on Arabic,
at a time when there were none. This I first explored with Waleed Al-Fares
and Abduelbaset Goweder. The idea of term distribution measures as a profiling
technique for data arose during work with Nikos Nanas and Victoria Uren, on user
profiles in multi-topic information filtering. Profiling with homogeneity measures,
and the use of fine-grained Bayesian models of term distribution builds on work
with Avik Sarkar and Paul Garthwaite, who deserve very special thanks, as does
Marian Petre for many helpful comments. I am indebted to all of these colleagues.
REFERENCES
Copestake, Anne & Karen Spärck Jones. 1990. “Natural Language Interfaces to
Databases”. Knowledge Engineering Review 5:4.225-249.
De Roeck, Anne, Avik Sarkar & Paul Garthwaite. 2004. “Frequent Term Dis-
tribution Measures for Dataset Profiling”. 4th International Conference of
266 ANNE DE ROECK
Abstract
We have known for some time that content words have “bursty” dis-
tributions in text. In contrast, much of the literature assumes that
function words are uninformative and they distribute homogeneously.
We describe two sets of experiments showing that assumptions of
homogeneity do not hold, even for the distribution of extremely fre-
quent function words. In the first experiment, we investigate the
behaviour of very frequent function words in the tipster collection
by postulating a “homogeneity assumption”, which we then defeat in
a series of experiments based on the χ2 test. Results show that it is
statistically unreasonable to assume homogeneous term distributions
within a corpus. We also found that document collections are not
neutral with respect to the property of homogeneity, even for very
frequent function words. In the second set of experiments, we model
the gaps between successive occurrences of a particular term using
a mixture of exponential distributions. Where the “homogeneity as-
sumption” holds these gaps should be uniformly distributed across
the entire corpus. Using the model we demonstrate that gaps are not
uniformly distributed, and even very frequent terms occur in bursts.
1 Introduction
first is based on the “bag of words” assumption that very frequent terms
are uniformly distributed and the gaps between successive occurrences of a
particular term are generated from a single exponential distribution. This
is in contrast to the second alternative, which assumes that terms occur in
bursts and the gaps between successive occurrences of a term are generated
from a mixture of two exponential distributions, one reflecting the rate of
occurrence of the term in the corpus and the other reflecting the rate of
re-occurrence after it has occurred recently (Sarkar et al. 2005a).
2 Experimental framework
2.3 Data
We choose the tipster collection for our experiments because the dataset
is of good quality, it is well understood, and it contains a range of different
genres (Table 1). We assembled some basic profiling data on these datasets.
In Table 1, we list type to token ratios at 10 million words for each dataset.
These ratios are calculated by dividing the number of words by the number
of unique terms. They give a rough appreciation of the breadth of coverage,
to the extent where breath of terminology can reflect this. The value is
an indication of the average number of “old” words between occurrences of
“new” words in running text.
FUNCTION WORDS DO NOT DISTRIBUTE HOMOGENEOUSLY 271
Table 1: Contents of each of the datasets along with top terms, Average
Document Length (ADL) and Type-to-Token Ratios (TTR )
We inspected the 100 most frequent terms from each of the datasets by
hand. Very frequent terms are not always function words, and lists were
sensitive to the domain of some datasets1 . The 10 most frequent terms
(Table 1), nonetheless showed a high degree of overlap, and were clearly
function words. Focusing on these ten terms in experiments across the
different collections should yield information on the behaviour of a small
collection of very frequent function words.
3 Experimental results
1
For example, section in position 19 in fr, software in position 21 in zf, and invention
in position 26 in pat
272 A. DE ROECK, A. SARKAR, P. H. GARTHWAITE
docDiv halfdocDiv
N most frequent terms N most frequent terms
DataSet 10 20 50 100 10 20 50 500
ap 2.11 1.58 2.58 2.29 1.77 1.47 1.27 1.17
0.12 0.21 0 0 0.09 0.12 0.06 0.02
doe 1.17 1.45 1.76 1.98 0.73 0.93 1.04 1.06
0.46 0.16 0.03 0 0.66 0.53 0.37 0.20
fr 54.52 41.72 72.09 66.79 7.91 9.55 11.64 8.85
0 0 0 0 0 0 0 0
pat 21.07 29.32 62.49 55.35 20.36 15.57 11.89 7.69
0 0 0 0 0 0 0 0
sjm 3.60 2.77 3.23 2.98 1.32 1.57 1.47 1.33
0.12 0 0 0 0.38 0.39 0.11 0
wsj 2.36 2.66 2.36 2.32 1.56 1.62 1.30 1.24
0.178 0 0 0 0.28 0.25 0.26 0.02
zf 11.95 8.13 6.91 6.58 1.95 1.86 1.61 1.56
0 0 0 0 0.13 0.12 0.02 0
the of said
DataSet pe f1
λ f2
λ pe f1
λ f2
λ pe f1
λ f2
λ
ap 0.59 16.58 16.11 0.65 38.37 36.44 0.04 696.38 69.01
doe 0.29 20.49 12.72 0.62 21.10 19.72 0.67 61349.69 12224.94
fr 0.01 194.89 13.47 0.02 106.25 24.01 0.84 26385.22 392.62
pat 0.03 58.96 10.61 0.03 73.42 21.82 0.06 2167.32 13.10
sjm 0.02 168.52 17.80 0.04 205.38 39.45 0.16 2499.38 92.42
wsj 0.70 17.46 17.00 0.42 36.91 35.39 0.12 1608.49 72.62
zf 0.10 67.80 18.39 0.01 262.47 46.51 0.42 8810.57 177.21
very similar in the ap and wsj datasets and pe is close to 0 in the fr, pat
and sjm datasets, so in these datasets the may distribute homogeneously.
In the doe and zf datasets, however, pe is near neither 0 nor 1, and λe1 and
λe2 differ markedly, so the does not appear to distribute homogeneously in
these two datasets. Similarly, the term of has very similar values of λe1 and
λe2 for ap, doe and wsj datasets and pe is close to 0 for the fr, pat, sjm
and zf datasets. The datasets provide little evidence against homogeneity
either based on the values of λe1 and λe2 or pe. In contrast, the term said
274 A. DE ROECK, A. SARKAR, P. H. GARTHWAITE
shows evidence of homogeneity only for the ap dataset, for which the value
of pe is close to 0.
To investigate the homogeneity assumption for other common terms, we
calculate the ratio between the two e λs, λe1 /λe2 and study how close this ratio
e e
is to 1. A λ1 /λ2 ratio of 1 indicates that the two exponential distributions
have equal means, and hence reduce to a single exponential distribution.
A large deviation of the λe1 /λe2 ratio from 1 reveals the presence of two
very distinct exponential distributions and provides evidence against the
homogeneity assumption of the term’s distribution in the corpus, provided
the value of pe is not close to either 0 or 1. If the value is very close to 0
or 1 (a difference of less than 0.05) we argue that one of the exponential
distributions has negligible effect and there is little evidence against the
term being homogeneously distributed. Table 4 provides the λe1 /λe2 ratio
and the values of pe for the most frequent terms of each of the datasets.
In the table ratios of λe1 /λe2 that are less than 1.2 are given in bold-face
type, as are values of pe that are below 0.05 or above 0.95. Combinations are
underlined when either of these is in bold. For terms with underlined values,
the model does suggests the assumption of homogeneity is not violated, but
the assumption seems poor for the terms that are not underlined in the
table. It can be observed from Table 4 that only the term of show signs
of being homogeneously distributed across all the datasets either based on
the λe1 /λe2 ratio or values of pe being close to 0. The terms and, are, the and
to also seem to be homogeneously distributed across many of the datasets.
The other 12 terms in Table 4 only appear to be homogeneously distributed
in at most 2 of the 7 datasets. Said is an interesting term in the table. It has
Term AP DOE FR PAT SJM WSJ ZF
a 1.9 (0.17) 3.6 (0.46) 3.8 (0.23) 3.3 (0.15) 2.3 (0.10) 2.0 (0.14) 2.5 (0.10)
and 1.1 (0.30) 1.1 (0.53) 2.7 (0.11) 3.6 (0.02) 3.6 (0.07) 1.1 (0.14) 1.1 (0.10)
are 1.1 (0.69) 1.1 (0.32) 3.0 (0.07) 5.0 (0.33) 10.4 (0.01) 1.1 (0.69) 1.2 (0.47)
as 31.6 (0.93) 31.5 (0.93) 3.9 (0.45) 4.5 (0.24) 40.8 (0.90) 65.1 (0.90) 56.4 (0.91)
be 3.0 (0.73) 1.3 (0.49) 6.1 (0.13) 5.7 (0.27) 3.5 (0.64) 2.1 (0.33) 2.2 (0.27)
for 2.0 (0.29) 3.2 (0.54) 4.4 (0.05) 4.1 (0.26) 2.6 (0.55) 1.9 (0.31) 15.9 (0.01)
in 2.2 (0.13) 2.5 (0.17) 7.1 (0.02) 4.0 (0.05) 2.9 (0.10) 1.9 (0.22) 2.8 (0.08)
is 2.9 (0.58) 4.6 (0.35) 3.8 (0.19) 5.3 (0.07) 4.0 (0.34) 2.4 (0.34) 5.7 (0.02)
of 1.1 (0.65) 1.1 (0.62) 4.4 (0.02) 3.4 (0.03) 5.2 (0.04) 1.1 (0.42) 5.6 (0.01)
on 1.9 (0.31) 5.7 (0.72) 4.7 (0.21) 5.9 (0.25) 2.6 (0.46) 1.9 (0.55) 2.5 (0.22)
or 41.5 (0.95) 3.7 (0.48) 6.8 (0.36) 9.9 (0.28) 8.6 (0.81) 4.6 (0.71) 3.6 (0.78)
said 10.1 (0.04) 5.0 (0.67) 67.2 (0.84) 165.3 (0.06) 27.0 (0.16) 22.1 (0.12) 49.7 (0.42)
san 112.4 (0.92) 21.1 (0.74) 579.3 (0.93) 855.8 (0.93) 14.4 (0.46) 149.6 (0.96) 90.6 (0.80)
that 2.7 (0.16) 1.2 (0.59) 4.8 (0.15) 4.7 (0.21) 3.4 (0.20) 2.5 (0.11) 4.3 (0.04)
the 1.0 (0.59) 1.6 (0.29) 14.5 (0.01) 5.5 (0.03) 9.5 (0.02) 1.0 (0.70) 3.7 (0.10)
to 3.2 (0.10) 1.2 (0.56) 12.5 (0.01) 3.8 (0.05) 6.8 (0.02) 1.1 (0.41) 3.2 (0.04)
with 2.6 (0.40) 2.4 (0.28) 3.5 (0.23) 3.6 (0.24) 2.7 (0.64) 1.3 (0.45) 2.5 (0.12)
Table 4: Values of λe1 /λe2 ratio and p for the frequent terms
very high values of the λe1 /λe2 ratio and the values vary over a huge range.
Also, the value of pe for said is close to 0 for the ap dataset. This is because
the term said has a huge dependence on the document’s content and style,
FUNCTION WORDS DO NOT DISTRIBUTE HOMOGENEOUSLY 275
and these characteristics can be explored and studied by modeling the gaps.
It is this property that allows the model to be used in characterizing genre
and stylistic features (Sarkar et al. 2005b). The term san is an outlier in
the list, as it is not a function word. But it featured in the list of top 10
terms in the sjm (stories from San Jose Mercury news) collection, being a
very widely used term in that collection. As expected, based on the model,
it is a very rare term, as indicated by a large rate of occurrence λe1 , and it’s
bursty nature is indicated by small values of λe2 , leading to large values of
the λe1 /λe2 ratio. Also, in contrast, the λe1 /λe2 ratio for the sjm collection,
is relatively small when compared to the values of the other collections,
demonstrating the fact that the term san is a non-informative term in sjm,
relative to the other collections.
The term as is also quite interesting, as it exhibits large values of λe1 /λe2
ratio in all the collections other than fr and pat. pat has comparatively
large values of λe1 /λe2 ratio for most of the other terms, indicating the fact
that the term has dependence on the content, style and structure of the
document and collection. Under such circumstances will it be appropriate
to apply a generic “stop-word” list for a collection of any documents? Based
on the above experiments, we believe that the answer is “no”.
4 Conclusion
REFERENCES
Argamon, Shlomo & Shlomo Levitan. 2005. “Measuring the Usefulness of Func-
tion Words for Authorship Attribution”. Proceedings of the 2005 Conference
of the Association for Computing in the Humanities & Literary and Linguis-
tic Computing. Canada.
Cavaglia, Gabriel. 2002. “Measuring Corpus Homogeneity Using a Range of
Measures for Inter-document Distance”. 3rd Int. Conference on Language
Resources and Evaluation (LREC), 426-431. Spain.
Church, Ken. 2000. “Empirical Estimates of Adaptation: The Chance of Two
Noriega’s Is Closer to p/2 than p2 ”. 18th International Conference on Com-
putational Linguistics (COLING-2000 ), 173-179. Saarbrücken. Germany.
De Roeck, Anne, Avik Sarkar & Paul H. Garthwaite. 2004 “Defeating the Ho-
mogeneity Assumption”. Proceedings of 7th International Conference on the
Statistical Analysis of Textual Data (JADT ), 282-294. De Louvain, Belgium.
Dunning, Ted. 1993. “Accurate Methods for the Statistics of Surprise and Coin-
cidence”. Computational Linguistics 19:1.61-74.
Franz, Alexander. 1997. “Independence Assumptions Considered Harmful”. Pro-
ceedings of the European Chapter of the ACL (EACL’97 ) 182-189. Spain.
Gelman, Andrew, J. Carlin, H.S. Stern & D.B. Rubin. 1995. Bayesian Data
Analysis. London: Chapman and Hall.
Katz, Slava M. 1996. “Distribution of Content Words and Phrases in Text and
Language Modelling”. Natural Language Engineering 2:1.15-60.
Kilgarriff, Adam. 1997. “Using Word Frequency Lists to Measure Corpus Ho-
mogeneity and Similarity between Corpora”. Proceedings of ACL-SIGDAT
Workshop on very large corpora, 231-245. Hong Kong.
Rayson, P. & R. Garside. 2000. “Comparing Corpora Using Frequency Profiling”.
Proceedings of the Workshop on Comparing Corpora, 1-6. Hong Kong.
Rose, Tony & Nick Haddock. 1997. “The Effects of Corpus Size and Homogeneity
on Language Model Quality”. Proceedings of the ACL-SIGDAT Workshop
on Very Large Corpora, 178-191. Beijing.
Sarkar, Avik, Paul H. Garthwaite & Anne DeRoeck. 2005a. “A Bayesian Mix-
ture Model for Term Re-occurrence and Burstiness”. Ninth Conference on
Computational Natural Language Learning (CoNLL-2005 ), 48-55. Michigan.
Sarkar, Avik, Anne DeRoeck & Paul H. Garthwaite. 2005b. “Term Re-occurrence
Measures for Analyzing Style”. Proceedings of the SIGIR Workshop on Tex-
tual Stylistics in Information Access. Salvador, Brazil.
Stamatatos, E., N. Fakotakis & G. Kokkinakis. 2000. “Text Genre Detection
Using Common Word Frequencies”. Proceedings of the Association for Com-
putational Linguistics, 808-814. Hong Kong.
Wilbur, John. & Karl Sirotkin. 1992. “The Automatic Identification of Stop
Words”. Journal of Information Science 18:1.45-55.
Yang, Yiming & John Wilbur. 1996. “Using Corpus Statistics to Remove Re-
dundant Words in Text Categorization”. Journal of the American Society of
Information Science 47:5.357-369.
Exploiting Parallel Texts to Produce a Multilingual
Sense Tagged Corpus for Word Sense Disambiguation
Abstract
We describe an approach to the automatic creation of a sense tagged
corpus intended to train a word sense disambiguation (WSD) sys-
tem for English-Portuguese machine translation. The approach uses
parallel corpora, translation dictionaries and a set of straightforward
heuristics. In an evaluation with nine corpora containing 10 am-
biguous verbs, the approach achieved an average precision of 94%,
compared with 58% when a state of the art statistical alignment tool
was used. The resulting corpus consists of 113,802 tagged sentences
and can be readily used to train a supervised machine learning algo-
rithm to build WSD models for use in machine translation systems.
1 Introduction
In general, the lists produced contain some verbs and words with other part-
of-speech which are not possible translations of the verb according to our
dictionaries (in italic in Table 3), and a null translation probability, that
is, the probability of the verb not being translated. Moreover, they do not
include all the possible translations, since many of them may not occur in
the corpus, or may occur with a very low frequency.
Given these assumptions, we defined a set of heuristics to find the most
adequate translation for each occurrence of the verb in an English unit (EU)
in the Portuguese unit (PU) (see Figure 1):
1. Identify and annotate inseparable phrasal verbs in the EU.
2. Identify and annotate, in the remaining EUs, separable phrasal verbs.
We assume the remaining EUs do not contain any phrasal verb.
PARALLEL TEXTS TO PRODUCE A SENSE TAGGED CORPUS 281
Vj − E Vj − P
corpus corpus
Vj − EUi Vj−Dic.
occurrence x Vj − PUi
phrasal
Yes Seek
phrasal translation
No
Seek used No
Vj−Dic. translation alone found
individual Yes
No
found end
Yes
Positions Yes
& Probab. + one
No
3. Search for all possible translations of the verb in the verb lemmas of
the PU, consulting specific dictionaries for inseparable, separable, or
non-phrasal verbs. Three possible situations arise:
(a) No translation is found — go to step 4.
(b) Only one translation is found — go to step 5.
(c) Two or more translations are found — go to step 6.
4. If the occurrence is a non-phrasal verb, finalize the process (no ad-
equate translation was found). Otherwise, check if the verb can be
used as non-phrasal verb. If yes, go back to step 3, now looking for
possible translations of the verb in the dictionary of non-phrasal verbs.
If it can not be used as a non-phrasal verb, finalize the process (no
adequate translation was found).
5. Annotate the EU with the only possible translation.
6. Identify the absolute positions of the verb in the EU and of each
possible translations in the PU, assigning a position weight (PosW )
282 LUCIA SPECIA, MARIA G.V. NUNES & MARK STEVENSON
Our approach determined a translation for 55% of all verbs (113,802 units)
in the corpora shown in Table 2. Similar identification percentages were
observed among verbs and corpora. The lack of identification for the re-
maining occurrences was due to three main reasons: (a) we do not consider
multi-word translations; (b) errors from the tools used in the pre-processing
steps, especially pos tagging errors; and (c) modified translations, including
cases of omission and addition of words.
Since our intention is to use this corpus to train a wsd model, we give
preference to precision of the sense tagging to the detriment of wide cover-
age. In order to estimate this precision, we randomly selected and manually
analyzed 30 tagged EU from each corpus for each verb, amounting to 1,500
units. The resultant precisions are shown in Table 4.
PARALLEL TEXTS TO PRODUCE A SENSE TAGGED CORPUS 283
On average, our approach was able to identify the correct senses of 94%
of the analyzed units. It achieved a very high average precision (99%) for
the less ambiguous verbs (the three last in Table 3). Of the seven highly
ambiguous verbs, to look and to give have lower numbers of possible senses
than the rest, and for them the system also achieved a very high average pre-
cision (96%). For the remaining five verbs, the system achieved an average
precision of 90.3%. Most of the tagging errors were related to characteristics
of the corpora. However, some errors were also due to limitations of our
heuristics, as illustrated in the distribution of errors sources for each corpus
in Table 5.
Corpus Idiom / Modified Tagger Heuristics
slang translation error
Europarl 6% 66% 8% 20%
Compara 8% 71% 0 21%
Messages 0 100% 0 0
Bible 6% 74% 10% 10%
Miscellaneous 10% 69% 16% 5%
Table 5: Tagging error sources
Most of the errors were due to modified translations, including omissions
and paraphrases (such as active voice sentences being translated by dif-
ferent verbs in a passive voice). In fact, with exception of the technical
corpora (particularly Messages), the translations were far from literal. In
those cases, as in the case of idioms or slang usages, the actual translation
was not in the sentence, or was written using words that were not in the
dictionary, but the system found other possible translation, corresponding
to other verb. Tagger errors refer to the incorrect tagging of the verbs with
any other pos. Therefore, the system also pointed out other possible trans-
lations in the PU. Errors due to the choices made by our heuristics are also
related to the other mentioned errors. For example, considering the position
284 LUCIA SPECIA, MARIA G.V. NUNES & MARK STEVENSON
4 Conclusions
REFERENCES
Agirre, Eneko & David Martı́nez. 2004. “Unsupervised WSD Based on Au-
tomatically Retrieved Examples: The Importance of Bias”. Conference on
Empirical Methods in Natural Language Processing, 25-32. Barcelona.
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L.
Mercer. 1991. “Word Sense Disambiguation Using Statistical Methods”.
29th Annual Meeting of the Association for Computational Linguistics, 264-
270. Berkley, Calif.
Caseli, Helena M., Aline M.P. Silva, Maria G.V. Nunes. 2004. “Evaluation of
Methods for Sentence and Lexical Alignment of Brazilian Portuguese and
English Parallel Texts”. 7th Brazilian Symposium on Artificial Intelligence,
184-193. Sao Luiz, Brazil.
Dagan, Ido & Alon, Itai. 1994. “Word Sense Disambiguation Using a Second
Language Monolingual Corpus”. Computational Linguistics 20:4.563-596.
Diab, Mona & Philip Resnik. 2002. “An Unsupervised Method for Word Sense
Tagging using Parallel Corpora”. 40th Annual Meeting of the Association
for Computational Linguistics, 255-262. Philadelphia, Penn.
Dinh, Dien. 2002. “Building a training corpus for word sense disambiguation
in the English-to-Vietnamese Machine Translation”. Workshop on Machine
Translation in Asia at COLING 2002, 26-32. Taipei.
Edmonds, Philip & Scott Cotton. 2001. “SENSEVAL-2: Overview”. 2nd Work-
shop on Evaluating Word Sense Disambiguation Systems, 1-5. Toulouse,
France.
Frankenberg-Garcia, Ana & Diana Santos. 2003. “Introducing COMPARA: The
Portuguese-English Parallel Corpus”. Corpora in Translator Education, 71-
87. Manchester.
Hutchins, W. John, Harold L. Somers. 1992. An Introduction to Machine Trans-
lation. London: Academic Press.
286 LUCIA SPECIA, MARIA G.V. NUNES & MARK STEVENSON
Ide, Nancy, Tomaz Erjavec & Dan Tufis. 2002. “Sense Discrimination with Par-
allel Corpora”. SIGLEX Workshop on Word Sense Disambiguation: Recent
Successes and Future Directions, 56-60. Philadelphia, Penn.
Koehn, Philipp. 2005. “Europarl: A Parallel Corpus for Statistical Machine
Translation”. MT Summit X, 79-86. Phuket, Thailand.
Ng, Hwee T., Bin Wang & Yee S. Chan. 2003. “Exploiting Parallel Texts for
Word Sense Disambiguation: An Empirical Study”. 41st Annual Meeting of
the Association for Computational Linguistics, 455-462. Sapporo, Japan.
Och, Franz J. & Hermann Ney. 2003. “A Systematic Comparison of Various
Statistical Alignment Models”. Computational Linguistics, 29:1.19-51.
Simões, Alberto M. & José J. Almeida. 2003. “NATools – A Statistical Word
Aligner Workbench”. Sociedade Española para el Procesamiento del Lenguaje
Natural, Madrid, 31:217-224.
Specia, Lucia. 2005. “A Hybrid Model for Word Sense Disambiguation in
English-Portuguese Machine Translation”. 8th Research Colloquium of the
UK Special-interest Group in Computational Linguistics, 71-78. Manchester,
U.K.
Wilks, Yorick, & Mark Stevenson. 1998. “The Grammar of Sense: Using Part-of-
speech Tags as a First Step in Semantic Disambiguation”. Journal of Natural
Language Engineering, 4:2.135-144.
Detecting Dangerous Coordination Ambiguities
Using Word Distribution
Francis Chantree∗ , Alistair Willis∗ , Adam Kilgarriff∗∗ &
Anne de Roeck∗
∗
The Open University, ∗∗ Lexical Computing Ltd
Abstract
In this paper we present heuristics for resolving coordination ambi-
guities. We test the hypothesis that the most likely reading of a coor-
dination can be predicted using word distribution information from a
generic corpus. Our heuristics are based upon the relative frequency
of the coordination in the corpus, the distributional similarity of the
coordinated words, and the collocation frequency between the coor-
dinated words and their modifiers. These heuristics have varying but
useful predictive power. They also take into account our view that
many ambiguities cannot be effectively disambiguated, since human
perceptions vary widely.
1 Introduction
Throughout this paper, the examples have been taken from requirements
engineering documents. Gause and Weinberg (1989) recognise requirements
as a domain in which misunderstood ambiguities may lead to serious and
potentially costly problems.
2 Methodology
‘Central coordinators’, such as and and or, are the most common cause of
coordination ambiguity, and account for approximately 3% of the words in
the bnc. We investigate single coordination constructions using these (and
and/or ) and incorporating two conjuncts and a modifier, as in the phrase:
old boots and shoes,
where old is the modifier and boots and shoes are the two conjuncts. We
describe the case where old applies to both boots and shoes as ‘coordination-
first’, and the case where old applies only to boots as ‘coordination last’.
We investigate the hypothesis that the preferred reading of a coordi-
nation can be predicted by using three heuristics based upon word dis-
tributions in a general corpus. The first we call the Coordination-Matches
heuristic, which predicts a coordination-first reading if the two conjuncts are
frequently coordinated. The second we call the Distributional-Similarity
heuristic, which predicts a coordination-first reading if the two conjuncts
have strong ‘distributional similarity’. The third we call the Collocation-
Frequency heuristic, which predicts a coordination-last reading if the mod-
ifier is collocated with the first conjunct more often than with the second.
We represent the conjuncts by their head words in all these three types of
analysis.
In our example, we find that shoes is coordinated with boots relatively
frequently in the corpus. boots and shoes are shown to have strong distribu-
tional similarity, suggesting that boots and shoes is a syntactic unit. Both
these factors predict a coordination-first reading. Thirdly, the ‘collocation
frequency’ of old and boots is not significantly greater than that of old and
shoes and so a coordination-last reading is not predicted. Therefore, all the
heuristics predict a coordination-first reading for this phrase.
In order to test this hypothesis, we require a set of sentences and phrases
containing coordination ambiguities, and a judgement of the preferred read-
ing of the coordinations. The success of the heuristics is measured by how
accurately they are able to replicate human judgements. We obtained the
sentences and phrases from a corpus of requirements documents, manually
identifying those that contain potentially ambiguous coordinating conjunc-
tions. Table 1 lists the sentences by part of speech of the head word of the
conjuncts; Table 2 lists them by part of speech of the external modifier.
DISAMBIGUATING COORDINATIONS 289
3 Related research
each coordination in our dataset using the Sketch Engine, which provides
lists of words that are conjoined with and or or. Each head word is looked
up in turn. The ranking of the match of the second head word with the
first head word may not be the same as the ranking of the match of the
first head word with the second head word. This is due to differences in the
overall frequencies of the two words. We use the higher of the two rankings.
We find that considering only the top 25 rankings is a suitable cut-off. An
ambiguity threshold of 60% is found to be the optimum for all ten folds in
the cross-validation exercise. For the example from our dataset:
Security and Privacy Requirements,
the higher of the two rankings of Security and Privacy is 9. This is in the top
25 rankings so the heuristic yields a positive result. The survey judgements
were: 12 coordination-first, 1 coordination-last and 4 ambiguous, giving a
certainty of 12/17 = 70.5%. As this is over the ambiguity threshold of 60%,
the heuristic always yields a true positive result on this sentence.
Averaging over all ten folds, this heuristic achieves 43.6% precision,
64.3% recall and 44.0% f-measure. However, the baselines are low, given the
relatively high ambiguity threshold, giving 20.0 precision and 19.4 f-measure
percentage points above the baselines.
We combine the two most successful heuristics, shown in the last line of Ta-
ble 3, by saying a coordination-first reading is predicted if the coordination-
matches heuristic gives a positive result and the collocation-frequency heuris-
tic gives a negative one. The left hand graph of Figure 1 shows the precision,
recall and f-measure for this fourth heuristic, at different ambiguity thresh-
olds. As can be seen, high precision and f-measure can be achieved with low
ambiguity thresholds, but at these thresholds even highly ambiguous coor-
dinations are judged to be either coordination-first or -last. The right hand
graph of Figure 1 shows performance as percentage points above the base-
lines. Here the fourth heuristic performs best, and is more appropriately
used, when the ambiguity threshold is set at 60%.
Instead of using the optimal ambiguity threshold, users of our technique
can choose whatever threshold they consider appropriate, considering how
critical they believe ambiguity to be in their work. Figure 2 shows the
proportions of ambiguous and non-ambiguous interpretations at different
ambiguity thresholds. None of the coordinations are judged to be ambiguous
with an ambiguity threshold of zero — which is a dangerous situation —
whereas at an ambiguity threshold of 90% almost everything is considered
ambiguous.
REFERENCES
Agarwal, Rajeev & Lois Boggess. 1992. “A Simple but Useful Approach to
Conjunct Identification”. Proceedings of the 30th Conference on Association
for Computational Linguistics, 15-21. Newark, Delaware.
Berry, Daniel & Erik Kamsties & Michael Krieger. 2003. From Contract Drafting
to Software Specification: Linguistic Sources of Ambiguity. A Handbook.
http://se.uwaterloo.ca/ dberry/handbook/ambiguityHandbook.pdf
Gause, Donald C. & Gerald M. Weinberg. 1989. Exploring Requirements: Quality
Before Design. New York: Dorset House.
Goldberg, Miriam. 1999. “An Unsupervised Model for Statistically Determining
Coordinate Phrase Attachment”. Proceedings of the 37th Annual Meeting of
the Association for Computational Linguistics on Computational Linguistics,
610-614. College Park, Maryland.
Grefenstette, Gregory. 1994. Explorations in Automatic Thesaurus Discovery.
Boston, Mass.: Kluwer Academic.
Kilgarriff, Adam. 2003. “Thesauruses for Natural Language Processing”. Pro-
ceedings of Natural Language Processing and Knowledge Engineering (NLP-
KE ) ed. by Chengqing Zong. 5-13. Beijing, China.
Kilgarriff, Adam & Pavel Rychly & Pavel Smrz & David Tugwell. 2004. “The
Sketch Engine”. 11th European Association for Lexicography International
Congress (EURALEX 2004 ), 105-116. Lorient, France.
Lin, Dekang. 1998. “Automatic Retrieval and Clustering of Similar Words”. Pro-
ceedings of the 17th International Conference on Computational Linguistics,
768-774. Montreal, Canada.
McLauchlan, Mark. 2004. “Thesauruses for Prepositional Phrase Attachment”.
Proceedings of Eight Conference on Natural Language Learning (CoNLL) ed.
by Hwee Tou Ng & Ellen Riloff, 73-80. Boston, Mass.
Okumura, Akitoshi & Kazunori Muraki. 1994. “Symmetric Pattern Matching
Analysis for English Coordinate Structures”. Proceedings of the 4th Confer-
ence on Applied Natural Language Processing, 41-46. Stuttgart, Germany.
Ratnaparkhi, Adwait. 1998. “Unsupervised Statistical Models for Prepositional
Phrase Attachment”. Proceedings of the 17th International Conference on
Computational Linguistics, 1079-1085. Montreal, Canada.
Resnik, Philip. 1999. “Semantic Similarity in a Taxonomy: An Information-
Based Measure and its Application to Problems of Ambiguity in Natural
Language”. Journal of Artificial Intelligence Research 11:95-130.
van Rijsbergen, C. J. 1979. Information Retrieval. London, U.K.: Butterworths.
Weiss, Sholom M. & Casimir A. Kulikowski. 1991. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets, Machine
Learning, and Expert Systems. San Francisco, Calif.: Morgan Kaufmann.
List and Addresses of Contributors
G. L.
generalized latent semantic anal- language density 247, 253, 256,
ysis (glsa) 45 257
INDEX OF SUBJECTS AND TERMS 305
e. F. K. Koerner, editor
Zentrum für Allgemeine sprachwissenschat, typologie
und universalienforschung, Berlin
ek.koerner@rz.hu-berlin.de
Current Issues in Linguistic heory (cilt) is a theory-oriented series which welcomes contributions from
scholars who have signiicant proposals to make towards the advancement of our understanding of lan-
guage, its structure, functioning and development. CILT has been established in order to provide a forum
for the presentation and discussion of linguistic opinions of scholars who do not necessarily accept the
prevailing mode of thought in linguistic science. It ofers an outlet for meaningful contributions to the
current linguistic debate, and furnishes the diversity of opinion which a healthy discipline must have.
A complete list of titles in this series can be found on the publishers’ website, www.benjamins.com
293 Detges, Ulrich and Richard WalteReit (eds.): he Paradox of Grammatical Change. Perspectives
from Romance. v, 254 pp. Expected February 2008
292 Nicolov, Nicolas, Kalina BoNtcheva, galia aNgelova and Ruslan MitKov (eds.): Recent
Advances in Natural Language Processing IV. Selected papers from RANLP 2005. 2007. xii, 307 pp.
291 BaaUW, sergio, Frank DRijKoNiNgeN and Manuela PiNto (eds.): Romance Languages and
Linguistic heory 2005. Selected papers from ‘Going Romance’, Utrecht, 8–10 December 2005. 2007.
viii, 338 pp.
290 MUghazy, Mustafa a. (ed.): Perspectives on Arabic Linguistics XX. Papers from the twentieth annual
symposium on Arabic linguistics, Kalamazoo, Michigan, March 2006. xii, 247 pp. Expected December 2007
289 BeNMaMoUN, elabbas (ed.): Perspectives on Arabic Linguistics XIX. Papers from the nineteenth annual
symposium on Arabic Linguistics, Urbana, Illinois, April 2005. xiv, 274 pp. + index. Expected December 2007
288 toivoNeN, ida and Diane NelsoN (eds.): Saami Linguistics. 2007. viii, 321 pp.
287 caMacho, josé, Nydia FloRes-FeRRáN, liliana sáNchez, viviane DéPRez and María josé
caBReRa (eds.): Romance Linguistics 2006. Selected papers from the 36th Linguistic Symposium on
Romance Languages (LSRL), New Brunswick, March-April 2006. 2007. viii, 340 pp.
286 WeijeR, jeroen van de and erik jan van der toRRe (eds.): Voicing in Dutch. (De)voicing – phonology,
phonetics, and psycholinguistics. 2007. x, 186 pp.
285 sacKMaNN, Robin (ed.): Explorations in Integrational Linguistics. Four essays on German, French, and
Guaraní. ix, 217 pp. Expected January 2008
284 salMoNs, joseph c. and shannon DUBeNioN-sMith (eds.): Historical Linguistics 2005. Selected
papers from the 17th International Conference on Historical Linguistics, Madison, Wisconsin, 31 July - 5
August 2005. 2007. viii, 413 pp.
283 leNKeR, Ursula and anneli MeURMaN-soliN (eds.): Connectives in the History of English. 2007.
viii, 318 pp.
282 PRieto, Pilar, joan MascaRó and Maria-josep solé (eds.): Segmental and prosodic issues in
Romance phonology. 2007. xvi, 262 pp.
281 veRMeeRBeRgeN, Myriam, lorraine leesoN and onno cRasBoRN (eds.): Simultaneity in Signed
Languages. Form and function. 2007. viii, 360 pp. (incl. CD-Rom).
280 heWsoN, john and vit BUBeNiK: From Case to Adposition. he development of conigurational syntax
in Indo-European languages. 2006. xxx, 420 pp.
279 NeDeRgaaRD thoMseN, ole (ed.): Competing Models of Linguistic Change. Evolution and beyond.
2006. vi, 344 pp.
278 Doetjes, jenny and Paz goNzález (eds.): Romance Languages and Linguistic heory 2004. Selected
papers from ‘Going Romance’, Leiden, 9–11 December 2004. 2006. viii, 320 pp.
277 helasvUo, Marja-liisa and lyle caMPBell (eds.): Grammar from the Human Perspective. Case,
space and person in Finnish. 2006. x, 280 pp.
276 MoNtReUil, jean-Pierre y. (ed.): New Perspectives on Romance Linguistics. Vol. II: Phonetics,
Phonology and Dialectology. Selected papers from the 35th Linguistic Symposium on Romance Languages
(LSRL), Austin, Texas, February 2005. 2006. x, 213 pp.
275 NishiDa, chiyo and jean-Pierre y. MoNtReUil (eds.): New Perspectives on Romance Linguistics. Vol.
I: Morphology, Syntax, Semantics, and Pragmatics. Selected papers from the 35th Linguistic Symposium on
Romance Languages (LSRL), Austin, Texas, February 2005. 2006. xiv, 288 pp.
274 gess, Randall s. and Deborah aRteaga (eds.): Historical Romance Linguistics. Retrospective and
perspectives. 2006. viii, 393 pp.