The Sensem Project Syntactico Semantic A PDF

Recent AdvAnces in nAtuRAl lAnguAge PRocessing
iv
AMSTERDAM STUDIES IN THE THEORY AND
HISTORY OF LINGUISTIC SCIENCE
general editor
e.F.K. KoeRneR
(Zentrum für Allgemeine sprachwissenschat, typologie
und universalienforschung, Berlin)
series iv – cuRRent issues in linguistic tHeoRY
Advisory Editorial Board
lyle campbell (salt lake city); sheila embleton (toronto)

Brian d. Joseph (columbus, ohio); John e. Joseph (edinburgh)
Manfred Krika (Berlin); e. Wyn Roberts (vancouver, B.c.)
Joseph c. salmons (Madison, Wis.); Hans-Jürgen sasse (Köln)
volume 292
nicolas nicolov, Kalina Bontcheva, galia Angelova

and Ruslan Mitkov (eds.)
Recent Advances in Natural Language Processing IV.

Selected papers from RANLP 2005
Recent AdvAnces in
nAtuRAl lAnguAge
PRocessing iv
selected PAPeRs FRoM RAnlP 2005
edited by
nicolAs nicolov
Umbria, Inc.
KAlinA BontcHevA
University of Sheield
gAliA AngelovA
Bulgarian Academy of Sciences
RuslAn MitKov
University of Wolverhampton
JoHn BenJAMins PuBlisHing coMPAnY

AMsteRdAM/PHilAdelPHiA
4- he paper used in this publication meets the minimum requirements of American national
standard for information sciences — Permanence of Paper for Printed library Materials,
Ansi Z39.48-1984.
Library of Congress Cataloging-in-Publication Data

RAnlP 2005 (2005 : Borovets, Bulgaria)
Recent advances in natural language processing iv : selected papers from RAnlP 2005 / edited by
nicolas nicolov ... [et al.].
p. cm. -- (Amsterdam studies in the theory and history of linguistic science. series iv, current
issues in linguistic theory, issn 0304-0763 ; v. 292)
includes bibliographical references and index.
1. computational linguistics--congresses. i. nicolov, nicolas. ii. title.
P98.R36 2005
410.285--dc22 2007043413
isBn 978 90 272 4807 7 (Hb; alk. paper)
© 2007 – John Benjamins B.v.
no part of this book may be reproduced in any form, by print, photoprint, microilm, or any other means,
without written permission from the publisher.
John Benjamins Publishing co. • P.o.Box 36224 • 1020 Me Amsterdam • he netherlands
John Benjamins north America • P.o.Box 27519 • Philadelphia PA 19118-0519 • usA
CONTENTS
Editors’ Foreword ix
I. COMPUTATION FOR LINGUISTICS

John Nerbonne
Linguistic challenges for computationalists 1
II. INFORMATION EXTRACTION & INDEXING

Ralph Grishman
NLP: An information extraction perspective 17
Florian Seydoux & Jean-Cédric Chappelier

Semantic indexing using minimum redundancy cut in ontologies 25
Niraj Aswani, Valentin Tablan, Kalina Bontcheva & Hamish

Cunningham
Indexing and querying linguistic metadata and document content 35
Irina Matveeva, Gina-Anne Levow, Ayman Farahat & Christiaan Royer

Term representation with Generalized Latent Semantic Analysis 45
III. PARSING
Ming-Wei Chang, Quang Do & Dan Roth
Multilingual dependency parsing: A pipeline approach 55
Sandra Kübler
How does treebank annotation influence parsing? Or how
not to compare apples and oranges 79
Laura Alonso, Joan Antoni Capilla, Irene Castellón, Ana Fernández-

Montraveta & Gloria Vázquez
The SenSem project: Syntactico-semantic annotation of sentences
in Spanish 89
vi CONTENTS
IV. ANAPHORA & REFERRING EXPRESSIONS

Robert Dale
Generating referring expressions: Past, present and future 99
Erhard W. Hinrichs, Katja Filippova & Holger Wunsch

A data-driven approach to pronominal anaphora resolution for
German 115
V. CLASSIFICATION
Nicolas Nicolov & Franco Salvetti
Efficient spam analysis for weblogs through URL segmentation 125
Filip Ginter, Sampo Pyysalo & Tapio Salakoski

Document classification using semantic networks
with an adaptive similarity measure 137
Rada Mihalcea & Samer Hassan

Text summarization for improved text classification 147
Caroline Sporleder & Alex Lascarides

Exploiting linguistic cues to classify rhetorical relations 157
VI. TEXTUAL ENTAILMENT &

QUESTION ANSWERING
Milen Kouylekov & Bernardo Magnini
Tree edit distance for textual entailment 167
Jörg Tiedemann
A genetic algorithm for optimising information retrieval
with linguistic features in question answering 177
Vasile Rus & Arthur Graesser

Lexico-syntactic subsumption for textual entailment 187
Courtney Corley, Andras Csomai & Rada Mihalcea

A knowledge-based approach to text-to-text similarity 197
CONTENTS vii
VII. ONTOLOGIES
Keiji Shinzato & Kentaro Torisawa
A simple WWW-based method for semantic word class acquisition 207
Eduard Barbu & Verginica Barbu Mititelu

Automatic building of Wordnets 217
VIII. MACHINE TRANSLATION

Stelios Piperidis, Panagiotis Dimitrakis & Irene Balta
Lexical transfer selection using annotated parallel corpora 227
Victoria Arranz, Elisabet Comelles & David Farwell

Multi-perspective evaluation of the FAME speech-to-speech
translation system for Catalan, English and Spanish 237
Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy,

László Németh & Viktor Trón
Parallel corpora for medium density languages 247
IX. CORPORA
Anne De Roeck
The role of data in NLP: The case for dataset profiling 259
Anne De Roeck, Avik Sarkar & Paul H. Garthwaite

Even very frequent function words do not distribute
homogeneously 267
Lucia Specia, Maria das Graças V. Nunes & Mark Stevenson

Exploiting parallel texts to produce a multilingual sense tagged
corpus for word sense disambiguation 277
Francis Chantree, Alistair Willis, Adam Kilgarriff & Anne de Roeck

Detecting dangerous coordination ambiguities using word
distribution 287
List and Addresses of Contributors 297
Index of Subjects and Terms 303

Editors’ Foreword
This volume brings together revised versions of a selection of papers pre-
sented at the Second International Conference on “Recent Advances in Nat-
ural Language Processing” (RANLP’05) held in Borovets, Bulgaria, 21–23
September 2005. The aim of the conference was to give researchers the
opportunity to present new results in Natural Language Processing (NLP)
based on modern theories and methodologies.
The conference was preceded by three days of tutorials (18–20 September
2005). Invited lecturers were:
Jan Hajic, Zdenka Uresova (Charles University, Prague)
The Prague Dependency Treebank and Valency Annotation
Bernardo Magnini (IRST, Trento)
Open Domain Question Answering: Techniques, Systems and
Evaluation
Rada Mihalcea (University of North Texas)
Graph-based Algorithms for Information Retrieval and Natural
Language Processing
Dragomir Radev (University of Michigan)
Information Retrieval
John Tait (University of Sunderland )
Image Retrieval
Michael Zock (CNRS )
Natural Language Generation: Snaphot of a fast evolving discipline
The conference received 160 submissions from more than 20 countries.
Whilst we were delighted to have so many contributions, restrictions on the
number of papers which could be presented in three days forced us to be
more selective than we would have liked (the acceptance ratio for regular
papers at RANLP’05 was 13.75% — significantly lower than for the previous
conference). From the papers presented at RANLP’05 we have selected the
best for this book, in the hope that they reflect the most significant and
promising trends (and successful results) in NLP.
Keynote speakers who gave invited talks were:
• Ido Dagan (Bar-Ilan University, Israel)
• Robert Dale (Macquarie University, Australia)
• Ralph Grishman (New York University, U.S.A.)
• Makoto Nagao (NICT, Tokyo, Japan)
• John Nerbonne (University of Groningen, The Netherlands)
• Anne de Roeck (Open University, U.K.)
x EDITORS’ FOREWORD
The book covers a wide variety of nlp topics: datasets, annotation, tree-
banks, parallel corpora, information extraction, parsing, word sense disam-
biguation, translation, indexing, ontologies, question answering, document
similarity, spam analysis, document classification, anaphora resolution, re-
ferring expressions generation, textual entailment, latent semantic analysis,
summarization, rhetorical relations, etc. To help the reader find his/her way
we have added an index which contains major NLP terms used throughout
the volume. We have also included a list and addresses of all contributors.
We would like to thank all members of the Program Committee. With-
out them the conference, although well organised, would not have had an
impact on the development of NLP. Together they have ensured that the
best papers were included in the final proceedings and have provided in-
valuable comments for the authors, so that the papers are ‘state of the art’.
The following is a list of those who participated in the selection process and
to whom a public acknowledgement is due:
Eneko Agirre (Univ. of the Basque Country, Donostia)
Elisabeth Andre (University of Augsburg)
Galia Angelova (Bulgarian Academy of Sciences)
Amit Bagga (Ask Jeeves, Piscataway)
Branimir Boguraev (IBM, T. J. Watson Research Center)
Kalina Bontcheva (University of Sheffield)
António Branco (University of Lisbon)
Eugene Charniak (Brown University, Providence)
Dan Cristea (University of Iaşi)
Hamish Cunningham (University of Sheffield)
Walter Daelemans (University of Antwerp)
Ido Dagan (Bar Ilan Univ. & FocusEngine, Tel Aviv)
Robert Dale (Macquarie University)
Thierry Declerck (DFKI GmbH, Saarbrücken)
Gaël Dias (Beira Interior University, Covilhã)
Rob Gaizauskas (University of Sheffield)
Alexander Gelbukh (Nat. Polytechnic Inst., Mexico)
Ralph Grishman (New York University)
Walther von Hahn (University of Hamburg)
Jan Hajic (Charles University, Prague)
Johann Haller (University of Saarland)
Catalina Hallett (Open University, Milton Keynes)
Graeme Hirst (University of Toronto)
Eduard Hovy (ISI, University of Southern California)
Nikiforos Karamanis (University of Wolverhampton)
Martin Kay (Stanford University)
Dorothy Kenny (Dublin City University)
Alma Kharrat (Microsoft Natural Language Group)
EDITORS’ FOREWORD xi
Sandra Kübler (Indiana University, Bloomington)

Manfred Kudlek (University of Hamburg)
Lori Lamel (LIMSI/CNRS, Orsay)
Shalom Lappin (King’s College, London)
Yves Lepage (ATR, Japan)
Anke Luedeling (Humboldt University, Berlin)
Bente Maegaard (CST, University of Copenhagen)
Bernardo Magnini (ITC-IRST, Trento)
Nuno Mamede (INESC, Lisbon)
Inderjeet Mani (Georgetown University)
Carlos Martin-Vide (Univ. Rovira i Virgili, Tarragona)
Tony McEnery (Lancaster University)
Wolfgang Menzel (University of Hamburg)
Rada Mihalcea (University of North Texas, Denton)
Ruslan Mitkov (University of Wolverhampton)
Andres Montoyo (University of Alicante)
Rafael Muñoz-Guillena (University of Alicante)
Masaki Murata (NICT, Japan)
Makoto Nagao (NICT, Japan)
Preslav Nakov (University of California at Berkeley)
Ani Nenkova (Columbia University)
John Nerbonne (University of Groningen)
Nicolas Nicolov (Umbria, Inc., Boulder)
Kemal Oflazer (Sabanci University, Istanbul)
Constantin Orăsan (University of Wolverhampton)
Chris Paice (Lancaster University)
Manuel Palomar (University of Alicante)
Victor Pekar (University of Wolverhampton)
Fabio Pianesi (ITC-IRST, Trento)
Stelios Piperidis (ILSP, Athens)
John Prager (IBM, T. J. Watson Research Center)
Gábor Prószèky (MorphoLogic, Budapest)
Stephen Pulman (Oxford University)
James Pustejovsky (Brandeis University)
José Francisco Quesada Moreno (University of Seville)
Dragomir Radev (University of Michigan)
Fuji Ren (University of Tokushima)
Ellen Riloff (University of Utah)
Anne de Roeck (Open University, Milton Keynes)
Antonio Rubio (University of Granada)
Franco Salvetti (Umbria, Inc., Boulder)
Frédérique Segond (Xerox Research Centre, Grenoble)
Kiril Simov (Bulgarian Academy of Sciences)
Richard Sproat (U. of Illinois at Urbana-Champaign)
Steffen Staab (University of Koblenz-Landau)
Keh-Yih Su (Behavior Design Corporation, Hsinchu)
xii EDITORS’ FOREWORD
John Tait (University of Sunderland)

Kristina Toutanova (Stanford University)
Isabel Trancoso (INEC, Lisbon)
Jun’ichi Tsujii (University of Tokyo)
Hans Uszkoreit (University of Saarland, Saarbrücken)
Karin Verspoor (Los Alamos National Laboratory)
Piek Vossen (Irion Technologies, Delft)
Yorick Wilks (Sheffield University)
Michael Zock (LIF/CNRS, Marseille)
The conference was made possible through the generous financial support
of the European Commission through project grant MLCF-CT-2004-013233
for a Marie Curie large Conference. Special thanks go to Galia Angelova
(Bulgarian Academy of Sciences) and Nikolai Nikolov (Incoma, Ltd.) and
for the efficient local organisation.
We believe that this book will be of interest to researchers, lecturers
and graduate students interested in Natural Language Processing and, more
specifically, to those who work in Computational Linguistics, Corpus Lin-
guistics, Machine Translation.
We would like to acknowledge the unstinting help received from our series
editor, E.F.K. Koerner, and from Ms Anke de Looper of John Benjamins
in Amsterdam. We have built upon our experience from the work on the
previous RANLP volumes. Both, E.F.K. Koerner and Anke de Looper have
continued to be ever so professional.
Nicolas Nicolov produced the typesetting code for the book, utilising
the TEX system with the LATEX 2ε package. Umbria Inc. provided compu-
tational support.
August 2007 Nicolas Nicolov

Kalina Bontcheva
Galia Angelova
Ruslan Mitkov
Linguistic Challenges for Computationalists
John Nerbonne
University of Groningen
Abstract
Even now techniques are in common use in computational linguistics
which could lead to important advances in pure linguistics, especially
language acquisition and the study of language variation, if they were
applied with intelligence and persistence. Reliable techniques for as-
saying similarities and differences among linguistic varieties are useful
not only in dialectology and sociolinguistics, but could also be valu-
able in studies of first and second language learning and in the study
of language contact. These techniques would be even more valuable
if they indicated relative degrees of similarity, but also the source of
deviation (contamination). Given the current tendency in linguistics
to wish to confront the data of language use more directly, techniques
are needed which can handle large amounts of noisy data and extract
reliable measures from them. The current focus in Computational
Linguistics on useful applications is a very good thing, but some
further attention to the linguistic use of computational techniques
would be very rewarding.1
1 Introduction
The goal of this paper is to urge computational linguists to explore issues in

other branches of linguistics more broadly. Computational linguistics (CL)
has developed an impressive array of analytical techniques, especially in the
past decade and a half, techniques which are capable of assaying linguistic
structure of various levels from fairly raw textual data. The goal will be
to note how these techniques might be applied to illuminate other issues of
broad interest in linguistics. My thesis is that CL techniques are capable
now of valuable use in many areas of non-computational linguists. Naturally
we will keep working at improvement in our computational processing, but
the present essay emphasizes applications of CL techniques to areas of lin-
guistics outside the normal CL remit which do not require massive technical
improvements in order to be used.
1
We are grateful to the Netherlands Organization for Scientific Research, NWO, for
support (project “Determinants of Dialect Variation, 360-70-120, P.I. J. Nerbonne). It
was stimulating to discuss the general issue of engineering work feeding back into pure
science with Stuart Schieber, who organized a course with Michael Collins of MIT at
the Linguistics Institute of the Linguistic Society of America at MIT, Summer 2005
contrasting science and engineering in CL.
2 JOHN NERBONNE
The idea my plea is based on—that there are opportunities for computa-
tional contributions to “pure” linguistics—is not absolutely new, of course,
as many computational linguists have been involved in issues of pure linguis-
tics as well, including especially grammatical theory. And we will naturally
attempt to identify such work as we become more concrete (below). We aim
to spark discussion by identifying less discussed areas where computational
forays appear promising, and in fact, we will not dwell on grammatical
theory at all.
It is best to add some caveats. First, the sort of appeal we aim at can only
be successful if it is sketched with some concrete detail. If we attempted
to argue the usefulness of computational techniques to general linguistic
theory very abstractly, virtually everyone would react, “Fine, but how can
we contribute more concretely?” But we can only provide more concrete
detail on a very limited number of subjects. Of course, we are limited by
our knowledge of these subjects as well, but the first caveat is that this
little essay cannot be exhaustive, only suggestive. We should be delighted
to hear promptly of several further areas of application for computational
techniques we omit here.
Second, the exhortation to explore issues in other branches of linguistics
more broadly takes the form of an examination of selected issues in non-
computational linguistics together with suggestions on how computational
techniques might shed added light on them. Since the survey is to be
brief, the suggestions about solutions—or perhaps, merely perspectives—
of necessity will also be brief. In particular, they will be no more than
suggestions, and will make no pretense at demonstrating anything at all.
Third, we might be misconstrued as urging you to ignore useful, money-
making applications in favor of dedicating yourselves to the higher goal of
collaborating in the search for scientific truth. But both the history of CL
and the usual modern attitude of scientists toward applications convinces
me that the application-oriented side of CL is very important and eminently
worthwhile. Perhaps you should indeed turn a deaf ear to the seductions of
filthy mammon and consecrate yourself to a life of (pure) science, but this
is a matter between you and your clergyman (or analyst). You have not
gotten this advice here.
Fourth, and finally, we might be viewed as advocating different sorts
of applications, namely the application of techniques from one linguistic
subfield (CL) to another (dialectology, etc.). In this sense modern genetics
applies techniques from chemistry to biological molecules to determine the
physical basis of inheritance, anthropology applies techniques from nuclear
chemistry (carbon dating) to date human artefacts, and astronomy applies
techniques from optics (glass) and electromagnetism (radio astronomy) to
map the heavens. In all of these case is the primary motivation is scien-
LINGUISTIC CHALLENGES 3
tific curiosity, not utilitarian, and this view is indeed parallel to the step
advocated here.
Martin Kay received the Lifetime Acheivement Award of the Associa-
tion for Computational Linguistics in 2005, and his acceptance address to
the 2005 meeting ended in cri de coeur to computational linguists to keep
studies of language central: “Language [...] remains a valid and respectable
object of study, and I earnestly hope that [computational linguists] will con-
tinue to pursue it.” (Kay 2005:438). Perhaps the present piece can serve
to suggest less likely areas where CL might contribute to progress in the
scientific understanding of language.
2 Computational linguistics
Computational Linguistics (CL) is often characterized as having a theo-

retical and an application-oriented, or engineering side (Joshi 1999, Kay
2002). The theoretical side of CL is concerned with processes involving
language and their abstract computational characterization, including pro-
cesses such as analyzing (parsing), and producing (generating) language,
but also storing, compressing, indexing, searching, sorting, learning and ac-
cessing language. The computational characterization of these processes
involves investigating algorithms for their accuracy and time and space re-
quirements, finding appropriate data structures, and naturally testing these
ideas, where possible, against concrete implementations.
The application-oriented, or engineering side of the field concerns itself
with creating useful computational systems which involve language manip-
ulation in some way, e.g., lexicography tools; speech understanding (in col-
laboration with speech recognition); machine translation, including transla-
tion aids such as translation memories, multilingual alignment, and special-
ized lexicon construction; speech synthesis, especially intonation; term ex-
traction, information retrieval, document summarization, data (text) min-
ing, and question answering; telephone information systems and natural
language interfaces; automatic dictionary and thesaurus access, grammar
checking, including spell-checking; document management, authoring (es-
pecially in multi-author systems), and conformance to specifications in so-
called “controlled language systems”; foreign language aids (such as access
to bilingual dictionaries), foreign language tutoring systems, and communi-
cation aids (for the handicapped). See Cole et al. (1996) for further discus-
sion of these, and other areas of application for language technology.
We have been overly compulsive about listing the engineering activities
not only to remind the reader how extensive these are, but also to empha-
size that the breadth of these activities would be unthinkable if it were not
for a rich “infrastructure” of language technology tools which the field is
4 JOHN NERBONNE
constantly creating. For the most part the techniques we urge you to apply
more broadly have been developed in order to build better and more var-
ied applications, as this has been the great motor in the recent dynamics
of computational linguistics. But some of the techniques have also been
useful in theoretical computational linguistics, and the distinction will play
no role here. In fact, perhaps the simplest view is to acknowledge that ap-
plications and theory make use of common technology, a sort of technical
infrastructure, and to emphasize the opportunities this provides.
3 Dialectology
We shall examine dialectology first because it is an area we have directly

worked in, and for which we therefore need to rely less on speculation
about the potential benefits of a computational approach. Given the greater
amount of direct experience with this work, we may use it to distill some of
the characteristics we need to seek in other areas in which computational
techniques might be promising.
Dialectology studies the patterns of variation in a language and espe-
cially its geographic conditioning (Chambers 1998). In London people say
[w6t] for ‘water’, with a voiceless [t] and no trace of final [r], in New York
most people say [wAR], with a “tapped” [t], and in Boston [wAR]. These dif-
"
ferences are systematic, but not exceptionless, and they appear to involve
potentially every level of linguistic structure, pronunciation, morphology,
lexicon, syntax, and discourse. Because differences appear to involve excep-
tions, it is advantageous to process a great deal of material and to apply
statistical techniques to the analysis. Fortunately, dialectologists have been
assiduous in collecting and archiving a great deal of data, especially involv-
ing pronunciation and lexical differences.
Once we have agreed that we need to subject a great deal of data to
systematic analysis, we have a fortiori accepted the need for automating
the analysis, and since it is linguistic material, it would be strange if this
did not lead us to computational linguistics. In fact edit distance, well-
known to computational linguists by its wide variety of applications, may be
applied fairly directly to the phonetic transcripts of dialect pronunciations
(Nerbonne, Heeringa & Kleiweg 1999). The application of edit distance
to pronunciation transcripts yields, for each pair of words, at each pair of
field work sites, a numerical characterization of the difference. Because
pronunciation differences are characterized numerically, we thereby initiate
a numerical analysis of data that dialectologists had normally regarded as
categorical—with all the advantages which normally accrue to numerical
data analysis.
Nerbonne (2003) discusses at greater length the computational issues in an-

alyzing, presenting and evaluating dialectological analyses, including those
which go beyond pronunciation. These issues include the use of lemmatiz-
ers or stemmers to clean up word-form data for lexical analysis, raising the
edit distance from strings to sets of strings in order treat data collections
with alternative forms, and the proper treatment of frequency in detec-
tion of linguistic proximity. Opportunities for the application of standard
CL techniques in computational linguistics abound. Heeringa (2004) sum-
marizes current thinking on measuring dialectal pronunciation differences,
including the thorny issue of evaluating the quality of results. Figure 1 illus-
trates the results of applying these techniques to Bulgarian data (Osenova,
Heeringa & Nerbonne, forthcoming).
Fig. 1: In this line map the average Levenshtein distances between 490
Bulgarian dialects are shown for 36 words. Darker lines join varieties with
more similar pronunciations; lighter lines indicate more dissimilar ones
It is important to report here, as well, that specialists in dialectology—and

not only computational linguists—are enthusiastic about the deployment
of computational tools. A common remark by dialectologists is that that
the new techniques allow a more comprehensive inclusion of all available
data, effectively answering earlier complaints that analyses of dialect areas
6 JOHN NERBONNE
and/or dialect continua relied too extensively on the analysts’ choice of

material. William Kretzschmar leads the American Linguistic Atlas Projects
(LAP), and has collaborated in various analyses and workshops (Nerbonne
& Kretzschmar 2003). He has inter alia included a pointer to CL work
on the home page of the LAP site he maintains at and he is presently
collaborating on a project to publish a second volume of papers focused on
computational techniques (Nerbonne & Kretzschmar 2006).
Finally, let us note that the computational step may introduce such gen-
uinely novel opportunities that we find ourselves in a position to ask ques-
tions which simply lay beyond earlier methodology. Given our numerical
perspective on dialect difference, we may e.g., ask, via a regression analysis,
how much of the aggregate varietal difference is explained by geography, or
whether travel time is a superior characterization of the geography relevant
to linguistic variation (Gooskens 2004), or whether larger settlements tend
to share linguistic variants more than smaller ones—something one might
expect if variation diffused via social contact (Heeringa & Nerbonne 2002).
As a further example, we may examine the degree to which the cultural
patterns exemplified in dialectal variation correlate with the distribution of
genetic variation (Manni, Heeringa & Nerbonne 2006). The introduction of
CL techniques enables us to ask more abstract questions in a way we can
still link to concrete linguistic analysis.
This work also suggests many related paths of exploration. For exam-
ple, even if a distance measure allows the mapping of the dialectological
landscape well, it seems ill-equipped to assay one extreme result of dialect
differentiation, i.e., the failure of comprehensibility. The reason for this
failure is the fact the comprehensibility is not symmetrical, while linguistic
distance by definition is: it may reliably be the case that speakers of one
variety understand the speaker of another better than vice versa. For exam-
ple, Dutch speakers find it easier to understand Afrikaans than vice versa
(Gooskens & van Bezooijen 2006). If this is due to language differences, it
calls for the development of an asymmetrical measure of the relative diffi-
culty of mapping from one language to another, or something similar.2
The computational work has been successful in dialectology because
there were large reservoirs of linguistics data to which analyses could be
applied, i.e., dialect atlases, because distinguishing properties resisted sim-
ple categorical characterization, and naturally because there were promising
computational techniques for getting at the crucial phenomena.
As we turn to other areas, we shall ask ourselves whether we are likely to
satisfy these desiderata. When even one is missing, the result can be disap-
2
Nathan Vaillette, Univ. of Massachusetts has explored this problem using relative
entropy in unpublished work.
pointing. For example, sociolinguistics has largely succeeded dialectology in

attracting scholarly interest. The linguistic issues are not wildly different—
different social groups use different language varieties, and these may differ
in all the ways in which geographical varieties do (pronunciation, lexicon,
etc.). It would be straightforward and interesting to apply the techniques
sketch above to linguistic varieties associated with different social groups.
But there is no tradition in sociolinguistics like that of the dialect atlas, i.e.,
collecting speech samples from a large set of sociolects. So the opportunity
does not present itself.
4 Diachronic linguistics
Diachronic linguistics investigates how languages change, and, most spec-

tacularly, how a single language many evolve into many related ones. It
regularly attracts a good deal of scholarly attention (Gray & Atkinson,
2003; Eska & Ringe, 2004; Dunn, Terrill, Reesink & Levinson, 2005) as
computational biologists have applied their techniques for tracking genetic
evolution to linguistic data. Although the scholarship is at times forbid-
ding in its expectations about philological expertise, the problem appears
to allow neat enough formulations so that one may be optimistic about
computational investigations.
Essentially, we are given a set of cognate words in several putatively
related languages, and we construct hypotheses about the most recent com-
mon ancestor—the protolanguage—as well as a simple set of sound changes
leading from the protolanguage to the individual descendants. For exam-
ple, we note that the word for father has an initial /f/ in Germanic (English
father ), /p/ in Romance, Greek and Indic (French père, Greek patera, and
Hindi pitā), and no initial consonant in some Celtic languages (Irish athair ).
This suggests that we postulate a /p/ in the protolanguage and changes
from /p/ to /f/ for Germanic and /p/ to ∅ for the relevant Celtic varieties.
But we gain confidence in these postulates only when the same rules are
shown to operate on other forms, i.e., when the correspondences recur (as
the p/f/∅ definitely does). It is surprising that CL should turn over to the
biologists such a well-structured problem in linguistic computation.3
Our community has contributed to this area, especially Brett Kessler,
who investigated how to test when sound correspondences exceed chance
levels (Kessler 2001), and Grzegorz Kondrak, who modified the edit dis-
tance algorithm mentioned above, in order to identify cognates, align them,
and on that basis postulate recurrent sound combinations (Kondrak 2002).
3
See also Benedetto et al. (2002) for attempt to reconstruct linguistic history using
relative entropy, but especially Goodman (2002) for criticism of Benedetto.
8 JOHN NERBONNE
But these studies deserve follow-ups, tests on new data, and extensions to
other problems. Among many remaining problems we note that it would
be valuable to detect borrowed words, which should not figure in cognate
lists, but which suggest interesting influence; to operationalize the notion
of semantic relatedness relevant to cognate recognition; to quantify how
regular sound change is; or to investigate the level of morphology, which is
regarded as especially probative in historical reconstruction. But we em-
phasize that there are likely to be interesting opportunities for contributions
with respect to detail as well, perhaps in the construction of instruments
to examine data more insightfully, to measure hypothesized aspects, or to
quantify the empirical base on which historical hypotheses are made.
5 Language acquisition
Studies of children’s acquisition of language are interesting to all sorts of in-

quiries because language is a defining characteristic of us as humans. They
occupy an important position in linguistics due to the linguistic argument
that innate, specifically linguistic mechanisms must be postulated to ac-
count for acquisition (Pinker 1994:ch. 4). The innate organizing principles
of language are postulated to be part of human genetic constitution, and
therefore the source of universal properties which all languages share. At
the same time psychologists have shown that some acquisition is mediated
by sensitivity to statistical trends in data (Saffran, Aslin & Newport 1999).
And children naturally need minimally to learn which of all the languages
they are genetically predisposed toward is the one in use locally. Finally, CL
has explored machine learning techniques extensively over the past decade
(Manning & Schütze 1999). Surely CL is positioned to contribute crucially
to this scientific discussion with interesting implemented models of specific
phenomena, and in particular with models aimed at broader coverage or so
one would think.
On the other hand, machine learning techniques do not translate to com-
putational models of acquisition very directly, at least not as normally used
by CL, namely to optimize performance on technical tasks that may have
no interesting parallel in a child’s acquisition of language, e.g., the task of
recognizing named entities, persons, places and organizations. In addition,
even idealized simulations of acquisition might wish to impose restrictions
on the sort of mechanisms to be used, e.g., that they may apply incremen-
tally, and on the input data, e.g., that it reflect children’s experience.
Fortunately, these differences in tasks, mechanisms and input data may
be overcome, and CL has not been inactive in examining language acqui-
sition. Brent (1997) is an early collection of articles on computational ap-
proaches to language acquisition, including especially Brent’s own work ap-
plying minimal description length to the problem of segmenting the speech

stream into words, and using only phonotactic and distributional informa-
tion (Brent 1999, Brent 1999a). There have been a number of other studies
focusing on phonotactics (Nerbonne & Konstantopoulos 2004, Nerbonne &
Stoianov 2004), the acquisition of morphophonemic rules (Gildea & Jurafsky
1996, Albright & Hayes 2003), morphology (Goldsmith 2001), and syntax
(Niyogi & Berwick 1996). Albright and Hayes’s work is especially worth
recommending to a CL audience as it is clear and explicit about linguistic
concerns in modeling acquisition computationally.
Most relevant to the sort of CL contribution I have in mind is the se-
ries of workshops organized by William Gregory Sakas of CUNY, Psycho-
Computational Models of Human Language Acquisition. The first took place
in 2004 in Geneva in coordination with COLING and the second in 2005 in
Ann Arbor in coordination with the ACL’s special interest group on natural
language learning It is clear from the proceedings of these workshops that
new syntheses of linguistic, psychological and computational perspectives
enjoy a good deal of interest (Yang 2004).
It is also clear that there is an enormous interest in further questions
about segmentation, alignment, constituency, local and long-distance rela-
tions, modification, and ill-formed input in addition to the usual questions
about the generality of solutions wit respect to various language types.
Finally, it is worth emphasizing here more than elsewhere that contribu-
tions need not take the form of simulations of human learning (even if this
is the case for most of the studies cited). There is great potential interest
in characterizing easy vs. difficult material, in what happens when second
and third languages are learned (contamination), and in how languages are
lost. In addition to simulation, we should also be thinking of how to opera-
tionalize measures of language proficiency that could use speech as directly
as possible. At the moment, extremely crude measures such as mean length
of utterance (MLU) and type/token ratio enjoy great popularity, but one
suspects that this is due more to their ease of computation than to their
reflection of linguistic sophistication. Ideally we should like to automate
our detection of the mastery of various linguistic structures, rules and ex-
ceptions. That is clearly a long way off in its full generality, but perhaps
realizable in some instances with standard techniques.
6 Language contact
Language contact study is an active branch of linguistics focused on rec-

ognizing and analyzing the ways in which languages borrow from one an-
other (Thomason & Kaufmann 1988, van Coetsem 1988). It is growing in
popularity, perhaps due to increases in mobility and the realization that
10 JOHN NERBONNE
multilingual speakers often, albeit unconsciously, impose the structures of

one language on another. Mufwene (2001) urges us to view extreme con-
tact effects such as koinéization, creolization and pidginization as various
degrees to which language mixtures may develop (instead of as the results of
very different processes, as earlier scholarship had held). Language contact
study is, moreover, linked to second-language acquisition in an obvious way:
if second-language speakers habitually impose elements of their native lan-
guage onto another, then those element are good candidates for long-term
borrowing whenever these languages are in contact.
It might seem as if we could use the same tools for the study of contact
effects that we developed for dialectology. After all, if one variety of a lan-
guage adopts elements of another, it should become more similar. Indeed
given the sort of data in dialect atlases, one can perform these analyses
and determine the convergence of some varieties toward a putative source
of contamination, at least the convergence with respect to other varieties
(Heeringa, Nerbonne, Niebaum & Nieuweboer 2000, Gooskens & Heeringa
2004). Furthermore, one could examine the role of geography in this con-
vergence.
But language contact data collections are not usually designed as dialect
atlases, with a number of distinct collection sites, and a controlled set of
linguistic variables to be assayed. Recently, we obtained data of a rather
different sort, and set ourselves the task of developing computational tools
for its analysis.4 Watson collected recordings of Finnish emigrants to Aus-
tralia in the mid 1990’s (Watson 1996), and this group could be divided into
adult emigrants and child emigrants, using puberty (16 years old) as the di-
viding line. The challenge was the development of a technique to determine
whether there were significant changes in the syntax of the two groups.
Following an obvious tack from CL, we settled on using n-grams of part-
of-speech tags (POS tags) assigned by the TnT tagger (Brants 2000) as a
probe to determine syntactic similarity. In order not to be swamped by fine
distinctions we used trigrams of a small tag set (50 tags). Up to this point we
were rediscovering an idea others had introduced (Aarts & Granger 1998).
To compare one corpus with another, we measured the difference in the
two vectors of trigram frequencies using cosine (inter alia). To determine
whether the difference is statistically significant, we applied permutation-
based statistics, roughly resampling the union of the two data sets (using
some complicated normalizations) and checking the degree of difference. A
difference is significant at the level p < p′ iff it is among the most extreme
p′ fraction of the resampled data.
4
What follows is an informal synopsis of work in progress being conducted with Wybo
Wiersma of Groningen and Lisa Lena Opas-Hänninen, Timo Lauttamus and Pekka
Hirvonen of Oulu University.
Because the technique is still under development, we cannot yet report much
more. The differences are indeed statistically significant, which, in itself, is
not surprising. The corpora are quite raw, however, so that the differences
we are finding to-date are dominated by hesitation noises and errors in tag-
ging. The promise is in the technique. If we have succeeded in developing
an automated measure of syntactic difference, we have opportunities for
application to a host of further questions about syntactic differences, e.g.,
about where these differences are detectable, and where not; about the time
course of contamination effects (do second-language learners keep improv-
ing, or is there a ceiling effect?); and about the role of the source language in
the degree of contamination. Some crucial computational questions would
remain, however, concerning detecting the source of contamination.
7 Other areas
As noted in the introduction, this brief survey has tried to develop a few
ideas in order to convince you that there are promising lines of inquiry
for computationalists who would seek to contribute to a broader range of
linguistic subfields. We suspect that there are many other areas, as well.
We have deliberately omitted grammatical theory from the list of poten-
tial near-term adopters of computational techniques. There are two reasons
for eschewing a sub-focus on grammar here, the first being the fact that the
potential relevance of computational work to grammatical theory has been
recognized for a long time, as grammar has been cited since the earliest days
of CL as a likely beneficiary of closer engagement (Kay 2002). But second,
even as computational grammar studies uncover new means of contributing
to the study of pure grammar (van Noord 2004), it seems to be a minority
of (non-computational) grammarians who recognize the value of computa-
tional work. Many researchers have explored this avenue, but the situation
has stabilized to one in which computational work is pursued vigorously
by small specialized groups (Head-Driven Phrase Structure Grammar and
Lexical-Functional Grammar), and largely ignored by most non-mainstream
grammarians. We deplore this situation as do others (Pollard 1993), but it
unfortunately appears to be quite stable.
In addition to the areas discussed above, it is easily imaginable that
CL techniques could play an interesting role in a number of other linguis-
tic subareas. As databases of linguistic typology become more detailed and
more comprehensive, they should become attractive targets for data-mining
techniques Psycholinguistic studies of processing are promising because they
provide a good deal of empirical data. We shall be content with a single ex-
ample. Moscoso del Prado Martı́n (2003) reviews a large number of studies
relating the difficulty of processing complex word forms, i.e., those involv-
12 JOHN NERBONNE
ing inflectional and/or derivational structure to the “family size” of a word

form, i.e. how many other word forms are related to it. He is able to show
that a simple characterization of family size and frequency due to informa-
tion theory correlates highly with processing difficulty.
8 Conclusions
We have urged computational linguists to consider how much they might

contribute to curiosity-driven research into language, i.e., linguistic theory,
focusing on examples in dialectology, diachronic linguistics, language ac-
quisition and language contact. We have suggested that there are many
avenues to pursue for those with a broader interest in language, and also
that the tools and training one receives in developing language technology
will be of direct use. We have not suggested that contributions in pure
science are any easier or harder to make, and the experience has been gen-
eral that the dynamics involved in pursuing non-applied goals are every bit
as demanding, and every bit at provocative: a successful effort invariably
suggests new questions and new avenues to explore.
We have been careful to avoid deprecating application-research and, at
the risk of repetition, restate that the development of useful applications is
a most valuable aspect of current CL. We encourage colleagues to think of
both channels of activity rather than to force a choice of one over the other.
If we are right that most of the interesting techniques for exploring is-
sues in non-computational linguistics have arisen through the development
of techniques for engineering activities, then we may have another case
where applied science furthers the progress of pure science (Burke 1985).
In making this remark, we are reneging on the promise in Section 2 not to
concern ourselves with whether a particular technique originated in theo-
retical vs. applied CL, but given the preponderance of applied work in CL,
it would be surprising if it were not true in many instance that techniques
from engineering were being conscripted for work in theory.
The use of a stemmer to extract lexical differences from lists of word
forms in dialectology (Nerbonne & Kleiweg 2003) is an example of the sort
of contribution where a technique developed only for application purposes
could be put to a purely scientific use, that of detecting lexical overlap
across a dialect continuum. The Porter stemmer which was used for this
purpose is not to be confused with a genuine lemmatizer, which is interesting
both linguistically and practically. But it usually reduces word forms to the
same stem when they in fact are elements of the same inflectional paradigm.
It was developed for use in information retrieval (Porter 1980), not for the
purpose of exploring linguistic structure or its processing, but its use in

dialectology has no ambitions toward practical application.
This would appear to be a genuine case of an engineering technique
serving a purpose in curiosity-driven research. To the extent CL is involved
in other pure science (beyond CL proper), this sort of cross-fertilization
must be standard. Only time will tell whether it will remain true of future
computational forays into pure linguistics.
REFERENCES
Aarts, Jan & Sylviane Granger. 1998. “Tag Sequences in Learner Corpora: A
Key to Interlanguage Grammar and Discourse”. Learner English on Com-
puter ed. by Sylviane Granger, 132-141. London: Longman.
Albright, Adam & Bruce Hayes. 2003. “Rules vs. Analogy in English Past
Tenses: A Computational/Experimental Study”. Cognition 90:119-161.
Benedetto, Dario, Emanuele Caglioti & Vittorio Loreto. 2002. “Language Trees
and Zipping”. Physical Review Letters 88:4.048702.
Brants, Thorsten. 2000. “TnT — A Statistical Part of Speech Tagger”. 6th
Applied Natural Language Processing Conference, 224-231. Seattle, Wash-
ington.
Brent, Michael, ed. 1997. Computational Approaches to Language Acquisition.
Cambridge, Mass.: MIT Press.
Brent, Michael. 1999. “An Efficient, Probabilistically Sound Algorithm for Seg-
mentation and Word Discovery”. Machine Learning Journal 34:71-106.
Brent, Michael. 1999. “Speech Segmentation and Word Discovery: A Computa-
tional Perspective”. Trends in Cognitive Science 3:294-301.
Burke, James. 1985. The Day the Universe Changed. Boston, Mass.: Little,
Brown & Co.
Chambers, J.K. & Peter Trudgill. 1998. Dialectology. Cambridge, U.K.: Cam-
bridge University Press.
Cole, Ronald A., Joseph Mariani, Hans Uszkoreit, Annie Zaenen & Victor Zue.
1996. Survey of the State of the Art in Human Language Technology.
National Science Foundation (NSF) and European Commission (EC),
www.cse.ogi.edu/CSLU/HLTsurvey/
Dunn, Michael, Angela Terrill, Geert Reesink, Robert A. Foley & Stephen C.
Levinson. 2005. “Structural Phylogenetics and the Reconstruction of An-
cient Language History”. Science 309:5743.2072-2075.
Eska, Joseph F. & Don Ringe. 2004. “Recent Work in Computational Linguistic
Phylogeny”. Language 80:3.569-582.
Gildea, Daniel & Daniel Jurafsky. 1996. “Learning Bias and Phonological Rule
Induction”. Computational Linguistics 22:4.497-530.
14 JOHN NERBONNE
Goldsmith, John. 2001. “Unsupervised Learning of the Morphology of a Natural

Language”. Computational Linguistics 27:2.153-198.
Goodman, Joshua. 2002. “Extended comment on ‘Language trees and zipping’ ”.
Condensed Matter Archive, February 21, 2002. arXiv:cond-mat/0202383.
Gooskens, Charlotte & Wilbert Heeringa. 2004. “The Position of Frisian in the
Germanic Language Area”. On the Boundaries of Phonology and Phonetics,
ed. by Dicky Gilbers, Maartje Schreuder & Nienke Knevel, 61-88. Groningen:
CLCG.
Gooskens, Charlotte & Renée van Bezooijen. 2006. “Mutual Comprehensibility
of Written Afrikaans and Dutch: Symmetrical or Asymmetrical?” Literary
and Linguistic Computing 21:4:543-557.
Gooskens, Charlotte. 2004. “Norwegian Dialect Distances Geographically Ex-
plained”. Language Variation in Europe. Papers from the Second Interna-
tional Conference on Language Variation in Europe (ICLAVE 2 ), ed. by
B.-L. Gunnarson et al. 195-206. Uppsala, Sweden: Uppsala University.
Gray, Russell D. & Quentin D. Atkinson. 2003. “Language-Tree Divergence
Times Support the Anatolian Theory of Indo-European Origin”. Nature
426:435-439.
Heeringa, Wilbert & John Nerbonne. 2002. “Dialect Areas and Dialect Con-
tinua”. Language Variation and Change 13:375-400.
Heeringa, Wilbert. 2004. Measuring Dialect Pronunciation Differences Us-
ing Levenshtein Distance. Ph.D. dissertation, Rijksuniversiteit Groningen,
Groningen.
Heeringa, Wilbert, John Nerbonne, Hermann Niebaum & Rogier Nieuweboer.
2000. “Measuring Dutch-German Contact in and around Bentheim”. Lan-
guages in Contact, ed. by Dicky Gilbers, John Nerbonne & Jos Schaeken,
145-156. Amsterdam-Atlanta: Rodopi.
Joshi, Aravind K. 1999. “Computational Linguistics”. The MIT Encyclopedia of
the Cognitive Sciences, ed. by Robert A. Wilson & Frank C. Keil, 162-164.
Kay, Martin. 2002. “Introduction”. Handbook of Computational Linguistics, ed.
by R.Mitkov, xvii–xx. Oxford, U.K.: Oxford University Press.
Kay, Martin. 2005. “A Life of Language”. Computational Linguistics 31:4.425-
438.
Kessler, Brett. 2001. The Significance of Word Lists. Stanford, Calif.: CSLI
Press.
Kondrak, Grzegorz. 2002. Algorithms for Language Reconstruction. Ph.D. dis-
sertation, Univ. of Toronto, Canada.
Manning, Chris & Hinrich Schütze. 1999. Foundations of Statistical Natural
Language Processing. Cambridge, Mass.: MIT Press.
Moscoso del Prado Martı́n, Fermin. 2003. Paradigmatic Structures in Morpholog-

ical Processing: Computational and Cross-Linguistic Experiemental Studies.
Ph.D. dissertation, Radboud Univ. Nijmegen, The Netherlands.
Mufwene, Salikoko. 2001. The Ecology of Language Evolution. Cambridge, U.K.:
Cambridge University Press.
Nerbonne, John & Peter Kleiweg. “Lexical variation in LAMSAS”. Computers
and the Humanities (= Special Issue on Computational Methods in Dialec-
tometry) ed. by John Nerbonne & William Kretzschmar, Jr., 37:3.339-357.
Nerbonne, John & Stasinos Konstantopoulos. 2004. “Phonotactics in Induc-
tive Logic Programming”. Intelligent Information Processing and Web Min-
ing (IIPWM’04 ) ed. by Mieczyslaw A. Klopotek, Slawomir T. Wierzchon &
Krzysztof Trojanowski, 493-502, Springer: Berlin.
Nerbonne, John & William Kretzschmar, eds. 2003. Computational Methods in
Dialectometry (= Special Issue of Computers and the Humanities, 37:3).
Nerbonne, John & William Kretzschmar, eds. 2006. Progress in Dialectometry
(= Special Issue of Literary and Linguistic Computing, 21).
Nerbonne, John & Ivilin Stoianov. 2004. “Learning Phonotactics with Simple
Processors”. On the Boundaries of Phonology and Phonetics (CLCG) ed. by
Dicky Gilbers, Maartje Schreuder & Nienke Knevel, 89-121. Groningen.
Nerbonne, John. 2003. “Linguistic Variation and Computation”. 10th Meeting
of the European Chapter of the Association for Computational Linguistics,
vol. 10, 3-10.
Nerbonne, John, Wilbert Heeringa & Peter Kleiweg. 1999. “Edit Distance and
Dialect Proximity”. Time Warps, String Edits and Macromolecules: The
Theory and Practice of Sequence Comparison ed. by David Sankoff & Joseph
Kruskal, v-xv. Stanford, Calif.: CSLI.
Niyogi, Partha & Robert C. Berwick. 1996. “A Language Learning Model for
Finite Parmeter Spaces”. Cognition 61:161-193.
Osenova, Petya, Wilbert Heeringa & John Nerbonne. Forthcoming. “A Quanti-
tative Analysis of Bulgarian Dialect Pronunciation”. To appear in Zeitschrift
für Slavische Philologie.
Pinker, Steven. 1994. The Language Instinct. New York: W. Morrow & Co.
Pollard, Carl. 1993. “On Formal Grammars and Empirical Linguistics”. ES-
COL’93: 10th Eastern States Conference on Linguistics ed. by Andreas
Kathol & M. Bernstein. Columbus, Ohio: Ohio State University.
Porter, Martin F. 1980. “An Algorithm for Suffix Stripping”. Program 14:3.130-
137.
Saffran, Jenny R., E.K. Johnson, Richard N. Aslin & Elissa L. Newport. “Statis-
tical Learning of Tonal Sequences by Human Infants and Adults”. Cognition
70:27-52.
Thomason, Sarah & Terrence Kaufmann. 1988. Language Contact, Creolization,
and Genetic Linguistics. Berkeley, Calif.: University of California Press.
16 JOHN NERBONNE
van Coetsem, Frans. 1988. Loan Phonology and the Two Transfer Types in
Language Contact. Dordrecht: Foris Publications.
van Noord, Gertjan. 2004. “Error Mining for Wide-Coverage Grammar Engi-
neering”. 42nd Meeting of the Association for Computational Linguistics,
446-453. Barcelona, Spain.
Watson, Greg. 1996. “The Finnish-Australian English Corpus”. ICAME Jour-
nal: Computers in English Linguistics 20:41-70.
Yang, Charles. 2004. “Universal Grammar, Statistics or Both?”. Trends in
Cognitive Science 8:10.451-456.
NLP: An Information Extraction Perspective
Ralph Grishman
New York University
Abstract
This chapter examines some current issues in natural language pro-
cessing from the vantage point of information extraction (ie), and so
gives some perspective on what is needed to make ie more success-
ful. By ie we mean the identification of important types of relations
and events in unstructured text. Ie provides a nice reference point
because it can benefit from a wide range of technologies, including
techniques for analyzing sentence structure, for identifying syntac-
tic and semantic paraphrases, and for discovering the information
structure of domains.
1 The challenges of information extraction
Information Extraction (ie) provides an interesting perspective from which

to view many of the methods and challenges of natural language processing
(nlp). On the one hand, some extraction tasks can be successfully addressed
using simple nlp tools, such as regular expressions and basic probabilistic
sequential models. On the other hand, ie in its full generality involves
understanding the structure of the information conveyed by a collection of
documents, and then distilling this information from the documents. This
will require a deep analysis of the language using a rich palette of nlp tools,
including some still to be developed. We will use this survey to present a
small portion of this palette, including tools developed by our group and
others.
Ie is a domain-specific task; the important types of objects and events
for one domain (e.g., people, companies, products, money and securities in
the financial news) can be quite different from those for another domain
(e.g., genes and proteins in genomics). Developing ie for a new domain can
be broken down into three tasks:
1. determining what the important types of facts are for the domain;
2. for each type of fact, determining the various ways in which it is
expressed linguistically;
3. identifying instances of these expressions in text.
While there is a fuzzy boundary between these tasks, this division will
provide a basis for organizing this overview. Steps 1 and 2 are done in
advance, and often by hand, while step 3 is part of every ie system, so we
will start with this last step and then work our way backwards.
18 RALPH GRISHMAN
2 Identifying instances of a linguistic expression
In simplest terms, ie can be realized as a pattern matching process. We

are looking for linguistic expressions which convey a particular relation-
ship or fact involving one or more arguments and modifiers. We find such
expressions by creating patterns with constants and constrained variables
(positions in the pattern which match the arguments). If the constants and
variables match individual words in sequence, we will end up with patterns
which are too specific; for example, extracting reports of deaths by looking
for “Mr.” followed by a capitalized token followed by “died”. To achieve
any degree of generality (and hence reasonable coverage), matching must
occur at a structural level; for example, that a noun phrase referring to a
person appears as the subject of some inflected form of “die”. So the cru-
cial problem here, at the heart of many nlp applications, is the accurate
identification of the structure of sentences and entire discourses.
2.1 Stages of analysis and joint inference

This structure exists on many levels: the structure of names; the grammati-
cal structure of sentences; and coreference structure across a discourse (and
even across multiple discourses). Each of these is important to ie – to fig-
uring out the participants in an event. And each of these has been studied
separately and quite intensively over the past decade. Annotated corpora
have been prepared for each of these levels of structure, and a wide range
of models and machine learning methods have been applied to construct
analyzers (particularly for name and grammatical structure). Except for
coreference analysis, the result of these efforts have in general been quite
satisfactory levels of performance – on the order of 90% accuracy for names
and for grammatical constituents.1
In a typical system, these analyzers are applied sequentially to prepro-
cess a text for extraction. Unfortunately, the analysis errors of the individual
stages not only add up, they compound: an error in an early stage will often
lead to further errors as analysis progresses. The net result is that overall
analysis performance, and hence extraction performance, is still not very
good. For the muc [Message Understanding Conference] evaluations in the
1990’s, recall on the event task rarely broke the 60% ‘ceiling’ (Hirschman
1998), and it’s not clear if we are doing much better today.
One limitation is the reliance on relatively local features in the early
stages of analysis. Most ne (named entity) analyzers, for example, are
based on simple models that look only one or two tokens ahead and behind.
1
When tested on texts similar in genre and time period to those on which the analyzer
was trained.
NLP: AN INFORMATION EXTRACTION PERSPECTIVE 19
This fails to capture such basic tendencies as the increased likelihood of a

name that was mentioned once in a document being mentioned again. To
account for this, some systems employ a name cache or, more elaborately,
features based on the context of other instances of the same string (Chieu &
Ng 2002) — in effect, trying to do simple coreference within the name tagger.
However, preferences which depend on more complex syntactic structures —
for instance, that names appearing as the subjects of selected verbs are likely
to be person names — remain difficult to capture because the structures
are simply not available at this stage of analysis.
A more general approach harnesses the richer representations of the
later stages to aid the performance of earlier ones. Such an approach may
generate multiple hypotheses in one analysis stage and then rescore them
using information from subsequent stages. In general, we rely on the idea
that the discourse is coherent — that in a properly-analyzed discourse,
there will be many connections between entities. For example, we expect
that a correct name tagging will license more coreference relations as well
as more semantic relations (such as ‘X is located in Y’, ‘X works for Y’,
etc.). By evaluating the result of these later stages of analysis for each
hypothesized set of name tags, for example, a system can use these later
stages to improve name tagging. Ji and Grishman (2005) generated N-best
NE hypotheses and rescored them after coreference and semantic relation
identification; they obtained a significant improvement in Chinese NE per-
formance. Roth and Yi (2004) built separate models for name classification
and for semantic relation identification, and then used a linear program-
ming model to capture the interactions between names and relations and to
maximize the total probability (the product of name and relation probabil-
ities).2 They obtained significant improvements in both name classification
and relation detection. We can expect this ‘global optimization’ approach
will be extended in the future to integrate a wider range of analysis levels
and provide further performance improvements, possibly even incorporating
cross-document information.
2.2 Stages of analysis: Alternative evidence

While such approaches should reduce analysis error, we need to consider
how to deal with the error that remains. ‘Deeper’ representations can in
principle do a better job in supporting ie (by identifying the common fea-
tures of variant syntactic forms), but they will generally involve greater
error. This is a dilemma which has faced ie developers for a decade. It has
led many groups to rely on partial parsing which, while less informative,
2
They assumed the extent of the names in the text was given.
20 RALPH GRISHMAN
is more accurate than full parsing.3 However, machine learning methods

which can handle large numbers of features have allowed recent systems
to integrate information from multiple levels of representation in predict-
ing the existence of ie relations and events (Kambhatla 2004). Zhao et al.
(2004), Zhao & Grishman (2005) have shown how using kernel methods
to combine information from n-grams, chunks, and grammatical relations
can improve extraction performance over using a single level of representa-
tion. In some cases where there is an error in the deep analysis, a correct
extraction decision will still be made based on the shallow features.
3 Finding linguistic expressions of an event or relation
The methods just described will give us a better chance of identifying in-
stances of a particular linguistic expression, but we are still faced with the
problem of finding the myriad linguistic expressions of an event — all (or
most of) the paraphrases of a given expression of an event. A direct ap-
proach is to annotate all the examples of an event in a large corpus, and
then collect and distill them either by hand or using some linguistic rep-
resentation and machine learning method. However, good coverage may
require a really large corpus, which can be quite expensive. Could we do
better?
3.1 Syntactic paraphrase

We need first of all to differentiate syntactic and semantic paraphrase. Syn-
tactic paraphrases are applicable over broad (grammatical) classes of words
— relations between active, passive, and relative clauses, for example, as
well as complement alternations and nominalizations. Many of these can
be addressed by using a deeper syntactic representation that captures the
commonality among such different expressions. In particular, a predicate-
argument representation, such as is being encoded for English in PropBank
(Kingsbury & Palmer 2002) and NomBank (Meyers et al. 2004), would
collapse many of these syntactic paraphrases.
3.2 Semantic paraphrase

What remains are the much more varied and numerous semantic para-
phrases. There are dozens of ways of saying that a company hired someone,
or that two people met. Lexical-semantic resources (such as WordNet) pro-
vide some assistance (Stevenson & Greenwood 2005), but they are largely
3
Many groups made this choice prior to the recent improvements in treebank-trained
parsers, but the choice is still not clear-cut.
limited to single-word paraphrases and so cover only a portion of the myr-

iad expressions required for an ie task. To complement these manually-
prepared resources, efforts have been underway for the past few years to
learn paraphrase relations from corpora. The basic idea is to identify pairs
of expressions A R B and A S B which involve the same arguments (A, B)
and most likely convey the same information; then R and S stand a good
chance of being paraphrases. One source of such pairs are two translations
of the same text (Barzilay & McKeown 2001). If we can sentence-align the
texts, the corresponding sentences are likely to carry the same information.
Another source are comparable news articles — articles from the same day
about the same news topic (Shinyama et al. 2002). The opening sentences
of such articles, in particular, are likely to contain phrases which convey
the same information. The likelihood is even greater if we focus on phrases
which are both relevant to the same topic (see the next section). Finally,
frequency can build confidence: if we have several pairs of individuals A1 ,
B1 ; A2 , B2 ; ... which appear in both context R (A1 R B1 ; A2 R B2 ) and
context S (A1 S B1 ; A2 S B2 ), then R and S stand a good chance of being
paraphrases. This general approach has been used to find paraphrases for
individual relations (Brin 1998, Agichtein & Gravano 2000, Lin & Pantel
2001) and to collect the primary paraphrase relations of a domain (Sekine
2005).
4 Discovering what’s important
Finally, there may be situations where we don’t have specific event or rela-
tion types in mind – where we simply want to identify and extract the ‘im-
portant’ events and relations for a particular domain or topic. Riloff (1996)
introduced the basic idea of dividing a document collection into relevant (on
topic) and irrelevant (off topic) documents, and selecting constructs which
occur much more frequently in the relevant documents. Her approach relied
on a relevance-tagged corpus. This idea was extended by (Yangarber et al.
2000) to bootstrap the discovery process from a small ‘seed’ set of patterns
which define a topic. Sudo generalized the form of the discovered patterns
(Sudo et al. 2003) and created a system which started from a narrative de-
scription of a topic and used this description to retrieve relevant documents
(Sudo et al. 2001).
These methods have been used to collect the linguistic expressions for a
specific set of event types, and they are effective when these events form a
coherent ‘topic’ – when they co-occur in documents. Because these methods
are based on the distribution of constructs in documents, they may gather
together related but non-synonymous forms like ‘hire’, ‘fire’, and ‘resign’,
or ‘buy’ and ‘sell’. However, by coupling these methods with paraphrase
22 RALPH GRISHMAN
discovery, it should be possible to both gather relevant expressions and

group those representing the same event types (Shinyama et al. 2002). By
doing so, it should be possible in the future to automatically identify the
key semantic structures of a topic or domain, and to create an ie system to
extract this information (Sekine 2006).
Acknowledgements. This research was supported by the Defense Advanced

Research Projects Agency under Grant N66001-04-1-8920 and Contract HR0011-
06-C-0023, and by the National Science Foundation under Grant 03-25657. This
paper does not necessarily reflect the position or the policy of the U.S. Govern-
ment.
REFERENCES
Agichtein, Eugene & Luis Gravano. 2000. “Snowball: Extracting Relations from
Large Plain-Text Collections”. 5th ACM Int. Conf. on Digital Libraries,
85-94. San Antonio, Texas.
Barzilay, Regina & Kathleen R. McKeown. 2001. “Extracting Paraphrases from
a Parallel Corpus”. ACL/EACL 2001, 50-57. Toulouse, France.
Brin, Sergei. 1998. “Extracting Patterns and Relations from the World Wide
Web”. World Wide Web and Databases International Workshop (= LNCS
1590), 172-183. Heidelberg: Springer.
Chieu, Hai Leong & Hwee Tou Ng. 2002. “Named Entity Recognition: A Max-
imum Entropy Approach Using Global Information”. 19th Int. Conf. on
Computational Linguistics (COLING-2002 ), 190-196. Taipei.
Hirschman, Lynette. 1998. “Language Understanding Evaluations: Lessons
Learned from MUC and ATIS”. 1st Int. Conf. on Language Resources and
Evaluation (LREC-1998 ), 117-122. Granada, Spain.
Ji, Heng & Ralph Grishman. 2005. “Improving Name Tagging by Reference
Resolution and Relation Detection”. 43rd Annual Meeting of the Association
for Computational Linguistics (ACL’2005 ), 411-418. Ann Arbor, Michigan.
Kambhatla, Nanda. 2004. “Combining Lexical, Syntactic and Semantic Features
with Maximum Entropy Models for Extracting Relations”. Companion Vol-
ume to the Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics (ACL’2004 ), 178-181. Barcelona, Spain.
Kingsbury, Paul & Martha Palmer. 2002. “From TreeBank to PropBank”. 3rd
Int. Conf. on Language Resources (LREC-2002 ), 1989-1993. Las Palmas,
Spain.
Lin, Dekang & Patrick Pantel. 2001. “Discovery of Inference Rules for Question
Answering”. Natural Language Engineering 7:4.343-360.
Meyers, Adam, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielin-
ska, Brian Young & Ralph Grishman. 2004. “Annotating Noun Argument
Structure for Nombank”. 4th Int. Conf. on Language Resources and Evalu-
ation (LREC-2004 ), 803-806. Lisbon, Portugal.
Riloff, Ellen. 1996. “Automatically Generating Extraction Patterns from Un-
tagged Text”. 13th National Conference on Artificial Intelligence, 1044-1049.
Portland, Oregon.
Roth, Dan & Wen-tau Yih. 2004. “A Linear Programming Formulation for
Global Inference in Natural Language Tasks”. Conf. on Computational Nat-
ural Language Learning (CoNLL’2004 ), 1-8. Boston, Mass.
Sekine, Satoshi. 2005. “Automatic Paraphrase Discovery Based on Context and
Keywords between NE Pairs”. 3rd Int. Workshop on Paraphrasing, Jeju
Island, South Korea.
Sekine, Satoshi. 2006. “On-demand Information Extraction”. 21st Int. Conf. on
Computational Linguistics and 44th Annual Meeting of the Association for
Computational Linguistics (COLING-ACL’2006 ), Sydney, Australia.
Shinyama, Yusuke, Satoshi Sekine, Kiyoshi Sudo & Ralph Grishman. 2002.
“Automatic Paraphrase Acquisition from News Articles”. Human Language
Technology Conference (HLT-2002 ), San Diego, Calif.
Stevenson, Mark & Mark Greenwood. 2005. “A Semantic Approach to IE Pattern
Induction”. 43rd Annual Meeting of Association for Computational Linguis-
tics, 379-386. Ann Arbor, Michigan.
Sudo, Kiyoshi, Satoshi Sekine & Ralph Grishman. 2001. “Automatic Pattern
Acquisition for Japanese Information Extraction”. Human Language Tech-
nology Conference (HLT-2001 ), 51-58. San Diego, Calif.
Sudo, Kiyoshi , Satoshi Sekine & Ralph Grishman. 2003. “An Improved Ex-
traction Pattern Representation Model for Automatic IE Pattern Acquisi-
tion”. 41st Annual Meeting of the Association for Computational Linguistics
(ACL’2003 ), 224-231. Sapporo, Japan.
Yangarber, Roman, Ralph Grishman, Pasi Tapanainen & Silja Huttunen. 2000.
“Automatic Acquisition of Domain Knowledge for Information Extraction”.
18th Int. Conf. on Computational Linguistics (COLING-2000 ), 940-946.
Saarbrücken, Germany.
Zhao, Shubin, Adam Meyers & Ralph Grishman. 2004. “Discriminative Slot
Detection Using Kernel Methods”. 20th Int. Conf. on Computational Lin-
guistics (COLING-2004 ), 757-763. Geneva.
Zhao, Shubin & Ralph Grishman. 2005. “Extracting Relations with Integrated
Information Using Kernel Methods”. 43rd Annual Meeting of the Association
for Computational Linguistics, 419-428. Ann Arbor, Michigan.
Semantic Indexing using Minimum Redundancy Cut
in Ontologies
Florian Seydoux & Jean-Cédric Chappelier

Ecole Polytechnique Fédérale de Lausanne (EPFL)
Abstract
This paper presents a new method that improves semantic indexing
while reducing the number of indexing terms. Indexing terms are
determined using a minimum redundancy cut in a hierarchy of con-
ceptual hypernyms provided by an ontology (e.g., WordNet, EDR).
The results of some information retrieval experiments carried out on
several standard document collections using the WordNet and EDR
ontologies are presented, illustrating the benefit of the method.
1 Introduction
Three fields are mainly reported in the literature about the use of seman-
tic knowledge for Information Retrieval: query expansion (Voorhees 1994,
Moldovan & Mihalcea 2000), Word Sense Disambiguation (Ide & Véronis
1998, Wilks & Stevenson 1998, Besançon et al. 2001) and semantic index-
ing. This contribution relates to the latest, the main idea of which is to
use word senses rather than, or in addition to, the words (usually lem-
mas or stems) for indexing document, in order to improve both recall (by
handling synonymy) and precision (by handling homonymy and polysemy).
However, the experiments reported in the litterature lead to contradicting
results: some claim that it degrades the performance (Salton 1968, Harman
1988, Voorhees 1993, Voorhees 1998); whereas for others the gain seems sig-
nificant (Richardson & Smeaton 1995, Smeaton & Quigley 1996, Gonzalo
et al. 1998a & 1998b, Mihalcea & Moldovan 2000).
Although it seems desirable for IR systems to take a maximum of seman-
tic information into account, the resulting expansion of the data processed
may not develop its full potential. Indeed, the growth of the number of
index terms not only increases the processing time but could also reduce
the precision, as discriminating documents by using a very large number of
index terms is a hard task.
This problem is not new, and various techniques aiming at reducing the
size of the indexing set already exist: filtering by stoplist, part of speech
tags, frequencies, or through statistical techniques as in LSI (Deerwester
et al. 1990) or PLSI (Hofmann 1999). However, most of these techniques are
not adapted to the case where an explicit semantic information is available,
26 F. SEYDOUX & J.-C. CHAPPELIER
wi,j,k r
(a)
c1 ... c 2 c3
(b) MRC c2
c3 c1
(c) wi
wj wk
Fig. 1: Several indexing scheme: (a) usual indexing with words, stems or
lemmas; (b) synset (or hypernyms synsets) indexing: each indexing term is
replaced by its (hypernyms) synset; this, in principle, reduces the size of
the indexing set since all the indexing terms that are shared by the same
hypernym are regrouped in one single indexing feature; (c) Minimum
Redundancy Cut (MRC) indexing: each indexing term is replaced by its
dominating concept chosen with MRC. This furthermore reduces the size
of the indexing set since all the indexing terms that are subsumed by the
same concept in the MRC are regrouped in one single indexing feature.
for example in the form of a thesaurus or an ontology (i.e., with some
underlying formal – not statistical – structure).
The focus of the work presented here is to use external1 structured se-
mantic resources such as an ontology in order to limit the semantic indexing
set. This work, which is a continuation of (Seydoux & Chappelier 2005),
relates, but from a different point of view, with experiments described in
(Gonzalo et al. 1998b, Whaley 1999, Mihalcea & Moldovan 2000), which
uses the synsets (or hypernyms synsets (Mihalcea & Moldovan 2000)) of
WordNet as indexing terms. We follow the onto-matching technique de-
scribed in (Kiryakov & Simov 1999), but here selecting the indexing terms
using the “Minimum Redundancy Cut” (MRC) (MRC, see Figure 1), an
information theory based criterion applied to the inclusive ’is-a’ relation
(hypernyms) provided by the WordNet (Fellbaum 1998, Miller 1995) and
EDR taxonomies (Miyoshi et al. 1996).
2 Minimum redundancy cut in an ontology
The choice of the appropriate hypernym (a “concept” in the ontology) to be

used for representing a word is not easy: be it too general, the performance
of the system will degrade (lack of precision); be it too specific, the indexing
set will not reduce enough, preserving some distinction between words with
1
By “external ”, we mean “not directly related to the document collection itself ”.
SEMANTIC INDEXING USING MRC 27
close senses (lack of recall). To select the appropriate level of conceptual

indexing, we consider cuts in the ontology.
Let N = {ni } represent the set of nodes and W the set of words in the
ontology. A cut Γ in the directed acyclic graph (DAG) representing the
ontology, is defined as a minimal subset2 of N which covers W. Each node
in the cut then represents every leaf node it dominates.
A probabilized cut M = (Γ, P ) is a couple consisting of a cut Γ and a
probability distribution P on Γ.
The main problem is to find a computable strategy to select an optimal
cut. We propose to use an information theory based criterion, that selects
a cut for which the redundancy is minimal.
From now on, let’s consider the probabilized cut M = (Γ, Pf ), where
Pf is defined using the relative frequencies of the words in the collection:
Pf (ni ) = f (ni )/|D|, f (ni ) being the number of occurrences of the node ni
in a document collection D of size |D|. To compute f (ni ), we consider that
an occurrence of ni happens when any of the hyponym words of ni occurs.
The redundancy R(M) of a probabilized cut M = (Γ, P ) is defined
(Shannon 1948) as R(M) = 1 − H(M)/ log |M|, P where |M| denotes the
number of nodes in the cut Γ and H(M) = − n∈Γ P (n) · log P (n).
Minimizing the redundancy is thus equivalent to maximizing the ratio
between the entropy H(M) of the cut and its maximum possible value
(log |M|), i.e., balancing as much as possible the probabilities of the nodes
in the cut.
Notice that R does not necessarily have a unique minimum, but the
ontology may rather have several equally minimal cuts. In practice, this
can easily be overcome, considering for instance any of the minimal cuts, or
those having a minimal number of nodes, or the minimum average depth of
the nodes, etc.
In order to identify global MRC, the whole set of possible cuts has to
be considered. We thus decided to give up global optimality for the sake of
tractability and focussed rather on more efficient heuristics.
The proposed algorithm consists, starting from the leaves, in iteratively
modifying a given cut by systematically replacing a node by its parent or
its children that minimizes the redundancy. For each node ni in the current
cut, we consider on one hand ni ↓ the (set of) children of ni , and on the
other hand ni ↑ the (set of) parents of ni . Due to the DAG structure, this
replacement can involve other nodes in the cut. In fact, when replacing ni
by ni ↓ , those nodes which are already covered by other nodes in the cut
must be excluded, i.e., we must consider ni ↓ \ (Γ \ {ni })⇓ instead of ni ↓
2
“minimal subset” means that no node can be removed from the set without decreasing
it’s coverage.
ni
Γ ni
Γ"
Γ"
ni Γ
m
∋
(Γ \ {ni })
m (ni )
(a) (b)
Fig. 2: (a) Lower search: node ni is replaced by ni ↓ without the nodes
already covered by other nodes in Γ (e.g., m), i.e., (Γ \ {ni })⇓ .
(b) Upper search: node ni is replaced by ni ↑ , and all nodes covered by ni ↑
⇓
(i.e., (ni ↑ ) ) are removed from Γ.
(see fig. 2a), where n⇓ stands for the transitive closure of n↓ ; and similarly
for ni ↑ (see fig. 2b).
Then, the cut with minimal redundancy among these new considered
cuts and the current one is kept, and the search continue as long as better
cuts are found. The full algorithm3 is given hereafter (Algorithm 1).
This algorithm converges towards a local minimum redundancy cut close
to the leaves. Note that this algorithm can be stopped any time, if required,
since it always works on a complete cut.
3 Experiments
We carried out several experiments with standard English document col-

lections of the Smart system,and ontologies generated from the MySQL
port of WordNet 4 and the English part of EDR Electronic Dictionary.
WordNet gathers information about approximatively 200, 000 “words”
(≈ 150, 000 different lexical strings including compounds) of type noun,
verb, adjective and adverb; organized into ≈ 115, 000 synsets, with ≈ 100, 000
hypernyms relations between them.
EDR gathers information about approximatively 420, 000 “words” (≈
240, 000 different lexical strings, including compounds and idiomatic ex-
pressions) of differents type (whithout restriction on POS); organized into
≈ 490, 000 concepts, with ≈ 500, 000 super/sub relations between them; we
gather here the two differents ontologies provided by EDR (a very large scale
general ontology and a smallest one specialized on information science).
All the experiments were acheived following the same procedure:
3
In practice, several optimizations can be made, which do not conceptually change the
algorithm and are thus not presented here for the sake of clarity.
4
WordNet v.2.0, by Android Technologies.
Algorithm 1 MRC local search algorithm
Requires: a hierarchy N (the leaves of which are W)

Ensures: a cut Γ with (local) minimal redundancy
Γ ←W {current cut, start from the leaves}

repeat
Γ′ ← ∅ {best new cut}
Γ′′ ← ∅ {tested candidate}
continue ← false {search-loop control flag}
for all ni ∈ Γ do
{Evaluate the children’s
cut:}
Γ ← (Γ \ {ni }) ∪ ni \ (Γ \ {ni })⇓
′′ ↓

Γ′ ← Argmin R(Γ′ ), R(Γ′′ )
{Evaluate each parent’s cut:}
for all nj ∈ ni ↑ do
Γ′′ ← (Γ ∪ {nj}) \ nj ⇓
Γ′ ← Argmin R(Γ′ ), R(Γ′′ )
if R(Γ′ ) < R(Γ) then
Γ ← Γ′ {keep the best cut}
continue ← true {the search goes on}
{some watchdog or timer can be put here}
until continue is false
return Γ
with R(∅) = R({c}) = 1, by convention.
1. First, textual information from documents and queries are tokenized

and lemmatized;tokens are then filtered, according to their POS tag
(only nouns, verbs, adverbs and adjectives are kept).
2. Then, for every tokens in a document, corresponding entries (leaves)
in the ontology are looked for, first using the lexical string, then with
the lemma if necessary. Tokens with no correspondence in the con-
sidered ontology are indexed in the usual way. The coverage rate of
the collections by the ontology was about 90% for both WordNet and
EDR.
3. Then the hierarchy of concepts related to the tokens found in the
ontology is expanded by selecting all possible senses for WordNet (re-
lying on the mutual reinforcement induced by collocations to have a
...
Γ A B C
g f e d c b a
documents (and queries) queries only
Fig. 3: Possibles configuration when indexing the queries’ terms. Word ‘a’
(not in the ontology) is always indexed by himself, while ‘d’ is always
indexed by ‘C’ and ‘e’, ‘f ’ and ‘g’ are always indexed by ‘B’. In the 0up
strategy, ‘b’ and ‘c’ are not indexed; in the 1up strategy, ‘c’ is indexed by
‘C’, and in the 2up strategy, ‘b’ is indexed as well by ‘B’ and ‘C’.
sort of disambiguation), and only the most frequent sense for EDR 5 .
4. A MRC cut is then computed with the algorithm previously presented.
5. The index of documents and queries are then computed; each token
is substituted by the concepts from the cut which subordinate it. As
the cut was computed only with nodes covering words contained in
the documents (not the queries), tokens covered by the ontology but
not by concepts in the selected cut can occur in the queries. We thus
evaluated the three following strategies (see Figure 3):
“0up” the first strategy simply consists in ignoring these tokens (in
Figure 3: ’b’ and ’c’ are not indexed);
“1up” in a more sophisticated strategy, we look if the related concepts
(or synsets) subordinate a part of the cut; in this case, the term
is indexed by the subordinate part of the cut, otherwise it is
ignored (in Figure 3: ’b’ is ignored but ’c’ is indexed by ’C’);
“2up” the most sophisticated strategy evaluated here consists in also
↑
looking for the hypernyms of the related concepts, i.e., (t↑ ) (in
Figure 3: ’c’ is always indexed by ’C’, but ’b’ is indexed as well,
by ’B’ and ’C’).
Tokens not covered by the ontology are indexed in the standard way
(in the figure, the token ’a’ is always indexed by itself).
5
This technique gives better results rather than keeping all possible senses in EDR (Sey-
doux & Chappelier 2005); this kind of WSD was not possible for WordNet, because
the most frequent sense information was not present in the WordNet version used.
6. Finally, search and evaluation are performed, using the vector-space

Smart information retrieval system (Salton 1971).
Table 1 gathers the 11-pt precision and the 30-doc recall for the experiments
carried out.
4 Discussion and conclusion

Three main conclusions can be drawn out of these experiments:
1. Using adapted additional semantic information can enhance the in-
dexing of documents, and thus the performance of a IR system. The results
of semantic (ontology-based) indexing (columns (c)) are better than the
baseline system for three of the collections, but clearly worse on the MED
collection.
This can be explained by the specificity of the vocabulary of these col-
lections, and their adequacy with the semantic resource. ADI and CACM
have an important technical vocabulary, but well covered by the EDR on-
tology6 . CACM documents are extremely small, often restricted to a single
title with few words. Conversely, the vocabulary of TIME is very general
and the length of each document is large. The CISI collection present docu-
ments of average size, but with a significant number of dates, proper names,
etc., for which the POS filtering seems to have annoying consequences (the
very low performance is clearly due to an initial loss of information). Fi-
nally, the MED collection has an extremely specific vocabulary, for which
the used ontologies were not adapted.
2. Excepted for the TIME collection, the results on EDR are globally
better than those obtained on WordNet. This is probably caused by the
(kind of) “WSD” technique used with EDR. Keeping all possible senses
with EDR gives almost the same results as with WordNet, excepted for the
TIME collection, for which the WordNet ontology seems really better. A
semantic disambiguation, even rudimentary, appears thus to be necessary.
The expected mutual reinforcement of collocations as a kind of “natural”
disambiguation does actually not occur. The reason is probably that the
hypernyms relations used do not constitute thematic links (for example, the
thematically related terms ‘doctor’, ‘drug’, ‘hospital’, ‘nurse’, etc. are not
linked together with the used ontologies).
Anyway, this result enfavours the use of a proper WSD procedure for
further improving the results.
3. The different strategies for dealing with tokens’ terms not covered
by the indexing set (the cut) lead to different outcomes, depending on the
6
In this special case however, the kind of WSD used here with EDR is not adapted (es-
pecially for ADI). Better results were obtained with EDR without WSD – see (Seydoux
& Chappelier 2005).
(a) (b) (c)-0up (c)-1up (c)-2up

ADI collection (82 documents, 35 queries)
[Documents from Information Science]
EDR, most frequent precision 0.2497 0.2939 0.2924 0.2933 0.2844
concept, tf recall 0.5996 0.7141 0.6901 0.6996 0.6806
concept, tf.idf recall 0.6984 0.7217 0.7081 0.7176 0.7007
precision 0.2497 0.2671 0.2593 0.2562 0.2567
WordNet, all synsets, tf
recall 0.5996 0.6361 0.6146 0.6064 0.6064
WordNet, all synsets, precision 0.3578 0.3564 0.3547 0.3450 0.3353
tf.idf recall 0.6984 0.6649 0.6767 0.6494 0.6422
TIME collection (423 documents, 83 queries)
[General world news articles from the magazine Time (1963)]
precision 0.3288 0.2579 0.2348 0.3449 0.3451
recall 0.7755 0.6263 0.6122 0.7760 0.7880
tf.idf recall 0.8901 0.8421 0.9196 0.9161 0.9281
MED collection (1033 documents, 30 queries)
[Collection of abstract from a medical journal]
precision 0.3623 0.2750 0.1914 0.3174 0.3212
recall 0.4574 0.3803 0.2894 0.4216 0.4241
tf.idf recall 0.5547 0.5120 0.4267 0.5263 0.5261
CISI collection (1460 documents, 112 queries)
[Articles from Information science (Library science)]
precision 0.0687 0.0588 0.0449 0.0738 0.0739
recall 0.1239 0.0926 0.0745 0.1226 0.1224
tf.idf recall 0.2318 0.1979 0.1364 0.2204 0.2192
CACM collection (3204 documents, 64 queries)
[Collection of titles and abstracts from a Computer science journal]
precision 0.1555 0.1628 0.1101 0.1637 0.1625
recall 0.3082 0.2736 0.2019 0.2993 0.2940
tf.idf recall 0.4534 0.3758 0.3366 0.4390 0.4354
Table 1: Evaluation of several indexing scheme (see Figures 1 and 3) on

several collections: (a): words only; (b): direct hypernyms; (c)-0up,
(c)-1up and (c)-2up : hypernyms from the MRC.
used ontology. As with EDR, the simple ’0up’ strategy gives better results,
this strategy is clearly the worst with WordNet, while the ’2up’ seems to
give the best results. This can perhaps be explained by the structure of the
ontologies: all the synsets of WordNet directly cover several leaves, while for
EDR, many of the concepts are just “pure concept”, only related to others
concepts.
As future work, it is forseen to confront the results presented here with
similar experiments using better WSD procedure (especially for WordNet),
and also using different weighting scheme for the evaluation (as for example
the “lnu-weighting”, more reliable for handling noisy data and dealing with
long documents). It would be also interesting to generalize this technique
in order to take in account all relationships between terms given by the
ontology, in addition to the thesaurus of hypernyms. Finally, we want to
emphasize that the technique presented here does not limit to IR but could
also be apply to other NLP domains, such as document clustering (Hotho
et al. 2003) and text summarization.
REFERENCES
Besançon, R., J.-C. Chappelier, M. Rajman & A. Rozenknop. 2001. “Improving
Text Representations through Probabilistic Integration of Synonymy Rela-
tions”. Xth Int. Symp. on Applied Stochastic Models and Data Analysis
(ASMDA’2001 ), vol. 1, 200-205. Compiègne, France.
Deerwester, S.C., S.T. Dumais, T.K. Landauer, G.W. Furnas & R.A. Harshman.
1990. “Indexing by Latent Semantic Analysis”. Journal of the American
Society of Information Science 41:6.391-407.
Fellbaum, Christiane, ed. 1998. WordNet, An Electronic Lexical Database. Cam-
bridge, Mass.: MIT Press.
Gonzalo, J., F. Verdejo, I. Chugur & J. Cigarran. 1998. “Indexing with WordNet
Synsets Can Improve Text Retrieval”. COLING/ACL Workshop on Usage
of WordNet for Natural Language Processing, 38-44. Montreal, Canada.
Gonzalo, J., F. Verdejo, C. Peters & N. Calzolari. 1998. “Applying EuroWordNet
to Multilingual Text Retrieval”. Journal of Computers and the Humanities
32:2-3.185-207.
Harman, D. 1988. “Towards Interactive Query Expansion”. 11th Conf. on Re-
search and Development in Information Retrieval, 321-331. Grenoble.
Hofmann, T. 1999. “Probabilistic Latent Semantic Indexing”. 22nd Int. Conf.
on Research and Development in Information Retrieval, 50-57. Berkeley.
Hotho, A., S. Staab & G. Stumme. 2003. “WordNet Improves Text Document
Clustering”. SIGIR 2003 Semantic Web Workshop. Toronto, Canada.
Ide, Nancy & J. Véronis. 1998. “Word Sense Disambiguation: The State of the
Art”. Computational Linguistics 24:1.1-40.
Kiryakov, A.K. & K.Iv. Simov. 1999. “Ontologically Supported Semantic Match-
ing”. Nordic Conf. on Computational Linguistics. Trondheim, Norway.
Mihalcea, R. & D. Moldovan. 2000. “Semantic Indexing Using WordNet Senses”.
ACL Workshop on IR & NLP. Hong Kong.
Miller, G.A. 1995. “Wordnet: A Lexical Database for English”. Communications
of the ACM 38:11.39-41.
Miyoshi, H., K. Sugiyama, M. Kobayashi & T. Ogino. 1996. “An Overview of
the EDR Electronic Dictionary and the Current Status of Its Utilization”.
16th Int. Conf. on Computational Linguistics (COLING-1996 ), 1090-1093.
Copenhagen.
Moldovan, D.I. & R. Mihalcea. 2000. “Using WordNet and Lexical Operators to
Improve Internet Searches”. IEEE Internet Computing 4:1.34-43.
Richardson, R. & A.F. Smeaton. 1995. “Using WordNet in a Knowledge-based
Approach to Information Retrieval”. Technical Report CA-0395. Dublin,
Ireland: Dublin City University.
Salton, G. 1968. Automatic Information Organization and Retrieval. New York:
McGraw-Hill Book Company.
Salton, G. 1971. The SMART Retrieval System - Experiments in Automatic
Document Processing. Englewood Cliffs, NJ: Prentice Hall Inc.
Seydoux, F. & J.-C. Chappelier. 2005. “Hypernyms Ontologies for Semantic
Indexing”. Methodologies and Evaluation of Lexical Cohesion Techniques in
Real-world Applications (electra’05), 49-55. Salvador, Brazil.
Shannon, C.E. 1948. “A Mathematical Theory of Communication”. The Bell
System Technical Journal 27.379-423.
Smeaton, A.F. & I. Quigley. 1996. “Experiments on Using Semantic Distances
between Words in Image Caption Retrieval”. 19th Int. ACM-SIGIR Conf.
on Research and Development in Information Retrieval, 174-180. Zurich.
Voorhees, E.M. 1993. “Using WordNet to Disambiguate Word Senses for Text
Retrieval”. 16th Annual Int. ACM-SIGIR Conf. on Research and Develop-
ment in Information Retrieval, 171-80. Pittsburgh, Penn.
Voorhees, E.M. 1994. “Query Expansion Using Lexical-Semantic Relations”.
17th Int. ACM-SIGIR Conf. on Research and Development in Information
Retrieval, 61-69. Dublin, Ireland.
Voorhees, E.M. 1998. “Using WordNet for Text Retrieval”. WordNet, An Elec-
tronic Lexical Database 285-303. Cambridge, Mass.: MIT Press.
Whaley, J.M. 1999. “An Application of Word Sense Disambiguation to Infor-
mation Retrieval”. Technical Report PCS-TR99-352. Hanover, New Haven:
Dartmouth College, Computer Science Dept.
Wilks, Yorick & Mark Stevenson. 1998. “Word Sense Disambiguation Using
Optimised Combinations of Knowledge Sources”. 17th Int. Conf. on Com-
putational Linguistics (COLING-1998 ), 1398-1402. Montreal, Canada.
Indexing and Querying Linguistic Metadata and
Document Content
Niraj Aswani, Valentin Tablan, Kalina Bontcheva &
Hamish Cunningham
University of Sheffield
Abstract
The need for efficient corpus indexing and querying arises frequently
both in machine learning-based and human-engineered natural lan-
guage processing systems. This paper presents the annic system,
which can index documents not only by content, but also by their
linguististic annotations and features. It also enables users to for-
mulate versatile queries mixing keywords and linguistic information.
The result consists of the matching texts in the corpus, displayed
within the context of linguistic annotations (not just text, as is cus-
tomary for kwic systems). The data is displayed in a graphical
user interface, which facilitates its exploration and the discovery of
new patterns, which can in turn be tested by launching new annic
queries.1 .
1 Introduction
The need for efficient corpus indexing and querying arises frequently both
in machine learning-based and human-engineered natural language process-
ing systems. A number of query systems have been proposed already
and Mason (1998:20-27), Bird (2000b:1699-1706) and Gaizauskas (2003:
227-236) are amongst the most recent ones. In this paper, we present a full-
featured annotation indexing and retrieval search engine, called
annic (ANNotations-In-Context), which has been developed as part of
gate (General Architecture for Text Engineering) (Cunningham 2002:168-
175).
Whilst systems such as McKelvie & Mikheev (1998), Gaizauskas (2003:
227-236) and Cassidy (2002) are targeted towards specific types of docu-
ments, Bird (2000b:1699-1706) and Mason (1998:20-27) are general purpose
systems. Annic falls in between these two types, because it can index docu-
ments in any format supported by the gate system. These existing systems
were taken as a starting point, but annic goes beyond their capabilities in
a number of important ways. New features address issues such as exten-
sive indexing of linguistic information associated with document content,
1
This work was partially supported by an ahrb grant etcsl and an eu grant sekt
36 ASWANI, TABLAN, BONTCHEVA & CUNNINGHAM
independent of document format. It also allows indexing and extraction

of information from overlapping annotations and features. Its advanced
graphical user interface provides a graphical view of annotation mark-ups
over the text along with an ability to build new queries interactively. In
addition, annic can be used as a first step in rule development for nlp sys-
tems as it enables the discovery and testing of patterns in corpora. Section
2 introduces the gate text processing platform which is the basis of this
work. Following this, we provide details of the annic implementation and
the changes made in Lucene.
2 GATE
Gate is a large-scale infrastructure for natural language processing applica-

tions (Cunningham 2002:168-175). Lingustic data associated with language
resources such as documents and corpora is encoded in the form of annota-
tions. Gate supports a variety of formats including xml, rtf, html, sgml,
email and plain text. In all cases, when a document is created/opened in
gate, the format is analysed and converted into a single unified model of
annotation. The annotation format is a modified form of the tipster for-
mat (Grisham 1997) which has been made largely compatible with the Atlas
format (Bird 2000a:807-814). The annotations associated with each docu-
ment are a structure central to gate, because they encode the language
data read and produced by each processing module. Each annotation has
a start and an end offset and a set of features associated with it. Each
feature has a “name” and a “value”, which holds the descriptive or analyti-
cal information such as part-of-speech tags, syntactic analysis, co-reference
information etc.
Jape, Java Annotation Patterns Engine, is part of the gate system. It
is an engine based on regular expression pattern/action rules over annota-
tions. Jape is a version of cpsl (Common Pattern Specification Language).
This engine executes the jape grammar phases - each phase consists of a
set of pattern/action rules. The left-hand-side (lhs) of the rule represents
an annotation pattern and the right-hand-side (rhs) describes the action to
be taken when pattern found in the document. Jape executes these rules
in a sequential manner and applies the rhs action to generate new anno-
tations over the matched regular expression pattern. Rule prioritisation (if
activated) prevents multiple assignments of annotations to the same text
string.
This paper demonstrates how annic indexes gate processed documents
with their annotations and features and enables users to formulate versatile
queries using jape patterns.
INDEXING & QUERYING METADATA & CONTENT 37
3 ANNIC
Annic is built on top of the Apache Lucene2 a high performance full-
featured search engine implemented in Java which supports indexing and
search of large document collections. Our choice of ir engine is due to
the customisability of Lucene. After few changes in the behaviour of the
key components of Lucene, we were able to make Lucene adaptable to our
requirements.
3.1 Lucene token generation

For each word in a document, Lucene creates a separate token. Every
Lucene token has its own position in the token stream. This position remains
relative to its previous token and is stored as a position increment factor.
Since Lucene only indexes the text attribute of a Lucene token, to meet our
requirements, i.e., to index the linguistic information and metadata, Lucene
was modified to index also the type attribute. Type attribute holds a string
assigned by lexical analyzer that defines the lexical or syntactic class of
the Lucene token. Every annotation in gate has a corresponding features
associated with it. We create a separate Lucene token for every feature in
the document. Consider the following example:
The word ‘‘Bill’’ is annotated as:
AnnotationType:Token Features=POS:NNP,String:Bill
AnnotationType:Person
Sr. No. Token Text Token Type Pos. Incr. Description

1 Token * 1 Annotation Type “Token”
2 NNP Token.pos 0 “pos” feature with value “NNP”
3 Bill Token.string 0 “string” feature with value “Bill”
4 Person * 0 Annotation Type “Person”
Table 1: Token stream for the word “Bill”
Table 3.1 explains the token stream that contains tokens for the above
annotations. The annotation type itself is stored as a separate Lucene token
with its attribute type “*” and text as the value of “annotation type”. This
allows users to search for a particular annotation type. In order not to
confuse features of one annotation with others, feature names are qualified
with their respective annotation type names. Where there exist multiple
annotations over the same piece of text, only the position of the very first
feature of the very first annotation is set to 1 and it is set to 0 for the
rest of the annotations and their features. This enables users to query over
overlapping annotations and features.
2
http://lucene.apache.org
It is possible for two annotations to share the same offsets. If two anno-
tations share both offsets, such annotations are kept on the same position
in the token stream, and otherwise one after another. But, what if anno-
tations overlap each other (i.e., they share only one of the start and end
offsets)? In this case, though annotations do not appear one after another,
they are stored one after another. This may lead to incorrect results being
returned and therefore the results are verified in order to filter out invalid
overlapping patterns.
Before indexing gate documents with Lucene, we convert them into
the Lucene format and refer to them as gate Lucene documents. For each
document, the token stream is stored in a separate file as a Java serializable
object in the index directory. Later, it is retrieved in order to fetch left and
right contexts of the found pattern.
3.2 Gate query parser

Jape patterns support various query formats. Below we give few examples
of jape patterns. Actual patterns can also be a combination of one or more
of the following pattern clauses:
1. “String”
2. {AnnotationType}
3. {AnnotationType == “String”}
4. {AnnotationType.feature == “feature value”}
5. {AnnotationType1, AnnotationType2.feature == “featureValue”}
6. {AnnotationType1.feature == “featureValue”, AnnotationType2.feature == “featureValue”}
Order of the annotations specified in annic query is very important. In

Lucene, document must contain the specified keywords, no matter in which
order they exist. Order is important only for the phrase queries. Since the
default implementation of Lucene indexer indexes only the text attribute of
Lucene Token, it does not allow searching over the type attribute. Lucene
query parser does not support position increments in queries. For example
if one wants to search for annotations of type Location and Person referring
to the same piece of text, Lucene does not support this. On the other hand,
the respective JAPE pattern would be {Location, Person}.
Query Term
JAPE Pattern Token Term Text Term Type
“String” String Token.string
{annotType} annotType “*”
{annotType.featureType == “value”} value annotType.featureType
Table 2: JAPE pattern tokens and their respective query terms
We introduce our own query parser (annic query parser) which accepts
jape queries and converts them into Lucene queries. Lucene query parser,
before accessing index, converts each keyword into an instance of Term

class and compares them with the terms in index. Table 3.2 demonstrates
how jape pattern tokens are converted into query terms. In order to use
predefined Lucene queries (i.e., Boolean and Phrase queries), jape patterns
with “OR” operator are normalized into the “AND normalized form” and
all such patterns are ORed together to form a Boolean query.
Lucene Phrase query considers its each token as a separate term and
sets its position to the previous terms position + 1. This behaviour leads
to a problem in the context of jape queries. For example, user issues the
following query:
{Location, Person.gender = ‘‘male’’}
This should search for the text that is annotated as Location and Person,
where the Person annotation must contain a feature called “gender” with
value “male”. In this case, the annic query parser creates two separate
terms (Location and Person.gender = “male”). In order to make both terms
referring to the same location, positions of these terms must remain same. If
the position of first term is n, Lucenes phrase query implementation makes
the position of second term to n+1. This results into a pattern where the
first annotation is Location and is followed by the annotation Person.gender
= “male”. To overcome this problem, one solution is to pass customized
term positions along with terms to the phrase query. Given a term and its
position respective to its previous term, Lucene searches within its index
to find the term only at the given position. Thus, instead of searching
the second term at the n+1 position, Lucene seeks a term that occurs at
n position. This disables automatic increment in term’s position and also
allows searching for the overlapping annotation.
But even after this arrangment, there exists one major overlapping prob-
lem. For example for the text “Mr. Tim-Berners Lee told ...”, where the
text “Mr.” is annotated as “Title”, “Tim-Berners” as “FirstName”, “Lee”
as “Surname”, “Mr. Tim-Berners Lee” as “Person” and finally “told” as
“Token” with the part-of-speech tag “verb”. For these annotations, the
tokens “Title” and “Person” will be placed at the same position in the to-
ken stream, while “FirstName”, “Surname” and “Verb” will be placed one
after another after the “Title” and the “Person” annotations. This results
into incorrect results when the query is : {Person} {Token.string == “told”}. When
searching this pattern in the token stream, “Person” is not followed by the
Token string “told”, instead “Person” is followed by the annotation “First-
Name”, which is followed the annotation “Surname” and which is followed
by the “told”. To solve this problem, after converting the jape query into
the Lucene query terms, we issue the query that contains only the initial
terms which refer to the same location. For example, instead of querying
with {Person}{Token.string == “told”} , we query index with {Person} . As a result
this query returns all positions from the token stream where the annota-
tion is “Person”. We compare the rest terms (i.e., “Token.string == “told” ) by
fetching terms after the “Person” annotation and by comparing query terms
with them.
Klene operators: Annic supports two operators, + and *, to specify
the number of times a particular annotation or a sub pattern should appear
in the main query pattern. Here, ({A})+n means one and up to n occur-
rences of annotation {A} and ({A})*n means zero or up to n occurrences
of annotation {A}.
4 ANNIC user interface
Annic provides an advanced user interface at the presentation layer that

allows users to index a large collection of documents (i.e., corpus), search
indices and analyze the found patterns along with their left and right con-
texts concordances. At indexing time, the user can specify the corpus to be
indexed, the annotation type that acts as document tokens, annotation set
which contains the annotations to index, features and annotation types not
to include in index and finally the location of index on the local or network
file system. At search time, the user specifies the maximum number of pat-
terns to retrieve as results, number of tokens to show in the left and right
contexts and finally the jape pattern query.
4.1 ANNIC viewer
Fig. 1: annic Viewer
Figure 1 gives a snapshot of the annic search window. The bottom section
in the window contains the patterns along with their left and right context
concordances and the section at top shows graphical visualization of an-

notations. Annic shows each pattern in a separate row and provides tool
tip that shows the query that the selected pattern refers to. Along with
its left and right context texts, it also lists the name of documents that
the patterns come from. When the focus changes from one pattern to an-
other, graphical visualization of annotations (gva, above the pattern table)
changes its current focus to the selected pattern. Here, users have an option
of visualising annotations and their features for the selected pattern. The
figure shows the highlighted spans of annotations for the selected pattern.
Annotation types and features can also be selected from the drop-down
combo box and their spans can also be highlighted into the gva. When
users choose to highlight the features of annotations (e.g., Token.cat), gva
shows the highlighted spans containing values of those features. Whereas
when users choose to highlight the annotation with feature “all”, annic
adds a blank span in gva and shows all its features in a popup window
when mouse enters the span region. A new query can also be generated
and executed from the annic gui. When clicked on any of the highlighted
spans of the annotations, the respective query clause is placed in the New
Query text box. Clicking on Execute issues a new query and refreshes the
gui output. Annic also provides an option to export results in xml or
html files with options of all patterns and selected patterns.
QP Patterns
1 {Token.string==“Microsoft”} | “Microsoft Corp”
2 {Person} {Token.cat==“IN”} {Token.cat==“DT”})*1 {Organization}
3 ({Token.kind==number})+4
({Token.string =“/”} | {Token.string==“-”}) ({Token.kind==number})+2
({Token.string==“/”} | {Token.string==“-”}) ({Token.kind==number})+2
4 {Title} ({Token.orth==upperInitial} | {Token.orth==allCaps}) ({FirstPerson})*1
5 ({Token.cat==“DT”})*1 {Location} {Token.cat==“CC”}
({Token.cat==“DT”})*1 {Location} {Organization}{Token.cat==“IN”}
({Token.cat==“DT”})*1 {Location}
QP=Query Pattern
Table 3: ANNIC queries
5 Applications of ANNIC
Annic is used as a tool aiding the development of jape rules. Language

engineers use their intuition when writing jape rules trying to strike the
ideal balance between specificity and coverage. This requires them to make
a series of informed guesses which are then validated by testing the resulting
ruleset over a corpus. Annic can replace the guesswork in this process with
actual live analysys of the corpus. Each pattern intended as part of a jape
rule can be easily tested directly on the corpus and have its specificity and
coverage assesed almost instantaneously.
Annic can be used also for corpus analysys. It allows querying the in-
formation contained in a corpus in more flexible ways than simple full-text
search. Consider a corpus containing news stories that has been processed
with a standard named entity recognition system like annie3 . A query like
{Organization} ({Token})*3 ({Token.string==’up’}|{Token.string==’down’}) ({Money} | {Percent})
would return mentions of share movements like “BT shared ended up 36p”
or “Marconi was down 15%”. Locating this type of useful text snippets
would be very difficult and time consuming if the only tool available were
text search. Annic can also be useful in helping scholars to analyse linguis-
tic corpora. Sumerologists, for instance, could use it to find all places in the
etcsl corpus 4 where a particular pair of lemmas occur in sequence.
6 Performance results
In order to evaluate the performance of annic, we experimented on three

different corpora (large, medium, and small), processed with gate: 10% of
the bnc (British National Corpus)(374 documents,1443.84MB), hse (Health
and Security Experiments)(192 documents,896MB), and finally the news
corpus (446 documents, 39.4MB).
We tested the performance with several types of queries: string only
queries, combinations of strings and linguistic data, and patterns with quan-
tified Klene operators. Table 3 lists some of the different types of queries
which were issued over the indexed corpuses. Table 4 gives the statistics of
output of these queries. It provides different statistics including the time
taken by annic to retrieve the results and the number of patterns retrieved.
BNC 10% HSE NEWS

QP ST P ST P ST P
1 11.276 112 0.5 0 1.252 3
2 5.23 6 7.0 6 0.432 2
3 50.139 264 110.738 39 6.652 36
4 39.029 238 120.054 180 12.37 1038
5 62.971 81 126.823 124 5.508 281
6 6.191 10 11.875 5 0.692 11
QP=Query Pattern,ST=Search Time,P=Patterns
Table 4: ANNIC query results
3
Gate is distributed with an ie system called annie, A Nearly-New IE system.
4
http://www-etcsl.orient.ox.ac.uk/
7 Related work
McKelvie & Mikheev (1998) describe a suite of programs, lt index, that

supports indexing of large sgml documents. It indexes elements by their
position in the document structure and by their textual content. Annic
is more generic, because it can cope with a wider range of formats, while
covering the same functionality.
Cue (Corpus Universal Examiner) system (Mason 1998:20-27) splits the
corpus data into different data streams (e.g., actual words, pos informa-
tion), which are stored along with their positioning information in the in-
dex. Unlike cue, annic maintains a fixed structured data format (Term
string, Term type, position) within indices and converts all annotations and
their features into this consistent format.
Gaizauskas (2003:227-236) describe a system, xara that indexes any
well-formed xml document. It combines an indexer, a server and a win-
dows client. Indexer requires information like how element content is to be
tokenized and how tokens are to be mapped to index terms etc. Annic
supports not only xml but many other types of documents supported by
gate. Similar to xara, in annic as well, the decision of how documents
be tokenized is left on a user (e.g., gate supplies tokenisers for several
languages).
In order to investigate new models for semi structured data that are
appropriate to xml, Buneman (1998) describes a query language that is be-
yond any xml query languages. They describe extraction rules that consist
of expressions along the tree and are expressed using the html Extraction
Languages (hel). Their query language comes with navigation operators,
regular expressions and conditions to retrieve information even from the
nested structures. Annic query parser works on top of the gate annota-
tions and features and supports search over overlapping annotations and
features. Its advanced user interface allows users to visualize the nested
structure of the annotations with their features highlighted.
Kazai (2004:72-79) discuss the overlapping problem in content-oriented
xml retrieval. They discuss the INitiative for the Evaluation of xml Re-
trieval (inex) system, which discusses the matrices to evaluate the xml
retrieval results. Their argument is that if in an xml document, a sub ele-
ment satisfies a content-oriented query, parent element would also satisfies
the same query. Thus, instead of including only a subcomponent in the
result, inex also includes the parent component. In annic, the overlap-
ping problem, as discussed in Kazai (2004:72-79), does not exist due to two
reasons. 1) Annotations in gate documents are stored as an annotation
graph. Thus comparing the structure of xml documents where elements
contain texts, in gate documents annotations are created over the text.
2) Annic queries are very specific about the annotation types, i.e., query
itself describes the annotation type in which the string should be searched.
If user does not specify annotation type, annic does it automatically to
search strings with the gate token annotation type.
REFERENCES
Bird, S., P. Buneman & W. Tan. 2000. “Towards a Query Language for An-
notation Graphs”. Proceedings of the Second International Conference on
Language Resources and Evaluation, 807-814. Athens.
Bird, S., D. Day, J. Garofolo, J. Henderson, C. Laprun & M. Liberman. 2000b.
“ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation”.
Proceedings of the Second International Conference on Language Resources
and Evaluation, 1699-1706, Athens.
Buneman, P., A. Deutsch, W. Fan, H. Liefke, A. Sahuguet & W.C. Tan. 1998.
“Beyond XML Query Languages”. Proceedings of the Query Language Work-
shop (QL 1998 ). http://www.w3.org/TandS/QL/QL98/pp/penn.ps [Source
checked in April 2006]
Cassidy, S. 2002. “Xquery as an Annotation Query Language: A Use Case
Analysis”. Proceedings of 3rd Language Resources and Evaluation Conference
(LREC’2002 ), Gran Canaria, Spain.
Cunningham, H., D. Maynard, K. Bontcheva & V. Tablan. 2002. “GATE: A
Framework and Graphical Development Environment for Robust NLP Tools
and Applications”. Proceedings of the 40th Anniversary Meeting of the As-
sociation for Computational Linguistics (ACL 2002 ), 168-175. Philadelphia.
Gaizauskas, R., L. Burnard, P. Clough & S. Piao. 2003. “Using the XARA XML-
Aware Corpus Query Tool to Investigate the METER Corpus”. Proceedings
of the Corpus Linguistics 2003 Conference, 227-236. Lancaster, U.K.
Grishman, Ralph. 1997. “TIPSTER Architecture Design Document Version 2.3”.
Technical report, The Defense Advanced Research Projects Agency (DARPA).
http://www.itl.nist.gov/div894/894.02/related projects/tipster/
[Source checked in April 2006]
Kazai, G., M. Lalmas & A. Vries. 2004. “The Overlapping Problem in Content-
Oriented XML Retrieval Evaluation”. Proceedings of the 27th International
conference on Research and Development in Information Retrieval, 72-79.
Sheffield, U.K.
Mason, O. 1998. “The CUE Corpus Access Tool”. Workshop on Distributing
and Accessing Linguistic Resources, 20-27, Granada, Spain.
http://www.dcs.shef.ac.uk/~hamish/dalr/ [Source checked in February
2006].
McKelvie, D. & A. Mikheev. 1998. “Indexing SGML files using LT NSL”. LT
Index documentation http://www.ltg.ed.ac.uk/ [Source checked in April
2006]
Term Representation
with Generalized Latent Semantic Analysis
Irina Matveeva∗ , Gina-Anne Levow∗ , Ayman Farahat∗∗ &
Christiaan Royer∗∗
∗ ∗∗
University of Chicago, Palo Alto Research Center
Abstract
Representation of term-document relations is very important for doc-
ument indexing and retrieval. In this paper, we present Generalized
Latent Semantic Analysis as a framework for computing semantically
motivated term and document vectors. Our focus on term vectors
is motivated by the recent success of co-occurrence based measures
of semantic similarity obtained from very large corpora. Our exper-
iments demonstrate that glsa term vectors efficiently capture se-
mantic relations between terms and outperform related approaches
on the synonymy test.
1 Introduction
Document indexing and representation of term-document relations are cru-

cial for document classification, clustering and retrieval (Deerwester et al.
1990, Ponte & Croft 1998, Salton & McGill 1983). Since many classification
and categorization algorithms require a vector space representation for the
data, it is often important to have a document representation within the
vector space model approach (Salton & McGill 1983). In the traditional
bag-of-words document representation (Salton & McGill 1983), words rep-
resent orthogonal dimensions which makes an unrealistic assumption about
their independence within documents.
Since the document vectors are constructed in a very high dimensional
vocabulary space, there has been a considerable interest in low dimensional
document representations. Latent Semantic Analysis (lsa) (Deerwester
et al. 1990) is one of the best known dimensionality reduction algorithms
that deals with synonymy and polysemy. The dimensions of the resulting
vector space are interpreted as latent semantic concepts.
In this paper, we introduce Generalized Latent Semantic Analysis (glsa)
as a framework for computing semantically motivated term and document
vectors. Glsa focuses on computing term vectors in the space of semantic
concepts; document vectors are computed as linear combinations of term
vectors. Thus, unlike lsa and other related approaches glsa is not based
on bag-of-words document vectors. Instead, we begin with semantically mo-
tivated pair-wise term similarities to compute a representation for terms.
46 MATVEEVA, LEVOW, FARAHAT & ROYER
This shift from the dual document-term representation to term representa-

tion has the following motivation:
• Terms offer a greater flexibility in exploring similarity relations than
documents. The availability of large document collections such as the
Web offers a great resource for statistical approaches. Co-occurrence
based measures of semantic similarity between terms improve perfor-
mance on semantic proximity tests, e.g., (Terra & Clarke 2003).
• Content bearing words are often combined into semantic classes which
contain synonyms and semantically related words. It seems natural
to represent terms as low dimensional vectors in the space of semantic
concepts.
In this paper, we use a large document collection to extract point-wise mu-
tual information, and the singular value decomposition as a dimensionality
reduction method and compute term vectors. Our experiments show that
the glsa term representation outperforms related approaches on term-based
tasks such as the synonymy test.
The rest of the paper is organized as follows. Section 2 contains the
outline of the glsa algorithm. Section 4 presents our experiments, followed
by conclusion in section 5.
2 Generalized Latent Semantic Analysis
2.1 GLSA framework

The glsa algorithm has the following setup. We assume that we have a
document collection C with vocabulary V and a large corpus W .
1. Construct the weighted term-document matrix D based on C;
2. For the vocabulary words in V , obtain a matrix of pair-wise similari-
ties, S, using the large corpus W ;
3. Obtain the matrix U T of a low dimensional vector space representation
of terms that preserves the similarities in S, U T ∈ Rk×|V | ;
4. Compute document vectors by taking linear combinations of term
vectors D̂ = U T D.
Cosine similarity between term and document vectors is often used as a
measure of semantic association. Therefore, we would like to obtain term
vectors so that their pair-wise cosine similarities preserve the semantic sim-
ilarity between the corresponding vocabulary terms. The glsa approach
can combine any kind of similarity measure on the space of terms with
any suitable method of dimensionality reduction. The traditional term-
document matrix is used in the last step to provide the weights in the linear
combination of term vectors.
TERM REPRESENTATION WITH GLSA 47
2.2 Low-dimensional representation

2.2.1 Singular value decomposition
The singular value decomposition (svd) is applied to the matrix S con-
taining pair-wise similarities between the vocabulary terms. svd of S is
S = UΣV T , where U and V are column orthogonal matrices containing the
left and right singular vectors of S, respectively. Σ is a diagonal matrix
with the singular values sorted in decreasing order.
Eckart and Young, see Golub & Reinsch (1971), have shown that the ma-
trix Sk = Uk Σk VkT obtained by setting all but the first k diagonal elements
in Σ to zero is the best element-wise approximation of S:
Sk = argminX ||S − X||2F ,
where X is a matrix of rank k. The P minimum is taken with respect to the
Frobenius norm, where ||A||2F = ij A2ij .
Since S is a real symmetric matrix, V = U, so that S = UΣU T . If
all singular values are non-negative, S can be represented as a product of
two matrices, S = Û UˆT , with Û = UΣ1/2 . This means that the entries
of S, which in the glsa case represent pair-wise term similarities, are best
preserved by inner products between the appropriately scaled left singular
vectors of S.
Lsa is a special case within the glsa framework. It uses svd to preserve
the pair-wise term similarities computed as inner products between the term
vectors in the space of bag-of-words documents, see Bartell et al. (1992) for
details.
2.2.2 PMI as measure of semantic association

The matrix of semantic associations between the pairs of vocabulary terms
can be obtained using different collection-based term association measures,
such as point-wise mutual information, likelihood ratio, χ2 test etc. (Man-
ning & Schutze 1999). In this paper we use point-wise mutual information
(pmi) because it has been successfully applied to collocation discovery and
semantic proximity tests such as the synonymy test and taxonomy induction
(Manning & Schutze 1999, Terra & Clarke 2003, Widdows 2003).
Pmi between the random variables representing two words, w1 and w2 ,
is computed as
P (W1 = 1, W2 = 1)
pmi(w1 , w2 ) = log .
P (W1 = 1)P (W2 = 1)
The similarity matrix S with pair-wise pmi scores may not be positive semi-
definite. Since such matrices work well in practice (Cox & Cox 2001) one
Fig. 1: Precision with GLSA, PMI and count over different window sizes,
for the TOEFL (left), TS1 (middle) and TS2 (right) tests
common approach is to use only the eigenvectors corresponding to the pos-

itive eigenvalues (Cox & Cox 2001). This is the approach which we use in
our experiments.
3 Related approaches
Iterative Residual Rescaling (Ando 2000) tries to put more weight on docu-
ments from underrepresented clusters of documents to improve the perfor-
mance of lsa on heterogeneous collections. Locality Preserving Indexing
(lpi) (He et al. 2004) is a graph-based dimensionality reduction algorithm
which preserves the similarities only locally. The Word Space Model (ws)
for word sense disambiguation developed by Schutze (1998) and Latent Re-
lational Analysis (lra) (Turney 2001) are another special cases of glsa
which compute term vectors directly.
These approaches are limited to only one measure of semantic associ-
ation. They use weighted co-occurrence statistics in the document space
(lsa, lpi), in the space of the most frequent informative terms (ws) and in
the space of context patterns (lra) to compute input similarities.
4 Experiments
The goal of the experimental evaluation of the glsa term vectors was to
demonstrate that the glsa vector space representation for terms captures
their semantic relations. We used the synonymy and term pairs tests for
the evaluation.
To collect the co-occurrence information for the matrix of pair-wise term
similarities S, we used a subset of the English Gigaword collection (ldc),
containing New York Times articles labeled as “story”. We had 1,119,364
documents with 771,451 terms. We used the Lemur toolkit1 to tokenize and
index all document collections used in our experiments; we used stemming
and a list of stop words.
The similarities matrix S was constructed using the pmi scores. The
results with likelihood ratio and χ2 were below those for pmi and we omit
them here. We used the pmi matrix S in combination with svd to compute
glsa term vectors. Unless stated otherwise, for the glsa method we report
the best performance over different numbers of embedding dimensions. We
used the plapack package2 to perform the svd.
4.1 Synonymy test

The synonymy test represents a list of words and for each of them, there
are 4 candidate words. The task is to determine which of these candidate
words is a synonym to the word in question. This test was first used to
demonstrate the effectiveness of lsa term vectors (Landauer & Dumais
1997). More recently, the pmi-ir approach developed by Turney was shown
to outperform lsa on this task (Terra & Clarke 2003, Turney 2001).
We evaluated the glsa term vectors for the synonymy test and compared
the results to Terra & Clarke (2003). We used the toefl test containing 80
synonymy questions. We also used the preparation tests called TS1 and TS2.
Since glsa in its present formulation cannot handle multi-word expressions,
we had to modify the TS1 and TS2 tests slightly. We removed all test
questions that contained multi-word expressions. From 50 TS1 questions
we used 46 and from 60 TS2 questions we used 49. Thus, we would like
to stress that the comparison of our results on TS1 and TS2 to the results
reported in Terra & Clarke (2003) is only suggestive. We used the TS1 and
TS2 tests without context. The only difference in the experimental setting
for the toefl test between our experiments and the experiments in Terra
& Clarke (2003) is in the document collections that were used to obtain the
co-occurrence information.
4.1.1 GLSA setting

To have a richer vocabulary space, we added 2000 most frequent words from
the English Gigaword collection to the vocabularies of the toefl, TS1 and
TS2 tests and computed glsa term vectors for the extended vocabularies.
We selected the term t∗ whose term vector had the highest cosine similarity
to the question term vector t~q as the synonym. We computed precision
scores as the ratio of correctly guessed synonyms.
1
http://www.lemurproject.org/
2
http://www.cs.utexas.edu/users/plapack/
The co-occurrence counts can be obtained using either term co-occurrence

within the same document or within a sliding window of a fixed size. In our
experiments we used the window-based approach which was shown to give
better results (Schutze 1998, Terra & Clarke 2003). Since the performance
of co-occurrence based measures is sensitive to the window size, we report
the results for different window sizes.
Fig. 2: Precision at different numbers of GLSA dimensions with the best

window size
4.1.2 Results on the synonymy test

Figure 1 shows the precision using different window sizes. The baselines are
to choose the candidate with the highest co-occurrence count or pmi score.
For all three data sets, glsa significantly outperforms pmi scores computed
on the same collection. The results that we obtained using just the pmi score
are below those reported in Terra & Clarke (2003). One explanation for this
discrepancy is the size and the composition of the document collections used
for the co-occurrence statistics. The English Gigaword collection that we
used is smaller and, more importantly, less heterogeneous than the web
based collection in Terra & Clarke (2003). Nonetheless, on the toefl data
set glsa achieves the best precision of 0.86, which is much better than our
pmi baseline as well as the highest precision of 0.81 reported in Terra &
Clarke (2003). GLSA achieves the same maximum precision as in Terra &
Clarke (2003) for TS1 (0.73) and a much higher precision on TS2 (0.82 vs.
0.75 in Terra & Clarke (2003)).
Figure 2 shows the precision for the glsa terms only, using different
number of dimensions. svd-based approaches usually perform best with
300-400 resulting dimensions (Deerwester et al. 1990). The variation of
precision at different numbers of glsa embedding dimensions within the
100-600 range is somewhat high for TS1 but much smoother for the toefl
and TS2 tests.
top 100 top 1000 top 100 top 1000

k pmi glsa pmi glsa k pmi glsa pmi glsa
1 0.0 1.0 0.0 1.0 1 0.27 0.67 0.08 0.43
5 0.0 1.0 0.0 1.0 5 0.40 0.48 0.8 0.40
10 0.0 1.0 0.0 0.8 10 0.35 0.37 0.1 0.37
50 0.32 0.88 0.12 0.8 50 0.14 0.13 0.16 0.20
100 0.24 0.76 0.1 0.8 100 0.08 0.08 0.16 0.18
Table 1: Precision for the term Table 2: Average precision for
pairs test at the top k most the term pairs test at the top k
similar pairs nearest words
4.2 Term pairs test

Some of the terms on the synonymy test are infrequent (e.g., “wig”) or
not informative (e.g., “unlikely”). We used the following test to evaluate
how the cosine similarity between glsa vectors captures similarity between
content words.
We computed glsa term vectors for the vocabulary of the 20 news
groups document collection. Using the Rainbow software3 we obtained the
top N words with the highest mutual information with the class label. We
also obtained the probabilities that each of these words has with respect to
each of the news groups. We assigned the group in which the word has the
highest probability as the word’s label. Some of the top words and their
labels can be seen in Table 3. Although the way we assigned labels may
not strictly correspond to the semantic relations between words, this table
shows that for this particular collection and for informative words (e.g.,
“bike”, “team”) they do make sense.
We computed pair-wise similarities between the top N words using the
cosine between the glsa vectors representing these words and also used just
the pmi scores. Then we looked at the pairs of terms with the highest sim-
ilarities. Since for this test we selected content bearing words, the intuition
is that most similar words should be semantically related and are likely to
appear in documents belonging to the same news group. Therefore, they
should have the same label.
In the synonymy task the comparisons are made between the pmi scores
of a few carefully selected terms that are synonymy candidates for the same
word. While pmi-ir performs quite well on the synonymy task, it is in
general difficult to compare pmi scores across different pairs of words. Our
experiments illustrate that glsa significantly outperforms the pmi scores
on this test. We used N = {100, 1000} top words by the MI with the class
3
http://www-2.cs.cmu.edu/ mccallum/bow/rainbow/
word nn=1 nn=2 nn=3 Prec

god (18) jesus (18) bible (18) heaven (18) 1
bike (15) motorcycle (15) rider (15) biker (15) 1
team (17) coach (17) league (20) game (17) 0.6
car (7) driver (1) auto (7) ford (7) 0.6
windows (1) microsoft (1) os (3) nt (1) 0.4
dod (15) agency (10) military (13) nsa (10) 0
article (15) publish (13) fax (4) contact (5) 0
Table 3: Precision at the 5 nearest terms for some of the top 100 words by
mutual information with the class label. The table also shows the first 3
nearest neighbors. The word’s label is given in the brackets (1=os.windows;
3=hardware; 4=graphics; 5=forsale; 7=autos; 10=crypt;
13=middle-east;15=motorcycles; 17=hokey; 18=religion-christian;
20=baseball)
label. The top 100 are highly discriminative with respect to the news group
label whereas the top 1000 words contain many frequent words. Our results
show that glsa is much less sensitive to this than pmi.
First, we sorted all pairs of words by similarity and computed precision
at the k most similar pairs as the ratio of word pairs that have the same
label. Table 1 shows that glsa significantly outperforms the pmi score.
pmi has very poor performance, since here the comparison is done across
different pairs of words.
The second set of scores was computed for each word as precision at
the top k nearest terms, similar to precision at the first k retrieved docu-
ments used in information retrieval. We report the average precision values
for different values of k in Table 2. Glsa achieves higher precision than
pmi. Glsa performance has a smooth shape peaking at around 200-300 di-
mension which is in line with results for other svd-based approaches. The
dependency on the number of dimensions was the same for the top 1000
words.
In Table 3 we show results for some of the words. Glsa representation
achieves very good results for terms that are good indicators of particular
news groups, such as “god” or “bike”. For frequent words, and words which
have multiple senses, such as “windows” or “article”, the precision is lower.
The pair “car”, “driver” is semantically related for one sense of the word
“driver”, but the word “driver” is assigned to the group “windows-os” with
a different sense.
5 Conclusion and future work
Our experiments have shown that the cosine similarity between the glsa
term vectors corresponds well to the semantic similarity between pairs of
terms. Interesting questions for future work are connected to the computa-
tional issues. As other methods based on a matrix decomposition, glsa is
limited in the size of vocabulary that it can handle efficiently. Since terms
can be divided into content-bearing and function words, glsa computa-
tions only have to include content-bearing words. Since the glsa document
vectors are constructed as linear combinations of term vectors, the inner
products between the term vectors are implicitly used when the similarity
between the document vectors is computed. Another interesting extension
is therefore to incorporate the inner products between glsa term vectors
into the language modelling framework and evaluate the impact of the glsa
representation on the information retrieval task.
We have presented the glsa framework for computing semantically mo-
tivated term and document vectors. This framework allows us to take ad-
vantage of the availability of large document collections and recent research
of corpus-based term similarity measures and combine them with dimen-
sionality reduction algorithms. Using the combination of point-wise mutual
information and singular value decomposition we have obtained term vec-
tors that outperform the state-of-the-art approaches on the synonymy test
and show a clear advantage over the pmi-ir approach on the term pairs test.
Acknowledgements. We are very grateful to Paolo Bientinesi for his ex-

tensive help with adopting the plapack package to our problem. The toefl
questions were kindly provided by Thomas K. Landauer, Department of Psychol-
ogy, University of Colorado. This research has been funded in part by contract
#MDA904-03-C-0404 to Stuart K. Card and Peter Pirolli from the Advanced
Research and Development Activity, Novel Intelligence from Massive Data pro-
gram.
REFERENCES
Ando, R.K. 2000. “Latent Semantic Space: Iterative Scaling Improves Preci-
sion of Inter-Document Similarity Measurement”. Proceedings of the 23th
annual international ACM SIGIR conference on Research and development
in information retrieval (SIGIR’00 ), 216-223. Athens, Greece.
Bartell, B.T., G.W. Cottrell, & R.K. Belew. 1992. “Latent semantic indexing is
an optimal special case of multidimensional scaling”. Proceedings of the 15th
annual international ACM SIGIR conference on Research and development
in information retrieval (SIGIR’92 ), 161-167. Copenhagen, Denmark.
Cox, T.F. & M.A. Cox. 2001. Multidimensional Scaling. London: CRC/Chapman
and Hall.
Deerwester, S.C., S.T. Dumais, T.K. Landauer, G.W. Furnas & R.A. Harshman.
1990. “Indexing by Latent Semantic Analysis”. Journal of the American
Society of Information Science 41:6.391-407.
Golub, G. & C. Reinsch. 1971. Handbook for Matrix Computation II, Linear
Algebra. New York: Springer-Verlag
He, X., D. Cai, H. Liu & W. Ma. 2004. “Locality preserving indexing for doc-
ument representation”. Proceedings of the 27th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval
(SIGIR’04 ), 96-103. Sheffield, U.K.
Landauer, T.K., S.T. Dumais. 1997. “A Solution to Platos Problem: The Latent
Semantic Analysis Theory of the Acquisition, Induction, and Representation
of Knowledge”. Psychological Review 104:2.211-240.
Manning, C. & H. Schutze. 1999. Foundations of Statistical Natural Language
Processing. Cambridge, Mass.: MIT Press.
Ponte, J.M. & W.B. Croft. 1998. “A language modeling approach to information
retrieval”. Proceedings of the 21th Annual International ACM SIGIR Con-
ference on Research and Development in Information Retrieval (SIGIR’98 ),
275-281. Melbourne, Australia.
Salton, G. & M.J. McGill. 1983. Introduction to Modern Information Retrieval.
New York: McGraw-Hill.
Schutze, H. 1998. “Automatic Word Sense Discrimination”. Computational Lin-
guistics 24:21.97-124
Terra, E.L. & C.L. Clarke. 2003. “Frequency Estimates for Statistical Word
Similarity Measures”. Proceedings of the Joint Human Language Technology
Conference and the Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics (HLT-NAACL’2003 ), vol. 1, 165-172.
Edmonton, Canada.
Turney, Peter D. 2001. “Mining the Web for Synonyms: PMI-IR versus LSA
on TOEFL”. Machine Learning: ECML 2001 — Proceedings of the 12th
European Conference on Machine Learning, Freiburg, Germany ed. by Luc
de Raedt & Peter Flach (= Lecture Notes in Artificial Intelligence, 2167),
491-502. Heidelberg: Spinger.
Widdows, D. 2003. “A Mathematical Model for Context and Word-Meaning”.
Proceedings of the 4th International and Interdisciplinary Conference on Mod-
eling and Using Context, 23-25. Stanford, Calif.
Multilingual Dependency Parsing:
A Pipeline Approach
Ming-Wei Chang, Quang Do & Dan Roth
University of Illinois at Urbana-Champaign
Abstract
This paper develops a general framework for machine learning based
dependency parsing based on a pipeline approach, where a task is
decomposed into several sequential stages. To overcome the error
accumulation problem of pipeline models, we propose two natural
principles for pipeline frameworks: (i) make local decisions as reli-
able as possible, and (ii) reduce the number of sequential decisions
made. We develop an algorithm that provably satisfies these princi-
ples and show that the proposed principles support several algorith-
mic choices that improve the dependency parsing accuracy signifi-
cantly. We present state of the art experimental results for English
and several other languages.1
1 Introduction
A pipeline process over the decisions of learned classifiers is a common com-

putational strategy in natural language processing. In this model a task is
decomposed into several stages that are solved sequentially, where the com-
putation in the ith stage typically depends on the outcome of computations
done in previous stages. For example, a semantic role labeling program
(Punyakanok et al. 2005) may start by using a part-of-speech tagger, then
apply a shallow parser to chunk the sentence into phrases, identify predi-
cates and arguments and then classify them to types. In fact, any left to
right processing of an English sentence may be viewed as a pipeline com-
putation as it processes a token and, potentially, makes use of this result
when processing the token to the right.
The pipeline model is a standard model of computation in natural lan-
guage processing for good reasons. It is based on the assumption that some
decisions may be easier to make and that the outcome of earlier decision
may sometimes be useful when making further decisions. Nevertheless, it is
clear that it results in error accumulation and suffers from its inability to
correct mistakes made in previous stages. Researchers have recently started
to address some of the disadvantages of this model. Roth & Yih (2004)
suggests a model in which global constraints are taken into account in a
1
This paper extends and unifies our previous works (Chang et al. 2006a, 2006b).
56 MING-WEI CHANG, QUANG DO & DAN ROTH
later stage to fix mistakes due to the pipeline. Punyakanok et al. (2005)
and Marciniak & Strube (2005) also address some aspects of this problem.
However, these solutions rely on the fact that all decisions are made with
respect to the same input; specifically, all classifiers considered use the same
examples as their input. In addition, the pipelines they study are shallow.
This paper develops a general framework for decisions in pipeline models
which addresses these difficulties. Specifically, we are interested in deep
pipelines—a large number of predictions that are being chained.
A pipeline process is one in which decisions made in the ith stage (1) de-
pend on earlier decisions and (2) feed on input that depends on earlier de-
cisions. The latter issue is especially important at evaluation time since, at
training time, a gold standard data set might be used to avoid this issue.
We develop and study the framework in the context of a bottom up ap-
proach to dependency parsing. We suggest that two principles should guide
the pipeline algorithm development:
(i) Make local decisions as reliable as possible.
(ii) Reduce the number of decisions made.
Using these as guidelines we devise an algorithm for dependency parsing,
prove that it satisfies these principles, and show experimentally that this
improves the accuracy of the resulting tree.
Specifically, our approach is based on a shift-reduced parsing as in (Ya-
mada & Matsumoto 2003). Our general framework provides insights that
allow us to improve their algorithm, and to principally justify some of the
algorithmic decisions. Specifically, the first principle suggests to improve
the reliability of the local predictions, which we do by improving the set of
actions taken by the parsing algorithm, and by using a look-ahead search.
The second principle is used to justify the control policy of the parsing algo-
rithm, namely, which edges to consider at different stages in the algorithm.
We prove that our control policy is optimal in some sense and, overall,
that the decisions we made guided by these principles lead to a significant
improvement in the accuracy of the resulting parse tree.
1.1 Dependency parsing and pipeline models

Dependency trees provide a syntactic representation that encodes functional
relationships between words; it is relatively independent of the grammar
theory and can be used to represent the structure of sentences in differ-
ent languages. Dependency structures are more efficient to parse (Eisner
1996) and are believed to be easier to learn, yet they still capture much of
the predicate-argument information needed in applications (Haghighi et al.
2005), which is one reason for the recent interest in learning these structures
(Lin 1994, Eisner 1996, Yamada & Matsumoto 2003, Nivre & Scholz 2004,
Mcdonald et al. 2005).
MULTILINGUAL DEPENDENCY PARSING 57
Eisner’s work—O(n3 ) parsing time generative algorithm—embarked the in-

terest in this area. His model, however, seems to be limited when dealing
with complex and long sentences. Mcdonald et al. (2005) build on this work,
and use a global discriminative training approach to improve the edges’
scores, along with Eisner’s algorithm, to yield an expected improvement.
A different approach was studied by Yamada & Matsumoto (2003), that
develop a bottom-up approach and learn the parsing decisions between con-
secutive words in the sentence. Local actions are used to generate a depen-
dency tree using a shift-reduce parsing approach (Aho et al. 1986). This
is a true pipeline approach in that the classifiers are trained on individual
decisions rather than on the overall quality of the parser, and chained to
yield the global structure. Conceptually similar work was done in other
successful parsers, e.g., (Ratnaparkhi 1997). Clearly, this approach suffers
from the limitations of pipeline processing, such as accumulation of er-
rors, but nevertheless, yields very competitive parsing results. A somewhat
similar approach was used in (Nivre & Scholz 2004) to develop a hybrid
bottom-up/top-down approach; there, the edges are also labeled with se-
mantic types, yielding lower accuracy than the works mentioned above.
The overall goal of dependency parsing (DP) learning is to infer a tree
structure. An example of a dependency tree is shown in Figure 1.
John gives me a beautiful puppy .
Fig. 1: An example of a dependency tree
A common way to do that is to predict with respect to each potential

edge (i, j) in the tree, and then choose a global structure that (1) is a tree
and that (2) maximizes some score. The first provides a set of structural
constraints on the resulting structure, and the second provides a principled
way to choose among (possibly many) resulting trees. In the context of DPs,
this “edge based factorization method” was proposed by (Eisner 1996). In
other contexts, this is similar to the approach of (Roth & Yih 2004) in that
scoring each edge depends only on the raw data observed and not on the
classifications of other edges, and that global considerations can be used to
overwrite the local (edge-based) decisions.
The key in a pipeline model is that making a decision with respect to the
edge (i, j) may gain from taking into account decisions already made with
respect to neighboring edges. However, given that these decisions are noisy,
there is a need to devise policies for reducing the number of predictions in
order to make the parser more robust. This is exemplified in (Yamada &
Matsumoto 2003)—a bottom-up approach, that is most related to the work

presented here. Their model is a “traditional” pipeline model—a classifier
suggests a decision that, once taken, determines the next action to be taken
(as well as the input the next action observes).
For many languages such as English, Chinese and Japanese (with a few
exceptions), DPs without edge crossings, called projective dependency trees,
are sufficient to analyze most sentences. Even for non-projective languages,
usually the proportion of non-projective edges is low (Buchholz & Marsi
2006). As is common in most earlier work on DP, the work described here
is also concerned mainly with projective trees. However, we also discuss
an extension that deals with non-projective trees by converting them into
projective trees (Nivre & Nilsson 2005).
In the rest of this paper, we propose and justify a framework for improv-
ing pipeline processing based on the principles mentioned above: (i) make
local decisions as reliably as possible, and (ii) reduce the number of deci-
sions made. We use the proposed principles to examine the (Yamada &
Matsumoto 2003) parsing algorithm and show that the framework suggests
modifying some of the decisions made there and, consequently, results in
better overall dependency trees. After introducing the task and our general
approach, Section 2 formally defines the model we consider, the pipeline
dependency parsing approach and its properties. Our experimental results
on English are presented in Section 3. In Section 4 we discuss a few exten-
sions to the model, and experimental results on other languages. Section 5
concludes and discusses some future directions.
2 Dependency parsing as a pipeline model
This section describes our dependency parsing algorithm and justifies its
advantages by viewing and analyzing it as a pipeline model.
Definition 1 For words x, y in a sentence T , we introduce the following

notation:
x → y : x is the direct parent of y.
x →∗ y: x is an ancestor of y.
x ↔ y : x → y or y → x.
x < y : x is to the left of y in T .
We now introduce the concept of a projective language (Nivre 2003):
Definition 2 (Projective language)

∀a, b, c ∈ T, a ↔ b and a < c < b imply that a →∗ c or b →∗ c.
2.1 A pipeline dependency parsing algorithm

Our parsing algorithm is a modified shift-reduce parser (Aho et al. 1986)
that applies a set of actions (described below) in a left to right manner on
consecutive pairs of words (a, b) (a < b) in a sentence. A machine learning
algorithm is used to determine what actions (namely, parsing decisions)
to take between two candidate words in the sentence, and a tree is thus
constructed in a bottom up manner. The basic actions used in this model,
as in (Yamada & Matsumoto 2003), are:
Shift: there is no relation between a and b, or the action is deferred because
the child word of a and b still has some unlinked children. For example,
if a is the parent of b and b has a child to the right of b, we must defer
the action.
Right: b is the parent of a (and a is thus eliminated from further consider-
ation).
Left: a is the parent of b (and b is thus eliminated from further considera-
tion).
L R S
X
a b a b a b
Fig. 2: This figure demonstrates the basic parsing actions:

Left, Right and Shift
These three actions are illustrated in Figure 2. It is important to note that

one word (the child word) will be eliminated from T after performing Left
or Right. We remove the child word because we already found its parent.
This is a true pipeline approach in that (1) the classifiers are trained on
individual decisions rather than on the overall quality of the parse tree, and
are chained to yield a global structure; and consequently, decisions made
with respect to a pair of words affect what pair of words is considered next
by the algorithm.
In order to complete the description of the algorithm we need to describe
which edge to consider once an action is taken. We describe it via the notion
of the focus point:
Definition 3 (Focus point) When the algorithm considers the pair (a, b),
a < b, we call the word a the current ‘focus point’.
Next we describe several policies for determining the focus point of the algo-
rithm following an action. We note that, with a few exceptions, determining
the focus point does not affect the correctness of the algorithm. It is easy
to show that for (almost) any focus point chosen, if the correct action is
selected for the corresponding word pair, the algorithm will eventually yield
the correct tree (but may require multiple cycles through the sentence). In
practice, the actions selected are noisy, and a wasteful focus point policy
will result in a large number of actions, and thus in error accumulation. To
minimize the number of actions taken, we want to find a good focus point
placement policy.
We always let the focus point move one word to the right after S. After
L or R there are several natural placement policies to consider for the focus
point:
Start Over: Move focus point to the first word in T .
Stay: Move focus point to be the next word to the right. That is, for
T = (a, b, c), and the focus point is a, an L action results in a as the
focus point (and the pair considers (a, c), while R action results in the
focus being b (and the pair considered (b, c)).
Step Back: The focus point moves to the previous word (on the left). That
is, for T = (a, b, c), and the focus point is b, in both cases, a will be the
focus point (and the pair considered are (a, c) in case of a R action,
and (a, b) in case of a L).
In practice, different placement policies have a significant effect on the num-
ber of pairs considered by the algorithm and, therefore, on the final accu-
racy2 .
The following analysis justifies the Step Back policy. We claim that
if Step Back is used, the algorithm will not waste any action. Thus, it
achieves the goal of minimizing the number of actions taken in a pipeline
algorithm. Notice that using this policy, when L is taken, the pair (a, b) is
reconsidered, but with new information, since now it is known that c is the
child of b. Although this seems wasteful, we will show in Lemma 1 that this
movement is necessary in order to reduce the number of actions.
As mentioned above, each of these policies yields the correct tree. Ta-
ble 1 provides an experimental comparison of the three policies in terms of
the number of actions required to build a tree. As a function of the focus
point placement policy, and using the correct action for each pair considered,
it shows the number of word pairs considered in the process of generating all
the trees for section 23 of Penn Treebank. It is clear from Table 1 that the
policies result in very different number of actions and that Step Back is
the best choice. Note that since the actions are the gold-standard actions,
2
Note that Yamada & Matsumoto 2003) mention that they move the focus point back
after R, but do not state what they do after executing L actions, and why. Yamada
(p.c. 2006) indicates that they also moved the focus point back after L.
Policy #Shift #Left #Right

Start over 156545 26351 27918
Stay 117819 26351 27918
Step back 43374 26351 27918
Table 1: The number of actions required to build all the trees for the
sentences in section 23 of Penn Treebank (Marcus et al. 1993 ) as a
function of the focus point placement policy. The statistics are taken with
the correct (gold-standard ) actions
the policy affects only the number of S actions used, and not the L and
R actions, which are a direct function of the correct tree. The number of
required actions in the test stage (when the actions taken are based on a
learned classifier) shows the same trend and consequently the Step Back
also gives the best dependency accuracy. The parsing algorithm with this
policy is given Algorithm 2. Note that a good policy should always try to
consider a child before considering its parent to save the number of deci-
sions. As we will show later, the Step Back has several nice properties to
be considered as a good policy.
2.2 Correctness and pipeline properties
We can prove several properties of our algorithm. First we show that the
algorithm builds the dependency tree in only one pass over the sentence.
Then, we show that the algorithm does not waste actions in the sense that it
never considers a word pair twice in the same situation. Consequently, this
shows that under the assumption of a perfect action predictor, our algorithm
makes the smallest possible number of actions, among all algorithms that
build a tree sequentially in one pass. Finally, we show that the algorithm
only requires linear number of actions. Note that this may not be true if
the action classifier is not perfect, and one can contrive examples in which
an algorithm that makes several passes on a sentence can actually make
fewer actions than a single pass algorithm. In practice, however, as our
experimental data shows, this is unlikely.
The following analysis is made under the assumption that the action
classifiers are perfect. While this assumption does not hold in practice, it
makes the following analysis tractable, and still gives a very good insight
into the practical properties of our algorithm, given that, as we show the
action classifier is quite accurate, even if not perfect.
For example, in Lemma 1, we show that the algorithm requires only
one round to parse a sentence under this assumption. In practice, in the
evaluation stage, we allow the parsing algorithm to run multiple rounds.
That is, when the focus point reaches the last word in T , we will set the
Algorithm 2 Dependency parsing with Step Back policy. getFeatures

extracts the features describing the word pair currently considered; getAc-
tion determines the appropriate action for the pair; assignParent assigns a
parent for the child word based on the action; and deleteWord deletes the
child word in T at the focus once the action is taken. Note that if other
policies are used, the algorithm needs to set f ocus = 0 after f ocus = T if
the tree is not completed.
t: a word token
focus: index into the sentence
T = {t1 , t2 , . . . , t|T | }: sentence
f ocus = 1
while f ocus < |T | do
~v = getF eatures(tf ocus , tf ocus+1)
α = getAction(tf ocus , tf ocus+1, ~v )
if α = L or α = R then
assignP arent(tf ocus , tf ocus+1, α)
deleteW ord(T, f ocus, α)
// performing Step Back here
if f ocus 6= 1 then f ocus = f ocus − 1
else
if f ocus 6= 1 then f ocus = f ocus − 1
focus point at the beginning of the sentence if the tree is not complete. We
found that over 99% of the sentences can be parsed in a single round and
that, the average number of rounds needed to parse a sentence (average
taken over all sentences in the test corpus) is 1.01. Therefore, we believe
that the following analysis, done, under the assumption of golden actions
can still be quite useful.
Lemma 1 For projective languages, the dependency parsing algorithm that

uses the Step Back policy completes the tree when it reaches the end of the
sentence for the first time. That is, when the focus point is at the last word
of T , the algorithm completes the tree.
Before providing our argument, it is important to recall that one child word
will be eliminated from T after performing Left or Right. The Step Back
policy and this elimination procedure are the key points in this lemma.
Note that in Algorithm 2, we use the focus point to choose the word
pair. Since we move the focus point in the word vector T , we can only
consider “consecutive” pairs in T . Note that the elimination after Left or
Right will generate new “consecutive” word pairs in T .
In order to prove the lemma, we need the following definition. We call a

pair of words (a, b) a free pair if and only if there is a relation between a
and b and the algorithm can perform either a L or a R actions on that
pair when it is considered. There are two requirement so that we can say
that the algorithm can perform either L or R. First, the word pair must
be consecutive in the currently considered T . Second, the action can be
perform right now (we do not need to defer it). Formally,
Definition 4 (Free pair) A pair (a, b) considered by the algorithm is a

free pair, if it satisfies all the following conditions:
1. a ↔ b
2. a, b are consecutive in T (but not necessarily in the original sentence).
3. If a → b, no other word in T is the child of b. If b → a, no other word
in T is the child of a.
Proof: It is easy to see that there is at least one free pair in T , with |T | > 2.
The reason is that if no such pair exists, there must be three words {a, b, c}
s.t. a ↔ b, a < c < b and ¬(a → c ∨ b → c). However, this violates the
properties of a projective language.
Now assume {c, a, b} are three consecutive words in T . We claim that
when using Step Back, the focus point is always to the left of all free pairs
in T . This is clearly true when the algorithm starts. Assume that (a, b)
is the first free pair in T and let c be just to the left of a and b. Then,
the algorithm will not make a L or R action before the focus point meets
(a, b), and will make one of these actions then. It’s possible that (c, a ∨ b)
becomes a free pair after removing a or b in T so we need to move the focus
point back. However, we also know that there is no free pair to the left of
c. Therefore, during the algorithm, the focus point will always remain to
the left of all free pairs. So, when we reach the end of the sentence, every
free pair in the sentence has been taken care of, and the sentence has been
completely parsed. ✷
Lemma 2 All actions made by a dependency parsing algorithm that uses

the Step Back policy are necessary. Specifically, a pair (a, b) will never be
considered again unless there is an additional child linked to a or b.
Proof: Note that if R or L is taken, either a or b will become a child word
and be eliminated from further considerations by the algorithm. Therefore,
if the action taken on (a, b) is R or L, this pair of words will never be
considered again.
If S is taken, it is possible to consider the word pair (a, b) again. From
Lemma 1, the algorithm will complete the tree in one round and a pair (a, b)
will be considered again only if the focus point “steps back”. Therefore, we
consider two cases here. First, if the next action followed S is L, b will get a
new child and the algorithm will step back and consider (a, b) again. Since
b gets new information, it is reasonable to check (a, b) again to see if the
algorithm can recover the deferred action or not. Second, if the next action
followed S is R, b will be eliminated from T and (a, b) will not be considered
again. ✷
The next lemma claims a stronger property: the algorithm requires only
O(n) actions, where n is the length of the input sentence.
Lemma 3 For a sentence T of size n, the number of actions used by a

dependency parsing algorithm with the Step Back policy is bounded by 3n.
Proof: In order to calculate the number of actions, we consider the number
of word pairs processed by the algorithms, taking into account word pairs
that may be processed multiple times. At the beginning of the algorithm,
the algorithm only has n − 1 word pairs since it can only consider two
consecutive words. Every time the algorithm performs an action other than
S, some words will be eliminated and new word pairs will be generated.
Furthermore, since a word gets new information (a new child), some word
pairs may be considered again.
From Lemma 2, we know that a word pair will be considered again only
if one of the words has a new child connected to it. Therefore, we only need
to calculate the number of newly generated word pairs and the number of
word pairs needed to be reconsidered.
Assume, w.l.o.g, that there are four words (a, b, c, d) and the focus point
is at b, that is, the algorithm is considering (b, c). If the algorithm performs
R, b becomes c’s child and (a, c) becomes a new word pair. We also need
to consider (c, d) (again, if we considered it before) since now c gets more
information. Similarly, if L is performed, we need to reconsider (a, b) for the
new information and examine the new word pair (b, d). In conclusion, after
performing every L or R action we may need to (re)consider two additional
word pairs.
Let m be the total number of word pairs which are needed to be consid-
ered. The previous discussion shows that we have:
m ≤ n − 1 + 2l + 2r
where l and r are the number of L and R actions taken, respectively. Note
that l + r ≤ n − 1 since the sentence has only n words. Therefore, m ≤ 3n
and the total number of actions is bounded by 3n. ✷
2.3 Improving the parsing action set

So far we considered the “standard” set of three actions. While the L and
R actions have a clear meaning, the S action covers several cases. When
we use machine learning techniques to learn when to perform each action,
this may result in an inaccurate decision with respect to this action. In
order to improve the accuracy of the action predictors, we suggest a new
(hierarchical) set of actions: Shift, Left, Right, WaitLeft, WaitRight. We
believe that predicting these is easier due to finer granularity – the S action
is broken into sub-actions in a natural way. The new actions have the
following semantics:
WaitLeft: a < b. a is the parent of b, but it’s possible that b is a parent of
other nodes. Action is deferred. Notice that if we mistakenly perform
Left instead, the child of b can not find its parents later.
WaitRight: a < b. b is the parent of a, but it’s possible that a is a parent
of other nodes. Similar to WL, action is deferred.
Consequently, we also change the meaning of the action S to take effect
only if there is no relationship between a and b3 . The new set of actions
is shown to better support our parsing algorithm, when tested on different
placement policies. When WaitLeft or WaitRight is performed, the focus
will move to the next word, as in S.
It is interesting to note that WaitRight is not needed in projective lan-
guages if Step Back is used. This gives another strong justification to use
Step Back, since the action classification becomes more accurate—a more
natural class of actions, with a smaller number of candidate actions. We
show this property formally below.
Lemma 4 If we apply the algorithm on projective languages, the WaitRight
action is not necessary if the algorithm uses Step Back policy.
Proof: Assume w.l.o.g. that the target sentence is (a, b, c) with b as the
focus word, and that the algorithm needs to perform WaitRight. That it, c
is the parent of b and b is the parent of other words (for example, a). This
implies that there is a free pair to the left of the focus point, contradicting
Lemma 1. Note that Lemma 1 still holds since WaitRight and WaitLeft
behave as Shift. Therefore, the algorithm never needs to use the WaitRight
action. ✷
2.4 Improving the actions’ accuracy: A pipeline model with look ahead
Once the parsing algorithm, along with the focus point policy, is determined,
it only remains to determine a way to choose an action. We do this using
3
Interestingly, Yamada & Matsumoto (2003) mention the possibility of an additional
single Wait action, but it is not added it to the model considered there.
a machine learning algorithm. Given an annotated corpus we use of the

parsing algorithm to determine the action needed for each consecutive pair;
this is used to train a classifier to predict one of the possible actions. The
details of the classifier and the feature used are given in Section 3.
When the learned model is evaluated on new data, the sentence is pro-
cessed left to right and the parsing algorithm, along with the action clas-
sifier, are used to produce the dependency tree. The parsing process is
somewhat more involved, since the action classifier is not used as is, but
rather via a look ahead inference step described next.
The advantage of a pipeline model is that it can use more information,
based on the outcomes of previous predictions. As discussed earlier, this
may result in error accumulation. The importance of having a reliable
action predictor in a pipeline model motivates the following approach. We
devise a look ahead algorithm and use this policy as a way to determine
the predicted action more accurately. This approach can be used in any
pipeline model but we illustrate it below in the context of our dependency
parser.
The following example illustrates a situation in which an early mistake in
predicting an action causes a chain reaction and results in further mistakes.
This stresses the importance of correct early decisions, and motivates our
look ahead policy.
W X Y Z
W Y Z
Fig. 3: Top: correct dependency relations between w, x, y and z. Bottom:

if the algorithm mistakenly decides that x is a child of w before deciding
that y and z are x’s children, we cannot find the correct parent for y and z
Let (w, x, y, z) be a sentence of four words, and assume that the correct
dependency relations are as shown in the top part of Figure 3. If the sys-
tem mistakenly predicts that x is a child of w before y and z becomes x’s
children, we can only consider the relationship between w and y in the next
stage. Consequently, we will never find the correct parent for y and z. The
previous prediction error propagates and impacts future predictions. On
the other hand, if the algorithm makes a correct prediction, in the next
Algorithm 3 Look ahead algorithm. y represents an action sequence. The

function search considers all possible action sequences with |depth| actions
and returns the sequence with the highest score.
Algo predictAction(Model, Depth, State)
x = getNextFeature(State)
y = search(x, Depth, Model, State)
lab = y[1]
State = update(State, lab)
return lab
Algo search(x, Depth, Model, State)

maxScore = −∞
F = {y | kyk = Depth}
for y in F do
s = 0, TmpState = State
for i = 1 . . . Depth do
x = getNextFeature(TmpState)
s = s+ score(y[i], x, Model )
TmpState = update(TmpState, y[i])
if s > maxScore then
ŷ = y
maxScore = s
return ŷ
stage, we do not need to consider w and y. As shown, getting useful rather

than misleading information in a pipeline model, requires correct early pre-
dictions. Therefore, it is advised to utilize some inference framework that
may help minimizing the error accumulation problem.
In order to improve the accuracy of the action prediction, we might
want to examine all possible combinations of action sequences and choose
the one that maximizes some score. It is clearly intractable to find the
global optimal prediction sequences in a pipeline model of the depth we
consider. Therefore, we use a look ahead strategy, implemented via a local
search framework, which uses additional information but is still tractable.
The look-ahead search algorithm is presented in Algorithm 3. The al-
gorithm accepts three parameters, Model, Depth and State:
Model : is our learning model—the classifier that is used to predict the ac-
tion. We assume a classifier that can give a confidence in its prediction
(see Section 3).
Depth: The parameter Depth determines the depth of the search procedure.
State: encodes the configuration of the environment (in the context of the
dependency parsing this includes the sentence, the focus point and the
current parent and children for each word). Note that State changes
when a prediction is made and that the features extracted for the
action classifier also depend on State.
The search algorithm will perform a search of length Depth. Additive scoring
is used to score the sequence, and the first action in this sequence is selected
and performed. Then, the State is updated, the new features for the action
classifiers are computed and search is called again.
One interesting property of this framework is that it allows the use of
future information in addition to past information. The pipeline model
naturally allows access to all the past information. Since the algorithm uses
a look ahead policy, it also uses future predictions. The significance of this
becomes clear in Section 3. There are several parameters, in addition to
Depth that can be used to improve the efficiency of the framework. For
example, given that the action predictor is a multi-class classifier, we do
not need to consider all future possibilities in order to determine the current
action. In our experiments, we only consider two actions with highest score
at each level (which was shown to produce almost the same accuracy as
considering all four actions). Note that our look ahead search is a local
search so the depth is usually small. Furthermore, we can keep a cache of
the search paths of the look-ahead search for the i-th action. These paths
can be reused in the look-ahead search for the (i + 1)-th action. Therefore,
our look ahead search is not very expensive.
3 Experimental study
In this section, we describe a set of experiments done to investigate the
properties of our algorithm and evaluate it. All the experiments here were
done on English. Results on other languages are described in Section 4.
We use the standard corpus for this task, the Penn Treebank (Marcus
et al. 1993). The training set consists of sections 02 to 21 and the test set
is section 23. The pos tags for the evaluation data sets were provided by
the tagger of (Toutanova et al. 2003) (which has an accuracy of 97.2% on
section 23 of the Penn Treebank).
3.1 Learning algorithm

As our learning algorithm we use a regularized variation of the perceptron
update rule, as incorporated in SNoW (Roth 1998, Carlson et al. 1999)4, a
multi-class classifier that is tailored for large scale learning tasks and has
4
The package can be downloaded from http://L2R.cs.uiuc.edu/∼cogcomp/software.php
been used successfully in a large number of NLP tasks, e.g., (Punyakanok

et al. 2005). SNoW uses softmax over the raw activation values as its con-
fidence measure, which can be shown to produce a reliable approximation
of the labels’ conditional probabilities. The softmax score can be written
as follows:
eactlab
score(x, lab) = pi = P acti
1≤i≤n e
where acti means the normalized activation value for i-th class of feature
vector x. The activation value is directly come from our machine learning
model.
3.2 Features
For each word pair (w1 , w2 ) we use the words, their pos tags and also these
features of the children of w1 and w2 . We also include the lexicon and pos
tags of 2 words before w1 and 4 words after w2 (as in (Yamada & Mat-
sumoto 2003)). The key additional feature we use, relative to (Yamada &
Matsumoto 2003), is that we include the previous predicted action as a fea-
ture. We also add conjunctions of above features to ensure expressiveness
of the model. Yamada & Matsumoto (2003) make use of polynomial ker-
nels of degree 2 which is equivalent to using even more conjunctive features
(Cumby & Roth 2003). Overall, the average number of active features in
an example is about 50.
3.3 Evaluation methodology

We use the same evaluation metrics as in (Mcdonald et al. 2005). Depen-
dency accuracy is the proportion of non-root words that are assigned the
correct head. Sentence accuracy indicates the fraction of sentences that
have a complete correct analysis. We also measure the root accuracy and
leaf accuracy, as in (Yamada & Matsumoto 2003). When evaluating the
result, we exclude the punctuation marks as done in (Mcdonald et al. 2005)
and (Yamada & Matsumoto 2003). The punctuation include comma, colon,
dot, quotation, and double quotation. All other symbols are scoring tokens.
3.4 Results
We first present the results of several of the experiments that were in-
tended to help us analyze and understand some of the design decisions
in our pipeline algorithm.
To see the effect of the additional action, we present in Table 2 a com-
parison between a system that does not have the WaitLeft action (similar
to the Yamada & Matsumoto’s (2003) approach) with one that does. For
Action set
Evaluation metric w/o WaitLeft w/ WaitLeft
Dependencies 90.27 90.53
Root 90.73 90.76
Sentence 39.28 39.74
Leaf 93.87 93.94
Table 2: The significance of the action WaitLeft
a fair comparison, in both cases, we do not use the look ahead procedure.
Note that, as stated above, the action WaitRight is never needed for our
parsing algorithm. It is clear that adding WaitLeft increases the accuracy
significantly.
Evaluation metric Depth=1 Depth=2 Depth=3 Depth=4
Dependencies 90.53 90.67 90.69 90.79
Root 90.76 91.51 92.05 92.26
Sentence 39.74 40.23 40.52 40.68
Leaf 93.94 93.96 93.94 93.95
Table 3: The effect of different Depth settings. Increasing the Depth
usually improves the accuracy
Table 3 investigates the effect of the look ahead, and presents results with
different Depth parameters (Depth= 1 means “no search”), showing a con-
sistent trend of improvement.
Table 4 breaks down the results as a function of the sentence length;
it is especially noticeable that the system also performs very well for long
sentences, another indication for its global performance robustness.
Sentence length
Evaluation metric <11 11-20 21-30 31-40 > 40
Dependencies 93.4 92.4 90.4 90.4 89.7
Root 96.7 93.7 91.8 89.8 87.9
Sentence 85.2 56.1 32.5 16.8 8.7
Leaf 94.6 94.7 93.4 94.0 93.3
Table 4: The effect of sentence length (Depth = 4). Note that our
algorithm performs very well on long sentences
Table 5 shows the results with three settings of the pos tagger. The best
result is, naturally, when we use the gold standard also in testing. However,
it is worthwhile noticing that it is better to train with the same pos tagger
available in testing, even if its performance is somewhat lower.
Finally, Table 6 compares the performances of several of the state of the
art dependency parsing systems with ours. When comparing with other
dependency parsing systems it is especially worth noticing that our system
gives significantly better accuracy on completely parsed sentences.
Sources of tags (train-test)

Evaluation metric gold−pos pos−pos gold−gold
Dependencies 90.7 90.8 92.0
Root 92.0 92.3 93.9
Sentence 40.8 40.7 43.6
Leaf 93.8 94.0 95.0
Table 5: Comparing different sources of POS tagging in a pipeline model.
We use “gold” if true tags are applied and we use “pos” if tags are
generated by a trained tagger (Depth= 4 for all experiments in this table).
It is better to use the POS used in the evaluation also when training
Dependency parsing systems
Evaluation metric Y&M03 N&S04 M&C&P05 Current
Dependencies 90.3 87.3 90.9 90.8
Root 91.6 84.3 94.2 92.3
Sentence 38.4 30.4 37.5 40.7
Leaf 93.5 n/a n/a 94.0
Table 6: Comparison between the current work and other dependency
parsers. Our system performs especially well at the sentence level
4 Extensions: Non-projective trees and edge labels

Some languages such as Czech are very different from English. In this
section, we address two major difficulties one needs to deal with in order
to extend the approach we described here to other languages. First, the
dependency graph in some languages may have multiple roots. In this case,
we need to refer to the dependency edges of a sentence as a dependency
graph. Second, some languages are non-projective . We discuss below how
to overcome these two difficulties within the approach described in this
work. Fortunately, the multi-root problem is not difficult as it might seem.
In fact, the algorithm described earlier can handle multiple roots problem
if the edges are projective. Assume the dependency graph of a sentence
has multiple roots and that the language is projective. Then, we can view
the sentence as a union of several small sentences with no dependencies
among them, where each of them has a single-rooted dependency tree. The
algorithm can generate multiple trees by performing Shift between root
words.
In order to deal with crossing edges we can use one of two possible
approaches. The first one is to introduce some additional actions to handle
the long distance relationships. The new actions, along with S, R and
L, could handle most of the non-projective trees efficiently (Attardi 2006).
Another approach is to convert non-projective trees into projective ones
(Nivre & Nilsson 2005). In this article, we adopt the second method, as
described below. Any projective dependency tree can be mapped into a

projective tree by using the Lift operation, which is defined as follows:
Lift(wj → wk ) = parent(wj ) → wk , where a → b means that a is the
parent of b, and parent is a function which returns the parent word of the
given word.
The procedure is as follows. First, the mapping algorithm examines if
there is a crossing edge in the current tree. If there is a crossing edge, it
will perform repeated Lift operations and replace this edge until the tree
becomes projective. Let E be the dependency graph of a sentence. The
algorithm for converting E into a projective graph is as follows (Nivre &
Nilsson 2005):
1. If E is projective, exit. Otherwise,
2. Choose the smallest non-projective edge e, and
3. Let e′ be the new edge after performing Lift on e.
4. Let E ← E − {e} ∪ {e′ }. Goto 1.
After we convert every training non-projective sentence in the training data,
we train the model as before, on the modified data. In the evaluation phase,
we directly apply the trained model to predict the dependency, resulting in
projective graphs. In this paper, we do not try to recover the information
lost during the mapping procedure.
4.1 Dealing with dependency labels

It is often desirable to predict also the dependency label (type) correspond-
ing to the edges in the parse tree. Dependency labels can be very informative
and include, in English, labels such as the “object” of a verb or the “sub-
ject” of a verb. In this paper, we evaluated a simple two-stage framework
to predict the dependency labels.
We view labeling the type of the dependencies as a post-task after the
phase of predicting the head for each token in the sentence. Thus, it is
a multi-class classification task. The number of the dependency types for
each language can be found in (Buchholz & Marsi 2006). In the phase of
learning dependency types, the parent of each token, which was labeled in
the first phase, will be used among the features. The predicted actions can
therefore help us to make accurate predictions of the dependency types.
4.2 Experimenting with multilingual dependency parsing
In this section, we report the result of applying our system, along with
the simple extensions described above in conll-x shared task (Buchholz &
Marsi 2006). The target dataset consists of data in 12 languages (English is
not included), listed in Table 7. The resources provided for the 12 languages
are described in (Hajič et al. 2004, Chen et al. 2003, Böhmová et al. 2003,
Kromann 2003, van der Beek et al. 2002, Brants et al. 2002, Kawata &
Bartels 2000, Afonso et al. 2002, Džeroski et al. 2006, Civit Torruella &
Martı́ Antonı́n 2002, Nilsson et al. 2005, Oflazer et al. 2003, Atalay et
al. 2003). We apply the techniques earlier in this section to extend our
algorithm to on non-projective languages and to the prediction of both
head words and dependency labels.
Language UAS LAS

Ours AV SD Ours AV SD
Arabic 76.09 73.48 4.94 60.92 59.94 6.53
Chinese 89.60 84.85 5.99 85.05 78.32 8.82
Czech 81.78 77.01 6.70 72.88 67.17 8.93
Danish 86.85 84.52 8.97 80.60 78.31 11.34
Dutch 76.25 75.07 5.78 72.91 70.73 6.66
German 86.90 82.60 6.73 84.17 78.58 7.51
Japanese 90.77 89.05 5.20 89.07 85.86 7.09
Portuguese 88.60 86.46 4.17 83.99 80.63 5.83
Slovene 80.32 76.53 4.67 69.52 65.16 6.78
Spanish 83.09 77.76 7.81 79.72 73.52 8.41
Swedish 89.05 84.21 5.45 82.31 76.44 6.46
Turkish 73.15 69.35 5.51 60.51 55.95 7.71
Table 7: Our results are compared with the average scores.
UAS=Unlabeled Attachment Score, LAS=Labeled Attachment Score,
AV=Average score, and SD=standard deviation
We emphasize that no special language enhancement has been applied for
any language in this experiment. We use the same features and the same
parameters for all languages. In the dataset used, some languages have
additional morphological information. The information is stored in the col-
umn feat. We incorporated the information of feat for the languages
when it is available. The evaluation presented here is done using the script
provided by conll-x and is thus somewhat different from the evaluation
used in Section 3. Two kinds of evaluation metrics are used here:
number of tokens with correct head and correct labels
LAS =
number of tokens
and
number of tokens with correct head
UAS =
number of tokens
Note that uas is the proportion of tokens with correct head. The only
difference between uas and dependency accuracy is that uas does consider
root words.
4.3 Multilingual results

Table 7 shows our results on uas and las. Our results are compared with
the average scores (av) and the standard deviations (sd) of all the systems
that took part in the shared task of conll-x. Our average uas for 12
languages is 83.54% with the standard deviation 6.01; and 76.80% with the
standard deviation 9.43 for average las. Comparing our results to other
systems that participated in the tasks (Buchholz & Marsi 2006), our system
performs better than the average of systems participated in conll-x for
every language. Perhaps surprising given that no tuning to any specific
langauge was made.
Method UAS LAS

w/o WaitLeft 87.95 81.48
w/ WaitLeft 88.47 81.82
w/ WaitLeft + search 89.07 82.31
Table 8: An algorithmic analysis on a Swedish dataset (Depth = 3)
To provide some more understanding in the context of other languages, we
show next some specific cases. Table 8 shows the results on the Swedish
dataset. We compare three systems. The first system used only the three
standard actions, without look-ahead search. The second system used the
improved action set. The best system used improved action set and look-
ahead search. Thus, we observe the same trends as we observed when
running these experiments on the English dataset.
5 Conclusions and further work
We have addressed the problem of using learned classifiers in a pipeline

fashion, where a task is decomposed into several stages and stage classifiers
are used sequentially, with each stage using the outcome of previous stages
as its input. This is a common computational strategy in natural language
processing and is known to suffer from error accumulation and an inability
to correct mistakes in previous stages.
We abstracted two natural principles, one which calls for making the
local classifiers used in the computation more reliable and a second, which
suggests to devise the pipeline algorithm in such a way that minimizes the
number of decisions (actions) made.
In this work we studied this framework in the context of designing a
bottom up dependency parsing. Not only we manage to use this framework
to justify several design decisions, but we also showed experimentally that
following these principles results in improving the accuracy of the inferred
trees relative to existing models.
Interestingly, we can show that the trees produced by our algorithm are
relatively good even for long sentences, and that our algorithm is doing
especially well when evaluated globally, at a sentence level, where our results
are significantly better than those of existing approaches—perhaps showing
that the design goals were achieved.
We also present the primarily results for the multilingual datasets pro-
vided in the conll-x shared task. This shows that dependency parsers
based on a pipeline approach can handle many different languages. Re-
cently, (McDonald & Pereira 2006) proposed using “second-order” feature
information when applying the Einser’s algorithm and a global reranking
algorithm. The proposed method outperform (Mcdonald et al. 2005) and
the approach proposed in this paper. Although there is more room for com-
parison, we believe the the key advantage there is simply using a better
feature set. Overall, the bottom line from this work, as well as from the
conll-x shared task is that a pipeline approach is competitive with a global
reranking approach (Buchholz & Marsi 2006).
Beyond enhancing our current results by introducing more and better
features, we intend to consider incorporating more information sources (e.g.,
shallow parsing results) into the pipeline process, to investigate different
search algorithms in the pipeline framework and to study approaches to
predicts dependencies and their type as a single task.
Acknowledgements. We are grateful to Ryan McDonald and the organizers
of conll-x for providing the annotated data set and to Vasin Punyakanok for
useful comments and suggestions. This research was supported by the Advanced
Research and Development Activity (arda)’s Advanced Question Answering for
Intelligence (aquaint) Program and a doi grant under the Reflex program.
REFERENCES
Abeillé, A. ed. 2003. Treebanks: Building and Using Parsed Corpora. (= Text,
Speech and Language Technology, 20). Dordrecht: Kluwer Academic.
Afonso, S., E. Bick, R. Haber & D. Santos. 2002. “Floresta sintá(c)tica: A tree-
bank for Portuguese”. 3rd Int. Conf. on Language Resources and Evaluation
(LREC ), 1698-1703. Las Palmas, Canary Islands, Spain.
Atalay, N. B., K. Oflazer & B.Say. 2003. “The Annotation Process in the Turk-
ish Treebank”. 4th Int. Workshop on Linguistically Interpreteted Corpora
(LINC ). Budapest, Hungary.
Aho, A.V., R. Sethi & J. D. Ullman. 1986. Compilers: Principles, Techniques,
and Tools. Reading, Mass.: Addison-Wesley.
Attardi, Giuseppe. 2006. “Experiments with a Multilanguage Non-projective De-
pendency Parser”. 10th Conf. on Computational Natural Language Learning
(CoNLL-X ), 166-170. New York.
Brants, S., S. Dipper, S. Hansen, W. Lezius & G. Smith. 2002. “The TIGER Tree-
bank”. 1st Workshop on Treebanks and Linguistic Theories (TLT ). Sozopol,
Bulgaria.
Böhmová, A., J. Hajič, E. Hajičová & B. Hladká. “The PDT: A 3-level Annota-
tion Scenario”. In (Abeillé 2003), ch. 7.
Buchholz, Sabine & Erwin Marsi. 2006. “Conll-X Shared Task on Multilingual
Dependency Parsing”. 10th Conference on Computational Natural Language
Learning (CoNLL-X ), 149-164. New York.
Carlson, A., C. Cumby, J. Rosen & D. Roth. 1999. “The SNoW learning architec-
ture”. Technical Report UIUCDCS-R-99-2101. Urbana, Illinois: University
of Illinois at Urbana-Champaign (UIUC) Computer Science Department.
Chang, M.-W., Q. Do & D. Roth. 2006a. “A Pipeline Framework for Dependency
Parsing”. Annual Meeting of the Association for Computational Linguistics
(ACL), 65-72. Sydney, Australia.
Chang, M.-W., Q. Do & D. Roth. 2006b. “A Pipeline Model for Bottom-up De-
pendency Parsing”. 10th Conf. on Computational Natural Language Learning
(CoNLL-X ), 186-190. New York City.
Chen, K., C. Luo, M. Chang, F. Chen, C. Chen, C. Huang & Z. Gao. 2003. “Sinica
Treebank: Design Criteria, Representational Issues and Implementation”. In
(Abeillé 2003), ch. 13, 231-248.
M. Civit Torruella & Ma A. Martı́ Antonı́n. 2002. “Design Principles for a
Spanish Treebank”. 1st Workshop on Treebanks and Linguistic Theories
(TLT ). Sozopol, Bulgaria.
Cumby, C. & D.Roth. 2003. “On Kernel Methods for Relational Learning”. Int.
Conf. on Machine Learning (ICML), 107-114. Washington, DC, USA..
Džeroski, S., T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtsky & A. Žele. 2006.
“Towards a Slovene Dependency Treebank”. 5th Int. Conf. on Language
Resources and Evaluation (LREC ).Geneva, Switzerland.
Eisner, J. 1996. “Three New Probabilistic Models for Dependency Parsing: An
Exploration”. Int. Conf. on Computational Linguistics (COLING), 340-345.
Copenhagen, Denmark.
Haghighi, A., A. Ng & C. Manning. 2006. “Robust Textual Inference via Graph
Matching”. Human Language Technology Conference and Conference on
Empirical Methods in Natural Language Processing, 387-394. Vancouver,
Canada.
Hajič, J., O. Smrž, P. Zemánek, J. Šnaidauf & E. Beška. 2004. “Prague Arabic
Dependency Treebank: Development in Data and Tools”. NEMLAR Int.
Conf. on Arabic Language Resources and Tools, 110-117. Cairo, Egypt.
Kawata, Y. & J. Bartels. 2000. “Stylebook for the Japanese Treebank in VERB-
MOBIL”. Verbmobil-Report 240. Tübingen, Germany: Seminar für Sprach-
wissenschaft, Universität Tübingen.
Kromann, Matthias Trautner. 2003. “The Danish Dependency Treebank and the
DTAG Treebank Tool”. 2nd Workshop on Treebanks and Linguistic Theories
(TLT ) ed. by Joakim Nivre & Erhard Hinrichs, 217-220. Växjö, Sweden.
Lin, D. 1994. “Principar—An Efficient, Broad-coverage, Principle-based Parser”.
Int. Conf. on Computational Linguistics (COLING), 482-488. Kyoto, Japan.
McDonald, R., K. Crammer & F. Pereira. 2005. “Online Large-margin Training
of Dependency Parsers”. Annual Meeting of the Association for Computa-
tional Linguistics (ACL’05 ), 91-98. Ann Arbor, Michigan.
McDonald, R. & F. Pereira. 2006. “Online Learning of Approximate Dependency
Parsing Algorithms”. 11th Conf. of the European Chapter of the Association
for Computational Linguistics (EACL’06 ), 81-88. Trento, Italy.
Marciniak, T. & M. Strube. 2005. “Beyond the Pipeline: Discrete Optimiza-
tion in NLP”. 9th Conference on Computational Natural Language Learning
(CoNLL-2005 ), 136-143. Ann Arbor, Michigan.
Marcus, M. P., B. Santorini & M. Marcinkiewicz. 1993. “Building a Large Anno-
tated Corpus of English: The Penn Treebank”. Computational Linguistics
19:2.313-330.
Nilsson, J., J. Hall & J. Nivre. 2005. “MAMBA Meets TIGER: Reconstructing
a Swedish Treebank from Antiquity”. 15th Nordic Conference of Computa-
tional Linguistics (NODALIDA) Special Session on Treebanks. Copenhagen
Studies in Language 32, 119-132. Copenhagen, Denmark.
Nivre, Joakim. 2003. “An Efficient Algorithm for Projective Dependency Pars-
ing”. Int. Workshop on Parsing Technology (IWPT ), 149-160. Nancy,
France.
Nivre, J. & J. Nilsson. 2005. “Pseudo-projective Dependency Parsing”. 43rd
Annual Meeting of the Association for Computational Linguistics (ACL),
99-106. Ann Arbor, Michigan.
Nivre, J. & M. Scholz. 2004. “Deterministic Dependency Parsing of English
Text”. 20th Int. Conference on Computational Linguistics (COLING-04 ),
64-70. Geneva, Switzerland.
Oflazer, K., B. Say, D. Zeynep Hakkani-Tür & G. Tür. 2003. “Building a Turkish
Treebank”. In (Abeillé 2003), ch. 15.
Punyakanok, V., D. Roth & W. Yih. 2005. “The Necessity of Syntactic Parsing
for Semantic Role Labeling”. Int. Joint Conference on Artificial Intelligence
(IJCAI ), 1117-1123. Edinburgh, Scotland, UK.
Ratnaparkhi, Adwait. 1997. “A Linear Observed Time Statistical Parser Based
on Maximum Entropy Models”. 2nd Conf. on Empirical Methods in Natural
Language Processing (EMNLP ), 1-10. Somerset, New Jersey.
Roth, Dan. 1998. “Learning to Resolve Natural Language Ambiguities: A Uni-
fied Approach”. National Conference on Artificial Intelligence (AAAI ), 806-
813. Montréal, Québec, Canada.
Roth, D. & W. Yih. 2004. “A Linear Programming Formulation for Global Infer-
ence in Natural Language Tasks”. Annual Conf. on Computational Natural
Language Learning (CoNLL) ed. by Hwee Tou Ng & Ellen Riloff, 1-8. Boston,
Mass.
Toutanova, K., D. Klein & C. Manning. 2003. “Feature-rich Part-of-speech Tag-
ging with a Cyclic Dependency Network”. Human Language Technology Con-
ference (HLT-NAACL), 173-180. Edmonton, Canada.
van der Beek, L., G. Bouma, R. Malouf & G. van Noord. 2002. “The Alpino De-
pendency Treebank”. Computational Linguistics in the Netherlands (CLIN ).
Groningen, The Netherlands.
Yamada, H. & Y.Matsumoto. 2003. “Statistical Dependency Analysis with Sup-
port Vector Machines”. Int. Workshop on Parsing Technologies (IWPT ),
195-206. Nancy, France.
How Does Treebank Annotation Influence Parsing?
Or How Not to Compare Apples and Oranges
Sandra Kübler
SfS-CL, University of Tübingen
Abstract
In this paper, we will investigate the influence which different deci-
sions in the annotation schemes of treebanks have on parsing. The
investigation uses the comparison of similar treebanks of German,
Negra and TüBa-D/Z, which are subsequently modified to allow a
comparison of the differences. The results show that some modifica-
tion have a negative effect while others have a positive influence.
1 Introduction
In the last decade, the Penn treebank (Marcus et al. 1994) has become
the standard data set for evaluating pcfg parsers. The fact that most
parsers are solely evaluated on this specific data set leaves the question
unanswered to what degree these results depend on the annotation scheme
of the treebank. This point becomes more urgent in the light of more recent
publications on parsing the Penn or the Negra treebank such as (Dubey
& Keller 2003; Klein & Manning 2003), which show that parsing results
can be improved if the parser is optimized for the input data. Klein &
Manning (2003), e.g., gain more than 1 point in F-score when they mark
each phrase that immediately dominates a verb, thus showing shortcom-
ings in the annotation scheme. These findings raise the question whether
such shortcomings in the annotation can be avoided during the design of
the annotation scheme of a treebank. The question, however, can only be
answered if we know which design decisions are favorable for pcfg parsing.
In this chapter, we will investigate how different decisions in the anno-
tation scheme influence parsing results. In order to answer this question,
however, a method needs to be developed which allows the comparison of
different annotation decisions without comparing unequal categories.
For a comparison of different annotation schemes, one ideally needs one
treebank with two different sets of (manual) annotations. An automatic
conversion from one annotation scheme with flatter structures to another
with deeper structures is impossible because of the high probability that
systematic errors are introduced by the conversion. If no such doubly anno-
tated text is available, two treebanks which are based on the same language
and the same text genre and which are annotated with different annotation
80 SANDRA KÜBLER
schemes can be used. In the absence of more detailed methods of compar-

ison, testing the effect of modifying individual annotation decisions gives
insight into the factors that influence parsing results. However, the annota-
tion schemes must be similar enough to enable a comparison. For English,
no such treebanks based on the same text type are available.
Only recently, a new pair of treebanks for German has become avail-
able, the Negra (Skut et al. 1998) and the TüBa-D/Z (Telljohann et al.
2004) treebanks. Both treebanks are based on newspaper text, both use
the stts pos tagset (Thielen & Schiller 1994), and both use an annota-
tion scheme based on constituent structure augmented with grammatical
functions. However, they differ in other respects, which makes them ideally
suited for an investigation on how decisions in the design of an annotation
scheme influence parsing accuracy.
In section 2, we will describe the treebanks used in this investigation in
more detail, section 3 describes the experimental setup and the method of
comparison, and section 4 discusses the results of the comparison.
S
505
HD OC SB
NP
VP 504
503 MO NK NK MNR
MO MO HD
PP PP PP
500 501 502
AC NK AC NK NK AC NK NK
Im Rathaus−Foyer wird neben dem Fund auch die Forschungsgeschichte zum Hochheimer Spiegel präsentiert .
0 1 2 3 4 5 6 7 8 9 10 11 12 13
APPRART NN VAFIN APPR ART NN ADV ART NN APPRART ADJA NN VVPP $.
Fig. 1: A sample tree from the Negra treebank
2 The Negra and the TüBa-D/Z treebanks
Both treebanks use German newspapers as their data source: the Frank-
furter Rundschau newspaper for Negra and the ‘die tageszeitung’ (taz) news-
paper for TüBa-D/Z. Negra comprises 20 000 sentences, TüBa-D/Z 15 000
sentences. Both treebanks use an annotation framework that is based on
phrase structure grammar and that is enhanced by a level of predicate-
argument structure. Despite all these similarities, the treebank annotations
differ in four important aspects: 1) Negra does not allow unary branching
(nodes with one daughter) while TüBa-D/Z does; 2) in Negra, phrases re-
ceive a flat annotation while TüBa-D/Z uses phrase internal structure; 3)
Negra uses crossing branches to represent long-distance relationships while
TüBa-D/Z uses a projective tree structure combined with functional labels;
4) Negra encodes grammatical functions in a combination of structural and
functional labeling while TüBa-D/Z uses a combination of topological fields
(Höhle 1986) and functional labels, which results in a flatter structure on
HOW DOES TREEBANK ANNOTATION INFLUENCE PARSING? 81
SIMPX
521
− − − − −
NF
520
OA−MOD
VF R−SIMPX
518 519
ON − − −
NX MF
516 517
HD − V−MOD PRED
PX LK MF C ADVX EN−ADD VC
509 510 511 512 513 514 515
− HD HD OA ON − HD − HD
NCX NCX VXFIN NCX VC NCX ADVX NCX VXFIN

500 501 502 503 504 505 506 507 508
− HD − HD HD − HD VPT HD HD HD HD
Der Autokonvoi mit den Probenbesuchern fährt eine Straße entlang , die noch heute Lagerstraße heißt .
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ART NN APPR ART NN VVFIN ART NN PTKVZ $, PRELS ADV ADV NN VVFIN $.
Fig. 2: A sample tree from the TüBa-D/Z treebank
the clause level. The two treebanks also use different notions of gram-
matical functions: TüBa-D/Z defines 36 grammatical functions covering
head and non-head information, as well as subcategorization for comple-
ments and modifiers. Negra utilizes 48 grammatical functions. Apart from
commonly accepted grammatical functions, such as SB (subject) or OA
(accusative object), Negra grammatical functions also comprise a more ex-
tended notion, e.g. RE (repeated element) or RC (relative clause). Because
of the differences in the definition of grammatical functions, a comparison
of grammatical functions is only possible in a task-based evaluation within
an application that uses these grammatical functions as input.
Figure 1 shows a typical tree from the Negra treebank. The syntactic
categories are shown in circular nodes, the grammatical functions as edge
labels in square boxes. The PP “Im Rathaus-Foyer” (in the foyer of the
town hall) and the NP “auch die Forschungsgeschichte zum Hochheimer
Spiegel” (also the research history of the Hochheimer Spiegel) do not contain
internal structure, the noun kernel elements are marked via the functional
labels NK. The fronted PP is grouped under the VP, resulting in crossing
branches. Figure 2 shows a typical example from TüBa-D/Z. Here, the
complex NP “Der Autokonvoi mit den Probenbesuchern” (the car convoy
with the visitors of the rehearsal) contains an NP and the PP with internal
NP, both NPs being explicitly annotated (as NX). The tree also contains
several unary nodes, e.g. the verb phrases “fährt” (goes) and “heißt” (is
called) or the street name “Lagerstraße”. The main ordering principle on
the clause level are the topological fields, long-distance relationships such
as the relation between the noun phrase “eine Straße” (a street) and the
extraposed relative clause “die heute noch Lagerstraße heißt” (which is still
called Lagerstraße) are marked via functional labeling; OA-MOD specifies
that this clause modifies the accusative object OA.
82 SANDRA KÜBLER
3 Comparing treebanks for parsing
Both treebanks do not completely adhere to the requirements of a cfg:

Apart from Negra’s crossing branches, which occur in 30% of the sentences,
both treebanks contain sentences that consist of more than one tree. For
all sentences, a virtual root node that groups all trees is inserted, and par-
enthetical trees are attached to the surrounding tree. In order to resolve
Negra’s crossing branches, a script was used that is provided by Thorsten
Brants. The script isolates crossing constituents and attaches the non-head
constituents higher up in the trees. After the conversion, both PP modifiers
of the VP in Figure 1 are reattached as daughters of S in order to resolve
the crossing branches. Unfortunately, the modified tree does not contain
any information on the scope of the modifiers, which has previously been
shown by the low attachment in the VP.
For the experiments, the statistical left-corner parser LoPar (Schmid
2000) was used. Since the experiments are designed to show differences
in parsing quality depending on the annotation decisions, the parser was
used without EM training or lexicalization of the grammar, and the parser
was given the gold pos tags for the test sentences. For all the experiments
reported here, only sentences with a length of maximally 40 words were
used. These sentences were randomly split into 90% training data and 10%
test data. This split was kept across experiments in order to enable error
analysis.
For each experiment, two different types of tests were performed: For one
type, the data contained only syntactic constituents, i.e., the grammatical
functions, which are shown as square boxes in the trees, were omitted.
Thus, the rule describing the complex NP and its daughters in Figure 1 is
represented as “NP → ADV ART NN PP”. These tests are reported below
as “labeled precision” and “labeled recall”. In the second type of tests,
the syntactic categories were augmented by their grammatical function.
Thus, the same rule extracted from the tree in Figure 1 now contains the
grammatical function for each node: “NP-SB → ADV-MO ART-NK NN-
NK PP-MNR”. These tests are reported below as “function labeled”.
3.1 First results and a methodology for comparing treebanks

The results of the experiments on the original treebanks after preprocessing
are shown in Table 1. As reported above, Negra contains crossing branches
in 30% of the sentences, which had to be resolved in preprocessing. The
results show that the F-score for TüBa-D/Z is significantly higher than for
Negra trees. In contrast, the number of crossing brackets is lower for Negra.
The Negra results raise the question whether the low crossing brackets rate
Negra TüBa-D/Z
crossing brackets 1.07 2.27
labeled recall 70.09% 84.15%
labeled precision 67.82% 87.94%
labeled F-score 68.94 86.00
crossing brackets 1.04 1.93
function labeled recall 52.75% 73.65%
function labeled precision 51.85% 76.13%
function labeled F-score 52.30 74.87
Table 1: The results of comparing Negra and TüBa-D/Z

in Negra is only due to the low number of constituents in the trees. The
percentage of nodes per words shows that while Negra trees contain on
average 0.88 nodes per word, TüBa-D/Z trees contain 2.38 nodes per word.
These results raise the question whether the deeper structures in TüBa-D/Z
can be parsed reliably but may not be useful for further processing.
SIMPX
507
− − −
MF
506
PRED
S
VF LK ADJX
501 503 504 505
SB HD MO MO ON HD − HD
PP NX VXFIN ADJX
500 500 501 502
AC NK − HD HD HD
" Ich stehe ständig unter Strom . " Das Ergebnis ist völlig offen .
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7
$( PPER VVFIN ADJD APPR NN $. $( ART NN VAFIN ADJD ADJD $.
Fig. 3: Negra (left) and TüBa-D/Z (right) trees with one word phrases
These inconclusive results necessitate a more detailed evaluation. A logical

next step towards a more conclusive evaluation would be a comparison of
single constituents and grammatical functions. Such a comparison, however,
is only meaningful if both annotation schemes describe the same phenom-
ena with the same categories. Unfortunately, for Negra and TüBa-D/Z, this
assumption often does not hold. The most obvious area in which the two
treebanks differ is the treatment of unary nodes: while TüBa-D/Z annotates
such constituents, Negra does not allow unary branching. From the exam-
ples in Figure 3, it becomes clear that the differences are widespread and
concern, amongst others, verb phrases, adverbial phrases, and prepositional
phrases. Due to these significant differences, a comparison of single con-
stituents is not meaningful since one would compare, for example, all NPs
in TüBa-D/Z to more complex NPs (with two words or more) in Negra.
84 SANDRA KÜBLER
SIMPX
526
ON HD OA VPT OA−MOD
NX R−SIMPX
522 524
− HD − ON − HD
PX NX MF
521 525 523
− − HD − HD MOD MOD PRED
Der Autokonvoi mit den Probenbesuchern fährt eine Straße entlang , die noch heute Lagerstraße heißt .
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ART NN APPR ART NN VVFIN ART NN PTKVZ $, PRELS ADV ADV NN VVFIN $.
Fig. 4: The sentence from Figure 2 flattened and without unary branches
In the absence of more detailed methods of comparison, testing the effect of

modifying individual annotation decisions gives insight into the factors that
influence parsing results. As mentioned above, Negra and TüBa-D/Z differ
in three major points (the fourth difference, crossing branches in Negra,
is already addressed in preprocessing): flatter phrases and no unary nodes
in Negra, and flatter structures on the clause level in TüBa-D/Z. In order
to test the individual decisions, the opposite treebank is modified to also
follow the respective decision. Consequently, the following modifications of
the treebanks were executed:
1. To test the influence of not annotating unary nodes (such as in Ne-
gra), all nodes with only one daughter were removed from TüBa-D/Z,
preserving the grammatical functions. In the following section, this
version is called Tü NU.
2. To test the influence of Negra’s flat phrase structure, phrases in TüBa-
D/Z were flattened. This version is called Tü flat.
3. Finally, both modifications, removing unary nodes and flattening phrases,
were applied to TüBa-D/Z. The resulting tree for the sentence in Fig-
ure 2 is shown in Figure 4. This version is called Tü cmb.
4. In order to test the influence of the flatter TüBa-D/Z structure on the
clause level, topological fields were introduced into the Negra anno-
tations. The topological fields were automatically extracted from the
Negra corpus by the DFKI Saarbrücken. The conversion algorithm
had to apply heuristics, which leads to a small number of errors in the
field annotation.
The original annotation of Negra had to be modified when the topo-
logical fields were introduced. In many cases, the topological fields
cross phrasal boundaries: These phrasal nodes were removed. The
resulting tree for the sentence in Figure 1 is shown in Figure 5. This
version of Negra is called Ne field.
For testing the effects of the modifications on parser performance, the same
data split as in the baseline experiments was used. The results of these
experiments are shown in Table 2.
S
509
−− −− −− −−
MF
508
MO SB
VF NP RK
505 506 507
MO MO NK NK MNR OC
PP LK PP PP VP
500 501 502 503 504
AC NK HD AC NK NK AC NK NK HD
Im Rathaus−Foyer wird neben dem Fund auch die Forschungsgeschichte zum Hochheimer Spiegel präsentiert .
0 1 2 3 4 5 6 7 8 9 10 11 12 13
APPRART NN VAFIN APPR ART NN ADV ART NN APPRART ADJA NN VVPP $.
Fig. 5: The sentence from Figure 1 with fields
4 Discussion of the results of the comparison
Table 2 gives the results for the evaluation of the two types of tests: the
upper half of the table gives results for parsing with syntactic node labels
only while the lower half of the table gives results for parsing syntactic
categories and grammatical functions. The results show that all transfor-
mations approximate the other treebank in annotation as well as in parser
performance.
Negra Ne field TüBa Tü NU Tü flat Tü cmb

cross. br. 1.07 1.30 2.27 1.87 1.09 1.15
lab. recall 70.09% 75.21% 84.15% 77.41% 85.63% 77.43%
lab. precision 67.82% 77.17% 87.94% 81.52% 86.24% 76.44%
lab. F-score 68.94 76.18 86.00 79.41 85.93 76.93
% sent. n. parsed 0.55% 0.05% 0.48% 1.91% 0.62% 2.26%
cross. br. 1.04 1.21 1.93 2.17 1.07 1.29
func. lab. recall 52.75% 69.85% 73.65% 62.11% 73.80% 53.63%
func. lab. prec. 51.85% 69.53% 76.13% 65.43% 74.66% 58.87%
func. lab. F-score 52.30 69.19 74.87 63.73 74.23 56.13
% sent. n. parsed 12.59% 2.17% 1.03% 9.98% 3.55% 18.87%
nodes/words 0.88 1.38 2.38 1.33 2.00 1.06
Table 2: The results of comparing the modified versions of Negra and

TüBa-D/Z
4.1 Modification of Negra
The modification of Negra, which introduces topological fields to flatten
the clause structure, leads to an improved F-score but also to more crossing
brackets. A first hypothesis would be that the improvement is due to the
reliable recognition of the new field nodes only. This hypothesis can be re-
jected by an evaluation of the parsing results for single syntactic categories.
This evaluation shows that the introduction of topological fields gives high
F-scores for the major fields, but it also improves both precision and recall
for adverbial phrases, noun phrases, prepositional phrases, and almost all
86 SANDRA KÜBLER
types of coordinated phrases. For adjectival phrases, precision improves

from 55.95% to 64.46% - but at the same time, recall degrades from 56.38%
to 50.97%. In contrast, the F-score for verb phrases deteriorates. This is
probably due to the fact that only verb phrases which do not cross field
boundaries are left in the annotation.
One reason for the improvement in the overall F-score is the change
in the number of rules for a specific syntactic category. A look at the
rules extracted from the training corpus shows a dramatic drop in numbers:
for adjectival phrases, the number decreases from more than 3900 rules
containing AP to approximately 3400 – even though new rules were added
for the treatment of topological fields.
4.2 Modification of TüBa-D/Z

Each modification of TüBa-D/Z results in a loss in F-score, but also in an
improvement concerning crossing brackets. While flattening phrase struc-
ture only leads to minor changes, deleting unary nodes has a detrimen-
tal effect: the F-score decreases from 86.00 to 79.41 when parsing syntactic
constituents, and from 74.83 to 63.73 when parsing syntactic constituents
including grammatical functions. Deleting unary nodes in the experiments
with grammatical functions increased the number of sentences that could
not be parsed by a factor of almost 10. These sentences would have re-
quired additional rules not present in the training sentences. This leads to
the question whether the deterioration is only due to the high number of
sentences which were assigned no parse. However, an evaluation of only
those sentences that did receive a parse shows only slightly better results
in recall (obviously, precision remains the same): 68.18% for parsed sen-
tences as compared to 62.11% for all sentences. This result, however, may
also be caused by missing rules, which is corroborated by a look at the
rules extracted from the test sentences: Approximately 24.0% of the rules
needed for correctly parsing the test sentences in the modification without
unary nodes are not present in the training set, as compared to 18.2% in
the original version of the TüBa-D/Z treebank.
A closer look at the different constituents shows that the syntactic cat-
egories that are affected most by the deletion of unary nodes are noun
phrases, finite verb phrases, adjectival phrases, adverbial phrases, and infini-
tival verb phrases. All those categories suffer losses in the F-score between
1.81% (for infinitival verb phrases) and 57.28% (for adverbial phrases).
Since both precision and recall are similarly affected, this means that the
parser does not only annotate spurious phrases but also misses phrases
which should be annotated.
Flattening phrases in TüBa-D/Z has a negative effect on precision but it

causes a slight increase in recall. The latter effect is a consequence of the bias
of the pcfg parser, which prefers small trees. A comparison of the average
number of nodes per word in a sentence shows that for all models, the
parsed trees contain significantly fewer nodes than the gold standard trees.
For the original TüBa-D/Z grammar including grammatical functions, the
parsed tree contains 54.6% of the nodes in the gold standard; in the flattened
version, the ratio is 58.6% (and for Negra, it is 62.5%).
The category that profits most from this modification is the category of
named entities; other syntactic categories that profit from a flattening of
the trees are prepositional phrases and relative clauses. All of them need
information from deeper levels of the tree to be recognized correctly.
The combination of both modifications in TüBa-D/Z, flat phrase
structure and deleted unary nodes, leads to a dramatic loss in the F-score
for functional parsing as compared to the experiment in which only the
unary nodes were deleted. A look at the unlabeled F-scores shows that
this loss is not only due to incorrect labels for constituents, it also affects
the recognition of phrase boundaries: the unlabeled F-score degrades from
91.34 for the original version of Tüba-D/Z, to 81.06 for the version without
unary nodes, and to 71.65 for the combination of both modifications.
We have presented a method for comparing different annotation schemes

and their influence on pcfg parsing. It is impossible to compare the per-
formance of a parser on single syntactic categories since even rather similar
annotation schemes apply different definitions for different phrase types. As
a consequence, the comparison must be based on modifications within one
annotation scheme to make it more similar to the other. The experiments
presented here show that annotating unary nodes and structured phrases
improve parsing results. On the clause level, however, a flatter structure
incorporating topological fields is helpful for German.
The experiments presented here were conducted with a standard pcfg
parser. The next logical step is to extend the comparison to different prob-
abilistic parsers with different probability models and different biases. The
parser by Klein & Manning (2003) uses extensions of the probability model
as well as lexicalization, both of which were very successful for English.
It is, however, unclear what the effect of these extensions is on German
data. Especially with regard to lexicalization, the situation for German is
unclear: While lexicalization improves results for English (cf. e.g. (Collins
1999)), experiments for German (Dubey & Keller 2003) show a detrimental
effect of lexicalization for the Negra data. Thus, a comparison of treebank
88 SANDRA KÜBLER
annotation schemes based on lexicalization only makes sense if a method

of lexicalization can be found for both annotation schemes that does not
overly decrease performance.
Acknowledgments. We are grateful to the DFKI Saarbrücken for providing

the topological field annotation for Negra. We would like to thank Wolfgang
Maier, Julia Trushkina, Holger Wunsch, and Tylman Ule for scripts to convert
and evaluate the data, and Erhard Hinrichs, Tylman Ule, and Yannick Versley
for fruitful discussions.
REFERENCES
Collins, Michael. 1999. Head-Driven Statistical Models for Natural Language
Parsing, Ph.D. dissertation. University of Pennsylvania. Philadelphia.
Dubey, Amit & Frank Keller. 2003. “Probabilistic Parsing for German using
Sister-Head Dependencies”. 41st Annual Meeting of the Association for Com-
putational Linguistics (ACL’2003 ), 96-103. Sapporo, Japan.
Höhle, Tilman. 1986. “Der Begriff ‘Mittelfeld’, Anmerkungen über die Theo-
rie der topologischen Felder”. Akten des Siebten Internationalen Germanis-
tenkongresses 1985, 329-340. Göttingen, Germany.
Klein, Dan & Christopher Manning. 2003. “Accurate Unlexicalized Parsing”.
Proceedings of the 41st Annual Meeting of the Association for Computational
Linguistics (ACL’2003 ), 423-430. Sapporo, Japan.
Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann
Bies, Mark Ferguson, Karen Katz & Britta Schasberger. 1994. “The Penn
Treebank: Annotating Predicate Argument Structure”. Proceedings of the
Human Language Technology Workshop (HLT’94 ), 114-119. Plainsboro,
New Jersey.
Schmid, Helmut. 2000. LoPar: Design and Implementation. Technical report.
Universität Stuttgart, Germany.
Skut, Wojciech, Thorsten Brants, Brigitte Krenn & Hans Uszkoreit. 1998. “A
Linguistically Interpreted Corpus of German Newspaper Texts”. ESSLLI
Workshop on Recent Advances in Corpus Annotation. Saarbrücken, Ger-
many.
Telljohann, Heike, Erhard Hinrichs & Sandra Kübler. 2004. “The TüBa-D/Z
Treebank: Annotating German with a Context-Free Backbone”. Proceed-
ings of of the Fourth International Conference on Language Resources and
Evaluation (LREC’2004 ), 2229-2235. Lisbon, Portugal.
Thielen, Christine & Anne Schiller. 1994. “Ein kleines und erweitertes Tagset
fürs Deutsche”. Lexikon & Text ed. by H. Feldweg & E. Hinrichs, 215-226.
Tübingen: Niemeyer.
The SenSem Project: Syntactico-Semantic
Annotation of Sentences in Spanish
Laura Alonso∗ , Joan Antoni Capilla∗∗ , Irene Castellón∗∗∗ ,

Ana Fernández-Montraveta∗∗∗∗ & Gloria Vázquez∗∗
∗
FaMAF, Univ. Nacional de Córdoba
∗∗
Dept. of English and Linguistics, Univ. de Lleida
∗∗∗
Dept. of General Linguistics, Univ. de Barcelona
∗∗∗∗
Dept. of English and German Philology, UAB
Abstract
This paper presents SenSem, a project that aims to systematize the
behavior of verbs in Spanish at the lexical, syntactic and semantic
level. As part of the project, two resources are being built: a corpus
where sentences are associated to their syntactico-semantic interpre-
tation and a lexicon where each verb sense is linked to the corre-
sponding annotated examples in the corpus. We also discuss some
tendencies that can be observed in the current state of development.
1 Introduction
The SenSem project aims to build a databank of Spanish verbs based on

a lexicon that links each verb sense to a significant number of manually
analyzed corpus examples. This databank will reflect the syntactic and
semantic behavior of Spanish verbs in naturally occurring text.
We analyze the 250 verbs that occur most frequently in Spanish. An-
notation is carried out at three different levels: the verb as a lexical item,
the constituents of the sentence and the sentence as a whole. The annota-
tion process includes verb sense disambiguation, syntactic structure analysis
(syntagmatic categories, including the annotation of the phrasal heads, and
syntactic functions), interpretation of semantic roles and analysis of senten-
tial semantics. It is precisely this last area of investigation which sets our
project apart from others currently being carried out with Spanish (Subi-
rats & Petruck 2003, Garcı́a De Miguel & Comesaña 2004).
Abstracting from the analysis of a significant number of examples, the
prototypical behavior of verb senses will be systematized and encoded in
a lexicon. The description of verb senses will focus on their properties
at the syntactico-semantic interface, and will include information like the
list of frames in which a verb can possibly occur. In addition, selectional
restrictions will be inferred from words marked as heads of the constituents.
Finally, the usage of prepositions will be studied.
90 ALONSO, CAPILLA, CASTELLÓN, FERNÁNDEZ, VÁZQUEZ
The conjunction of all this information will provide a very fine-grained de-
scription of the syntactico-semantic interface at sentence level, useful for ap-
plications that require an understanding of sentences beyond shallow pars-
ing. In the fields of automatic understanding, semantic representation and
automatic learning systems, a resource of this type is very valuable.
In the rest of the paper we will describe the corpus annotation process
in more detail and examples will be provided. Section 2 offers a general
overview of other projects similar to SenSem. In section 3, the levels of
annotation are discussed, and the process of annotation is described in sec-
tion 4. We then proceed to present the results obtained to date and the
current state of annotation, and we put forward some tentative conclusions
obtained from the results of the annotation thus far.
2 Related work
As shown by Levin (1993) and others (Jones et al. 1994, Jones 1995, Kipper
et al. 2000, Saint-Dizier 1999, Vázquez et al. 2000), syntax and semantics are
highly interrelated. By describing the way linguistic layers inter-relate, we
can provide better verb descriptions since generalizations from the lexicon
that previously belonged to the grammar level of linguistic description can
be established (lexicalist approach).
Within the area of Computational Linguistics, it is common to deal
with both fields independently (Grishman et al. 1994, Corley et al. 2001).
In other cases, the relationship established between syntactic and semantic
components is not fully exploited and only basic correlations are established
(Dorr et al. 1998, McCarthy 2000). We believe this approach is interesting
even though it does not take full advantage of the existing link between
syntax and semantics.
Furthermore, we think that in order to coherently characterize the syn-
tactico-semantic interface, it is necessary to start by describing linguistic
data from real language. Thus, a corpus annotated at syntactic and seman-
tic levels plays a crucial role in acquiring this information appropriately.
In recent years, a number of projects related to the syntactico-semantic
annotation of corpora have been carried out. The length of the present
paper does not allow us to consider them all here, but we will mention a
few of the most significant ones.
FrameNet (Johnson & Fillmore 2000) is a lexicographic resource that
describes approximately 2,000 items, including verbs, nouns and adjectives
that belong to diverse semantic domains (communication, cognition, per-
ception, movement, space, time, transaction, etc.). Each lexical entry has
examples from the British National Corpus, manually annotated with ar-
gument structure and, in some cases, also adjuncts.
THE SENSEM PROJECT 91
PropBank (Kingsbury & Palmer 2002, Kingsbury et al. 2002) is a project

based on the manual semantic annotation of a subset of the Penn Treebank
II (a corpus which is syntactically annotated). This project aims to identify
predicate-argument relations. In contrast with FrameNet, the sentences to
be annotated have not been pre-selected.
Both FrameNet and Propbank work with the use of corpora, although
their objectives are a bit different. In FrameNet, a corpus is used to find
evidence about linguistic behavior and to associate examples to lexical en-
tries, whereas in Propbank, the objective is to enrich a corpus that has been
already annotated at a syntactic level so that it can be exploited in more
ambitious nlp applications.
For Spanish, only a few initiatives address the syntactico-semantic analy-
sis of corpus. The DataBase ”Base de Datos Sintácticos del Español Actual”
(Muñiz et al. 2003) provides the syntactic analysis of 160.000 sentences ex-
tracted from part of the arthus corpus of contemporary texts. Syntactic
positions are currently being labeled with semantic roles (Garcı́a de Miguel
& Comesaña 2004).
FrameNet-Spanish (Subirats & Petruck 2003) is the application of the
FrameNet methodology for Spanish. Its target is to develop semantic frames
and lexical entries for this language. Each verb sense is associated to its
possible combinations of participants, syntactic functions and phrase types,
as attested in the corpus.
The SenSem project provides a different approach to the description of
verb behavior. In contrast with FrameNet, its aim is not to provide examples
for a pre-existing lexicon, but to shape the lexicon with the corpus examples
annotated. Another difference from the FrameNet approach is that the
semantic roles we use are far more general, they are not related to syntactic
functions, and are less class-dependent.
Finally, to the best of our knowledge, no large-scale corpus annotation
initiative associates semantics to sentence such as their aspectual interpre-
tations or types of causativity.
3 Levels of annotation
As mentioned previously, we are describing verb behavior so only con-

stituents directly related to the verb will be analyzed. Elements beyond
the scope of the verb (i.e., extra-sentential elements such as logical linkers,
some adverbs, etc.) are disregarded, as in the following example:
(1) ... El presidente, que ayer inició una visita oficial a la capital
francesa, hizo estas declaraciones ...
... The president, who began an official visit to the French capital
yesterday, stated ...
Were we annotating the verb iniciar –begin– we would ignore the partici-
pants of the main sentence and only take into account the elements within
the clause. If we were annotating the verb hacer –make– we would anno-
tate the subject to include the relative clause as a whole, without further
analysis, and with the word “president” as the head of the whole structure.
3.1 Sentential semantics level

At this level, different aspects of sentential semantics are accounted for.
With regard to aspectual information, a distinction is made among three
types of meaning, event, process or state, as in the following examples:
(2) event: ... El diálogo acabará hoy ...
... The conversations will finish today ...
(3) process: ... cuando le preguntaron de qué habı́a vivido hasta
aquel momento ...
... when he was asked what he had been living on until then ...
(4) state: ... El gasto de personal se acerca a los 2.990 millones ...
... Personnel expenses come close to 2.990 million ...
Apart from aspectual information, we also annotate sentential level mean-
ings using labels like anticausative, antiagentive, impersonal, reflexive, re-
ciprocal or habitual. This feature is useful to account for the variation in
the syntactic realizations of the argument structures of each verb sense.
For example, the following sentences with the verb abrir –open– present an
agentive and antiagentive interpretation, respectively:
(5) agentive: El alcalde de Calafell [...] abrirá un expediente...
The mayor of Calafell [...] will open administrative proceedings ...
(6) antiagentive: el vertedero de Tivissa no se abrirá sin consenso.
Tivissa’s rubbish dump will not be opened without a consensus.
3.2 Lexical level

At the lexical level, each example of a verb is assigned a sense. We have
developed a verb lexicon in which the possible senses for a verb are defined,
together with its prototypical event structure and thematic grid, a list of
synonyms and antonyms and related synsets in WordNet (Fellbaum 1998).
Various lexicographic sources have been taken as references to build
the inventory of senses for each verb, mainly the Diccionario de la Real
Academia de la Lengua Española and the Diccionario Salamanca de la
Lengua Española. Less frequent meanings are discarded, together with
archaic and restricted uses. This inventory of senses for each verb is only
preliminary, and can be modified whenever the examples found in the corpus
indicate the existence of a distinct sense which has not been considered.
3.3 Constituent level

Finally, at the constituent level, each participant in the clause is tagged with
its constituent type (e.g., noun phrase, completive, prepositional phrase)
and syntactic function (e.g., subject, direct object, prepositional object).
Arguments and adjuncts are also distinguished. Arguments are defined
as those participants that are part of the verb’s lexical semantics. Argu-
ments are assigned a semantic role describing their relation with the verb
(e.g., agent, theme, initiator, . . . ).
The head of the phrase is also signaled in order to acquire selectional
restrictions for that verb sense. Negative adverbs and pronouns are also
marked. They have been considered relevant in that it may alter some
other information in different levels.
4 Annotation process
The SenSem corpus will describe the 250 most frequent verbs Spanish. For
each verb, 100 examples are extracted randomly from 13 million words of
corpora obtained from the electronic version of the Spanish newspaper El
Periódico de Catalunya, disregarding uses of the verb as an auxiliary and
idioms.
The manual annotation of examples is carried out via a graphical inter-
face, seen in Figure 1, where the three levels are clearly distinguished.
The interface displays one sentence at a time. First, when a verb sense is
selected from the list of possible senses, and its prototypical event structure
and semantic roles are displayed. Then, constituents are identified and
analyzed by selecting the words that belong to it; the head of the arguments
and its possible metaphorical usage are also signaled. Finally, annotators
specify any applicable semantics at clause level (e.g., anticausative, reflexive,
stative, etc.).
At the constituent level, the semantic role chosen for each phrase is
often predictive of the other labels of that phrase: agents tend to be noun
phrases with subject function, themes tend to be noun phrases with subject
or object function (if they occur in passives, antiagentives, anticausatives
or statives), etc. These associations are exploited to semi-automate the
annotation process: once a role is selected, the category and function most
frequently associated with it and its role as a verb argument are pre-selected
so that the annotator only has to validate the information.
The final corpus will be available to the linguistic community by means
of a web-based interface (http://grial.uab.es/projectes/sensem).
Fig. 1: Screenshot of the annotation interface
5 Preliminary results of annotation
At this stage of the project1 , 77 verbs have already been annotated, which
implies that the corpus at this moment is made up of 7,700 sentences
(199,190 words). A total of 900 sentences out of these 7,700 have already
been validated, which means that a corpus of approximately 25,000 words
has already undergone the complete annotation process.
5.1 Data analysis

In this section we describe the information about verb behavior that can
be extracted from the corpus in its present state. We have found that, out
of the 199,190 words that have already been annotated, 182,303 are part of
phrases which are an argument of the verb and 16,887 are adjuncts.
With regard to aspectuality, there is a clear predominance of events
1
This paper was finished in the summer of 2005.
(74.26% of the sentences) over processes (20.67%) and states (8.96%). This
skewed distribution of clause types, with a clear predominance of events,
may be exclusive to the journalistic genre. We have yet to investigate its
distribution in other genres.
As concerns syntactic functions, the most frequent category is direct
object, covering 40% of the labels, with a significant difference in subjects
(23%). This is not surprising if we take into account that Spanish is a pro-
drop language. Prepositional objects (13%) and indirect objects (2%) are
less frequent.
As for semantic roles, themes are predominant, as would be expected
given that the most common syntactic function is that of direct object, and
that there is a high presence of antiagentive, anticausative and passive con-
structions. Within the different types of the semantic role theme, unaffected
themes (moved objects) appear most frequently.
5.2 Inter-annotator agreement

In order to measure inter-annotator agreement, four sentences of 59 verbs
have been annotated by 4 different judges so that divergences in criteria
could be found. These common sentences were used in the preliminary
phase with the aim of both training the annotators and detecting points of
disagreement among them. This comparison has helped us refine and settle
the annotation guidelines and facilitate the revision of the corpus.
In order to detect these problematic issues, we calculated inter-annotator
agreement for all levels of annotation. An overview of the most represen-
tative values for annotator agreement can be seen in Table 1. We found
pair wise proportions of overall agreement and the kappa coefficient (Cohen
1960), which gives an indication of stability and reproducibility of human
judgments in corpus annotation.
As a general remark, agreement is comparable to what is reported in
similar projects. For example, Kingsbury et al. (2002) report agreement
between 60% and 100% for predicate-argument tagging within Propbank,
noting that agreement tends to increase as annotators are more trained. In
SenSem, the level of annotation that is comparable to predicate-argument
relations, semantic role annotation, is clearly within this 60%-100% range.
It is noteworthy that the values obtained for the kappa coefficient are
rather low. After a close inspection, we found that these low values of
kappa are mainly due to the fact that the annotation guidelines were still
not well-established, and that annotators were still under training.
In contrast, the tagging of semantic roles presents perfect agreement for
very infrequent roles (indirect cause, instrument, location). More frequent
roles show a higher level of disagreement: initiators are significantly less
category agr. kappa category agr. kappa

eventual semantics argumentality
event 66% .11 argument 82% .54
state 90% .33 adjunct 64% .46
process 76% .06
semantic role
initiator 70% .37 agent 84% .81
cause 91% .89 experiencer 97% .92
theme 68% .43 affected theme 74% .55
non-affected theme 70% .34 goal 79% .70
syntactic function
agentive complement 100% 1.00 subject 87% .83
direct object 80% .63 indirect object 77% .79
prepositional object 1 67% .65 prepositional object 2 66% .28
prepositional object 3 78% .24 Circumstantial 62% .42
Predicative 76% .16
syntactic category
noun phrase 78% .67 prepositional phrase 72% .53
adjectival phrase 88% .69 negative adverbial 100% 1.00
adverbial phrase 77% .54 adverbial clause 68% .66
gerund clause 72% .65 relative clause 82% .16
completive clause 95% .93 direct speech 96% .95
infinitive clause 94% .98 prep. completive clause 96% .44
prep. infinitive clause 81% .57 personal pronoun 97% .81
relative pronoun 98% .96 other pronouns 94% .82
Table 1: Inter-annotator agreement for a selection of annotated categories
clearly perceived than agents or causes (note differences in k agreement).

It is also clear that fine-grained distinctions are more difficult to perceive
than coarse-grained ones, as exemplified by low agreement between subdis-
tinctions within the superclass of theme.
Among syntactic functions, the agentive complement of passives presents
perfect agreement. Agreement is also high for subjects, direct and indirect
objects, but the distinction between different kinds of prepositional objects
and circumstantial complements is not clearly perceived.
Finally, agreement is rather high for some syntactic categories: pro-
nouns, adverbs of negation, adjectival complements, completive clauses, in-
finitive clauses and direct speech present k > .7 and ratios of agreement over
90%. However, major categories present a rather high ratio of disagreement,
as well as those categories that are mostly considered adjuncts.
After the study of inter-annotator agreement, the guidelines for anno-
tation have been settled (Vázquez et al. 2005). These guidelines serve as
a reference for annotators, and we believe they will increase the overall
consistency of the resulting corpus.
6 Conclusions and future work
The linguistic resource we have presented constitutes an important source of

linguistic information useful in several natural language processing areas as
well as in linguistic research. The fact that the corpus has been annotated
at several levels increases its value and its versatility.
The project is in its second year of development, with still a year and
a half to go. During this time we intend to continue with the annotation
process and to develop a lexical database that will reflect the information
found in the corpus. We are aware that the guidelines established in the
annotation process are going to bias, to a certain extent, the resulting re-
sources, but nevertheless we believe that both corpus and lexicon are of
interest for the nlp community.
All tools developed in the project and the corpus and lexicon themselves
will be available to the scientific community.
REFERENCES
Carletta, J., A.Isard, S.Isard, J.C.Kowtko, G.Doherty-Sneddon & A.H.Anderson.
1996. “HCRC Dialogue Structure Coding Manual”. HCRC Technical report
HCRC/TR-82. Edinburgh, U.K.: Univ. of Edinburgh.
Cohen, J. 1960. “A Coefficient of Agreement for Nominal Scales”. Educational
& Psychological Measure, vol. 20, 37-46.
Corley, S., M. Corley, F. Keller, M. W. Crocker, S. Trewin. 2001. “Finding
Syntactic Structure in Unparsed Corpora” Computer and the Humanities
vol. 35, 81-94.
Dorr, B., M. A. Martı́, I. Castellón. 1998. “Spanish EuroWordNet and LCS-
Based Interlingual MT”. Proceeding of the ELRA Congress. Granada, Spain.
http://citeseer.ist.psu.edu/dorr97spanish.html
Fellbaum, C. 1998. “A Semantic Network of English Verbs”, WordNet: An Elec-
tronic Lexical Database ed. by C. Fellbaum, 69-104. Cambridge, Mass.: MIT
Press.
Garcia de Miguel, J.M. & S. Comesaña. 2004. “Verbs of Cognition in Span-
ish: Constructional Schemas and Reference Points”, Linguagem, Cultura
e Cogniao: Estudos de Linguı́stica Cognitiva ed. by A. Silva, A. Torres &
M. Gonçalves, 399-420. Coimbra: Almedina.
Grishman, Ralph, C. Macleod & A. Meyers. 1994. “Comlex Syntax: Build-
ing a Computational Lexicon”. International Conference on Computational
Linguistics (COLING-1994 ), 268-272. Kyoto, Japan.

Johnson, C. & C.J. Fillmore. 2000. “The FrameNet Tagset for Frame-Semantic
and Syntactic Coding of Predicate-Argument Structure”. Proceedings of the
North American Chapter of the Association for Computational Linguistics
(NAACL 2000 ), 56-62. Seattle, Washington.
Jones, D., ed. 1994. “Verb Classes and Alternations in Bangla, German, English
and Korean”. Memo n. 1517. MIT, Artificial Intelligence Laboratory.
Jones, D. 1995. “Predicting Semantics from Syntactic Cues—Evaluating Levin’s
English Verb Classes and Alternations”. UMIACS TR-95-121, Univ. of Mary-
land.
P. Kingsbury, P. & M. Palmer. 2002. “From Treebank to Propbank”. 3rd Conf.
on Language Resources and Evaluation (LREC-2002 ). Las Palmas, Spain.
http://citeseer.ist.psu.edu/kingsbury02from.html
Kingsbury, P., M. Palmer & M. Marcus. 2002. “Adding Semantic Annotation to
the Penn TreeBank”. Human Language Technology Conference. San Diego,
Calif. http://citeseer.ist.psu.edu/kingsbury02adding.html
Kipper, K., H.T. Dang & M. Palmer. 2000. “Class-based Construction of a Verb
Lexicon”. 17th National Conference on Artificial Intelligence (AAAI-2000 ),
691-696. Austin, Texas.
Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investi-
gation. Chicago, Illinois: The University of Chicago Press.
McCarthy, D. 2000. “Using Semantic Preferences to Identify Verbal Participation
in Role Switching Alternations”. Proceedings of the North American Chapter
of the Association for Computational Linguistics (NAACL 2000 ), 256-263.
Seattle, Washington.
Muñiz, E., M. Rebolledo, G. Rojo, M.P. Santalla & S. Sotelo. 2003. “Description
and Exploitation of BDS: a Syntactic Database about Verb Government in
Spanish”. Int. Conf. on Recent Advances in Natural Language Processing
(RANLP-2003 ) ed. by G. Angelova, K. Bontcheva, R. Mitkov & N. Nicolov,
297-303. Borovets, Bulgaria.
Saint-Dizier, P. 1999. “Alternations and Verb Semantic Classes for French: Anal-
ysis and Class formation”. Predicative Forms in Natural Languages and
Lexical Knowledge Bases ed. by P. Saint-Dizier, 139-170. Dordrecht, The
Netherlands: Kluwer.
Subirats-Rüggeberg, C. & M.R.L. Petruck. 2003. “Surprise: Spanish FrameNet!”.
Workshop on Frame Semantics, International Congress of Linguists. Prague.
http://framenet.icsi.berkeley.edu/papers/SFNsurprise.pdf
Vázquez, G., A. Fernández & M. A. Martı́. 2000. Clasificación verbal. Alternan-
cias de diátesis. Lleida: Universitat de Lleida.
Vázquez, G., A. Fernández & L. Alonso. 2005. “Guidelines for the Syntactico-
Semantic Annotation of a Corpus in Spanish”. Int. Conf. on Recent Ad-
vances in Natural Language Processing (RANLP-2005 ) ed. by G. Angelova,
K. Bontcheva, R. Mitkov & N. Nicolov, 603-607. Borovets, Bulgaria.
Generating Referring Expressions:
Past, Present and Future
Robert Dale
Macquarie University
Abstract
The generation of referring expressions—noun phrases which speak-
ers use to identify intended referents for their hearers—is the most
well-established and agreed-upon subtask in natural language gen-
eration. This paper surveys the developments in the field that have
brought us to this point, and characterises the state of the art in the
area; we also point to unsolved problems and new directions waiting
to be pursued.
1 Introduction
Natural language generation (nlg) is concerned with getting machines to

produce natural language output, either written or spoken, from some un-
derlying non-linguistic input, which might be information contained in a
database or a knowledge base of some kind.1 As a research area, nlg is
important for both practical and theoretical reasons.
• On the practical side, nlg is an important component of a number of
potentially valuable applications. For example, we would like embod-
ied conversational agents to use language appropriately on the basis
of context, overcoming the limitations of canned output that we see
in today’s dialogue applications; and there is great scope for infor-
mation personalisation, where nlg permits the production of massive
numbers of individually-tailored emails, documents or web pages that
take into account the preferences or characteristics of the recipient.
In areas like these, nlg offers the scope for solutons that would be
infeasible if we had to rely on human authoring.
• On the theoretical side, nlg permits—and even promotes—a focus
on aspects of language that are generally ignored by work in natural
language analysis: it allows us to explore the decision processes that
lie behind subtle, and sometimes not so subtle, variations in output.
So, for example, why does a speaker produce a passive sentence rather
than the corresponding active declarative sentence, or some other syn-
tactic structure that appears to convey the same information? What
1
In more recent years, there has been an increasing amount of work in what we might
call text-to-text generation, where the input and output are both natural language
texts; these approaches are not directly relevant to the topic discussed here.
100 ROBERT DALE
determines whether, in a conditional expression, we express the an-

tecedent before the consequent or vice versa? Why give a five minute
exposition on a topic rather than a thirty minute one, and how are the
higher-level structures and the overall content of the possible variants
determined? These are the kinds of questions that nlg attempts to
answer, and which work in nlu tends to ignore: where nlu seeks to
remove the subtle nuances and variations in language use in order to
determine some core ‘canonical’ representation of meaning, nlg seeks
to explain the variety in expression that is contingent on contextual
factors.
A component task that is implicated in almost any natural language gen-
eration system is that of referring expression generation. Any nlg system
has to talk about things, and to do so it has to make decisions as to how to
refer to those things. For many entities in the world—people, companies,
cities, ships, newspapers, movies, and so on—this is a relatively straight-
forward process: these entities have proper names, and in such cases the
requirement is for a decision-making process that can choose between the
use of the entity’s full proper name, an abbreviated form if one is available,
or a pronoun.
But most entities in the world do not have proper names, and so if we
want to refer to these entities, we have to do so in terms of properties they
possess; here the range of possibilities is significantly greater, and the task
correspondingly more complex. It is this process of determining how to
identify an intended referent for a hearer that we call referring expression
generation.
The present paper aims to do three things. First, in Section 2, we
provide a characterisation of the task of referring expression generation,
situating it within the larger task of natural language generation as a whole.
Then, in Section 3, we provide a brief survey of the work that has been
done in the area to date, emphasising how most recent work has adopted
a characterisation of the problem that is fairly widely agreed upon. In
Section 4, we attempt to identify outstanding issues in the field and point
to possible future research directions. The paper ends with some conclusions
in Section 5.
2 What’s involved in referring expression generation

2.1 The architecture of a natural language generator
Under a fairly common view of the architecture of the natural language
generation process (see, for example, (Reiter & Dale 2000)), the generation
of an output text, whether one sentence long or many paragraphs long,
consists of three distinct stages:
GENERATING REFERRING EXPRESSIONS 101
Text planning is concerned with taking as input some high level commu-
nicative goal—for example, my desire to tell you what I did today—
and deriving from this some structured plan for what has to be said
to achieve this goal. Typically this plan is considered to be tree-like in
nature, corresponding to the hierarchical decomposition of the orig-
inal goal into subgoals, subgoals of those subgoals, and so on. So,
for example, my plan to tell you about my activities today might be
decomposed into the three subgoals of telling you what I did in the
morning, what I did in the afternoon, and what I did in the evening;
and each of these subgoals might in turn consist of lower-level goals,
and so on recursively until some level of atomicity, perhaps corre-
sponding to utterances that can be realised as single sentences or
clauses, is reached.
We might expect this tree structure to be labelled to indicate the
rhetorical relations that hold between the constituent elements of the
plan. So, for example, my goal to tell you about a particular event
that occurred might be preceded by a goal to explain some relevant
background information in order to allow you to understand the sig-
nificance of the event, and the relationship between these two goals
in the plan might be labelled as being one of Background. The termi-
nal nodes or leaves of this plan then correspond to the propositions
to be conveyed in the resulting output, each of these propositions
being a semantic chunk that might correspond to a major clause in
the language. These proposition-like elements are often referred to as
messages.
Micro-planning then uses these messages to build a sequence of sentence
plans that determine the content to be realised in each sentence of
the output. This may involve combining messages to build complex
sentences, a process sometimes referred to as aggregation. So, for
example, two propositions that might independently be realised as
the sentences I went to the zoo and Mary went to the zoo can instead
be combined to produce a more complex proposition that might be
realised as the sentence Both Mary and I went to the zoo.
More importantly for our purposes, the micro-planning stage is
generally considered to be the place where we determine how the
entities mentioned in the messages should be referred to. Up until
this point the entities are represented by symbolic identifiers, such as
person001, person002 and zoo167; in our present example, referring
expression generation determines that person001, the speaker, should
be identified by a first person singular pronominal form, person002
should be referred to by her given name, and zoo167 should be re-
ferred to simply in terms of the basic type of the object in question,
with no additional modifiers.
102 ROBERT DALE
Linguistic realisation is the final stage of the process; this maps these
sentence plans, which are still in the form of semantic constructs,
into the appropriate lexico-syntactic material of the target natural
language to produce the final output surface forms. It is at this point
that the words I, Mary and zoo are finally selected.
Within this view of the architecture of natural language generation, our
interest here, then, is in a component process in the micro-planning stage:
the system has determined that it wants to refer to a given entity, indi-
cating this by incorporating the internal symbol for that entity in one or
more messages, and the task of referring expression generation is then to
determine the semantic content of a referring expression that identifies this
entity for the hearer.2
2.2 Generating referring expressions

To make this concrete, suppose we are faced with the need to realize a
proposition like that in Example (1):
(1) owns(j, j1)
Here, j is a symbolic identifier corresponding to some individual whose name
we will assume is John, and j1 is the symbol corresponding to some partic-
ular entity that John owns. The task of the referring expression generator
is to determine how to identify j and j1 for the hearer.
We will ignore the reference to John and assume that we can straightfor-
wardly decide to use his name in order to refer to him. Our current focus of
interest is j1, where in the absence of a proper name that is known to both
the speaker and hearer, we have to reason about the properties we know to
be true of the entity in order to determine how to identify it for our hearer.3
In order to carry out this task, we need to have access to at least three
kinds of knowledge:
Knowledge of the domain: We require some representation of the prop-
erties of the entities in the domain, and of the relationships between
those entities.
2
Other architectural models that carve up the process of natural language generation
in different ways are possible. In almost all such architectures, however, there is a
process whose focus is the mapping from internal symbols to the content of linguistic
expressions that identify their referents.
3
Even this is, of course, a simplification: certainly we can use properties that do not
hold of the entity in question in order to describe it, just so long as the hearer believes
those properties are true of the entity in question. So, when checking into a rather
conservative hotel in a small village, I might successfully refer to my partner as my
wife, even if we are not in fact married. Successful reference requires that I achieve my
intention of singling out for you my intended referent; whether I achieve this by using
properties that are true of the intended referent is less important.
Knowledge of the user: We need to have some representation of the

hearer’s knowledge, so that we can avoid making use of terms that
are alien to him or her. For example, there is little point in using a
highly technical term for a piece of machinery if this term is unfa-
miliar to the hearer. Similarly, unless there is no alternative, there is
little point in referring to some property of the entity that is not easily
accessible to the hearer, such as a serial number on the underside of
a very heavy object.
Knowledge of the discourse: We need to have some representation of
the preceding discourse context so we know what entities have already
been referred to, and what has been said about them. This informs
our use of pronominal forms and of reduced versions of previously used
descriptions.
The discourse context is a major factor in determining the appropriate form
of reference to use on any subsequent reference to an entity, as is demon-
strated by the following examples:
(2) a. John owns a red jumper.
b. He wears it on Sundays.
(3) a. John owns a red jumper and a blue one.
b. He wears the red one on Sundays.
(4) a. John owns a red jumper and a red cardigan.
b. He wears the jumper on Sundays.
In each of these three examples, the proposition expressed in the second sen-
tence is the same, and might be represented as something like
wears(j,j1,on(day0)). However, as the examples demonstrate, the means
used to refer to the jumper worn by John depend on the entities that have
been mentioned in the previous discourse context. In the first example, a
straighforward pronominal reference suffices, since there are no other likely
antecedents for a neuter pronoun in the current context; in the second ex-
ample, there are two jumpers in the context, and so we have to distinguish
which of the two we mean by using a property that is unique to one; and
in the third example, our context contains two entities of different types,
so this can be used as the distinguishing characteristic in singling out our
intended referent. In the simple cases demonstrated here, this seems quite
straightforward and amenable to computational implementation given ap-
propriate representational mechanisms for the various forms of information
involved; but few cases are quite as simple as these.
The approaches taken to this problem in the last 20 years have adopted a
remarkably consistent characterisation of what the problem involves, and in
this regard, the generation of referring expressions is rather unique within
the larger portfolio of research questions that make up natural language
generation. This shared conception of the problem makes it possible to
104 ROBERT DALE
compare and contrast solutions, at least to a much greater degree than is

the case for many other problems. This has resulted in a focus within the
literature on the development of precisely-stated algorithms that explore
different facets of the problem.
2.3 A formal characterisation

The bulk of research into the generation of referring expressions takes as
given the following assumptions.
1. We have a context that consists of a collection of entities, identified
by symbolic identifiers, such as e1 through en .
2. One of these entities is our intended referent.
3. The entities in the context are characterized in terms of a set of at-
tributes and the values that the entities have for these attributes. So,
for example, our knowledge base might represent the fact that entity
e1 has the value pen for the attribute type, and the value red for the
attribute color, where the symbols used to designate attribute values
correspond to conceptual or semantic elements rather than surface
lexical items. We will use the notation hAttribute, Valuei for attribute–
value pairs; for example, hcolor, redi indicates the attribute of color
with the value red.
4. In a typical context where we want to refer to our intended referent ei ,
there will be other entities from which the intended referent must be
distinguished; these are generally referred to as potential distractors.
The goal of referring expression generation can then be stated thus:
We want to find some collection of attributes and their values
that distinguish the intended referent from all the potential dis-
tractors in the context.
If such a collection of attributes and their values can be found, then it
serves as a distinguishing description. Formally, we can characterize this as
follows.
Let r be the intended referent, and C be the set of distractors;
then, a set L of attribute–value pairs will represent a distin-
guishing description if the following two conditions hold:
• C1: Every attribute–value pair in L applies to r: that is,
every element of L specifies an attribute and its value that
r possesses.
• C2: For every member c of C, there is at least one element
l of L that does not apply to c: that is, there is an l in L
that specifies an attribute value that c does not possess. l
is said to rule out c.
We can think of the generation of referring expressions as being governed by

three principles, referred to by Dale (1992) as the principles of adequacy, ef-
ficiency and sensitivity; these are Gricean-like conversational maxims (Grice
1975) framed from the point of view of the specific task of generating refer-
ring expressions.
The first two of these principles are primarily concerned with saying
neither too much nor too little. The principle of adequacy requires that a
referring expression should contain enough information to allow the hearer
to identify the referent, and the principle of efficiency requires that the
referring expression should not contain unnecessary information (since, in
line with Grice’s observations, the hearer will wonder why this information
has been provided). The principle of sensitivity, however, has a different
concern: it specifies that the referring expression constructed should be
sensitive to the needs and abilities of the hearer or reader. Accordingly,
the definition of a distinguishing description specified above should really
include a third component:
• C3: The hearer knows or can easily perceive that conditions C1 and
C2 hold.
In other words, the hearer must realize that the distinguishing description
matches the intended referent and none of the potential distractors, and
ideally this realization should not require a large perceptual or cognitive
effort on the hearer’s part.
3 A history of work in the area
3.1 Early work

As noted earlier, the generation of referring expressions is a component task
in almost all nlg systems. Discussion of relevant issues that arise in the
broader context of nlg system construction can be found in the seminal
early language generation research of Davey (1979), McDonald (1980) and
McKeown (1985). However, in each of these cases, the generation of refer-
ring expressions was only one of many tasks the authors set out to address,
and so only some aspects of the problem were considered.
The first major work to look specifically at the problem of referring ex-
pression generation was that of Appelt (1985). This work contains many
insights into the range of problems that a thoroughgoing solution to the
problem needs to consider, placing the task firmly within the framework of
speech acts (Searle 1969). Building on other work that has attempted to
model speech acts by means of AI-style planning operators, Appelt devel-
oped a plan-based approach to constructing references, whereby the speaker
106 ROBERT DALE
reasons about the effects on the hearer obtained by using particular elements
of content in a referring expression.
3.2 Producing minimal distinguishing descriptions

The earlier works referred to above clearly established the need to view
the generation of referring expressions as a specific task to be addressed
within a generation system; Appelt’s work in particular demonstrated that
the problem is a non-trivial one. However, none of these works provided a
formally well-specified algorithm for carrying out the task.
The principle of efficiency mentioned above suggests that, other things
being equal, we desire an algorithm that will deliver a referring expression
that mentions as few attributes as possible: a minimal distinguishing de-
scription. However, from the point of view of computational complexity,
the problem of constructing a minimal distinguishing description is equiv-
alent to finding the minimal size set cover. This problem is known to be
NP-Hard (Garey & Johnson 1979), and any such algorithm is probably com-
putationally impractical when entities have any more than a small number
of properties that might be used in their descriptions. Much of the work in
providing algorithms for referring expression generation has therefore been
concerned to find heuristic workarounds that avoid these computational
complexity issues.
There is insufficient space here to consider each of the algorithms in the
literature in detail; however, to provide a flavor of what such algorithms
contain, we show here the first of the fully specified algorithms to appear,
as published in (Dale 1992); this has served as a starting point for many of
the subsequently proposed algorithms. The algorithm, which is essentially a
variant of Johnson’s greedy heuristic for minimal set cover (Johnson 1974),
is as shown in Figure 1.
To see how this works, suppose we have a context that contains the
following entities with the properties indicated:
• e1 : htype, peni, hcolor, redi, hbrand, Staedtleri, hsize, largei
• e2 : htype, peni, hcolor, bluei, hbrand, Staedtleri, hsize, smalli
• e3 : htype, peni, hcolor, redi, hbrand, Stabiloi, hsize, largei
• e4 : htype, peni, hcolor, greeni, hbrand, Staedtleri, hsize, largei
Suppose that our intended referent is e1 . Then, the algorithm first chooses
the property of being red, since this rules out more distractors than any
other property possessed by e1 : it rules out both e2 and e4 , whereas the brand
and size attributes only rule out one distractor each, and the type attribute
has no impact at all. We still then have to add additional information to
rule out e3 : here, size does not help, but adding the brand does what is
Let L be the set of properties to be realized in our description; let P be

the set of properties known to be true of our intended referent r, where a
property is a combination of an attribute and a value for that attribute;
and let C be the set of potential distractors in the current context, referred
to here as the contrast set. The initial conditions are thus as follows:
• C = {hall distractorsi};
• P = {hall properties true of ri};
• L = {}
In order to describe the intended referent r with respect to the contrast set
C, we do the following.
1. Check Success:
if |C| = 0 then return L as a distinguishing description
elseif P = ∅ then fail
else goto Step 2.
2. Choose Property:
for each pi ∈ P do: Ci ← C ∩ {x|pi (x)}
Chosen property is pj , where Cj is the smallest set.
goto Step 3.
3. Extend Description (wrt the chosen pj ):
L ← L ∪ {pj }
C ← Cj
P ← P − {pj }
goto Step 1.
Fig. 1: An algorithm for the Greedy Heuristic
needed, since e1 and e3 have different brands. So, for e1 , the attributes color
and brand provide us with a distinguishing description.
On the other hand, if the intended referent was e2 , the color or size alone
would be sufficient; and if the intended referent was e3 , the brand alone
would be sufficient. Verifying that the algorithm does indeed provide these
results is left as an exercise for the reader.
In reality, since noun phrases always (or almost always) contain head
nouns, the head noun indicating the entity’s type is also incorporated into
the resulting description; for completeness, the algorithm in Figure 1 would
have to be augmented to ensure this. In the case of e1 above, the resulting
description might then be realised as the red Stabilo pen.
The algorithm we have just presented is intended to be quite general and
domain-independent; wherever we can characterize the problem in terms
of the basic assumptions laid out earlier, this algorithm can be used to
108 ROBERT DALE
determine the content of context-dependent referring expressions. However,

the algorithm also suffers from some limitations:
• As stated, it may satisfy the principles of adequacy and efficiency,
but it does not pay heed to the principle of sensitivity: there is no
guarantee that the properties selected will even be perceptible to the
hearer. In the scenario above, for example, if the labels that indicate
the brands of the pens are face down on the table, and the pens have no
other brand-distinguishing characteristics, then the brand information
is of no real value for the hearer.
• The algorithm only makes use of one-place predicates (such as the fact
that our whiteboard pen is red), but we can of course also make use
of relational properties (such as the fact that the pen is on the table
rather than on the floor) when describing entities.
Since this early algorithm, a wide range of alternatives have been proposed,
either to address these particular problems or to extend the coverage of the
approach in other ways.
3.3 More efficient algorithms

Although the algorithm above avoids the computational complexity of a
literal interpretation of the principle of efficiency, it is still a relatively ex-
pensive way of computing a description. A second algorithm for referring
expression generation that has had widespread influence is known as the
Incremental Algorithm (Dale and Reiter 1995). This algorithm has three
distinctive properties.
• First, the algorithm sacrifices the goal of finding a minimal distin-
guishing description in the interests of efficiency and tractability. It
does this by considering the attributes that might be used in a descrip-
tion in a predefined order. So, for example, in describing a physical
object, we might first consider its color and then its size as appro-
priate properties to use in its description. This has the consequence
that on occasion the algorithm may include a property whose discrim-
inatory power is made redundant by some subsequently incorporated
property; however, Dale and Reiter argue that this is not necessarily a
bad thing, since human-generated referring expressions also often con-
tain informational redundancy (see, for example, (Pechmann 1989)).
It remains to be seen whether the algorithm provided by Dale and
Reiter provides the same kinds of redundancies as those produced by
humans.
• Second, while the Incremental Algorithm contains a domain-
independent, and therefore quite generally applicable, algorithm for
building up the content of a referring expression, it does so by refer-

ence to a predefined list of properties whose content is dependent on
the domain. This provides a convenient balance between generality
and domain-specificity: in order to use the algorithm in another do-
main, one simply provides an ordering over the attributes available in
that domain.
• Third, when the algorithm considers whether a given property would
be useful to include in a description, it carries out a check to determine
whether the property in question is one that the hearer would be able
to make use of in identifying the referent. In so doing, it also explicitly
aims to address the principle of sensitivity.
The Incremental Algorithm has subsequently served as the basis for a num-
ber of other algorithms in the literature.
3.4 Referring to entities using relations

Another deficiency we noted above with respect to the greedy heuristic algo-
rithm was that it could only make use of one-place predicates in determining
which attributes of an entity to use in its description. The reason for this
limitation is to avoid adding significant complexity to the algorithm: if we
want to build referring expressions of the form the red pen next to the coffee
cup, note that the entity we are relating the intended referent to (in this
example, the coffee cup) is itself an entity for which we have to construct
a referring expression. Once the generator decides that some other entity
has to be introduced in order to identify the intended referent, a recursive
invocation of the process of referring expression generation is required. As
noted by Novak (1988), there is clearly the scope here for an infinite regress,
as in the red pen next to the coffee cup next to the red pen next to the coffee
cup next to . . . . In informal terms it would appear to be straightforward
to deal with such cases appropriately, but capturing the required behavior
algorithmically is a little more difficult.
As a solution, Dale and Haddock (1991) proposed an algorithm based on
constraint satisfaction. This effectively builds up collections of properties
for each of the entities to be described in parallel, and determines when the
properties (or more precisely, the relations) used to describe one entity also
support the identification of another entity.
More recently, Krahmer et al. (2003) have proposed a solution based
on viewing the domain of entities, properties and relations as a labeled
graph, with the problem of referring expression generation then being one of
constructing an appropriate subgraph. This provides a very elegant solution
with well-understood mathematical properties.
110 ROBERT DALE
3.5 Logical extensions: Sets, booleans and quantifiers

So far, we have pointed to algorithms that permit the use of both one-
place predicates and relations in the construction of referring expressions.
Although this provides for a wide range of possibilities, it still does not
encompass all the devices that we as humans use in constructing referring
expressions. Notably, the algorithms discussed so far assume that the in-
tended referent is a singular individual, and that it is to be picked out by
the logical conjunction of the properties in the description.
There are, of course, other possibilities here. For example, we may
sometimes need to refer to sets of entities, as in Please pass me all the pens.
Sometimes it may be convenient to refer to entities not just in terms of
conjunctions of properties, but also via other Boolean possibilities, such as
the use of negation and disjunction. To return to our earlier example, it
might in some circumstances be most appropriate for me to say Please pass
me the red pen that hasn’t run out of ink; and in cases where we want to
refer to a set of entities, there may be no property that is shared by all the
intended referents, requiring an expression like the red pen and the blue pen.
These ideas are explored in detail by van Deemter (2002), who presents
generalizations and extensions of the Incremental Algorithm described above
that cater for these possibilities. Creaney (2002) offers an algorithm that
allows the incorporation of logical quantifiers, such as all and some, into
descriptions.
4 Outstanding issues
In the foregoing we have surveyed, albeit somewhat briefly, the major as-
pects of referring expression generation that have been the focus of attention
in the development of algorithms over the last 15 years. There are many
other algorithms in the literature beyond those presented here, but the
problems addressed are essentially the same. In this section, we turn to
aspects of the problem that have not yet received the same level of detailed
exploration.
4.1 Other forms of anaphoric reference

We have focused here on the construction of definite descriptions: refer-
ences to entities which are in the common ground, either by virtue of their
presence in the environment, or because they have already been introduced
into the discourse. Definite descriptions are not the only form of anaphoric
reference: as we noted at the outset of this paper, we can, of course, also use
pronouns such as it and one-anaphoric expressions such as the red one to
identify intended referents. Although the literature does contain some work
on generating these forms of reference, they have not received the degree of
attention that has been lavished on definite descriptions, and in particular
there has been little work that has seriously attempted to integrate the var-
ious forms of reference in a unified algorithmic framework. This is perhaps
most surprising in the case of pronominal reference, given that this is seen
as such a central problem in natural language understanding.
Other forms of definite reference have also been neglected: in particular,
there is no substantive work that explores the generation of associative
anaphoric expressions, such as the use of a reference like the cap in a context
where the hearer can infer that this must be a first mention of the cap of a
particular pen that is already salient in the discourse.
4.2 Initial reference

An even more striking gap in the literature is any serious treatment of initial
reference, as opposed to subsequent or anaphoric reference to entities which
are already in the common ground. Subsequent references to entities have
a tendency to reuse information used in earlier references; so, for example, I
might introduce an entity as that Stabilo pen on the table and subsequently
refer to it as the Stabilo pen. The generation literature not only tends to
ignore this influence of the initial form of reference on subsequent referring
expressions, but is in general quite silent on the problem of how initial
references can be constructed. There are a great many ways in which I might
refer to a specific entity when introducing it into the discourse, and when we
move beyond simple examples of collections of similar objects, as are often
used in the literature, it becomes clear that the form of initial reference used
has a lot to do with the purpose of the discourse and other aspects of the
discourse context. Thus, my decision as to whether to introduce someone
into our conversation as a man I met at the bus stop, a mountain climber
I met last night, or an interesting guy I met, will depend on what I have
planned in terms of content for the rest of the discourse. The complexities
of planning and reasoning required to explain how this selection process
works are far beyond our current understanding.
4.3 The pragmatics of reference

Following on from the previous point, relatively little work has looked at the
broader context of reference, and in particular the fact that, when we con-
struct a referring expression, we may be doing many things simultaneously
and attempting to achieve a variety of purposes.
Building on the earlier explorations of Appelt (1985), the work of Kro-
nfeld (1990) is a notable exception here; but we lack a worked-out com-
putational theory of how reference functions within the wider context of
112 ROBERT DALE
language use, and in particular we have no formally well-specified explana-

tions of the variety of purposes of reference. The bulk of the work in the
field has focused on referring expressions as linguistic devices to distinguish
intended referents from other entities in a given context, but this is a rather
narrow and low-level view. We also use reference to maintain topic, to shift
topic, to introduce contrasting entities, and to navigate a hearer’s focus of
attention in various ways, amongst a range of other higher level purposes.
These are all relatively unanalyzed notions from a formal perspective, and
the relationships between them need to be worked out more clearly before
we can say that we have a proper computational theory of the use of refer-
ence and the part it plays in language. As is often the case, approaching this
problem from the perspective of natural language generation may provide
insights that can also enrich our understanding of natural language analysis.
4.4 Evaluation
Despite the limitations discussed in the preceding sections, it turns out
that there are, as yet, no significant nlg applications that would provide
a thorough testing ground for the wide range of algorithms that have been
proposed; for most currently practical applications of language generation,
relatively simple techniques will suffice. As a consequence, the bulk of the
algorithms we have mentioned are explored by their designers in relatively
simple toy domains. However, there would clearly be some advantage in
being able to compare the performance of different algorithms against some
standard data set of desirable referring expressions. This raises a number
of questions about how such a corpus might be constructed, and how the
performance of any given algorithm might be measured; some of these issues
are discussed further in (van Deemter et al. 2006) and (Viethen & Dale
2006).
5 Conclusions
As we have tried to show in the preceeding, a significant body of work in the

generation of referring expressions has been developed over the last 20 years,
almost all of which has the same core conception of the task. This makes it
possible for new work to build on existing work, and provides an excellent
basis for comparison of algorithms, although it is not yet the case that such
comparisons have been carried out in an empirically driven fashion using
large data sets of human-generated referring expressions.4
4
A significant exception here is the work reported in Albert Gatt’s forthcoming Ph.D.
(Gatt 2007).
An intriguing question is why it is that a consensus view of the problem of

referring expression generation has developed, where this has not happened
for many other subtasks in nlg. For example, there is much less agreement
on the exact nature of the tasks of text planning, document structuring,
aggregation, and linguistic realisation. Sometimes we find divergence in
terms of the nature of the processing that falls within a particular compo-
nent’s responsibilities; but in many cases the disagreements arise in terms
of the nature of the inputs and the outputs of the process. There are,
indeed, other conceptions of the process of referring expression generation
that we might consider; why, for example, do we draw a distinction be-
tween the determination of the semantic content of a referring expression
and the mapping of this semantic content into a surface linguistic form?
Why do we take the input to the process to be an internal symbol, rather
than some partially developed semantic construct whose content has been
derived to meet other communicative needs? It is likely that existing work
in the area can be extended, without too much effort, to be used in these
alternative architectural models; from this perspective, the current model
of referring expression generation is a useful simplification that allows for
the development of more complex solutions.
Ultimately, however, our algorithms will be truly validated when we
have functioning systems that are required to make decisions about forms
of reference in rich semantic contexts. The applications we alluded to at the
beginning of this paper—embodied conversational agents who take part in
complex dialogues, and personalisation technology for delivering emails and
web pages—will provide these requirements; within the next five to ten years
we can expect to see the presently theoretically-driven results of research in
referring expression generation put to the test in real applications.
REFERENCES
Appelt, Douglas E. 1985. Planning English Sentences. Cambridge: Cambridge
University Press.
Creaney, Norman. 2002. “Generating Descriptions Containing Quantifiers: Ag-
gregation and Search”. Information Sharing: Reference and Presupposi-
tion in Language Generation and Interpretation ed. by Kees van Deemter
& Rodger Kibble (= CSLI Lecture Notes, 143), 265-294. Stanford: CSLI
Publications.
Dale, Robert. 1992. Generating Referring Expressions: Constructing Descrip-
tions in a Domain of Objects and Processes. Cambridge, Mass.: MIT Press.
Dale, Robert & Nicholas Haddock. 1991. “Content Determination in the Gener-
ation of Referring Expressions”. Computational Intelligence 7:4.252-265.
114 ROBERT DALE
Dale, Robert & Ehud Reiter. 1995. “Computational Interpretations of the

Gricean Maxims in the Generation of Referring Expressions”. Cognitive Sci-
ence 19:2.233-263.
Davey, Anthony C. 1979. Discourse Production. Edinburgh: Edinburgh Univer-
sity Press.
Garey W. & D. Johnson. 1979. Computers and Intractability: A Guide to the
Theory of NP-Completeness. San Francisco: W. H. Freeman.
Gatt, Albert. 2007. Generating Coherent References to Multiple Entities. Ph.D.
Dissertation, University of Aberdeen, Aberdeen, Scotland.
Grice, H. P. 1975. “Logic and Conversation”. Syntax and Semantics Volume 3:
Speech Acts ed. by P. Cole and J. Morgan, 43-58. New York: Academic
Press.
Johnson, D. S. 1974. “Approximation Algorithms for Combinatorial Problems”,
Journal of Computer and Systems Sciences, vol. 9, 256-278.
Krahmer, Emiel, Sebastiaan van Erk & André Verleg. 2003. “Graph-based Gen-
eration of Referring Expressions”. Computational Linguistics 29:1.53-72.
Kronfeld, Amichai. 1990. Reference and Computation. Cambridge: Cambridge
University Press.
McDonald, David D. 1980. Natural Language Production as a Process of Decision
Making under Constraints. Ph.D. dissertation, MIT, Cambridge, Mass.
McKeown, Kathleen R. 1985. Text Generation: Using Discourse Strategies and
Focus Constraints to Generate Natural Language Text. Cambridge: Cam-
bridge University Press.
Novak, Hans-Joachim. 1988. “Generating Referring Phrases in a Dynamic Envi-
ronment”. Advances in Natural Language Generation vol. 2 ed. by Michael
Zock and Gerard Sabah, 76-85. London: Pinter.
Pechmann, Thomas. 1989. “Incremental Speech Production and Referential
Overspecification”. Linguistics 27:1.89-110.
Reiter, Ehud & Robert Dale. 2000. Building Natural Language Generation Sys-
tems. Cambridge: Cambridge University Press.
Searle, J. R. 1969. Speech Acts: An Essay in the Philosophy of Language. Cam-
bridge: Cambridge University Press.
van Deemter, Kees. 2002. “Generating referring expressions: Boolean extensions
of the Incremental Algorithm”. Computational Linguistics 28:1.37-52.
van Deemter, Kees, Ielka van der Sluis & Albert Gatt. 2006. “Building a Se-
mantically Transparent Corpus for the Generation of Referring Expressions”.
Proceedings of the International Conference on Natural Language Generation
15-16 July, 130-132. Sydney, Australia.
Viethen, Jette & Robert Dale. 2006. “Algorithms for Generating Referring Ex-
pressions: Do They Do What People Do?”. Proceedings of the International
Conference on Natural Language Generation 15-16 July, 63-70. Sydney, Aus-
tralia.
A Data-driven Approach to Pronominal Anaphora
Resolution for German
Erhard W. Hinrichs, Katja Filippova & Holger Wunsch

SfS-CL, University of Tübingen
Abstract
This paper reports on a hybrid architecture for computational
anaphora resolution (Car) of German that combines a rule-based
pre-filtering component with a memory-based resolution module (us-
ing the Tilburg Memory Based Learner – TiMBL). The Car exper-
iments performed on the TüBa-D/Z treebank data corroborate the
importance of modelling aspects of discourse structure for robust,
data-driven anaphora resolution. The best result with an F-measure
of 0.734 achieved by these experiments outperforms the results re-
ported by Schiehlen (2004), the only other study of German Car
that is based on newspaper treebank data.
1 Introduction
The present study focuses exclusively on the resolution of pronominal

anaphora with NP antecedents for German, where the term pronoun is used
as a cover term for 3rd person reflexive, possessive, and personal pronouns.
The purpose of this paper is threefold: (i) to apply the machine learning
paradigm of memory-based learning to the task of Car for German, (ii) to
provide a series of experiments that corroborate the importance of modelling
aspects of discourse structure for robust, data-driven anaphora resolution
and that induce more fine-grained information from the data than previous
approaches, (iii) to apply Car to a corpus of German newspaper texts,
yielding competitive results for a genre that is known to be considerably
more difficult than the Heidelberg corpus of tourist information texts (see
Kouchnir (2003) for more discussion on this issue).
2 Previous research on Computational Anaphora Resolution
While early work on Car was carried out almost exclusively in a rule-based
paradigm, there have been numerous studies during the last ten years that
have demonstrated that machine-learning and statistical approaches to Car
can offer competitive results to rule-based approaches. In particular, this
more recent work has shown that the hand-tuned weights for anaphora
116 ERHARD W. HINRICHS, KATJA FILIPPOVA & HOLGER WUNSCH
resolution introduced by Lappin & Leass (1994), by Kennedy & Bogu-

raev (1996), and Mitkov (2002) can be successfully simulated by data-driven
methods (Preiss 2002b).
While there is a rich diversity of methods that have been applied to
Car, there is also a striking convergence of grammatical features that
are used as linguistic knowledge across different algorithms. Most ap-
proaches base their resolution algorithm on some combination of distance
between pronouns and potential antecedents, grammatical agreement be-
tween pronouns and antecedents, constituent structure information, gram-
matical function assignment for potential antecedents, and the type of NP
involved (e.g. whether it is definite or indefinite). The combined effect of
these features is to establish a notion of discourse salience that can help
rank potential antecedents. An important aspect of discourse salience is
its dynamic character since there seems to be a strong correlation be-
tween salience and discourse recency. This aspect of salience was first
captured by Lappin & Leass (1994) and by Kennedy & Boguraev (1996)
through the use of a decay function that decreases the score of a poten-
tial antecedent each time a new sentence is processed. In data-driven ap-
proaches this decay function is simulated by the distance measure between
pronoun and antecedent.
Previous studies of Car have focused on English and have been based
on text corpora of fairly modest size, however see Ge et al. (1998) for an
exception. The only previous studies for German have been presented by
Strube & Hahn (1999), based on centering theory, Müller et al. (2002),
using co-training, and by Kouchnir (2003), who applies boosting. Schiehlen
(2004) provides an overview of adapting Car algorithms to German that
were originally developed for English. While memory-based learning (Mbl)
has been successfully applied to a wide variety of Nlp tasks, there has been
only one previous study of Car using Mbl (Preiss 2002a).
3 Data
The present research focuses on German and utilizes the TüBa-D/Z (Hin-
richs et al. 2004), a large treebank of German newspaper text that has
been manually annotated with constituent structure and grammatical rela-
tions such as subject, direct object, indirect object and modifier. These types
of syntactic information have proven crucial in previous Car algorithms.
More recently, the TüBa-D/Z annotations have been further enriched to
also include anaphoric relations (Hinrichs et al. 2004), thereby making the
treebank suitable for research on Car. German constitutes an interest-
ing point of comparison to English since German exhibits a much richer
inflectional morphology and a relatively free word order at the phrase level.
PRONOMINAL ANAPHORA RESOLUTION FOR GERMAN 117
The sample sentences in (1) illustrate the annotation of referentially depen-

dent relations in the TüBa-D/Z anaphora corpus.
(1) [1 Der neue Vorsitzende der Gewerkschaft Erziehung und
Wissenschaft] heißt [2 Ulli Thöne]. [3 Er] wurde gestern mit 217
von 355 Stimmen gewählt.
‘The new chairman of the union of educators and scholars is called
Ulli Thöne. He was elected yesterday with 217 of 355 votes’.
In (1) a coreference relation exists between the noun phrases [1] and [2],
and an anaphoric relation between the noun phrase [2] and the personal
pronoun [3]. Since noun phrases [1] and [2] are coreferential, there exists
an implicit anaphoric relation between NP [1] and NP [3], with all three
NPs belonging to the same coreference chain. In keeping with the Muc-6
annotation standard1 , the anaphoric relation of a pronoun is established
only to its most recently mentioned antecedent.
The TüBa-D/Z currently consists of 766 newspaper texts with a total
of 15260 sentences and an average number of 19.46 sentences per text. The
TüBa-D/Z contains 7606 reflexive and personal pronouns, 2195 possessive
pronouns, and 99585 markables (i.e. potential antecedent NPs). The num-
ber of pronouns in the TüBa-D/Z corpus is considerably larger than in the
hand-annotated portion of the German Negra newspaper corpus (2198
possessive pronouns, 3115 personal pronouns) utilized in Schiehlen (2004)
and substantially larger than the German Heidelberg tourism information
corpus (36924 tokens, 2179 anaphoric NPs) used by Müller et al. (2002) and
by Kouchnir (2003).
4 Experiments
The experiments are based on a hybrid architecture that combines a rule-

based pre-filtering module with a memory-based resolution algorithm.
The purpose of the pre-filtering module, which has been implemented
in the Xerox Incremental Deep Parsing System (Xip) (Aı̈t-Mokhtar et
al. 2002), is to retain only those NPs as potential antecedents that match
a given pronoun in number and gender. Due to the richness of inflectional
endings in German, this pre-processing step is crucial for cutting down the
size of the search space of possible antecedents. Without Xip pre-filtering,
the TüBa-D/Z corpus yields a total of 1,412,784 of anaphor/candidate-
antecedent pairs. This number represents all possible ways of pairing a
pronoun with an antecedent NP in each of the 766 texts of the TüBa-D/Z
corpus. After pre-filtering this number is reduced to appr. 190,000 pairs.
1
See www.cs.nyu.edu/cs/faculty/grishman/COtask21.book 1.html.
pronoun/antecedent cataphoric parallel clause-mate distance

On Od Oa Pred
discourse history Mod Opp Fopp App
Title Conj Hd Other
pronoun reflexive possessive
Table 1: Feature set
The memory-based resolution module utilizes the Tilburg Memory Based

Learner (TiMBL), version 5.1 (Daelemans et al. 2005). Unless otherwise
specified, the experiments use the default settings of TiMBL.
4.1 Feature set

In the experiments, the TiMBL learner was presented with the set of fea-
tures summarized in Table 1. The features on line 1 all refer to relational
properties of the pronoun and potential antecedents. The feature parallel
encodes whether the anaphor and the potential antecedent have the same
grammatical function. The features on line 3 refer to the pronoun alone
and encode whether it is possessive or reflexive. The features on line 2
are designed to model the discourse history in terms of the grammatical
functions of NPs that are in the same coreference class as the candidate
antecedent. The grammatical functions are those provided by the syntactic
annotation of the TüBa-D/Z treebank: On (for: subject), Oa (for: direct
object), Od (for: dative object), Pred (for: predicative complement), Mod
(for: modifier ), etc.
The main purpose of the experiments reported here was to systemati-
cally study the impact that information about discourse context has on the
performance of data-driven approaches to Car. To this end, we designed
two experiments that differ from each other in the amount of information
about the coreference chains that are encoded in the training data.
4.2 Knowledge-rich encoding of instances – Experiment I

In Experiment I, complete information about coreference chains is used for
training. In example (1) the three bracketed NPs form one coreference
chain since the first two NPs are coreferent and the pronoun is anaphoric
to both. Accordingly, for example (1), two positive instances are created as
shown in Table 2. Binary features are encoded as yes/no. Features related
to distances are given numeric values from 1 to 30, with a special value of
31 reserved for the value undefined. Inspection of the data showed that a
context window of size 30 contains the antecedent in more than 99% of all
cataphoric parallel clause mate distance On Od Oa Pred Mod

no no no -1 -1 -31 -31 -31 -31
no no no -1 -1 -31 -31 -1 -31
Opp Fopp App Title Conj Hd Other refl poss class
-31 -31 -31 -31 -31 -31 -31 no no yes
-31 -31 -31 -31 -31 -31 -31 no no yes
Table 2: Positive knowledge-rich sample instances
cases. For technical reasons, the numeric values are prefixed by a dash in
order for TiMBL to treat them as discrete rather than continuous values.
The first vector in Table 2 displays the pairing of the pronoun with the
NP der neue Vorsitzende der Gewerkschaft Erziehung und Wissenschaft,
the first NP in the text. This NP is the subject (On) of its clause. The
value for this grammatical function is −1 since the NP occurs in the clause
immediately preceding the pronoun. The second vector pairs the two pre-
ceding NPs with the pronoun er. Since the NP Ulli Thöne is in predicative
position (Pred) and occurs in the same clause as the subject NP der neue
Vorsitzende der Gewerkschaft Erziehung und Wissenschaft, the value for
these two grammatical functions On and Pred is −1. Thus, the intended
semantics of the features for each grammatical function is to encode the dis-
tance of the last occurrence of a member of the same coreference class with
that particular grammatical function.2 Using mention counts of grammat-
ical functions instead of distances did not significantly change the results,
and were therefore omitted from the experiments.
The sample vectors in Table 2 illustrate the incremental encoding of in-
stances. The initial vector encodes only the relation between the pronoun
and the antecedent first mentioned in the text. Each subsequent instance
adds one more member of the same coreference class. This incremental
encoding follows the strategy of Kennedy & Boguraev (1996) and reflects
a dynamic modelling of the discourse history. The last item in the vec-
tor indicates class membership. In the memory-based encoding used in the
experiments, anaphora resolution is turned into a binary classification prob-
lem. If an anaphoric relation holds between an anaphor and an antecedent,
then this is encoded as a positive instance, i.e., as a vector ending in yes.
If no anaphoric relation holds between a pronoun and an NP, then this
encoded as a negative instance, i.e., as a vector ending in no.
2
A similar encoding is also used by Preiss (2002a).
4.3 Knowledge-poor encoding of instances – Experiment II

Experiment II uses a more knowledge-poor encoding of the data and pairs
each pronoun only with the most recent antecedent in the same coreference
class, thereby losing both information inherent in the entire coreference class
and at the same time truncating the discourse history. Using example (1)
once more as an illustration, two positive instances are created. The first
vector is the same as in Experiment I. The second vector retains value −1
only for Pred, the grammatical function of the candidate itself. The value
of On is now undefined (−31).
4.4 Two variants

For each of the two experiments described above, two variants were con-
ducted. In one version, the evaluation focused on the closest antecedent to
calculate the result for recall, precision and F-measure.3 In a second vari-
ant, the most confident antecedent was chosen. The confidence measure was
calculated by the function conf(t, ck ) := Pndk di for classes c1 . . . cn , and class
i=1
distributions d1 . . . dn (where di is the number of neighbors that classified
the test instance t as belonging to class ci ).
5 Evaluation
To assess the difficulty of the pronoun resolution task for the TüBa-D/Z
corpus, we established as a baseline a simple heuristic that picks the closest
preceding subject as the antecedent. This baseline is summarized in Table 3
together with results of the experiments described in the previous section.
For each experiment ten-fold cross-validation was performed, using 90% of
the corpus for training and 10% for testing.
5.1 Results of Experiments I and II

Both experiments significantly outperform the baseline approach in
F-measure. The findings summarized in Table 3 corroborate the impor-
tance of modelling the discourse history for pronoun resolution since the
results of Experiment I are consistently better than those of Experiment
II. The present paper does not have to rely on the hand-coding of a decay
function. Rather, it induces the relevant aspects of the discourse history
directly from the instance base used by the memory-based learner.
3
Throughout this paper the term F-measure implies the parameter setting of β = 1.
av. precision av. recall av. F-measure

Baseline 0.500 0.647 0.564
Experiment I
closest antecedent 0.826 0.640 0.721
most conf. antecedent 0.801 0.621 0.700
Experiment II
most conf. antecedent 0.786 0.606 0.684
Table 3: Summary of results
6 most informative features: clause-mate,parallel,possessive,Fopp,On,Od

3 least informative features: Title, distance,Conj
Table 4: Summary of feature weights based on GainRatio values
It is also noteworthy that in Experiment I the strategy of picking the clos-

est antecedent outperforms the strategy of picking the most confident an-
tecedent chosen by TiMBL.
5.2 Benchmarking feature impact

It is instructive to benchmark the importance of the features used in the
experiments. This can be ascertained from the weights that the gain ratio
measure (as the default feature weighting used by TiMBL) assigns to each
feature. Gain ratio is an entropy-based measure that assigns higher weights
to more informative features. Table 4 displays the top six most informative
features and the three least informative features in decreasing order of infor-
mativeness. The fact that the features clause-mate, parallel, and possessive
are the three most informative features concurs with the importance given
to such features in hand-crafted algorithms for Car. However, the ranking
of some of the features included in Table 4 is rather unexpected. The fact
that the grammatical function Fopp (for: optional PP complement) out-
ranks the grammatical function subject (On) runs counter to hand-coded
salience rankings found in the literature which give the feature subject the
highest weights among all grammatical functions. That the Fopp feature
outranks the function subject is due to the fact that the presence of an op-
tional PP-complement is almost exclusively paired with negative instances.
This finding points to an important advantage of data-driven approaches
over hand-crafted models. While the latter only take into account posi-
tive evidence, data-driven models can profit from considering positive and
av. precision av. recall av. F-measure

Baseline 0.500 0.647 0.564
Experiment I
Table 5: Summary of best results
negative evidence alike. Perhaps the most surprising result is the fact that
distance between anaphor and antecedent is given the second lowest weight
among all eighteen features. This sharply contrasts with the intuition of-
ten cited in hand-crafted approaches that the distance between anaphor
and antecedent is a very important feature for an adequate resolution algo-
rithm. The reason why distance receives such a low weight might well have
to do with the fact that this feature becomes almost redundant when used
together with the other distance-based features for grammatical functions.
The empirical findings concerning feature weights summarized in Ta-
ble 4 underscore the limitation of hand-crafted approaches that are based
on the analysts’ intuitions about the task domain. In many cases, the rel-
ative weights of features assigned by data-driven approaches will coincide
with the weights assigned by human analysts and fine-tuned by trial and
error. However, in some cases, feature weightings obtained automatically by
data-driven methods will be more objective and diverge considerably from
manual methods, as the weight assigned by TiMBL to the feature distance
illustrates.
5.3 Optimization by fine-tuning of TiMBL parameters

It has been frequently observed (e.g. by Hoste et al. (2002)) that the de-
fault settings provided by a classifier often do not yield the optimal results
for a given task. The Car task for German is no exception in this regard.
TiMBL offers a rich suite of parameter settings that can be explored for
optimizing the results obtained by its default settings. Some key parame-
ters concern the choice of feature distance metrics, the value of k for the
number of nearest neighbors that are considered during classification as well
as the choice of voting method among the k-nearest neighbors used in clas-
sification. TiMBL’s default settings provide the feature distance metric of
weighted overlap (with the gain ratio measure for feature weighting), k = 1
as the number of k-nearest neighbors, and majority class voting.
To assess the possibilities of optimizing the results of Experiments I and
II, the best result (Experiment I with closest antecedent) was chosen as a
starting point. The best results, shown in Table 5, were obtained by us-
ing TiMBL with the following parameters: modified value distance metric
(Mvdm), no feature weighting, k = 3, and inverse distance weighting for
class voting. The optimizing effect of the parameters is not entirely sur-
prising.4 The Mvdm metric determines the similarities of feature values by
computing the difference of the conditional distribution of the target classes
for these values. For informative features, δ(v1 , v2 ) will on average be large,
while for less informative features δ will tend to be small. Daelemans et
al. (2005) report that for Nlp tasks Mvdm should be combined with values
of k larger than one. The present task confirms this result by achieving
optimal results for a value of k = 3.
6 Comparison with related work
The only previous study of German Car that is based on newspaper tree-
bank data is that of Schiehlen (2004). Schiehlen compares an impressive
collection of published algorithms, ranging from reimplementations of rule-
based algorithms to reimplementations of machine-learning and statistical
approaches. The best results of testing on the Negra corpus were achieved
with an F-measure of 0.711 by a decision-tree classifier, using C4.5 and a
pre-filtering module similar to the one used here. The best result with an
F-measure of 0.734 achieved by the memory-based classifier and the Xip-
based pre-filtering component outperforms Schiehlen’s results, although a
direct comparison is not possible due to the different data sets.
7 Summary and future work
The current paper presents a hybrid architecture for computational anaphora

resolution (Car) of German that combines a rule-based pre-filtering com-
ponent with a memory-based resolution module (using the Tilburg Memory
Based Learner – TiMBL). The data source is provided by the TüBa-D/Z
treebank of German newspaper text that is annotated with anaphoric rela-
tions. The Car experiments performed on these treebank data corroborate
the importance of modelling aspects of discourse structure for robust, data-
driven anaphora resolution. The best result with an F-measure of 0.734
achieved by the memory-based classifier and the Xip-based pre-filtering
component outperforms Schiehlen’s results, although a direct comparison is
not possible due to the different data sets.
4
See Hoste et al. (2002) for the optimizing effect of Mvdm in the word sense disam-
biguation task.
REFERENCES
Aı̈t-Mokhtar, Salah, Jean-Pierre Chanod & Claude Roux. 2002. “Robustness
Beyond Shallowness: Incremental Deep Parsing”. Natural Language Engi-
neering 8:2-3.121-144.
Daelemans, Walter, Jakub Zavrel, Ko van der Sloot & Antal van den Bosch. 2005.
TiMBL: Tilburg Memory Based Learner – version 5.1–Reference Guide. Tech-
nical Report ILK 01-04. Induction of Linguistic Knowledge, Computational
Linguistics, Tilburg University.
Ge, Niyu, John Hale & Eugene Charniak. 1998. “A Statistical Approach to
Anaphora Resolution”. Sixth Workshop on Very Large Corpora, 161-170.
Montreal, Canada.
Hinrichs, Erhard, Sandra Kübler, Karin Naumann, Heike Telljohann & Julia
Trushkina. 2004. “Recent Developments in Linguistic Annotations of the
TüBa-D/Z Treebank”. Third Workshop on Treebanks and Linguistic Theo-
ries, 51-62. Tübingen, Germany.
Hoste, Veronique, Iris Hendrickx, Walter Daelemans & Antal van den Bosch.
2002. “Parameter Optimization for Machine-Learning of Word Sense Disam-
biguation”. Natural Language Engineering 8:4.311-325.
Kennedy, Christopher & Branimir Boguraev. 1996. “Anaphora for Everyone:
Pronominal Anaphora Resolution Without a Parser”. 16th International
Conference on Computational Linguistics, 113-118. Copenhagen, Denmark.
Kouchnir, Beata. 2003. A Machine Learning Approach to German Pronoun
Resolution. Master’s thesis. School of Informatics, University of Edinburgh.
Lappin, Shalom & Herbert Leass. 1994. “An Algorithm for Pronominal Anaphora
Resolution”. Computational Linguistics 20:4.535-561.
Mitkov, Ruslan. 2002. Anaphora Resolution. Amsterdam: John Benjamins.
Müller, Christoph, Stefan Rapp & Michael Strube. 2002. “Applying co-training
to reference resolution”. 40th Annual Meeting of the Association for Com-
putational Linguistics (ACL’02 ), 352-359. Philadelphia, Penn.
Preiss, Judita. 2002a. “Anaphora Resolution With Memory-Based Learning”.
5th UK Special Interest Group for Computational Linguistics (CLUK5 ), 1-8.
Preiss, Judita. 2002b. “A Comparison of Probabilistic and Non-Probabilistic
Anaphora Resolution Algorithms”. Student Workshop at the 40th Annual
Meeting of the Association for Computational Linguistics (ACL’02 ), 42-47.
Philadelphia, Penn.
Schiehlen, Michael. 2004. “Optimizing Algorithms for Pronoun Resolution”.
20th International Conference on Computational Linguistics (COLING-2004 ),
515-521. Geneva, Switzerland.
Strube, Michael & Udo Hahn. 1999. “Functional Centering - Grounding Ref-
erential Coherence in Information Structure”. Computational Linguistics
25:3.309-344.
Efficient Spam Analysis for Weblogs through
URL Segmentation
Nicolas Nicolov & Franco Salvetti

Umbria, Inc. & University of Colorado at Boulder
Abstract
This paper shows that in the context of spam classification of we-
blogs, segmenting the urls beyond the standard punctuation is help-
ful. Many spam urls contain phrases in which the words are glued
together in order to avoid spam filtering techniques based on uni-
grams and punctuation segmentation. An approach for segmenting
tokens into words forming a phrase is proposed and validated. The
accuracy of the implemented system is better than that of humans
(78% vs. 76%).
1 Introduction
The blogosphere, which is the part of the web consisting of personal elec-
tronic journals (weblogs) currently encompasses 107.7 million pages [2007-
10-02] and doubles in size every 5.5 months (Technorati 2006). The large
amount of information contained in the blogosphere has been proven valu-
able for applications such as market intelligence, trend discovery, and opin-
ion tracking (Hurst 2005). Unfortunately in the last several years the blo-
gosphere has been heavily polluted with spam. For certain subsets of the
blogosphere the spam content can be as high as 70-80%. Spam weblogs
(called splogs) are weblogs used for the purposes of promoting affiliated
websites. Splogs skew statistics. Thus, methods for filtering out splogs on
large collections of weblogs are needed. Sophisticated content-based meth-
ods or methods based on link analysis (Gyöngyi et al. 2004, Andras et al.
2005) can be slow. We propose a fast, lightweight and accurate method
merely based on analysis of the url of the web page itself without con-
sidering (nor downloading) the content of the page (Kan 2004, Kan & Thi
2005).
For qualitative analysis of the contents of the blogosphere it is acceptable
to eliminate data from analysis as long as the remaining part is representa-
tive and spam-free. For example, in determining the public opinion about
a new product it is sufficient to determine the ratio of the people that like
the product vs. the people that dislike it rather than the actual number.
Therefore, considering the vasts amount of weblogs available it is acceptable
for a spam classifier to lower recall in order to improve precision, especially
126 NICOLAS NICOLOV & FRANCO SALVETTI
when used as a pre-filter. Our method reaches 93.3% of accuracy in classi-

fying a web page in terms of spam or good if 49.1% of the data are left aside
(labeled as unknown). If all data needs to be classified our method achieves
78% accuracy which is higher than the average accuracy of humans (76%)
on the url spam classification task.
Spammers, in creating splogs (Wikipedia 2006:Splog), aim to increase
the traffic to specific websites. To do so, they frequently communicate
a concept (e.g., a service or a product) through a short, sometimes non-
grammatical phrase embedded in the url of the page (e.g.,
http://adult-video-mpegs.blogspot.com). We want to build a classifier us-
ing machine learning techniques which leverages the language used in these
descriptive urls in order to classify weblogs as spam or good. We built an
initial Language Model (lm) based classifier on the tokens of the urls after
tokenizing on punctuation. We ran the system and got an accuracy of 72.2%
which is close to the accuracy of humans—76% (the baseline is 50% as the
training data is balanced). When we did error analysis on the misclassified
examples we observed that many of the mistakes were on urls that contain
whole phrases (in the sense of sequence of words rather than syntactic con-
stituents) glued together as one token (e.g., www.dailyfreeipod.com). Had
the words in these phrases been segmented the initial system would have
classified the url correctly. We, thus, turned our attention to additional
segmenting of the urls beyond just punctuation and using this ‘aggressive’
segmentation in classification.
Training a segmenter on standard available text collections (e.g., The
Penn TreeBank or The British National Corpus) seemed not the way to
procede because the lexical items used, and the sequence in which words
appear differ from the usage in the urls. Given that we are interested in
unsupervised lightweight approaches for url segmentation one possibility
is to use the urls themselves after segmenting on punctuation and to try to
learn the segmenting (the majority of urls are naturally segmented using
punctuation as we shall see later). We trained a segmenter on the urls,
unfortunately this method did not provide sufficient improvement over the
system that does not use additional segmentation. We hypothesized that
the content of the spam pages corresponding to the urls could be used as a
corpus to learn the segmentation. We crawled the 20K pages corresponding
to the 20K urls labeled as spam and good in the training set, converted
them to text, tokenized and used the token sequences as training data for
the segmenter. This led to a statistically significant improvement of 5.8%
of the accuracy of the spam filter.
The structure of the paper is as follows: in Section 2 we discuss the
notion of splogs in the blogosphere and how it is reflected in urls. We then
describe segmentation in Section 3 (the heart of the paper) and a classi-
SPAM CLASSIFICATION OF WEBLOGS 127
fication approach (Section 4) which uses the segmentation. Experimental

results including the performance of humans are shown in Section 5 and
discussed in Section 6. We highlight future directions in Section 7 and
conclude in Section 8.
2 Engineering of splogs
“Splogs, are weblog sites which the author uses only for promoting affiliated
websites. The purpose is to increase the PageRank of the affiliated sites,
get ad impressions from visitors, and/or use the blog as a link outlet to
get new sites indexed. Content is often nonsense or text stolen from other
websites with an unusually high number of links to sites associated with the
splog creator which are often disreputable or otherwise useless websites.”
(Wikipedia 2006:Splog).
Spammers indicate the semantic content of the webpage directly in the
url. A Uniform Resource Locator (url), or web address, is a standardized
address name layout for a resource (a document, image, etc.) on the internet
(or elsewhere). The currently used url syntax is specified in the internet
standard RFC 17381 :
resource type://hostname.domain:port/file path name#anchor
While the resource type and domain are a small set, the hostname,
file path name and anchor present a perfect opportunity to indicate seman-
tic content by using descriptive phrases—often noun groups (non-recursive
noun phrases) like http://www.adultvideompegs.blogspot.com.2 There are dif-
ferent varieties of sites promoted by spam:
1. commercial products (especially electronics);
2. vacations;
3. mortgages;
4. adult-related.
Users don’t want to see spam web pages in their search results and market
intelligence applications are affected when data contains spam. To address
this, search engines provide filtering.
Simple approaches to filtering use keyword spotting (or unigram models)—
if the webpage or its url contain certain tokens the page is eliminated from
the results (or ranked lower). For spotting keywords in the urls a url
is often tokenized on punctuation symbols. To avoid being identified as a
splog one of the creative techniques that spammers use is to glue words
1
See also RFC 3986 on Uniform Resource Identifiers (URIs).
2
Non-spam sites also communicate concepts through the urls:
http://showusyourcharacter.com.
together into longer tokens which will not match the filter keywords (e.g.,
http://businessopportunitymoneyworkathome.coolblogstuff.com).
Another approach to dealing with spam is having a list of spam websites
(SURBL 2006). Such approaches based on blacklists are now less effective
because the available free bloghost tools for creating blogs have made it
very easy for spammers to automatically create new splogs which are not
on the blacklist.
The spam classifier uses a segmenter which aggressively splits the url
(Section 3), and then the token sequence is used in a supervised learning
framework for classification (Section 4).
3 URL segmentation
The url segmentation problem at a first glance looks deceptively simple:

given a sequence of words without spaces between them identify the words.
But what is a word? One of the approaches considered below will take the
stance that a word is an entry in a dictionary. Yes, in advance we already
know that dictionaries don’t list all words. Another question is even if
somehow all possible ‘words’ which are substrings of the initial string are
identified (some words can overlap with others) how does the system pick
the right sequence of consecutive words.
If an input token has n characters there can be 2n−1 possible segmenta-
tions (there are n−1 positions and each position is a potential break point).
Thus, due to their large number considering all possible segmentations and
picking the best one is not practical.
The segmenter first does the easy job—it tokenizes the urls on puctu-
ation symbols (., -, , /, ?, =). Then the current url tokens are examined
for further possible segmentation. We explore a number of techniques:
1. dictionary-based;
2. symmetric sliding window: 6-gram with single break, and 6-gram sliding
window with a backoff of a symmetric 4-gram (again single break);
3. non-symmetric, multi-break, 7-gram sliding window with 6-, and 5-gram
backoff; and
4. contextual.
For all techniques we first generalize certain characters (e.g., digits).
3.1 Dictionary-based URL segmentation

With the dictionary-based approach the segmenter goes through the current
token from the left to right and in a greedy manner looks for the longest
match with a word in a known set of words. If a match is found the current
position advances after the match; otherwise the current position moves
• abc
|{z} • remaining characters
matched word
↑ ↑
ini new
pos pos
Fig. 1: Forward maximum match
to the next character. Such an algorithm is also called forward maximum

match. Segmenting known words that are pre-defined in a dictionary is
often referred to as ‘word breaking’ (Gao et al. 2005). Two issues with the
above formulation are:
1. No dictionary is exhaustive. In particular, dictionaries don’t include all
morphological variations (especially productive morphological processes);
2. The greedy technique might initially select a longer word while the solution
might involve a shorter form (loved•ays vs. love•days; we use the symbol
‘•’ to indicates a point where a break occurs or is predicted).
To address these issues we consider extending a found match: ing, ed, s,
doubling final consonant, etc. and backtracking to explore shorter matches.
The maximum match idea can be applied backwards from right to left (back-
ward maximum match).
3.2 Symmetric sliding window

The segmenter uses a sliding window of n characters (we used n = 6). Going
from left to right the segmenter decides whether to split after the current
third character. The segmentation decisions are based on counts collected
1. d i e ? t s t hatwork
2. d i e ◦ t s thatwork
3. d i e t ◦ s t i atwork
4. di e t s • t h a twork
5. diets • t h a ? t w o rk
6. diets • t h a ◦ t w ork
7. diets • t h a t • w o r k
Fig. 2: Workings of the symmetric sliding window with back-off

during training. Figure 2 illustrates the initial steps of the processing of

www.dietsthatwork.com when considering the token dietsthatwork. The ‘?’
indicates that the left and right tri-grams have not been encountered in the
training data; the symbol ‘◦’ indicates that the left and right tri-grams are
kept together; and as above ‘•’ indicates a break. For example, in the case
of d i e ? t s t on line 1 in Figure 2 the algorithm considers how many
times we have seen in the training data the n-gram ‘dietst’ vs. ‘die tst’.
If both counts are zero (as on line 1) the 6-gram symmetric sliding window
algorithm simply moves to the next character position. An extension we
have explored backs-off to a more general symmetric context of 4-gram
characters (shown on lines 2 and 6). If the counts for the 4-grams are zero
the default decision is not to split at the current position.
1. i m i n h e a ven
2. i m i n h e aven
3. i m i n h eaven
4. i m i n heaven
5. • i • m • in • h e a v e n
6. • i • m • in • • heaven •
7. i • m • in • heaven
Fig. 3: Workings of the multi-break segmenter
3.3 Non-symmetric, multi-break

Investigating the output of the symmetric window approach we observed
that the contexts were too specific—precision was high but recall was low
(as we will see in Section 5). Hence, we considered the 4-gram back-off.
This substancially improved recall but precision was lowered as the 4-gram
contexts are beginning to be too general. There are non-symmetric con-
texts which are not as general—after the symmetric 6-gram A B C - X Y Z
we considered non-symmetric contexts A B C - X Y and B C - X Y Z . We
generalized this notion of non-symmetric contexts and allowed multiple
breaks to be predicted simultaneously (including at the boundaries). We
also started with longer contexts of 7-grams. This algorithm also addressed
cases of short words as in: www . i • m • in • heaven . com. Figure 3 exem-
plifies the entire processing. In the step between lines 6 and 7 the system
simply removes consecutive breaks and breaks at the boundaries.
feature abbr
L known word Lw
L known word (morph. extended) Lmw
R known word Rw
L known suffix Ls
L maximal known suffix Lms
R known prefix Rp
Lw & Rw LwRw
Ls & Rp LsRp
L & R char unigrams 1-1
L & R char bigrams 2-2
L & R char trigrams 3-3
L bigram & R trigram (char) 2-3
L trigram & R bigram (char) 3-2
Table 1: Segmentation features for MaxEnt; L = ‘to the left there is a
. . . ’; R = ‘to the right there is a . . . ’
3.4 Contextual
Finally we considered an approach where diverse information about the
segmentation can be used by the system in a unified way. We employed a
Maximum Entropy framework and at a position between two characters in
the string we considered the following features (contextual predicates) again
motivated by error analysis of the previous systems as well as linguistic
intuitions (cf. Table 1).
feature example
Lw complex|andconfused
Lmw weekly |tipsandtricks
Rw speed|racer
Ls wellness|formulas
Lms institutionalized|models
Rp quick|antiaging
LwRw keydata|recovery
LsRp dating|unleashed
Table 2: Examples for the MaxEnt features
Table 2 shows examples of the word and affix features. The ‘|’ symbol indi-
cates the current position. The underlined substrings refer to the triggers
for the corresponding features. ‘Lmw’ refers to there being a morphological
variant (extension) of a known word to the left of the current position, e.g.,
weekly |. . . ; a maximal suffix is a suffix which is not a prefix of another
suffix (ing is a suffix but it is not maximal because of the composite suf-
fix ings). Note how the character n-grams subsume the symmetric sliding
window approaches above in Section 3.2.
4 URL classification
For spam classification a Naı̈ve Bayes classifier is used. Given a token

sequence T = ht1 , . . . , tn i, as provided by the segmenter, the class ĉ ∈ C =
{spam, good} is decided as:
P (c) · P (T |c)
ĉ = arg max P (c|T ) = arg max
c∈C c∈C P (T )
= arg max P (c) · P (T |c)
c∈C
n
Y
= arg max P (c) · P (ti |c)
c∈C
i=1
In the last step we assume conditional independence between the features.

We use simple Laplace smoothing (add one) and the individual probabilities
are computed as:
count(ti , c) + 1
P ∗ (ti |c) =
Nc + Vc
where: count(ti , c) is the number of time the token ti occurred in in the
collection for category c; Nc is the total number of tokens in the collection
for category c; Vc is the number of distinct tokens in the collection for
category c.
We have also explored simple voting techniques for determining the class
ĉ:
( Pn
spam, if sgn i=1 sgn (P (ti|spam) − P (ti |good)) = 1
ĉ =
good, otherwise
Because we are interested in having control over the precision of the classifier
we introduce a score meant to be used for deciding whether to label a url
as unknown.
P (spam|T ) − P (good|T )
score(T ) =
P (spam|T ) + P (good|T )
If the score(T ) exceeds a certain threshold τ we label T as spam or good

using the greater probability of P (spam|T ) or P (good|T ). To control the
accuracy of the classifier we can tune τ . For instance, in order to achieve
93.3% of accuracy we set τ = 0.75 which implied a recall of 50.9%.
# of # spam # good
splits URLs URLs
1 2,235 2,274
2 868 459
3 223 46
4 77 7
5 2 1
6 4 1
8 3 -
Total 3,412 2,788
Table 3: Number of splittings in a URL
5 Experiments and results
First we discuss the segmenter. 10,000 spam and 10,000 good weblog urls
and their corresponding webpages were used for the experiments. The
20,000 weblog HTML pages are used to induce the segmenter. The first
experiment was aimed at finding how common was segmentation as a phe-
nomenon. The segmenter was run on the actual training urls. The number
of urls that are additionally segmented besides the segmentation on punc-
tuation are reported in Table 3.
The multiple segmentations need not all occur on the same token in the
url after initial segmentation on punctuation.
The various segmenters were then run on a separate test set of 1,000
urls for which the ground truth for the segmentation was marked. For
the MaxEnt model we used a feature count cutoff of 7—this affects mostly
the lexical and chaacter n-gram features. The results are in Table 4. OOV
System Precision Recall F-measure

Dict (no OOV) 95.79% 97.13% 96.45%
Dict 43.33% 87.86% 58.04%
6-gram 84.31% 48.84% 61.85%
6-gram & 4-gram 70.92% 75.85% 73.30%
Multi-break 82.67% 90.18% 86.26%
Contextual 87.96% 91.07% 89.48%
Table 4: Performance of the segmenter
means ‘out-of-vocabulary’ and refers to the dictionary-based system which

knows all the tokens but uses a greedy search. Figure 4 shows long tokens
which are correctly split.
cash • for • your • house

unlimitted • pet • supllies
jim • and • body • fat
weight • loss • product • info
kick • the • boy • and • run
bringing • back • the • past
food • for • your • speakers
Fig. 4: Correct segmentations
The spam classifier was then run on the test set. The results are shown in
Table 5.
accuracy 78%
prec. spam 82%
rec. spam 71%
f-measure spam 76%
prec. good 74%
rec. good 84%
f-measure good 79%
Table 5: Classification results
The performance of humans on this task was also evaluated. Eight indi-
viduals performed the spam classification just looking at the unsegmented
urls. The scores for the human annotators are given in Table 6. The aver-
age accuracy of the humans (76%) is slightly lower than that of the system
(78%).
From an information retrieval perspective if only 50.9% of the urls
are retrieved (labelled as either spam or good and the rest are labelled
as unknown) then of the spam/good decisions 93.3% are correct. This is
relevant for cases where a url spam filter is in cascade followed by, for
example, content-based spam classifier.
6 Discussion
The system performs better with the aggressive segmentation because the
system has been forced to smooth on fewer occasions. For instance given
Mean σ
accuracy 76% 6.71
precision spam 83% 7.57
recall spam 65% 6.35
f-measure spam 73% 7.57
precision good 71% 6.35
recall good 87% 6.39
f-measure good 78% 6.08
Table 6: Results for the human annotators; µ is the mean; σ is the standrd
deviation
the input url www.ipodipodipod.com in the system which segments solely on

punctuation both the spam and the good model will have to smooth and
the results depend merely on the smoothing technique.
Even if we reached the average scores of humans we expect to be able
to improve the system further as the maximum accuracy among the human
annotators is 90%.
Among the errors of the segmenter the most common were related to
plural nouns (‘girl•s’ vs. ‘girls’) and past tense of verbs (‘dedicate•d’
vs. ‘dedicated’) whereas common mistakes for the url classifier involve
urls that do not contain enough information even for humans to classify
correctly. Kolari et al. (2006) mention classification of spam weblogs based
on urls. Mishne et al. (2005) present work on spam comment identification.
Shih & Karger (2004) use urls for classification.
7 Future work
We are exploring other classifiers for the contextual segmentation model,

including Robust Risk Minimization, Support Vector Machines, etc. For
the segmentation task we are also exploring finite state approaches. As
mentioned above we consider the url spam detector as a filter followed by
content models that consider the text in the weblog, structural characteris-
tics of the HTML, the outward links from a page, etc.
8 Conclusions
We have presented techniques for identifying whether a url indicates spam

or not. We considered an approach where we initially segment the url
and then perform the classification. The technique is simple and efficient,
yet very effective—our system reaches an accuracy of 78% while humans
perform at 76%.
REFERENCES
Andras, T. S., A. Benczúr, Károly Csalogány & M. Uher. 2005. “SpamRank —
Fully Automatic Link Spam Detection”. 1st Int. Workshop on Adversarial
Information Retrieval on the Web, WWW-2005.
Gao, J., M. Li & C.-N. Huang. “Chinese Word Segmentation and Named Entity
Recognition: A Pragmatic Approach”. Computational Linguistics 31:4.531-
574, December 2005.
Gyöngyi, Z., H. Garcia-Molina & J. Pedersen. 2004. “Combating Web Spam
with TrustRank”. 30th Int. Conference on Very Large Data Bases (VLDB),
271-279. Toronto, Canada.
Hurst, Matthew. 2005. “Deriving Marketing Intelligence from Online Discus-
sion”. 11th ACM SIGKDD Int. Conference on Knowledge Discovery in Data
mining (KDD’05 ), 419-428. Chicago, Illinois, U.S.A.
Kan, M.-Y. 2004. “Web Page Classification without the Web Page”. 13th Int.
World Wide Web Conference (WWW’2004 ), New York. poster.
Kan, M.-Y. & H. O. N. Thi. 2005. “Fast Webpage Classification Using URL Fea-
tures”. Conference on Information and Knowledge Management (CIKM’05 ),
325-326, Bremen, Germany.
Kolari, P., T. Finin & A. Joshi. 2006. “SVMs for the Blogosphere: Blog Identi-
fication and Splog Detection”. Computational Approaches to Analyzing We-
blogs, number SS-06-03, pages ed. by N. Nicolov, F. Salvetti, M. Liberman &
J. H. Martin, 92-99. Menlo Park, Calif.: AAAI Press.
Mishne, Gilad, D. Carmel & R. Lempel. 2005. Blocking Blog Spam with
Language Model Disagreement. 1st International Workshop on Adversar-
ial Information Retrieval on the Web (AIRWeb’05), at WWW2005, ed. by
B. D. Davison, 1-6. Chiba, Japan.
Shih, L.K. & D. Krager. 2004. “Using URLs and Table Layout for Web Classi-
fication Tasks”. 13th Int. World Wide Web Conference (WWW’2004 ), New
York.
SURBL. 2006. “Surbl – Spam Uri Realtime Blocklists”. http://www.surbl.org.
Technorati. 2006. “State of the Blogosphere February 2006 part 1: On
Blogosphere Growth”. http://technorati.com/weblog/2006/ 02/81.html.
Wikipedia. 2006. “Splog (spam blog)”. http://en.wikipedia.org/wiki/Splog.
Document Classification Using Semantic Networks
with an Adaptive Similarity Measure
Filip Ginter, Sampo Pyysalo & Tapio Salakoski
Turku Centre for Computer Science (TUCS )
University of Turku, Finland
Abstract
We consider supervised document classification where a semantic net-
work is used to augment document features with their hypernyms.
A novel document representation is introduced in which the contri-
bution of the hypernyms to document similarity is determined by
semantic network edge weights. We argue that the optimal edge
weights are not a static property of the semantic network, but should
rather be adapted to the given classification task. To determine the
optimal weights, we introduce an efficient gradient descent method
driven by the misclassifications of the k-nearest neighbors (kNN)
classifier. The method iteratively adjusts the weights, increasing or
decreasing the similarity of documents depending on their classes.
We thoroughly evaluate the method using the kNN classifier and
show it to statistically significantly outperform the bag-of-words rep-
resentation as well as the more advanced hypernym density represen-
tation of Scott & Matwin (1998).
1 Introduction
Semantic networks have been shown to offer opportunities for improving

the performance of machine learning methods in a variety of classification
tasks. Several semantic similarity measures have been proposed and ap-
plied, in particular to word sense disambiguation-type problems (see e.g.,
Patwardhan, Banerjee & Pedersen (2003) for a recent evaluation). Meth-
ods applying semantic networks to document classification have also been
proposed, although they are not as widely studied. For instance, semantic
networks have been used to augment terms with their synonyms (Gómez-
Hidalgo & deBuenaga Rodrı́guez 1997) and hypernyms (Scott & Matwin
1998), thus incorporating the semantic network information on the level of
features. Here we consider the special case of similarity through the hy-
ponymy/hypernymy relation, which is the focus of most proposed measures
of semantic relatedness. The similarity of terms is typically presented as a
static property that can be directly measured either from the semantic net-
work (Agirre & Rigau 1996), from external unlabeled data (Resnik 1995), or
using a combination of the two (Jiang & Conrath 1997). We have previously
138 FILIP GINTER, SAMPO PYYSALO & TAPIO SALAKOSKI
traveller
air traveller space traveller
astronaut cosmonaut
Fig. 1: Hyponymy relationships in a semantic network
argued that in supervised classification tasks the similarity of terms should

be considered dependent on the task and data (Ginter, Pyysalo, Boberg,
Järvinen & Salakoski 2004). Simply put, terms commonly related to doc-
uments of the same class should be considered similar, while terms related
to documents of different classes should be considered dissimilar to aid the
classification method in distinguishing between the classes.
Consider the fragment of a semantic network shown in Figure 1. Com-
mon measures of semantic similarity would assign high relatedness to the
terms astronaut and cosmonaut as they are immediate hyponyms of the
same term, space traveller. In most classification tasks, considering astro-
naut and cosmonaut essentially synonymous terms would be appropriate.
In a document representation, this can be naturally realized by consider-
ing the term space traveller to be highly relevant to documents containing
either of its two hyponyms. However, we suggest that in a hypothetical
document classification task where the goal is to distinguish between doc-
uments about American and Russian space efforts, space traveller should
not be considered relevant to documents containing either astronaut or cos-
monaut in order to avoid increasing the similarity between documents of
different classes.
We now discuss some desirable properties for a data-dependent seman-
tic document representation and means of realizing them. We assume that
each document has been assigned a set of direct terms from a semantic net-
work (e.g., terms that are mentioned in the document). The representation
should then determine the relevance of each semantic network term to each
document. It is natural to limit this measure of relevance between 0 and 1,
and to assign the value 1 to each direct term. Hypernyms of direct terms
are typically relevant and their relevance values should be allowed to vary
in a data-dependent fashion. We suggest that terms that are neither di-
rect terms nor hypernyms of direct terms in a document are not relevant
to that document and can be assigned the relevance value 0: for example,
if astronaut is the only direct term, there is no reason to assume that ei-
ther cosmonaut or air traveller are relevant. Finally, relevance should not
increase with distance from the direct term: if, for example, astronaut is
the only direct term, traveller should be considered at most as relevant as
space traveller. This implies a representation where relevance propagates
DOCUMENT CLASSIFICATION USING SEMANTIC NETWORKS 139
from direct terms to more general terms, decreasing according to the data-
dependent strengths of connections between hyponyms and hypernyms.
We have previously introduced a data-driven method for determining hy-
pernym relevance for document classification, where relevance was limited
to the two cases ‘fully relevant’ and ‘irrelevant’ (Ginter, Pyysalo, Boberg,
Järvinen & Salakoski 2004). In this paper, we present a method that applies
a finer-grained concept of relevance and shows a more substantial perfor-
mance advantage.
2 Document representation
Let T be a set of possible terms that are organized in a semantic network

according to the hyponymy relation. Let t, t′ ∈ T be terms. We denote by
t′ ≺⋆ t the relation when t′ is a hyponym of t. Further, t′ ≺ t denotes the
relation when t′ is an immediate hyponym of t, that is, the relation encoded
by the semantic network. Hyponymy (≺⋆ ) is the transitive closure of im-
mediate hyponymy (≺). The immediate hyponymy relation is commonly
represented using a directed graph, such as in Figure 1, with an edge from
t′ to t whenever t′ ≺ t. Hyponymy (≺⋆ ) is by definition an asymmetric
relation, and the corresponding directed graph is thus acyclic.
We now define a document representation that implements the intuitions
discussed in Section 1. Let D be a set of documents and let d ∈ D be a
document with the set of direct terms T (d) ⊆ T . As discussed previously,
the document d is represented by the direct terms T (d) and their hypernyms.
We implement this property through the notion of activation at (d) ∈ [0, 1]
of a term t ∈ T with respect to the document d, which represents the
relevance of t to d. For any t ∈ T (d), at (d) is by definition set to 1. The
activation of any other term recursively depends on the activations of its
immediate hyponyms so that the activation of hypernyms of direct terms
typically results in a non-zero value. The activation of the remaining terms
is zero by definition.
A term t ∈ T is affected by a document d if t ∈ T (d) or ∃ t′ ∈ T (d) :
t′ ≺⋆ t. That is, t is affected by d if t is either a direct term of d or a
hypernym of a direct term. The set of terms affected by a document d is
denoted Aff (d). Let further the base of a term t ∈ T with respect to a
document d be Baset (d) = {t′ | t′ ≺ t, t′ ∈ Aff (d)}.
Unless t is a direct term, at (d) is based on the activations of the terms
in Baset (d). For each t′ ∈ Baset (d), the contribution of t′ to the activation
of t is controlled by a weight wt′ t associated with the relationship t′ ≺ t. By
definition, 0 ≤ wt′ t ≤ 1 for all weights. The activation at (d) is computed as
the weighted sum of the activations of the terms in Baset (d). Thus,
r Aff (d) = {e, a, k, b, r, f } , Baser (d) = {a, b} , Basea (d) = {e} ,

1/4 1/2 Baseb (d) = {e, f } , Basef (d) = {k} , Baseg (d) = Baseh (d) =
1
a b c = Basej (d) = Basek (d) = Basel (d) = Based (d) = Basee (d) =
0 1 1 1 = Basec (d) = ∅, ae (d) = aa (d) = ak (d) = 1, ad (d) = ag (d) =
d e f 1
1 1 1/2 1/2 1
= ah (d) = aj (d) = al (d) = ac (d) = 0, af (d) = wkf ak (d) = ,
2
g h j k l web ae (d) + wf b af (d) 3 war aa (d) + wbr ab (d) 1
ab (d) = = , ar (d) = =
2 4 2 2
Fig. 2: The representation of a document with direct terms

T (d) = {e, a, k} (bold circles). Terms affected by d are depicted in gray

1P
 if t ∈ T (d),
wt′ t at′ (d)
t′ ∈Baset (d)
at (d) = |Baset (d)|
if t ∈ Aff (d) \ T (d), (1)


0 otherwise.
Finally, each document d is represented by an activation vector
a(d) = (at1 (d), . . . , atm (d)),
where ti ∈ T . Figure 2 illustrates the concepts introduced so far.
3 Weight update algorithm

a(d)
Let â(d) be the normalized activation vector of d, that is, â(d) = ka(d)k .
The similarity between any two documents di , dj ∈ D is defined as the
dot-product of their normalized activation vectors
X
sim(di, dj ) = â(di ) · â(dj ) = ât (di )ât (dj ) . (2)
t∈T
Given a training set of documents D, a document similarity measure, and a

document d to be classified, the k-nearest neighbors (kNN) classifier com-
putes a set N(d, k, D) ⊆ D \ {d} of k documents most similar to d, also
termed as the k-neighborhood. The class assigned to d is the majority class
among the documents in its k-neighborhood.
The weight update algorithm is based on kNN and implements the fol-
lowing intuition: a misclassification of a document d means that the major-
ity of the documents in N(d, k, D) are of a different class than d. The mis-
classification could therefore be corrected by modifying the k-neighborhood
so that it would contain a majority of documents with the same class as
that of d. This can be achieved by adjusting the semantic network weights

so that the similarity between d and its k-neighbors with a different class
decreases and with the same class increases. As there is only one, global
set of weights, any change affects all the documents and therefore directly
optimizing the similarity of d with its k-neighbors also indirectly affects the
similarity of d with all other documents. Generally, documents with the
same class are “pulled” towards d while documents with another class are
“pushed” away from d. Naturally, this effect is strongest for the k-neighbors
of d, whose similarity with d is optimized directly. Other variations of the
general scheme are possible as well. For example, the k-neighborhoods
could be optimized for all documents rather than only for those that were
misclassified.
Let us consider two documents di , dj ∈ D. The objective is to either
increase or decrease sim(di , dj ) by modifying the semantic network weights.
Let w = (w1 , . . . , wn ) , where n is the total number of weights, be the vector
of all weights in the semantic network in an arbitrary but fixed order. We
then define the weight gradient ∇w(di , dj ) with respect to sim(di , dj ) as

∂ sim(di , dj ) ∂ sim(di , dj )
∇w(di , dj ) = ,..., .
∂w1 ∂wn
Adding the gradient ∇w(di , dj ) to the weight vector w leads to an increase
of sim(di , dj ), while subtracting ∇w(di , dj ) from w leads to a decrease of
∂ sim(di , dj )
sim(di , dj ). The formula to compute the partial derivative ∂ wrs
of
sim(di , dj ) with respect to a weight wrs is specified jointly by Equations 3,
7, and 8 in Appendix 1 which also details the derivation of the formula.
A learning rate constant η ∈ R, η > 0, is introduced to control the
magnitude of the weight adjustment by the gradient. The weight vector w
is then updated according to the rule
w ← w + δη∇w(di , dj ) ,
where δ = +1 (resp. δ = −1) if sim(di , dj ) is to be increased (resp. de-
creased). The complete weight update algorithm is introduced in Figure 3.
4 Evaluation
We evaluate the proposed representation using ten datasets consisting of

articles from the PubMed biomedical literature database, where each article
has been manually assigned a set of relevant terms from the MeSH ontology
(we use the 2005 version).
Each of the ten datasets, corresponding to the randomly selected jour-
nals in Ginter, Pyysalo, Boberg, Järvinen & Salakoski (2004), consists of
w ← 1̄
until stopping criterion satisfied do:
w′ ← 0̄
for each document di ∈ D:
classify di using D \ {di } as training set
if misclassified di then:
for each dj ∈ N (di , k, D \ {di }):
if class (di ) = class (dj ) then δ ← +1 else δ ← −1
w′ ← w′ + δ∇w(di , dj )
w ← w + η · w′
for each weight wk in w:
wk ← max{0, min{1, wk }}
Fig. 3: Pseudocode of the weight update algorithm
2000 randomly selected articles from the journal and 2000 randomly se-
lected articles that have appeared elsewhere. Additionally, we form for
each dataset seven down-sampled training sets, the largest consisting of
1000 positive and 1000 negative examples (the other 2000 being used for
testing). Each task is then a binary classification problem where the doc-
uments must be classified either as originating from the journal or not.
Since the journals are usually focused on a subdomain, these classification
problems model document classification by topic.
We evaluate the proposed document representation with and without
the adaptive component. In the fixed representation, the semantic network
weights are all set to one constant value wf ix , 0 ≤ wf ix ≤ 1, determined
from the data. In the adaptive representation, the weights are computed
using the algorithm introduced in Section 3, with a stopping criterion where
iteration ends when the average performance increase on the training set
over the last three rounds drops below 0.05%.
We compare the performance of the fixed and adaptive representations
against two baselines, the commonly used bag-of-words (BoW) represen-
tation and a modification of the hypernym density (HD) representation of
Scott & Matwin (1998) in which each document di is represented by a mul-
tiset consisting of all direct terms of di , together with their hypernyms up
to a distance h from any of the direct terms. We found that in our case
coercing the multiset into a set results in an improvement of performance,
and thus we apply this step in our evaluation. Further, infrequent terms are
not discarded and the HD normalization step is performed by the classifiers.
The main evaluation is performed using the kNN classifier. The param-
eters of the various methods — k for BoW, k, h for HD, k, wf ix for fixed,
and k, η for adaptive — are selected by cross-validated grid search on the
training set. We also perform a limited evaluation using Support Vector
Absolute difference [percentage units] DOCUMENT CLASSIFICATION USING SEMANTIC NETWORKS 143
5 8
7 Adaptive vs. BoW
3
6
8 HD vs. BoW
8 Fixed vs. BoW
5
Adaptive vs. Fixed

8 5 6 Adaptive vs. HD
4
2 3 5 HD vs. Fixed
0 3 2 5 4
4 5
3
1 3 5
2 3
1 3
3 5 5
2
1 3
1 4
0 3
1
0 1 1 0 2
0 0
0
31 62 125 250 500 1000 2000

Training set size [number of documents]
Fig. 4: Pairwise method differences and their per-dataset and overall

statistical significances for kNN. Results averaged over all datasets. The
number displayed by each difference denotes the number of datasets for
which the difference was statistically significant (p < 0.05, 5 × 2cv test).
Full — as opposed to empty — circle denotes the average difference over
all ten datasets being statistically significant (p < 0.05, t-test)
Machines (SVM) (Vapnik 1998). For this evaluation, only the SVM reg-
ularization parameter C is separately selected, while other parameters are
set to their kNN optimum values.
We measure the performance using average 5 × 2 cross-validated accu-
racy. To assess the statistical significance for individual datasets, we use the
robust 5 × 2 cross-validation test of Alpaydin (1999). To assess the overall
significance across all datasets, we use the two-tailed paired t-test.
Average performance differences when using kNN are plotted in Figure 4
and average absolute performance figures are given in Table 1a. The adap-
tive method statistically significantly outperforms all others for all except
the smallest training set size. Further, the relative decrease in error rate
grows almost monotonically, indicating that the adaptive method works
better given more data. As the documents were assigned on average only
10 MeSH terms and the MeSH ontology contains almost 23000 nodes, reli-
able optimization of the edge weights is expected to be difficult with very
small training sets. Nevertheless, the adaptive method works remarkably
well with as few as 62 training examples.
When applied to SVMs, the fixed and HD representations outperform
the adaptive method (Figure 1b). The difference is statistically significant
for most training set sizes larger than 62. Clearly, the adaptive method
does not optimize a criterion beneficial for SVM classification, and hence
modification of the adaptive strategy is required to increase applicability
BoW HD ∆ Fix. ∆ Ad. ∆ BoW HD ∆ Fix. ∆ Ad. ∆

31 72.5 74.8 8.4 74.4 6.9 75.9 12.4 75.6 79.1 14.3 78.6 12.3 78.1 10.2
62 73.9 77.3 13.0 77.2 12.6 79.6 21.8 78.2 81.6 15.6 81.5 15.1 80.9 12.4
125 75.9 79.4 14.5 79.1 13.3 82.0 25.3 80.9 84.0 16.2 84.3 17.8 83.0 11.0
250 77.8 81.8 18.0 81.4 16.2 84.0 27.9 83.2 85.8 15.5 85.9 16.1 84.7 8.9
500 79.6 83.4 18.6 83.1 17.2 85.4 28.4 85.4 87.5 14.4 87.8 16.4 86.7 8.9
1000 81.0 84.9 20.5 84.5 18.4 86.6 29.5 87.5 88.9 11.2 89.2 13.6 88.4 7.2
2000 82.4 85.9 19.9 85.7 18.8 87.4 28.4 88.9 90.0 9.9 90.3 12.6 89.7 7.2
(a) kNN (b) SVM
Table 1: Accuracy measurements averaged over all ten datasets for each of
the seven training set sizes. The ∆ columns represent the relative error
rate decrease in percents over the BoW baseline
to SVM classification. Nevertheless, the fixed representation outperforms

both the BoW and HD representations for larger training set sizes (the
latter difference is mostly not statistically significant), indicating that the
document representation itself benefits also SVMs.
We have applied semantic networks to develop an adaptive document repre-

sentation. We have evaluated the representation and the algorithm against
the BoW, fixed and HD representations with ten randomly selected datasets
from the PubMed biomedical literature database.
Our results indicate that the proposed adaptive representation can sta-
tistically significantly outperform the baselines over a range of training set
sizes from 62 to 2000, with the relative decrease in error rate ranging be-
tween 20–30% against BoW and 10–14% against the fixed and HD repre-
sentations. A preliminary evaluation with SVMs indicated that while the
semantic network-based document representations give a statistically signif-
icant improvement over the BoW baseline, the gradient descent component
of the adaptive method requires modification.
We conclude that the proposed method can successfully determine term-
document relevance in a data-dependent manner, increasing performance in
supervised document classification tasks. As future work, several aspects of
the method can be studied, such as the setting of the initial weights, the
learning rate, and the stopping criterion. An additional natural extension of
the method is to consider relationships other than hyponymy as activation
paths. Careful analysis of these and other properties may offer further
opportunities for the use of semantic networks in document classification.
REFERENCES
Agirre, Eneko & German Rigau. 1996. “Word Sense Disambiguation Using
Conceptual Density”. Proceedings of the 16th Conference on Computational
Linguistics, Copenhagen, Denmark (COLING’96 ), 16-22. San Francisco:
Morgan Kaufmann.
Alpaydin, Ethem. 1999. “Combined 5 × 2 cv F -Test for Comparing Supervised
Classification Learning Algorithms”. Neural Computation 11:8.1885-1892.
Ginter, Filip, Sampo Pyysalo, Jorma Boberg, Jouni Järvinen & Tapio Salakoski.
2004. “Ontology-Based Feature Transformations: A Data-Driven Approach”.
Proceedings of the 4th International Conference on Advances in Natural Lan-
guage Processing, Alicante, Spain (EsTAL’04 ), ed. by José L. Vicedo et al.
(= Lecture Notes in Computer Science, 3230 ), 279-290. Heidelberg: Springer.
Gómez-Hidalgo, José M. & Manuel de Buenaga Rodrı́guez. 1997. “Integrat-
ing a Lexical Database and a Training Collection for Text Categorization”.
Proceedings of the ACL/EACL Workshop on Automatic Information Extrac-
tion and Building of Lexical Semantic Resources, Madrid, Spain, ed. by Piek
Vossen et al., 39-44. Association for Computational Linguistics.
Jiang, Jay J. & David W. Conrath. 1997. “Semantic Similarity Based on Corpus
Statistics and Lexical Taxonomy”. Proceedings of the International Con-
ference on Research in Computational Linguistics, Tainan, Taiwan (RO-
CLING’97 ), 19-33.
Patwardhan, Siddharth, Satanjeev Banerjee & Ted Pedersen. 2003. “Using Mea-
sures of Semantic Relatedness for Word Sense Disambiguation”. Proceed-
ings of the Fourth International Conference on Intelligent Text Processing
and Computational Linguistics, Mexico City, Mexico (CICLing’03 ), ed. by
Alexander F. Gelbukh, 241-257. Heidelberg: Springer.
Resnik, Philip. “Using Information Content to Evaluate Semantic Similarity in
a Taxonomy”. 1995. Proceedings of the 14th International Joint Conference
on Artificial Intelligence, Montreal, Canada (IJCAI’95 ), ed. by Christopher
Mellish, 448-453. San Francisco: Morgan Kaufmann.
Scott, Sam & Stan Matwin. “Text Classification Using WordNet Hypernyms”.
1998. Proceedings of the COLING-ACL Workshop on Use of WordNet in
Natural Language Processing Systems, Montreal, Canada, ed. by Sanda Hara-
bagiu, 38-44. Association for Computational Linguistics.
Vapnik, Vladimir. 1998. Statistical Learning Theory. New York: Wiley.
∂ sim(di , dj )
Appendix 1 Derivation of the formula for ∂ wrs
This appendix details the derivation of the formula to compute the value of
∂ sim(di , dj )
∂ wrs
. The partial derivative is solved and the final formula is obtained
jointly from Equations 3, 7, and 8.
∂ sim(di , dj ) X ∂ ât (di ) X ∂ ât (dj )

= ât (dj ) + ât (di)
∂ wrs t∈T
∂ wrs t∈T
∂ wrs (3)
def
= Q(di , dj ) + Q(dj , di )
We now solve Q(di , dj ); the formula for Q(dj , di ) follows by symmetry.
a (d )
∂ at (di )
∂ ât (di ) ∂ ka(di )k
t i
ka(di )k − at (di ) ∂ ka(di )k
= = ∂ wrs ∂ wrs (4)

∂ wrs ∂ wrs ka(di )k2
qP
∂ ka(di )k ∂ u∈T [au (di )]
2
1 X ∂ au (di)

= = au (di ) (5)
∂ wrs ∂ wrs ka(di )k u∈T ∂ wrs
Combining (4) and (5) yields
P
∂ at (di )
∂ ât (di ) ∂ wrs
ka(di )k − ât (di ) u∈T au (di ) ∂ ∂auw(di)
=
rs (6)
∂ wrs ka(di)k2
Substituting from (6) into (3) gives Q(di , dj ) =
P ∂ at (di ) P P ∂ au (di )

t∈T ∂ wrs ka(d i )kât (d j ) − t∈T ât (d j )ât (d i ) u∈T au (d i ) ∂ wrs
ka(di )k2
P ∂ at (di ) P P
t∈T ∂ wrs ka(di )kât (dj ) − u∈T au (di ) ∂ ∂auw(di)
t∈T ât (dj )ât (di )
= rs
ka(di)k2
P ∂ at (di ) P
(2) t∈T ∂ wrs ka(di )kât (dj ) − u∈T au (di ) ∂ ∂auw(drsi ) sim(di , dj ) (7)
=
ka(di)k2
P ∂ at (di )
u|t t∈T ∂ wrs
(ka(di )kât (dj ) − at (di ) sim(di, dj ))
=
ka(di)k2
P ∂ at (di )
t∈T ∂ wrs
(ât (dj ) − ât (di ) sim(di , dj ))
=
ka(di )k
/ Aff (di )\T (di) then at (di) is by (1) constant and consequently ∂∂awt (drsi ) =
If t ∈
0. For t ∈ Aff (di) \ T (di ),
(
∂ at (di ) (1) 1 X ar (di ) if (t′ , t) = (r, s)
= ∂ at′ (di ) (8)
∂ wrs |Baset (di )| ′ wt′ t ∂ wrs otherwise.
t ∈Baset (di )
The recursion in (8) ends when (t′ , t) = (r, s). Finally, substituting from
∂ sim(d , d )
(8) into (7) and then into (3) completes the derivation of ∂ wrsi j .
Text Summarization for Improved Text Classification
Rada Mihalcea & Samer Hassan
University of North Texas
Abstract
This paper explores the possible benefits of the interaction between
text summarization and text classification. Through experiments
performed on standard data sets, we show that techniques for extrac-
tive summarization can be effectively combined with classification
methods, resulting in improved performance in a text categorization
task. Moreover, comparative results suggest that the synergy be-
tween text summarization and text classification can be regarded as
a new task-oriented evaluation testbed for automatic summarization.
1 Introduction
Text categorization is a problem typically formulated as a learning task,

where a classifier learns how to distinguish between categories in a given
set, using features automatically extracted from a collection of training
documents. In addition to the learning methodology itself, the accuracy
of the text classifier also depends to a large extent upon the classification
granularity, and on how well separated are the training or test documents
belonging to different categories. For instance, it may be a relatively easy
task to learn how to classify documents in two distinct categories such as
computer science and music, but it may be significantly more difficult to
distinguish between documents pertaining to more closely related topics
such as operating systems and compilers.
Intuitively, if the gap between categories could be increased, the classi-
fication performance would raise accordingly, since the learning task would
be simplified by removing features that represent potential overlap between
categories. This is in fact the effect achieved through feature weighting
and selection (Yang & Pedersen 1997, Ng et al. 1997), which was found
to improve significantly over the case where no weighting or selection is
performed. We propose a new approach for reducing the potential overlap
between documents belonging to different categories, by using a method
that extracts the essence of a text prior to classification. In this way, only
the important sections of a document participate in the learning process,
and thus the performance of the text classification algorithm could be im-
proved. Through experiments on standard test collections, we show that
significant improvements can be achieved on a text categorization task by
classifying extractive summaries, rather than entire documents. We believe
148 RADA MIHALCEA & SAMER HASSAN
that these results have implications not only on the problem of text clas-
sification, where the method proposed provides the means to improve the
categorization accuracy with error reductions of up to 19.3%, but also on
the problem of text summarization, by suggesting a new application-based
evaluation of tools for automatic summarization.
2 Text categorization using extractive summarization
Provided a set of training documents, each document assigned with one or

more categories, the task of text categorization consist of finding the most
probable category for a new unseen document, based on features extracted
from training examples. The classification process is typically performed
using information drawn from entire documents, and this may sometime
result in noisy features. To lessen this effect, we propose to feed the text
classifier with summaries rather than entire texts, with the goal of removing
the less-important, noisy sections of a document prior to classification.
The text classification process is thus modified to integrate an extractive
summarization tool that determines the top N most important sentences
in each document. Starting with a collection of texts, every document is
replaced by its summary, followed by the application of a regular text cat-
egorization algorithm that determines the most likely category for a given
test document. Figure 1 illustrates the classification process based on ex-
tractive summarization.
TRAINING NEW
DOCUMENTS
DOCUMENT
Extractive Summarization
SUMMARIES
SUMMARY
TEXT
Learning
CLASSIFIER
CATEGORY
Fig. 1: Text classification using extractive summarization

TEXT SUMMARIZATION FOR IMPROVED TEXT CLASSIFICATION 149
2.1 Extractive summarization

To summarize documents, we use a graph-based algorithm that generates
extractive summaries by finding the most important sentences in the text, as
described in our previous work (Mihalcea & Tarau 2004, Mihalcea & Tarau
2005). Shortly, the summarization algorithm starts by building a graph that
represents the text, with sentences stored in the nodes, and edges drawn
using a measure of lexical similarity. Next, a random-walk algorithm such
as PageRank (Brin & Page 1998) or hits (Kleinberg 1999) is run on the
graph, and the sentences with the highest score are selected for inclusion in
the summary.
The selection of this particular summarization algorithm is motivated by
several reasons. First, the decision to use an extractive summarization tool,
versus more complex systems that include sentence compression and text
generation, is based on the fact that in our experiments text summarization
is not an end per se, but rather an intermediate step for document classi-
fication. The informativeness of a summary is thus more important than
its coherence, and summarization through sentence extraction is sufficient
for this purpose. Second, through evaluations conducted on standard data
sets, the algorithm was demonstrated to be competitive with the state-of-
the-art in text summarization (Mihalcea & Tarau 2004). Finally, by being
an algorithm that can produce a ranking over sentences in a text, it is well
suited for our experiments where we want to measure the impact on text
classification of summaries of various lengths.
2.2 Algorithms for text classification

There is a large body of algorithms previously tested on text classifica-
tion problems, due also to the fact that text categorization is one of the
testbeds of choice for machine learning algorithms. In the experiments re-
ported here, we compare results obtained with two frequently used text
classifiers—Rocchio and Naı̈ve Bayes, selected for the diversity of their
learning methodologies.
Rocchio. This is an adaptation of the relevance feedback method developed
in information retrieval (Rocchio 1971). It uses standard tf.idf weighted vec-
tors to represent documents, and builds a prototype vector for each category
by summing up the vectors of the training documents in each category. Test
documents are then assigned to the category that has the closest prototype
vector, based on a cosine similarity. Text classification experiments with
different versions of the Rocchio algorithm showed competitive results on
standard benchmarks (Joachims 1997).
Naı̈ve Bayes. The basic idea in a Naı̈ve Bayes text classifier is to estimate
the probability of a category given a document using joint probabilities

of words and documents. Naı̈ve Bayes assumes word independence, which
means that the conditional probability of a word given a category is assumed
to be independent of the conditional probability of other words given the
same category. Despite this simplification, Naı̈ve Bayes classifiers perform
surprisingly well on text classification (Joachims 1997). While there are
several versions of Naı̈ve Bayes classifiers (variations of multinomial and
multivariate Bernoulli), we use the multinomial model (McCallum & Nigam
1998), which was shown to be more effective.
2.3 Data
For the classification experiments, we use the Reuters-21578 1 and WebKB 2
data sets—two of the most widely used test collections for text classifica-
tion. For Reuters-21578, we use the standard ModApte data split (Apte et
al. 1994), obtained by eliminating unlabeled documents, and selecting only
categories that have at least one document in the training set and test set
(Yang & Liu 1999). For WebKB, we perform a four-fold cross validation
after removing the other category, using the 3 : 1 training/test split auto-
matically created with the scripts provided with the collection, which sepa-
rates the data into training on three of the universities plus a miscellaneous
collection, and testing on a fourth held-out university. Both collections are
further post-processed by removing all documents with less than 10 sen-
tences, resulting into final data sets of 1277 training documents, 436 test
documents, 60 categories for Reuters-21578, and 350 training documents,
101 test documents, 6 categories for WebKB. This last step is motivated by
the goal of our experiments: we want to determine the impact of various
degrees of summarization on text categorization, and this is not possible
with very short documents.
2.4 Evaluation metrics

The evaluation is run in terms of accuracy, defined as the number of correct
assignments among the document–category pairs in the test set. For the
WebKB data set, since the classifiers assign exactly one category to each
document, and a document can belong to only one category, this definition
of accuracy coincides with the measures of precision, recall, and F-measure.
For the Reuters-21578 data set, multiple classifications are possible for each
document, and thus the accuracy coincides with the classification precision.
1
http://www.daviddlewis.com/resources/testcollections/reuters21578/
2
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
Reuters-21578 WebKB
Summary length Rocchio Naı̈ve Bayes Rocchio Naı̈ve Bayes
1 sentence 70.18% 72.94% 60.40% 61.39%
2 sentences 72.48% 73.85% 62.38% 71.29%
3 sentences 73.17% 75.46% 60.40% 74.26%
4 sentences 72.94% 75.46% 63.37% 76.24%∗
5 sentences 75.23% 75.92% 66.35%∗ 79.24%∗
6 sentences 77.00%∗ 78.38%∗ 65.37%∗ 77.24%∗
7 sentences 76.38% ∗ 77.61%∗ 65.34%∗ 76.24%∗
8 sentences 76.38%∗ 77.38%∗ 64.36% 76.21%∗
9 sentences 75.23% 77.61%∗ 65.34%∗ 75.25%
10 sentences 75.23% 77.52%∗ 65.35%∗ 75.25%
full document 74.91% 75.01% 63.36% 74.25%
Table 1: Classification results for Reuters-21578 and WebKB, using

Rocchio and Naı̈ve Bayes classifiers, for full-length documents and for
extractive summaries of various lengths. Statistically significant
improvements (p < 0.05, paired t-test) with respect to full-document
classification are also indicated (∗ )
Evaluations figures are reported as micro-average over all the categories in

the test set.
3 Experimental results
Classification experiments are run using each of the two learning algorithms,
with documents in the collection represented by extracts of various lengths.
The classification performance is measured for each experiment, and com-
pared against a traditional classifier that performs text categorization using
the original full-length documents.
Table 1 shows the classification results obtained using the Reuters-21578
and the WebKB test collections. Note that these results refer to data subsets
somewhat more difficult than the original test collections, and thus they are
not directly comparable to results previously reported in the literature. For
instance, the average number of documents per category in the Reuters-
21578 subset used in our experiments is only 25, compared to the average
of 50 documents per category available in the original data set3 . Similarly,
the WebKB subset has an average number of 75 documents per category,
3
Using the same Naı̈ve Bayes and Rocchio classifier implementations on the entire
Reuters-21578 data set results in a classification accuracy of 77.42% for Naı̈ve Bayes
and 79.70% for Rocchio, figures that are closely comparable to results previously re-
ported on this data set.
Rocchio Naive Bayes

80 80
75 75
70 70
Accuracy (%)
Accuracy (%)
65 65
60 60
N most important (graph-based) N most important (graph-based)

N least important (graph-based) N least important (graph-based)
55 first N (position based) 55 first N (position based)
last N (position based) last N (position based)
random N random N
full-length documents full-length documents
50 50
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Summary Length (number of sentences) Summary Length (number of sentences)
Fig. 2: Classification accuracy for Reuters-21578 for extractive summaries

of various lengths
compared to 750 in the original data set.

Figures 2 and 3 plot the classification accuracy on the two data sets
for each learning algorithm, using extractive summaries of various lengths
generated with the graph-based summarization method. For a compara-
tive evaluation, the figures plot results obtained with methods that create
an extract by: (a) selecting the N most important sentences using the
graph-based extractive summarization algorithm; (b) selecting the first N
sentences in the text; (c) randomly selecting N sentences; (d) selecting the
N least important sentences using the low-end of the ranking obtained with
the same summarization algorithm; (e) selecting the last N sentences.
From a text categorization perspective, the results demonstrate that
techniques for text summarization can be effectively combined with text
classification methods to the end of improving categorization performance.
On the Reuters-21578 data set, the classification performance using the Roc-
chio classifier improves from an accuracy of 74.90% to 77.00% obtained for
summaries of 6 sentences. Similar improvements are observed for the Naı̈ve
Bayes classifier, where the highest accuracy of 78.38% is obtained again for
summaries of 6 sentences, and is significantly better than the accuracy of
75.01% for text classification with full-length documents. The impact of
summarization is even more visible on the WebKB data set, where sum-
maries of 5 sentences result in a classification accuracy of 79.24% using a
Naı̈ve Bayes classifier, significantly higher than the accuracy of 74.25% ob-
tained when the classification is performed with entire documents. Similarly,
the categorization accuracy using a Rocchio classifier on the same data set
improves from 63.36% for full-length documents to 66.35% for summaries
of 5 sentences4 . The highest error rate reduction observed during these
4
The improvements observed in all classification settings are statistically significant at
p < 0.05 level (paired t-test).
Rocchio Naive Bayes

80 80
75 75
70 70
65 65
Accuracy (%)
Accuracy (%)
60 60
55 55
50 N most important (graph-based) 50 N most important (graph-based)

N least important (graph-based) N least important (graph-based)
first N (position based) first N (position based)
45 last N (position based) 45 last N (position based)
random N random N
full-length documents full-length documents
40 40
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Summary Length (number of sentences) Summary Length (number of sentences)
Fig. 3: Classification accuracy for WebKB for extractive summaries of

various lengths
experiments was 19.3%, achieved with a Naı̈ve Bayes classifier applied on

5-sentence extractive summaries.
Another aspect worth noting is the fact that these results can have im-
portant implications on the problem of text classification for documents for
which only abstracts are available. As it was previously suggested (Hulth
2003), many documents on the Internet are not available as full-texts, but
only as abstracts. Similarly, documents that are typically stored in printed
form, such as books, journals, or magazines, may have an abstract available
in electronic format, but not the entire text. This means that an even-
tual classification of such documents has to rely on abstract categorization,
rather than traditional full-text categorization. The results obtained in the
experiments reported here suggest that the task of text classification can
be efficiently performed even when only abstracts are available for the doc-
uments to be classified.
Classification efficiency is also significantly improved when the classifi-
cation is based on summaries, rather than full-length documents. A clear
computational improvement was observed even for the relatively small test
collections used in our experiments. For instance, the Naı̈ve Bayes classifier
takes 22 seconds to categorize the full-length documents in the Reuters-
21578 subset on a Pentium IV 3GHz 2GB computer, compared to only 6
seconds when the classification is done using 5-sentence summaries5 .
The results have also important implications on the problem of text
summarization. As seen in Figures 2 and 3, regardless of the classifier, a
performance peak is observed for summaries of 5-7 sentences, which give the
5
One could argue that the summarization-based classification implies an additional
summarization overhead. Note however that the summarization process can be paral-
lelized and needs to be performed only once offline. The resulting summaries can be
then used in multiple classification tasks.
highest increase in classification accuracy. This result can have an interest-

ing interpretation in the context of text summarization, as it indicates the
optimal number of sentences required for “grasping” the content of a text.
From this perspective, the task of text classification can be regarded as an
objective way of defining the “informativeness” of an abstract, and could
be used as an alternative to more subjective human-assessed evaluations of
summary content.
The comparative plots from Figures 2 and 3 also reveal another inter-
esting aspect of the synergy between text summarization and document
classification. Different automatic summarization tools – graph-based ex-
tractive summarization, heuristics that extract sentences based on their
position in the text, or a simple random sentence selection baseline – have
different but consistent impact on the quality of a text classifier, which
suggests that text categorization can be used as an application-oriented
evaluation testbed for automatic summarization. The graph-based sum-
marization method gives better text classification accuracy as compared to
the position-based heuristic, which in turns performs better than the simple
random sentence selection baselines. This comparative evaluation correlates
with rankings of these methods previously reported in the literature (Mi-
halcea & Tarau 2004), (Erkan & Radev 2004), which were based on human
judgments or automatic evaluations such as Rouge (Lin & Hovy 2003).
4 Related work
To our knowledge, the impact of content-based text summarization on the

task of text categorization was not explored in previous studies. Sentence
importance was considered as an additional factor for feature weighting in
work reported in (Ko et al. 2002), where words in a text were weighted dif-
ferently based on the score associated with the sentence they belong to. Ex-
periments with four different text categorization methods have shown that
a weighting scheme based on sentence importance can significantly improve
the classification performance. In (Kolcz et al. 2001), text summarization is
regarded as a feature selection method, and is shown to improve over alter-
native algorithms for feature selections. Finally, another related study is the
polarity analysis through subjective summarization reported in (Pang & Lee
2004), where the main goal was to distinguish between positive and negative
movie reviews by first selecting those sentences likely to be more subjective
according to a min-cut algorithm. The focus of their study was the analysis
of text style (subjective versus objective), rather than classification of text
content. We consider instead the more general text classification problem,
combined with a typical text summarization task, and evaluate the role that
text summaries can play in document categorization.
5 Conclusions
In this paper, we investigated the interaction between document summa-

rization and text categorization, through comparative classification experi-
ments relying on full-length documents or summaries of various lengths. We
showed that techniques for automatic summarization can be used to improve
the performance of a text categorization task, with error rate reductions of
up to 19.3% obtained with a Naı̈ve Bayes or a Rocchio classifier when applied
on short extractive summaries rather than full-length documents. Finally,
we suggested that the interaction between text summarization and docu-
ment categorization can be regarded as an application-oriented evaluation
testbed for tools for automatic summarization, as summaries produced by
different summarization tools were shown to have different impact on the
performance of a text classifier.
REFERENCES
Apte, Chidanand, Fred Damerau & Sholom M. Weiss. 1994. “Towards Language
Independent Automated Learning of Text Categorisation Models”. Proceed-
ings of the 17th ACM SIGIR Conference on Research and Development in
Information Retrieval, 23-30. Dublin, Ireland.
Brin, Sergey & Lawrence Page. 1998. “The Anatomy of a Large-Scale Hy-
pertextual Web Search Engine”. Computer Networks and ISDN Systems
30:1-7.107-117.
Erkan, Gunes & Dragomir Radev. 2004. “Lexpagerank: Prestige in Multi-
Document Text Summarization”. Proceedings of the Conference on Empirical
Methods in Natural Language Processing, 365–371. Barcelona, Spain.
Hulth, Anette. 2003. “Improved Automatic Keyword Extraction Given More
Linguistic Knowledge”. Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP 2003 ), 216-223. Sapporo, Japan.
Joachims, Thorsten. 1997. “A Probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization”. Proceedings of the 14th International
Conference on Machine Learning (ICML 1997 ), 143-151. Nashville, Ten-
nessee.
Kamvar, Sepandar, Taher Haveliwala, Christopher Manning & Gene Golub. 2003.
“Extrapolation Methods for Accelerating PageRank Computations”. Pro-
ceedings of the 12th International World Wide Web Conference, 261-270.
Budapest, Hungary.
Kleinberg, Jon. 1999. “Authoritative Sources in a Hyperlinked Environment”.
Journal of the ACM 46:5.604-632.
Ko, Youngjoong, Jinwoo Park & Jungyun Seo. 2002. “Automatic Text Catego-
rization Using the Importance of Sentences”. Proceedings of the 19th Inter-
national Conference on Computational Linguistics, 1-7. Taipei, Taiwan.
Kolcz, Aleksander, Vidya Prabakarmurthi & Jugal Kalita. “Summarization as
Feature Selection for Text Categorization.” Proceedings of the 10th Inter-
national conference on Information and Knowledge Management, 365-370.
Atlanta, Georgia.
Lin, Chin-Yew & Eduard Hovy. 2003. “Automatic Evaluation of Summaries
Using n-gram Co-occurrence Statistics”. Proceedings of the Human Language
Technology Conference, 71-78. Edmonton, Canada.
McCallum, Andrew & Kamal Nigam. 1998. “A Comparison of Event Models
for Naive Bayes Text Classification”. Proceedings of AAAI-98 Workshop on
Learning for Text Categorization, 41-48. Madison, Wisconsin.
Mihalcea, Rada & Paul Tarau. 2004. “TextRank – Bringing Order into Texts”.
Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2004 ), 404-411. Barcelona, Spain.
Mihalcea, Rada & Paul Tarau. 2005. “An Algorithm for Language Independent
Single and Multiple Document Summarization”. Proceedings of the Interna-
tional Joint Conference on Natural Language Processing, 19-24. Jeju Island,
Korea.
Ng, Hwee Tou, Goh, Wei Booh & Low, Kok Leong. 1997. “Feature Selection,
Perceptron Learning, and a Usability Case Study for Text Categorization”.
Proceedings of the 20th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 67-73. Philadelphia,
Pennsylvania.
Pang, Bo & Lillian Lee. 2004. “A Sentimental Education: Sentiment Analysis
Using Subjectivity Summarization Based on Minimum Cuts”. Proceedings of
the 42nd Meeting of the Association for Computational Linguistics, 271-278.
Barcelona, Spain.
Rocchio, Joseph J. 1971. “Relevance Feedback in Information Retrieval”. Engle-
wood Cliffs, New Jersey: Prentice Hall.
Yang, Yiming & Xin Liu. 1999. “A Reexamination of Text Categorization Meth-
ods”. Proceedings of the 22nd ACM SIGIR Conference on Research and
Development in Information Retrieval, 42-49. Berkeley, California.
Yang, Yiming & Jan O. Pedersen. 1997. “A Comparative Study on Feature Se-
lection in Text Categorization”. Proceedings of the 14th International Con-
ference on Machine Learning, 412-420. Nashville, Tennessee.
Exploiting Linguistic Cues to Classify
Rhetorical Relations
Caroline Sporleder∗ & Alex Lascarides∗∗
∗ ∗∗
Tilburg University & University of Edinburgh
Abstract
We propose a method for automatically identifying rhetorical re-
lations. We use supervised machine learning but exploit discourse
markers to automatically label training data. Our models draw on a
variety of linguistic cues to distinguish between relations. We show
that these feature-rich models outperform the previously suggested
bigram models by more than 20%, at least for small training sets.
Our approach is therefore better suited to deal with relations for
which it is difficult to automatically label a lot of training data be-
cause they are rarely signalled by unambiguous discourse markers.
1 Introduction
Clauses in a text link to each other via rhetorical relations such as con-
trast or result (Mann & Thompson 1987). For example, the sentences
in (1) are linked by result. Many NLP applications, such as question-
answering and text summarisation, would benefit from a method which
automatically identifies such relations. While rhetorical relations are some-
times signalled by discourse markers (also known as discourse ‘connectives’)
such as since or consequently, these are often ambiguous. For example, since
can indicate either a temporal or an explanation relation (examples (2a)
and (2b), respectively). Discourse markers are also often missing (as in (1)).
Hence, it is not possible to rely on discourse markers alone.
(1) A train hit a car on a level crossing. It derailed.
(2) a. She has worked in retail since she moved to Britain.
b. I don’t believe he’s here since his car isn’t parked outside.
We present a machine learning method which uses a variety of linguistic and
textual features (word stems, part-of-speech tags, tense information) to de-
termine the rhetorical relation between two adjacent text spans (sentences
or clauses) in the absence of an unambiguous discourse marker. To avoid
manual annotation of training data, we train on automatically labelled ex-
amples, building on earlier work by Marcu & Echihabi (2002), who extracted
examples from large text corpora and used unambiguous discourse mark-
ers to label them with the correct rhetorical relation. Before training, the
markers were removed to enable the classifier to learn from other features
158 CAROLINE SPORLEDER & ALEX LASCARIDES
and thus identify relations in unmarked examples. This approach works be-
cause there is often a certain amount of redundancy between the discourse
marker and the general linguistic context. For example, the two clauses in
example (3a) are in a contrast relation signalled by but. However, this
relation can also be inferred if no discourse marker is present (see (3b)).
(3) a. She doesn’t make bookings but she makes itinerary recommendations.
b. She doesn’t make bookings; she makes itinerary recommendations.
Hobbs et al. (1993) and Asher & Lascarides (2003) propose a logical ap-
proach to inferring relations, which in this case would rely on the linguistic
cues of a negation in the first span, syntactic parallelism of the two spans,
and the fact that they both have the same subject. We explore whether
such cues can also be exploited as features in a statistical model for recog-
nising rhetorical relations and whether such a feature-rich model leads to
a better performance than Marcu & Echihabi’s (2002) word co-occurrence
model.
2 Related research
Marcu & Echihabi (2002) present a machine learning approach to automat-

ically identify four relations (contrast, cause-explanation-evidence,
condition and elaboration) from Rhetorical Structure Theory (rst)
(Mann & Thompson 1987). Two types of non-relations are also included.
The training data are extracted automatically from a large text corpus
(around 40 million sentences) using manually constructed extraction pat-
terns containing discourse markers which signal one of these relations. For
example, if a sentence begins with but, it is extracted together with the
preceding sentence and labelled as contrast. Examples of non-relations
are created artificially: one type of non-relation corresponds to randomly
picked pairs of non-adjacent spans from the same text, and another type
of non-relation to pairs of spans from different texts. The number of ex-
tracted examples per relation range from 900,000 to 4 million. After the
markers had been removed, several Naive Bayes classifiers were trained to
distinguish between relations on the basis of co-occurrences between pairs
of words. The reported test set accuracy for the six-way classifier is 49.7%.
Lapata & Lascarides (2004) present a method for inferring temporal
connectives. They, too, label data automatically, using connectives such as
while or since. But their aim is to predict the original temporal connective
(which was removed from the test set) rather than the underlying rhetorical
relation. For this, they train simple probabilistic models based on nine types
of linguistically motivated features. They report accuracies of up to 70.7%.
EXPLOITING LINGUISTIC CUES 159
There have also been non-statistical approaches. Corston-Oliver (1998), for

instance, presents a system which takes fully parsed sentences and deter-
mines rhetorical relations by applying heuristics based on linguistic cues
such as clausal status, anaphora and deixis. Le Thanh et al. (2004) use a
similar, heuristics-based approach which first splits sentences into discourse
spans and then determines which relations hold between them.
3 Our approach
3.1 Relations and discourse marker selection

We chose five sdrt (Asher & Lascarides 2003) relations: contrast, re-
sult, explanation, summary and continuation. For each of these,
there are unambiguous discourse markers but these relations also frequently
occur without such a marker; so it is beneficial to be able to determine them
in the absence of an explicit marker. This is in contrast to relations such as
condition, which always require a discourse marker (e.g., if. . . then).
sdrt relations are defined on the basis of truth conditional semantics
and tend to be less fine-grained than those in Rhetorical Structure Theory
(rst) (Mann & Thompson 1987). Let R(a, b) denote the fact that a relation
R connects two spans a and b. For each of the five relations it holds that
R(a, b) is true only if the contents of a and b are true too. In addition, con-
trast(a,b) entails that a and b have parallel syntactic structures that induce
contrasting themes, result(a,b) entails that a causes b, summary(a,b) en-
tails that a and b are semantically equivalent, continuation(a,b) means
that a and b have a contingent, common topic and explanation(a,b) means
that b is an answer to the question why a? (cf. Bromberger 1962).
To identify unambiguous discourse markers for the five sdrt relations,
we undertook an extensive corpus study, using 2,000 randomly chosen ex-
amples (30 per marker), as well as linguistic introspection. The differences
between sdrt and rst mean that some markers which are ambiguous in
rst are unambiguous in sdrt. For example, in other words can signal
either summary or restatement in rst, but sdrt does not distinguish
these since the length of the related spans is irrelevant to sdrt’s semantics.
Overall, we identified 55 unambiguous discourse markers for the five re-
lations. These were used in the manually written extraction patterns. Sen-
tences (4) to (8) below show one automatically extracted example for each
relation (discourse markers are underlined, spans are indicated by square
brackets).
(4) [We can’t win] [but we must keep trying.]
(contrast)
(5) [The ability to operate at these temperatures is advantageous,] [because the

devices need less thermal insulation.]
(explanation)
(6) [By the early eighteenth century in Scotland, the bulk of crops were housed
in ricks,] [the barns were consequently small.]
(result)
(7) [The starfish is an ancient inhabitant of tropical oceans.] [In other words,
the reef grew up in the presence of the starfish.]
(summary)
(8) [First, only a handful of people have spent more than a few weeks in space.]
[Secondly, it has been impractical or impossible to gather data beyond some
blood and tissue samples.]
(continuation)
3.2 Data
We used three corpora, mainly from the news domain, to extract our data
set: the British National Corpus (BNC, 100 million words), the North Amer-
ican News Text Corpus (350 million words) and the English Gigaword Cor-
pus (1.7 billion words). We preprocessed the corpora by removing duplicate
texts and speech transcripts, and running a sentence splitter (Reynar &Rat-
naparkhi 1997) on the latter two corpora.
We applied the extraction patterns to the raw text to find potential
examples. These were then parsed with the RASP parser (Carroll & Briscoe
2002) and the parse trees were processed to (i) identify the two spans and
(ii) filter out false positives. For instance, (9) was extracted as an example
of summary based on the apparent presence of the discourse marker in
short. However, the parser correctly identified this string as part of the
prepositional phrase in short order and the example was discarded. We
extracted both intra- and inter-sentential relations (see (4) and (7) above,
respectively). However, we limited the length of the spans to one sentence,
as we specifically wanted to focus on relations between small units of text.
(9) In short order I was to fly with ‘Deemy’ on Friday morning.
There are three potential sources of errors: (i) the two spans are not
related, (ii) they are related but the wrong relation is hypothesised and
(iii) the hypothesised span boundaries are wrong. To assess the quality of
the extracted data, we manually inspected 100 randomly selected examples
(20 per relation). We found 11 errors overall: three of type (i) (no relation)
and eight of type (iii) (wrong boundary). No wrongly predicted relation
was found. The overall precision was thus 89%.
The number of extracted training examples ranged from 1,732 for con-
tinuation and around 50,000 for contrast. On the whole, our data set
is much smaller than the one used by Marcu & Echihabi (2002), which
contained around 10 million examples for six relations.
3.3 The classifier

Several machine learning schemes can be employed for the task. We used
a boosting algorithm with simple decision rules, as implemented in Boos-
Texter (Schapire & Singer 2000). BoosTexter allows a variety of feature
types, such as nominal, numerical or text-based. For the latter, it applies
n-gram models when forming classification hypotheses. We implemented 72
linguistically motivated features, roughly falling into 9 classes: positional
features, length features, lexical features, part-of-speech features, temporal
features, syntactic features and cohesion features.
We defined three positional features, encoding whether the relation
holds intra- or inter-sententially and whether the example occurs towards
the beginning and end of a paragraph. The motivation for these features
is that the probability of different relations is likely to vary with both their
paragraph position and the position of sentence boundaries relative to span
boundaries. For instance, a summary relation is probably more frequent
at the beginning or end of a paragraph than in the middle of it.
We encoded the length of the two spans to capture possible length
effects. For example, it is possible that the spans are longer on average for
continuation than for contrast.
Lexical information is also likely to provide useful cues (cf. Marcu & Echi-
habi 2002). For example, a high word overlap between spans may be ev-
idence for summary. Furthermore, while we removed the unambiguous
discourse markers from the training examples (as they form the basis for
the labelling), these may be accompanied by ambiguous cues (like still, sig-
nalling contrast or a temporal relation), which we do not remove. Words
like this can and should be exploited by the classifier. Thus we encoded
the lemmas and stems of all words and of the content words only. These
were encoded as text features, allowing BoosTexter to automatically iden-
tify stem or lemma n-gram sequences that may be good cues for a particular
relation. We also encoded the stem and lemma overlap between the spans.
Some part-of-speech sequences may also be more likely for one relation
than for another and were included as a text feature. We also separately
encoded specific information about verb, noun and adjective lemmas (Lap-
ata & Lascarides 2004). Furthermore, we mapped the lemmas to their most
general WordNet (Fellbaum 1998) class (e.g., verb-of-cognition). Finally,
we encoded the overlaps between lemmas and between WordNet classes.
Tense and aspect provide clues about temporal relations among events
and may influence the probabilities of different rhetorical relations. We
used simple heuristics to classify verbal complexes in terms of finiteness,

modality, aspect, voice and negation (Lapata & Lascarides 2004).
Some relations (e.g., summary) may have syntactically less complex
spans than others (e.g., continuation). To estimate syntactic complex-
ity, we encoded the number of NPs, VPs, PPs, ADJPs, and ADVPs in each
span. We also included argument structure information, e.g., whether a
verb has a direct object. Finally, we added information about the subjects,
i.e., their part-of-speech tags, whether they have a negative aspect (e.g.
nobody, nowhere), and the WordNet classes to which they map (see above).
The degree of cohesion between two spans may be another informative
feature. To estimate it we looked at the distribution of pronouns and at the
presence or absence of ellipses (cf. Hutchinson 2004).
4 Experiments
We conducted three experiments. First, we assessed how well humans can

determine rhetorical relations in the absence of discourse markers. This
gives a measure of the difficulty of the task. We then tested our model and
compared its performance to two baseline classifiers. Finally, we looked at
which features are particularly useful for predicting the correct relation.
4.1 Experiment 1: Human agreement

Training on automatically labelled examples will only be successful if there is
a certain amount of redundancy between the unambiguous discourse marker,
which is used to assign the label and then removed, and other linguistic
features of the examples. If discourse markers were only used in cases where
a relation cannot be inferred from the linguistic context alone, any approach
which aims to train a classifier on automatically extracted examples from
which the unambiguous discourse markers have been removed would fail.
The presence of redundancy in some cases is evident from examples like
(3), where contrast can be inferred even when the discourse marker is
removed. However, there may be other cases where this is more difficult. To
assess the difficulty of determining the rhetorical relation when the discourse
marker has been removed, we conducted a small pilot study with human
subjects (cf. Soria & Ferrari 1998). We used our extraction patterns to
automatically extract examples for the four rhetorical relations contrast,
explanation, result and summary (continuation was added after the
pilot study). We then manually checked the extracted examples to filter out
false positives and randomly selected 10 examples per relation from which
we then removed the discourse markers. We also semi-automatically selected
10 examples of adjacent sentences or clauses which were not related by any
of the four relations. For each example, we included the two preceding
and following sentences as context. We then asked three subjects who were
trained in discourse annotation to classify each of the 50 examples as one
of the four relations or as none. All subjects were aware that discourse
markers had been removed from the examples but did not know the location
of the removed discourse marker. We evaluated the annotations against the
gold standard. The average accuracy was 71.25%, the average, pairwise
Kappa coefficient (Siegel & Castellan 1988) was .61. While the agreement
is far from perfect, it is relatively high for a discourse annotation task.
Hence it seems that the task of predicting the correct relation for sentences
from which the discourse marker has been removed is feasible for humans.
4.2 Experiment 2: Comparing the models

The machine learning experiments involved five relations: contrast, ex-
planation, result, summary and continuation. We did not use the
complete set of extracted examples, as this was highly imbalanced (see Sec-
tion 3.2), which can have negative effects on a machine learner (see e.g.
Japkowicz 2000). Instead, we created training and test sets which con-
tained an equal number of examples for each relation (i.e., 1,732 examples
per relation). We used 90% of this data set for training (7,795 examples)
and 10% for testing (865 examples), making sure that the distribution of
the relations was uniform in both data sets, and evaluated BoosTexter’s
performance using 10-fold cross-validation.
For comparison, we also used two baselines. For the first, a relation
was predicted at random. As there are five, equally frequent relations in
the test set, the average accuracy achieved by this strategy will be 20%.
For the second baseline, we implemented a bigram model along the lines
of Marcu & Echihabi (2002). Table 1 shows the average accuracies of the
three classifiers for all relations and also for each individual relation.
The feature-rich BoosTexter model performs notably better than either
of the other two classifiers. It outperforms the random baseline by nearly
40% and the bigram model by more than 20%. This difference is statis-
tically significant (χ2 = 208.12, DoF = 1, p <= 0.01). Furthermore, the
performance gain achieved by our model holds for every relation with the
exception of explanation where the bigram model performs better.
While the bigram model is geared towards large training sets and it is
possible that it outperforms the BoosTexter model if trained on a lot of
data, obtaining large amounts of data is not always feasible, even if train-
ing examples are extracted automatically. Some relations occur relatively
infrequently or are rarely signalled by an unambiguous discourse marker.
This applies to continuation, which is usually signalled by an ambigu-
Avg. Accuracy
Relation random bigrams BT
contrast 20.00 33.11 43.64
explanation 20.00 75.39 64.45
result 20.00 16.21 47.86
summary 20.00 19.34 48.44
continuation 20.00 25.48 83.35
all 20.00 33.96 57.55
Table 1: Results for BoosTexter (BT ) and two baselines (10-fold
cross-validation)
ous marker (e.g., and) or no marker at all. For such relations, our approach
seems a better choice than the model proposed by Marcu & Echihabi (2002).
4.3 Experiment 3: Feature exploration

To determine which features are particularly useful for the task, we con-
ducted a further experiment in which we trained an individual BoosTexter
model for each of the features. These one-feature classifiers were then tested
on an unseen test set (again using 10-fold cross-validation). Table 2 shows
the 10 best performing features and their average accuracies.
Feature Avg. Accuracy

left stems 42.51
left words 41.79
intra/inter 39.18
left pos-tags 34.62
right words 32.82
right stems 32.58
right pos-tags 31.72
left content words 29.78
left noun lemmas 28.30
right span length 28.12
Table 2: The ten best features (10-fold cross-validation)
Lexical features (stems, words, lemmas) seem the most useful. Table 3
shows some of the words chosen by BoosTexter as particularly predictive of
a given relation. Most of the choices seem fairly intuitive. For instance, an
explanation relation is often signalled by tentatively qualifying adverbs
(e.g., perhaps), while summary and continuation relations frequently
contain pronouns and contrast can be signalled by words such as other
or still etc. Table 2 also suggests that the lexical items in the left span are
more important than those in the right. For example, left stems is 10% more
accurate then right stems. This makes sense from a processing perspective:
if the relation is already signalled in the left span the sentence will be easier
to process than if the signalling is delayed until the right span is read.
Relation Predictive Words
contrast other, still, not, . . .
explanation perhaps, probably, mainly, . . .
result undoubtedly, so, indeed, . . .
summary their, this, yet . . .
continuation you, it, there . . .
Table 3: Words chosen as cues for a relation
5 Conclusion
We have presented a machine learning method for automatically classify-
ing discourse relations in the absence of unambiguous discourse markers,
using a model which combines a wide variety of linguistic cues. The model
was trained on automatically labelled data and tested on five relations. We
found that this feature-rich model significantly outperformed the simpler
bigram model proposed by Marcu & Echihabi (2002), at least on our, rel-
atively small training set. Hence, our method is particularly suitable for
relations which are rarely signalled by (unambiguous) discourse markers
(e.g., continuation). In such cases, it is difficult to obtain sufficiently
large training sets that a bigram model will perform well, even if the data
is extracted from very large corpora.
So far we have only tested our method on automatically labelled data,
i.e., examples from which an originally present, unambiguous discourse
marker had been removed, not on examples which occur naturally with-
out such a marker. However, the latter are exactly the types of examples at
which our method is aimed. In future research, we thus intend to create a
small, manually labelled test corpus of examples which are not unambigu-
ously marked and test our method on it to determine whether our results
carry over to that data type. The RST Discourse Treebank (Carlson et al.
2002) could be used as a starting point for the corpus creation.
Acknowledgements. The research in this paper was supported by epsrc
grant number GR/R40036/01. We would like to thank Ben Hutchinson and
Mirella Lapata for interesting discussions, participating in our labelling experi-
ment and letting us use some of their Perl scripts. We are also grateful to the
reviewers for their helpful comments.
REFERENCES
Asher, Nicholas & Alex Lascarides. 2003. Logics of Conversation. Cambridge:
Cambridge University Press.
Bromberger, Sylvain. 1962. “An Approach to Explanation”. Analytical Philoso-

phy ed. by Ronald J. Butler, 75-105. Oxford: Oxford University Press.
Carlson, Lynn, Daniel Marcu & Mary Ellen Okurowski. 2002. RST Discourse
Treebank. Linguistic Data Consortium.
Carroll, John & Edward Briscoe. 2002. “High Precision Extraction of Grammati-
cal Relations”. 19th Int. Conf. on Computational Linguistics (COLING-02 ),
134-140. Taipei, Taiwan.
Corston-Oliver, Simon H. 1998. “Identifying the Linguistic Correlates of Rhetor-
ical Relations”. Workshop on Discourse Relations and Discourse Markers,
8-14. Montreal, Canada.
Fellbaum, Christiane Fellbaum, ed. 1998. WordNet: An Electronic Database.
Hobbs, Jerry R., Mark Stickel, Douglas Appelt & Paul Martin. 1993. “Interpre-
tation as Abduction”. Artificial Intelligence 63:1-2.69-142.
Hutchinson, Ben. 2004. “Acquiring the Meaning of Discourse Markers”. 42nd
Annual Meeting of the Association for Computational Linguistics (ACL’2004),
685-692. Barcelona, Spain.
Japkowicz, Nathalie. 2000. “The Class Imbalance Problem: Significance and
Strategies”. Int. Conf. on Artificial Intelligence (IC-AI-00), 111-117. Las
Vegas, Nevada.
Lapata, Mirella & Alex Lascarides. 2004. “Inferring Sentence-internal Temporal
Relations”. Meeting of the North American Chapter of the Assoc. for Comp.
Linguistics (NAACL-04 ), 153-160. Boston, Mass.
Le Thanh, Huong, Geetha Abeysinghe & Christian Huyck. 2004. “Generating
Discourse Structures for Written Text”. 20th International Conference on
Computational Linguistics (COLING-04 ), 329-335. Geneva, Switzerland.
Mann, William C. & Sandra A. Thompson. 1987. Rhetorical Structure Theory: A
Theory of Text Organization. Technical Report ISI/RS-87-190. Los Angeles,
Calif.: ISI.
Marcu, Daniel & Abdessamad Echihabi. 2002. “An Unsupervised Approach to
Recognizing Discourse Relations”. 40th Annual Meeting of the Association
for Computational Linguistics (ACL-02 ), 368-375. Philadelphia, Penn.
Reynar, Jeffrey C. & Adwait Ratnaparkhi. 1997. “A Maximum Entropy Ap-
proach to Identifying Sentence Boundaries”. 5th Applied Natural Language
Processing Conference (ANLP-97 ), 16-19. Washington, DC.
Schapire, Robert E. & Yoram Singer. 2000. “BoosTexter: A Boosting-based
System for Text Categorization”. Machine Learning 39:2-3.135-168.
Siegel, Sidney & N. John Castellan. 1988. Nonparametric Statistics for the
Behavioral Sciences. New York: McGraw-Hill.
Soria, Claudia & Giacomo Ferrari. 1998. “Lexical Marking of Discourse Rela-
tions – Some Experimental Findings”. Workshop on Discourse Relations and
Discourse Markers, 36-42. Montreal, Canada.
Tree Edit Distance for Textual Entailment
Milen Kouylekov∗,∗∗ & Bernardo Magnini∗∗
∗
ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
∗∗
University of Trento
Abstract
This paper addresses Textual Entailment (i.e., recognizing that the
meaning of a text entails the meaning of another text) using a Tree
Edit Distance algorithm between the syntactic trees of the two texts.
It also compares the contribution of lexical resources for recognizing
textual entailment in an experiment carried on over the pascal-rte
dataset.
1 Introduction
The problem of language variability (i.e., the fact that the same information
can be expressed with different words and syntactic constructs) has been
attracting a lot of interest during the years and it poses significant issues
in front of systems aimed at natural language understanding. The example
below shows that recognizing the equivalence of the statements came in
power, was prime-minister and stepped in as prime-minister is a challenging
problem.
• Ivan Kostov came in power in 1997.
• Ivan Kostov was prime-minister of Bulgaria from 1997 to 2001.
• Ivan Kostov stepped in as prime-minister 6 months after the December
1996 riots in Bulgaria.
While the language variability problem is well known in Computational
Linguistics, a general unifying framework has been proposed only recently
in (Dagan & Glickman 2004). In this approach, language variability is
addressed by defining the notion of entailment as a relation that holds
between two language expressions (i.e., a text t and an hypothesis h) if the
meaning of h, as interpreted in the context of t, can be inferred from the
meaning of t. The entailment relation is directional, as the meaning of one
expression can entail the meaning of the other, while the opposite may not.
The Recognizing Textual Entailment (rte) task takes as input a T H
pair and consists in automatically determining whether an entailment re-
lation holds between T and H or not. The task, potentially, covers almost
all the phenomena in language variability: entailment can be due to lexical
variations, as it is shown in example (1), to syntactic variation (example 2),
to semantic inferences (example 3) or to complex combinations of all such
168 MILEN KOUYLEKOV & BERNARDO MAGNINI
levels. As a consequence of the complexity of the task, one of the crucial

aspects for any rte system is the amount of knowledge required for filling
the gap between T and H. Examples:
1. T – Euro-Scandinavian media cheer Denmark v Sweden draw. H –
Denmark and Sweden tie.
2. T – Jennifer Hawkins is the 21-year-old beauty queen from Australia.
H – Jennifer Hawkins is Australia’s 21-year-old beauty queen.
3. T – The nomadic Raiders moved to LA in 1982 and won their third
Super Bowl a year later. H – The nomadic Raiders won the Super
Bowl in 1982.
In example 1 we need to know that T lexically H ; in example 2 we need
to understand that the syntactic structure are equivalent; in 3 we need to
reason about temporal entities.
The aim of this paper is to provide a clear and homogeneous framework
for the evaluation of lexical resources for the rte task. The framework is
based on the intuition that the probability of an entailment relation between
t and h is related to the ability to show that the whole content of h can
be mapped into the content of t. The more straightforward the mapping
can be established, the more probable is the entailment relation. Since a
mapping can be described as the sequence of editing operations needed to
transform t into h, where each edit operation has a cost associated with it,
we assign an entailment relation if the overall cost of the transformation is
below a certain threshold, empirically estimated on the training data.
Within the tree edit distance (ted) framework, the complexity of rte
is put on the availability of entailment rules and on the definition of cost
functions for the three editing operations. In this paper we investigate the
role of resources for entailment rules for the definition of cost functions.
We have experimented the ted approach with a non annotated document
collection and a similarity relation estimated over a corpus of dependency
trees. Experiments, carried on the pascal-rte dataset, provide significant
insight for future research on rte.
2 Tree edit distance on dependency trees
We adopted a tree edit distance algorithm applied to the syntactic repre-

sentations (i.e., dependency trees) of both t and h. A similar use of tree
edit distance has been presented by (Punyakanok et al. 2004) for a Ques-
tion Answering system, showing that the technique outperforms a simple
bag-of-word approach. While the cost function presented in (Punyakanok
et al. 2004) is quite simple, for the rte challenge we tried to elaborate more
complex and task specific measures.
TREE EDIT DISTANCE FOR TEXTUAL ENTAILMENT 169
According to our approach, t entails h if there exists a sequence of transfor-

mations applied to t such that we can obtain h with an overall cost below a
certain threshold. The underlying assumption is that pairs between which
an entailment relation holds have a low cost of transformation. The kind of
transformations we can apply (i.e., deletion, insertion and substitution) are
determined by a set of predefined entailment rules, which also determine a
cost for each editing operation.
We have implemented the tree edit distance algorithm described in
(Zhang & Shasha 1990) and applied to the dependency trees derived from
t and h. Edit operations are defined at the level of single nodes of the
dependency tree (i.e., transformations on subtrees are not allowed in the
current implementation). Since the Zhang & Shasha (1990) algorithm does
not consider labels on edges, while dependency trees provide them, each
dependency relation R from a node A to a node B has been re-written as
a complex label B-R concatenating the name of the destination node and
the name of the relation. All nodes except the root of the tree are relabeled
in such way. The algorithm is directional: we aim to find the better (i.e.,
less costly) sequence of edit operation that transform t (the source) into
h (the target). According to the constraints described above, the following
transformations are allowed:
• Insertion: insert a node from the dependency tree of h into the

dependency tree of t. When a node is inserted it is attached with the
dependency relation of the source label.
• Deletion: delete a node N from the dependency tree of t. When N
is deleted all its children are attached to the parent of N. It is not
required to explicitly delete the children of N as they are going to be
either deleted or substituted on a following step.
• Substitution: change the label of a node N1 in the source tree into a
label of a node N2 of the target tree. Substitution is allowed only if the
two nodes share the same part-of-speech. In case of substitution the
relation attached to the substituted node is changed with the relation
of the new node.
3 System architecture
The system is composed by the following modules, showed in Figure 1:

(i) a text processing module, for the preprocessing of the input t/h pair;
(ii) a matching module, which performs the mapping between t and h;
(iii) a cost module, which computes the cost of the edit operations.
Fig. 1: System architecture
3.1 Text processing module

The text processing module creates a syntactic representation of a t/h pair
and relies on a sentence splitter and a syntactic parser. For sentence split-
ting we used MXTerm (Ratnaparkhi 1996), a Maximum entropy sentence
splitter. For parsing we used Minipar, a principle-based English parser (Lin
1998) which has high processing speed and good precision.
3.2 Matching module

The matching module finds the best sequence of edit operations between the
dependency trees obtained from t and h. It implements the edit distance
algorithm described in Section 2. The entailment score of a given pair is
calculated in the following way:
ed(T, H)
score(T, H) = (1)
ed(, H)
where ed(T, H) is the function that calculates the edit distance cost and
ed(, H) is the cost of inserting the entire tree h. A similar approach is
presented in (Monz & de Rijke 2001), where the entailment score of two
′
documents d and d is calculated by comparing the sum of the weights (idf)
of the terms that appear in both documents to the sum of the weights of
′
all terms in d .
We used a threshold t such that if score(T, H) < t then t entails h,
otherwise no entailment relation holds for the pair. To set the threshold
we have used both the positive and negative examples of the training set
provided by the pascal-rte dataset (see Section 5.1 for details).
3.3 Cost module

The matching module makes requests to the cost module in order to receive
the cost of single edit operations needed to transform t into h. We have
different cost strategies for the three edit operations.
Insertion. The intuition underlying insertion is that its cost is proportional

to the relevance of the word w to be inserted (i.e., inserting an informative
word has an higher cost than inserting a less informative word). More
precisely:
Cost[Ins(w)] = Rel(w) (2)

where Rel(w), in the current version of the system, is computed on a doc-
ument collection as the inverse document frequency (idf ) of w, a measure
commonly used in Information Retrieval. If N is the number of documents
in a text collection and Nw is the number of documents of the collection
that contain w then the idf of w is given by the formula:
N
idf (w) = log (3)
Nw
The most frequent words (e.g., stop words) have a zero cost of insertion.
Substitution. The cost of substituting a word w1 with a word w2 can

be estimated considering the semantic entailment between the words. The
more the two words are entailed, the less the cost of substituting one word
with the other. We used the following formula:
Cost[Subs(w1 , w2 )] = (4)
Ins(w2 ) ∗ (1 − Ent(w1 , w2 ))
where Ins(w2 ) is calculated using (3) and Ent(w1 , w2 ) can be approximated
with a variety of relatedness functions between w1 and w2 .
There are two crucial issues for the definition of an effective function for
lexical entailment: first, it is necessary a database of entailment rules with
enough coverage; second, we have to estimate a quantitative measure for
such rules.
We experimented the use of a dependency based thesaurus available at
http://www.cs.ualberta.ca/ lindek/downloads.htm. For each word, the
~
thesaurus lists up to 200 most similar words and their similarities calcu-
lated on a parsed corpus using frequency counts of the dependency triples.
A complete review of the method including a comparison with different
approaches is presented in (Lin 1996). Dependency triples consists of the
head, a dependency type and a modifier. They can be viewed as features
for the head and the modifiers in the triples when calculating similarity.
The cost of a substitution is calculated by the following formula:
Entsim (w1 , w2 ) = simth (w1 , w2 ) (5)
where w1 is the word from t that is being replaced by the word w2 from h
and simth (w1 , w2 ) is the similarity between w1 and w2 in the thesaurus mul-
tiplied by the similarity between the corresponding relations. The similarity
between relations is stored in a database of relation similarities obtained by
comparing dependency relations from a parsed local corpus. The similar-
ities have values from 1 (very similar) to 0 (not similar). If there is no
similarity, the cost of substitution is equal to the cost of inserting the word
w2.
Deletion. In the pascal-rte dataset t is typically shorter than h. As a

consequence, we expect that much more deletions are necessary to transform
t into h than insertions or substitutions. Given this bias toward deletion,
in the current version of the system we set the cost of deletion to 0. This ex-
pectation has been empirically confirmed (see results of system 2 in Section
5.3).
We carried out a number of experiments in order to estimate the contribu-

tion of different combinations of the available resources. In this section we
report about the dataset, the experiments and the results we have obtained.
4.1 Dataset
For the experiments we have used the pascal-rte dataset (Dagan et al.
2005). The dataset1 has been collected by human annotators and it is com-
posed of 1367 text (T ) - hypothesis (H ) pairs split into positive and negative
examples (a 50%-50% split). The examples came from seven different appli-
cation scenario (i.e., Information Retrieval, Question Answering, Machine
Translation, Information Extraction, Reading Comprehension, Paraphrase
Acquisition) where textual entailment is supposed to play a crucial role.
Typically, T consists of one sentence while H was often made of a shorter
sentence. The dataset has been split in a training (576 pairs) and a test
(800 pairs) part.
4.2 Experiments
The following configurations of the system have been experimented.
System 1: Tree Edit Distance Baseline. In this configuration, consid-

ered as a baseline for the Tree Edit Distance approach, the cost of the three
edit operations are set as follows:
Deletion: always 0
Insertion: the idf of the word to be inserted
Substitution: 0 if w1 = w2 , infinite in all the other cases.
In this configuration the system just needs a non-annotated corpus for
estimating the idf of the word to be inserted. Deletion is 0 because we
expect much more deletion that insertions, due to the fact that T is longer
than H, implying that much more deletions than insertions are necessary.
System 2: Deletion as idf. In this configuration we wanted to check the

impact of assigning a cost to the deletion operation.
Deletion: the idf of the word to be deleted
Substitution: same as system 1.
System 3: Similarity Database. This is the same as system 1., but we

estimate the cost of substitutions using the similarity database described
in Section 3.3. We expect a broad coverage with respect to the previous
system.
Deletion: always 0
Substitution: same as system 2, plus similarity rules.
1
The dataset is available for download at http://www.pascal-network.org/
Challenges/RTE/Datasets .
4.3 Results
For each system we have tested we report a table with the following data:
• #attempted. The number of deletions, insertion and substitutions
that the algorithm successfully attempted (i.e., operations that are
included in the best sequence of editing transformations).
• %success. The proportion of deletions, insertion and substitutions
that the algorithm successfully attempted over the total attempted.
• accuracy. The proportion of T-H pairs correctly classified by the
system over the total number of pairs.
• cws. Confidence Weighted score, The proportion of T-H pairs cor-
rectly classified by the system over the total number of pairs.
Table 1 shows the results obtained by the systems we experimented.
System 1 System 2 System 3

#deletions 20325 19984 20101
#insertions 7927 7586 7703
#substitutions 2686 3027 2910
%deletions 0.88 0.87 0.87
%insertions 0.75 0.71 0.73
%substitutions 0.25 0.29 0.27
accuracy 0.560 0.481 0.566
cws 0.615 0.453 0.624
Table 1: Results
The hypothesis about the 0 cost of the deletion operation was confirmed by
the results of system 2.
The similarity database used in system 3 increased the number of the
successful substitutions made by the algorithm from 25% to 27%. It also
increased performance of the system for both accuracy and cws. The impact
of the similarity database on the result is small, because of the low similarity
between the dependency trees of h and t.
5 Discussion
The approach we have presented can be used as a framework for testing re-
sources for lexical entailment. The experiments show that using a similarity
database with the edit distance algorithm can be of benefit for recognizing
textual entailment.
In order to test the real performance of resources of lexical entailment
rules a set of pair from RTE dataset which require lexical entailment rules
must be selected.
The performance of the system is correlated with the performance of the

used resources. A wrong substitution with a low cost can significantly influ-
ence the optimal cost of the tree mapping. In order to obtain good results
we should consider for substation only pairs with high entailment score (in
our experiment similarity).
The tree edit distance algorithm is designed to work with substitution
on the level of tree nodes while our analysis of the pascal-rte dataset show
that sub-tree substitutions are more suitable for the task. Other resources
of entailment rules (e.g., paraphrases in (Lin & Pantel 2001), entailment
patters as acquired in (Szpektor et al. 2004)) could significantly wide the
application of entailment rules and, consequently, improve performances.
We estimated that for about 40% of the the true positive pairs the sys-
tem could have used entailment rules found in entailment and paraphrasing
resources. As an example, the pair 565:
t - Soprano’s Square: Milan, Italy, home of the famed La
Scala opera house, honored soprano Maria Callas on Wednesday
when it renamed a new square after the diva.
h - La Scala opera house is located in Milan, Italy.
could be successfully solved using a paraphrase pattern such as Y home
of X <=> X is located in Y, which can be found in (Lin & Pantel 2001).
However, in order to use this kind of entailment rules , it would be necessary
to extend the “single node” implementation of tree edit distance to address
editing operations among sub-trees. A system with an algorithm capable of
calculating the cost of substitution on the level of subtrees can be used as
a framework for testing paraphrase and entailment acquisition systems.
A drawback of the tree edit distance approach is that it is not able to
observe the whole tree, but only the subtree of the processed node. For
example, the cost of the insertion of a subtree in h could be smaller if the
same subtree is deleted from t in prior or later stage. A context sensitive
extension of the insertion and deletion module will increase the performance
of the system.
We have presented an approach for recognizing textual entailment based on

tree edit distance applied to the dependency trees of T and H. We have
also demonstrated that using lexical similarity resources can increase the
performance of a system based on such algorithm.
In the future we plan to incorporate more resources for lexical entailment
rules. We would like compare the performance obtained with the similarity
database to WordNet (Fellbaum 1998), a lexical database which includes
lexical and semantic relations among word senses. Moreover, in order to

use entailment and paraphrasing resources, we plan to extend the tree edit
distance algorithm with sub-tree substitutions.
REFERENCES
Dagan, Ido & Oren Glickman. 2004. “Generic Applied Modeling of Language
Variability”. Proceedings of PASCAL Workshop on Learning Methods for
Text Understanding and Mining. Grenoble, France.
Dagan, Ido, Oren Glickman & Bernardo Magnini. 2005. “The PASCAL Recog-
nizing Textual Entailment Challenge”. Proceedings of PASCAL Workshop
on Recognizing Textual Entailment, 1-8. Southampton, U.K.
Fellbaum, Christiane. 1998. WordNet, an Electronic Lexical Database. Cam-
bridge, Mass.: MIT Press.
Lin, Dekang. 1998. “Dependency-based evaluation of MINIPAR”. Proceedings
of the Workshop on Evaluation of Parsing Systems (LREC-1998 ). Granada,
Spain.
Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity”. Pro-
ceedings of 15th International Conference on Machine Learning, 296-304.
Madison, Wisconsin.
Lin, Dekang & Patrik Pantel. 2001. “Discovery of inference rules for Question
Answering”. Natural Language Engineering 7.4:343-360.
Monz, Christof & Maarten de Rijke. 2001. “Light-Weight Entailment Check-
ing for Computational Semantics”. Proceedings of The third workshop on
inference in computational semantics (ICoS-3 ), 59-72. Siena, Italy.
Punyakanok, Vasin, Dan Roth & Wen-tau Yih. 2004. “Mapping Dependencies
Trees: An Application to Question Answering”. Proceedings of AI & Math
2004. Fort Lauderdale, Florida.
Ratnaparkhi, Adwait. 1996. “A Maximum Entropy Part-Of-Speech Tagger”.
Proceeding of the Empirical Methods in Natural Language Processing Con-
ference, 491-497. Philadelphia, Pennsylvania.
Szpektor, Idan, Hristo Tanev, Ido Dagan & Bonaventura Coppola. 2004. “Scaling
Web-based Acquisition of Entailment Relations”. Proceedings of Conference
on Empirical Methods in Natural Language Processing (EMNLP-04 ), 41-48.
Barcelona, Spain.
Zhang, Kaizhong & Dennis Shasha. 1990. “Simple fast algorithms for the editing
distance betweentrees and related problems”. SIAM Journal on Computing
18:6.1245-1262.
A Genetic Algorithm for Optimising Information
Retrieval with Linguistic Features in Question
Answering
Jörg Tiedemann
University of Groningen
Abstract
In this chapter we describe a genetic algorithm applied to the optimi-
sation of information retrieval (ir) within a Dutch question answer-
ing (qa) system. In our approach, information retrieval is used for
passage retrieval to filter out relevant text passages for subsequent
answer extraction modules in the qa system. We focus on the in-
tegration of linguistic features extracted from syntactic analyses of
the texts in the retrieval component. For this we apply output of
the Dutch dependency parser Alpino together with an off-the-shelf
ir engine. Query parameters are optimised using a genetic algorithm
to improve the overall performance of the qa system.
1 Introduction
Standard information retrieval is commonly used in question answering sys-

tems in order to reduce the search space for the system to find answers to
given natural questions in unrestricted texts. Consequently, ir is one of
the bottlenecks in such a system because it is bound to fail if ir does not
provide relevant passages containing answers. The simplest approach to ir
in qa uses words appearing in the question as keywords for standard plain
text retrieval. However this is a strong limitation disregarding other infor-
mation included in a natural language question. Recently, many researchers
investigate possibilities of integrating linguistic knowledge into ir especially
to facilitate question answering (Strzalkowski 1996, Katz & Lin 2003, Neu-
mann & Sacaleanu 2004, Moldovan et al. 02, Prager et al. 2000, Cui et
al. 2001). One reason for the current effort in this field is the increased
availability of robust nlp tools with large coverage and improved efficiency.
However, results achieved so far are rather modest and simple techniques
such as stemming are still the most effective ones. It seems to be important
to select appropriate features in relevant cases in order to succeed in a task
like information retrieval (Katz & Lin 2003). Therefore, we will concentrate
on the optimisation of feature selection and keyword weighting.
The remaining part of this chapter is organised as follows: The next
section briefly describes the ir component in our qa system. Thereafter,
we will discuss the genetic algorithm used for optimisation. Then we will
178 JÖRG TIEDEMANN
show experimental results on open-domain question answering followed by

a summary and some conclusions.
2 Information retrieval with linguistic features
In our qa system we use off-the-shelf ir engines for passage retrieval. We

implemented an interface to several open-source ir packages and in this
paper we will apply the well-known Lucene system from the Apache Jakarta
project (Jakarta 2004). We work on Dutch qa and, therefore, we also apply
simple extensions of Lucene for Dutch such as a stemmer and a stop word
filter. Our work is focused on open-domain question answering using data
from the cross-language evaluation forum clef.
As mentioned earlier, we like to integrate linguistic knowledge into the
retrieval of relevant passages. For this, we have parsed the entire clef
corpus for Dutch with its approx. 1,000,000 paragraphs containing more
than 4,000,000 sentences. We applied the wide-coverage dependency parser
Alpino (Bouma et al. 2001) for this task, which produces among the depen-
dency relations also part-of-speech labels, named entity labels, and linguistic
root forms. An example of the output of Alpino is shown in Figure 1. From
the syntactic analyses we extract several combinations of features for each
paragraph and store them in different fields in the ir index. Henceforth, we
will refer to these fields as “index layers”. Table 1 summarises the layers de-
fined in our index. Feature combinations such as dependent-head bigrams
for each word in each paragraph ...

text plain text tokens
root linguistic root forms
RootPOS root + POS tag
RootRel root + relation (to its head)
RootHead root (dependent) + root (head)
RootRelHead dependent + relation + head
for selected words in each paragraph ...
compound compounds
ne named entities
neLOC location names
nePER person names
neORG organisation names
neTypes labels of named entities
Table 1: Index layers
are simply concatenated using a special delimiter symbol to form single

tokens that can be queried.
A GENETIC ALGORITHM FOR OPTIMISING IR IN QA 179
top
smain
su hd
vc
1 verb
ppart
np word4
det hd hd
mod obj1 mod
det noun verb
pp 1 pp
het0 embargo1 stel in5
hd obj1 hd
obj1
prep name prep
np
tegen2 Irak3 na6
det hd
mod mod
det noun
pp pp
de7 inval8
hd obj1 hd obj1
prep name prep noun
in9 Koeweit10 in11 199012
Fig. 1: A dependency tree produced by Alpino: Het embargo tegen Irak werd
ingesteld na de inval in Koeweit in 1990. (The embargo against Iraq has
been declared after the invasion of Kuwait in 1990.)
The index described above is filled with the data from the machine anno-
tated clef corpus. Now it is possible to send complex queries to the ir
engine using the powerful query language supported by Lucene. There are
several features that we explore in our experiments:
Field restrictions: This is the basic parameter needed to search in the
multi-layer index. We have to specify the field we like to query with
the given keywords. For example, we can look into the text layer with
a query such as “text:(Irak Verenigde Naties)”
Keyword weighting: Each keyword in a query can be weighted with a
so-called “boost factor”. For example, we can weight the named entity
(ne) “Irak” with a weight of 2: “ne:(Irak^2 Verenigde_Naties)”
Required keywords: Lucene’s query language allows one to mark key-
words to be required in matching paragraphs using the character ’+’.
For example, to require the named entity “Verenigde Naties” we can

write “ne:(Irak^2 +Verenigde_Naties)”.
Questions are parsed by Alpino producing similar annotations as done for
the sentences in the clef corpus. Now, the main idea is to match features
extracted from the analysed question with paragraphs in the index. As
we argued earlier, it is important to select appropriate features (in terms
of keywords) in order to improve retrieval performance. Hence, we do not
want to simply use all possible feature combinations matching one of the
fields in the index. Rather, we want to perform a fine-grained selection
of features to combine the most appropriate keywords in a query to the ir
engine. For this we can not only select one or more index layers to be queried
but also restrict the set of selected keywords by additional constraints. For
example, we can restrict the query to only include plain text tokens that
have been labelled as nouns. Any kind of restriction can be used to define a
subset of keywords, which then can be combined with other subsets. Now,
we can also set boost factors for certain selections of keywords to weight
them differently than other keywords. Furthermore, we can mark keyword
selections as required.
In our experiments we use constraints on part-of-speech labels and rela-
tion types (dependency relations of words to their head words). Of course,
both types of constraints can also be combined. In this way we have a large
variety of keyword types (index layer + constraints) that can be weighted
in various ways. We use some simple preference heuristics to keep queries
at a reasonable size. For instance, we define that weights of more “specific”
keywords (with stronger constraints) overwrite weights of less specific ones.
We also define that restrictions on relation names are more specific than
POS restrictions and required markers overwrite weights for keywords with
identical specificity. We also use some heuristics to prune queries. For a
given set of questions, we do the following: Keyword types are discarded if
1. they do not produce any keywords for all questions, i.e. no keywords
are selected using the given constraints for any question
2. they do not add any new keyword nor they change any parameter
(weight/required marker) of any query (i.e. all keywords selected
are already contained in the existing queries with exactly the same
weights)
3. they do not produce any ir output for all questions
3 A Genetic Algorithm for query optimisation
In our approach, we use ir as a black box filled with our data. We can
modify the input parameters and inspect the output produced but we do not
know exactly how the internal ranking is influenced by the given parameters.
However, we want to perform a systematic search for optimised queries to
improve the performance of the qa system. Here, we can use a randomised
optimisation process to iteratively search through possible settings with the
goal to find the optimal parameter setup. Genetic algorithms have proven to
be useful for such a task, running in a manner of a “focused trial-and-error”
beam search through a pre-defined hypothesis space. The goal in genetic
algorithms is to let a population of individuals find a way to better fitness in
their environment using evolutionary principles. Useful genes (=settings)
are kept and combined in the population from generation to generation.
Random changes are introduced using mutation.
Here, individuals refer to qa processes trying to answer a given set of
questions. Each individual differs in the settings used for ir. The parame-
ters are simply the keyword types used in the ir queries as defined in the
previous section. Here, each of these keyword types may have exactly one
weight or a required marker. Fitness is determined by the quality of the
result when running qa with the individual’s ir settings. We use mean
reciprocal ranks of the top five answers to measure this quality. One impor-
tant principle in genetic algorithms is “crossover/reproduction”, i.e. means
to produce new children for coming generations. Here, we define crossover
as the creation of a new qa process using the ir settings of two parents from
the current population. The settings of the two parents are simply merged.
In cases of overlaps we compute the arithmetic mean of parameters (i.e.
weights). Again, required markers overwrite weights on identical keyword
types. Parents are selected randomly without any preference mechanism.
Another important principle in genetic algorithms is “natural selection”.
This is usually done in a probabilistic way. However, we simplify this func-
tion to a top N search for the individuals with the highest fitness scores in
the current population for reasons we will explain below.
One of the beauties of genetic algorithms is the possibility to run par-
allel processes in order to explore the space of possible solutions. This is
especially useful in our case because question answering requires a lot of
processing power and time. It is crucial that we can run through a large
number of settings in order to find a (sub-)optimal solution. For our exper-
iments, we use a Beowulf Linux cluster to run up to 50 processes in parallel
(that is what we define to be the maximum number of children to “grow
up” at a time). We keep the size of the population small with only 25 in-
dividuals from which new children are created. This means that we define
our selection process to be a top 25 search among “living individuals”. This
makes it possible to apply the selection procedure each time a child reports
its fitness score. Hence, we do not have to wait for an entire new generation
which would be necessary when applying probabilistic selection.
We believe that this procedure of gradually changing the population de-

scribes a more natural development than jumping from one generation to
an entirely new one from time to time. The disadvantage of top N selection
is that it is deterministic and, therefore, the optimisation process might
get stuck in local maxima more easily. This, however is compensated by
probabilistic mutation operations with rather high likelihoods as explained
below.
As a side effect of the top N search, children that develop faster will
also have a larger chance to survive (at least until stronger individuals ap-
pear pushing them out of the population). This may reduce the effects of
over-fitting because we prefer simple queries to complex ones which usually
require more processing time. However, complex ones are not discarded and
may overrule simpler ones (and their ancestors) in the long run.
Mutation operations are defined to include additional randomness in the
optimisation process. Standard approaches apply mappings of parameters
to gene sequences. However, we define mutation operations directly on the
parameters given. We use the following operations on new indivudals:
• a new keyword type is added with a chance of 0.2;
• a keyword type is removed with a chance of 0.1;
• a keyword weight is modified by a random value between −5 and 5
with a chance of 0.1 (+ the weight has to remain a positive value);
• a keyword type is marked as required with a chance of 0.01.
As mentioned earlier, we have chosen to set rather high likelihoods to the
mutation operations to compensate for the deterministic selection process
and also to account for the small populations size. We have not experi-
mented with different or even variable settings which might be an interesting
investigation in future work.
4 Experiments
For our experiments we used date from the clef competition on Dutch
qa. We used questions from the last three years, 2003 – 2005 for training
and evaluation. These question sets have been annotated with their correct
answers and supporting documents from the clef corpus in which the an-
swer can be found (if they were found by one of the participating teams).
We used only those questions which have at least one answer and we use
simple string matching to determine if our system produces correct answers
(i.e. we used answer strings instead of document IDs for evaluation). We
randomly selected 250 questions for evaluation and used the remaining 522
questions for training.
The genetic algorithm has been initialised with single keyword type settings
using default settings for weights (=1) and no required markers. We limited
ourselves to a small set of possible constraints to select keywords of different
types. We included constraints on a sub-set of word classes (names, nouns,
adjectives and verbs) and a sub-set of dependency relations (direct object,
modifier, apposition and subject). Each combination of constraints has been
considered and applied to each of the index layers to form the set of initial
settings. However, as explained earlier, we pruned this set by discarding
keyword types according to the conditions defined in section 2.
After running the initial settings the genetic algorithm decides how to
combine them and how to develop further. Table 2 shows the results after
various numbers of settings tested.
number of settings training evaluation
300 0.419 0.413
450 0.426 0.421
600 0.429 0.429
750 0.436 0.436
900 0.436 0.436
1000 0.441 0.452
1200 0.447 0.455
1500 0.448 0.442
1700 0.449 0.442
baseline (plain text keywords) 0.407 0.412
Table 2: Improvements of QA performance by optimising IR query
parameters
We obtained scores above the baseline already after 300 indivuduals tested
(initial settings included). Here, the baseline is the best performing initial
setting, namely plain text word keywords which is similar to standard ir
(Dutch stemming and stop word removal has been applied). In the top
part of Figure 2, the training curve is plotted for the 1700 settings. The
bottom part shows the corresponding scores on unseen evaluation data.
The thin black lines refer to the scores of individual settings tested in the
optimisation process. The solid bold lines refer to the top scores obtained
using the optimised query settings.
The plots illustrate the typical behaviour of genetic algorithms. The
random modifications cause the oscillation of the fitness scores plotted with
thin black lines. The algorithm picks up the advantageous features and
promotes them in the competitive selection process. This gives the gradual
improvements in terms of fitness of the top individual tested. The final
scores are well above the base line (about 10% improvement in MRR). The
largest improvements can be observed in the beginning of the optimisation
0.46
0.44
0.42
answer string MRR
0.4
0.38
0.36
0.34
0.32
baseline (0.407)
optimized (training)
0.3
0 200 400 600 800 1000 1200 1400 1600
number of settings
0.46
0.44
0.42
answer string MRR
0.4
0.38
0.36
0.34
0.32
baseline (0.412)
optimized (evaluation)
0.3
0 200 400 600 800 1000 1200 1400 1600
number of settings
Fig. 2: Parameter optimisation with a genetic algorithm. Training (top)
and evaluation (bottom).
process which is very common in machine learning approaches. The training

curve starts to level out after about 1300 settings.
The two plots in Figure 2 also illustrate the relation between the impact
of ir variation on training data and on unseen evaluation data. There is
a strong correlation between the scores obtained when applying the var-
ious ir settings to the two different data sets. The general tendency of
evaluation scores is similar to the training curve with step-wise improve-
ments throughout the optimisation process even though the development of
the evaluation score is not monotonic. The optimisation process seems to

succeed in picking up preferable features for improving qa in general.
Besides the small drops of evaluation scores in the beginning of the op-
timisation process we can also observe a tendency of decreasing values after
about 1500 settings which is probably due to over-fitting. This suggests
some form of early stopping or a stop condition based on validation data.
Note also that the scores on evaluation data using optimised settings do
not always reach the top scores seen among all individuals tested on this
data set. The highest score observed on evaluation data actually amounts
to 0.469 which is clearly above the 0.455 obtained when using settings op-
timised on the training data. However, the differences are small and the
optimised settings produce scores high above the baseline on both, training
and evaluation data.

In this chapter we described the information retrieval component of our
open-domain question answering system. We integrated linguistic features
produced by a wide-coverage parser for Dutch, Alpino, in the ir index to
improve the retrieval of relevant paragraphs. These features are stored
in several index layers that can be queried by the qa system in various
ways. We also used word class labels and syntactic relations as constraints
in the selection of keywords from analysed natural language questions to
build the ir queries. We demonstrated a genetic algorithm for optimising
query parameters such as keyword weights and selection constraints to take
advantage of the enriched ir index. Query parameters are trained on a
set of questions annotated with answers taken from the clef competitions
on Dutch question answering. We showed that the performance of the
qa system could be improved by about 10% on unseen evaluation data
measured with the mean reciprocal rank of retrieved answers.
In future work we would like to test further combinations of features
and constraints to improve the performance even further. We would also
like to experiment with different settings of the genetic algorithm in order
to investigate the impact of these parameters on the optimisation process.
Acknowledgements. This research was carried out as part of the research

program for Interactive Multimedia Information Extraction, imix, financed by
nwo, the Dutch Organisation for Scientific Research.
REFERENCES
Bouma, Gosse, Gertjan van Noord & Robert Malouf. 2001. “Alpino: Wide
coverage computational analysis of Dutch”. Computational Linguistics in
the Netherlands (CLIN 2000 ), 45-59. Tilburg, The Netherlands: Rodopi.
Cui, Hang, Renxu Sun, Keya Li, Min-Yen Kan & Tat-Seng Chua. 2005. “Ques-
tion Answering Passage Retrieval Using Dependency Relations”. Proceedings
of the 28th Annual International ACM SIGIR Conference on Research and
Development of Information Retrieval (SIGIR 2005 ), 400-407. Salvador,
Brazil.
Apache Jakarta. 2004. Apache Lucene - A High-performance, Full-featured Text
Search Engine Library. http://lucene.apache.org/java/docs/index.html
[Source checked in April 2006].
Katz, Boris & Jimmy Lin. 2003. “Selectively using relations to improve precision
in question answering”. Proceedings of the EACL-2003 Workshop on Natural
Language Processing for Question Answering, 43-50. Budapest, Hungary.
Moldovan, Dan, Sanda Harabagiu, Roxana Girju, Paul Morarescu, Finley Laca-
tusu, Adrian Novischi, Adriana Badulescu & Orest Bolohan. 2002. “LCC
Tools for Question Answering”. Proceedings of the 11th Text Retrieval Con-
ference, TREC, 388-397. Gaithersburg, MD, USA.
Neumann, Günter and Bogdan Sacaleanu. 2004. “Experiments on Robust NL
Question Interpretation and Multi-Layered Document Annotation for a Cross-
Language Question/Answering System”. Multilingual Information Access for
Text, Speech and Images: 5th Workshop of the Cross-Language Evaluation
Forum, CLEF, Lecture Notes in Computer Science, Volume 3491, Springer
Berlin / Heidelberg, 411-422. Bath, UK.
Prager, John, Eric Brown, Anni Cohen, Dragomir Radev & Valerie Samn. 2000.
“Question-Answering by Predictive Annotation”. Proceedings of the 23rd
Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, 184-191. Athens, Greece.
Strzalkowski, Tomek, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang
Lin, José Pérez-Carballo, Troy Straszheim, Jin Wang & Jon Wilding. 1997.
“Natural Language Information Retrieval: TREC-5 Report”. Proceedings of
the 5th Text Retrieval Conference, TREC, 291-314. Gaithersburg, MD, USA.
Lexico-Syntactic Subsumption for Textual Entailment
Vasile Rus & Arthur Graesser
The University of Memphis
Abstract
We present here a graph-based approach to the task of textual en-
tailment between a Text and Hypothesis. The approach takes into
account the full lexico-syntactic context of both the Text and Hy-
pothesis and relies heavily on the concept of subsumption. It starts
with mapping the Text and Hypothesis into graph-structures where
nodes represent concepts and edges represent lexico-syntactic rela-
tions among concepts. Based on a subsumption score between the
Text-graph and Hypothesis-graph an entailment decision is made. A
novel feature of our approach is the handling of negation. The results
obtained from a standard entailment test data set are better than
results reported by systems that use similar knowledge resources.
We also found that a tf-idf approach performs close to chance for
sentence-like Texts and Hypotheses.
1 Introduction
Recognizing Textual Entailment (rte) (Dagan, Glickman & Magnini 2005)

is the task of deciding, given two text fragments, whether the meaning of
one text is entailed (can be strongly inferred) from another text (Dagan,
Glickman & Magnini 2005). We say that T (the entailing text) entails H
(the entailed hypothesis). The task is relevant to a large number of applica-
tions, including machine translation, question answering, and information
retrieval.
In this paper we present a novel approach to rte that uses minimal
knowledge resources. The approach relies on lexico-syntactic information
and synonymy specified in a thesaurus. The results obtained are better
than state-of-the-art solutions that use same array of resources.
In our approach each T-H pair is first mapped into two graphs, one for H
and one for T, with nodes representing main concepts and edges indicating
dependencies among concepts as encoded in H and T, respectively. An
entailment score, entail(T, H), is then computed that quantifies the degree
to which the T-graph subsumes the H-graph. The score is so defined to be
non-reflexive, i.e. entail(T, H) 6= entail(H, T ). We evaluate the approach
on the standard rte Challenge data (Dagan, Glickman & Magnini 2005).
The application of this work to Intelligent Tutoring Systems (its), namely
AutoTutor (Graesser, Lu, Jackson, Mitchell, Ventura, Olney & Louwerse
188 VASILE RUS & ARTHUR C. GRAESSER
2004; Graesser, VanLehn, Rose, Jordan & Harter 2001), is under investi-
gation. AutoTutor is an its that works by having a conversation with the
learner. AutoTutor appears as an animated agent that acts as a dialog
partner with the learner. The animated agent delivers AutoTutor’s dialog
moves with synthesized speech, intonation, facial expressions, and gestures.
Students are encouraged to articulate lengthy answers that exhibit deep
reasoning, rather than to recite small bits of shallow knowledge. It is in the
deep knowledge and reasoning part where we explore the potential help of
entailment.
The rest of the paper is structured as follows. Section 2 outlines re-
lated work. Section 3 describes our approach in detail. Section 4 presents
the experiments we performed, results and a comparative view of similar
approaches. The Conclusions section wraps up the paper.
2 Related work
The task of textual entailment has been recently treated, in one form or an-
other, by research groups ranging from informational retrieval to language
processing leading to a large spectrum of approaches from tf-idf (term fre-
quency - inverse document frequency) to knowledge-intensive.
In one of the earliest explicit treatments of entailment Monz & de Rijke
(2001) proposed a weighted bag of words approach. They argued that tra-
ditional inference systems based on first order logic are limited to yes/no
judgements when it comes to entailment tasks whereas their approach de-
livered “graded outcomes”. They established entailment relations among
larger pieces of text (4 sentences on average) than the proposed rte setup
where the text size is a sentence (seldom two) or part of a sentence (phrase).
We replicated their approach, for comparison purposes, and report its per-
formance on the rte test data.
Pazienza and colleagues Pazienza, Pennacchiotti & Zanzotto (2005) use
a syntactic graph distance approach for the task of textual entailment. The
set of dependencies, or grammatical relations, they use is different from
ours. Further, it is not clear if they handle remote dependencies or not.
Among the drawbacks of their work, as compared to ours, is lack of negation
handling.
Recently, Dagan & Glickman (2004) presented a probabilistic approach
to textual entailment based on lexico-syntactic structures. They use a
knowledge base with entailment patterns and a set of inference rules. The
patterns are composed of a pattern structure (entailing template → entailed
template) and a quantity that tells the probability that a text which entails
the entailing template also entails the entailed template. This is a good
example of a knowledge-intensive approach.
LEXICO-SYNTACTIC SUBSUMPTION FOR TEXTUAL ENTAILMENT 189
A closely related effort, although not labeled as entailment (probably be-

cause it was ahead of the rte era), is presented in Moldovan & Rus (2001).
They show how to use unification and matching to address the answer cor-
rectness problem. Answer correctness can be viewed as entailment: Is a
candidate answer entailing the ideal answer to the question? Initially, the
question is paired with an answer from a list of candidate answers. The
resulting pair is mapped into a first-order logic representation and a unifi-
cation process between the question and the answer follows. As a back-off
step, for the case when no full unification is possible, the answer with high-
est unification score is ranked at the top. The task they describe is different
than the rte task because a list of candidate answers to rank are available.
The granularity of candidate answers and questions is similar to the rte
data.
3 Approach
Our solution for recognizing textual entailment is based on the idea of sub-
sumption. In general, an object X subsumes an object Y if X is more general
than or identical to Y, or alternatively we say Y is more specific than X.
The same idea applies to more complex objects, such as structures of in-
terrelated objects. Applied to textual entailment, subsumption translates
into the following: hypothesis H is entailed from text T if and only if T
subsumes H. We use graph representations for the text T and hypothesis H
to implement the idea of subsumption.
manage d−obj
to−comp
subj mod mod
compounds
bombers not to
det p−comp det nn
mod
the enter
subj the embassy
Fig. 1: Example of dependency graph for test pair #1981 from RTE (edges
are in grayscale to better visualize the correspondence)
The two text fragments involved in a textual entailment decision are initially
mapped into a graph representation that has its roots in the dependency-
graph formalisms of Hays (1964) and Mel’cuk (1998). The mapping process
has three phases: preprocessing, dependency graph generation, and final
graph generation.
In the preprocessing phase we do tokenization, lemmatization, part-of-

speech tagging and parsing. The preprocessing continues with a step in
which parse trees are transformed in a way that helps the graph generation
process in the next phase. For example, auxiliaries and passive voice are
eliminated but their important information is kept: voices are marked as
additional labels to the verb tags, while aspect information (derived from
modals such as may) is recorded as an extra marker of the node gener-
ated for a particular verb. An important step, part of the preprocessing
phase, indentifies major concepts in the input: named entities, collocations,
postmodifiers, existentials, etc. A dictionary of collocations, compiled from
WordNet (Miller 1995), and a simple algorithm help us detect collocations.
subj obj
mod s−comp nn
The bombers had not managed to enter the embassy compounds.
subj mod
obj
subj nn
The bombers entered the embassy compounds.

Fig. 2: Example of graph representation for test pair #1981 from RTE
(Edges are in grayscale to better visualize the correspondence)
The mapping from text to the graph-representation is based on information

from parse trees. We use Charniak’s (2000) parser to obtain parse trees and
head-detection rules (Magerman 1994) to obtain the head of each phrase. A
dependency tree is generated by linking the head of each phrase to its mod-
ifiers in a straightforward mapping step. The problem with the dependency
tree is that it only encodes local dependencies (head-modifiers). Remote
dependencies, as the subject relation between bombers and entered in Fig-
ure 1, are not marked in such dependency trees. An extra step transforms
the previous dependency tree into a dependency graph in which remote
dependencies are explicitly marked. Furthermore, the dependency graph
is transformed into a final graph in which direct relations among content
words are coded (Figure 2). For instance, a mod dependency between a noun
and its attached preposition is replaced by a direct dependency between the
prepositional head and prepositional object. The remote dependencies are
obtained using a naive-bayes functional tagger (Rus & Desai 2005).
As soon as the graph representation is obtained a graph matching operation

is initialized. A node in the Hypothesis is paired with a node in Text.
If no direct matching (called identity match) is possible we use synonyms
from WordNet. Nodes in the Text graph that have corresponding entries in
WordNet (i.e. nodes for content words) are mapped to all possible senses
in the database and the matching continues with comparing all the words
in a given synset to the to-be-matched Hypothesis node. Possible multiples
matchings are solved using criteria elaborated in (Rus, Graesser & Desai
2005).
3.1 Graph matching and subsumption

A graph G = (V, E) consists of a set of nodes or vertices V and a set of
edges E. Isomorphism in graph theory is the problem of testing whether
two graphs are really the same (Skiena 1998). Several variations of graph
isomorphism exist in practice of which the subsumption or containment
problem best fits our task. Is graph H contained in (not identical) to graph
T? Graph subsumption consists of finding a mapping from vertices in H
to T such as edges among nodes in H hold among mapped edges in T. In
our case the problem can be further relaxed: attempt a subsumption and if
not possible back-off to a partial subsumption. The important aspect is to
quantify the degree of subsumption of H by T.
We use graphs to model the linguistic information embedded in a sen-
tence: nodes represent concepts and edges represent syntactic relations
among concepts. Furthermore, we map the textual entailment problem
into a graph isomorphism problem: the Text entails the Hypothesis if the
hypothesis graph is contained or subsumed by the text graph.
The subsumption algorithm for textual entailment has three major steps:
(1) find an isomorphism between HV (set of vertices of the Hypothesis
graph) and TV , (2) check whether the labeled edges in H, EH , have cor-
respondents in ET , and (3) compute score. Step 1 is more than a simple
word-matching method since if a node in H doesn’t have a direct correspon-
dent in T a thesaurus is used to find all possible synonyms for nodes in T.
Nodes in H have different priorities: head words are matched first followed
by modifiers. Modifiers that indicate negation are handled separately from
the bare lexico-syntactic subsumption since if H is subsumed at large by T,
and T is not negated but H is, or vice versa (see example in Figure 2), the
overall score should be dropped with high confidence to indicate no entail-
ment. Step 2 takes each relation in H and checks its presence in T. It is
augmented with relation equivalences among appositions, possessives, and
linking verbs (be, have). In the last step, a normalized score for node and
edge mapping is computed. The score for an entailment pair is the weighted
sum of the node and relation matching scores. The overall score is adjusted
after considering negation relations.
P
Vh ∈HV maxVt ∈TV match(Vh , Vt )
entscore(T, H) = (α × +
|Vh |
P
Eh ∈HE maxEt ∈TE synt match(Eh , Et )
β× + γ) ×
|Eh |
(1 + (−1)#neg rel )
(1)
2
We looked at two broad types of negation: explicit and implicit. In this

work, we treated only explicit negation. Explicit negation is indicated by
particles such as: no, not, neither ... nor and their shortened forms ’nt. Im-
plicit negation is present in text via deeper lexico-semantic relations among
different linguistic expressions. The most obvious example is the antonymy
relation among lemmas which can be retrieved from WordNet. Negation is
regarded as a feature of both Text and Hypothesis and it is accounted for in
the score after the entailment decision for the Text-Hypothesis pair without
negation is made. If one of the text fragments is negated, the decision is
reversed; if both are negated the decision is retained (double-negation). In
Equation 1 the term #neg rel represents the number of negation relations
in T and H.
3.2 The scoring

The formula to compute the overall score is provided by Equation 1. The
weights of lexical and syntactic matching are given by parameters α and β,
respectively. The last term of the equation represents the effect of negation
on the entailment decision. An odd number of negation relations between
T and H, denoted #neg rel, would lead to an entailment score of 0 while an
even number will not change the bare lexico-semantic score. The choice of
α, β and γ can have a great impact on the overall score. The Experiments
and Results section talks about how to estimate those parameters. From the
way the score is defined, it is obvious that entscore(H, T ) 6= entscore(T, H).
Before we present results on several baselines and of our lexico-syntactic

graph-based approach let us look at the experimental setup as defined by
the rte Challenge.
4.1 Experimental setup

The dataset of text-hypothesis pairs was collected by human annotators and
it is described in (Dagan, Glickman & Magnini 2005). It is a collection of
seven subsets, which correspond to success and failure examples in differ-
ent applications: Question Answering, Information Retrieval, Comparable
Documents, Reading Comprehension, Paraphrase Acquisition, Information
Extraction, and Machine Translation. The annotators selected from each
application area both positive entailment examples (judged as true), where
T does entail H, as well as negative examples (false), where entailment does
not hold, for a 50%-50% split.
The evaluation is fully automatic. The judgements returned by a sys-
tem are compared to those manually assigned by the human annotators
(the gold standard). Two performance measures were proposed: accuracy
and Confidence-Weighted Score. The percentage of matching judgements
provides the accuracy of the run, i.e. the fraction of correct responses. The
Confidence-Weighted Score (CWS, also known as average precision) is com-
puted by sorting the judgements of the test examples by their confidence (in
decreasing order from the most certain to the least certain) and calculating
the following:
n
1 X # − correct − up − to − pair − i
∗ (2)
n i=1 i
n is the number of the pairs in the test set, and i ranges over the pairs.
The Confidence-Weighted Score varies from 0 (no correct judgements at
all) to 1 (perfect score), and rewards the systems’ ability to assign a higher
confidence score to the correct judgements than to the wrong ones. Precision
was also proposed and represents the percentage of positive cases that were
correctly solved. It is not use as much and we only rarely reported next.
4.2 Results
In this section we first provide some insights on several baseline approaches
and then show how our lexico-syntactic approach works in the rte data.
It is arguable what the baseline for entailment is. In Dagan, Glickman
& Magnini (2005), the first suggested baseline is the method of blindly
and consistently guessing true or false for all test pairs. Since the test
data was balanced between false and true outcomes, this blind baseline
would provide an average accuracy of 0.50. Randomly predicting true
or false is another blind method that leads to a run being better than
chance for (cws>0.540)/(accuracy>0.535) at the 0.05 level or for a run
with (cws>0.558)/(accuracy>0.546) at the 0.01 level.
We experimented here with more informed baselines. The first baseline

we used is the lexical overlap: tokenize, lemmatize (using wnstemm in the
WordNet wn library), ignore punctuation and compute the degree of lexical
overlap between H and T. We normalized the result by dividing the lexical
overlap by the total number of words in H. Then, if the normalized score
is greater than 0.5, we assign a true value meaning T entails H, otherwise
we assign false. The normalized score also plays the role of confidence
score necessary to compute the CWS metric. The results (first row in Table
1) for CWS and accuracy are close to chance, a possible suggestion that
the test corpus is balanced in terms of lexical overlap. The precision (only
accounting for positive entailment cases) of 0.6111 on this lexical baseline
method may indicate that higher lexical matching may be a good indica-
tor of positive entailment. The second informed baseline is the approach
presented in Monz & de Rijke (2001), mentioned earlier. We decided to
apply it to the rte data to compare a pure word-level statistical method
to our method and also to see to what extent tf-idf fits rte-like data. rte
uses sentence-like Hs and Ts as opposed to paragraphs in Monz & de Ri-
jke (2001). A larger context, with more words in both H and T, can favor
a word-level statistical method. Briefly, tf-idf uses idf (inverted document
frequency) as a measure of word importance, or weight, in a document. The
idf weights are derived from the development data and then an entailment
score is computed according to the equation below.
P
t ∈(t∩h) idfk
entscore(t, h) = Pk (3)
tk ∈h idfk
Every score below a certain threshold leads to a false entailment and every-
thing above leads to true entailment. We obtained the optimal threshold
from different runs with different thresholds (0.1, 0.2, ..., 0.9) on the devel-
opment data. The results for the test data presented in the second row in
Table 1 are from the run with the optimal threshold.
system cws accuracy

lexical baseline 0.543 0.538
idf-baseline 0.497 0.505
graph-based 0.604 0.554
Zanzotto (Rome-Milan) 0.557 0.524
Punyakanok 0.569 0.561
Andreevskaia 0.519 0.515
Jijkoun 0.553 0.536
Table 1: Performance and comparison of different approaches on RTE test

data
The third row in the table shows the results on test data obtained with the
proposed graph-based method. Initially we used linear regression to esti-
mate the values of the parameters but then switched to a balanced weighting
(α = β = 0.5, γ = 0) which provided better results on development data.
Depending on the value of the overall score three levels of confidence were
assigned: 1, 0.75, 0.5. For instance, an overall score of 0 led to false en-
tailment with maximum confidence of 1. The results of this lexico-syntactic
approach on test data are significant at 0.01 level.
The bottom rows in Table 1 replicate, for comparison purposes, the
results of systems that participated in the rte Challenge (Dagan, Glickman
& Magnini 2005). We picked the best results (some systems report results
for more than one run) for runs that use similar resources to us: word
overlap, WordNet, and syntactic matching.
5 Conclusions
We presented in this paper a lexico-syntactic approach to textual entail-

ment. A novel feature of our approach is the handling of negation. As
compared to a tf-idf approach it performs significantly better. It also shows
better results than more sophisticated systems that use the same array of
resources. A tf-idf scheme is not particularly suitable for the rte-like entail-
ment task due to data sparseness and the need to perform deeper language
processing to capture finer nuances of language.
Acknowledgements. This research was partially funded by The University

of Memphis and AutoTutor project. The research on AutoTutor was supported
by the National Science Foundation (REC 106965, ITR 0325428) and the DoD
Multidisciplinary University Research Initiative (MURI) administered by ONR
under grant N00014-00-1-0600. Any opinions, findings, and conclusions or rec-
ommendations expressed in this article are those of the authors and do not nec-
essarily reflect the views of The University of Memphis, DoD, ONR or NSF. We
are also grateful to three anonymous reviewers for their valuable comments.
REFERENCES
Charniak, Eugene. 2000. “A Maximum-Entropy-Inspired Parser”. Proceedings

of North American Chapter of Association for Computational Linguistics
(NAACL-2000 ), 132-139. Seattle, Washington.
Dagan, Ido & Oren Glickman. 2004. “Probabilistic Textual Entailment: Generic
Applied Modeling of Language Variability”. Proceedings of Learning Methods
for Text Understanding and Mining. Grenoble, France.
nising Textual Entailment Challenge”. Proceedings of the Recognizing Textual
Entaiment Challenge Workshop (RTE ), 1-8. Southampton, U.K.
Graesser, Arthur, Kurt VanLehn, Carolyn Rose, Pamela Jordan & Derek Har-
ter. 2001. “Intelligent Tutoring Systems with Conversational Dialogue”. AI
Magazine 22:4.39-51.
Graesser, Arthur, Shulan Lu, Tanner Jackson, Heather Mitchell, Mathew Ven-
tura, Andrew Olney & Max Louwerse. 2004. “Autotutor: A Tutor with
Dialogue in Natural Language”. Behavioral Research Methods, Instruments,
and Computers, 36:2.180-193.
Hays, David. 1964. “Dependency Theory: A Formalism and Some Observations”.
Language 40:4.511-525.
Magerman, David. 1994. Natural Language Parsing as Statistical Pattern Recog-
nition. Unpublished PhD thesis, Stanford University, February 1994.
Mel’cuk, Igor. 1998. Dependency Syntax: Theory and Practice. Albany, NY:
State University of New York Press.
Miller, George. 1995. “WordNet: A Lexical Database for English”. Communi-
cations of the ACM (November) 38:11.39-41.
Moldovan, Dan & Vasile Rus. 2001. “Logic Form Transformation of WordNet
and its Applicability to Question Answering”. Proceedings of the 39th Annual
Meeting of the Association for Computational Linguistics (ACL-2001 ), 394-
401. Toulouse, France.
Monz, Christof & Maarten de Rijke. 2001. “Light-Weight Entailment Checking
for Computational Semantics”. Proceedings of Inference in Computational
Semantics (ICoS-3 ), ed. by P. Blackburn & M. Kohlhase, 59-72. Schloss
Dagstuhl, Germany.
Pazienza, Maria, Marco Pennacchiotti & Fabio Zanzotto. 2005. “Textual En-
tailment as Syntactic Graph Distance: A Rule Based and SVM Based Ap-
proach”. Proceedings of the Recognizing Textual Entailment Challenge Work-
shop (RTE ), 25-29. Southampton, U.K.
Rus, Vasile & Kirtan Desai. 2005. “Assigning Function Tags with a Simple
Model”. Proceedings of Conference on Intelligent Text Processing and Com-
putational Linguistics (RTE ), 112-115. Mexico City, Mexico.
Rus, Vasile, Arthur Graesser & Kirtan Desai. 2005. “Lexico-Syntactic Sub-
sumption for Textual Entailment”. Proceedings of International COnference
on Recent Advances in Natural Language Processing Conference (RANLP
2005 ), 444-451. Borovets, Bulgaria.
Skiena, S. S. 1998. The Algorithm Design Manual. New York: Springer-Verlag.
A Knowledge-based Approach to
Text-to-Text Similarity
Courtney Corley, Andras Csomai & Rada Mihalcea
Dept. of Computer Science, University of North Texas
Abstract
In this paper, we present a knowledge-based method for measuring
the semantic similarity of texts. Through experiments performed
on two different applications: (1) paraphrase and entailment iden-
tification, and (2) word sense similarity, we show that this method
outperforms the traditional text similarity metrics based on lexical
matching.
1 Introduction
Measures of text similarity have been used for a long time in applications
in natural language processing and related areas. The typical approach to
finding the similarity between two text segments is to use a simple lexical
matching method, and produce a similarity score based on the number of
lexical units that occur in both input segments. Improvements to this sim-
ple method have considered stemming, stop-word removal, part-of-speech
tagging, longest subsequence matching, as well as various weighting and
normalization factors (Salton et al. 1997). While successful to a certain de-
gree, these lexical similarity methods fail to identify the semantic similarity
of texts. For instance, there is an obvious similarity between the text seg-
ments I own a dog and I have an animal, but most of the current similarity
metrics will fail in identifying a connection between these texts.
In this paper, we explore a knowledge-based method for measuring the
semantic similarity of texts. While there are several methods previously
proposed for finding the semantic similarity of words, to our knowledge the
application of these word-oriented methods to text similarity has not been
yet explored. We introduce an algorithm that combines the word-to-word
similarity metrics into a text-to-text semantic similarity metric, and we
show that this method outperforms the simpler lexical matching similarity
approach, as measured in two different applications: (1) paraphrase and
entailment identification, and (2) word sense similarity.
2 Measuring text semantic similarity
Given two input text segments, we want to automatically derive a score that
indicates their similarity at the semantic level, thus going beyond the simple
198 C. CORLEY, A. CSOMAI & R. MIHALCEA
lexical matching methods traditionally used for this task. Although we

acknowledge the fact that a comprehensive metric of text semantic similarity
should take into account the relations between words, as well as the role
played by the various entities involved in the interactions described by each
of the two texts, we take a first rough cut at this problem and attempt to
model the semantic similarity of texts as a function of the semantic similarity
of the component words. We do this by combining metrics of word-to-word
similarity and language models into a formula that is a potentially good
indicator of the semantic similarity of the two input texts.
2.1 Semantic similarity of words

There is a relatively large number of word-to-word similarity metrics that
were previously proposed in the literature, ranging from distance-oriented
measures computed on semantic networks, to metrics based on models of
distributional similarity learned from large text collections. From these,
we chose to focus our attention on six different metrics, selected mainly
for their observed performance in natural language processing applications,
e.g., malapropism detection (Budanitsky & Hirst 2001) and word sense dis-
ambiguation (Patwardhan et al. 2003), and for their relatively high compu-
tational efficiency.
We conduct our evaluation using the following WordNet-based word
similarity metrics: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin,
and Jiang & Conrath. Note that all these metrics are defined between
concepts, rather than words, but they can be easily turned into a word-to-
word similarity metric by selecting for any given pair of words those two
meanings that lead to the highest concept-to-concept similarity1 .
The Leacock & Chodorow (Leacock & Chodorow 1998) similarity is de-
termined as:
length
Sim lch = − log (1)
2∗D
where length is the length of the shortest path between two concepts using
node-counting, and D is the maximum depth of the taxonomy.
The Lesk similarity of two concepts is defined as a function of the overlap
between the corresponding definitions, as provided by a dictionary. It is
based on an algorithm proposed in (Lesk 1986) as a solution for word sense
disambiguation.
The Wu and Palmer (Wu & Palmer 1994) similarity metric measures the
depth of the two concepts in the WordNet taxonomy, and the depth of the
1
We use the WordNet-based implementation of these metrics, as available in the Word-
Net::Similarity package (Patwardhan 2003).
TEXT SEMANTIC SIMILARITY 199
least common subsumer (LCS), and combines these figures into a similarity
score:
2 ∗ depth(LCS)
Simwup = (2)
depth(concept1 ) + depth(concept2 )
The measure introduced by Resnik (Resnik 1995) returns the information

content (IC) of the LCS of two concepts:
Simres = IC(LCS) (3)
where IC is defined as IC(c) = − log P (c) and P (c) is the probability of

encountering an instance of concept c in a large corpus.
The next measure we use in our experiments is the metric introduced by
Lin (Lin 1998), which builds on Resnik’s measure of similarity, and adds a
normalization factor consisting of the information content of the two input
concepts:
2 ∗ IC(LCS)
Sim lin = (4)
IC(concept1 ) + IC(concept2 )
Finally, the last similarity metric we consider is Jiang & Conrath (Jiang
& Conrath 1997), which returns a score determined by:
1
Simjnc = (5)
IC(concept1 ) + IC(concept2 ) − 2 ∗ IC(LCS)
2.2 Language models

In addition to the semantic similarity of words, we also want to take into
account the specificity of words, so that we can give a higher weight to a
semantic matching identified between two very specific words (e.g., collie
and sheepdog), and give less importance to the similarity score measured
between generic concepts (e.g., go and be). While the specificity of words is
already measured to some extent by their depth in the semantic hierarchy,
we are reinforcing this factor with a corpus-based measure of word speci-
ficity, based on distributional information learned from large corpora. We
determine the specificity of a word using the inverse document frequency in-
troduced in (Sparck-Jones 1972), defined as the total number of documents
in the corpus divided by the total number of documents that include that
word. In the experiments reported here we use the British National Corpus
to derive the document frequency counts, but other corpora can be used to
the same effect.
2.3 Semantic similarity of texts

We define a directional measure of similarity, which indicates the semantic
similarity of a text segment Ti with respect to a text segment Tj . This
definition provides us with the flexibility we need to handle applications

where the directional knowledge is useful (e.g., entailment), and at the same
time it gives us the means to handle bidirectional similarity through a simple
combination of two unidirectional metrics.
For a given pair of text segments, we start by creating sets of open-class
words, with a separate set created for nouns, verbs, adjectives, adverbs, and
cardinals. Next, we try to determine pairs of similar words across the sets
corresponding to the same open-class in the two text segments. For nouns
and verbs, we use a measure of semantic similarity based on WordNet, while
for the other word classes we apply lexical matching2 .
For each noun (verb) in the set of nouns (verbs) belonging to one of
the text segments, we try to identify the noun (verb) in the other text
segment that has the highest semantic similarity (maxSim), according to
one of the six measures of similarity described in Section 2.1. Note that all
possible word senses are considered as potential candidates when seeking a
potential match for the word semantic similarity measure. If this similarity
measure results in a score greater than 0, then the word is added to the set of
similar words for the corresponding word class W Spos 3 . The remaining word
classes: adjectives, adverbs, and cardinals, are checked for lexical matching
and included in the corresponding word class set if a match is found.
The similarity between the input text segments Ti and Tj is then deter-
mined using a scoring function that combines the word-to-word similarities
and the word specificity:
P P
( (maxSim(wk ) ∗ idfwk ))
pos wk ∈{W Spos }
sim(Ti , Tj )Ti = P (6)
idfwk
wk ∈{Tipos }
This score, which has a value between 0 and 1, is a measure of the directional
similarity, in this case computed with respect to Ti . The scores from both
directions can be combined into a bidirectional similarity using a simple
product function:
sim(Ti , Tj ) = sim(Ti , Tj )Ti × sim(Ti , Tj )Tj (7)
2
The reason behind this decision is the fact that most of the semantic similarity measures
apply only to nouns and verbs, and there are only one or two relatedness metrics that
can be applied to adjectives and adverbs.
3
All similarity scores have a value between 0 and 1. The similarity threshold can be
also set to a value larger than 0, which would result in tighter measures of similarity.
3 Application 1: Paraphrase and entailment recognition
To test the effectiveness of the text semantic similarity measures, we use

them to automatically identify if two text segments are paraphrases of each
other. We use the Microsoft paraphrase corpus (Dolan et al. 2004), con-
sisting of 4,076 training and 1,725 test pairs, and determine the number of
correctly identified paraphrase pairs in the corpus using the text semantic
similarity measure as the only indicator of paraphrasing. The paraphrase
pairs in this corpus consist of two text segments labeled with a unique iden-
tifier, which were automatically collected from thousands of news sources
on the Web over a period of 18 months. The pairs were manually annotated
by human judges who determined if the two sentences in a pair were seman-
tically equivalent. The agreement between the human judges who labeled
the candidate paraphrase pairs in this data set was measured at approxi-
mately 83%, which can be considered as an upperbound for an automatic
paraphrase recognition task performed on this data set.
In addition, we also evaluate the measure using the Pascal corpus
(Dagan et al. 2005), consisting of 1,380 test–hypothesis pairs with a di-
rectional entailment (580 development pairs and 800 test pairs). The text
segment pairs in this data set are assigned with a unique identifier and a
true or f alse label, indicating if the test sentence entails the hypothesis or
not. Again, an agreement of about 80% was observed between the human
judges who annotated this data set.
For each of the two data sets, we conduct an unsupervised evaluation,
where the decision on what constitutes a paraphrase (entailment) is made
using a constant similarity threshold of 0.25 across all experiments. A su-
pervised evaluation leading to an additional small improvement is reported
in (Corley et al. 2005), with the optimal threshold and weights determined
through learning on training data.
We evaluate the text similarity metric built on top of the various word-
to-word metrics introduced in Section 2.1. For comparison, we also compute
two baselines: (1) A random baseline created by randomly choosing a true or
false value for each text pair; and (2) A vectorial similarity baseline, using
a cosine similarity measure as traditionally used in information retrieval,
with tf.idf weighting.
For paraphrase identification, we use the bidirectional similarity mea-
sure, and determine the similarity with respect to each of the two text
segments in turn, and then combine them into a bidirectional similarity
metric. For entailment identification, since this is a directional relation, we
only measure the semantic similarity with respect to the hypothesis (the
text that is entailed).
We evaluate the results in terms of accuracy, representing the number of
Metric Acc. Prec. Rec. F

Semantic similarity (knowledge-based)
J&C 0.693 0.722 0.871 0.790
L&C 0.695 0.724 0.870 0.790
Lesk 0.693 0.724 0.866 0.789
Lin 0.693 0.716 0.887 0.792
W&P 0.690 0.702 0.921 0.800
Resnik 0.690 0.690 0.964 0.804
Combined 0.700 0.719 0.893 0.796
Baselines
Vectorial 0.654 0.716 0.795 0.753
Random 0.513 0.683 0.500 0.578
Table 1: Text semantic similarity for paraphrase identification
Metric Acc. Prec. Rec. F

Semantic similarity (knowledge-based)
J&C 0.573 0.543 0.908 0.680
L&C 0.569 0.543 0.870 0.669
Lesk 0.568 0.542 0.875 0.669
Resnik 0.565 0.541 0.850 0.662
Lin 0.563 0.538 0.878 0.667
W&P 0.558 0.534 0.895 0.669
Combined 0.583 0.561 0.755 0.644
Baselines
Vectorial 0.528 0.525 0.588 0.555
Random 0.486 0.486 0.493 0.489
Table 2: Text semantic similarity for entailment identification
correctly identified true or false classifications in the test data set. We also
measure precision, recall and F-measure, calculated with respect to the true
values in each of the test data sets. Tables 1 and 2 show the results obtained
in paraphrase and entailment recognition. We also evaluate a metric that
combines all the similarity measures, including the lexical similarity, using
a simple average, with results indicated in the Combined row.
4 Application 2: Word sense similarity
As a second evaluation testbed for the measures of semantic similarity, we

considered another application, namely the unsupervised clustering of the
WordNet senses (Miller 1995). The goal of this application is to reduce
the often criticized fine granularity of the WordNet sense inventory (Palmer
& Dang 2006), by merging the senses of a given word based on the simi-
larities of the corresponding glossary definitions. Comparative evaluations
are performed by comparing the quality of the sense clusters obtained with
different measures of definition semantic similarity, versus simpler methods

based on lexical similarity.
The sense clustering process proceeds as follows. First, we create a
similarity matrix for all the possible senses of a word, based on the pairwise
similarities of the corresponding glossary definitions. Next, an unsupervised
agglomerative average link clustering algorithm (Jain & Dubes 1988) is
used to find the sense clusters (groupings). The clustering algorithm starts
with the set of all word senses, and considers every sense as an individual
cluster. Then, at every iteration, it merges the two most similar clusters
and recalculates the similarity matrix. Finally, once the clustering process
stops, the resulting set of sense clusters can be used as a coarse-grained
sense inventory for the given word.
An important aspect of any agglomerative clustering algorithm is the
stopping criterion that interrupts the clustering process, thus preventing
the construction of a single large cluster. In the current implementation,
the clustering stops when the similarity of the two most similar synsets falls
below the median of all pairwise similarities (excluding self similarities), as
measured in the initial state. A similar criterion was successfully used in
other algorithms for sense clustering, e.g., (Chklovski & Mihalcea 2003).
To evaluate the sense clustering algorithm, we use a set of 42 highly
ambiguous verbs from WordNet, and their corresponding manually cre-
ated sense clusters. This data set was constructed by trained linguists and
lexicographers at University of Pennsylvania, as part of a sense clustering
project4 . The quality of the generated sense clusters is measured using two
metrics considered standard in clustering evaluation, namely purity and
entropy (Zhao & Karypis 2001), computed for the automatically generated
sense groupings relative to the gold standard clusters.
The entropy shows how the various gold-standard clusters are distributed
within each automatically generated cluster. The purity of an automatically
generated cluster is defined as the fraction represented by the largest cluster
of senses from the gold standard assigned to that cluster.
To evaluate the measures of text similarity, we perform comparative
evaluations of sense clustering solutions using semantic similarity metrics
computed between glossary definitions. We use the measures introduced in
Section 2.1, plus an additional Combined measure, which computes the
arithmetic average of the individual metrics. The baseline once again is
represented by the traditional measure of lexical similarity. We also calcu-
late a random baseline, where the definition similarities are determined as
random numbers between 0 and 1.
Table 3 shows the purity and entropy values of the sense clusters ob-
4
Martha Palmer, personal communication.
Metric Entropy Purity

Sense Clustering
Lin 0.163 0.835
Resnik 0.176 0.827
W&P 0.174 0.824
J&C 0.186 0.823
Lesk 0.186 0.819
L&C 0.186 0.816
Combined 0.163 0.833
Baselines
Lexical Similarity 0.191 0.818
Random 0.248 0.761
Table 3: Sense clustering results using gloss semantic similarities
tained with the semantic similarity metrics, as well as the two baselines.
Note that all evaluation measures take values between 0 and 1, with smaller
entropy values and higher purity values representing clusters of higher qual-
ity (a perfect match with the gold standard would be represented by a set
of clusters with entropy of 0 and purity of 1).
All the semantic similarity measures lead to sense clusters that are better
than those obtained with the lexical similarity measure. The best results are
obtained with the Lin metric, for an overall entropy of 0.163 and a purity of
0.835, at par with the Combined measure that combines together all the
individual metrics. It is worth noting that the random baseline establishes
a rather high lower bound: 0.248 entropy and 0.761 purity. Considering
these values as the origin of a 0–100% evaluation scale, the improvements
brought by the best semantic similarity measure (Lin) relative to the lexical
matching baseline translate into 11.6% and 6.7% improvement for entropy
and purity respectively. These figures are competitive with previously pub-
lished sense clustering results (Chklovski & Mihalcea 2003).
5 Discussion and conclusions
For the task of paraphrase recognition, incorporating semantic informa-

tion into the text similarity measure increases the likelihood of recognition
significantly over the random baseline and over the vectorial similarity base-
line. In the unsupervised setting, the best performance is achieved using
a method that combines several similarity metrics into one, for an overall
accuracy of 70.0%. When learning is used to find the optimal combination
of metrics and optimal threshold, the highest accuracy of 71.5% is obtained
by combining the semantic similarity metrics and the lexical similarity.
For the entailment data set, although we do not explicitly check for en-
tailment, the directional similarity computed for textual entailment recogni-
tion does improve over the random and vectorial similarity baselines. Once
again, the combination of similarity metrics gives the highest accuracy, mea-
sured at 58.3%, with a slight improvement observed in the supervised set-
ting, where the highest accuracy was measured at 58.9%. Both these figures
are competitive with the best results achieved during the Pascal entail-
ment evaluation (Dagan et al. 2005).
For the word sense similarity application, the clusters obtained using the
semantic similarity metrics have higher purity and lower entropy as com-
pared to those generated based on the simpler lexical matching measure.
The best clustering solution is obtained with the Lin similarity metric, for
an entropy of 0.163 and a purity of 0.835, which represent a clear improve-
ment with respect to the lexical matching baseline, taking also into account
the competitive lower bound obtained through random clustering.
Although our method relies on a bag-of-words approach, as it turns
out the use of measures of semantic similarity improves significantly over
the traditional lexical matching metrics5 . We are nonetheless aware that
a bag-of-words approach ignores many of important relationships in sen-
tence structure, such as dependencies between words, or roles played by the
various arguments in the sentence. Future work will consider the investi-
gation of more sophisticated representations of sentence structure, such as
first order predicate logic or semantic parse trees, which should allow for
the implementation of more effective measures of text semantic similarity.
REFERENCES
Budanitsky, Alexander & Graeme Hirst. 2001. “Semantic Distance in WordNet:
An Experimental, Application-oriented Evaluation of Five Measures”. Pro-
ceedings of the Workshop on WordNet and Other Lexical Resources, 29-34.
Pittsburgh, Pennsylvania.
Chklovski, Timothy & Rada Mihalcea. 2003. “Exploiting Agreement and Dis-
agreement of Human Annotators for Word Sense Disambiguation”. Proceed-
ings of the International Conference on Recent Advances in Natural Language
Processing (RANLP ), 98-104. Borovets, Bulgaria.
Corley, Courtney, Andras Csomai & Rada Mihalcea. 2005 “Text Semantic Sim-
ilarity, with Applications”. Proceedings of the International Conference on
Recent Advances in Natural Language Processing (RANLP ), 173-180.
Borovets, Bulgaria.
nising Textual Entailment Challenge”. Proceedings of the PASCAL Work-
shop, 1-8. Southampton, United Kingdom.
5
The improvement of the combined semantic similarity metric over the simpler lexical
similarity measure was found to be statistically significant in all experiments, using a
paired t-test (p < 0.001).
Dolan, William B., Chris Quirk & Chris Brockett. 2004. “Unsupervised Con-
struction of Large Paraphrase Corpora: Exploiting Massively Parallel News
Sources”. Proceedings of the 20th International Conference on Computational
Linguistics (COLING), 250-356 Geneva, Switzerland.
Jain, Anil K.,& Richard C. Dubes. 1998. Algorithms for Clustering Data. En-
glewood, New Jersey: Prentice Hall.
Jiang, Jay J. & David W. Conrath. 1997. “Semantic Similarity Based on Corpus
Statistics and Lexical Taxonomy”. Proceedings of the International Confer-
ence on Research in Computational Linguistics, 19-33. Taiwan.
Leacock, Claudia & Martin Chodorow. 1998. “Combining Local Context and
WordNet Sense Similarity for Word Sense Identification”. WordNet, An
Electronic Lexical Database ed. by Christiane Fellbaum, 265-283. Cambridge,
Mass.: MIT Press.
Lesk, Michael E. 1986. “Automatic Sense Disambiguation Using Machine Read-
able Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone”. Pro-
ceedings of the Special Interest Group on the Design of Communication Con-
ference (SIGDOC ), 24-26. Toronto, Canada.
Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity”. Pro-
ceedings of the 15th International Conference on Machine Learning (ICML),
296-304. Madison, Wisconsin.
Miller, George A. 1995. “Wordnet: A Lexical Database”. Communication of the
ACM 38:11.39-41.
Palmer, Martha & Hoa Trang Dang. Forthcoming. “Making Fine-Grained and
Coarse-Grained Sense Distinctions, Manually and Automatically”. To appear
in Natural Language Engineering.
Patwardhan, Siddharth, Satanjeev Banerjee & Ted Pedersen. 2003. “Using Mea-
sures of Semantic Relatedness for Word Sense Disambiguation”. Proceedings
of the Fourth International Conference on Intelligent Text Processing and
Computational Linguistics (CICLING), 241-257. Mexico City, Mexico.
Resnik, Philip. 1995. “Using Information Content to Evaluate Semantic Similar-
ity in a Taxonomy”. Proceedings of the 14th International Joint Conference
on Artificial Intelligence, 448-453. Montreal, Canada.
Salton, Gerald & Chris Buckley. 1997. “Term Weighting Approaches In Au-
tomatic Text Retrieval”. Readings in Information Retrieval, 323-328. San
Francisco: Morgan Kaufmann.
Sparck-Jones, Karen. 1972. “A Statistical Interpretation of Term Specificity and
its Application in Retrieval”. Journal of Documentation 28:1.11-21.
Wu, Zhibiao & Martha Palmer. 1994. “Verb Semantics and Lexical Selection”.
Proceedings of the 32nd Annual Meeting of the Association for Computational
Linguistics (ACL), 133-148. Las Cruces, New Mexico.
Zhao, Ying & George Karypis. 2001. “Criterion Functions for Document Cluster-
ing: Experiments and Analysis”. Technical Report (TR 01-40). Minneapolis,
Minnesota: Department of Computer Science, Univ. of Minnesota.
A Simple WWW-based Method
for Semantic Word Class Acquisition
Keiji Shinzato∗ & Kentaro Torisawa∗∗
∗
Kyoto University &
∗∗
Japan Advanced Institute of Science and Technology
Abstract
This chapter describes a simple method to obtain semantic word
classes from html documents. We previously showed that itemiza-
tions in html documents can contain semantically coherent word
classes. However, not all the itemizations are semantically coherent.
Our goal is to provide a simple method to extract only semantically
coherent itemizations from html documents. Our new method can
perform this task by obtaining hit counts from an existing search
engine 2n times for an itemization consisting of n items.
1 Introduction
There are many natural language processing tasks in which semantically co-
herent word classes can play an important role, and many automatic meth-
ods for obtaining word classes (or semantic similarities) have been proposed
(Church & Hanks 1989, Hindle 1990, Lin 1998, Pantel & Lin 2002). Most
of these methods rely on particular types of word co-occurrence frequencies,
such as noun-verb co-occurrences, obtained from parsed corpora.
On the other hand, we previously showed that itemizations in html
documents on the World Wide Web (www), such as that in Figure 1, can
contain semantically coherent word classes (Shinzato & Torisawa 2004). We
say a class of words is coherent if it contains only words that are semanti-
cally similar to each other and that have a common hypernym other than
trivial hypernyms such as things or objects. The expressions in Figure 1
have common non-trivial hypernyms such as “Record shops”, and consti-
tute a semantically coherent word class. Since one can find a huge number
of html itemizations throughout the www, we can expect a huge number
of semantic classes to be available. However, itemizations do not always
contain semantically coherent word classes. Many itemizations are used
just for formatting html documents properly and contain semantically in-
coherent items. We therefore need an automatic method to filter out such
inappropriate itemizations.
In this chapter, we present a simple filtering method to obtain only
semantically coherent word classes from itemizations in html documents.
The method calculates strength of association between items in a word class
208 KEIJI SHINZATO & KENTARO TORISAWA
<UL><LI>Favorite Record Shops</LI>

<UL><LI>Tower Records</LI>
<LI>HMV</LI>
<LI>Virgin Megastores</LI>
<LI>RECOfan</LI>
</UL></UL>
Fig. 1: Sample code of an itemization layout
extracted from html itemizations using hit counts obtained from a search
engine, and tries to exclude semantically incoherent word classes according
to this strength. Our new method is simple in the sense that all it requires
are hit counts from a search engine, mutual information regarding the hit
counts, and an implementation of Support Vector Machines (Vapnik 1995).
The method is also efficient and lightweight in the sense that it can perform
this task just by obtaining hit counts at most 2n times for a word class
consisting of n items. There is no need to download and analyze a large
number of html documents.
We have tested the effectiveness of our method through experiments
using html documents and human subjects. A problem is that it is difficult
to set a rigorous evaluation criterion for semantic word classes. We try to
solve this problem using a manually tailored thesaurus.
In this chapter, Section 2 reviews previous work and Section 3 describes
our proposed method. Our experimental results, obtained using Japanese
html documents, are presented in Section 4.
2 Previous work
An alternative to our filtering method for word classes is our hyponymy

relation acquisition method (Shinzato & Torisawa 2004). We abbreviate
this method as hram (Hyponymy Relation Acquisition Method). Hram
first extracts html itemizations and downloads documents including the
expressions in the itemizations. The method then finds words that fre-
quently appear in the downloaded documents and appear less frequently
in general texts. Basically, it selects one among such words according to a
score and other heuristics, and produces the word as a common hypernym
for the expressions in the itemization. In general, if the score value pro-
duced with a hypernym is large enough, the resulting hypernym is likely to
be a proper hypernym and the itemization tends to include a semantically
coherent word class. Because of this property, we can regard the whole
procedure as an alternative to our method. More precisely, if we select the
itemizations for which hram produced a hypernym with a high score, then
the selected itemizations tend to be semantically coherent word classes. The
WWW-BASED SEMANTIC WORD CLASS ACQUISITION 209
difference from our new algorithm, though, is that hram requires a large
amount of time and is not appropriate as a filter to obtain a large number
of word classes in a short time period. It needs to download a considerable
number of texts, parse at least part of the texts, and count the occurrence
frequencies of many words. Our aim is to skip such heavyweight processes
and provide a lightweight filter.
As another type of method for obtaining semantically coherent word
classes, there are many methods to automatically generate semantically co-
herent word classes from normal texts. Most of these collect the contexts in
which an expression appears and calculate similarities between the contexts
for the expressions to constitute a word class. In Lin’s work (Lin 1998),
for instance, the contexts for an expression are represented as a set of syn-
tactic dependency relations in which the expression appears. Many others
have taken similar approaches and here we cite just a few examples (Hindle
1990, Pantel & Lin 2002). The difference between these methods and our
work is that they assume rather complex contexts such as dependency rela-
tions. We do not assume such complex contexts obtained by using parsers
or parsed corpora. We only need to obtain the numbers of documents that
include expressions in the same itemization using a search engine. This
type of (co-occurrence) frequencies (or some statistical values computed
from the frequencies) are known to be useful for detecting semantic relat-
edness between two expressions (Church & Hanks 1989). The relatedness
is not limited to semantic similarities, though, which we need to compute
to obtain semantic word classes. For instance, Church found that while
“doctor” and “nurse” have a large mutual information value, “doctor” and
“bill” also have a large value. We do not think “doctor” and “bill” are sim-
ilar, though they are somehow related. Our trick for excluding such word
pairs from our semantic class is to use itemizations in html documents. We
assume that expression pairs having semantic relatedness other than strong
semantic similarities are unlikely to be included in the same itemization.
For instance, “doctor” and “bill” will not appear in the same itemization.
3 Proposed method
Our goal is to extract semantically coherent classes consisting of single words

or multi-word expressions from html documents. In the following, we refer
to single words and multi-word expressions simply as expressions.
Our procedure consists of two steps:
Step 1 Extract sets of expressions from itemizations in html documents.
We call each obtained set an Itemized Expression Set (IES).
Step 2 Select only semantically coherent classes from the iess obtained in
Step 1 using the document frequencies and mutual information.
The procedure was designed based on the following two assumptions. Step
1 corresponds to Assumption A, and Step 2 to Assumption B.
Assumption A: At least, some iess are semantically coherent.
Assumption B: Expressions in a semantically coherent ies are likely to
co-occur in the same document.
To judge the semantic coherence among the expressions in an ies, according
to Assumption B, we estimate strength of co-occurrences between pairs of
expressions. As this strength, we use simple document frequencies and pair-
wise mutual information (Church & Hanks 1989). Our algorithm computes
these values for pairs of expressions in the same ies. An important point
is that our algorithm does not compute the values for all the possible pairs
in an ies, although a most straightforward implementation of Assumption
B should be the algorithm that does so. Instead, our algorithm randomly
generates only n pairs of expressions from an ies consisting of n expressions
and then calculates document frequency and mutual information for each
pair. The parameters required to compute the values are obtained from an
existing search engine. The number of queries to be given to the engine is
just 2n for an ies consisting of n expressions. Note that if we compute the
strength for all possible pairs, we need to throw n(n−1)/2+n queries to the
search engine. Our method is thus much more efficient than this exhaustive
algorithm in terms of the number of queries. The details of Steps 1 and 2
are described below.
3.1 Step 1: Extract IESs

The objective of Step 1 is to extract iess from itemizations in html doc-
uments. We follow the approach described in (Shinzato & Torisawa 2004).
First, we associate each expression in an html document with a path which
specifies both the html tags enclosing the expression and the order of the
tags. Consider the html document in Figure 1. The expression “Favorite
Record Shops” is enclosed by the tags <LI>,</LI> and <UL>,</UL>. If we
sort these tags according to their nesting order, we obtain a path (UL, LI)
and this path specifies the information regarding the place of the expres-
sion. We write h(UL, LI), Favorite Record Shopsi if (UL, LI) is a path for
the expression “Favorite Record Shops.” We can then obtain the following
paths for the expressions from the document.
h(UL, LI), Favorite Record Shopsi,h(UL, UL, LI), Tower Recordsi,
h(UL, UL, LI), HMVi,h(UL, UL, LI), Virgin Megastoresi,
h(UL, UL, LI), RECOfani
Our method extracts the set of expressions associated with the same path
as an ies. In the above example, we can obtain the ies {Tower Records,
HMV, Virgin Megastores, RECOfan}.
set ei docs(ei ) ej docs(ej ) docs(ei , ej ) I(ei , ej )

Tower Records 2.01 × 105 RECOfan 4.87 × 102 9.90 × 101 12.05
A Virgin Megastores 9.87 × 103 Tower Records 2.01 × 105 9.21 × 102 10.93
RECOfan 4.87 × 102 HMV 9.71 × 105 2.52 × 102 10.13
HMV 9.71 × 105 Virgin Megastores 9.87 × 103 1.29 × 103 9.14
International 7.88 × 107 Today’s Deals 6.40 × 105 4.52 × 105 5.23
Gift Certificates 2.64 × 106 New Releases 1.12 × 106 1.91 × 104 4.76
B Sell Your Stuff 1.35 × 105 Gift Certificates 2.64 × 106 4.21 × 102 2.31
Top Sellers 3.09 × 106 Sell Your Stuff 1.35 × 105 2.81 × 102 1.50
Today’s Deals 6.40 × 105 Top Sellers 3.09 × 106 3.48 × 102 −0.44
New Releases 1.12 × 106 Top Sellers 3.09 × 106 3.39 × 102 −1.42
Total number of documents N = 4.2 × 109 .
Table 1: Examples of pairwise mutual information
3.2 Step 2: Select semantically coherent IESs

Our procedure next filters out semantically incoherent iess from ones ob-
tained in Step 1. We use document frequencies and pairwise mutual in-
formation for this purpose. These values are given to Support Vector
Machines (svms) (Vapnik 1995) as features for selecting semantically co-
herent iess. To obtain the features given to the svm, we first generate n
pairs of two expressions in an ies consisting of n expressions. More pre-
cisely, for each expression in the ies, we randomly pick another expression
from the set to generate the pairs. For the ies {Tower Records, HMV,
Virgin Megastores, RECOfan}, we generate, for instance, the following set
of expression pairs from the ies.
{hTower Records, HMVi, hHMV, Virgin Megastoresi,
hVirgin Megastores, RECOfani,hRECOfan, Tower Recordsi}
Next, we estimate pairwise mutual information for each pair. We defined
pairwise mutual information, I(e1 , e2 ), between expressions e1 and e2 as
d ocs(e1 , e2 )/N
I(e1 , e2 ) = log2
d ocs(e1 )/N × d ocs(e2 )/N
where d ocs(e) is the number of documents including an expression e and
d ocs(ei , ej ) is the number of documents including two expressions ei and ej .
We estimate d ocs(e) and d ocs(ei , ej ) using a search engine, which is “goo”
(http://www.goo.ne.jp/) in our experiments. N is the total number of
documents and we used 4.2 × 109 as N according to “goo.” Note that we
used −109 as a logarithm of 0 in calculating the mutual information values.
Consider the following iess:
A {Tower Records, HMV, Virgin Megastores, RECOfan}
B {Gift Certificates, International, New Releases, Top Sellers, Today’s Deals}
We think that Set A is semantically coherent while Set B is semantically
incoherent. The pairwise mutual information values computed for each ies
ID Descriptions ID Descriptions
1 Sum of the I(ei , ej ) values of all pairs. 11 Smallest docs(ei , ej ) of a pair in P.
2 Average of the I(ei , ej ) values of all pairs. 12 2nd smallest docs(ei , ej ) of a pair in P.
3 Largest I(ei , ej ) of a pair in P. 13 Number of pairs whose docs(ei , ej ) is 0.
4 2nd largest I(ei , ej ) of a pair in P. 14 Number of items in an IES.
5 Smallest I(ei , ej ) of a pair in P. 15 Number of items whose docs(e) is 0.
6 2nd smallest I(ei , ej ) of a pair in P. 16 Sum of the hit count for all items in an IES.
7 Sum of the docs(ei , ej ) values of all pairs. 17 Average hit count for all items in an IES.
Average of the docs(ei , ej ) values of all 18 Largest hit count for an item in an IES.
8 pairs. 19 2nd largest hit count for an item in an IES.
9 Largest docs(ei , ej ) of a pair in P. 20 Smallest hit count for an item in an IES.
10 2nd largest docs(ei , ej ) of a pair in P. 21 2nd smallest hit count for an item in an IES.
P : A set of pairs of two randomly selected items in an IES.
Table 2: Features used in our procedure
are listed in Table 1. These values for the pairs in Set A are all positive
and larger than those for the pairs in Set B. This roughly means that every
pair in Set A co-occurs much more frequently than expected only from the
frequencies of each item in the pair with assuming independence of the
occurrences of the items. In addition, the differences between the actual
co-occurrence frequencies and the expected ones are larger in Set A than
those in Set B. We expect to be able to select semantically coherent iess by
looking at such differences in mutual information values.
We used the features listed in Table 2. The major part of the features are
mutual information and hit counts for expression pairs, but they also include
hit counts for single expressions and some other items which we expect to be
useful in our task. Note that we used only the largest, the second largest,
the smallest, and the second smallest mutual information, co-occurrence
frequencies, and hit counts as features (features with id 3 to 6, 9 to 12 and
18 to 21.) Since we restricted iess given to our method only to the ones
that have more than three expressions, the feature values are always defined.
Finally, our method ranks the iess according to output values of the svm
(i.e., values of the decision function of the svm) and produces only the top
M iess as final outputs. We assume that a value of the decision function
indicates the likeliness that a given ies is semantically coherent. More
precisely, we made an assumption that the larger a value of the decision
function for a given ies is, the more likely the ies is to be semantically
coherent. Then, we expect that by producing only the iess having large
decision function values as final outputs, we can obtain the semantically
coherent iess with a relatively high precision.
4 Experiments
We downloaded 1.0 × 106 Japanese html documents (10.5gb with html

tags), and extracted 132, 874 iess through the method described in Sec-
tion 3.1. We randomly picked 800 sets from the extracted iess as our test
set. It contained 5, 227 expressions in total. As our training set for the
svms, we randomly selected 400 sets, which included 2, 541 expressions.
The training set was annotated with Coherent/Incoherent labels by the au-
thors according to an evaluation scheme for iess, as described in a later
section. Note that the test set and the training set were chosen so that the
two sets do not have any common expressions. We chose TinySVM 1 as an
implementation of svms. As the kernel function, we used the anova kernel
of degree 2 provided in TinySVM. This choice was made according to the
observations obtained in experiments using the training set. Other types
of kernel provided in TinySVM did not converge during the training or did
not indicate high performance on the training set.
4.1 Evaluation scheme

In our experiments, we evaluated the iess produced by our method accord-
ing to the following criterion.
CRITERION If we can come up with a common hypernym 2 for 70%
of the expressions in a given ies, we regard the ies as a semantically
coherent class. However, when we can think of only the words referring
to an extremely wide range of objects, such as things and objects, as
hypernyms, the class is not coherent.
We also prepared a stricter version of this criterion, which asks the subjects
to come up with a common hypernym covering all the expressions in an ies.
We call this criterion CRITERION (STRICT).
In the following, we call the words that refer to an extremely wide range
of objects, such as things and objects, trivial hypernyms. The trivial hyper-
nyms are problematic since expressions that are not similar to each other
may have a trivial hypernym as its common hypernym. For instance, con-
sider a set of expressions {tank, desk, human, idea}. It may be possible to
regard “objects” as a common hypernym of them, but it is difficult to regard
the set as a semantically coherent class. This means that it is not sufficient
to judge the semantic coherence of expressions according to only whether
one can think of their common hypernym, and it is necessary to judge if we
can come up with a non-trivial common hypernyms of the expressions.
Then, the problem is how we can make a list of non-trivial (possible)
hypernyms. We used the Nihongo Goi Taikei thesaurus (Ikehara et al. 1997)
to solve this problem. The thesaurus contains 2,710 semantic classes, each of
1
Available from http://chasen.org/~taku/software/TinySVM/
2
In this study, class-instance relations are also regarded as hypernym-hyponym rela-
tions. Then, for instance, we can think of common hypernyms of proper nouns.
100 100
90 90
80 80
Precision [%]
Precision [%]
70 70
60 60
50 50
Proposed Method Proposed Method
40 Proposed Method (STRICT) 40 Proposed Method (STRICT)
HRAM HRAM
30 30
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
# of word classes # of word classes
(A) Three subjects (B) Four subjects

Fig. 2: Comparison with HRAM
which are labeled by a Japanese expression naming the class, and the classes
are organized into a hierarchical structure. We tried to make a list of trivial
hypernyms from the thesaurus according to the following steps. First, we
extracted 245 nouns included in the classes that are located in the top five
levels in the hierarchy, and then manually checked whether the extracted
nouns should be regarded as trivial hypernyms. As a result, we could obtain
164 trivial hypernyms such as “ (individual)” and “ (phenomena).”
We then removed the trivial hypernyms from the set of general nouns in
the thesaurus. As a result, we could obtain the list of 92,002 nouns, and
assumed that it is a list of non-trivial (possible) hypernyms.
We asked four human subjects to evaluate the acquired iess. The sub-
jects were asked if they can come up for each ies with a common hypernym
in the obtained non-trivial hypernym list. More precisely, the subjects were
asked to give a common hypernym of a given ies to our evaluation tool.
The subjects could proceed to the evaluation of the next ies only when the
tool finds the given hypernym in the non-trivial hypernym list, or when the
subjects tell the tool that they could not find any non-trivial hypernyms.
By this, we could prevent the subjects from choosing trivial hypernyms.
4.2 Experimental results

We compared the performances of our method with those of hram, which
can be seen as an alternative to our method as pointed out in Section
2. As mentioned, the outputs of our method are sorted according to the
decision function values given by the svm, while the outputs of hram are
also sorted by its original score. In this experiment, we gave 800 iess in
our test set to our method and hram, and we then picked up top 200 iess
Semantically coherent IESs

Japanese actresses (Sumire Haruno), (Emi Sakuragi), (Mai Ayana),
(Ai Morotori), (Yuka Amano), (Reika Oban), (Shizuka Minami)
Football teams in Italy (Piacenza), (Juventus), (SS Lazio),
(Chievo), (AS Roma), (Modena), (AC Milan), (Torino)
Semantically incoherent IESs
Events or items related to wedding ceremonies (Cakes), (Printed matter),
(Beverages), (Service charges), (Presents), (Sound and lighting),
(Cuisine), (Celebration)
Captions of private pictures listed in someone’s webpage (The Parthenon),
(Guards for a presidential mansion), (Sta. Athens),
(In front of a parliament building)
Table 3: Examples of acquired IESs
from the outputs of the both methods. The performances are shown in
Figure 2. Graph (A) shows the precisions of the methods when we assume
that semantically coherent iess are only the iess that were accepted by three
or more human subjects in the four subjects, while graph (B) indicates the
precisions of the methods when semantically coherent iess are only the iess
that all the four subjects accepted. In the both graphs, the y-axis indicates
the precision of the acquired word classes, while the x-axis indicates the
number of classes. “Proposed Method” refers to the precisions achieved by
our method and “hram” indicates the precisions obtained by hram. The
evaluation of both methods were done according to CRITERION defined
before. “Proposed Method (STRICT)” is the results of our method when
we used CRITERION (STRICT).
From the both graphs, we can see that our method outperforms hram.
If we look at the results evaluated according to CRITERION, the precision
of our method in graph (A) reached 88% for the top 100 classes which was
12.5% of all the given iess in the test set. For the top 200 classes (25%
of all the iess in the test set), the precision was about 80%. The kappa
statistic for measuring the inter-rater agreement was 0.69 for our method.
For hram, the statistic was 0.78. These values indicate that our subjects
had good agreement (Landis & Koch 1977). Table 3 shows examples of the
iess obtained by our method.
5 Conclusions
We have proposed a method to extract semantic word classes from itemiza-

tions in html documents. Its major characteristics are that (i) it can be
implemented easily using svms and an existing commercial search engine,
and (ii) it can perform its task using only hit counts obtained by a small
number of queries given to the search engine. The method was evaluated
by using four human subjects.
Our method ranks itemizations collected from the www according to the
decision function of svms, and produces highly ranked itemizations as out-
puts. In our experiments, when the top 10% of collected itemizations were
produced, at least, three of the four human subjects regarded about 80%
of the itemizations as sets of semantically similar expressions for which the
subjects could come up with non-trivial common hypernyms.
Acknowledgments. This work is supported by Grant-in-Aid for Scientific

Research (15680005), by Promotion Subsidy for Science and Technology, Ministry
of Education, Culture, Sports, Science and Technology and by Special Coordina-
tion Funds for Promoting Science and Technology, Fostering Talent in Emergent
Research Fields.
REFERENCES
Church, Kenneth & Patrick Hanks. 1989. “Word Association Norms, Mutual
Information, and Lexicography”. Proceedings of the 27th Annual Meeting of
the Association for Computational Linguistics (ACL’89), 26-29 June, 1989,
Vancouver, BC, Canada., 76-83. San Francisco: Morgan Kaufmann.
Hindle, Donald. 1990. “Noun Classification from Predicate-argument Struc-
tures”. Proceedings of the 28th Annual Meeting of the Association for Com-
putational Linguistics (ACL’90), 6-9 June 1990, Pittsburgh, PA, U.S.A., 268-
275. San Francisco: Morgan Kaufmann.
Ikehara, Satoru, Miyazaki Masahiro, Shirai Satoshi, Yokoo Akio, Nakaiwa Hi-
romi, Ogura Kentaro, Ooyama Yoshihumi & Hayashi Yoshihiko. 1997. Ni-
hongo Goi Taikei – A Japanese Lexicon. Tokyo: Iwanami Syoten.
Landis, Richard. & Gary Koch. 1977. “The Measurement of Observer Agreement
for Categorical Data”. Biometrics, 33:1.159-174.
Lin, Dekang. 1998. “Automatic Retrieval and Clustering of Similar Words”.
Proceedings of the 36th Annual Meeting of the Association for Computational
Linguistics and 17th International Conference on Computational Linguistics
(COLING-ACL’98 ), 768-774. San Francisco: Morgan Kaufmann.
Pantel, Patrick & Dekang Lin. 2002. “Discovering Word Senses from Text”.
ACM Special Interest Group on Knowledge Discovery and Data Mining
(SIGKDD’02 ), 613-619. New York: ACM.
Shinzato, Keiji & Kentaro Torisawa. 2004. “Acquiring Hyponymy Relations from
Web Documents”. Human Language Technology Conference of the North
American Chapter of the Association for Computational Linguistics (HLT-
NAACL’04 ), 73-80. Boston, Mass.
Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. Berlin:
Springer.
Automatic Building of Wordnets
Eduard Barbu∗ & Verginica Barbu Mititelu∗∗
∗
Graphitech Italy
∗∗
Romanian Academy, Research Institute for Artificial Intelligence
Abstract
This paper dealing with the automatic building of wordnets starts by
stating the assumptions behind such enterprise and then presents a
two-phase methodology for automatically building a target wordnet
strictly aligned with an already available wordnet (source wordnet).
In the first phase the synsets for the target language are automati-
cally generated and mapped onto the source language synsets using
a series of heuristics. In the second phase the salient relations that
can be automatically imported are identified and the procedure for
their import is explained. The idea behind all our heuristics will be
stated, the heuristics employed will be presented and their success
evaluated against a case study: automatically building a Romanian
wordnet using the Princeton WordNet.
1 Introduction
The importance of a wordnet for nlp applications can hardly be overes-

timated. The Princeton WordNet (pwn) (Fellbaum 1998) is now a ma-
ture lexical ontology which has demonstrated its efficiency in a variety
of tasks (word sense disambiguation, machine translation, information re-
trieval, etc.). Inspired by the success of pwn, researchers started developing
wordnets for many languages taking pwn as a model. Furthermore, in both
EuroWordNet (Vossen 1998) and BalkaNet (Tufiş 2004) projects the synsets
from different versions of pwn (1.5 and 2.0) were used as ili repositories.The
created wordnets were linked by means of interlingual relations through this
ili repository. The rapid progress in building a new wordnet and its linking
with an already tested wordnet (usually pwn) is hindered by the amount
of time and effort needed for developing such a resource.
This paper will discuss the problem of automatic building of wordnets. It
starts by stating the assumptions behind such enterprise and then presents
a methodology that can be used for automatically building wordnets strictly
aligned (that is, using only eq synonym relation) with an already available
wordnet. We have started our experiment with the study of nouns.
The paper has the following organization. Firstly we state the implicit
assumptions in building a wordnet strictly aligned with other wordnets.
Then we shortly describe the resources that one needs in order to apply the
218 EDUARD BARBU & VERGINICA BARBU MITITELU
heuristics, and also the criteria we used in selecting the source language test
synsets. Finally, we state the problem to be solved in a more formal way,
we present the idea behind all heuristics, the heuristics themselves and their
success evaluated against a case study (automatically building a Romanian
wordnet using pwn 2.0).
2 Assumptions
The assumptions that we considered necessary for automatically building a

target wordnet using a source wordnet are the following: (i) There are word
senses that can be clearly identified. This premise was extensively ques-
tioned among others by (Kilgarriff 1997) who thinks that word senses have
no real ontological status, but they exist only relative to a task. We briefly
present Kilgarriff’s argument and a short refutation of it. To put it in a
nutshell, Kilgarriff’s argument is this: according to the well known slogan
of Quine “No entity without identity”, if word senses have a real ontological
status then they should have identity criteria. Till now, as Kilgarriff shows,
no one was able to provide an adequate identity criterion for word senses.
So, the argument continues, they do not exist. However, his argument is
unsound because Kilgarriff unconditionally accepts Quine identity criterion.
Quine identity criterion was discussed for example by (Strawson 1997) who
argued that any interesting reading of the above mentioned criterion fails.
We could not provide identity criteria for colour, traits, intellectual qual-
ities, etc. However, we will most probably assert that they exist. (ii) A
rejection of the strong reading of Sapir-Whorf (Caroll 1964) hypothesis (the
principle of linguistic relativity). Simply stated, the principle of linguistic
relativity says that language shapes our thought. There are two variants
of this principle: strong determinism and weak determinism. According to
the strong determinism language and thought are identical. This hypothesis
has today few followers if any, and the evidence against it comes from vari-
ous sources among which the possibility of translation in another language.
However, the weak version of the hypothesis is largely accepted. One can
view the reality and our organization of it by analogy with the spectrum
of colours, which is a continuum in which we place arbitrary boundaries
(white, green, black, etc.). Different languages will “cut” differently this
continuous spectrum. For example, Russian and Spanish have no words for
the English concept blue. This weak version of the principle of linguistic
relativity warns us, however, that a specific source wordnet could not be
used for automatically building any target wordnet. We further discuss this
bellow. (iii) The acceptance of the conceptualization made by the source
wordnet, i.e. of the way in which the source wordnet “sees” the reality by
identifying the main concepts to be expressed and their relationships. For
AUTOMATIC BUILDING OF WORDNETS 219
specifying how much languages can differ with respect to the conceptual
space they reflect we will follow (Sowa 1992) who considers three distinct
dimensions: accidental (the two languages have different notations for the
same concept; for example, the Romanian word măr and the English word
apple lexicalize the same concept), systematic (the systematic dimension
defines the relation between the grammar of a language and its conceptual
structures; it has little import for our problem), cultural (the conceptual
space expressed by a language is determined by environmental, cultural fac-
tors, etc.; it could be the case, for example, that concepts that define the
legal systems of different countries are not mutually compatible; so, when
one builds a wordnet starting from a source wordnet he/she should ask
himself/herself what the parts (if any) that could be safely transferred in
the target language are; more precise what the parts that share the same
conceptual space are).
The assumption that we make use of is that the differences between the
two languages (source and target) are merely accidental: they have different
lexicalizations for the same concepts. As the conceptual space is already
expressed by the source wordnet structure using a language notation, our
task is to find the concepts notations in the target language.
When the source wordnet is not perfect (the real situation), then a draw-
back of the automatic mapping approach is that all the mistakes existent
in the source wordnet are transferred in the target wordnet. Another pos-
sible problem appears because the previously made assumption about the
sameness of the conceptual space is not always true.
3 Selection of concepts and resources used
When we selected the set of synsets to be implemented in Romanian we

followed two criteria.
The first criterion states that the selected set should be structured in
the source wordnet (i.e. every selected synset should be linked by at least
one semantic relation with other selected synsets). If we want to obtain
a wordnet in the target language and not just some isolated synsets, this
criterion is self-imposing.
The second criterion is related to the evaluation stage. To properly
evaluate the built wordnet, it should be compared with a “golden standard”.
The golden standard that we use will be the Romanian Wordnet (RoWN)
developed in the BalkaNet project.
For fulfilling both criteria we chose a subset of noun concepts from the
RoWN that has the property that its projection on pwn 2.0 is closed under
the hyperonym and the meronym relations. The projection of this subset
on pwn 2.0 comprises 9716 synsets that contain 19624 literals.
For the purpose of automatic mapping of this subset we used an in-house

dictionary built from many sources. We made sure that the translations of
the above mentioned set is as complete as possible. The second resource
used is the Romanian Explanatory Dictionary (EXPD 1996) whose entries
are numbered to reflect the dependencies between different senses of the
same word.
4 Notation introduction and the idea of heuristics
In this section we introduce the notations used in the paper and we outline
the guiding idea of all heuristics we used:
1. By TL we denote the target lexicon. In our experiment TL will contain
Romanian words (nouns).
2. By SL we denote the source lexicon. In our case SL will contain English
words (nouns).
3. WT and WS are the wordnets for the target language and the source
language, respectively.
4. wjk denotes the k th sense of the word wj .
5. BD is a bilingual dictionary which acts as a bridge between SL and
TL .
If we ignore the information given by the definitions associated with word
senses, then, formally a sense of a word in the pwn is distinguished from
other word senses only by the set of relations it contacts in the semantic
network. These relations define the position of a word in the semantic
network.
The idea of our heuristics could be summed up in three points: (i) Take
profit of and increase the number of relations in the source wordnet to obtain
a unique position for each word sense. (ii) Try to derive useful relations
between the words in the target language. (iii) In the mapping stage of the
procedure take profit of the structures built at points (i) and (ii). We have
developed a set of four heuristics.
5 The synonymy heuristic rule
The synonymy heuristic exploits the fact that synonymy enforces equiva-
lence classes on word senses.
Let EnSyn= {ewji11
11
, ewji12
12
, . . . , ewji1n
1n
} (where ewj11 , ewj12 , ewj1n are the
words in synset and the superscripts denote their sense numbers) be a SL
synset and length (EnSyn) > 1. The length of a synset is equal to the
number of distinct words (words that are not variants) in the synset. So
we disregard synsets such as {artefact, artifact}. For achieving this we
computed the well known Levenshtein distance between the words in the
synset.
Taking the actual RoWN as a gold standard we can evaluate the results
of our heuristics by comparing the obtained synsets with those in the RoWN.
We distinguish five possible cases:
• The synsets are equal (this case will be labelled as Identical – id).
• The generated synset has all literals of the correct synset and some
more (Over-generation – ovg).
• The generated synset and the golden one have some literals in common
and some different (Overlap – ovp).
• The generated synset literals form a proper subset of the golden synset
(Under-generation – ug).
• The generated synset have no literals in common with the correct one
(Disjoint – dj).
The cases ovg, ovp and dj will be counted as errors. The other two cases,
namely id and ug, will be counted as successes. The evaluation of the
synonymy heuristic is given in Table 1.
Error types Correct

NMS R OVG OVP DJ UG ID P
8493 87 210 0 0 300 7983 98
Table 1: The results of the synonymy heuristic
The number of mapped synsets (nms) represents the number of synsets

mapped by the heuristic. The recall (r) column represents the recall of the
heuristic. The P column represents the precision of the heuristic. The high
recall and precision proves the quality of the first part of the dictionary we
used. The only type of error we encountered is ovg.
6 The hyperonymy heuristic rule
The hyperonymy heuristic draws from the fact that, in the case of nouns,
the hyperonymy relation can be interpreted as an is-a relation (for versions
of pwn lower than 2.1 this is not entirely true: the hyperonym relation can
be interpreted as is-a or instance-of). It is also based on two related
observations: a hyperonym and his hyponyms carry some common infor-
mation and the information common to the hyperonym and the hyponym
will increase as you go down the hierarchy.
Let EnSyn1 = {ewji11 11
, . . . , ewji1t
1t
} and EnSyn2 = {ewji21
21
, . . . , ewji2s
2s
} be
two SL synsets such that EnSyn1 HYP EnSyn2 , meaning that EnSyn1 is a
hyperonym of EnSyn2 . Then we generate the translation lists of the words

in the synsets. The intersection is computed as:
TL EnSyn1 = M (ewj11 ) ∩ M (ewj12 ) ∩ . . . ∩ M (ewj1t )
TL EnSyn2 = M (ewj21 ) ∩ M (ewj22 ) ∩ . . . ∩ M (ewj2s )

The generated synset in the target language will be computed as:
TL Synset = TL EnSyn1 ∩ TL EnSyn2
The results of the hyperonymy heuristic are presented in Table 2. The low
recall is due to the fact that we did not find many common translations
between hyperonyms and their hyponyms.
Error types Correct

1028 10 213 0 150 230 435 65
Table 2: The results of the hyperonymy heuristic
7 The domain heuristic
The domain heuristic takes profit of an external relation imposed over pwn.
At IRST pwn was augmented with a set of Domain Labels, the resulting
resource being called WordNet Domains (Magnini & Cavaglia 2000). The
domain labels are hierarchically organized and each synset received one or
more domain labels. The idea of using domains is helpful for distinguishing
word senses (different word senses of a word are assigned to different do-
mains). The best case is when each sense of a word has been assigned to
a distinct domain. But even if the same domain labels are assigned to two
or more senses of a word, in most cases we can assume that this is a strong
indication of a fine-grained distinction. It is very probable that the dis-
tinction is preserved in the target language by the same word. We labelled
every word in the BD dictionary with its domain label. For English words
the domain is automatically generated from the English synset labels. For
labelling Romanian words we used two methods:
1. We downloaded a collection of documents from web directories such
that the categories of the downloaded documents match the categories
used in the Wordnet Domain. As a set of features we selected the
nouns that provide more information. For this we used the well known
χ2 statistic. χ2 statistic checks if there is a relationship between being
in a certain group and having a characteristic that we want to study.

In our case we want to measure the dependency between a term t and
a category c. The formula for χ2 is:
N × (AD − CB)2
χ2 (t, c) =
(A + C) × (B + D) × (A + B) × (C + D)
where:
• A is the number of times t and c co-occur;
• B is the number of times t occur without c;
• C is the number of times c occurs without t;
• D is the number of times neither c nor t occurs;
• N is the total number of documents.
For each category we computed the score between that category and
the noun terms of our documents. Then, for choosing the terms that
discriminate well for a certain category we used the formula below
(where m denotes the number of categories):
m
χ2 max (t) = max χ2 (t, ci )
i=1
2. We took advantage of the fact that some words have already been
assigned subject codes in various dictionaries. We performed a manual
mapping of these codes onto the Domain Labels used at IRST. The
Romanian words that could not be associated domain information
were associated with the default factotum domain.
The following entry is a BD entry augmented with domain information:
M (ew1 [D1 , . . .]) = rw1 [D1 , D2 . . .] , rw2 [D1 , D3 . . .] , . . . , rwi [D2 , D4 . . .]
In the square brackets the domains that pertain to each word are listed.
For each synset in the SL we generated all the translations of its literals
in TL . Then the TL synset is built using only those TL literals whose domain
“matches” the SL synset domain (that is: it is the same as the domain of
SL , subsumes the domain of SL in the IRST domain labels hierarchy or is
subsumed by the domain of SL in the IRST domain labels hierarchy). The
results of this heuristic are given in Table 3.
8 The monolingual dictionary heuristic rule
The monolingual dictionary heuristic takes advantage of the fact that the
source synsets have a gloss associated and also that target words that are
Error types Correct

7520 77 689 0 0 0 6831 91
Table 3: The results of the domain heuristic
translations of source words have glosses associated in EXPD. As features

of pwn and EXPD glosses we have choosen the set of nouns.
The target definitions were automatically translated using the bilingual
dictionary. All possible source definitions were generated by translating
each lemmatized noun word in the TL definition. Thus, if a TL definition of
one TL word is represented by the following vector [rw1, rw2 , . . . , rwp ], then
the number of SL vectors generated will be: N = nd × tw1 × tw2 × . . . × twp ,
where nd is the number of definitions the target word has in the monolingual
dictionary (EXPD), and twk with k = 1 . . . p is the number of translations
that the noun wk has in the bilingual dictionary. Then we simply count
the number of common nouns between the pwn glosses and each translated
EXPD gloss and take the word whose gloss maximizes this number to be
included in the TL synset. Notice that by using this heuristic rule we can
automatically add a gloss to the TL synsets.
As one can see in Table 4, the number of incomplete synsets is high.
The low recall is due to the low agreement between Romanian and English
glosses.
Error types Correct

3527 36 25 0 78 547 2877 97
Table 4: The results of the monolingual dictionary heuristic
9 Combining results
For choosing the final synsets we devised a set of meta-rules by evaluating

the pros and cons of each heuristic rule. For example, given the high quality
bilingual dictionary the probability that the synonymy heuristic will fail is
very low. So the synsets obtained using it will automatically be selected.
A synset obtained using the other heuristics will be selected and, moreover,
will replace a synset obtained using the synonymy heuristic, only if it is
obtained independently using domain and hyperonymy heuristics, or by
using domain and monolingual dictionary heuristics. If a synset is not
selected by the above meta-rules it will be selected only if it is obtained by
domain heuristic and the ambiguity of the members of the source synset
from which it was generated is at most equal to 2. Table 5 shows the
combined results of our heuristics.
As one can observe there, for 106 synsets in pwn 2.0 the Romanian
equivalent synsets could not be found. There also resulted 635 synsets that
are smaller than the synsets in the RoWN.
Error types Correct

9610 98 615 0 250 635 8110 91
Table 5: The combined results of the heuristics
10 Import of relations
After building the target synsets an investigation of the nature of the re-
lations that structure the source wordnet should be made for establishing
which of them can be safely transferred in the target wordnet. As one
expects, the conceptual relations can be safely transferred because these
relations hold between concepts. The only lexical relation that holds be-
tween nouns and that was subject to scrutiny was the antonymy relation.
We concluded that this relation can also be safely imported. The importing
algorithm works as described below.
If two source synsets S1 and S2 are linked by a semantic relation R
in WS and if T1 and T2 are the corresponding aligned synsets in the WT ,
then they will be linked by the relation R. If in WS there are intervening
synsets between S1 and S2 , then we will set the relation R between the
corresponding TL synsets only if R is declared as transitive (R+, unlimited
number of compositions, e.g. hypernym) or partially transitive relation (Rk
with k a user-specialized maximum number of compositions, larger than the
number of intervening synsets between S1 and S2 ). For instance, we defined
all the holonymy relations as partially transitive (k=3).
11 Conclusions
Other experiments of automatically building wordnets that we are aware

of are (Atserias et al., 1997) and (Lee et al., 2000). They combine sev-
eral methods, using monolingual and bilingual dictionaries for obtaining a
Spanish Wordnet and, respectively, a Korean one starting from pwn 1.5.
However, our approach is characterized by the fact that it gives an ac-
curate evaluation of the results by automatically comparing them with a
manually built wordnet. We also explicitly state the assumptions of this

automatic approach. Our approach is the first to use an external resource
(WordNet Domains) in the process of automatically building a wordnet.
We obtained a version of RoWN that contains 9610 synsets and 11969
relations with 91% precision. The success of our procedure was influenced
both by the data set we used and by the quality of the bilingual dictionary.
Some heuristics developed here may be applied for the automatic con-
struction of synsets of other parts of speech. Extending our experiment to
adjectives and verbs would be of great interest in our opinion.
REFERENCES
Atserias, Jordi, Salvador Climent, Xavier Farreres, German Rigau & Horacio
Rodriguez. 1997. “Combining Multiple Methods for Automatic Construction
of Multilingual WordNets”. Proceedings of the International Conference on
Recent Advances in Natural Language Processing, 143-150. Tzigov Chark,
Bulgaria.
Carroll, John B., ed. 1964. Language, Thought, and Reality: selected writings of
Benjamin Lee Whorf. Cambridge, Mass.: MIT.
Dicţionarul explicativ al limbii române. 1996. Bucureşti, Romania: Univers
Enciclopedic.
Fellbaum, Christiane, ed. 1998. WordNet: An Electronical Lexical Database.
Cambridge, Mass.: MIT.
Kilgarriff, Adam. 1997. “I Don’t Believe in Word Senses”. Computers and the
Humanities 31:2.91-113.
Lee, Changki, Geunbae Lee & Seo Jung Yun. 2000. “Automatic Wordnet map-
ping Using Word Sense Disambiguation”. Proceedings of the Joint SIGDAT
Conference on Empirical Methods in Natural Language Processing and Very
Large Corpora (EMNLP/VLC 2000 ), 142-147. Hong Kong.
Magnini, Bernardo & Gabriela Cavaglia. 2000. “Integrating Subject Field Codes
into WordNet”. Proceedings of the Second International Conference on Lan-
guage Resources and Evaluation (LREC-2000 ) ed. by Maria Gavrilidou et
al., 1413-1418. Athens, Greece.
Sowa, John F. 1992. “Logical Structure in the Lexicon” Lexical Semantics and
Commonsense Reasoning ed. by J. Pustejovsky & S. Bergler, 39-60. Berlin:
Springer-Verlag.
Tufiş, Dan, ed. 2004. The BalkaNet Project (= Special Issue of Romanian Journal
of Information Science and Technology, 7:1/2). Bucureşti, Romania: Editura
Academiei Române.
Vossen, Piek, ed. 1998. A Multilingual Database with Lexical Semantic Networks.
Dordrecht: Kluwer.
Lexical Transfer Selection
Using Annotated Parallel Corpora
Stelios Piperidis, Panagiotis Dimitrakis & Irene Balta
Institute for Language and Speech Processing
Abstract
This paper addresses the problem of bilingual lexicon extraction and
lexical transfer selection, in the framework of computer-aided and
machine translation. The proposed method relies on parallel cor-
pora, annotated at part of speech and lemma level. We first extract
a bilingual lexicon using unsupervised statistical techniques. For
each word with more than one translation candidates we build con-
text vectors, in order to aid the selection of the contextually correct
translation equivalent. The method achieves an overall precision of
ca. 85% while the maximum recall reaches 75%.
1 Background
The emergence of parallel corpora has evoked the appearance of many meth-
ods that attempt to deal with different aspects of computational linguistics
(Véronis 2000). Of special significance in the fields of lexicography, termi-
nology, computer-aided translation (cat) and machine translation (mt) is
the impact of ‘bitexts’; a pair of texts in two languages, where each text
is a translation of the other (Melamed 1997). Such texts are necessary for
providing evidences of use, directly deployable in statistical methods and en-
hance the automatic elicitation of the otherwise sparse linguistic resources.
Recent developments in cat and mt have moved towards the use of par-
allel corpora, aiming at two primary objectives: (i) to overcome the sparse-
ness of the necessary resources and (ii) to avoid the burden of producing
them manually. Furthermore, parallel corpora have proven rather useful
for automatic dictionary extraction, which offers the advantages of a lex-
icon capturing the corpus specific translational equivalences (Brown 1997,
Piperidis et al. 2000). Extending this approach, we investigate the possi-
bility to use bilingual corpora in order to extract translational correspon-
dences coupled with information about the word senses and contextual use.
In particular, we focus on polysemous words with multiple translational
equivalences.
The relation between word and word usage, as compared to the relation
of word and word sense has been thoroughly addressed (Gale et al. 1992a,
Yarowsky 1993, Kilgarriff 1997). We argue that through the exploitation
of parallel corpora and without other external linguistic resources, we can
228 STELIOS PIPERIDIS, PANAGIOTIS DIMITRAKIS & IRENE BALTA
adequately resolve the task of target word selection in cat and mt. Along
this line, it has been argued that the accumulative information added by a
second language could be very important in lexical ambiguity resolution in
the first language (Dagan et al. 1991), while tools have been implemented
for translation prediction, by using context information extracted from a
parallel corpus (Tiedemann 2001).
In addition, research on word sense disambiguation (wsd) is signifi-
cant in the design of a methodology for automatic lexical transfer selection.
Although monolingual wsd and translation are perceived as different prob-
lems (Gale et al. 1992d), we examine whether certain conclusions, which
are extracted during the process of wsd, could be useful for translation
prediction.
The role of context, as the only means to identify the meaning of a
polysemous word (Ide & Véronis 1998), is of primary importance in various
statistical approaches. Brown in (Brown et al. 1991) and Gale in (Gale
et al. 1992b, Gale et al. 1992d), use both the context of a polysemous
word and the information extracted from bilingual aligned texts, in order
to assign the correct sense to the word. Yarowsky explores the significance
of context in creating clusters of senses (Yarowsky 1995), while Schütze
in (Schütze 1998), addresses the sub-problem of word sense discrimination
through the context-based creation of three cascading types of vectors.
We examine the impact of context upon translation equivalent selection,
through an “inverted” word sense discrimination experiment. Given the
possible translational candidates, which are extracted from the statistical
lexicon, we investigate the discriminant capacity of the context vector, which
we build separately for each of the senses of the polysemous word.
2 Proposed method
The basic idea underlying the proposed method is the use of context vectors,
for each of the word usages of a polysemous word, in order to resolve the
problem of lexical transfer selection. The method consists of three stages,
shown in Figure 1:
• Bilingual Lexicon Extraction
• Context Vectors Creation
• Lexical Transfer Selection
2.1 Lexicon building

Parallel corpora are first sentence-aligned, using a Gale & Church-like algo-
rithm (Gale et al. 1991) and annotated on both language sides for part-of-
speech and lemma. Focusing on the semantic load bearing words, we filter
LEXICAL TRANSFER SELECTION WITH PARALLEL CORPORA 229
Parallel / Aligned /
Parallel Corpus Tokenization Annotated Corpus
Sentence Lexical
Alignment Context
Lexicon Transfer
Vectors
Selection
POS Tagging
Lemmatization
Fig. 1: Lexicon Building and Lexical Transfer Selection Architecture
the pos tagged corpus and retain only nouns, adjectives and verbs. The
corpus-specific lexicon is extracted using unsupervised statistical methods,
based on two basic principles:
• No language-pair specific assumptions are made about the corre-
spondences between grammatical categories. In this way, all possible
correspondence combinations are produced, as this is possible during
the translation process from a source language to the target language.
• For each aligned sentence-pair, each word of the target sentence is a
candidate translation for each word of the aligned source sentence.
Following the above principles we compute: the absolute frequency of each
word and the frequency of each word pair, in a sentence pair. Using these
frequencies, we extract lexical equivalences, based on the following criteria:
1. The frequency of the word pair must be greater than threshold T hr1 .
2. Each of the conditional probabilities P (Wt |Ws ) and P (Ws |Wt ) has to
be greater than threshold T hr2 .
3. The product P (Wt |Ws ) · P (Ws |Wt ) must be greater than threshold
T hr3 . This product is indeed the score of the translation.
After experimentation and examination of the results, {T hr1 , T hr2 , T hr3 }
were set to {5, 0.25, 0.15}. Experiments with T hr1 =1 and T hr1 =10 re-
vealed that the former resulted in high recall with significantly low preci-
sion, i.e., a fairly large but low quality lexicon, while the latter resulted in
relatively high precision, whilst recall was radically reduced, i.e., a fairly
small high-quality lexicon. We empirically fixed T hr1 in the middle of the
experimentation range. T hr2 was set to 0.25 to account for a maximum of
4 possible translations, which a polysemous word could have, according to
the corpus data. T hr3 was set to 0.15 to account for the lower bound of
the product of conditional probabilities P (Wt |Ws ) and P (Ws |Wt ), taking
into consideration the threshold of the individual conditional probabilities,
i.e., we empirically set the lower bounds of the conditional probabilities to
{0.25, 0.6}.
The extracted bilingual lexical equivalences account for: words with
one translation (90% of total) and words with multiple translations (10%).
Words with multiple equivalents are further distinguished in: (i) words that
have multiple, different in sense, translational equivalents, (ii) words that
have multiple, synonymous translations and (iii) words that have multiple
translations, which are in fact wrong due to statistical errors of the method.
In the following, we focus on polysemous words, with multiple translational
equivalents, which cannot mutually replace each other in the same context.
2.2 Context vector (CV) creation

For each word, in the set of the treated grammatical categories, in the
corpus, a context vector (CV) is created based on: (i) the extracted lexicon,
in order to retrieve the possible translations of a word and (ii) the parallel
aligned sentences to retrieve those words that systematically co-occur with
that word (thus contributing to the definition of its meaning). The process
is described below:
Step 1.1: For “univocal” words, words with only one translation, we as-
sume that the translation is also its “sense”. For the untranslated words
we make no assumption, though they also participate in the created CVs.
Both these categories of words are denoted by Wu .
Step 1.2: For the words with multiple translations in the extracted lexicon,
we cannot automatically pick out the polysemous words. Therefore for each
of these Wp , we make the following assumption: Each word, with more than
one translation in the lexicon, could potentially be polysemous. Suppose
Wp is one of those words and let T1 and T2 be two possible translations.
When Wp is found in a source sentence, we search for the words T1 and T2
in the target sentence. If one and only one is matched, e.g., T1 , we conclude
that this is the correct translation and Wp is replaced in the source sentence
by Wp T1 . This is repeated for each word with multiple translations. In the
case of erroneous multiple translations (caused by statistical errors), none
of T1 or T2 are assigned as a sense, due to the simultaneous appearance of
more than one in the target sentence. In the end, we have a new corpus
in which some words appear as before and some have been labeled by their
“local senses”.
Step 2: In order to build the CVs, we address Wp T1 and Wp T2 as being
different. Then we isolate those source sentences where either “word-sense”
Wp T1 or Wp T2 appear exclusively. In each different set of sentences we
examine the context of Wp Ti in a window of certain length centered to
Wp Ti . The size of the window is defined as:
window size = 2n, (1)
where n denotes the number of word tokens on either side of Wp Ti .

Inside this window, we look for words, Wx , which belong to the selected
grammatical categories. Each of Wx is added to the CV, along with the
number of times this word has appeared in the context of Wp Ti . We follow
a similar procedure for each word Wu .
Step 3: The final formation of the CVs is based on the following equations:
NWx Wp Ti ≥k (2)
P (Wx |Wp Ti ) ≥ a1 , (3)

where NWx Wp Ti is the number of the total co-occurrences of words Wx in the
window of Wp Ti , k is the minimum co-occurrences that a word Wx must
have in order to participate in the CV which describes Wp Ti , P (Wx |Wp Ti )
is the conditional probability of the word Wx given the appearance of Wp Ti
and a1 is a threshold, which P (Wx |Wp Ti ) must exceed. P (Wx |Wp Ti ) is also
the score of Wx in the CV of Wp Ti .
In the case of a word Wu , with only one translation or no translation in
the lexicon, similar equations are used:
NWx Wu ≥ k (4)
P (Wx |Wu ) ≥ a2 , (5)

where NWx Wu is the number of the total co-occurrences of words Wx in the
window of Wu , k is defined as the minimum co-occurrences of Wx and Wu ,
P (Wx |Wu ) is the conditional probability of Wx given the appearance of Wu
and a2 is a threshold, which P (Wx |Wu ) must exceed. P (Wx |Wu ) is the score
of Wx in the CV of Wu .
Whenever at least one of these criteria in each set of equations is not
met, the word Wx is deleted from the CV. In (3) and (5) we use two distinct
thresholds a1 and a2 with
a1 < a2 , (6)
as we would like polysemous words to have greater CVs. The parameters
a1 and a2 are set to 0.05 and 0.1 respectively.
Thus, we have constructed, a CV for each word in the set of the specified
grammatical categories. The CV consists of words that systematically co-
occur with the word of interest and their CV scores.
2.3 Lexical transfer selection

Based on the lexicon and the CVs, the algorithm can disambiguate an am-
biguous word, when appearing in a certain context, by comparing this con-
text, with the previously created CVs of its “senses”. The process includes
the following steps:
Step I: Let Wp be an ambiguous word, with Ti translational equivalents

extracted from the lexicon. A sentence is fed to the system and Wp is one of
the words. Each Ti of the “senses” Wp Ti are considered to be translation
candidates for the sentence at hand.
Step II: For each of Wp Ti an extended context vector (ECV) Vxyzw is
produced. The main characteristic of the ECV is its depth di . The depth di
denotes the number of “co-occurrence connections” between words, which
we use in order to “meaningfully connect” the word Wp Ti with any word
Wx . In our methodology:
di = 4 (7)
We believe that a greater value for di would capture the spurious co - oc-
currences of words, thus it would not represent a logical and linguistically
expected “sense-connectivity”. The ECV of Wp T1 consists of the words Wx
that appear in the CV V1 of Wp T1 , the words that appear in the CVs V1i
of each word in V1 , and so on, until depth = 4 (V1 words are in depth 1, V1i
words are in depth 2 etc).
Step III: Each of the word Wx that participates in the ECV is assigned an
extended context vector score ECV ScorexWp or ECV ScorexWu , depending
on the type of the word with which it co-occurs:
P (Wx |Wp Ti )
ECV ScorexWp Ti = (8)
21−dx
P (Wx |Wu )
ECV ScorexWu = , (9)
21−dx
where P (Wx |Wp Ti ) and P (Wx |Wu ) are defined in (3) and (5) and dx is the
depth in which Wx was found. In case of multiple appearances of Wx in the
ECV, we choose the one in the lowest depth, as it is the most significant in
the process of defining the sense of Wp .
Step IV: The final lexical transfer selection procedure examines each ECV
of Wp Ti separately. We compare the words inside the ±n window of word
Wp of the sentence under examination with those included in the ECV.
For each matched word we compute the appropriate score, using (8) and
(9). By adding the scores of the matched words, we assign to each possible
translational equivalent Ti a total score, depending on the associated ECV.
Finally, for the lexical transfer selection, we choose the word-sense Wp Ti
with the highest score. If both scores are equal, the algorithm does not
choose randomly and can output both as candidate translations. A feedback
mechanism could be foreseen to minimize these cases, if appropriate, in a
subsequent transfer selection round.
Polysemous Word Wp Words in First Level Vector V1 of Wp

✞✟✠ agenerase aqueous be capsule child clear colourless
Wp_T1 = solution_ ☎ ✆ ✝
concentrate contain contraindicated fill infusion
(solution, as a homogeneous liquid)
injection insulin IU mg ml oral patients pen vial
✁✂ ✄
Wp_T2 = solution_
adopt be find have possible problem
(solution, as answer, decision)
Test Sentence Test Sentence

(dots indicate tokens not from the selected POS types) (dots indicate tokens not from the selected POS types)
. clear . . colourless . solution . . crossborder . EURES . adviser . help . . find
visible . particle . . be . use . practical . solution . . . problem . . customise . .
service . . . need . . regional . customer
Polysemous word✞✟✠
Wp = solution Polysemous word✞✟✠
Wp = solution
Wp_T1 = solution_ ☎ ✆ ✝ ✁ ✂ ✄ (solution, as a homogeneous liquid) Wp_T1 = solution_ ☎ ✆ ✝ ✁ ✂ ✄ (solution, as a homogeneous liquid)
Wp_T2 = solution_ (solution, as answer, decision) Wp_T2 = solution_ (solution, as answer, decision)
Context Vectors Comparison Context Vectors Comparison

✞✟✠ ✁✂ ✄ ✞✟✠ ✁✂ ✄
Wp_T1 = solution_ ☎ ✆ ✝ Wp_T2 = solution_ Wp_T1 = solution_ ☎ ✆ ✝ Wp_T2 = solution_
Wi found in context vector Wi found in context vector Wi found in context vector Wi found in context vector
be: score=0.3 be: score=0.44 service: score=0.0825 find: score=0.09
clear: score=0.11 problem:score=0.07
colourless: score=0.08
visible: score=0.025
use: score=0.0475
✞✟✠
✞✟✠
Score(Wp_T1=solution_ ☎ ✆ ✝ ✁ ✂ ✄ )=0.5625 Score(Wp_T1=solution_ ☎ ✆ ✝ ✁ ✂ ✄ )=0.0825
Score(Wp_T2=solution_ )=0.44 ✞ ✟ ✠ Score(Wp_T2=solution_ )=0.16✁ ✂ ✄
Correct Selected: Wp_T1 = solution_ ☎ ✆ ✝ Correct Selected: Wp_T2 = solution_
Fig. 2: Example of lexical transfer selection
3 Results
The corpus used was the intera parallel corpus (Gavrilidou et al. 2004)
consisting of official eu documents in English and Greek from five different
domains; education, environment, health, law and tourism. The corpus
comprises 100, 000 aligned sentences, containing on average 830, 000 tokens
of the selected grammatical categories (nouns, verbs and adjectives) in either
language. The corresponding lemmas are 20, 000. The complete bilingual
lexicon comprises 5, 280 records (where multiple translational equivalences
of a word are counted as one record).
Evaluation was focused on the ability to resolve truly ambiguous words,
leaving aside words with synonymous translations, or erroneous transla-
tional candidates. For this purpose, a set of ambiguous words in English,
the contextually correct translation equivalent of which is “univocal” in
Greek, were manually selected. The selected set was {active, floor, seal,
settlement, solution, square, vision}.
For the above words, we extracted the sentences, in the parallel cor-
pus, that contain them. We adopted the 10-fold cross validation technique
for evaluation, computing the average results over the 10 iterations of the
algorithm. The possible answers, given by the algorithm were:
• Correct, when only the selected translational equivalent was present

in the target sentence.
• Wrong, when the selected translational equivalent was different from
the one appearing in the target sentence.
• No answer, when the translational equivalents were assigned the same
score.
Precision was calculated as the ratio of the correct answers to the sum of
correct and wrong answers. Recall was calculated as the ratio of the correct
answers to the possible correct answers. The experiment was first performed
with three sets of k, n: (i) k=3 and a window of n=5, (ii) k=3 and a window
of n=7, and (iii) k=3 and a window of n=15 (k referring to (2) and (4)).
The averaged results over the 10 iterations are shown in Table 1. In order
to simulate a larger corpus, we enlarged the produced CVs. We conducted
the experiment again, with k=1 and the three variants for the size of the
windows defined as previously. The results are also shown in Table 1, while
in Figure 2 we present an example of the system’s operation.
k=3 k=3 k=3 k=1 k=1 k=1

n=±5 n=±7 n=±15 n=±5 n=±7 n=±15
Correct 85.5 90.8 91.8 88.1 92.9 87.4
Wrong 9.6 14.5 26.3 11.7 17.0 31.7
No answer 25.1 15.2 2.4 20.7 10.6 1.4
Precision 89.9% 86.2% 77.7% 88.2% 84.5% 73.3%
Recall 71.1% 75.3% 76.2% 73.1% 77.1% 72.5%
Answered 79.1% 87.4% 98.0% 82.8% 91.2% 98.8%
Table 1: Results for k = 3 and k = 1
As expected, the wider the window, the more likely that the system gives an
answer, although the precision decreases. Especially for a window size n=15,
which in most cases in the given corpus contains all the tokens in a sentence,
we notice that the system’s performance declines disproportionately. This
is due to multiple erroneous statistical co-occurrences that are semantically
irrelevant in such wide windows.
In the second experiment, the percentage of the answered cases and
recall increased, compared to the first experiment, while precision slightly
decreased. The results also indicate that although a smaller window and
a higher absolute appearance threshold k would lead to a lower number of
answers, the accuracy increases.
To evaluate the performance of the method taking into consideration the
special characteristics of our corpus, we computed a “baseline” performance
(Gale et al. 1992c). We assign to each polysemous word, found in the
test set, the most frequent of its possible senses. The estimated baseline
performance was 55% on average due to the almost equal distribution of

the different senses of the words. Thus, the employment of context vectors
method lead to an increase in recall of almost 20%.
The proposed method can be used as a translational tool, for cat and
mt, especially as it concerns translation customization processes. Further-
more the method can be used as feedback mechanism for a refinement in
statistical lexicon extraction. As a validation, we conducted a second ex-
periment, over all the words in the lexicon, which had multiple translations
(although not always correct). The results were similar to the ones presented
in Table 1. Thus, such methods can be of utmost importance for bootstrap-
ping the development of multilingual lexica with semantic constraints on the
potential cross-lingual equivalences. Forthcoming experiments will include
tests on larger corpora and use of linguistically principled window selection.
REFERENCES
Brown, Peter F., Stephen Della Pietra, Vincent J. Della Pietra & Robert L
Mercer. 1991. “Word-Sense Disambiguation Using Statistical Methods”.
Proceedings of the 29th Annual Meeting on Association for Computational
Linguistics (ACL’91 ), 264-270. Berkeley, Calif.
Brown, Ralf D. 1997. “Automated Dictionary Extraction for ‘Knowledge-Free’
Example-Based Translation”. Proceedings of the 7th International Confer-
ence on Theoretical and Methodological Issues in Machine Translation
(TMI-97 ), 111-118. Santa Fe, New Mexico.
Dagan, Ido, Alon Itai & Ulrike Schwall. 1991. “Two Languages Are More Infor-
mative Than One”. Proceedings of the 29th Annual Meeting on Association
for Computational Linguistics (ACL’91 ), 130-137. Berkeley, Calif.
Gale, William A. & Kenneth W. Church. 1991 “A Program for Aligning Sentences
in Parallel Corpora”. Proceedings of the 29th Annual Meeting on Association
for Computational Linguistics (ACL’91 ), 177-184. Berkeley, Calif.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992a. “One Sense per
Discourse”. Proceedings of the Workshop on Speech and Natural Language,
233-237. Harriman, New York.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992b. “Using Bilin-
gual Materials to Develop Word Sense Disambiguation Methods”. Proceed-
ings of the 4th International Conference on Theoretical and Methodological
Issues in Machine Translation (TMI-92 ), 101-112. Montreal, Canada.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992c. “Estimating
Upper and Lower Bounds on the Performance of Word-Sense Disambigua-
tion Programs”. Proceedings of the 30rd Annual Meeting on Association for
Computational Linguistics (ACL’92 ), 249-256. Delaware, Newark.
Gale, William A., Kenneth W. Church & David Yarowsky. 1992d. “A Method for
Disambiguating Word Senses in a Large Corpus”. Common Methodologies
in Humanities Computing and Computational Linguistics, (= Special issue
of Computers and the Humanities, December 1992) 26:5-6.415-439.
Gavrilidou, M., P. Labropoulou, E. Desipri, V. Giouli, V. Antonopoulos,
S. Piperidis. 2004. “Building Parallel Corpora for eContent Professionals”.
COLING Workshop on Multilingual Linguistic Resources (MLR2004 ), 90-93.
Geneva, Switzerland.
Ide, Nancy & Jean Véronis. 1998. “Introduction to the Special Issue on Word
Sense Disambiguation: The State of the Art”. Computational Linguistics
24:1.2-40.
Kilgarriff, Adam. 1997. “I don’t Believe in Word Senses”. Computers and the
Humanities 31:2.91-113.
Melamed, I. Dan. 1997. “A Word-to-Word Model of Translationan Equivalence”.
Proceedings of the 35th Annual Meeting on Association for Computational
Linguistics (ACL’97 ), 490-497. Madrid, Spain.
Piperidis, Stelios, Harris Papageorgiou & Sotiris Boutsis. 2000. “From Sentences
to Words and Clauses”. Parallel Text Processing ed. by Jean Véronis, 117-
138. Dordrecht, The Netherlands: Kluwer Academic.
Shütze, Hinrich. 1998. “Automatic Word Sense Discrimination”. Word Sense
Disambiguation (= Special issue of Computational Linguistics, March 1998)
24:1.97-123.
Tiedemann, Jörg. 2001. “Predicting Translations in Context”. Proceedings of the
International Conference on Recent Advances in Natural Language Processing
(RANLP-2001 ), 240-244. Tzigov Chark, Bulgaria.
Véronis, Jean. 2000. “From the Rosetta Stone to the Information Society: A
Survey of Parallel Text Processing”. Parallel Text Processing ed. by Jean
Véronis, 1-25. Dordrecht, The Netherlands: Kluwer Academic.
Yarowsky, David. 1993. “One Sense Per Collocation”. Proceeding of the Work-
shop on Human Language Technology, 266-271. Princeton, New Jersey.
Yarowsky, David. 1995. “Unsupervised Word Sense Disambiguation Rivaling Su-
pervised Methods”. Proceedings of the 33rd Annual Meeting on Association
for Computational Linguistics (ACL’95 ), 189-196. Cambridge, Mass.
Multi-Perspective Evaluation of the FAME
Speech-to-Speech Translation System for Catalan,
English and Spanish
Victoria Arranz∗ , Elisabet Comelles∗∗ & David Farwell∗∗
∗
ELDA - Evaluation and Language Resources Distribution Agency,
∗∗
TALP Research Centre, Universitat Politcnica de Catalunya & Instituci
Catalana de Recerca i Estudis Avanats
Abstract
This paper describes the final evaluation of the FAME interlingua-
based speech-to-speech translation system for Catalan, English and
Spanish. It is an extension of the already existing NESPOLE! Sys-
tem that translates between English, French, German and Italian.
However, the FAME modules have now been integrated in an Open
Agent Architecture platform. We describe three types of evalua-
tion (task-oriented, performance-based and of user satisfaction) and
present their results when applied to our system. We also compare
the results of the system with those obtained by a stochastic trans-
lator developed independently within the FAME project.1
1 Introduction
The FAME interlingual speech-to-speech translation system (SST) for Cata-

lan, English and Spanish has been developed at the Universitat Politcnica
de Catalunya (UPC), Spain, as part of the recently completed European
Union-funded FAME project (Facilitating Agent for Multicultural Exchange
http://isl.ira.uka.de/fame/). The FAME system is an extension of the
NESPOLE! system (Metze et al. 2002) to Catalan and Spanish in the do-
main of hotel reservations. At its core is a robust, scalable, interlingual SST
system having cross-domain portability that allows for effective translingual
communication in a multi-modal setting. However, despite being originally
developed within the NESPOLE! framework, it was later ported to an Open
Agent Architecture that resulted in a number of benefits, including a speed
up of the end-to-end translation process.
The main advantage of an interlingual approach lies in the ease to add
new languages to the translation system. Only analysis and generation
grammars need to be developed for the new languages. Furthermore, the
developers do not need to be bilingual: only monolingual source-language
analysis or target-language generation developers are required.
1
This research was financed by the FAME (IST-2001-28323) and ALIADO (TIC2002-
04447-C02) projects. We would also like to thank Climent Nadeu and Jaume Padrell.
238 VICTORIA ARRANZ, ELISABET COMELLES & DAVID FARWELL
The complexity of spontaneous speech also had to be overcome. Some of the

main problems were disfluencies, incomplete or non-grammatical sentences,
etc. This was also a major reason for using an interlingual approach, since
it allows the translation of sentence fragments, non-grammatical sentences
etc., without being totally dependent on well-formed syntax.
Portability was also an issue considered as it is one of the main draw-
backs generally attributed to interlingual systems. However, this was not
the case for our system. The structure of the grammars presents a clear divi-
sion between the rules and lexical items that are portable to other domains
and those that are task-specific. This allowed us to develop the translation
modules quickly and efficiently and to port newly acquired vocabulary and
forms of expression to the other languages. Other partners in the project
were also able to port the Spanish and Catalan components to a very dif-
ferent domain: the medical domain (Schultz et al. 2004).
The interlingua used is Interchange Format (IF), created by the C-STAR
Consortium (http://www.c-star.org; (Levin et al. 2002)) and adapted for
this effort. Its central advantage for representing dialogue interactions such
as those typical of SST systems is that rather than capturing the detailed
semantic and stylistic distinctions, it characterizes the intended conversa-
tional goal of the interlocutor. Even so, it is still necessary to consider the
structural and lexical properties related to Spanish and Catalan.
2 System architecture
Although the system architecture was initially based on NESPOLE!, all

of the modules have now been integrated in an Open Agent Architecture
platform (Holzapfel et al. 2003). This type of multi-agent framework offers a
number of technical features for a multi-modal environment that are highly
advantageous for both system developers and users.
Broadly speaking, the FAME system consists of an analysis component
and generation component. The analysis component transcribes spoken
source language utterances and maps that transcription into an interlin-
gual representation. The generation component maps from interlingua into
target language text and then produces a synthesized version of that text.
For both Catalan and Spanish automatic speech recognition (ASR),
we used the JANUS Recognition toolkit (JRTk) developed at UKA and
CMU (Woszczyna et al. 1993). For the text-to-text component, the anal-
ysis side utilizes the top-down, chart-based SOUP parser (Gavaldà 2000)
with full domain action level rules to parse input utterances. Natural lan-
guage generation is done with GenKit, a pseudo-unification based gener-
ation tool (Tomita et al. 1988). For both Spanish and Catalan, we use a
MULTI-PERSPECTIVE EVALUATION OF THE FAME SYSTEM 239
Text-to-Speech (TTS) system fully developed at the UPC, which uses a

unit-selection based, concatenative approach to speech synthesis.
For the initial development of the Spanish analysis grammar, the al-
ready existing NESPOLE! English and German analysis grammars were
used as a reference point. Despite using these grammars, great efforts
were taken to overcome important differences between English and German
and the Romance languages in focus. The Catalan analysis grammar, in
turn, was adapted from the Spanish analysis grammar and, in this case, the
process was rather straightforward. The Spanish generation grammar was
mostly developed from scratch, although some of the underlying structure
was taken from the NESPOLE! English generation grammar. Language-
dependent properties such as word order, gender and number agreement,
etc. needed to be dealt with representationally but, on the whole, starting
with existing structural descriptions was useful. On the other hand, the
generation lexica play a major role in the generation process and these had
to be developed from scratch. Again, the Catalan generation lexicon was
adapted from the Spanish directly with almost no significant complication.
3 Evaluation
Evaluation was done on real users of the SST system, in order to:
• examine the performance of the system in as real a situation as pos-
sible, as if it were to be used by a real tourist trying to book accom-
modation in Barcelona,
• study the influence of using ASR in translation,
• compare the performance of a statistical approach and an interlingual
approach in a restricted semantic domain and for a task of this kind2 ,
• investigate the relevance of certain standard evaluation methods used
in statistical translation when applied to interlingual translation.
3.1 Evaluation: Data recording and treatment

Prior to evaluation, several tasks had to be done to obtain the necessary
data. These included dialogue and data recording during real3 system us-
age; adapting the translation system to register every utterance from the
different translation approaches and from ASR; recruiting people to play
the roles of the users; designing the scenarios; designing the sequence of
events for the recording sessions; transcribing all speech data; etc.
2
A statistical system was built in parallel within the FAME project so as to compare
approaches, both in terms of results and efforts.
3
In order to perform a quantitative evaluation of the system, a number of scenarios as
real as possible were set up with external users and in reality-resembling situations.
Conversations took place between an English-speaking client and a Cata-

lan- or Spanish-speaking travel agent. Twenty dialogues were carried out
by a total of 12 people. Of these, 10 people were completely inexperienced
with respect to the task and unfamiliar with the system, while 2 were fa-
miliar with both the task and the system. The former (the 10 speakers)
participated in 2 dialogues each and the latter (the other 2) participated in
10 dialogues each. That way, each dialogue would resemble a real-situation
dialogue where one of the speakers would always be familiar with the task
and the system while the other one would not. It should also be added that
all English speakers recruited for the evaluation were non-native speakers of
the language and the results from speech recognition and translation have
suffered from this. However, we considered this realistic as most of the po-
tential system users would actually be from non-English speaking countries.
Five different scenarios were designed per speaker (agent or client) and
they were available in all relevant languages (agent scenarios in Catalan
and Spanish and client scenarios in English). Before starting the recording
of the data, speakers were provided with very basic knowledge about the
system. Computer screens only showed the user their own scenario related
information and system interface. The system interface provided them with
the ASR output of their own contribution and the translation output (from
both the interlingual and statistical systems) of the other user’s utterances.
The former allowed the speakers to check if the ASR had recognised their
utterances properly and thus allow for translation to go on or intervene
before communication failure took place, say by repeating their utterance.
The latter allowed them to have the two translation outputs from the other
speaker’s utterances on the screen since the synthesizer only provided one
of the translations (choice based on a very simple algorithm).
Dialogue recording took place in a room set up for that purpose. Speak-
ers were situated separately with their respective computers in such a way
that they could only view their own computer screen. Once recording was
finished and all conversations were registered: a) All speech files were tran-
scribed and all utterances were grouped according to the dialogue they
belonged to. Speech disfluencies were also marked and all utterances were
tagged; b) Reference translations were created for each speaker+dialogue
file, so as to evaluate translation using BLEU and mWER metrics.
3.2 Task-oriented evaluation metrics

A task-oriented methodology was developed to evaluate both the end-to-end
system and the source language transcription to target language text sub-
component. An initial version of this methodology had already proven useful
during system development since it allowed us to analyse content and form
independently and, thus, contributed to practical system improvements.

The evaluation criteria used were broken down into three main categories
(Perfect, Ok and Unacceptable), while the second was further subdivided
into Ok+, Ok and Ok-. During the evaluation these criteria were indepen-
dently applied to form and to content. In order to evaluate form, only
the generated output was considered by the evaluators. To evaluate con-
tent, evaluators took into account both the input utterance or text and the
output text or spoken utterance. Thus, the meaning of the metrics varies
according to whether they are being used to judge form or to judge content:
• Perfect: well-formed output (form) or full communication of speak-
ers’ information (content).
• Ok+/Ok/Ok-: acceptable output, grading from some minor error of
form (e.g., missing determiner) or missing information (Ok+) to some
more serious problem of form or content (Ok-) resulting in awkward-
ness or important missing information.
• Unacceptable: unacceptable output, either essentially unintelligible
or simply totally unrelated to the input.
3.2.1 Task-oriented evaluation results

The results obtained from the evaluation of the end-to-end translation sys-
tem for the different language pairs are shown in Tables 1, 2, 3 and 4,
respectively. After studying the results we can conclude that many of the
errors obtained are caused by the ASR component. However, results remain
rather good since, for the worst of our language pairs (English-Spanish), a
total of 62.4% of the utterances were judged acceptable in regard to content.
This is comparable to evaluations of other state-of-the-art systems such as
NESPOLE! (Lavie et al. 2002), which obtained slightly lower results and
were performed on Semantic Dialog Units (SDUs)4 instead of utterances
(UTTs), thus simplifying the translation task. The Catalan-English and
English-Catalan pairs were both quite good with 73.1% and 73.5% of the
utterances being judged acceptable, respectively, and the Spanish-English
pair performed very well with 96.4% of the utterances being acceptable.
As seen in these tables, better results have been obtained for the Spanish-
English and Catalan-English directions. We should point out that Catalan
and Spanish Language Models used for ASR were developed specifically for
this task while the English Language Models used were those provided by
the project’s partners. In addition, we should also consider that a great
effort was devoted to the development of both Catalan and Spanish anal-
ysis and generation grammars. However, English analysis and generation
4
SDUs are smaller meaning-porting units, where usually several of them are contained
within a dialogue utterance.
SCORES FORM CONT SCORES FORM CONT

PERFECT 70.59% 31.93% PERFECT 92.85% 71.42%
OK+ 5.04% 15.12% OK+ 4.77% 11.90%
OK 6.72% 9.25% OK 1.19% 7.14%
OK- 9.25% 16.80% OK- 0% 5.96%
UNACCEPT 8.40% 26.90% UNACCEPT 1.19% 3.58%
Table 1: Evaluation of end-to-end Table 2: Evaluation of end-to-end

translation for Catalan-English translation for Spanish-English
based on 119 UTTs based on 84 UTTs
SCORES FORM CONT SCORES FORM CONT

PERFECT 64.96% 34.19% PERFECT 64.80% 17.60%
OK+ 15.39% 11.97% OK+ 4.80% 10.40%
OK 8.54% 14.52% OK 12.00% 18.40%
OK- 5.12% 12.82% OK- 8.80% 16.00%
UNACCEPT 5.99% 26.50% UNACCEPT 9.60% 37.60%
Table 3: Evaluation of end-to-end Table 4: Evaluation of end-to-end
translation for English-Catalan translation for English-Spanish
based on 117 UTTs based on 125 UTTs
grammars were not so developed. Because generation from a well-formed IF

is more robust than from a fragmented IF, better analysis components tend
to result in better overall throughput. These two factors are why better
results are achieved when Spanish and Catalan are source languages.
3.3 Statistical evaluation metrics

Evaluation of our end-to-end speech-to-speech translation system has also
been carried out by means of statistical metrics such as BLEU and mWER.
We anticipated that results would drop drastically when compared to the
manual evaluation presented in Section 3.2 and this turned out to be the
case. This considerable drop is due to a number of factors:
• Resulting translations are compared to a single reference translation,
thus failing to account for language variety and flexibility and nega-
tively impacting results,
• BLEU and mWER penalise diversions from the reference translation,
even if these result from minor errors that do not affect intelligibility,
• The English-speaking volunteers for the evaluation were not native
speakers, which considerably complicated the ASR task.
3.3.1 Statistical evaluation results

Results obtained both from the statistical approach and the interlingua-
based one are shown below in Tables 5 and 6, respectively:
Lang Pairs # sent. mWER BLEU Lang Pairs # sent. mWER BLEU
CAT2ENG 119 74.66 0.1218 CAT2ENG 119 78.98 0.1456
ENG2CAT 117 77.84 0.1573 ENG2CAT 117 81.19 0.2036
SPA2ENG 84 61.10 0.1934 SPA2ENG 84 60.93 0.3462
ENG2SPA 125 80.95 0.1052 ENG2SPA 125 86.71 0.1214
Table 5: Results of the statistical Table 6: Results of the

MT system interlingua-based MT system
The results are consistent with respect to the relative performance of the dif-
ferent systems in terms of language pairs. The Spanish-to-English systems,
both statistical and rule-based, performed best. The English-to-Spanish sys-
tems, both statistical and rule-based, performed the worst. The Catalan-to-
English and English-to-Catalan systems performed somewhere in between
with the latter slightly outperforming the former.
As for the relative performance of the statistical systems as opposed to
the rule-based systems, the results are entirely contradictory. The mWER
scores of the statistical system are consistently better than those of the
rule-based systems (apart from the Spanish-to-English case where the two
systems essentially performed equally). On the other hand, the BLEU scores
of the rule-based system are consistently better than those of the statistical
systems. It is unclear how this happened although it is likely that since the
BLEU metric rewards overlapping strings of words (as opposed to simply
matching words) that the rule-based systems produced a greater number of
correct multiword sub-strings than the statistical systems did. In any case,
were it not for the low performance of all the systems and the very limited
size of the test corpus, this would be a very telling result with regard to the
validity of the evaluation metrics.
3.4 User satisfaction evaluation

A final user-satisfaction evaluation was carried out both from a quantitative
and a qualitative point of view. This is done on a system which provided
the user with both the interlingual-based and statistical-based translations.
3.4.1 Quantitative study

The quantitative study of the user satisfaction consists in measuring the re-
sults obtained from the end-to-end translation system according to a number
of metrics established for that purpose. Both metrics 1 and 3 are used as
reference points for determining the values for metrics 2 and 4, respectively.
The metrics used are detailed below:
1. Number of turns per dialogue: This establishes the number of turns
per dialogue.
2. Success in communicating the speaker’s intention/Successful turns per

dialogue: This measures the success of each turn.
3. Number of items of target information per dialogue: This refers to the
number of different blocks of semantic information contained in each
sentence or turn.
4. Successful items of target information obtained: This measures the
number of successful blocks of semantic information passed from one
user to the other (agent and client).
5. Number of disfluencies per dialogue: This refers to the number of
disfluencies uttered by the users, covering mostly erroneous mouse
clicks, pauses, doubts and mistakes while speaking.
6. Number of repetitions per dialogue: This reflects the number of re-
peated turns per dialogue so as to show how many repetitions the
users have had to go through to achieve their goal.
7. Number of abandoned turns per dialogue: This presents those turns
that have been abandoned by the user, mostly after several repetitions.
Before providing any figures, results obtained from metrics 2 and 4 should be
further explained given that they seem to provide much lower results than
they actually do. The success obtained both at the level of a turn and of
an item of target information is shown in a global way, that is, taking into
account the full number of repetitions (which are considered in reference
metrics 1 and 3). Thus, a dialogue may be successful by means of some
repetitions while the numbers of success in metrics 2 and 4 are rather low.
In order to establish this success, one should also look at metric 7, which
reflects the number of abandoned turns and, thus, failures in transmitting
target information (speaker’s intention).
Last but not least, and as already explained in Section 3.1, users playing
the role of the English-speaking client were not native speakers of English,
which certainly makes ASR an even more complex task. This is particularly
so in some dialogues where the speakers have little mastery of the language.
Table 7 shows the results obtained with the above metrics.
As observed in M-7, 7 dialogues have successfully communicated all in-
formation; 8 dialogues have only given up on one turn, and 3 dialogues on 2.
The remaining 2 dialogues have abandoned 3 and 4 turns, respectively. This
is not an important loss, bearing in mind that after analysing the results, it
was observed that a large number of problems come from very simple turns
like greetings and thanking.
3.4.2 Qualitative study

The quantitative study presented above has been supplemented by a qual-
itative evaluation based on users’ responses to a brief questionnaire. Users
Dialogues M-1 M-2 M-3 M-4 M-5 M-6 M-7

Eng/Spa-1 7 7 11 10 0 0 0
Eng/Spa-2 24 13,5 33 21,5 0 4 2
Eng/Spa-3 24 19 36 27 2 1 1
Eng/Spa-4 12 7,5 22 15 1 2 1
Eng/Spa-5 22 16,5 36 28 2 4 1
Eng/Spa-6 18 7 18 8 5 9 0
Eng/Spa-7 9 6 14 10 3 1 0
Eng/Spa-8 32 14,5 42 21 0 10 3
Eng/Cat-9 23 9,5 33 15,5 1 12 1
Eng/Cat-10 40 20,5 52 26,5 3 17 0
Eng/Cat-11 20 9 34 16 1 8 1
Eng/Cat-12 37 16 44 18,5 1 15 1
Eng/Cat-13 6 5,5 12 11 1 1 0
Eng/Cat-14 37 18,5 48 27 0 14 4
Eng/Cat-15 25 8 35 15 0 13 2
Eng/Cat-16 37 24 52 35,5 2 9 2
Eng/Cat-17 11 5,5 23 12 0 5 0
Eng/Cat-18 31 15 43 24 1 14 1
Eng/Cat-19 7 4,5 13 9 1 2 0
Eng/Cat-20 23 14,5 32 20,5 1 9 1
Table 7: User satisfaction results
were asked 10 questions about the naturalness and lenght of the dialogue,
the behaviour of the system, user’s success in getting what he wanted, the
difficulty to correct errors, if the user would use this system again, etc.The
average response was 3.4 points out of 5. An informal inspection of the
results per questionnaire indicates that the reaction of the users as a whole
was consistent and weakly positive (taking 3.0 as a median).
4 Conclusions
This article has described the FAME interlingua-based speech-to-speech

translation system for Catalan, English and Spanish and the three different
evaluations performed on real users and in lifelike situations.
The different evaluations prove that the system is already at an inter-
esting and promising stage of development. A public demonstration of the
system also took place with untrained users participating and testing the
system at the Forum of Cultures in Barcelona, during July 2004. Results
from this open event were also very satisfactory.
Having reached this level of development, our next step will be to solve
some remaining technical problems and to expand the system both within
this domain and to others. Among the technical problems, we need to focus
on improving the ASR component. Another problem to be confronted is
dealing with degraded translations. An option here may be to incorporate

within the dialogue model strategies for the speakers to be able to request
repetitions or reformulations.
Last but not least, a detailed study has been carried out of the pros and
cons of the interlingua representation used by our system when applied to
the Romance languages described.
REFERENCES
Gavaldà, Marsal. 2000. “SOUP: A Parser for Real-world Spontaneous Speech”.
Proceedings of the 6th International Workshop on Parsing Technologies
(IWPT-2000 ), XX-YY. Trento, Italy.
Holzapfel, Hartwig, I. Rogina, M. Wölfel & T. Kluge. 2003. “FAME Deliverable
D3.1: Testbed Software, Middleware and Communication Architecture”.
Lavie, Alon, F. Metze, R. Cattoni & E. Constantini. 2002. “A Multi-Perspective
Evaluation of the NESPOLE! Speech-to-Speech Translation System”. Pro-
ceedings of Speech-to-Speech Translation Workshop at the 40th Annual Meet-
ing of the Association of Computational Linguistics (ACL-2002 ), XX-YY.
Philadelphia, PA.
Levin, Lori, D. Gates, D. Wallace, K. Peterson, A. Lavie, F. Pianesi, E. Pianta,
R. Cattoni & N. Mana. 2002. “Balancing Expressiveness and Simplicity in
an Interlingua for Task based Dialogue”. Proceedings of Speech-to-Speech
Translation Workshop at the 40th Annual Meeting of the Association of Com-
putational Linguistics (ACL-2002 ), XX-YY. Philadelphia, Penn.
Metze, Florian, J. McDonough, J. Soltau, C. Langley, A. Lavie, L. Levin, T. Schultz,
A. Waibel, L. Cattoni, G. Lazzari, N. Mana, F. Pianesi & E. Pianta. 2002.
“The NESPOLE! Speech-to-Speech Translation System”. Proceedings of the
Human Language Technology Conference (HLT-2002 ), XX-YY. San Diego,
Calif.
Searle, John. 1969. Speech Acts: An Essay in the Philosophy of Language.
Cambridge, U.K.: Cambridge University Press.
Schultz, Tanja, D. Alexander, A.W. Black, K. Peterson, S. Suebvisai & A. Waibel.
2004. “A Thai Speech Translation System for Medical Dialogs”. Proceedings
of the Human Language Technology Conference HLT/NAACL-2004, XX-YY.
Boston, Mass.
Tomita, Masaru & E.H. Nyberg. 1988. “Generation Kit and Transformation
Kit, Version 3.2, User’s Manual”. Technical Report (CMU-CMT-88-MEMO).
Center for Machine Translation, Carnegie Mellon University, Pittsburgh, PA.
Woszczyna, Monika, N. Coccaro, A. Eisele, A. Lavie, A. McNair, T. Polzin, I. Rogina,
C. Rose, T. Sloboda, M. Tomita, J. Tsutsumi, N. Aoki-Waibel, A. Waibel &
W. Ward. 1993. “Recent Advances in JANUS: A Speech Translation Sys-
tem”. Proceedings of the 1993 Eurospeech. Berlin.
Parallel Corpora for Medium Density Languages
Dániel Varga∗ , Péter Halácsy∗ , András Kornai∗ ,

Viktor Nagy∗∗ , László Németh∗ & Viktor Trón∗∗∗
∗
Media Research Centre at the Technical University of Budapest
∗∗
Hungarian Research Institute for Linguistics
∗∗∗
University of Edinburgh & University of Saarland
Abstract
The choice of natural language technology appropriate for a given
language is greatly impacted by ‘density’ (availability of digitally
stored material). More than half of the world speaks medium den-
sity languages, yet many of the methods appropriate for high or
low density languages yield suboptimal results when applied to the
medium density case. In this paper we describe a general method-
ology for rapidly collecting, building, and aligning parallel corpora
for medium density languages, illustrating our main points on the
case of Hungarian, Romanian, and Slovenian. We also describe and
evaluate the hybrid sentence alignment method we are using.
1 Introduction
There are only a dozen large languages with a hundred million speakers or
more, accounting for about 40% of the world population, and there are over
5,000 small languages with less than half a million speakers, accounting
for about 4% (Grimes 2003). In this paper we discuss some ideas about
how to build parallel corpora for the five hundred or so medium density
languages that lie between these two extremes based on our experience
building a 50M word sentence-aligned Hungarian-English parallel corpus.
Throughout the paper we illustrate our strategy mainly on Hungarian (14m
speakers), also mentioning Romanian (26m speakers), and Slovenian (2m
speakers), but we emphasize that the key factor leading the success of our
method, a vigorous culture of native language use and (digital) literacy, is
by no means restricted to Central European languages. Needless to say,
the density of a language (the availability of digitally stored material) is
predicted only imperfectly by the population of speakers: major Prakrit or
Han dialects, with tens, sometimes hundreds, of million speakers, are low
density, while minor populations, such as the Inuktitut, can attain high
levels of digital literacy given the political will and a conscious Hansard-
building effort (Martin et al. 2003). With this caveat, population (or better,
GDP) is a very good approximation for density, on a par with web size.
248 VARGA, HALÁCSY, KORNAI, NAGY, NÉMETH & TRÓN
The rest of the paper is structured as follows. In Section 2 we describe our

methods of corpus collection and preparation. Our hybrid sentence-level
aligner is discussed in Section 3. Evaluation is the subject of Section 4.
2 Collecting and preparing the corpus
Starting with Resnik (1998), mining the web for parallel corpora has emerged
as a major technique, and between English and another high density lan-
guage, such as Chinese, the results are very encouraging (Chen & Nie 2000,
Resnik & Smith 2003). However, when no highly bilingual domain (like .hk
for Chinese or .ca for French) exists, or when the other language is much
lower density, the actual number of automatically detectable parallel pages
is considerably smaller: for example, Resnik and Smith find less than 2,000
English-Arabic parallel pages for a total of 2.3m words.
For medium density languages parallel web pages turn out to be a sur-
prisingly minor source of parallel texts. Even in cases where the popula-
tion and the economy is sizeable, and a significant monolingual corpus can
be collected by crawling, mechanically detectable parallel or bilingual web
pages exist only in surprisingly small numbers. For example a 1.5 billion
word corpus of Hungarian (Halácsy et al. 2004), with 3.5 million unique
pages, yielded only 270,000 words (535 pages), and a 200m word corpus of
Slovenian (202,000 pages) yielded only 13,000 words (42 pages) using URL
parallelism as the primary matching criterion as in PTMiner (Chen & Nie
2000).
Web pages are undoubtedly valuable for a diversity of styles and contents
that is greater than what could be expected from any single source, but a few
hundred web pages alone fall short of a sensible parallel corpus. Therefore,
one needs to resort to other sources, many of them impossible to find by
mechanical URL comparison, and often not even accessible without going
through dedicated query interfaces.
Literary texts. The Hungarian National Library maintains a large
public domain digital archive Magyar Elektronikus Könyvtár ’Hungarian
Electronic Library’ mek.oszk.hu/indexeng.phtml with many classical texts.
Comparison with the Project Gutenberg archives at www.gutenberg.org
yielded well over a hundred parallel texts by authors ranging from Jane
Austen to Tolstoy. Equally importantly, many works still under copyright
were provided by their publishers under the standard research exemption
clause. While we can’t publish most of these texts in either language, we
publish the aligned sentence pairs alphabetically sorted. This “shuffling”
somewhat limits usability inasmuch as higher than sentence-level text layout
becomes inaccessible, but at the same time makes it prohibitively hard to
reconstruct the original texts and contravene the copyright. Since shuffling
PARALLEL CORPORA FOR MEDIUM DENSITY LANGUAGES 249
nips copyright issues in the bud, it simplifies the complex task of dissemi-
nating aligned corpora considerably.
Religious texts. The entire Bible has been translated to over 400
languages and dialects, and many religious texts from the Bhagavad Gita to
the Book of Mormon enjoy nearly as broad currency. The Catholic Church
makes a special effort to have papal edicts translated to other languages
from the original Latin (see www.vatican.va/archive).
International Law. From the Geneva Convention to the Universal
Declaration of Human Rights (www.unhchr.ch/udhr) many important le-
gal documents have been translated to hundreds of languages and dialects.
Those working on the languages of the European Union have long availed
themselves of the CELEX database.
Movie captioning. Large mega-productions are often dubbed, but
smaller releases will generally have only captioning, often available for re-
search purposes. For cult movies there is also a vigorous subgenre of amateur
translations by movie buffs.
Software internationalization. Multilingual software documentation
is increasingly becoming available, particularly for open source packages
such as KDE, Gnome, OpenOffice, Mozilla, the GNU tools, etc (Tiedemann
and Nygaard 2004).
Bilingual magazines. Both frequent flyer magazines and national
business magazines are often published with English articles in parallel.
Many magazines from Scientific American to National Geographic have edi-
tions in other languages, and in many countries there exist magazines with
complete mirror translations (for instance, Diplomacy and Trade Magazine
publishes every article both in Hungarian and English).
Annual reports, corporate home pages. Large companies will often
publish their annual reports in English as well. These are usually more
strictly parallel than the rest of their web pages.
There is no denying that the identification of such resources, negoti-
ating for their release, downloading, format conversion, and character-set
normalization remain labor-intensive steps, with good opportunities for au-
tomation only at the final stages. But such an effort leverages exactly the
strengths of medium density languages: the existence of a joint cultural
heritage both secular and religious, of national institutions dedicated to the
preservation and fostering of culture, of multinational movements (particu-
larly open source) and multinational corporations with a notable national
presence, and of a rising tide of global business and cultural practices. Al-
together, the effort pays off by yielding a corpus that is two-three orders of
magnitude larger, and covering a much wider range of jargons, styles, and
genres, than what could be expected from parallel web pages alone. Table 1
summarizes the different types of texts and their sizes in our Hungarian-
English parallel corpus. In addition to the texts, we identified other signifi-

cant lexical resources, such as public domain glossaries specifically prepared
for EU law, Microsoft software, Linux, and other particular domains and
most importantly, a large (over 254,000 records) general-purpose bilingual
dictionary manually created over many years by Attila Vonyó. Since there
is no guarantee that such materials are available for other languages, in the
next section we describe a sentence alignment algorithm which does not rely
on the existence of such bilingual dictionaries, but can take advantage of it
if it is available.
source docs E words (m) H words (m)

Literary 156 14.6 11.5
Legal 10374 24.1 18.3
Captioning 437 2.5 1.9
Sw docs 187 0.8 0.7
Magazines 107 0.3 0.3
Business 19 0.5 0.4
Religious 122 2.3 2.0
Web 435 0.3 0.2
Total 11550 44.6 34.6
Table 1: Distribution of text types in the Hungarian–English parallel

corpus
After some elementary format-detection and conversion routines (using stan-

dard open source tools such as catdoc and pdftotext), we have a corpus of
raw text consisting of assumed parallel documents. While the texts them-
selves were collected and converted predominantly manually, the aligned bi-
corpus is derived by entirely automatic methods. Due to the manual effort,
parallelism is nearly perfect, therefore the size of the raw corpus of collected
texts is not significantly different from the size of the useful (aligned) data.
The first steps of our corpus preparation pipeline are tokenizers per-
forming sentence and paragraph boundary detection and word tokeniza-
tion. These are relatively simple flex programs (along the lines of Mikheev
2002) both for English and Hungarian. For languages with more complex
morphology such as Hungarian, it makes sense to conflate by stemming mor-
phological variants of a lexeme before the texts are passed to the aligner.
We used hunmorph, a language-independent word analysis toolkit (Trón et
al. 2005) both for Hungarian and English.
The most important ingredient of the pipeline is of course automatic sen-
tence alignment which we carried out using our own algorithm and software
hunalign, described in detail in the next section.
3 Sentence level alignment
There are three main approaches to the problem of corpus alignment at

the sentence level: length-based (Brown et al. 1991, Gale & Church 1991),
dictionary- or translation based (Chen 1993, Melamed 1996, Moore 2002),
and partial similarity-based (Simard & Plamondon 1998). This last method
in itself may work well for Indo-European languages (probably better be-
tween English and Romanian than English and Slovenian), but for Hun-
garian the lack of etymological relation suggests that the number of cog-
nates will be low. Even where the cognate relationship is clear, as in com-
puter/kompjúter, strike/sztrájk etc., the differences in orthography make it
hard to gain traction by this method. Therefore, we chose to concentrate on
the dictionary and length-based methods, and designed a hybrid algorithm,
hunalign, that successfully amalgamates the two.
In the first step of the alignment algorithm, a crude translation of the
source text is produced by converting each word token into the dictionary
translation that has the highest frequency in the target corpus, or to itself
in case of lookup failure.
This pseudo target language text is then compared against the actual
target text on a sentence by sentence basis. The similarity score between a
source and a target sentence consists of two major components: token-based
and length-based. The dominant term of the token-based score is the num-
ber of shared words in the two sentences, normalized with the larger token
count of the two sentences. A separate reward term is added if the pro-
portion of shared numerical tokens is sufficiently high in the two sentences
(especially useful for the alignment of legal texts).
For the length-based component, the character counts of the original
texts are incremented by one, and the score is based on the ratio of longer
to shorter. The relative weight of the two components was set so as to
maximize precision on the Hungarian–English training corpus, but seems a
sensible choice for other languages as well. Paragraph boundary markers are
treated as sentences with special scoring: the similarity of two paragraph-
boundaries is a high constant, the similarity of a paragraph-boundary to a
real sentence is minus infinity, so as to make paragraph boundaries pair up.
The similarity score is calculated for every sentence pair around the di-
agonal of the alignment matrix (at least a 500-sentence neighborhood is cal-
culated or all sentences closer than 10% of the longer text). This is justified
by the observation that the beginning and the end of the texts are consid-
ered aligned and that the sentence ratio in the parallel text represents the
average one-to-many assignment ratio of alignment segments, from which
no significant deviations are expected. We find that 10% is high enough to
produce reassuringly high recall figures even in the case of faulty parallelism
such as long surplus chapters. Once the similarity matrix is obtained for the
relevant sentence pairs, the optimal alignment trail is selected by dynamic
programming, going through the matrix with various penalties assigned to
skipping and coalescing sentences. The score of skipping is a fixed param-
eter, learnt on our training corpus while the score of coalescing is the sum
of the minimum of the two token-based scores and the length-based score
of the concatenation of the two sentences. For performance reasons, the
dynamic programming algorithm does not take into account the possibility
of more than two sentences matching one sentence. After the optimal align-
ment path is found, a postprocessing step iteratively coalesces a neighboring
pair of one-to-many and zero-to-one segments wherever the resulting new
segment has a better character-length ratio than the starting one. With
this method, any one-to-many segments can be discovered.
The hybrid algorithm presented above remains completely meaningful
even in the total absence of a dictionary. In this case, the crude translation
will be just the source language text, and sentence-level similarity falls back
to surface identity of words.After this first phase a simple dictionary can
be bootstrapped on the initial alignment. From this alignment, the second
phase of the algorithm collects one-to-one alignments with a score above a
fixed threshold. Based only on all one-to-one segments, cooccurrences of
every source-target token pair are calculated. These, when normalized with
the maximum of the two tokens’ frequency yield an association measure.
Word pairs with association higher than 0.5 but are are used as a dictionary.
Our algorithm is similar in spirit to that of Moore (2002) in that they
both combine the length-based method with some kind of translation-based
similarity. In what follows we discuss how Moore’s algorithm differs from
ours.
Moore’s algorithm has three phases. First, an initial alignment is com-
puted based only on sentence length similarity. Next, an IBM ‘Model I’
translation model (Brown et al. 1993) is trained on a set of likely matching
sentence pairs based on the first phase. Finally, similarity is calculated us-
ing this translation model, combined with sentence length similarity. The
output alignment is calculated using this complex similarity score. Compu-
tation of similarity using Model I is rather slow, so only alignments close
to the initially found alignment are considered, thus restricting the search
space drastically.
Our simpler method using a dictionary-based crude translation model
instead of a full IBM translation model has the very important advantage
that it can exploit a bilingual lexicon, if one is available, and tune it ac-
cording to frequencies in the target corpus or even enhance it with extra
local dictionary bootstrapped from an initial phase. Moore’s method offers
no such way to tune a preexisting language model. This limitation is a real
one when the corpus, unlike the news and Hansard corpora more familiar
to those working on high density languages, is composed of very short and
heterogeneous pieces. In such cases, as in web corpora, movie captions, or
heterogeneous legal texts, average-based models are actually not close to
any specific text, so Moore’s workaround of building language models based
on 10,000 sentence subcorpora has little traction.
On top of this, our translation similarity score is very fast to calculate,
so the dictionary-based method can be used already in the first phase where
a much bigger search space can be traversed. If the lexicon resource is good
enough for the text, this first phase already gives excellent alignment results.
Maximizing alignment recall in the presence of noisy sentence segmen-
tation is an important issue, particularly as language density generally cor-
relates with the sophistication of NLP tools, and thus lower density implies
poorer sentence boundary detection. From this perspective, the focus of
Moore’s algorithm on one-to-one alignments is less than optimal, since ex-
cluding one-to-many and many-to-many alignments may result in losing
substantial amounts of aligned material if the two languages have different
sentence structuring conventions.
While speed is often considered a mundane issue, hunalign, written in
C++, is at least an order of magnitude faster than Moore’s implementation
(written in Perl), and the increase in speed can be leveraged in many ways
during the building of a parallel corpus with tens of thousands of documents.
First, rapid alignment allows for more efficient filtering of texts with low
confidence alignments, which usually point to faulty parallelism such as
mixed order of chapters (as we encountered in Arabian Nights and many
other anthologies), missing appendices, extensive extra editorial headers
(typical of Project Gutenberg), comments, different prefaces in the source
texts etc. Once detected automatically, most cases of faulty parallelism
can be repaired and the texts realigned. Second, debugging and fine-tuning
lower-level text processing steps (such as the sentence segmentation and
tokenization steps) may require several runs of alignment in order to monitor
the impact of certain changes on the quality of alignment. This makes speed
an important issue. Interestingly, runtime complexity of Moore’s program
seems to be very sensitive to the faults in parallelism. Adding a 300 word
surplus preface to one side of 1984 but not the other slows down this program
by a factor of five, while it has no detectable impact on hunalign.
Finally, Moore’s aligner, while open source and clearly licensed for re-
search, is not free software. In particular, parallel corpora aligned with it
can not be made freely available for commercial purposes. Since we wanted
to make sure that our corpus is available for any purpose, including com-
mercial use, Moore’s aligner program was not a viable choice.
4 Evaluation
In this section we describe our attempts to assess the quality of our parallel
corpus by evaluating the performance of the sentence aligner on texts for
which manually produced alignment is available. We also compare our
algorithm to Moore’s (2002) method.
Evaluation shows hunalign has very high performance: generally it
aligns incorrectly at most a handful of sentences. As measured by Moore’s
method of counting only on one-to-one sentence-pairs, precision and recall
figures in the high nineties are common. But these figures are overly op-
timistic because they hide one-to-many and many-to-many errors, which
actually outnumber the one-to-one errors. In 1984, for example, 285 of the
6732 English sentences or about 4.3% do not map on a unique Hungarian,
and 716 or 10.6% do not map on a unique Romanian sentence – similar
proportions are found in other alignments, both manual and automatic.
To take these errors into account, we used a slightly different figure of
merit, defined as follows. The alignment trail of a text can be represented
by a ladder, i.e. an array of pairs of sentence boundaries: rung (i, j) is
present in the ladder iff the first i sentences on the left correspond to the
first j sentences on the right. Precision and recall values are calculated by
comparing the predicted and actual rungs of the ladder: we will refer to
this as the ‘complete rung count’ as opposed to the ‘one-to-one count’. In
general, complete rung figures of merit tend to be lower than one-to-one
figures of merit, since the task of getting them right is more ambitious: it
is precisely around the one-to-many and many-to-one segments of the text
that the alignment algorithms tend to stumble.
condition precision recall
id 34.30 34.56
id+swr 74.57 75.24
len 97.58 97.55
len+id 97.65 97.42
len+id+swr 97.93 97.80
dic 97.30 97.08
len+dic-stem 98.86 98.88
len+dic 99.34 99.34
len+boot 99.12 99.18
Table 2: Performance of the sentence-level aligner
Table 2 presents precision and recall figures based on all the rungs of the
entire ladder against the manual alignment of the Hungarian version of
Orwell’s 1984 (Dimitrova et al. 1998).
If length-based scoring is switched off and we only run the first phase with-
out a dictionary, the system reduces to a purely identity based method
we denote by id. This will still often produce positive results since proper
nouns and numerals will “translate” to themselves. With no other steps
taken, on 1984 id yields 34.30% precision at 34.56% recall. By the simple
expedient of stopword removal, swr, the numbers improve dramatically, to
74.57% precision at 75.24% recall. This is due to the existence of short
strings which happen to have very high frequency in both languages (the
two predominant false cognates in the Hungarian-English case were a ‘the’
and is ‘too’).
Using the length-based heuristic len instead of the identity heuristic is
better, yielding 97.58% precision at 97.55% recall. Combining this with the
identity method does not yield significant improvement (97.65% precision
at 97.55% recall). If, on top of this, we also perform stopword removal, both
precision (97.93%) and recall (97.80) improve.
Given the availability of a large Hungarian-English dictionary by A.
Vonyó, we also established a baseline for a version of the algorithm that
makes use of this resource. Since the aligner does not deal with multiword
tokens, entries such as Nemzeti Bank ‘National Bank’ are eliminated, re-
ducing the dictionary to about 120k records. In order to harmonize the
dictionary entries with the lemmas of the stemmer, the dictionary is also
stemmed with the same tool as the texts. Using this dictionary (denoted by
dic in Table 2) without the length-based correction results in slightly worse
performance than identity and length combined with stop word removal.
If the translation-method with the Vonyó dictionary is combined with
the length-based method (len+dic), we obtain the highest scores 99.34%
precision at 99.34% recall on rungs (99.41% precision and 99.40% recall on
one-to-one sentence-pairs). In order to test the impact of stemming we let
the algorithm run on the non-stemmed text with a non-stemmed dictionary
(len+dic-stem). This established that stemming has indeed a substantial
beneficial effect, although without it we still get better results than any of
the non-hybrid cases.
Given that the dictionary-free length-based alignment is comparable
to the one obtained with a large dictionary, it is natural to ask how the
algorithm would perform with a bootstrapped dictionary as described in
Section 3. With no initial dictionary but using this automatically boot-
strapped dictionary in the second alignment pass, the algorithm yielded
results (len+boot), which are, for all intents and purposes, just as good as
the ones obtained from combining the length-based method with our large
existing bilingual dictionary (len+dic). This is shown in the last two lines
of Table 2. Since this method is so successful, we implemented it as a mode
of operation of hunalign.
To summarize our results so far, the pure sentence length-based method

does as well in the absence of a dictionary as the pure matching-based
method does with a large dictionary. Combining the two is ideal, but this
route is not available for the many medium density languages for which
bilingual dictionaries are not freely avaliable. However, a core dictionary can
automatically be created based on the dictionary-free alignment, and using
this bootstrapped dictionary in combination with length-based alignment
in the second pass is just as good as using a human-built dictionary for this
purpose. In other words, the lack of a high-quality bilingual dictionary is
no impediment to aligning the parallel corpus at the sentence level.
hunalign Moore’02
task
prec rec prec rec
1984-HE-S 99.22 99.24 99.42 98.56
1984-HE-U 98.88 99.05 99.24 97.39
1984-RE-U 97.10 97.98 97.55 96.14
CoG-HE-S 97.03 98.44 96.45 97.53
Table 3: Comparison of hunalign and Moore’s (2002 ) algorithm on three

texts. Performance figures are based on one-to-one alignments only
While we believe that an evaluation based on all the rungs of the ladder
gives a more realistic measure of alignment performance, for the sake of
correct comparison with Moore’s method, we present some results using the
one-to-one alignments metric. Table 3 summarizes results on Orwell’s 1984
for Hungarian–English (1984-HE-S, stemmed and 1984-HE-U, unstemmed),
Romanian–English (1984-RE-U, unstemmed), as well as on Steinbeck’s Cup
of Gold for Hungarian–English (CoG-HE-S, 80k words, stemmed) using
hunalign (with bootstrapped dictionary, no further tuning and omitting
paragraph information) and Moore’s (2002) algorithm (with the default
values).
In order to be able to compare the Hungarian and Romanian results
for 1984, we provide the Hungarian case for the unstemmed 1984. One
can see that both algorithms show a drop of performance. This makes
it clear that the drop in quality from Hungarian–English to Romanian–
English can not be attributed to the fact that we tuned our system on
the Hungarian case. As mentioned earlier, the Romanian translation has
716 non-one-to-one segments compared to the Hungarian translation’s 285.
Given both algorithm’s preference to globally diagonal and locally one-to-
one alignments, this difference in one-to-one alignments is likely to render
the Romanian–English alignment a harder task.
In order to sensibly compare our results with that of Moore’s, paragraph
information was not exploited. huntoken, the sentence tokenizer we use is
able to identify paragraph boundaries which are then used by the aligner.
Experiments showed that paragraph information can substantially improve

alignment scores: measured on the Hungarian–English alignment of Stein-
beck’s ‘Cup of Gold’, the number of incorrect alignments drop from 148 to
115.1 Therefore the figures shown in Table 3 are in no way absolute best
bisentence scores for the texts in question.
5 Conclusion
In the past ten years, much has been written on bringing modern language
technology to bear on low density languages. At the same time, the bulk
of commercial research and product development, understandably, concen-
trated on high density languages. To a surprising extent this left the medium
density languages, spoken by over half of humanity, underresearched. In this
paper we attempted to address this issue by proposing a methodology that
does not shy away from manual labor as far as the data collection step
is concerned. Harvesting web pages and automatically detecting parallels
turns out to yield only a meager slice of the available data: in the case
of Hungarian, less than 1%. Instead, we proposed several other sources
of parallel texts based on our experience with creating a 50 million word
Hungarian–English parallel corpus.
Once the data is collected and formatted manually, the subsequent steps
can be almost entirely automated. Here we have demonstrated that our
hybrid alignment technique is capable of efficiently generating very high
quality sentence alignments with excellent recall figures, which helps to get
the maximum out of small corpora. Even in the absence of any language
resources, alignment quality is very high, but if stemmers or bilingual dic-
tionaries are available, our aligner can take advantage of them.
REFERENCES
Brown, Peter F., Jennifer Lai & Robert Mercer. 1991. “Aligning Sentences
in Parallel Corpora”. 29th Meeting of the Association for Computational
Linguistics (ACL’91 ), 169-176. Berkeley: University of California.
Brown, Peter F., Vincent J. Della Pietra, Stephen A. Della Pietra & Robert L.
Mercer. 1993. “The Mathematics of Statistical Machine Translation: Pa-
rameter Estimation”. Computational Linguistics 19:2.263-311.
Chen, Jiang & Jian-Yun Nie. 2000. “Automatic Construction of Parallel English-
Chinese Corpus for Cross-Language Information Retrieval”. 6th Conf. on
Applied Natural Language Processing, 21-28. San Francisco, Calif.
1
Although paragraph identification itself contains a lot of errors, improvement may be
due to the fact that paragraphs, however faulty, are consistent in terms of alignment.
The details of this and the question of exploiting higher-level layout information is left
for future research.
Chen, Stanley F. 1993. “Aligning Sentences in Bilingual Corpora using Lexical

Information”. 31st Conference of the Association for Computational Lin-
guistics, 9-16. Morristown, New Jersey, U.S.A.
Dimitrova, Ludmila, Tomaz Erjavec, Nancy Ide, Heiki Jaan Kaalep, Vladimir
Petkevic & Dan Tufiş. 1998. “Multext-east: Parallel and Comparable Cor-
pora and Lexicons for Six Central and Eastern European Languages”. 36th
Annual Meeting of the Association for Computational Linguistics and 17th
Int. Conf. on Computational Linguistics ed. by Christian Boitet & Pete
Whitelock, 315-319. San Francisco, Calif.: Morgan Kaufmann.
Gale, William A. & Kenneth Ward Church. 1991. “A Program for Aligning
Sentences in Bilingual Corpora”. 29th Annual Meeting of the Association for
Computational Linguistics (ACL’91 ), 177-184. Berkeley, Calif.
Grimes, Barbara, ed. 2003. The Ethnologue (14th ed.). Dallas, Texas: SIL
International.
Halácsy, Péter, András Kornai, László Németh, András Rung, István Szakadát
& Viktor Trón. 2004. “Creating Open Language Resources for Hungarian”.
Language Resources and Evaluation Conference, 203-210. Lisbon.
Martin, Joel, Howard Johnson, Benoit Farley & Anna Maclachlan. 2003. “Align-
ing and Using an English-Inuktitut Parallel Corpus”. HLT-NAACL Work-
shop: Building and Using Parallel Texts, 115-118. Edmonton, Canada.
Melamed, I. Dan. 2000. “Models of Translational Equivalence among Words”.
Computational Linguistics 26:2.221-249.
Mikheev, Andrei. 2000. “Periods, Capitalized Words, etc.”. Computational
Linguistics 28:3.289-318.
Moore, Robert C. 2002. “Fast and Accurate Sentence Alignment of Bilingual
Corpora”. 5th AMTA Conf.: Machine Translation: From Research to Real
Users, 135-244. Langhorne, Penn.: Springer.
Resnik, Philip & Noah Smith. 2003. “The Web as a Parallel Corpus”. Compu-
tational Linguistics 29:3.349-380.
Resnik, Philip. 1998. “Parallel Strands: A Preliminary Investigation into Mining
the Web for Bilingual Text”. Machine Translation and the Information Soup:
Third Conference of the Association for Machine Translation in the Americas
ed. by D.Farwell, L.Gerber & E.Hovy. Langhorne, Penn.: Springer.
Simard, Michel & Pierre Plamondon. 1998. “Bilingual Sentence Alignment:
Balancing Robustness and Accuracy”. Machine Translation 13:1.59-80.
Tiedemann, Jörg & Lars Nygaard. 2004. “The Opus Corpus - Parallel and Free”.
Language Resources and Evaluation Conference 1183-1186. Lisbon.
Trón, Viktor, György Gyepesi, Péter Halácsy, András Kornai, László Németh &
Dániel Varga. 2005. “Hunmorph: Open Source Word Analysis”. ACL 2005
Workshop on Software.
Varga, Dániel, Lászl Németh, Péter Halácsy, András Kornai, Viktor Trón & Vik-
tor Nagy. 2005. “Parallel Corpora for Medium Density Languages”. Recent
Advances in Natural Language Processing, 590-596. Borovets, Bulgaria.
The Role of Data in NLP:
The Case for Dataset Profiling
Anne De Roeck
The Open University
Abstract
There is a rich literature documenting the fact that text and docu-
ment properties affect the success of techniques in data driven Nat-
ural Language Processing and Information Retrieval. At the same
time, there is little by way of a systematic investigation of the precise
role played by data. This paper sets out the rationale for achieving
a better understanding of the role of data, and points out some un-
desirable methodological, epistemological and practical consequences
of not doing so. Some rough sparseness measures are used to show
that standard datasets differ in important ways. A case is made for
exploring the role of data by compiling dataset profiles. These would
contain measures tailored to highlight inherent bias in a collection,
reflecting its fitness to support a technique in the context of a task.
1 Data matters
It is an inescapable fact of life in Natural Language Processing (nlp) and
Information Retrieval (ir), that given some task, the performance of a tech-
nique will depend on the properties of the data on which it is deployed. A
substantial literature documents this three-way dependency between task,
data and technique performance, often highlighting it in the context of ex-
perimental evaluations. It is quite easy to find mention of text or document
properties that are believed to have an effect. For instance, breadth of do-
main coverage is a factor. “Deep” language understanding techniques are
reported to work well in narrow domains with a great deal of inherent struc-
ture (Copestake & Spärck Jones 1990), but they entail large processing costs
and so are less suited to handling large volumes of text, or for deployment
in interactive systems (Zaenen & Uzkoreit 1996). They are also too brittle
for general querying of broad, unstructured sources, such as Web search,
where “shallow” statistically-based ir techniques work well (Spärck Jones
1999). Explicit structure markers have an effect. For interactive search, hy-
perlinks do not significantly improve recall and precision in diverse domains,
such as the trec test data (Savoy & Pickard 1999, Hawking et al. 1999),
but they do in narrow domains and for searching intranets (Kruschwitz
2001). Stemming is a well established technique that does not improve
effectiveness of retrieval in general (Harman 1991), except for morphologi-
cally complex languages (Popovic & Willett 1992), and for short documents
260 ANNE DE ROECK
(Krovetz 1993). Document (and hence query) length is a factor in its own
right. Short keyword-based queries behave differently from long structured
queries (Pickens & Croft 2000), and general textbooks (Jurafsky & Martin
2000) state quite categorically that keyword-based retrieval works better on
long texts.
Whilst the role of data is very much acknowledged, a precise charac-
terisation of salient data features is not usually pursued. For instance,
whilst document length is a known factor, it remains unclear exactly what
is meant by a short document in any given set of experiments. Similarly, lit-
tle is known about how to recognise a “diverse” domain in a collection. On
the whole, papers setting out experimental results do not routinely contain
information on the datasets on which these were obtained.
As a research community with a large experimental agenda, the absence
of detailed characterisation of the role of data should worry us. In 1972, the
physicist Richard Feynman (Feynman 1972) described a cornerstone of his
notion of scientific integrity:
“. . . a principle of scientific thought that corresponds to a kind of utter
honesty–a kind of leaning over backwards. For example, if you’re doing
an experiment, you should report everything that you think might make
it invalid–not only what you think is right about it: other causes that
could possibly explain your results; and things you thought of that you’ve
eliminated by some other experiment, and how they worked–to make sure
the other fellow can tell they have been eliminated. [...] In summary, the
idea is to give all of the information to help others to judge the value of
your contribution; not just the information that leads to judgement in one
particular direction or another.” (Feynman 1992:341)
In data driven nlp and ir, text and document properties are clearly ac-
knowledged as one of these “other causes”, and consequently it would seem
necessary to engage with charting in what way they contribute to experi-
mental outcomes. However, determining such properties and reporting on
them is a non-trivial task, because there is no agreed common framework for
approaching it. Without such a framework for approaching the systematic
understanding of the role of data, we are faced with three serious, unde-
sirable consequences. The first is methodological in nature, and concerns
replicability: unless it is understood which aspect of the data give rise to
which effects, experimental outcomes cannot be replicated reliably, except
in a trivial sense by running the same experiment on the same data. The
second is epistemological: it is impossible to generalise from empirical find-
ings in the absence of a transparent framework for describing systematic,
testable relationships between data characteristics and technique perfor-
mance given some task. The third is practical: when faced with a new set
of data and some task, it is impossible to know how the dataset relates to
THE ROLE OF DATA IN NLP: DATASET PROFILING 261
known or established test collections, and hence which techniques can be

deployed effectively.
The point here is that ’there is an elephant in the room’: we know that
text and document characteristics matter, but we have not engaged sys-
tematically with charting their contribution in the context of different tasks
and techniques. In mitigation, establishing a framework for doing so is a
large and difficult task. Before making some suggestions on how it might be
approached, I will demonstrate that datasets drawn from standard reference
collections can indeed be dramatically different with respect to properties
which are known to affect the performance of nlp and ir techniques.
2 Sparseness
Given enough data, even simple frequency-based measures can show how
datasets differ significantly along dimensions we know will affect the be-
haviour of techniques and applications. Sparseness, for instance, is a well
known problem in nlp and ir. Type to token ratios (ttr) can be used as
’off-the-cuff’ sparseness indicators. Whilst they are coarse-grained, they are
cheap to calculate, by dividing the total number of words in a portion of
text by the number of terms. On the assumption that each occurrence of a
term provides evidence of its use, ttrs show, for some sample of running
text, how much evidence (in words) is present on average for every term.
Thus, sparser data have lower ttrs. Another way of interpreting the ratio
is as a very rough indicator of how far apart (in terms of running words)
evidence of new terms is spaced.
When run over standard collections, such as tipster, ttrs reveal some
interesting differences. For ease of reference, Table 1 gives a brief description
of all the datasets that will be used in this paper, including the tipster
ones. Table 2 sets out some ttrs for 7 of those datasets, calculated over text
samples of 100, 200, 800, 20,000 and 1 million words long. As expected, the
ttrs suggest that sparseness ratios drop with sample length: large datasets
are less likely to be sparse. However, they also show that there are significant
differences between datasets in the tipster collection. For instance, at one
million words, the U.S. Patents corpus (pat) will on average supply 62 words
worth of evidence for every term (or, in an alternative reading of the ttr,
evidence of a new term will crop up, on average, about 62 words apart).
A one million word section of the San Jose Mercury corpus (sjm) is much
sparser, with only 26 words worth of evidence for every term. This suggests
that techniques that are sensitive to sparseness might do less well when
confronted with one million words of sjm text than they might on the same
amount of pat text. Importantly, the ttrs in Table 2 further suggest there
are significant differences between languages. Comparing newspaper text,
262 ANNE DE ROECK
the ratio for one million words of Arabic is 8.25, which appears dramatically
worse than that of text drawn from the San Jose Mercury (26.38). Looking
at more balanced corpora, the ratios for the Bengali corpus are much sparser
than the overall tipster ratios.
Data Set Contents Collection size
TIPSTER Diverse English language reference dataset; has sub-collections.
AP TIPSTER. Copyrighted AP Newswire stories from 1989. 114,438,101
DOE TIPSTER. Short abstracts from the US Department of Energy. 26,882,774
FR TIPSTER. US government reports from Federal Register (1989). 62,805,175
PAT TIPSTER. US Patent Documents for the years 1983-1991. 32,151,785
SJM TIPSTER. Copyrighted stories from 1991 San Jose Mercury News. 39,546,073
WSJ TIPSTER. Stories from Wall Street Journal 1987-89 41,560,108
ZF TIPSTER. Computer Select disks 1989/90, Ziff-Davis Publishing 115,956,732
OU Dataset drawn from Open University Intranet. 39,807,404
Arabic Arabic newspaper articles drawn from Al-Hayat newspaper. 18,639,264
Bengali Modern balanced Bengali corpus - Institute of Indian Languages 3,052,522
Table 1: Summary description of datasets and collection size (in words)
Of course, ttrs have severe limitations as sparseness indicators. Nonethe-

less they should be sufficiently informative to alert us to the possibility
of encountering differences in the behaviour of a sparseness-sensitive tech-
nique on raw text drawn from different genres, and from different languages.
Given that sparseness is a widely acknowledged factor, it is perhaps puzzling
that we do not know more about its relative effects in such circumstances.
Are there ways of measuring sparseness in such a way we can predict its
impact? How does sparseness interact with other factors, and can these be
measured? Importantly, how much data of which kind is needed for boot-
strapping applications successfully in languages where structured resources
are not available? Given that sparseness appears, at first sight, to be sensi-
tive to genre, does it make sense to compare experimental outcomes without
understanding the type of textual data from which they were obtained?
Sparseness is only one factor, and similar questions can be raised about
other factors. The general point remains the same: in nlp and ir, can we
sensibly generalise from experimental outcomes without understanding the
way in which they have been affected by the data? If the answer is that we
cannot, the challenge will be how to approach the task of charting the role
of data in a useful way.
3 Profiling collection bias

There are different ways in which an investigation into the role of data might
be tackled. Both in nlp and ir, a large amount of experimental findings is
available, and it is important that the approach taken should supplement
rather than invalidate past work. One option would be to acknowledge that
a dataset, or collection, has an inherent “bias”, which determines its fitness
in supporting a technique in the context of some task. Measures can be

devised that highlight each bias, relative to technique and task. These mea-
sures can be combined into a dataset “profile”. Making profiles available has
methodological as well as practical benefits (De Roeck et al. 2004). They
can be worked up and published for standard reference collections, such as
the trec datasets, effectively presenting an opportunity for benchmarking.
They can be used alongside the large body of existing, published results ob-
tained on such collections, adding detail that enhances their interpretation,
so they would make it possible to aggregate across studies. New datasets
can be positioned in relation to standard collections by running appropriate
measures on them. Where new data are used in experiments, new profiles
can be drawn up and reported alongside findings. In practical settings, pro-
files can help developers in understanding the type of the collection they
are working with, and assist in selecting the most appropriate or effective
techniques for an application.
Sample length TIPSTER PAT SJM OU Arabic Bengali
100 1.40 1.31 1.43 1.47 1.19 1.20
200 1.60 1.54 1.61 1.70 1.43 1.39
800 2.32 3.06 2.03 2.62 1.58 1.86
20,000 6.46 11.03 4.46 6.94 2.87 5.21
1,000,000 38.48 62.64 26.38 36.13 8.25 10.81
Table 2: Type to token ratios (TTR) for six datasets
The notion of investigating collection characteristics is not new. As early

as 1973, Karen Spärck Jones looked at collection properties influencing au-
tomatic indexing, or term classification performance. Salton & Buckley
(1988), include collection statistics on the datasets that informed their rec-
ommendations on the effectiveness of different term weighting techniques.
There are other early examples where collection properties were published
alongside experimental results, but the practice has disappeared, and no
systematic approach was ever developed. Much early work had to contend
with the combined effects of small datasets and limited processing capability,
which affected the choice of measures. Both of these constraints have dis-
appeared. For comparison, Table 3 includes a row showing the corpus size,
measured in words, for some of the tipster sub-collections. The smallest,
the U.S. Department of Energy abstracts (doe) contains over 26 million
words. In contrast, dataset sizes in (Spärck Jones 1973) ranged between
2,713 and 6,574 “postings”, or words. (Salton & Buckley 1988) collections
ranged rougly between 67,000 and 534,000 words. Processing power makes
it possible to move away from measures that rely primarily on basic term
counting. (Sarkar et al. 2004), for instance, introduces a Baysian approach
264 ANNE DE ROECK
to modelling term distributions that may be useful in highlighting salient

differences between collections. Measures such as these are computationally
expensive, but have now come within our reach.
4 Measures for profiling

Useful profiling measures have to tell us something relevant about the data
in the context of a task or an application. They have to be sufficiently di-
verse and fine-grained to allow complex profiles that reflect combinations of
a range of relevant features. They have to be cheap to implement and run,
so they can be used in practical development over large datasets. Where
dataset properties are investigated (Spärck Jones 1973, Salton & Buckley
1988), the starting point tends to be a collection of vital statistics reflecting
document and collection size, together with basic term frequency data. Ta-
ble 3 shows how information of this kind can highlight differences between
collections.
Dataset AP DOE FR PAT SJM WSJ
No of Docs 242,918 226,086 45,820 6,711 90,257 98,732
No of Words 114,438,101 26,882,774 62,805,175 32,151,785 39,546,073 41,560,108
Av. Word/Doc 471 119 1371 4791 438 421
No of Terms 347966 179,310 157,313 146,943 178,571 159,726
Av Term/Doc 238 73 293 653 224 204
Min Doc 9 1 2 73 21 7
Max Doc 2,944 373 387,476 74,964 10,393 7,992
Table 3: Statistics for TIPSTER sub-collections: number of docs, corpus

size in words, avg. doc. length in words, number of distinct terms, avg.
number of terms per doc., length of shortest and longest docs
Simple, frequency-based profiles of this kind carry some useful information.
For example, document length is known to be a factor in key-word based
retrieval. On the other hand, measures such as these are not particularly
fine-grained, and there are others that might be more suited to bring out
salient differences between datasets. Profiling is related to detecting simi-
larity in the context of a task or technique, and there are some established
measures and methodologies that might be used as a starting point. (Rose &
Haddock 1997), for example, are interested in extending training data for
speech processing and use homogeneity measures to investigate whether two
collections are sufficiently similar for that purpose. (De Roeck et al. 2004)
further suggests profiling with homogeneity measures and focusing on the
behaviour of very frequent terms. These are useful for benchmarking be-
cause they are less subject to sparseness, and they occur abundantly in
virtually all collections, presenting a common point of comparison across
datasets. (De Roeck et al. 2006) in this volume uses a fine-grained Baysian
term distribution model and demonstrates that very frequent terms do be-
have significantly differently in different datasets. Recent work on genre
detection has used very frequent function word distribution, which further
strengthens their position as a suitable starting point for the exploration of
dataset profiling measures for genre-sensitive tasks and techniques.
5 Conclusion
Data matters, and there are significant differences between text and doc-
ument collections that impact on the behaviour of techniques. There are
many compelling reasons to investigate the role of dataset characteristics
on experimental results in nlp and ir. In order to do so, a framework is
required to guide systematic exploration of the influence of data character-
istics. One way to approach such a framework is to develop measures that
reflect inherent bias in a collection, by highlighting its fitness to support a
technique in the context of a task. Such measures can be combined into
dataset profiles which can be published. This would make benchmarking
possible and has the advantage of adding a level of detail to the interpreta-
tion of existing results for standard reference collections, introducing oppor-
tunities for aggregating across existing studies. It is important to select mea-
sures that are informative, and sufficiently fine-grained to bring out salient
characteristics. Early attempts at investigating collection properties were
hampered by lack of data and the limitations of computational resources.
These limitations are far less critical today, and apart from vast amounts of
data, it is now possible to expand the battery of standard frequency-based
measures with others that are computationally more demanding.
Acknowledgements. I became acutely aware of the need to get a firmer
handle on the role of data when I started working on natural language interfaces
to intranets and other extreme datasets, such as the Yellow Pages, with Udo
Kruschwitz, Nick Webb and John Carson. The notion of dataset profiling arose
directly from the need to develop reliable datasets to drive experiments on Arabic,
at a time when there were none. This I first explored with Waleed Al-Fares
and Abduelbaset Goweder. The idea of term distribution measures as a profiling
technique for data arose during work with Nikos Nanas and Victoria Uren, on user
profiles in multi-topic information filtering. Profiling with homogeneity measures,
and the use of fine-grained Bayesian models of term distribution builds on work
with Avik Sarkar and Paul Garthwaite, who deserve very special thanks, as does
Marian Petre for many helpful comments. I am indebted to all of these colleagues.
REFERENCES
Copestake, Anne & Karen Spärck Jones. 1990. “Natural Language Interfaces to
Databases”. Knowledge Engineering Review 5:4.225-249.
De Roeck, Anne, Avik Sarkar & Paul Garthwaite. 2004. “Frequent Term Dis-
tribution Measures for Dataset Profiling”. 4th International Conference of
266 ANNE DE ROECK
Language Resources and Evaluation (LREC ), 1647-1650. Lisbon. Portugal.

De Roeck, Anne, Avik Sarkar & Paul Garthwaite. 2006. “Function Words do not
Distribute Homogeneously”. Recent Advances in NLP, vol.4. ed. by Nicolas
Nicolov et al. Amsterdam & Philadelphia: John Benjamins. [this volume]
Feynman, Richard P. 1992. Surely You’re Joking Mr. Feynman!. New York:
Vintage.
Harman, Donna. 1991. “How Effective is Suffixing?”. Journal of the American
Society for Information Science 42:1.7-15.
Hawking, David, Ellen Voorhees, Nick Craswell, & Peter Bailey. 1999. “Overview
of the TREC-8 Web Track”. 8th Text Retrieval Conference (TREC-8 ), 131-
148. Gaithersburg, Maryland.
Jurafsky, Daniel & James H. Martin. 2000. Speech and Language Processing.
New Jersey: Prentice Hall.
Krovetz, Robert. 1993. “Viewing Morphology as an Inference Process”. 16th
Annual ACM Conference on Research and Development in Information Re-
trieval (SIGIR93 ), 191-202. Pittsburgh, Pennsylvania.
Kruschwitz, Udo. 2001. “Exploiting Structure for Intelligent Web Search”. 34th
Hawaii Int. Conf. on System Sciences (HICSS ), vol. 4. Maui, Hawaii. IEEE
paper #4010.
Pickens, Jeremy & Bruce W. Croft. 2000. “An Exploratory Analysis of Phrases in
Text Retrieval”. Recherche d’Informations Assistée par Ordinateur (RAIO ):
Content-based Multimedia Information Access. Paris, France.
Popovic, Mirko & Peter Willett. 1992. “The Effectiveness of Stemming for
Natural-Language Access to Slovene Textual Data”. Journal of the American
Society for Information Science 43:5.384-390.
Rose, Tony & Nick Haddock. 1997. “The Effects of Corpus Size and Homogene-
ity on Language Model Quality”. ACL-SIGDAT Workshop on Very Large
Corpora, 178-191. Beijing and Hong Kong.
Salton, Gerard & C. Buckley. 1988. “Term Weighting Approaches in Automatic
Text Retrieval”. Information Processing and Management 24:5.513-523.
Savoy, Jacques & Justin Picard. 1999. “Report on the TREC-8 Experiment:
Searching on the Web and in Distributed Collections”. 8th Text Retrieval
Conference (TREC-8 ), 229-241. Gaithersburg, Maryland.
Spärck Jones, Karen. 1973. “Collection Properties Influencing Automatic Term
Classification Performance”. Information Storage and Retrieval 9.499-513.
Spärck Jones, Karen. 1999. “What is the Role of NLP in Text Retrieval”. Natural
Language Information Retrieval, ed. by T. Strzalkowski, 1-25. Dordrecht:
Kluwer.
Zaenen, Annie & Hans Uszkoreit. 1996. “Language Analysis and Understand-
ing”. Survey of the State of the Art in Human Language Technology ed. by
R. Cole, 109-110. Cambridge: Cambridge University Press.
Even Very Frequent Function Words Do Not
Distribute Homogeneously
Anne De Roeck∗ , Avik Sarkar∗ & Paul H. Garthwaite∗∗
∗
Centre for Research in Computing, ∗∗ Department of Statistics
The Open University, Milton Keynes, MK7 6AA, UK
Abstract
We have known for some time that content words have “bursty” dis-
tributions in text. In contrast, much of the literature assumes that
function words are uninformative and they distribute homogeneously.
We describe two sets of experiments showing that assumptions of
homogeneity do not hold, even for the distribution of extremely fre-
quent function words. In the first experiment, we investigate the
behaviour of very frequent function words in the tipster collection
by postulating a “homogeneity assumption”, which we then defeat in
a series of experiments based on the χ2 test. Results show that it is
statistically unreasonable to assume homogeneous term distributions
within a corpus. We also found that document collections are not
neutral with respect to the property of homogeneity, even for very
frequent function words. In the second set of experiments, we model
the gaps between successive occurrences of a particular term using
a mixture of exponential distributions. Where the “homogeneity as-
sumption” holds these gaps should be uniformly distributed across
the entire corpus. Using the model we demonstrate that gaps are not
uniformly distributed, and even very frequent terms occur in bursts.
1 Introduction
Some areas of statistical Natural Language Processing (nlp), and Informa-

tion Retrieval (ir) adopt the “bag of words” model for text — i.e., they
assume that terms in a document occur independently of each other. In
spite of numerous drawbacks (Franz 1997), this model has been used exten-
sively, largely because it makes the application of standard mathematical
and statistical techniques very convenient. At the same time, it is widely
accepted that the term independence assumption is wrong, and that words
do not occur independently of each other.
The actual extent to which the occurrence of terms depend on each other
is relatively unexploited. There is a growing literature which investigates
term dependency between content words. Church (2000) describes “bursti-
ness” in the distribution of content words in documents — i.e., the fact that
repeated occurrences of an informative word in a document tend to cluster
268 A. DE ROECK, A. SARKAR, P. H. GARTHWAITE
together. In contrast, the distribution patterns of function words have re-

ceived less attention. Typically, function words are assumed to distribute
evenly throughout text, but are often used for applications of genre iden-
tification (Stamatatos et al. 2000) and authorship attribution (Argamon &
Levitan 2005). Katz (1996), for instance, develops a model for bursty dis-
tributions of “concept” terms, and distinguishes these from function words
on the basis that function words are distributed homogeneously.
This view of function words as general background noise is consistent
with their removal through stop lists or frequency thresholds in many ap-
plications. More sophisticated approaches, however, show that stop word
removal based on collection specific distribution patterns leads to improved
performance in text categorization (Wilbur & Sirotkin 1992, Yang & Wilbur
1996). This constitutes some evidence that function words perhaps do not
distribute quite that homogeneously throughout all text.
In short, the statistical nlp and ir literatures sustain a “homogeneity
assumption” in two respects. First, it is adopted as a consequence of the
“bag of words” model. Term independence is related to homogeneity in
term distribution: terms that occur independently (randomly) distribute
homogeneously. We know this is not the case for content words (Church
2000). Second, the assumption is adopted indirectly in the treatment of
function words, which are seen as uninformative precisely because they are
taken to distribute homogeneously. This is the assumption this paper aims
to contest.
We aim to show that the homogeneity assumption does not generally
hold, not just for content words, but also for the distribution of very fre-
quent function words. These results are significant in their own right be-
cause they demonstrate that it is statistically unreasonable to assume that
function word distribution within a corpus is homogeneous. In addition, we
show that data-sets and document collections display different homogene-
ity characteristics in the distribution of very frequent function words. The
homogeneity assumption is defeated substantially for collections known to
contain similar documents, and even more drastically for diverse collections.
We devise two sets of experiments to test this homogeneity assumption
using the tipster collection. In the first method, we start by postulating
the homogeneity assumption: that very frequent function words distribute
homogeneously in corpus text. We use the χ2 test (including the p-value)
to relate a notion of homogeneity to a level of statistical significance. We
also explore different ways of partitioning the datasets and measuring ho-
mogeneity.
In the second set of experiments, we study the gaps between successive
occurrences of some very frequent function words. Here we examine two
alternatives for modeling these gaps using exponential distribution. The
FUNCTION WORDS DO NOT DISTRIBUTE HOMOGENEOUSLY 269
first is based on the “bag of words” assumption that very frequent terms
are uniformly distributed and the gaps between successive occurrences of a
particular term are generated from a single exponential distribution. This
is in contrast to the second alternative, which assumes that terms occur in
bursts and the gaps between successive occurrences of a term are generated
from a mixture of two exponential distributions, one reflecting the rate of
occurrence of the term in the corpus and the other reflecting the rate of
re-occurrence after it has occurred recently (Sarkar et al. 2005a).
2 Experimental framework
2.1 χ2 based homogeneity

We adopted the methodology outlined in Kilgarriff (1997) for measuring
homogeneity in a corpus by measures of similarity. Specifically, he casts
homogeneity as internal similarity of distributions, between two halves of
a document collection as measured by the χ2 statistic. We adopted this
methodology as it is found to perform well in comparative experiments
(Rose & Haddock 1997, Cavagli 2002) as long as certain conditions are met
(Dunning 1993). However, our aim of investigating homogeneity in frequent
term distribution requires a more fine-grained tool than simple use of the χ2
statistic as a measure. We use the χ2 test, where a p-value is obtained. A p-
value < 0.05 would mean that non-homogeneity between the two partitions
is statistically significant (see De Roeck et al. 2004 for details).
Different partitions of a corpus may affect the outcome of similarity
based experiments. For instance, assigning one-word chunks to random
halves would inject a high degree of randomness in the data and destroy all
evidence of term dependence. In that case, we would expect our experiments
to be unable to defeat the homogeneity assumption. On the other hand,
repetition of Kilgarriff (1997) and others (dissolving document boundaries
and placing successive chunks of 5000 words in each partition) found re-
sounding evidence of heterogeneity between the distributions. This leads
to the following questions. (1) Do very frequent function words distribute
homogeneously across document boundaries? and (2) Do very frequent
function words distribute homogeneously throughout the same document?
We try to answer each of these questions by partitioning the collection in
different ways (De Roeck et al. 2004): (1) Choose a document and assign it
at random to either of two partitions (the docDiv experiment). (2) Split
each document in the middle, and randomly assign one half to either of
the partitions, and the other half to the other partition (the halfdocDiv
experiment).
2.2 Modeling gaps

The gaps between successive occurrences of a term is modeled based on a
mixture of exponential distributions (Sarkar et al. 2005a). The model as-
sumes that the term occurs at some low underlying base rate 1/λ1 but, after
the term has occurred, then the probability of it occurring soon afterwards
is increased to some higher rate 1/λ2 . Specifically, the rate of re-occurrence
is modeled by a mixture of two exponential distributions:
• The exponential component with larger mean (average), 1/λ1 , deter-
mines the rate with which the particular term will occur if it has not
occurred before or it has not occurred recently.
• The second component with smaller mean (average), 1/λ2, determines
the rate of re-occurrence in a document or text chunk given that it
has already occurred recently. This component captures the bursty
nature of the term in the text (or document).
The mixture model for a gap x is described as follows:
φ(x) = pλ1 e−λ1 x + (1 − p)λ2 e−λ2 x
where p and (1 − p) denote, respectively, the probabilities of membership

for the first and the second exponential distribution.
Now, if the “bag of words” homogeneity assumption is correct, then
the above mixture model will be over-parameterized, as the gaps will be
generated from a single exponential distribution. Then one of the following
conditions must hold so as to dissolve one of the mixture components and
end up with a single exponential distribution. These conditions are:
• p = 0 or p = 1
• λ1 = λ2
We first model the gaps based on a mixture of exponential distributions and
then investigate the above claims with respect to the model.
2.3 Data
We choose the tipster collection for our experiments because the dataset
is of good quality, it is well understood, and it contains a range of different
genres (Table 1). We assembled some basic profiling data on these datasets.
In Table 1, we list type to token ratios at 10 million words for each dataset.
These ratios are calculated by dividing the number of words by the number
of unique terms. They give a rough appreciation of the breadth of coverage,
to the extent where breath of terminology can reflect this. The value is
an indication of the average number of “old” words between occurrences of
“new” words in running text.
Data Set Contents of the documents 10 Most Frequent ADL TTR

Terms
ap Copyrighted AP Newswire stories from the of to a in and said 471.1 106.84
1989. for that on
doe Short abstracts from the Department of the of and in a to is 119.0 94.7
Energy. for with are
fr US government reports from Federal the of to and a in for 1,370.7 144.8
Register (1989). or that be
pat US Patent Documents for the years the of a and to in is 4,790.9 134.1
1983-1991. for said as
sjm Copyrighted stories from the San Jose the a of to and in for 438.1 102.2
Mercury News (1991). that is san
wsj Stories from Wall Street Journal 1987-89 the of to a in and that 420.9 116.2
for is said
zf Computer Select disks 1989/90, Ziff- the and to of a in is 395.6 121.8
Davis Publishing for that with
Table 1: Contents of each of the datasets along with top terms, Average
Document Length (ADL) and Type-to-Token Ratios (TTR )
We inspected the 100 most frequent terms from each of the datasets by
hand. Very frequent terms are not always function words, and lists were
sensitive to the domain of some datasets1 . The 10 most frequent terms
(Table 1), nonetheless showed a high degree of overlap, and were clearly
function words. Focusing on these ten terms in experiments across the
different collections should yield information on the behaviour of a small
collection of very frequent function words.
3 Experimental results
3.1 Homogeneity experiments
Experimental results for the χ2 based homogeneity experiments are shown

in Table 2. The top value in each cell shows the Chi-square By Degrees of
Freedom (cbdf) and the bottom value the p-value. Both are averaged over
iterations. Bold cells indicate cases where the homogeneity assumption
has survived the test (p-value > 0.05). In Kilgarriff (1997), inclusion of
the most frequent terms means that the behaviour of function words will
dominate the outcome of experiments, and that the cbdf measure examines
mostly stylistic homogeneity. Here, to allow more detailed tracking of the
distribution of very frequent terms, we calculated results at different values
for N.
1
For example, section in position 19 in fr, software in position 21 in zf, and invention
in position 26 in pat
3.1.1 docDiv experiment

The docDiv experiment maintains document boundaries and assigns whole
documents randomly to either partition: it investigates homogeneity across
documents in a collection. Table 2 shows that the homogeneity assumption
is defeated (p-value < 0.05) quite readily. In the ap and doe datasets the
assumption cannot be defeated for the 10 and 20 most frequent terms, and
in the wsj and sjm datasets for the 10 most frequent terms. All the other
datasets show heterogeneity with statistical significance, with p-values of 0
or close to it (very strong evidence against the homogeneity null-hypothesis).
cbdf values provide further insight. In most cases, these are very large, and
associated with very low p-values, indicating high levels of non-homogeneity
in the distribution of frequent words between documents. This is possibly
an indicator of high stylistic variance in a collection.
docDiv halfdocDiv
N most frequent terms N most frequent terms
DataSet 10 20 50 100 10 20 50 500
ap 2.11 1.58 2.58 2.29 1.77 1.47 1.27 1.17
0.12 0.21 0 0 0.09 0.12 0.06 0.02
doe 1.17 1.45 1.76 1.98 0.73 0.93 1.04 1.06
0.46 0.16 0.03 0 0.66 0.53 0.37 0.20
fr 54.52 41.72 72.09 66.79 7.91 9.55 11.64 8.85
0 0 0 0 0 0 0 0
pat 21.07 29.32 62.49 55.35 20.36 15.57 11.89 7.69
0 0 0 0 0 0 0 0
sjm 3.60 2.77 3.23 2.98 1.32 1.57 1.47 1.33
0.12 0 0 0 0.38 0.39 0.11 0
wsj 2.36 2.66 2.36 2.32 1.56 1.62 1.30 1.24
0.178 0 0 0 0.28 0.25 0.26 0.02
zf 11.95 8.13 6.91 6.58 1.95 1.86 1.61 1.56
0 0 0 0 0.13 0.12 0.02 0
Table 2: docDiv and halfdocDiv Results
3.1.2 halfdocDiv experiment

This experiment (Table 2) is sensitive to within-document homogeneity, as-
signing different halves of each document to each of the partitions. Again,
the homogeneity assumption was defeated at some point for most datasets.
The exception is the DOE collection where the null-hypothesis remains un-
defeated for the 20,000 most frequent words. This dataset contains very
short documents, each unlikely to deal with more than one topic. In stark
contrast, the null hypothesis was resoundingly defeated even for the 10 most
frequent terms for the fr and pat datasets, with very low p-values, and
comparatively high cbdf. The fr and pat datasets have the longest aver-
age document lengths of the tipster collection. In addition, they appear
the most diverse, with by far the highest type to token ratios (Table 1).
Generally speaking, the experiment finds statistically relevant heterogeneity

much more often than the earlier docDiv experiment. Also, cbdf values are
much lower here than in the corresponding docDiv table (with the exception
of the pat and fr collections). Comparing docDiv and halfdocDiv exper-
iments suggests that very frequent terms distribute more homogeneously
within documents than across document boundaries, but that document
length may be a significant factor.
3.2 Modeling gaps

We model the gaps between successive occurrences of a particular term using
a mixture of exponential distributions (Sarkar et al. 2005a). Modeling was
based on a Bayesian framework which enables complex models to be fitted
(Gelman et al. 1995). The model provides estimates of the mean of each of
the exponential distributions (λe1 and λe2 ) and estimates of the probability
of a gap being generated from each of these distributions (e p and 1 − pe).
To examine the homogeneity assumption, we have to investigate if any
of the following claims are true: pe = 0 or pe = 1 or λe1 = λe2 .
The validity of any of these claims would reduce the mixture model to a
single component exponential distribution, which would be consistent with
the assumption of homogeneity. We constructed the mixture models for the
10 most frequent terms in Table 1, and the parameter estimates for terms
the, of and said are in Table 3. For the term the (Table 3), λe1 and λe2 are
the of said
DataSet pe f1
λ f2
λ pe f1
λ f2
λ pe f1
λ f2
λ
ap 0.59 16.58 16.11 0.65 38.37 36.44 0.04 696.38 69.01
doe 0.29 20.49 12.72 0.62 21.10 19.72 0.67 61349.69 12224.94
fr 0.01 194.89 13.47 0.02 106.25 24.01 0.84 26385.22 392.62
pat 0.03 58.96 10.61 0.03 73.42 21.82 0.06 2167.32 13.10
sjm 0.02 168.52 17.80 0.04 205.38 39.45 0.16 2499.38 92.42
wsj 0.70 17.46 17.00 0.42 36.91 35.39 0.12 1608.49 72.62
zf 0.10 67.80 18.39 0.01 262.47 46.51 0.42 8810.57 177.21
Table 3: Parameter estimates of terms ‘the’, ‘of ’ and ‘said’
very similar in the ap and wsj datasets and pe is close to 0 in the fr, pat
and sjm datasets, so in these datasets the may distribute homogeneously.
In the doe and zf datasets, however, pe is near neither 0 nor 1, and λe1 and
λe2 differ markedly, so the does not appear to distribute homogeneously in
these two datasets. Similarly, the term of has very similar values of λe1 and
λe2 for ap, doe and wsj datasets and pe is close to 0 for the fr, pat, sjm
and zf datasets. The datasets provide little evidence against homogeneity
either based on the values of λe1 and λe2 or pe. In contrast, the term said
shows evidence of homogeneity only for the ap dataset, for which the value
of pe is close to 0.
To investigate the homogeneity assumption for other common terms, we
calculate the ratio between the two e λs, λe1 /λe2 and study how close this ratio
e e
is to 1. A λ1 /λ2 ratio of 1 indicates that the two exponential distributions
have equal means, and hence reduce to a single exponential distribution.
A large deviation of the λe1 /λe2 ratio from 1 reveals the presence of two
very distinct exponential distributions and provides evidence against the
homogeneity assumption of the term’s distribution in the corpus, provided
the value of pe is not close to either 0 or 1. If the value is very close to 0
or 1 (a difference of less than 0.05) we argue that one of the exponential
distributions has negligible effect and there is little evidence against the
term being homogeneously distributed. Table 4 provides the λe1 /λe2 ratio
and the values of pe for the most frequent terms of each of the datasets.
In the table ratios of λe1 /λe2 that are less than 1.2 are given in bold-face
type, as are values of pe that are below 0.05 or above 0.95. Combinations are
underlined when either of these is in bold. For terms with underlined values,
the model does suggests the assumption of homogeneity is not violated, but
the assumption seems poor for the terms that are not underlined in the
table. It can be observed from Table 4 that only the term of show signs
of being homogeneously distributed across all the datasets either based on
the λe1 /λe2 ratio or values of pe being close to 0. The terms and, are, the and
to also seem to be homogeneously distributed across many of the datasets.
The other 12 terms in Table 4 only appear to be homogeneously distributed
in at most 2 of the 7 datasets. Said is an interesting term in the table. It has
Term AP DOE FR PAT SJM WSJ ZF
a 1.9 (0.17) 3.6 (0.46) 3.8 (0.23) 3.3 (0.15) 2.3 (0.10) 2.0 (0.14) 2.5 (0.10)
and 1.1 (0.30) 1.1 (0.53) 2.7 (0.11) 3.6 (0.02) 3.6 (0.07) 1.1 (0.14) 1.1 (0.10)
are 1.1 (0.69) 1.1 (0.32) 3.0 (0.07) 5.0 (0.33) 10.4 (0.01) 1.1 (0.69) 1.2 (0.47)
as 31.6 (0.93) 31.5 (0.93) 3.9 (0.45) 4.5 (0.24) 40.8 (0.90) 65.1 (0.90) 56.4 (0.91)
be 3.0 (0.73) 1.3 (0.49) 6.1 (0.13) 5.7 (0.27) 3.5 (0.64) 2.1 (0.33) 2.2 (0.27)
for 2.0 (0.29) 3.2 (0.54) 4.4 (0.05) 4.1 (0.26) 2.6 (0.55) 1.9 (0.31) 15.9 (0.01)
in 2.2 (0.13) 2.5 (0.17) 7.1 (0.02) 4.0 (0.05) 2.9 (0.10) 1.9 (0.22) 2.8 (0.08)
is 2.9 (0.58) 4.6 (0.35) 3.8 (0.19) 5.3 (0.07) 4.0 (0.34) 2.4 (0.34) 5.7 (0.02)
of 1.1 (0.65) 1.1 (0.62) 4.4 (0.02) 3.4 (0.03) 5.2 (0.04) 1.1 (0.42) 5.6 (0.01)
on 1.9 (0.31) 5.7 (0.72) 4.7 (0.21) 5.9 (0.25) 2.6 (0.46) 1.9 (0.55) 2.5 (0.22)
or 41.5 (0.95) 3.7 (0.48) 6.8 (0.36) 9.9 (0.28) 8.6 (0.81) 4.6 (0.71) 3.6 (0.78)
said 10.1 (0.04) 5.0 (0.67) 67.2 (0.84) 165.3 (0.06) 27.0 (0.16) 22.1 (0.12) 49.7 (0.42)
san 112.4 (0.92) 21.1 (0.74) 579.3 (0.93) 855.8 (0.93) 14.4 (0.46) 149.6 (0.96) 90.6 (0.80)
that 2.7 (0.16) 1.2 (0.59) 4.8 (0.15) 4.7 (0.21) 3.4 (0.20) 2.5 (0.11) 4.3 (0.04)
the 1.0 (0.59) 1.6 (0.29) 14.5 (0.01) 5.5 (0.03) 9.5 (0.02) 1.0 (0.70) 3.7 (0.10)
to 3.2 (0.10) 1.2 (0.56) 12.5 (0.01) 3.8 (0.05) 6.8 (0.02) 1.1 (0.41) 3.2 (0.04)
with 2.6 (0.40) 2.4 (0.28) 3.5 (0.23) 3.6 (0.24) 2.7 (0.64) 1.3 (0.45) 2.5 (0.12)
Table 4: Values of λe1 /λe2 ratio and p for the frequent terms
very high values of the λe1 /λe2 ratio and the values vary over a huge range.
Also, the value of pe for said is close to 0 for the ap dataset. This is because
the term said has a huge dependence on the document’s content and style,
and these characteristics can be explored and studied by modeling the gaps.
It is this property that allows the model to be used in characterizing genre
and stylistic features (Sarkar et al. 2005b). The term san is an outlier in
the list, as it is not a function word. But it featured in the list of top 10
terms in the sjm (stories from San Jose Mercury news) collection, being a
very widely used term in that collection. As expected, based on the model,
it is a very rare term, as indicated by a large rate of occurrence λe1 , and it’s
bursty nature is indicated by small values of λe2 , leading to large values of
the λe1 /λe2 ratio. Also, in contrast, the λe1 /λe2 ratio for the sjm collection,
is relatively small when compared to the values of the other collections,
demonstrating the fact that the term san is a non-informative term in sjm,
relative to the other collections.
The term as is also quite interesting, as it exhibits large values of λe1 /λe2
ratio in all the collections other than fr and pat. pat has comparatively
large values of λe1 /λe2 ratio for most of the other terms, indicating the fact
that the term has dependence on the content, style and structure of the
document and collection. Under such circumstances will it be appropriate
to apply a generic “stop-word” list for a collection of any documents? Based
on the above experiments, we believe that the answer is “no”.
4 Conclusion
Our homogeneity experiments indicate that very frequent function words

do not distribute homogeneously in general, across different documents,
even when those documents are of the same, or a related genre (docDiv).
They also show that such words do distribute more homogeneously within
document boundaries, but that this behaviour is highly sensitive to doc-
ument type, and may well depend on factors related to document length,
and breadth of domain coverage per document. They demonstrate that the
same very frequent function words take on very different distribution pat-
terns in different collections, even where such collections belong to related
genres. (halfdocDiv).
We further investigated the 10 most frequent terms of each of the collec-
tions by modeling the gaps between successive occurrences of a particular
term based on a mixture of two exponential distributions. One of these
distributions measure the inherent rate of occurrence of the term in the
corpus, and the other measures the rate of re-occurrence after the term has
occurred recently. The experiments demonstrate that terms distribute in
this pattern, as compared to a “bag of words” homogeneity model where a
single exponential distribution would be sufficient. Our experiments reveal
that terms do occur in bursts, including most of the very frequent ones.
REFERENCES
Argamon, Shlomo & Shlomo Levitan. 2005. “Measuring the Usefulness of Func-
tion Words for Authorship Attribution”. Proceedings of the 2005 Conference
of the Association for Computing in the Humanities & Literary and Linguis-
tic Computing. Canada.
Cavaglia, Gabriel. 2002. “Measuring Corpus Homogeneity Using a Range of
Measures for Inter-document Distance”. 3rd Int. Conference on Language
Resources and Evaluation (LREC), 426-431. Spain.
Church, Ken. 2000. “Empirical Estimates of Adaptation: The Chance of Two
Noriega’s Is Closer to p/2 than p2 ”. 18th International Conference on Com-
putational Linguistics (COLING-2000 ), 173-179. Saarbrücken. Germany.
De Roeck, Anne, Avik Sarkar & Paul H. Garthwaite. 2004 “Defeating the Ho-
mogeneity Assumption”. Proceedings of 7th International Conference on the
Statistical Analysis of Textual Data (JADT ), 282-294. De Louvain, Belgium.
Dunning, Ted. 1993. “Accurate Methods for the Statistics of Surprise and Coin-
cidence”. Computational Linguistics 19:1.61-74.
Franz, Alexander. 1997. “Independence Assumptions Considered Harmful”. Pro-
ceedings of the European Chapter of the ACL (EACL’97 ) 182-189. Spain.
Gelman, Andrew, J. Carlin, H.S. Stern & D.B. Rubin. 1995. Bayesian Data
Analysis. London: Chapman and Hall.
Katz, Slava M. 1996. “Distribution of Content Words and Phrases in Text and
Language Modelling”. Natural Language Engineering 2:1.15-60.
Kilgarriff, Adam. 1997. “Using Word Frequency Lists to Measure Corpus Ho-
mogeneity and Similarity between Corpora”. Proceedings of ACL-SIGDAT
Workshop on very large corpora, 231-245. Hong Kong.
Rayson, P. & R. Garside. 2000. “Comparing Corpora Using Frequency Profiling”.
Proceedings of the Workshop on Comparing Corpora, 1-6. Hong Kong.
Rose, Tony & Nick Haddock. 1997. “The Effects of Corpus Size and Homogeneity
on Language Model Quality”. Proceedings of the ACL-SIGDAT Workshop
on Very Large Corpora, 178-191. Beijing.
Sarkar, Avik, Paul H. Garthwaite & Anne DeRoeck. 2005a. “A Bayesian Mix-
ture Model for Term Re-occurrence and Burstiness”. Ninth Conference on
Computational Natural Language Learning (CoNLL-2005 ), 48-55. Michigan.
Sarkar, Avik, Anne DeRoeck & Paul H. Garthwaite. 2005b. “Term Re-occurrence
Measures for Analyzing Style”. Proceedings of the SIGIR Workshop on Tex-
tual Stylistics in Information Access. Salvador, Brazil.
Stamatatos, E., N. Fakotakis & G. Kokkinakis. 2000. “Text Genre Detection
Using Common Word Frequencies”. Proceedings of the Association for Com-
putational Linguistics, 808-814. Hong Kong.
Wilbur, John. & Karl Sirotkin. 1992. “The Automatic Identification of Stop
Words”. Journal of Information Science 18:1.45-55.
Yang, Yiming & John Wilbur. 1996. “Using Corpus Statistics to Remove Re-
dundant Words in Text Categorization”. Journal of the American Society of
Information Science 47:5.357-369.
Exploiting Parallel Texts to Produce a Multilingual
Sense Tagged Corpus for Word Sense Disambiguation
Lucia Specia∗ , Maria das Graças Volpe Nunes∗ &

Mark Stevenson∗∗
∗ ∗∗
Universidade de São Paulo, University of Sheffield
Abstract
We describe an approach to the automatic creation of a sense tagged
corpus intended to train a word sense disambiguation (WSD) sys-
tem for English-Portuguese machine translation. The approach uses
parallel corpora, translation dictionaries and a set of straightforward
heuristics. In an evaluation with nine corpora containing 10 am-
biguous verbs, the approach achieved an average precision of 94%,
compared with 58% when a state of the art statistical alignment tool
was used. The resulting corpus consists of 113,802 tagged sentences
and can be readily used to train a supervised machine learning algo-
rithm to build WSD models for use in machine translation systems.
1 Introduction
Word Sense Disambiguation (wsd) is concerned with the identification of

an appropriate sense for an ambiguous word in a given context. Although
wsd can be thought of as an independent task, its importance is more
straightforwardly realized when it is used in an application, such as Infor-
mation Retrieval or Machine Translation (mt) (Wilks & Stevenson 1998).
In mt, which is the focus of this paper, wsd can be used to identify the
most appropriate translation for a source language word when the target
language offers more than one option with different meanings. It differs
from monolingual wsd because there is not always a direct relation be-
tween the number of possible senses and translations of a word (Hutchins
& Somers 1992). In this context, thus,“sense” can be considered equivalent
to “translation”. For example, assuming the translation from English to
Portuguese, the languages considered in this work, bank can be translated
as banco (financial institution or seat) or margem (land along the side of
a river ). Financial institution and land along the side of a river are both
senses of the English word bank, however, the seat sense is valid only in the
translation.
Sense ambiguity has been recognized as one of the most important prob-
lems in mt and a serious barrier to the progress in this area. However, the
various approaches to wsd are generally aimed at monolingual applications.
278 LUCIA SPECIA, MARIA G.V. NUNES & MARK STEVENSON
Recent monolingual approaches focusing on the use of corpus-based tech-

niques have shown good results, especially those using supervised learning
(Edmonds & Cotton 2001). However, supervised approaches are dependent
on sense tagged corpora. The lack or inadequacy of such corpora is one of
the main drawbacks of those approaches.
For multilingual applications, corpora are only available for a small num-
ber of language pairs. For English-Portuguese, in particular, there are no
currently available corpora. In this context, the automatic creation of sense
tagged corpora based on parallel data is a good strategy, but still little ex-
plored. Approaches aiming at the creation of English tagged sense corpora
include the work of Agirre & Martı́nez (2004) and Diab & Resnik (2002).
Dinh (2002) explored bilingual parallel corpora and word alignment meth-
ods to create an English-Vietnamese sense tagged corpus.
Given the large amount of multilingual machine readable texts currently
available, identifying the correspondent word pairs in the source and target
languages of parallel corpora seems to be indeed a very practical strategy
to automatically create sense tagged data. Parallel corpora can also act
as good knowledge sources for sense disambiguation, particularly for mt
(Brown et al. 1991). They have also been used for monolingual wsd (Dagan
& Itai 1994, Ide et al. 2002, Ng et al. 2003).
Most of these approaches rely on the existence of accurate word align-
ment methods. However, current word alignment methods do not perform
satisfactorily when applied to English-Portuguese. Indeed, experiments
with several alignment methods on English-Portuguese reported a precision
of 57% and a recall of 61% for the best method (Caseli et al. 2004). Con-
sidering these issues in the context of our ultimate goal of building a wsd
system for English-Portuguese mt, we developed an alternative approach
to automatically create a sense tagged corpus. In what follows, we first
describe our approach, including its scope, the parallel corpora explored,
and the sense tagging process (Section 2). We then evaluate the proposed
approach (Sections 3) and discuss some conclusions (Section 4).
2 The sense tagging approach
2.1 Experimental setting

This work focuses on verbs, which represent difficult cases for wsd. We
address seven frequent and highly ambiguous verbs identified as very prob-
lematic to English-Portuguese mt systems in a previous study (Specia 2005).
For comparison we also consider another three frequent verbs which are not
as ambiguous. The verbs, along with their number of possible translations
(cf. DIC Prático Michaelis 5.1), are given in Table 1. Possible translations
PARALLEL TEXTS TO PRODUCE A SENSE TAGGED CORPUS 279
Verb # Translations Verb # Translations

come 226 make 239
get 242 take 331
give 128 ask 16
go 197 live 15
look 63 tell 28
Table 1: Verbs and their possible translations
are single words, including synonyms and phrasal verb usages. The average
number of translations for the seven highly ambiguous verbs is 203, and the
average for the three other verbs (ask, live, and tell ) is 19.
The original untagged parallel corpus, consisting of English sentences
containing the 10 verbs along with their manually translated Portuguese
sentences, was collected from several sources, as shown in Table 2. Eu-
roparl (Koehn 2005) comprises bilingual versions of the European Parlia-
ment texts. Compara (Frankenberg-Garcia & Santos 2003) comprises fic-
tion books. Messages contains messages used by Linux software. Bible
contains versions of the Christian Bible. Finally, Miscellaneous contains
small texts from various sources, including user manuals for a programming
language, news and abstracts of theses. All these corpora were already sen-
tence aligned. Parallel sentences in a many-to-one or one-to-many relation-
ship were grouped together to form a “unit”. The number of units selected
from each corpus (in one language) is illustrated in Table 2, resulting in a
total of 220,406 units in each language.
Corpus # Units
Europarl 167,339
Compara 19,706
Messages 16,844
Bible 15,189
Miscellaneous 1,328
Table 2: Corpora and numbers of sentences
Some pre-processing steps were carried out to filter certain units and to
transform the corpus into a more usable format. These included the lemma-
tization of English units and Portuguese verbs, elimination of unit pairs
containing English idioms involving one of the 10 verbs and pos tagging of
the units in both languages. The resulting text contained 206,913 sentences.
2.2 Sense identification

In order to identify the translation of each verb occurrence, the following
assumptions were made:
• Given a sentence aligned parallel corpus, the translation of the verb
in an English unit can be found in its corresponding Portuguese unit.
• Every English verb has a pre-defined set of possible translations, which

can be extracted from bilingual dictionaries.
• Phrasal verbs have specific translations, which are preferred to the
translations of the verb occurring individually.
• Translations have different probabilities of being used in a given cor-
pus, which are given by a statistical analysis of such corpus.
• If there are two or more possible translations for an English verb then
the translation which is closer to the position of the English verb, in
its respective unit, is more likely to be correct.
Machine readable versions of bilingual dictionaries were used to define the
set of possible single-word translations for each verb and to identify a list
of phrasal verbs and their translations. An English dictionary of phrasal
verbs was used to give the lists of separable and inseparable phrasal verbs.
The NATools package (Simões & Almeida 2003) was used to produce the
translation probabilities. NATools employs statistical techniques to create
bilingual dictionaries from sentence aligned parallel corpora. It provides a
list with at most 20 possible translations for each word, along with their
probabilities. In Table 3 we illustrate the list of translation probabilities
produced by NATools for the verb to give, in the corpus Compara.
Probability Translation Probability Translation
ceder v 0.0117 lançar v 0.0131
devolver v 0.0053 pergunta 0.0063
null 0.1520 entregar v 0.0252
renunciar v 0.0055 provocar v 0.0077
desistir v 0.0225 fazer v 0.0309
soltar v 0.0060 dar v 0.5783
deixar v 0.0065 ser v 0.0230
receber v 0.0079
Table 3: Translation probabilities for ‘to give’
In general, the lists produced contain some verbs and words with other part-
of-speech which are not possible translations of the verb according to our
dictionaries (in italic in Table 3), and a null translation probability, that
is, the probability of the verb not being translated. Moreover, they do not
include all the possible translations, since many of them may not occur in
the corpus, or may occur with a very low frequency.
Given these assumptions, we defined a set of heuristics to find the most
adequate translation for each occurrence of the verb in an English unit (EU)
in the Portuguese unit (PU) (see Figure 1):
1. Identify and annotate inseparable phrasal verbs in the EU.
2. Identify and annotate, in the remaining EUs, separable phrasal verbs.
We assume the remaining EUs do not contain any phrasal verb.
Vj − E Vj − P
corpus corpus
Vj − EUi Vj−Dic.
occurrence x Vj − PUi
phrasal
Yes Seek
phrasal translation
No
Seek used No
Vj−Dic. translation alone found
individual Yes
No
found end
Yes
Positions Yes
& Probab. + one
No
Choose & Tagged

Annotate EU
Fig. 1: Sense identification process
3. Search for all possible translations of the verb in the verb lemmas of
the PU, consulting specific dictionaries for inseparable, separable, or
non-phrasal verbs. Three possible situations arise:
(a) No translation is found — go to step 4.
(b) Only one translation is found — go to step 5.
(c) Two or more translations are found — go to step 6.
4. If the occurrence is a non-phrasal verb, finalize the process (no ad-
equate translation was found). Otherwise, check if the verb can be
used as non-phrasal verb. If yes, go back to step 3, now looking for
possible translations of the verb in the dictionary of non-phrasal verbs.
If it can not be used as a non-phrasal verb, finalize the process (no
adequate translation was found).
5. Annotate the EU with the only possible translation.
6. Identify the absolute positions of the verb in the EU and of each
possible translations in the PU, assigning a position weight (PosW )
to each translation. PosW penalizes translations in distant positions

from the position of the EU verb, according to the equation (1).

|EU position − P U position|
P osW = 1 − (1)
10
7. Calculate the translation weight (TraW ) for each possible translation,

as shown in equation (2), using the translation probabilities.
T raW = P osW + T ranslationP robability (2)
8. Annotate the EU with translation having the highest TraW.

The position plus probability weighting schema adopted in the case of more
than one possible translation was empirically defined after experimenting
with different schemas. As an example of its use, consider the pair of sen-
tences shown in Figure 2, for to come (EU position = 7). The system
correctly identifies the translation as vir, the lemma of vindo (PU position
= 9, PosW = 0.8, probability = 0.432, TraW = 1.232), although there
are two more possible translations in the sentence, according to our list of
possible translations: sair (PU position = 2, PosW = 0.5, probability =
0.053, TraW = 0.553) and ir (lemma of for ) (PU position = 6, PosW =
0.9, probability = 0.04, TraW = 0.94). It is worth noticing that the word
position plays the most important role in this example. The probabilities
generally take effect when the possible translations are close to each other.
“I’d rather leave without whatever I came for.”
“Prefiro sair sem o que tenha vindo buscar.”
Fig. 2: Example of parallel sentences
3 Evaluation and discussion
Our approach determined a translation for 55% of all verbs (113,802 units)
in the corpora shown in Table 2. Similar identification percentages were
observed among verbs and corpora. The lack of identification for the re-
maining occurrences was due to three main reasons: (a) we do not consider
multi-word translations; (b) errors from the tools used in the pre-processing
steps, especially pos tagging errors; and (c) modified translations, including
cases of omission and addition of words.
Since our intention is to use this corpus to train a wsd model, we give
preference to precision of the sense tagging to the detriment of wide cover-
age. In order to estimate this precision, we randomly selected and manually
analyzed 30 tagged EU from each corpus for each verb, amounting to 1,500
units. The resultant precisions are shown in Table 4.
Verb Europarl Compara Messages Bible Miscellaneous

come 80% 84% 95% 90% 91%
get 93% 87% 100% 95% 82%
give 97% 95% 95% 97% 93%
go 90% 90% 95% 85% 95%
look 100% 98% 95% 90% 100%
make 87% 86% 100% 93% 97%
take 80% 88% 91% 90% 93%
ask 100% 98% 100% 100% 100%
live 100% 100% 100% 100% 100%
tell 100% 94% 100% 100% 96%
Average 93% 92% 97% 94% 95%
Table 4: Precision of the sense tagging process
On average, our approach was able to identify the correct senses of 94%
of the analyzed units. It achieved a very high average precision (99%) for
the less ambiguous verbs (the three last in Table 3). Of the seven highly
ambiguous verbs, to look and to give have lower numbers of possible senses
than the rest, and for them the system also achieved a very high average pre-
cision (96%). For the remaining five verbs, the system achieved an average
precision of 90.3%. Most of the tagging errors were related to characteristics
of the corpora. However, some errors were also due to limitations of our
heuristics, as illustrated in the distribution of errors sources for each corpus
in Table 5.
Corpus Idiom / Modified Tagger Heuristics
slang translation error
Europarl 6% 66% 8% 20%
Compara 8% 71% 0 21%
Messages 0 100% 0 0
Bible 6% 74% 10% 10%
Miscellaneous 10% 69% 16% 5%
Table 5: Tagging error sources
Most of the errors were due to modified translations, including omissions
and paraphrases (such as active voice sentences being translated by dif-
ferent verbs in a passive voice). In fact, with exception of the technical
corpora (particularly Messages), the translations were far from literal. In
those cases, as in the case of idioms or slang usages, the actual translation
was not in the sentence, or was written using words that were not in the
dictionary, but the system found other possible translation, corresponding
to other verb. Tagger errors refer to the incorrect tagging of the verbs with
any other pos. Therefore, the system also pointed out other possible trans-
lations in the PU. Errors due to the choices made by our heuristics are also
related to the other mentioned errors. For example, considering the position
of the words as the main evidence can be inappropriate when translations

are modified by the inclusion or omission of words. Moreover, some units
are very long (e.g., 180 words), with the PU containing between 1 and 15
possible translations of the EU verb (on average, 1.5 potential translations
for all verbs, and 2.4 for the seven most ambiguous verbs) despite the sen-
tence alignment. Nevertheless, in general the use of filters and heuristics
avoided many errors, reducing the coverage of the system, but increasing
its precision.
3.1 Comparison with an alternative approach

We compared the precision of our approach to the precision of the word
alignment package GIZA++ (Och & Ney 2003). Each pre-processed corpus
was individually submitted to GIZA++ and the alignments produced for the
verbs, in the same sentences used to evaluate our system, were analyzed.
Only the cases for which our system had proposed a possible translation
were considered. The average precision for each corpus is shown in Table 6.
Corpus Precision
Europarl 51%
Compara 61%
Messages 70%
Bible 42%
Miscellaneous 66%
Table 6: Precision of the GIZA++ word alignment
The alignments produced by GIZA++ often contained multiple words as

possible translations for a verb (i.e. the alignment was not one-to-one). We
considered an alignment to be correct if it included the correct translation.
Nevertheless, the precision achieved by GIZA++ is considerably lower than
the precision of our system, with a statistically significant difference (p <
0.05, Wilcoxon Signed Ranks Test). Since statistical evidence is the only
information used by GIZA++, it was not successful in identifying non-
frequent translations. Moreover, it rarely found the correct alignment in
the case of modified translations. This shows that the use of linguistic
knowledge can avoid many tagging errors. In fact, although our approach
also uses statistical information, it will still work if that information is not
available and perform well even for very small corpora.
4 Conclusions
We presented an approach to create a sense tagged corpus aimed at mt,

based on parallel corpora, linguistic knowledge and statistical evidence. The
results of an evaluation using several parallel corpora and 10 verbs showed

that the approach is effective, achieving an average precision of 94%. Most
of the tagging errors were related to characteristics of the corpora, espe-
cially non-literal translations and use of language constructions that are
very difficult to process automatically (idioms, etc.). The resultant corpus
of 113,802 instances provides, in addition to the sense tags, other kinds of
useful information: pos-tags, lemmas and the neighbor words. This corpus
will be used to train a supervised machine learning algorithm to produce a
wsd model. In order to be extended to wider contexts, besides the parallel
corpora, the approach requires only resources that can be extracted from
machine readable sources.
REFERENCES
Agirre, Eneko & David Martı́nez. 2004. “Unsupervised WSD Based on Au-
tomatically Retrieved Examples: The Importance of Bias”. Conference on
Empirical Methods in Natural Language Processing, 25-32. Barcelona.
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L.
Mercer. 1991. “Word Sense Disambiguation Using Statistical Methods”.
29th Annual Meeting of the Association for Computational Linguistics, 264-
270. Berkley, Calif.
Caseli, Helena M., Aline M.P. Silva, Maria G.V. Nunes. 2004. “Evaluation of
Methods for Sentence and Lexical Alignment of Brazilian Portuguese and
English Parallel Texts”. 7th Brazilian Symposium on Artificial Intelligence,
184-193. Sao Luiz, Brazil.
Dagan, Ido & Alon, Itai. 1994. “Word Sense Disambiguation Using a Second
Language Monolingual Corpus”. Computational Linguistics 20:4.563-596.
Diab, Mona & Philip Resnik. 2002. “An Unsupervised Method for Word Sense
Tagging using Parallel Corpora”. 40th Annual Meeting of the Association
for Computational Linguistics, 255-262. Philadelphia, Penn.
Dinh, Dien. 2002. “Building a training corpus for word sense disambiguation
in the English-to-Vietnamese Machine Translation”. Workshop on Machine
Translation in Asia at COLING 2002, 26-32. Taipei.
Edmonds, Philip & Scott Cotton. 2001. “SENSEVAL-2: Overview”. 2nd Work-
shop on Evaluating Word Sense Disambiguation Systems, 1-5. Toulouse,
France.
Frankenberg-Garcia, Ana & Diana Santos. 2003. “Introducing COMPARA: The
Portuguese-English Parallel Corpus”. Corpora in Translator Education, 71-
87. Manchester.
Hutchins, W. John, Harold L. Somers. 1992. An Introduction to Machine Trans-
lation. London: Academic Press.
Ide, Nancy, Tomaz Erjavec & Dan Tufis. 2002. “Sense Discrimination with Par-
allel Corpora”. SIGLEX Workshop on Word Sense Disambiguation: Recent
Successes and Future Directions, 56-60. Philadelphia, Penn.
Koehn, Philipp. 2005. “Europarl: A Parallel Corpus for Statistical Machine
Translation”. MT Summit X, 79-86. Phuket, Thailand.
Ng, Hwee T., Bin Wang & Yee S. Chan. 2003. “Exploiting Parallel Texts for
Word Sense Disambiguation: An Empirical Study”. 41st Annual Meeting of
the Association for Computational Linguistics, 455-462. Sapporo, Japan.
Och, Franz J. & Hermann Ney. 2003. “A Systematic Comparison of Various
Statistical Alignment Models”. Computational Linguistics, 29:1.19-51.
Simões, Alberto M. & José J. Almeida. 2003. “NATools – A Statistical Word
Aligner Workbench”. Sociedade Española para el Procesamiento del Lenguaje
Natural, Madrid, 31:217-224.
Specia, Lucia. 2005. “A Hybrid Model for Word Sense Disambiguation in
English-Portuguese Machine Translation”. 8th Research Colloquium of the
UK Special-interest Group in Computational Linguistics, 71-78. Manchester,
U.K.
Wilks, Yorick, & Mark Stevenson. 1998. “The Grammar of Sense: Using Part-of-
speech Tags as a First Step in Semantic Disambiguation”. Journal of Natural
Language Engineering, 4:2.135-144.
Detecting Dangerous Coordination Ambiguities
Using Word Distribution
Francis Chantree∗ , Alistair Willis∗ , Adam Kilgarriff∗∗ &
Anne de Roeck∗
∗
The Open University, ∗∗ Lexical Computing Ltd
Abstract
In this paper we present heuristics for resolving coordination ambi-
guities. We test the hypothesis that the most likely reading of a coor-
dination can be predicted using word distribution information from a
generic corpus. Our heuristics are based upon the relative frequency
of the coordination in the corpus, the distributional similarity of the
coordinated words, and the collocation frequency between the coor-
dinated words and their modifiers. These heuristics have varying but
useful predictive power. They also take into account our view that
many ambiguities cannot be effectively disambiguated, since human
perceptions vary widely.
1 Introduction
Coordination ambiguity is a very common form of structural (i.e., syntac-

tic) ambiguity in English. However, although coordinations are known to
be a “pernicious source of structural ambiguity in English” (Resnik 1999),
they have received little attention in the literature compared with other
structural ambiguities such as prepositional phrase (pp) attachment.
Words and phrases of all types can be coordinated (Okumura & Muraki
1994), with the external modifier being a word or phrase of almost any type
and appearing either before or after the coordination. So for the phrase:
Assumptions and dependencies that are of importance
the external modifier that are of importance may apply either to both as-
sumptions and dependencies or to just the dependencies.
We address the problem of disambiguating coordinations, that is, de-
termining how the external modifier applies to the coordinated words or
phrases (known as ‘conjuncts’). We describe a novel disambiguation method
using several types of word distribution information, and empirically vali-
date this method using a corpus of ambiguous phrases, for which preferred
readings were selected by multiple human judges. We also introduce the
concept of an ambiguity threshold to recognise that the meaning of some
ambiguous phrases cannot be judged reliably. All the heuristics use infor-
mation generated by the Sketch Engine (Kilgarriff et al. 2004) operating on
the British National Corpus (bnc) (http://www.natcorp.ox.ac.uk).
288 CHANTREE, WILLIS, KILGARRIFF & DE ROECK
Throughout this paper, the examples have been taken from requirements
engineering documents. Gause and Weinberg (1989) recognise requirements
as a domain in which misunderstood ambiguities may lead to serious and
potentially costly problems.
2 Methodology
‘Central coordinators’, such as and and or, are the most common cause of
coordination ambiguity, and account for approximately 3% of the words in
the bnc. We investigate single coordination constructions using these (and
and/or ) and incorporating two conjuncts and a modifier, as in the phrase:
old boots and shoes,
where old is the modifier and boots and shoes are the two conjuncts. We
describe the case where old applies to both boots and shoes as ‘coordination-
first’, and the case where old applies only to boots as ‘coordination last’.
We investigate the hypothesis that the preferred reading of a coordi-
nation can be predicted by using three heuristics based upon word dis-
tributions in a general corpus. The first we call the Coordination-Matches
heuristic, which predicts a coordination-first reading if the two conjuncts are
frequently coordinated. The second we call the Distributional-Similarity
heuristic, which predicts a coordination-first reading if the two conjuncts
have strong ‘distributional similarity’. The third we call the Collocation-
Frequency heuristic, which predicts a coordination-last reading if the mod-
ifier is collocated with the first conjunct more often than with the second.
We represent the conjuncts by their head words in all these three types of
analysis.
In our example, we find that shoes is coordinated with boots relatively
frequently in the corpus. boots and shoes are shown to have strong distribu-
tional similarity, suggesting that boots and shoes is a syntactic unit. Both
these factors predict a coordination-first reading. Thirdly, the ‘collocation
frequency’ of old and boots is not significantly greater than that of old and
shoes and so a coordination-last reading is not predicted. Therefore, all the
heuristics predict a coordination-first reading for this phrase.
In order to test this hypothesis, we require a set of sentences and phrases
containing coordination ambiguities, and a judgement of the preferred read-
ing of the coordinations. The success of the heuristics is measured by how
accurately they are able to replicate human judgements. We obtained the
sentences and phrases from a corpus of requirements documents, manually
identifying those that contain potentially ambiguous coordinating conjunc-
tions. Table 1 lists the sentences by part of speech of the head word of the
conjuncts; Table 2 lists them by part of speech of the external modifier.
DISAMBIGUATING COORDINATIONS 289
Head Word % of Total Example from Surveys (head words underlined)

Noun 85.5 Communication and performance requirements
Verb 13.8 Proceed to enter and verify the data
Adjective 0.7 It is very common and ubiquitous
Table 1: Breakdown of sentences in dataset by head word type

Modifier % of Total Example from Surveys (modifiers underlined)
Noun 46.4 ( It ) targeted the project and election managers
Adjective 23.2 .... define architectural components and connectors
Prep 15.9 Facilitate the scheduling and performing of works
Verb 5.8 capacity and network resources required
Adverb 4.4 ( It ) might be automatically rejected or flagged
Rel. Clause 2.2 Assumptions and dependencies that are of importance
Number 0.7 zero mean values and standard deviation
Other 1.4 increased by the lack of funding and local resources
Table 2: Breakdown of sentences in dataset by modifier type
Ambiguity is context-, speaker- and listener-dependent, so there are no ab-

solute criteria for judging it. Therefore, rather than rely upon the judgement
of a single human reader, we took a consensus from multiple readers. This
approach is known to be very effective albeit expensive (Berry 2003).
In total, we extracted 138 suitable coordination constructions and showed
each one to 17 judges. They were asked to judge whether each coordination
was to be read coordination first, coordination last or “ambiguous so that
it might lead to misunderstanding”. In the last case, the coordination is
then classed as an ‘acknowledged ambiguity’ for that judge. We believe that
by using a sufficiently large number of judges, we can estimate how certain
we can be that the coordination should be read in a particular way. Then
we use the idea of an adjustable ‘ambiguity threshold’, which represents
the minimum acceptable level of certainty about the preferred reading of a
passage of text in order for it not to be considered ambiguous.
3 Related research
There is little work on automatically disambiguating coordination ambigu-

ities in English. What research there has been addresses several different
tasks, illustrating the difficulty of a full treatment of all ambiguities caused
by coordinations. For instance, Agarwal and Boggess (1992) developed a
method of recognising which phrases are conjoined by matching part of
speech and case labels in a tagged dataset. They achieved an accuracy
of 82.3% using the machine-readable Merck Veterinary Manual as their
dataset. In a full system, their methods would form a useful initial step
for identifying the coordinated structures, before attempting to determine

attachment. Goldberg (1999) adapted Ratnaparkhi’s (1998) pp attachment
method for use on coordination ambiguities. She achieved an accuracy of
72% on the annotated attachments of her test set, drawn from the Wall
Street Journal by extracting head words from chunked text.
Resnik (1999) investigated the role of semantic similarity in resolving
nominal compounds in coordination ambiguities of the form noun1 and
noun2 noun3, such as bank and warehouse guard. To disambiguate, Resnick
compares the relative information content of the classes in WordNet that
subsume the noun pairs; this method has achieved 71.2% precision and
66.0% recall of the correct human disambiguations in a dataset drawn from
the Wall Street Journal. By adding an evaluation of the selectional asso-
ciation between the nouns to his semantic similarity evaluation, Resnick
achieves precision of 77.4% and 69.7% recall on complex coordinations of
the form noun0 noun1 and noun2 noun3.
We believe that because our method is applicable to any part of speech
for which word distribution information is available, our results are more
generally applicable than those of Resnick, which are applied specifically
to nominal compounds. In addition, we do not know of other comparable
work in which multiple readers have been used to select a preferred reading.
This approach to collecting our datasets gives us an additional insight into
the relative certainty of different readings.
4 Disambiguation empirical study
We maximise our heuristics’ performance using ambiguity thresholds and

ranking cut-offs. The ambiguity threshold is the minimum level of certainty
that must be reflected by the consensus of survey judgements. Suppose
a coordination is judged to be coordination-first by 65% of judges, and
we use a heuristic that predicts coordination-first readings. Then, if the
ambiguity threshold is 60% the consensus judgement will be considered
to be coordination-first, whereas it will not if the ambiguity threshold is
70%. This can significantly change the baseline — the percentage of either
coordination-first or coordination-last judgements, depending on which of
these readings the heuristic is predicting. The ranking cut-off is the point
below which a heuristic is considered to give a negative result. We use data
in the form of rankings as these are considered more accurate than frequency
or similarity scores for word distribution comparisons (McLaughlan 2004).
True positives for a heuristic are those coordinations for which it pre-
dicts the consensus judgement. Precision for a heuristic is the number of
true positives divided by the number of positive results it produces; recall
is the number of true positives divided by the number of coordinations it
should have judged positively. Precision is much more important to us than

recall: we wish each heuristic to be a reliable indicator of how a coordina-
tion should be read, and hope to achieve good recall by the heuristics having
complementary coverage. We use a weighted f-measure statistic (van Ri-
jsbergen 1979) to combine precision and recall — with β = 0.25, strongly
favouring precision — and seek to maximise this for all of our heuristics:
(1 + β) ∗ P recision ∗ Recall
F−Measure =
β 2 ∗ P recision + Recall
We employ 10-fold ‘cross validation’, to avoid the problem of ‘overfitting’
(Weiss & Kulikowski 1991). Our dataset is split into ten equal parts, nine
of which are used for training to find the optimum ranking cut-off and am-
biguity threshold for each heuristic. (The former are found to be the same
for all 10 folds for all three heuristics.) The heuristics are then run on the
heldout tenth part using those cut-offs and ambiguity thresholds. This pro-
cedure is carried out for each heldout part, and the heuristics’ performances
over all the iterations are averaged to give their overall performances.
4.1 Our tools

All our heuristics use statistical information generated by the Sketch Engine
with the bnc as its data source. The bnc is a modern corpus of over 100
million words of English, collated from a variety of sources. The Sketch
Engine provides a thesaurus giving distributional similarity between words,
and word sketches giving the frequencies of word collocations in many types
of syntactic relationship. It accepts input of verbs, nouns and adjectives.
In the word sketches, head words of conjuncts are found efficiently by using
grammatical patterns (Kilgarriff et al. 2004).
The Sketch Engine’s thesaurus is in the tradition of Grefenstette (1994);
it measures distributional similarity between any pair of words according
to the number of corpus contexts they share. Contexts are shared where
the relation and one collocate remain the same, so hobject, drink, winei and
hobject, drink, beer i count towards the similarity between wine and beer.
Shared collocates are weighted according to the product of their mutual
information, and the similarity score is the sum of these weights across
all shared collocates, as in (Lin 1998). Distributional thesauruses are well
suited to our task, as words used in similar contexts but having dissimilar
semantic meaning, such as good and bad, are often coordinated.
4.2 Coordination-matches heuristic

We hypothesise that if a coordination is found frequently within a corpus
then a coordination-first reading is the more likely. We search the bnc for
each coordination in our dataset using the Sketch Engine, which provides
lists of words that are conjoined with and or or. Each head word is looked
up in turn. The ranking of the match of the second head word with the
first head word may not be the same as the ranking of the match of the
first head word with the second head word. This is due to differences in the
overall frequencies of the two words. We use the higher of the two rankings.
We find that considering only the top 25 rankings is a suitable cut-off. An
ambiguity threshold of 60% is found to be the optimum for all ten folds in
the cross-validation exercise. For the example from our dataset:
Security and Privacy Requirements,
the higher of the two rankings of Security and Privacy is 9. This is in the top
25 rankings so the heuristic yields a positive result. The survey judgements
were: 12 coordination-first, 1 coordination-last and 4 ambiguous, giving a
certainty of 12/17 = 70.5%. As this is over the ambiguity threshold of 60%,
the heuristic always yields a true positive result on this sentence.
Averaging over all ten folds, this heuristic achieves 43.6% precision,
64.3% recall and 44.0% f-measure. However, the baselines are low, given the
relatively high ambiguity threshold, giving 20.0 precision and 19.4 f-measure
percentage points above the baselines.
4.3 Distributional-similarity heuristic

Our second hypothesis follows a suggestion by Kilgarriff (2003) that if two
conjuncts display strong distributional similarity, then the conjunction is
likely to form a syntactic unit, giving a coordination-first reading.
For each coordination, the lemmatised head words of both the conjuncts
are looked up in the Sketch Engine’s thesaurus. We use the higher of the
ranking of the match of the second head word with the first head word and
the ranking of the match of the first head word with the second head word.
The optimal cut-off is to consider only the top 10 matches. An ambiguity
threshold of 50% produces optimal results for 7 of the folds, while 70% is
optimal for the other 3. For the example from our dataset:
processed and stored in database,
the verb process has the verb store as its second ranked match in the the-
saurus, and vice versa. As this is in the top 10 matches, the heuristic
yields a positive result. The survey judgements were: 1 coordination-first,
coordination-last and 5 ambiguous, giving a certainty of 1/17 = 5.9%. As
this is below both the ambiguity thresholds used by the folds, the heuristic’s
performance on this sentence always yields a false positive result.
Averaging for all ten folds, this heuristic achieves 50.8% precision, 22.4%
recall and 46.4% f-measure, and 11.5 precision and 5.8 f-measure percentage
points above the baselines.
Heuristic Re- Baseline Prec. Prec. F-meas. F-meas.

call Precision above base (β = 0.25) above base
(1) Coordination-match 64.3 23.6 43.6 20.0 44.0 19.4
(2) Distrib-similarity 22.4 39.3 50.8 11.5 46.4 5.8
(3) Collocation-freq. 35.3 22.1 40.0 17.9 37.3 14.1
(4) = (1) & not (3) 64.3 23.6 47.1 23.5 47.4 22.9
Table 3: Performance of our heuristics (%)
4.4 Collocation-frequency heuristic

Our third heuristic predicts coordination-last readings. We hypothesise
that if a modifier is collocated in a corpus much more frequently with the
conjunct head word that it is nearest to than it is to the further head word,
then it is more likely to form a syntactic unit with only the nearest head
word. This implies that a coordination-last reading is the more likely.
We use the Sketch Engine to find how often the modifier in each sen-
tence is collocated with the conjuncts, head words. We experimented with
collocation ratios, but found the optimal cut-off to be when there are no
collocations between the modifier and the further head word, and any non-
zero number of collocations between the modifier and the nearest head word.
An ambiguity threshold of 40% produces optimum results for 8 of the folds,
while 70% is optimal for the other 2. For the example from our dataset:
project manager and designer,
project often modifies manager in the bnc but never designer, and so
the heuristic yields a positive result. The survey judgements were: 8
coordination-last, 4 coordination-first and 5 ambiguous, giving a certainty
of 8/17 = 47.1%. This is over the ambiguity threshold of 40% but under
the threshold of 70%. On this sentence, the heuristic therefore yields a true
positive result for 8 of the folds but a false positive result for 2 of them.
Averaging for all ten folds, the heuristic achieves 40.0% precision, 35.3%
recall and 37.3% f-measure, and 17.9 precision and 14.1 f-measure percent-
age points above the baselines.
5 Evaluation and discussion
Table 3 summarises our results. Our use of ambiguity thresholds prevents

readings being assigned to highly ambiguous coordinations. This has two
contrary effects on performance: the task is made easier as the target set
contains more clear-cut examples, but harder as there are fewer examples
to find. Our precision and f-measure in terms of percentage points over the
baselines, except for the distributional-similarity heuristic, are encouraging.
Fig. 1: Heuristic 4: Left graph – absolute performance; Right graph –

performance as percentage points over baselines
We combine the two most successful heuristics, shown in the last line of Ta-
ble 3, by saying a coordination-first reading is predicted if the coordination-
matches heuristic gives a positive result and the collocation-frequency heuris-
tic gives a negative one. The left hand graph of Figure 1 shows the precision,
recall and f-measure for this fourth heuristic, at different ambiguity thresh-
olds. As can be seen, high precision and f-measure can be achieved with low
ambiguity thresholds, but at these thresholds even highly ambiguous coor-
dinations are judged to be either coordination-first or -last. The right hand
graph of Figure 1 shows performance as percentage points above the base-
lines. Here the fourth heuristic performs best, and is more appropriately
used, when the ambiguity threshold is set at 60%.
Instead of using the optimal ambiguity threshold, users of our technique
can choose whatever threshold they consider appropriate, considering how
critical they believe ambiguity to be in their work. Figure 2 shows the
proportions of ambiguous and non-ambiguous interpretations at different
ambiguity thresholds. None of the coordinations are judged to be ambiguous
with an ambiguity threshold of zero — which is a dangerous situation —
whereas at an ambiguity threshold of 90% almost everything is considered
ambiguous.
6 Conclusions and further work
Our results show that the collocation-frequency heuristic and (particularly)

the coordination-matches heuristic are good predictors of the preferred
reading of a sentence displaying coordination ambiguity, and that com-
bining them increases performance further. However, the performance of
Fig. 2: Ambiguous and non-ambiguous readings at different thresholds
the distributional-similarity heuristic suggests that distributional similar-

ity between head words of conjuncts is only a weak indicator of preferred
readings.
The success of these heuristics is perhaps surprising, as the distribution
information was obtained from a general corpus (the bnc), but tested on
a specialist data set (requirements documents). This indicates that many
distributions of head words in the data set are reflected in the corpus. These
are promising results, as they suggest that our techniques may be applica-
ble across different domains of discourse, without the need for distribution
information for specialist corpora. The results also show that the heuristics
are not specific to grammatical constructions: the method is applicable to
coordinations of different types of word, and different types of modifier.
We have found that people’s judgements can vary quite widely: different
people interpret a sentence differently, but do not themselves consider the
sentence ambiguous. We call this ‘unacknowledged ambiguity’; it is poten-
tially more dangerous than acknowledged ambiguity as it is not noticed and
therefore may not be resolved. Unacknowledged ambiguity is measured as
the number of judgements in favour of the minority non-ambiguous choice,
over all the non-ambiguous judgements. The average unacknowledged am-
biguity over all the examples in our dataset is 15.3%.
This paper is part of wider research into notifying users of ambiguities
in text and informing them of how likely they are to be misunderstood. We
are currently testing heuristics based on morphology, typography and word
sub-categorisation. In this work we investigate the multi-level conjunct
parallelism model of Okumura and Muraki (1994).
REFERENCES
Agarwal, Rajeev & Lois Boggess. 1992. “A Simple but Useful Approach to
Conjunct Identification”. Proceedings of the 30th Conference on Association
for Computational Linguistics, 15-21. Newark, Delaware.
Berry, Daniel & Erik Kamsties & Michael Krieger. 2003. From Contract Drafting
to Software Specification: Linguistic Sources of Ambiguity. A Handbook.
http://se.uwaterloo.ca/ dberry/handbook/ambiguityHandbook.pdf
Gause, Donald C. & Gerald M. Weinberg. 1989. Exploring Requirements: Quality
Before Design. New York: Dorset House.
Goldberg, Miriam. 1999. “An Unsupervised Model for Statistically Determining
Coordinate Phrase Attachment”. Proceedings of the 37th Annual Meeting of
the Association for Computational Linguistics on Computational Linguistics,
610-614. College Park, Maryland.
Grefenstette, Gregory. 1994. Explorations in Automatic Thesaurus Discovery.
Boston, Mass.: Kluwer Academic.
Kilgarriff, Adam. 2003. “Thesauruses for Natural Language Processing”. Pro-
ceedings of Natural Language Processing and Knowledge Engineering (NLP-
KE ) ed. by Chengqing Zong. 5-13. Beijing, China.
Kilgarriff, Adam & Pavel Rychly & Pavel Smrz & David Tugwell. 2004. “The
Sketch Engine”. 11th European Association for Lexicography International
Congress (EURALEX 2004 ), 105-116. Lorient, France.
Lin, Dekang. 1998. “Automatic Retrieval and Clustering of Similar Words”. Pro-
ceedings of the 17th International Conference on Computational Linguistics,
768-774. Montreal, Canada.
McLauchlan, Mark. 2004. “Thesauruses for Prepositional Phrase Attachment”.
Proceedings of Eight Conference on Natural Language Learning (CoNLL) ed.
by Hwee Tou Ng & Ellen Riloff, 73-80. Boston, Mass.
Okumura, Akitoshi & Kazunori Muraki. 1994. “Symmetric Pattern Matching
Analysis for English Coordinate Structures”. Proceedings of the 4th Confer-
ence on Applied Natural Language Processing, 41-46. Stuttgart, Germany.
Ratnaparkhi, Adwait. 1998. “Unsupervised Statistical Models for Prepositional
Phrase Attachment”. Proceedings of the 17th International Conference on
Computational Linguistics, 1079-1085. Montreal, Canada.
Resnik, Philip. 1999. “Semantic Similarity in a Taxonomy: An Information-
Based Measure and its Application to Problems of Ambiguity in Natural
Language”. Journal of Artificial Intelligence Research 11:95-130.
van Rijsbergen, C. J. 1979. Information Retrieval. London, U.K.: Butterworths.
Weiss, Sholom M. & Casimir A. Kulikowski. 1991. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets, Machine
Learning, and Expert Systems. San Francisco, Calif.: Morgan Kaufmann.
List and Addresses of Contributors
Laura Alonso i Alemany Francis Chantree

Sección de Ciencias de la Computación The Open University
FaMAF Walton Hall
Universidad Nacional de Córdoba Milton Keynes MK7 6AA, U.K.
Argentina f.j.chantree@open.ac.uk
alemany@famaf.unc.edu.ar
Jean-Cédric Chappelier
Victoria Arranz School of Computer &
ELDA - Evaluation and Language Communication Sciences
Resources Distribution Agency École Polytechnique Fédérale
55-57, rue Brillat Savarin, de Lausanne (EPFL)
75013 Paris, France Station 14, CH–1015 Lausanne, Switzerland
arranz @elda.org
Elisabet Comelles
Niraj Aswani TALP Research Centre
Dept. of Computer Science Universitat Politècnica de Catalunya
Univ. of Sheffield C/ Jordi Girona 1-3
Regent Court, 211 Portobello Str. 08034 Barcelona, Spain
Sheffield, S1 4DP, U.K. comelles@lsi.upc.edu
niraj @dcs.shef.ac.uk
Courtney D. Corley
Irene Balta Dept. of Computer Science
Institute for Language & Univ. of North Texas
Speech Processing P.O. Box 311366
6 Artemidos & Epidavrou Denton, TX 76203
151 25 Marousi, Athens, Greece corley@cs.unt.edu
balta@ilsp.gr
Andras Csomai
Eduard Barbu Dept. of Computer Science
Graphitech Italy Univ. of North Texas
2, Salita Dei Molini P.O. Box 311366
38050 Villazzano, Trento, Italy Denton, TX 76203
eduard.barbu@graphitech.it ac0225 @cs.unt.edu
Kalina Bontcheva Hamish Cunningham

Dept. of Computer Science Dept. of Computer Science
Univ. of Sheffield Univ. of Sheffield
Regent Court, 211 Portobello Str. Regent Court, 211 Portobello Str.
Sheffield, S1 4DP, U.K. Sheffield, S1 4DP, U.K.
kalina@dcs.shef.ac.uk hamish@dcs.shef.ac.uk
Ming-Wei Chang Robert Dale

Dept. of Computer Science Centre for Language Technology
Univ. of Illinois at Urbana-Champaign Macquarie Univ.
Urbana, IL 61801, U.S.A. North Ryde
mchang21 @uiuc.edu NSW 2109 Australia
robert.dale@mq.edu.au
298 LIST AND ADDRESSES OF CONTRIBUTORS
Anne De Roeck Arthur C. Graesser

Centre for Research in Computing Univ. of Memphis
The Open University Institute for Intelligent Systems
Milton Keynes MK7 6AA, U.K. Dept. of Psychology
a.deroeck @open.ac.uk 202 Psychology Building
Memphis, TN 38120, U.S.A.
Panagiotis Dimitrakis a-graesser @memphis.edu
Institute for Language &
Speech Processing Ralph Grishman
6 Artemidos & Epidavrou Computer Science Department
151 25 Marousi, Athens, Greece New York University
715 Broadway, 7th Floor
Quang Do New York, NY 10003, U.S.A.
Dept. of Computer Science grishman@cs.nyu.edu
Univ. of Illinois at Urbana-Champaign
Urbana, IL 61801, U.S.A. Péter Halácsy
quangdo2 @uiuc.edu Media Research Centre
Technical Univ. of Budapest
Ayman Farahat Stoczek u. 2, H-1111 Budapest, Hungary
Palo Alto Research Center halacsy@mokk.bme.hu
3333 Coyote Hill Road
Palo Alto, CA 94304, U.S.A. Samer Hassan
ayman.farahat@gmail.com Univ. of North Texas
PO Box 305241
David Farwell Denton, TX 76203, U.S.A.
TALP Research Centre samer @unt.edu
Universitat Politècnica de Catalunya
C/ Jordi Girona 1-3 Erhard W. Hinrichs
08034 Barcelona, Spain SfS-CL, Univ. of Tübingen
Wilhelmstr. 19
Katja Filippova 72074 Tübingen, Germany
SfS-CL, Univ. of Tübingen eh@sfs.uni-tuebingen.de
Wilhelmstr. 19
72074 Tübingen, Germany Adam Kilgarriff
katja.f @gmail.com Lexical Computing Ltd
71 Freshfield Road
Paul H. Garthwaite Brighton BN2 0BL, U.K.
Department of Statistics adam@lexmasterclass.com
The Open University
Milton Keynes MK7 6AA, U.K. András Kornai
p.h.garthwaite@open.ac.uk Media Research Centre
Filip Ginter Stoczek u. 2, H-1111 Budapest, Hungary
Turku Centre for Computer Science (tucs) andras@kornai.com
Univ. of Turku
Lemminkäisenkatu 14 A Milen Kouylekov
20540 Turku, Finland ITC-irst, Centro per la Ricerca
filip.ginter @it.utu.fi Scientifica e Tecnologica
via Sommarive 18
38050 Povo (TN), Italy
kouylekov @itc.it
LIST AND ADDRESSES OF CONTRIBUTORS 299
Sandra Kübler Verginica Barbu Mititelu

Dept. of Linguistics Romanian Academy Research Institute for
Indiana Univ. Artificial Intelligence
Memorial Hall 13, Calea 13 Septembrie
1021 E. Third Street 050711 Bucharest, Romania
Bloomington, IN 47405, USA vergi@racai.ro
skuebler @indiana.edu
Viktor Nagy
Alex Lascarides Research Institute for Linguistics
School of Informatics Benczúr u. 33, H-1068 Budapest, Hungary
Univ. of Edinburgh nagyv @nytud.hu
2 Buccleuch Place
László Németh
Edinburgh EH8 9LW, U.K.
Media Research Centre
alex @inf.ed.ac.uk
Stoczek u. 2, H-1111 Budapest, Hungary
Gina-Anne Levow
nemeth@mokk.bme.hu
Dept. of Computer Science
Univ. of Chicago John Nerbonne
5801 S. Ellis Ave Humanities Computing
Chicago, IL 60637, U.S.A. Univ. of Groningen
levow @cs.uchicago.edu 9700 AS Groningen, The Netherlands
j.nerbonne@rug.nl
Bernardo Magnini
ITC-irst, Centro per la Ricerca Scientifica Nicolas Nicolov
e Tecnologica Chief Scientist, Umbria Inc.
via Sommarive 18 4888 Pearl East Circle, Suite 300W
38050 Povo (TN) - Italy Boulder, CO 80301, U.S.A.
magnini@itc.it nicolas@umbrialistens.com
Irene Castellón Masalles Maria das Graças V. Nunes

Departament de Linguı́stica General Departamento de Ciências
Facultat de Filologia de Computaçao e Estatı́stica
Universitat de Barcelona, Spain ICMC – Universidade de São Paulo
icastellon@ub.edu Caixa Postal 668
13560-970 So Carlos, Brazil
Irina Matveeva gracan@icmc.usp.br
Univ. of Chicago Stelios Piperidis
5801 S. Ellis Ave Institute for Language &
Chicago, IL 60637, U.S.A. Speech Processing
matveeva@cs.uchicago.edu 6 Artemidos & Epidavrou
151 25 Marousi, Athens, Greece
Rada Mihalcea spip@ilsp.gr
Dept. of Computer Science Sampo Pyysalo
Univ. of North Texas Turku Centre for Computer Science (tucs)
PO Box 311366 Univ. of Turku
Denton, TX 76203 Lemminkäisenkatu 14 A
rada@cs.unt.edu 20540 Turku, Finland
sampo.pyysalo@it.utu.fi
300 LIST AND ADDRESSES OF CONTRIBUTORS
Dan Roth Keiji Shinzato

Department of Computer Science Graduate School of Informatics
Univ. of Illinois at Urbana-Champaign Kyoto University
Urbana, IL 61801, U.S.A. Yoshida-Honmachi, Sakyo-ku
danr @uiuc.edu Kyoto 606-8501, Japan
shinzato@nlp.kuee.kyoto-u.ac.jp
Christiaan Royer
Palo Alto Research Center Lucia Specia
3333 Coyote Hill Road Departamento de Ciências de
Palo Alto, CA 94304, U.S.A. Computaçao e Estatı́stica
royer @parc.com ICMC – Universidade de São Paulo
Caixa Postal 668
Vasile Rus 13560-970 So Carlos, Brazil
Univ. of Memphis lspecia@icmc.usp.br
Institute for Intelligent Systems
Dept. of Computer Science Caroline Sporleder
373 Dunn Hall ILK/Language and Information Science
Memphis, TN 38120, U.S.A. Tilburg Univ.
vrus@memphis.edu P.O. Box 90153
5000 LE Tilburg, The Netherlands
Tapio Salakoski c.sporleder @uvt.nl
Turku Centre for Computer Science (tucs)
Univ. of Turku Mark Stevenson
Lemminkäisenkatu 14 A Dept. of Computer Science
20540 Turku, Finland Univ. of Sheffield
tapio.salakoski@it.utu.fi Regent Court, 211 Portobello Str.
S1 4DP Sheffield, U.K.
Franco Salvetti m.stevenson@dcs.shef.ac.uk
Univ. of Colorado at Boulder, 430 UCB Valentin Tablan
Boulder, CO 80309-0430, U.S.A. Dept. of Computer Science
franco.salvetti@colorado.edu Univ. of Sheffield
Regent Court, 211 Portobello Str.
Avik Sarkar Sheffield, S1 4DP, U.K.
Centre for Research in Computing valyt @dcs.shef.ac.uk
The Open University
Milton Keynes MK7 6AA, U.K. Jörg Tiedemann
a.sarkar @open.ac.uk Alfa Informatica
Univ. of Groningen
Florian Seydoux Oude Kijk in ’t Jatstraat 26
School of Computer and 9712 EK Groningen, The Netherlands
Communication Sciences tiedeman@let.rug.nl
École Polytechnique Fédérale de Lausanne
(EPFL) Kentaro Torisawa
Station 14, CH–1015 Lausanne, Switzerland School of Information Science
Japan Advanced Institute of Science and
Technology (JAIST)
1-1 Asahidai, Nomi, Ishikawa, 923-1292 Japan
torisawa@jaist.ac.jp
LIST AND ADDRESSES OF CONTRIBUTORS 301
Viktor Trón Alistair Willis

Univ. of Edinburgh The Open University
2 Buccleuch Place Walton Hall
Edinburgh EH8 9LW, U.K. Milton Keynes MK7 6AA, U.K.
v.tron@ed.ac.uk a.g.willis@open.ac.uk
Dániel Varga Holger Wunsch

Media Research Centre SfS-CL, University of Tübingen
Technical Univ. of Budapest Wilhelmstr. 19
Stoczek u. 2, H-1111 Budapest, Hungary 72074 Tübingen, Germany
daniel @mokk.bme.hu wunsch@sfs.uni-tuebingen.de
Index of Subjects and Terms
A. central coordinators 288

adaptive representation 139, 140 chi-square (χ2 ) 271
aggregation 101 Chinese 19
alignment 21, 247, 251, 252, 257 collection bias 262
alignment ladder 254, 256 collocation frequency 288
ambiguity complete rung count 254
acknowledged 289 computational complexity 106, 108
lexical 228 computer-aided translation (cat)
lexical ambiguity resolution 228 227
threshold 289 conjuncts 287
unacknowledged 295 context 228, 230
anaphora resolution 115 context vector (cv) 228, 230
annotation scheme 79 extended context vector (ecv)
annotation tools 93 232
arguments vs. adjuncts 93 contrast set 107
automatic building 217 conversational agents 99
automatic labelling 157, 158, 163, coordination-first readings 288
165 coordination-last readings 288
coreference analysis 18
B.
corpus
backward maximum match 129
alignment 247, 251, 252
bag-of-words (bow) 142
Base de Datos Sintácticos del Español annotation 89-91, 93
Actual 91 bilingual 227
bias 262 cost 168
bigram model 157, 163, 165 coverage 18
bitext 227 CRITERION 213
BoosTexter 161, 163, 164 CRITERION (STRICT) 213
boosting 161 cross validation 291
bootstrapping 21 cue phrase 157
bow see bag-of-words
burstiness, term burstiness 267 D.
dataset 168
C. decision function 212
CACM 31 definite descriptions 110
304 INDEX OF SUBJECTS AND TERMS
dependency 187, 189, 190 German 115, 116

dependency parsing 56 graph 187-191, 195
dependency tree 169, 170 graph-based algorithms 149
dependency triples 172 greedy search 128
dictionary 220-222, 224, 226
dictionary extraction H.
automatic 227 heuristic 217, 218, 220-226
discourse history 118 heuristics 280
discourse marker homogeneity 269
connective 157-163, 165 homogeneity assumption 268
discourse relation 157, 165 hunalign 250, 251, 253-256
discourse structure 116 hyponymy 138, 139
distinguishing description 104, 105 hyponymy relation acquisition method
distributional similarity 288, 291 (hram) 208
document categorization 154 hypothesis 167
document classification 137
document frequency 211 I.
information extraction 17
E. information extraction (ie) 173
editing transformations 174 information personalisation 99
edr 26 information retrieval (ir) 25, 173
Eisner’s algorithm 57 initial reference 111
entailment 168, 187-189, 191-195 intelligent tutoring systems (its)
patterns 175 187
recognition 201 intended referent 100, 102-104, 106,
relation 167, 168 107
rules 169, 172, 174, 175 inter-annotator agreement 95
score 170 inverse document frequency (idf)
semantic 171 171
textual 201 itemization 207
itemized expression set (ies) 209
F. its see intelligent tutoring sys-
feature tems
contextual 19
local 18 K.
forward maximum match 129 k-nearest neighbors (knn) 140
FrameNet 90 kappa statistic 215
Spanish 91 knn see k-nearest neighbors
function words 268 knowledge base 99
G. L.
generalized latent semantic anal- language density 247, 253, 256,
ysis (glsa) 45 257
INDEX OF SUBJECTS AND TERMS 305
language variability 167 named entity (ne) 18

latent semantic analysis (lsa) 45 natural language understanding 167
lexical entailment 174 natural language generation (nlg)
lexical equivalence 229 99
lexical ontology 217 ne see named entity
lexical resource 168 negation 188, 191, 192, 195
lexical transfer selection 228, 231, Negra treebank 79, 80
232 Nihongo Goi Taikei thesaurus 213
lexicon nlg see natural language gener-
bilingual 228, 233 ation
corpus-specific 229 NomBank 20
statistical 228 nominalization 20
look-ahead search 67 non-projective 71
LoPar parser 82 non-trivial hypernym 213
M. O.
machine translation (mt) 173, 227, one-anaphoric expressions 110
277 one-to-one count 254
Maximum Entropy 131 optimal cost 175
features 131 overfitting 291
mbl see memory-based learning
memory-based learning (mbl) 117 P.
Message Understanding Conference pairwise mutual information 211
(muc) 18 parallel corpus 227, 228, 247, 248,
messages 101 250, 253, 254, 256, 257,
micro-planning 101 279
minimal distinguishing description paraphrase
106 semantic 20
minimal set cover 106 syntactic 20
minimum redundancy cut (mrc) paraphrase acquisition 173
26 paraphrase recognition 201
modeling gaps 270 paraphrasing resources 175, 176
mrc see minimum redundancy cut parser 170
muc see Message Understanding parsing
Conference partial 19
pattern matching 18
N. pcfg parsing 79
n-gram 131, 133 Penn Treebank 68, 79, 91
Naı̈ve Bayes 132, 150, 152, 153, pipeline process 55
155, 158 planning operators 105
text classification 149 point-wise mutual information (pmi)
name cache 19 47
name tagging 19 potential distractors 104, 105, 107
306 INDEX OF SUBJECTS AND TERMS
precision 229, 234 semantically coherent word class

predicate-argument representation 207
20 sense ambiguity 277
principle of adequacy 105, 108 sense clustering 202
principle of efficiency 105, 106, 108 SenSem 89, 91
principle of sensitivity 105, 108 sentence alignment 247, 250, 251
profiling 262 sentence plan 101
projective language 58 sentence semantics 89, 92
PropBank 20, 91 sentence splitter 170
shift-reduce parser 59
Q. similarity 172
question answering (qa) 168, 173 similarity database 173
R. singular value decomposition (svd)
random-walk algorithms 149 47
re-ranking 19 smoothing (Laplace) 132
reading comprehension 173 spam 125, 133-135
recall 229, 234 content model 135
recognizing textual entailment (rte) sparseness 261
task 167 speech acts 105
referring expression generation 100 splog 125-127
regular expressions 17 stylistic analysis 275
relational properties 108 summarization
relative clause 20 extractive 147
rhetorical relation 157 support vector machines (svm) 143,
rhetorical structure theory (rst) 211
158, 159 svm see support vector machines
Rocchio, text classification 149 synset 217-226
syntactic functions 89, 91, 93, 95,
S. 96
segmentation 126-128 syntactic matching 169
contextual 131 syntactico-semantic interface 89-
dictionary-based 128 91
non-symmetric, multi-break 130 system 168
symmetric sliding window 129
segmented discourse representation T.
theory (sdrt) 159 TüBa-D/Z treebank 80, 116
semantic coherence 207 term re-occurrence 269, 270
semantic indexing 25 term relevance 138, 139
semantic inferences 167 text categorization 147
semantic network 137, 139, 220 text classification 147
semantic roles 89, 91-93, 95 text planning 101
semantic similarity 197 text processing 170
INDEX OF SUBJECTS AND TERMS 307
text semantic similarity 197 V.

text summarization 148 verbal aspect 92, 93
textual entailment 173, 175 verbal lexicon 89-92, 97
thesaurus 172 verbal sense 89, 91, 92
threshold 168, 171
tipster dataset 270 W.
transformation 169 word
translation model 252 polysemous 227-230
translational equivalence 227 target word selection 228
tree edit distance 168, 175, 176 univocal 230
tree mapping 175 usage 227, 228
treebank comparison 79 word sense 227, 228
trivial hypernym 213 word sense disambiguation (wsd)
type-to-token ratio 270 228, 277
word sense discrimination 228
U. word sense similarity 202
uniform resource identifier 127 word similarity 198
uniform resource locator 127 WordNet 20, 26, 198, 200, 202,
uri see uniform resource identi- 203, 217-220, 225, 226
fier domains 222, 226
url see uniform resource locator
CURRENT ISSUES IN LINGUISTIC THEORY
e. F. K. Koerner, editor
Zentrum für Allgemeine sprachwissenschat, typologie
und universalienforschung, Berlin
ek.koerner@rz.hu-berlin.de
Current Issues in Linguistic heory (cilt) is a theory-oriented series which welcomes contributions from
scholars who have signiicant proposals to make towards the advancement of our understanding of lan-
guage, its structure, functioning and development. CILT has been established in order to provide a forum
for the presentation and discussion of linguistic opinions of scholars who do not necessarily accept the
prevailing mode of thought in linguistic science. It ofers an outlet for meaningful contributions to the
current linguistic debate, and furnishes the diversity of opinion which a healthy discipline must have.
A complete list of titles in this series can be found on the publishers’ website, www.benjamins.com
293 Detges, Ulrich and Richard WalteReit (eds.): he Paradox of Grammatical Change. Perspectives
from Romance. v, 254 pp. Expected February 2008
292 Nicolov, Nicolas, Kalina BoNtcheva, galia aNgelova and Ruslan MitKov (eds.): Recent
Advances in Natural Language Processing IV. Selected papers from RANLP 2005. 2007. xii, 307 pp.
291 BaaUW, sergio, Frank DRijKoNiNgeN and Manuela PiNto (eds.): Romance Languages and
Linguistic heory 2005. Selected papers from ‘Going Romance’, Utrecht, 8–10 December 2005. 2007.
viii, 338 pp.
290 MUghazy, Mustafa a. (ed.): Perspectives on Arabic Linguistics XX. Papers from the twentieth annual
symposium on Arabic linguistics, Kalamazoo, Michigan, March 2006. xii, 247 pp. Expected December 2007
289 BeNMaMoUN, elabbas (ed.): Perspectives on Arabic Linguistics XIX. Papers from the nineteenth annual
symposium on Arabic Linguistics, Urbana, Illinois, April 2005. xiv, 274 pp. + index. Expected December 2007
288 toivoNeN, ida and Diane NelsoN (eds.): Saami Linguistics. 2007. viii, 321 pp.
287 caMacho, josé, Nydia FloRes-FeRRáN, liliana sáNchez, viviane DéPRez and María josé
caBReRa (eds.): Romance Linguistics 2006. Selected papers from the 36th Linguistic Symposium on
Romance Languages (LSRL), New Brunswick, March-April 2006. 2007. viii, 340 pp.
286 WeijeR, jeroen van de and erik jan van der toRRe (eds.): Voicing in Dutch. (De)voicing – phonology,
phonetics, and psycholinguistics. 2007. x, 186 pp.
285 sacKMaNN, Robin (ed.): Explorations in Integrational Linguistics. Four essays on German, French, and
Guaraní. ix, 217 pp. Expected January 2008
284 salMoNs, joseph c. and shannon DUBeNioN-sMith (eds.): Historical Linguistics 2005. Selected
papers from the 17th International Conference on Historical Linguistics, Madison, Wisconsin, 31 July - 5
August 2005. 2007. viii, 413 pp.
283 leNKeR, Ursula and anneli MeURMaN-soliN (eds.): Connectives in the History of English. 2007.
viii, 318 pp.
282 PRieto, Pilar, joan MascaRó and Maria-josep solé (eds.): Segmental and prosodic issues in
Romance phonology. 2007. xvi, 262 pp.
281 veRMeeRBeRgeN, Myriam, lorraine leesoN and onno cRasBoRN (eds.): Simultaneity in Signed
Languages. Form and function. 2007. viii, 360 pp. (incl. CD-Rom).
280 heWsoN, john and vit BUBeNiK: From Case to Adposition. he development of conigurational syntax
in Indo-European languages. 2006. xxx, 420 pp.
279 NeDeRgaaRD thoMseN, ole (ed.): Competing Models of Linguistic Change. Evolution and beyond.
2006. vi, 344 pp.
278 Doetjes, jenny and Paz goNzález (eds.): Romance Languages and Linguistic heory 2004. Selected
papers from ‘Going Romance’, Leiden, 9–11 December 2004. 2006. viii, 320 pp.
277 helasvUo, Marja-liisa and lyle caMPBell (eds.): Grammar from the Human Perspective. Case,
space and person in Finnish. 2006. x, 280 pp.
276 MoNtReUil, jean-Pierre y. (ed.): New Perspectives on Romance Linguistics. Vol. II: Phonetics,
Phonology and Dialectology. Selected papers from the 35th Linguistic Symposium on Romance Languages
(LSRL), Austin, Texas, February 2005. 2006. x, 213 pp.
275 NishiDa, chiyo and jean-Pierre y. MoNtReUil (eds.): New Perspectives on Romance Linguistics. Vol.
I: Morphology, Syntax, Semantics, and Pragmatics. Selected papers from the 35th Linguistic Symposium on
Romance Languages (LSRL), Austin, Texas, February 2005. 2006. xiv, 288 pp.
274 gess, Randall s. and Deborah aRteaga (eds.): Historical Romance Linguistics. Retrospective and
perspectives. 2006. viii, 393 pp.

The Sensem Project Syntactico Semantic A PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Sensem Project Syntactico Semantic A PDF

Uploaded by

Copyright:

Available Formats

Recent AdvAnces in nAtuRAl lAnguAge PRocessing

series iv – cuRRent issues in linguistic tHeoRY

Advisory Editorial Board

lyle campbell (salt lake city); sheila embleton (toronto)

nicolas nicolov, Kalina Bontcheva, galia Angelova

Recent Advances in Natural Language Processing IV.

JoHn BenJAMins PuBlisHing coMPAnY

Library of Congress Cataloging-in-Publication Data

I. COMPUTATION FOR LINGUISTICS

II. INFORMATION EXTRACTION & INDEXING

Florian Seydoux & Jean-Cédric Chappelier

Niraj Aswani, Valentin Tablan, Kalina Bontcheva & Hamish

Irina Matveeva, Gina-Anne Levow, Ayman Farahat & Christiaan Royer

Laura Alonso, Joan Antoni Capilla, Irene Castellón, Ana Fernández-

IV. ANAPHORA & REFERRING EXPRESSIONS

Erhard W. Hinrichs, Katja Filippova & Holger Wunsch

Filip Ginter, Sampo Pyysalo & Tapio Salakoski

Rada Mihalcea & Samer Hassan

Caroline Sporleder & Alex Lascarides

VI. TEXTUAL ENTAILMENT &

Vasile Rus & Arthur Graesser

Courtney Corley, Andras Csomai & Rada Mihalcea

Eduard Barbu & Verginica Barbu Mititelu

VIII. MACHINE TRANSLATION

Victoria Arranz, Elisabet Comelles & David Farwell

Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy,

Anne De Roeck, Avik Sarkar & Paul H. Garthwaite

Lucia Specia, Maria das Graças V. Nunes & Mark Stevenson

Francis Chantree, Alistair Willis, Adam Kilgarriff & Anne de Roeck

List and Addresses of Contributors 297

Index of Subjects and Terms 303

Sandra Kübler (Indiana University, Bloomington)

John Tait (University of Sunderland)

August 2007 Nicolas Nicolov

The goal of this paper is to urge computational linguists to explore issues in

Computational Linguistics (CL) is often characterized as having a theo-

We shall examine dialectology first because it is an area we have directly

Nerbonne (2003) discusses at greater length the computational issues in an-

It is important to report here, as well, that specialists in dialectology—and

and/or dialect continua relied too extensively on the analysts’ choice of

pointing. For example, sociolinguistics has largely succeeded dialectology in

Diachronic linguistics investigates how languages change, and, most spec-

Studies of children’s acquisition of language are interesting to all sorts of in-

plying minimal description length to the problem of segmenting the speech

Language contact study is an active branch of linguistics focused on rec-

multilingual speakers often, albeit unconsciously, impose the structures of

ing inflectional and/or derivational structure to the “family size” of a word

We have urged computational linguists to consider how much they might

purpose of exploring linguistic structure or its processing, but its use in

Goldsmith, John. 2001. “Unsupervised Learning of the Morphology of a Natural

Moscoso del Prado Martı́n, Fermin. 2003. Paradigmatic Structures in Morpholog-

1 The challenges of information extraction

Information Extraction (ie) provides an interesting perspective from which

2 Identifying instances of a linguistic expression

In simplest terms, ie can be realized as a pattern matching process. We

2.1 Stages of analysis and joint inference

This fails to capture such basic tendencies as the increased likelihood of a

2.2 Stages of analysis: Alternative evidence

is more accurate than full parsing.3 However, machine learning methods

3 Finding linguistic expressions of an event or relation

3.1 Syntactic paraphrase

3.2 Semantic paraphrase

limited to single-word paraphrases and so cover only a portion of the myr-