An Approach For Interconnecting Lexical Resources

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

AN APPROACH FOR INTERCONNECTING LEXICAL RESOURCES

LIVIU-ANDREI SCUTELNICU1, 2, ANCA-DIANA BIBIRI1, DAN CRISTEA1, 2

1
Faculty of Computer Science Alexandru Ioan Cuza University of Iai
2
Institute of Computer Science of Romanian Academy, Iai Branch

Abstract
Nowadays, great efforts are being made to create linguistic resources and
technologies for Romanian language in order to enhance the level of research
and innovation dedicated to this language. Our work is dedicated to construct
lexical resources in order to shed light on multiple aspects of exploiting our
language in its synchronic, but also diachronic aspects and for knowledge
engineering. For that, we use different lexical resources for Romanian
languages such as: RoWordNet, DEX-Online and the corpus COROLA. As a
case study, we refer to the tip-of-the-tongue (TOT) problem, i.e. how to help a
human to remind a word that is not immediately reachable. Our solution
considers first a standardization of the lexical resources that would make
possible a navigation process without the concern of representation of data.
The study makes a step forward towards building a tool that would help
people getting fluency and continuity of ideas in communication and can also
be useful to researchers in natural language processing and linguists interested
in combining different lexical resources.
Keywords Corpus, Dictionary, Interconnection, Lexical Resources,
WordNet

1. Introduction

The existence of lexical resources is of utmost importance for language technology.


Acquiring and developing the basic tools for creation, structuring, exploitation and
maintenance of language resources depends on the datasets specific to the respective
language, that varies from simple lists of words to complex resources, such as
dictionaries, annotated corpora, grammars, treebanks, thesauri, ontologies, language
models, etc. In the same time, the involvement of more types of resources in complex
applications makes compulsory their easy interconnection. The heterogeneity of
language resources obliges to integration and interoperability (Chiarcos et all, 2012) in
order to be reusable for many applications and available to the NLP community.
In our paper we propose a methodology of making compatible different linguistic
knowledge bases and exemplify this approach with a number of existing resources for
Romanian language. The immediate application is to find a solution to the TOT
problem, but the approach could be useful in other application settings that impose the
combination of heterogeneous resources.

2. Lexical resources
Our lexical resources are monolingual in this study, namely representing the Romanian
language. Lexical resources are databases structured in many ways: word lists,
LIVIU-ANDREI SCUTELNICU, ANCA DIANA BIBIRI, DAN CRISTEA

dictionary entries and synsets. More precisely, the following 4 resources have been
used, each with a different structure:
RoWordNet1 the Romanian WordNet (Tufi&Cristea, 2002) (Tufi et al.,
2013) is created following the model of Princeton WordNet (Fellbaum, 1998).
This database contains nouns, verbs, adjectives and adverbs grouped into sets of
cognitive synonyms called synsets, each expressing a distinct concept. Every
synsets has a unique identifier and groups a set of lexicals paired with their sense
IDs; on this basis, the sense number of the searched/target word in the respective
synsets can be extracted. The synsets for nouns and verbs are placed in a
conceptual hierarchy by hypernyms/hyponym relations;
COROLA the Computational Representative Corpus of Contemporary
Romanian Language is a resource under construction now as a priority project of
the Romanian Academy. At the end of the project (2017), this corpus, built in
collaboration by the Research Institute in Artificial Intelligence in Bucharest and
the Institute of Computer Science in Iasi is supposed to include texts totalizing
500 million Romanian words acquired from a broad range of domains (politics,
humanities, theology, science, etc.) and covering all literary genres (prose,
poetry, theatre, scientific texts, journalistic texts, etc.) (Bibiri et al., 2015). In
order to acquire its running form, the texts (obtained on the base of written
protocols from our providers), after discharging the formatting ballast, passed
through a processing chain which included minimum: sentence segmentation,
tokenization, lemmatization, part of speech tagging. In our research we have
used about 130 million words (out of the 200 million which make the corpus at
present);
DEX-Online2 an online source-free dictionary created by voluntary
contribution and merging a collection of sense definitions for 75,000 entries
extracted from prestigious dictionaries of the Romanian language;
A semantic lexicon a vocabulary containing 6000 words grouped by
grammatical categories (nouns, verbs, adjectives and adverbs) and semantic
classes (politics, religion, family, etc.).
Each of these resources are available (under specific licences) in the XML format.

3. Towards lexical resources

Building up an infrastructure of language resources only makes sense if these resources


are standardized.(Teubert, 1997). The development of reusable resources for
interchanging and open-ending retrieval tasks is always dependent on the fact that the
resource has to have a homogeneous aspect (Sinclair, 1994). Thus, standardized
methods and specifications are needed for the process of comparison and for building
links/bridges between the resources taken into consideration.
As mentioned, our resources are coded in XML, but this common format is too general
to ensure compatibility. The idea is to rewrite the resources onto a form that could be
1
http://www.racai.ro/en/tools/text/rowordnet/
2
https://dexonline.ro/
AN APPROACH FOR INTERCONNECTING LEXICAL RESOURCES

used by the same interrogation code. Lets note, for the time being, that such goals have
been formulated long time ago.
Text Encoding Initiative (TEI), for instance, has proposed an inventory of features most
often deployed for computer-based text processing and has formulated
recommendations about suitable ways of representing these features. The idea was to
facilitate processing of different types of resources by computer programs, to facilitate
the loss-free interchange of data amongst individuals and research groups using
different programs, computer systems, or application software. However, TEI is only a
near-standard, not a standard in itself. More towards a standard is the Lexical Markup
Framework (LMF). This is the ISO International Organization for Standardization
ISO/TC37 standard for natural language processing (NLP) and machine-readable
dictionary lexicons3.
Both LMF and TEI model lexical material at a deep representational detail. However we
want more than that since our primary intention is the interconnection of different
lexical material in an easy way. Here are the features we want to acquire:
if two resources are to be connected, simply merge their contents, where
merge mean putting the information together but not repeating common
information and being specific there were differences occur. The differences
should be reflected by specific feature names;
we should be able to represent variations in word forms, alternate orthography,
diachronic morphology, etc. if the primary resources include such data;
restrictions in querying the merged resource should be made possible by
applying various filtering criteria;
if the lexicographic data is hierarchical (for instance, a sense of a dictionary
entry contains, apart from a definition and examples, also sub-senses), then this
should be reflected in the representation and in processing by recursive searches.
Here is an example: give me the definition neighbouring sphere of depth 2 of
the word captain should be solved by: take all senses of the entry captain and
form the list of words in the corresponding definitions, then for each of them
take all their senses and collect again words in their definitions.
By applying these criteria, we implemented the merge operation by analysing the
contents of each XML-coded linguistic resource presented in input to determine both
their common and uncommon parts. For example, in the resources mentioned in the
previous section some fields of the XML representation, such as definition, lemma and
part of speech do exist in all resources and, as such, they were coded only once. The
other tags, which code specific content, remain in each resource.
Thus, after parsing the XMLs and retrieving information, we have created a shared
resource for all inputs as follows: initially we took for each entry (one entry represents a
word with all related information - definition, examples, synonyms, part of speech etc.)
the common parts and inserted into an XML tag; for uncommon parts were performed
different XML tags, one for each resource used, containing specific information and

3
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=37327
LIVIU-ANDREI SCUTELNICU, ANCA DIANA BIBIRI, DAN CRISTEA

resources that no longer met the other re-sources; from COROLA corpus we extracted
contexts that contain the searched word and its neighbours (4 proceeding and 4
following) a kind of concordance (Figure 1).

Figure 1: Example of a XML entry after merging different lexical resources

4. A case study the TOT state

On how easily we talk, sometimes we retrieve the words with difficulty, when we look
for a word in the process of communication and it does not come to mind. This is
known as the tip-of the tongue (TOT) state or problem. The TOT phenomenon occurs to
each individual at least once per week. It is clear that a tool that would help a human to
recover a forgotten word during a conversation or during a creative writing process
would have to be handy, act fast and involve a very large collection of words properly
manipulated. Zock (2015) proposes a three-step methodology. He says that in search for
a word that does not come to mind (target word), we always start from one or more
source words. In a first step of the search, out of this source word(s), an extension
process would have to generate a large number of associated words, too large to be
handled visually by the subject. Then, in a second step, this family of words would have
to be clustered automatically, and containing a manageable number of
semantically/phonetically/etc. related words. If the number of clusters is itself
manageable, then, in the last step, the subject should decide among the representative
words of clusters, which one is the closest to the target word. Once chosen, s/he would
have to search the cluster and, if lucky, the target word is there. If unlucky, the process
should be iterated starting from a subset of the chosen cluster, which is supposed now to
be closer to the target word then at the beginning of this search process.
AN APPROACH FOR INTERCONNECTING LEXICAL RESOURCES

We reach for a dictionary, generally, synonyms, hyponyms or hypernyms dictionaries,


to find a similar word that could help to revive a missing item from memory.
The tip-of-the-tongue (abbreviated TOT) state is a very familiar problem not only for
speakers who are engaged in a discussion but also for the writers who can lose their
ideas. This makes pressure on persons mind, because it interrupts the fluency of the
dialog, time when they try to make connections to other words, similar words,
synonyms, antonyms, etc., in order to reach the target word. This is one of the most
frequently problem encountered nowadays, resulting in an absence of reading that leads
to a good vocabulary enrichment, probably due to lack of time and commodity.
Applied example: Having the new XML obtained after combining the aforementioned
resources, we consider a concrete example: the target word is vicar, but we do not have
in mind, on the spot, this word, so the first word that crosses our mind is abate (we refer
to the noun abate translated slaughterhouse and tried to get after a series of
algorithmic steps the target word vicar).
The structure of RoWordNet noticed that this noun is part of the religious field, which
helped us to search in DEX-Online definition for some words that are from this field,
which we named them Keywords. We applied these criteria for the neighbours of the
input word extracted from COROLA corpus. With this list of keywords extracted it was
generated a list of about 300 words - if given the example of the word - that represent
synonyms, hypernyms and hyponym from RoWordNet. After sorting the list
alphabetically we revealed the existence of the targeted word vicar.

5. Conclusions and future work

Our society is in a process of currently rapidly evolving and changing. Language


resources, as dynamic repository of linguistic knowledge, have to be continuously
enriched and easily exploitable also the concatenation of many lexical resources for
enriching statistical description and obtaining more accurate linguistic resources.
In our paper we presented an approach of how we unify some lexical resources and how
we can contribute to solve the TOT problem by making a standardization of different
lexical resources.
Clusterization of the resulted list of semantically related words is meant to optimize the
task and to ease the way for finding the target word. From this list it can be selected
keywords that are related to the target word and thus applying some filters to narrow the
search area for the aimed word.
As a representation formalism, RDF Resource Description Framework is based on
the notion of triples and provides a graph-based data model (Chiarcos et all., 2012). It
facilitates data merging allowing structured and unstructured data to be combined,
exposed and shared across various applications.
LIVIU-ANDREI SCUTELNICU, ANCA DIANA BIBIRI, DAN CRISTEA

Acknowledgements

This work was co-funded by the European Social Fund through Sectoral Operational
Programme Human Resources Development 2007 2013, project number
POSDRU/187/1.5/S/155397, project title Towards a New Generation of Elite
Researchers through Doctoral Scolarships.
Many thanks to Daniela Gifu for the semantic Lexicon.

References

A. D. Bibiri, C. Bolea, L. A. Scutelnicu, A. Moruz, L. Pistol, D. Cristea. 2015, Metadata


of a Huge Corpus of Contemporary Romanian. Data and organization of the work,
in Proceedings of the 7th Balkan Conference in Informatics. ACM New York.
ISBN 978-1-4503-3335-1.
C. Chiarcos, S. Hellmann, S. Nordhoff. 2012, Introduction and Overview, in C.
Chiarcos, S. Hellmann, S. Nordhoff (eds.) Linked Data in Linguistics.
Representing and Connecting Language Data and Language Metadata. Springer-
Verlag, Berlin Heidelberg, 1-12.
C. Fellbaum(ed.). 1998. WordNet: An Electronic Lexical Database, Cambridge, MA,
Mit Press, 1998.
John Sinclair. 1994. Corpus Typology, EAGLES DOCUMENT EAG-CWG-IR-2,
Version Of October.
Wolfgang Teubert. 1997. Language Resources for Language Technology, in Dan Tufi,
Poul Andersen (eds.) Recent Advances in Romanian Language Technology,
Romanian Academy Publishing House, Bucharest, 25-34.
Dan Tufi, Dan Cristea. 2002. Methodological Issues in Building the Romanian
Wordnet and Consistency Checks in Balkanet. Proceedings of LREC, Las
Palmas, 35-41.
Dan Tufi, Verginica Barbu Mititelu, Dan tefnescu, Radu Ion. 2013. The Romanian
wordnet in a nut-shell. Language Resources and Evaluation 47(4): 1305-1314.

You might also like