Professional Documents
Culture Documents
An Approach For Interconnecting Lexical Resources
An Approach For Interconnecting Lexical Resources
An Approach For Interconnecting Lexical Resources
1
Faculty of Computer Science Alexandru Ioan Cuza University of Iai
2
Institute of Computer Science of Romanian Academy, Iai Branch
Abstract
Nowadays, great efforts are being made to create linguistic resources and
technologies for Romanian language in order to enhance the level of research
and innovation dedicated to this language. Our work is dedicated to construct
lexical resources in order to shed light on multiple aspects of exploiting our
language in its synchronic, but also diachronic aspects and for knowledge
engineering. For that, we use different lexical resources for Romanian
languages such as: RoWordNet, DEX-Online and the corpus COROLA. As a
case study, we refer to the tip-of-the-tongue (TOT) problem, i.e. how to help a
human to remind a word that is not immediately reachable. Our solution
considers first a standardization of the lexical resources that would make
possible a navigation process without the concern of representation of data.
The study makes a step forward towards building a tool that would help
people getting fluency and continuity of ideas in communication and can also
be useful to researchers in natural language processing and linguists interested
in combining different lexical resources.
Keywords Corpus, Dictionary, Interconnection, Lexical Resources,
WordNet
1. Introduction
2. Lexical resources
Our lexical resources are monolingual in this study, namely representing the Romanian
language. Lexical resources are databases structured in many ways: word lists,
LIVIU-ANDREI SCUTELNICU, ANCA DIANA BIBIRI, DAN CRISTEA
dictionary entries and synsets. More precisely, the following 4 resources have been
used, each with a different structure:
RoWordNet1 the Romanian WordNet (Tufi&Cristea, 2002) (Tufi et al.,
2013) is created following the model of Princeton WordNet (Fellbaum, 1998).
This database contains nouns, verbs, adjectives and adverbs grouped into sets of
cognitive synonyms called synsets, each expressing a distinct concept. Every
synsets has a unique identifier and groups a set of lexicals paired with their sense
IDs; on this basis, the sense number of the searched/target word in the respective
synsets can be extracted. The synsets for nouns and verbs are placed in a
conceptual hierarchy by hypernyms/hyponym relations;
COROLA the Computational Representative Corpus of Contemporary
Romanian Language is a resource under construction now as a priority project of
the Romanian Academy. At the end of the project (2017), this corpus, built in
collaboration by the Research Institute in Artificial Intelligence in Bucharest and
the Institute of Computer Science in Iasi is supposed to include texts totalizing
500 million Romanian words acquired from a broad range of domains (politics,
humanities, theology, science, etc.) and covering all literary genres (prose,
poetry, theatre, scientific texts, journalistic texts, etc.) (Bibiri et al., 2015). In
order to acquire its running form, the texts (obtained on the base of written
protocols from our providers), after discharging the formatting ballast, passed
through a processing chain which included minimum: sentence segmentation,
tokenization, lemmatization, part of speech tagging. In our research we have
used about 130 million words (out of the 200 million which make the corpus at
present);
DEX-Online2 an online source-free dictionary created by voluntary
contribution and merging a collection of sense definitions for 75,000 entries
extracted from prestigious dictionaries of the Romanian language;
A semantic lexicon a vocabulary containing 6000 words grouped by
grammatical categories (nouns, verbs, adjectives and adverbs) and semantic
classes (politics, religion, family, etc.).
Each of these resources are available (under specific licences) in the XML format.
used by the same interrogation code. Lets note, for the time being, that such goals have
been formulated long time ago.
Text Encoding Initiative (TEI), for instance, has proposed an inventory of features most
often deployed for computer-based text processing and has formulated
recommendations about suitable ways of representing these features. The idea was to
facilitate processing of different types of resources by computer programs, to facilitate
the loss-free interchange of data amongst individuals and research groups using
different programs, computer systems, or application software. However, TEI is only a
near-standard, not a standard in itself. More towards a standard is the Lexical Markup
Framework (LMF). This is the ISO International Organization for Standardization
ISO/TC37 standard for natural language processing (NLP) and machine-readable
dictionary lexicons3.
Both LMF and TEI model lexical material at a deep representational detail. However we
want more than that since our primary intention is the interconnection of different
lexical material in an easy way. Here are the features we want to acquire:
if two resources are to be connected, simply merge their contents, where
merge mean putting the information together but not repeating common
information and being specific there were differences occur. The differences
should be reflected by specific feature names;
we should be able to represent variations in word forms, alternate orthography,
diachronic morphology, etc. if the primary resources include such data;
restrictions in querying the merged resource should be made possible by
applying various filtering criteria;
if the lexicographic data is hierarchical (for instance, a sense of a dictionary
entry contains, apart from a definition and examples, also sub-senses), then this
should be reflected in the representation and in processing by recursive searches.
Here is an example: give me the definition neighbouring sphere of depth 2 of
the word captain should be solved by: take all senses of the entry captain and
form the list of words in the corresponding definitions, then for each of them
take all their senses and collect again words in their definitions.
By applying these criteria, we implemented the merge operation by analysing the
contents of each XML-coded linguistic resource presented in input to determine both
their common and uncommon parts. For example, in the resources mentioned in the
previous section some fields of the XML representation, such as definition, lemma and
part of speech do exist in all resources and, as such, they were coded only once. The
other tags, which code specific content, remain in each resource.
Thus, after parsing the XMLs and retrieving information, we have created a shared
resource for all inputs as follows: initially we took for each entry (one entry represents a
word with all related information - definition, examples, synonyms, part of speech etc.)
the common parts and inserted into an XML tag; for uncommon parts were performed
different XML tags, one for each resource used, containing specific information and
3
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=37327
LIVIU-ANDREI SCUTELNICU, ANCA DIANA BIBIRI, DAN CRISTEA
resources that no longer met the other re-sources; from COROLA corpus we extracted
contexts that contain the searched word and its neighbours (4 proceeding and 4
following) a kind of concordance (Figure 1).
On how easily we talk, sometimes we retrieve the words with difficulty, when we look
for a word in the process of communication and it does not come to mind. This is
known as the tip-of the tongue (TOT) state or problem. The TOT phenomenon occurs to
each individual at least once per week. It is clear that a tool that would help a human to
recover a forgotten word during a conversation or during a creative writing process
would have to be handy, act fast and involve a very large collection of words properly
manipulated. Zock (2015) proposes a three-step methodology. He says that in search for
a word that does not come to mind (target word), we always start from one or more
source words. In a first step of the search, out of this source word(s), an extension
process would have to generate a large number of associated words, too large to be
handled visually by the subject. Then, in a second step, this family of words would have
to be clustered automatically, and containing a manageable number of
semantically/phonetically/etc. related words. If the number of clusters is itself
manageable, then, in the last step, the subject should decide among the representative
words of clusters, which one is the closest to the target word. Once chosen, s/he would
have to search the cluster and, if lucky, the target word is there. If unlucky, the process
should be iterated starting from a subset of the chosen cluster, which is supposed now to
be closer to the target word then at the beginning of this search process.
AN APPROACH FOR INTERCONNECTING LEXICAL RESOURCES
Acknowledgements
This work was co-funded by the European Social Fund through Sectoral Operational
Programme Human Resources Development 2007 2013, project number
POSDRU/187/1.5/S/155397, project title Towards a New Generation of Elite
Researchers through Doctoral Scolarships.
Many thanks to Daniela Gifu for the semantic Lexicon.
References