Professional Documents
Culture Documents
Terminology Acquisition and Description Using Lexical Resources and Local Grammars
Terminology Acquisition and Description Using Lexical Resources and Local Grammars
[Др.РГФ]
Cvetana Krstev, Ranka Stanković, Ivan Obradović, Biljana Lazić | Terminology Acquisition and Description Using Lexical
Resources and Local Grammars | Proceedings of the 11th Conference on Terminology and Artificial Intelligence, Granada,
Spain, 2015 | 2015
URI: dr.rgf.bg.ac.rs/s/repo/item/1759
Дигитални репозиторијум Рударско-геолошког факултета The Digital repository of The University of Belgrade
Универзитета у Београду омогућава приступ издањима Faculty of Mining and Geology archives faculty
Факултета и радовима запослених доступним у слободном publications available in open access, as well as the
приступу. Претрага репозиторијума доступна је на employees' publications. The Repository is available at:
dr.rgf.bg.ac.rs dr.rgf.bg.ac.rs
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
81
82
83
gray in Figure 1): inflectional class prediction Productiveness of all inflectional classes are
(step 2.3) and extraction of MWU candidates for not the same: some classes are used for a large
termbases using syntactic graphs (step 3). number of regular cases, while other pertain to
(rare) exceptions. Our approach is addressing the
3 Prediction of inflectional class for first group, having in mind that terminology usu-
simple words ally inflects regularly. Figure 3 presents the
number of inflectional classes and percent of
Prediction of inflectional class for a new word in lemmas that belong to them. For example, 10
Serbian is not an easy task because of complex classes for adjectives account for 98% of lem-
inflectional grammar with numerous rules and mas, 10 classes for nouns account for 61.8% of
exceptions. Morphological electronic dictionaries lemmas, and 10 classes for verbs account for
of Serbian for NLP are being developed for 59.6% of lemmas.
many years now. Their development follows the
methodology and format (known as 100
and subsequently used in various NLP tasks. Figure 3: FST classes and the percentage cover-
Serbian e-dictionaries of simple forms have ing the dictionary of lemmas
reached a considerable size: they have about
135,000 lemmas generating more than 5 million FST class prediction can be divided into two
forms and 13,000 compound lemmas, that is, parts: one is extraction of implicit knowledge
multi-word units (Krstev, 2008). The number of and the other is actual prediction of FST class for
simple lemmas by Part-Of-Speech (POS) is de- a new lemma. Extraction of implicit knowledge
picted in Figure 2 (left). in the form of a dataset with word endings,
grammatical categories and FST classes proceeds
POS lemmas FSTs as follows:
Nouns 81,866 61% 372 44% 1. Calculate frequencies for each POS and rela-
Verbs 17,071 13% 372 44% tive frequencies for each FST class within
Adjectives 31,071 23% 69 8% POS in the current dictionary of simple
Other 4,632 3% 41 5% lemmas.
Total 134,640 854 2. Create a dataset from DELAS lemma end-
Figure 2: Statistics of lemmas and inflectional ings of length 3,4,5 and 6 characters with
FSTs corresponding grammatical categories re-
trieved from DELAF (e.g. for nouns in that
Inflectional classes are described with dataset: POS, lemma, FST, gender,
metadata including most important features for animateness, pronunciation).
class distinction e.g. for nouns grammatical gen- 3. Create another dataset with frequencies for
der and number, case, and animateness are given. each combination of FST code and grammat-
Grammatical inflectional rules are encoded by ical category and for each ending of length
854 inflectional Finite-State Transducers (FST) 3,4,5,6, as an estimate of the probability that
Inflectional FSTs are a special kind of FSTs used the FST class is the appropriate one. The da-
for modeling inflectional paradigms, that is, in- taset includes: ending, POS, gender,
flectional classes. Each FST of this kind is used animateness, pronunciation, FST and proba-
for production of all inflected forms for all lem- bility (chance rank 0-100) for FST (table 1,
mas belonging to the same class. The number of column Rel. freq.).
Inflectional FSTs by POS is depicted in Figure 2
(right).
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
84
ending
POS length gender anim FST Total Frequency Rel. freq. Example
alica N 5 f - N650 97 95 98 sijalica
alica N 5 f - N1650 97 1 1 Skalica
alica N 5 f + N651 27 26 96 varalica
alica N 5 f + N1651 27 1 4 Lalica
ica N 3 m + N1683 145 142 98 Milojica
ica N 3 m + N1741 145 3 2 Prica
ica N 3 m/f + N683 40 33 83 tvrdica
ica N 3 m/f + N679 40 7 18 ubica
Table 1: Excerption from dataset with ending.
(m: masculine gender; f: feminine gender; m/f: nouns change gender in their inflectional paradigm)
Analysis of the relation between word end- ary of simple word lemmas with implicit
ings and inflectional FST classes shows that the knowledge about dependence between FST clas-
prediction of inflectional class by the abovemen- ses and dictionary entries.
tioned statistical analysis of existing dictionaries After preparing the list of new entries in the
is justified. Figure 4 illustrates this relation for form: lemma, POS, Grammatical_Categories
word endings of length 3, 4, 5 and 6. For exam- (e.g. grabuljar,N,Rud ‘rake’) the following pro-
ple, in the case of word endings of length 3, for cedure is applied:
33% of words from the existing dictionary there 1. For each candidate lemma filter the dataset
is only one corresponding FST class, for approx- prepared from previous step as follows:
imately 20% of words there are two classes, and 1.1. if the lemma has specific marks for pro-
so on, whereas for word endings of length 6 nunciation, then retain only dataset mem-
there is a single class for as much as 90% of bers with the same mark and remove the
words. rest;
In order to facilitate prediction of FST class, a 1.2. if the grammatical gender or animateness
set of rules based on inflectional class metadata is assigned, retain only dataset members
is used. Distinction between inflectional classes with the same grammatical category and
based on grammatical categories can be done remove the rest;
only to some extent, so implicit knowledge from 1.3. if the first letter of the lemma is in upper
the existing dictionary of simple words is used to case additional filtering can take place tak-
improve prediction. ing into account FST classes which have
only inflected singular forms.
100
90
2. After filtering and ranking the dataset, pre-
90
80 82
length 3 diction (FST assignment) for the lemma is
70
length 4
repeated with threshold from 99 to 95 for
60 63
length 5
relative frequency, for suffixes 6,5,4, and 3
% of words
50
length 6
respectively;
40 3. For thresholds under 95 and over 80 lemma
30
33
prefix (if longer than 2 characters) is used: if
20 the prefix is in the dictionary of prefixes and
10 the remainder of the lemma is a word in
0 DELAS, then the lemma is the inflectional
1 2 3 4 5 6
number of FST classes
class of the corresponding DELAS word is
Figure 4: Relation between word endings and assigned to the lemma.
inflectional FST classes 4. For thresholds 80 and less steps 1 and 2 only
are repeated.
The process of automatic prediction of inflec- From a sample of domain texts and dictionar-
tional FST class for a new entry follows a hybrid ies we manually filtered 623 new terms from
approach: one part is rule-driven with explicit domains of mining, geology and e-learning and
codification of knowledge about FST classes and applied the described procedure for FST class
the other is statistical, based on existing diction- prediction: to 582 (93%) of them the correct FST
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
85
class was assigned, 27 (4%) had a partly correct nical characteristics of machines, which are often
class assigned (for instance, inflection is correct longer MWUs than the less specific terminology
but falsely allows plural forms), and to 14 (2%) of the two other termbases. Two examples from
of them an incorrect class was assigned. RudOnto can illustrate this: a term for employee
position “Geologist for mineralogy, petrology,
4 Syntactic graphs for MWU recogni- sedimentology and geochemical research” and a
tion term for technical characteristics of machines
“Length of the caterpillar transporting device
4.1 Structure of terms in termbases measured from the vertical excavator rotation
In order to analyze the structure of terms in dif- axis to the front edge of the caterpillar”.
ferent domains, primarily the number of compo- 4.2 Extraction of MWUs from domain texts
nents they consist of, we used samples from
three terminological resources for Serbian. Two The extraction of MWUs from a text is preceded
terminological resources, GeolISSTerm 2 and by the retrieval of new simple word terms from it
RudOnto3 have been developed at University of and their incorporation in the existing system of
Belgrade, Faculty of Mining and Geology. morphological e-dictionaries as MWU extraction
GeolISSTerm is a bilingual thesaurus of geologi- relies heavily on existing lexical resources.
cal terms in Serbian and their English equiva- In the Serbian e-dictionary of MWUs, all en-
lents (Stankovic et al., 2011), divided in several tries are distributed in classes according to their
subdomains: petrology, mineralogy, hydrogeolo- syntactic structure, or more precisely, according
gy, geophysics, structural geology etc. RudOnto to the information needed for their inflection.
is covering the larger area of mining engineering The names of classes correspond to the names of
and mine safety terminology (Stankovic et al., special FSTs that are used for MWU inflection.
2012). The third termbase used is the Dictionary For instance, the class AXN pertains to MWUs
of Library and Information Sciences (RNBS), 4 with the syntactic structure: an adjective (A) fol-
developed by the National Library of Serbia. It lowed by a noun (N), where the two components
contains terminology in Serbian, English and agree in gender, number, case and animateness.
German, related to theory and practice of librari- In class names X stands for a component that
anship and information sciences and a wide does not inflect when a MWU inflects or for a
range of close or related fields. component separator. In the case of AXN, X
stands for the separator, usually a space. Some-
Table 2. Frequencies of terms of different lengths times, MWUs with different syntactic structure
in samples from 3 termbases belong to the same class. For instance, the class
Dictionary
Term length (in number of words) N4X implies that MWUs belonging to it consist
1 2 3 4 ≥5 of a noun followed by two other components
GeolISS
1436 2356 749 305 243
(separated by two separators) that do not inflect.
Term The syntactic structure of these components can
RNBS 3302 6180 2062 806 415 be a noun followed by two adjectives/nouns in
RudOnto 1004 1351 1350 1031 2341 the genitive case (e.g. eksploatacija mineralnih
sirovina ‘exploitation of mineral resources’) but
Table 2 presents the distribution of terms consist- also a noun followed by a prepositional phrase
ing of 1, 2, 3, 4 and more components for the (e.g. bager na šinama ‘excavator on rails’).
three termbases. These results are consistent with There are 29 such classes for Serbian nominal
the results presented in (Justeson et al., 1995), at MWUs.5 However, 10 of them are used for the
least for GeolISSTerm and RNBS, and show that inflection of more than 98% of all nominal
terms with 5 or more components are much less MWUs. Four of these classes are used for the
frequent than the shorter ones. The results are inflection of two component MWUs, four for the
somewhat different for RudOnto, as it contains inflection of 3-component MWUs and two for
very specific terms, such as causes of injuries, the inflection of 4-component MWUs. Given that
employee positions, types of injuries, or tech-
5
The number of FSTs (80) is greater than the number of
2
http://geoliss.mprrpp.gov.rs/term classes because they deal with other details of inflection:
3
http://rudonto.rgf.bg.ac.rs/ does the MWU inflect in number, are some components
4
http://rbi.nb.rs/en/home.html optional, etc.
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
86
they cover the large majority of MWUs, we have b. NNgiNgiNgi - a noun followed by three
developed syntactic FSTs for the extraction of nouns/adjectives in the genitive case; e.g.
MWUs belonging to these 10 classes. They are, istraživanje ležišta mineralnih sirovina ‘ex-
listed in the descending order of their frequency: ploration of mineral deposits’.
1. AXN – an adjective followed by a noun; the c. NprepNpNgi - a noun followed by a prepo-
adjective and the noun have to agree in all sitional phrase; e.g. bakar sa primesama
four grammatical categories; e.g. zemni gas zlata ‘copper with a sprinkling of gold’.
‘natural gas’. 9. AXN4X – a noun preceded by an adjective
2. 2XN – a noun preceded by a word that does that agrees with it in gender, number, case
not inflect in the MWU. Usually it is a word and animateness and followed by two words
used only in one or a few MWUs, a prefix or that do not inflect in the MWU. Two syntactic
an adverb derived from an adjective, while structures are possible:
the separator is usually a hyphen; e.g. anker- a. ANPrepNp - A noun preceded by an adjec-
mreža ‘anchor network’. tive and followed by a prepositional phrase
3. N2X – a noun followed by a word that does (as in case 4b); e.g. gravitacijska
not inflect in the MWU. Usually this word is koncentracija u vodi ‘gravity concentration
a noun in the genitive or in the instrumental in water’.
case; e.g. patrona eksploziva ‘explosive car- b. ANNgiNgi - a noun preceded by an adjec-
tridge’ and upravljanje krovinom ‘roof con- tive and followed by two adjectives/nouns
trol’. in the genitive case or in the instrumental
4. N4X – a noun followed by two words that do case (a 4a case); e.g. površinska
not inflect in the MWU. Two syntactic struc- eksploatacija mineralnih sirovina ‘surface
tures are possible: exploitation of mineral resources’.
a. NNgi - A noun followed by two adjec- 10. 2XAXN - an adjective followed by a noun
tives/nouns in the genitive case or in the in- that agrees in all four grammatical categories
strumental case; e.g. otkopavanje širokim and preceded by a word that does not inflect
čelom ‘broad forehead excavation’. in the MWU; e.g. magmatsko-eruptivni masiv
b. NprepNp - A noun followed by a preposi- ‘magmatic-igneous massif’.
tional phrase; e.g. lanac sa grabuljama FST for extraction of MWUs of type AXN with
‘chain with a rake’. two paths from one of the subgraphs that illus-
5. AXN2X – a noun preceded by an adjective trate the agreement between adjectives and nouns
that agrees with it in gender, number, case is depicted in Figure 5. Dictionary variable used
and animateness and followed by a word that for FST output in the form $a.LEMMA$ re-
does not inflect in the MWU, usually a noun trieves a lemma of recognized word form $a$
in the genitive or instrumental case; e.g. thus performing the simple word lemmatization.
geološko kartiranje terena ‘geological field Due to high homography of word forms it
mapping’. may happen that the same sequence of words is
6. NXN – a noun followed by a noun that agrees recognized by two or more graphs; naturally,
with it in number and case, where the separa- only one recognition may be correct. For in-
tor can be a hyphen; e.g. bager kašikar ‘shov- stance if the MWU bager kašikar (case 6, NXN)
el excavator’. is detected in the analyzed text in the genitive
7. AXAXN – a noun preceded by two adjectives case bagera kašikara it may be erroneously in-
that agree with it in gender, number, case and terpreted as a MWU of a form NNg (case 3) in
animateness; e.g. površinski istražni radovi the genitive case. Consequently, all NNg con-
‘surface exploration works’. structions in an analyzed text that appear in the
8. N6X - a noun followed by three words that do genitive case (which happens very frequently)
not inflect in the MWU. Three syntactic will be interpreted also as a NXN case. For that
structures are possible: reason, in the case of ambiguous recognition we
a. NNgiPrepNp - a noun followed by a noun always give precedence to the more probable
in the genitive case and a prepositional case. For instance, for 2-component MWUs the
phrase (as in case 4b); e.g. priprema ležišta precedence is: AXN, 2XN, N2X, NXN.
za otkopavanje ‘deposit preparation for min- As a rule, we are looking for the longest
ing’. match for a MWU, that is, if a text matches an
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
87
1. For MWUs with syntactic structure AXN, plural and singular forms, then both forms
AXAXN, AXN2X, AXN4X, and 2XAXN of lemmas will be offered.
the form of the adjectives has to be correct- Production of correct MWU lemmas is a pre-
ed so that the right gender is selected to cor- requisite for the successful evaluation. Moreover,
respond to the gender of the noun (simple entries for morphological e-dictionary of MWUs
word lemmas are always in the masculine can be produced only from correct MWU lem-
gender). For example, when simple word mas. Finally, as a byproduct of the whole process
lemmatization offers a lemma minskim MWU inflectional classes for newly retrieved
bušotinaf ‘blasting boreholes’ it has to be MWUs are obtained – they are derived directly
corrected to minskaf bušotinaf. from local grammars used for their extraction.
2. For all MWUs, the right number of the
MWU has to be selected: if it appeared in a 4.3 Evaluation of performance of MWU
text only in singular form or only in plural extraction
form, then the lemma will be in the respec- In order to evaluate our approach, we applied it
tive form (e.g. only singular form jamski to a collection of 74 papers in Serbian from the
vazduh ‘air in the underground mine’, only journal Infotheca. 6 The size of the corpus is
plural form atmosferske padavine ‘atmos-
pheric precipitation’); if it appeared in both 6
Infotheca - Journal for Digital Humanities
(http://infoteka.bg.ac.rs/index.php/en/infoteca)
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
88
89