Professional Documents
Culture Documents
TMF Normalization of Arabic Technical and Scientific Terms
TMF Normalization of Arabic Technical and Scientific Terms
Terms*
1 Introduction
*
This work is partially funded by the European Commission H2020 project Parthenos (Grant
agreement: 654119, Call: H2020-INFRADEV-1-2014-1).
interests. So, in order to facilitate cooperation and avoid redundancy it is appropriate
to use the standards and guidelines for the creation and utilization of terminological
data collections and for the sharing and exchange of such data.
Technical and scientific documents are written in specialist language. They are
very rich in well-presented terminology and cover multiple technical and scientific
fields. In fact, scientists, technical editors and inventors are the best to present the
scientific and technical terms of a field. Since, when drafting their scientific papers,
technical standards or patent applications, they will carefully choose words, terms and
named entities of a specific domain.
For many reasons (we will list some of them in section III), Arabic language is
considered as poor language coverage, in the sense that the work of Arabic linguists
and terminologists seems insufficient to make the Arabic language a technological
language. For this reason, there is a lack of standardized multidisciplinary Arabic
terminological databases and as a result, no possibility of fusion and interchange of
terminological data.
Another issue is that Arabic scientific and technical documents may contain ambi-
guities and have semantic issues due to regional specificities. In fact, in the same
field, there is a risk that the terms will be represented in different ways from one sci-
entific or technical document to another, from one country to another.
In this paper, we aim, through a thorough study of the characteristics and typology
of Arabic technical and scientific terms (from a collection of patents, scientific arti-
cles and manuals), to propose a development process of an Arabic standardized mul-
tidisciplinary terminological database from a corpus of Arabic technical and scientific
documents. The originality of our work consists in the automatic extraction of Arabic
terms and their presentation in a standardized format.
This paper is organized as follows. Section 2 is devoted to the presentation of the
previous works concerning existent norms of terminological products and Arabic and
non-Arabic terminological databases. In section 3, we detail the characteristics of
Arabic terms of our corpus. Then, in section 4, we present our method to automatical-
ly generate standardized terminological database. Section 5 is devoted to the experi-
mentation and evaluation and we conclude and enunciate some perspectives in section
6.
2 Previous Work
1
http://www.tei-c.org/index.xml
2
http://www.tei-c.org/Vault/P4/doc/html/TE.html
The TDC (Terminological Data Collection) is the level of the database itself. It is
the union of several terminological records according to various languages dedicated
to the same purpose, it is in fact, the informative data collection on a specific domain
concepts. It includes, in addition to the terminological information in the strict sense,
global and complementary information. The GI (Global Information) and CI (com-
plementary information) are two levels which store important reference data to man-
age or operate the base. These data do not belong directly to the terminology collec-
tion.
3
http://iate.europa.eu
4
http://www.termsciences.fr/
5
http://www.wipo.int/reference/en/wipopearl
terms in ten languages, including Arabic, derived from patent documents. It contains
15,000 concepts and 90,000 terms. The ARABTERM6 is an online technical diction-
ary organized per industry sector and allowing user participation. Finally, the UN-
ARABTERM7 is a multilingual terminology database which offers technical or spe-
cialized terms in four of UN’s official languages: Arabic, English, French and Span-
ish.
Some previous works use automatic methods to extract Arabic terms and expres-
sions. For instance, in [3], authors use a hybrid approach combining grammatical
patterns and statistical methods to extract Arabic multi-word terms. In [2], authors use
a combination of three approaches that rely on linguistic information, frequency
counts, and statistical measures to extract multi-word expressions.
A very small number of studies and projects have been developed through the
standardization of Arabic terminology products. They have adopted the standards
mentioned in the previous section to achieve the desired results. One of these projects
is CARTAGO [7] which is an international project with group skills and provides
networking facilities to collect and make available widely standardized multilingual
terminology resources in the field of e-learning.
These previous works have three major disadvantages. The first is that some of
them ([3], [2], WIPO Pearl, ARABTERM and UN-ARABTERM) do not provide (or
are not) Arabic standardized TDBs so it will be very difficult to use their terms for
exchange and merging with other TDBs. The second is that some of them build TDBs
manually (WIPO Pearl, UN-ARABTERM, ARABTERM and CARTAGO). The third
disadvantage is that some of them provide a small number of Arabic terms and con-
cepts for only one field (CARTAGO) or for a very small number of fields (WIPO
Pearl and UN-ARABTERM).
In this section, we detail the characteristics of Arabic technical and scientific terms of
our corpus. First, we explain the impact of foreign languages on Arabic terms. Sec-
ond, we enunciate terminological disparities. Next, we point the translation of some of
them to other languages. Then, we present multi-word terms. Finally, we give exam-
ples of conceptual and terminological relationships.
6
http://www.arabterm.org/
7
http://unterm.un.org/dgaacs/arabterm.nsf
its ability to express complex notions of the modern world of science and technology
and to be a tool capable of translating the intellectual life in its entirety. The Arabic
language is, like any other language, a satisfactory instrument for the expression of
the world of which it is correlative. It is only in scientific and technical terminology
that it may seem insufficient. Unfortunately, the Arabic linguists and terminologists
have been unable to bridge the terminological and conceptual gaps from the socio-
cultural and technological development and to ensure consistency of a normalized
nomenclature of concepts. That is why the Arabic language is often criticized for its
slow modernization in terms of technical terminologies.
As a result, we noted, in many patents and manuals, the use of technical terms in
French or English and not their Arabic equivalents (e.g. jump guard). It seems that
Arabic inventors and technical editors cannot found the appropriate technical terms in
Arabic. We noted, also, the intensive use of phenomena such as the Arabization of
some English or French terms. It gives to them a double identity, one which is the
Arabization and another which is the Arabic equivalent. For instance, the word
“code” in the term “elevator code” has an Arabization ‘ ’كودkuwdo8, and it has an
Arabic equivalent which is ‘ ’رمزramozo. Another phenomenon is periphrasis in Ara-
bic terminology which is due to the abundance of foreign neologisms to which it is
not possible to find one-word equivalent. For example, the word “microphone” could
be ‘ ’ميكروفونmiykoruwfuwno as an Arabization or ‘ ’مكبر الصوتmukabGiro
AloSawoto, literally “amplifier of the sound” as a periphrasis.
The Arab world extends over a large geographic area with historical and socio-
cultural specificities and conceptual and terminological, semantic and taxonomic
disparities between Mashreq (Middle East) and Maghreb (Arab Maghreb) regions,
countries (urban-rural) and between the literary or administrative language and dialec-
tal or popular language. For instance, an inventor or technical editor from the
Mashreq uses the term ‘ ’شفرة$aforapo to designate the term “code”, and an inventor
from the Maghreb uses the term ‘ ’كودkuwdo.
Other disparities could be caused by the different variations of terms: flexional
(e.g. ‘ ’مصعد ثنائي الطوابقmiSoEado vunaA}iy AloTawabiqo “double-deck elevator” and
‘ ’مصاعد ثنائية الطوابقmaSaEido vunaA}iyGapo AloTawabiqo “double-deck elevators”),
graphic, syntactic and morphosyntactic variations.
Most of the Arabs are bilingual or more, the majority masters the French or English
languages. For this reason technical and scientific documents generally have a transla-
tion of technical and scientific terms or keywords in the same paragraph. These trans-
8
Buckwalter Arabic Transliteration System: http://www.qamus.org/transliteration.htm
lations are usually of a very high quality because they are made by professional hu-
man translators and they facilitate the implementation of multilingual terminology
databases.
There are two types of terms: the single word terms and multi-word or complex terms.
The structures of Arabic complex units that could be lexicalized are mainly
Noun+Adjective, Noun+Noun, Noun+Preposition+Noun. Other structures more com-
plex exist. Our work is to extract terms from complex sentences. More precisely, to
reduce the dispersion in the modeling of complex sentences content while keeping the
semantics of the base term. For example, the sentence ‘’زر يعمل عن طريق الضغط عليه
zirGo yaEomalo Eano Tariyqo AloDGagoTo Ealayoho “button works by pressing it”.
The term resulted from the complex sentence is ‘ ’زر بالضغطzirGo biAloDGagoTo
“pushbutton”.
Our proposed method consists of: first, the construction of a Data Category Registry
(DCR) and TMF modeling; then, the processing of Arabic terms’ characteristics;
finally, the alimentation of the TDB.
In this paper, we resume the work of [1], in which authors built a transducer cascade
using the CasSys tool [5] of the Unitex platform to extract and annotate terms under
TMF format from a corpus of Arabic technical and scientific documents. However,
we noticed that many of the important characteristics (Section 3) of terms were not
considered and presented in the given TDB. So, in this paper, we involve them by
including appropriate section levels and data categories. For example, the “definition”
of the multi-word terms from the complex sentences could be included to the “term-
Section”. An example is shown in Fig. 4 which is a transducer that recognizes the
terms ‘ ’مصعد يعمل بالجر ببكرة محزوزةmiSoEado yaEomalo biAlojarGo bibakorapo ma-
Hozuwzapo “Elevator works by traction with splined roller”. In fact, the boxes be-
tween the two brackets “(definition” and “definition)” contain the complex sentence
used as the definition of the term. The term in its turn is extracted by the concatena-
tion of the three variables $Nc1$, $Prps$ and $Nc2$, given that Nc = Noun, Prps =
Preposition, V = Verb and Adj = Adjective.
Fig. 4. Transducer of terms extraction
With:
Noun1=Noun2=‘“=’ مصعدelevator”
Noun11=Noun21=‘“=’متعددmulti”
Noun12=‘ ’المقصوراتsynonym Noun22=‘“ ’العرباتcar”
6 Conclusion
In this paper, we point the lack of Arabic multilingual and multi-disciplinary termino-
logical databases. We have studied characteristics of Arabic technical and scientific
terms of our corpus. We have, also, benefited of the interoperable GMT format that
gives the TMF standard and create a TML that fits the Arabic Terms by including the
appropriate data categories in the TMF meta-model. Finally, we used a rule-based
approach to extract terms and build a terminological database in association with a
java interface. In the future, we aim enlarge our terminological database by extracting
and annotating terms under TMF format from scientific and technical documents
using a statistical approach. We aim also to combine the two approaches into a hybrid
one.
References
1. Ammar, C., Haddar, K., Romary, L.: Automatic Construction of a TMF Terminological
Database Using a Transducer Cascade. In: Recent Advances in Natural Language Pro-
cessing (RANLP ’15), pp. 17-23, Hissar (2015)
2. Attia, M., Toral, M., Tounsi, L., Pecina, P., Genabith, J.V.: Automatic extraction of Arabic
multiword expressions. In: Proceedings of the International Conference on Language Re-
sources and Evaluation LREC’10, pp. 19-26 (2010)
3. Boulaknadel, S., Daille, D., Aboutajdine, D.: A Multi-Word Term Extraction Program for
Arabic Language. In: Proceedings of the International Conference on Language Resources
and Evaluation LREC’08, pp. 630–634, Marrakech (2008)
4. Doumi, N., Lehireche, L., Maurel, D., Khater, M.: Using finite-state transducers to build
lexical resources for Unitex Arabic package. In: Second Symposium for Researcher Stu-
dents in Natural Language Processing and its Applications CEC-TAL’15, pp. 90-100,
Sousse (2015)
5. Friburger, N., Maurel, D.: Finitestate transducer cascades to extract named entities in texts.
In: Theoretical Computer Science, vol. 313, pp. 93-104 (2004)
6. Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: A Toolkit for Arabic Tokenization,
Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatiza-
tion. In Proceedings of the 2nd International Conference on Arabic Language Resources
and Tools, pp. 102-109, Cairo (2009)
7. Hudrisier, H., Ben Henda, M.: CARTAGO : une terminologie large langues de
l’enseignement électronique à distance, dans un contexte de co-élaboration multilingue de
documents normatifs. In : Séminaire international Les outils d'aide à la traduction, Union
Latine, Bucarest (2008)
8. ISO 16642, Computer applications in terminology - Terminological markup framework
(TMF). (2003)
9. ISO 12620:1999, Computer applications in terminology – Data categories. (1999)
10. ISO 704: Terminology work — Principles and methods. (2000)
11. Khayari, M., Schneider, S., Kramer, I., Romary, L.: Unification of multi-lingual scientific
terminological resources using the ISO 16642 standard. The TermSciences initiative. In:
International Workshop Acquiring and representing multilingual, specialized lexicons: the
case of biomedicine, 6 p., Genoa (2006)
12. Lou, B., Sperberg-McQueen, C.M.: The Design of the TEI Encoding Scheme. Computers
and the Humanities, vol. 29(1), pp. 17–39 (1995)
13. Romary, L.: An Abstract Model for the Representation of Multilingual Terminological Da-
ta: TMF Terminological Markup Framework. In: 5th TermNet Symposium TAMA’01,
Antwerp (2001)
14. Romary, L., Kramer, I., Salmon-Alt, S., Roumier J.: Gestion de données terminologiques :
principes, modèles, méthodes. Widad Mustafa El Hadi. Terminologie et accès à l'informa-
tion, Hermes, 13 p. (2006)
15. Romary, L., Van Campenhoudt, M.: Normalisation des échanges de données en termino-
logie : les cas des relations dites conceptuelles. In : 4ème Rencontre Terminologie et Intel-
ligence Artificielle TIA'01, 10 p., Nancy (2001)