Professional Documents
Culture Documents
Hybridization Based Machine Translations For Low-Resource Language With Language Divergence
Hybridization Based Machine Translations For Low-Resource Language With Language Divergence
Divergence
NANDINI SETHI
Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New
Delhi,110006, Delhi, India. nandinisethi2104@gmail.com
AMITA DEV
Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New
Delhi,110006, Delhi, India. vc@igdtuw.ac.in
POONAM BANSAL
Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New
Delhi,110006, Delhi, India. poonambansal@igdtuw.ac.in
DEEPAK GUPTA
Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, Delhi, India.
deepakgupta@mait.ac.in
A hybridised form of direct and rule-based language processing is used in this paper to present a Machine translation system from
Sanskrit to Hindi. The divergence between Sanskrit and Hindi is also discussed in this paper, along with a proposition for how to handle
it. Sanskrit-Hindi bilingual dictionaries, Grammatical Sanskrit corpus and a Sanskrit analyses rule base, have all been used in the
projected system. The projected system's ability to access data from various data vocabularies and rule bases utilised in the system
expansion has been improved by the usage of Elasticsearch technique. Additionally, a novel technique that builds a parse tree from the
parsing table is presented in this paper. The system processes the input Sanskrit sentence using the parsing approach and the Context
Free Grammar in normal form for Sanskrit language processing. No standard Sanskrit-Hindi Grammatical corpora available for Machine
Translation which is designed and developed in the proposed work. The specific language sentence is produced using the Grammatical
corpora and bilingual dictionaries. The proposed system achieved a Bilingual Evaluation Understudy (BLEU) score of 51.6 percent after
being tested using Python's natural language toolkit API. The proposed system performs better than current systems when compared to
cutting-edge systems, according to the comparison.
Keywords: Corpora, Direct translation, Elasticsearch, Natural language processing, Sanskrit grammar, Tagger.
1 INTRODUCTION
Sanskrit is a decorative, refined, and pure language written in the Devanagari script, also known as
"Samaskrita." All other Indian languages have their roots in Sanskrit, one of the world's oldest languages. The
Sanskrit corpus is around 30 million words long, which is one million times larger than the aggregate corpus
of Greek and Latin dialects. It was also available in the form of inscriptions even before the invention of the
printing machine. Sanskrit literature includes literary, theatrical, lyrical, devotional, scientific, engineering,
and mathematical works. The Vedic language is extremely inflected by nature, and [1] portrayed it in the
structure of eight stages in a better organised and understandable manner (Panini Asthadhyayi). Since
Georges Artstrouni's Mechanical Brain (1937), which used computers to translate text, commercial-level
products have been created thanks to the excellent dependability, resilience, and efficiency of modern
computer-based translation approaches. Only 49,736 people in India were Sanskrit language speakers as of the
census taken in 1991. Future population declines and a possible shortage of Sanskrit scholars could make it
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To
copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
© 2022 Association for Computing Machinery.
2375-4699/2022/1-ART1 $15.00
http://dx.doi.org/10.1145/3571742
ACM Trans. Asian Low-Resour. Lang. Inf. Process.
more challenging to comprehend the ancient Indian texts that are still available in Sanskrit. Multiple
government and non-government organisations have recently taken a number of actions at the national and
international levels to preserve this indigenous language (Sanskrit) and to unearth the innovation and
engineering written in this language for the benefit of humankind, particularly in the medicinal disciplines.
According to [2] Sanskrit has the most orderly and mathematically organised grammatical style of any
natural language, making it a preferable language for computer comprehension. The Sanskrit language has
the following characteristics:
Sanskrit is touted as the language of processing because of its generally straightforward structure
and lack of ambiguity.
Sanskrit is the most methodical and organised language, and because of its more precisely defined
grammar, it is also more technically computable.
Sanskrit contains a number of hidden algorithms that can be used to examine "Meanings" or "Word
sense" from a variety of angles as part of its extensive scholarly treatises.
Sanskrit depicts words based on their qualities rather than their objects; any vowel or consonant can
be found in a Sanskrit word. Vowels are self-contained, whereas consonants are dependent on
vowels. It establishes the Sandhi method.
Over 60 million Indians use Hindi on the internet, placing it fourth in the world in proportion to the
number of speakers [87]. In light of the aforementioned facts, our goal is to introduce a large number of Hindi
speakers to Sanskrit and help them become at least somewhat conversant in it. There are several language
translation projects being pursued globally, but Sanskrit-related research is still lacking. Machine Translation
(MT) is a method for using computers to translate source text to another. Using MT, this issue of converting
Sanskrit text into Hindi-equivalent content can be efficiently and quickly solved [119]. The primary solution
to this issue is not human translation. Various approaches are available of developing the machine translation
system. Table 1 shows the comparison of those approaches.
Because of its complex and dynamic character, context- Continuous learning necessitates a high level of
Neural based based machine translation is possible. processing.
1.3 Contributions
To begin with, a grammatical corpus was manually compiled for the proposed model's
construction using a combination of direct and rule-based mechanisms.
To improve the suggested system's accuracy, we trained and tested it using a variety of models,
training data, and sentence lengths.
After that, we combined direct machine translation with the linguistic tools produced by the
traditional rule-based approach. This allowed us to disambiguate translations of words because
the same word might have many meanings depending on the context. We also examined its
ability to tokenize data effectively and lessen data sparsity.
Performance testing was done to evaluate three metrics BLEU, Fluency Score, and Adequacy
Score between rule-based, corpus-based, and hybrid systems.
Sanskrit language parsers have been proposed in three different forms: shallow [67], deterministic [68],
and constraint-based [69]. Bhadra et al. [70] developed a Sanskrit analytic system that uses segmentation,
shallow parsing, and the Karaka Analysis module to examine Sanskrit texts. [71] suggested a system for
parsing Sanskrit compounds that automatically executes the segmentation operation and detects the Vedic
compounds using statistical methods. Anusaaraka platform was used to offer a Sanskrit-to-Hindi MTS [17,
72]. For the English translation of Sanskrit texts, Aparna developed an RBMT MTS in 2005. In order to create
the simple words from the complicated Sanskrit words, morphological properties of the input Sanskrit
sentence were done. These straightforward words were then supplied to the transducer module, which
created word-by-word transducers. The target language was created using a translator module, which
received the output from a collection of transducers used to create the parser. In 2014, Upadhayay et al.
proposed yet another DMT system for converting Sanskrit text to English text and delivering text-to-speech
interpretation, albeit the system did not undertake any syntactic or semantic analysis. It just substituted
words with each other and rearranged the resulting words. According to the preceding discussion, a variety of
methodologies have been employed in the creation of MT systems, specifically for the Sanskrit language. [17,
72, 74], for example, used the DMT approach, whereas others [73] used the RBMT process to develop
translations from Sanskrit to Hindi. Any approach has both advantages and disadvantages. As a result of the
complexity of constructing the rule base to cover a large amount of the language and the difficulty of lack of
semantic features analysis in the DMT approach, the investigators present a hybrid form of both the DMT and
RBMT methods to take advantage of the strengths of both.
Table 4 shows various areas where Sanskrit and Hindi deviate, along with possible examples that might
occur during translation. Sanskrit and Hindi sentences are denoted by "SS" and "HS," respectively, in Table 4.
The ITRANS (Indian Language TRANSliteration) format is used to write the Sanskrit sentences.
Irony is present in the morphology and inflection of Sanskrit. This causes a change in the
string's last consonant, its gender, and makes it difficult to remember different string infection
conFigureurations.
The structural differences between the source and target languages are mostly to blame for
translation issues. Sanskrit and Hindi are based on bhakti and karaka; however, English replaces
bhakti with prepositions or nothing when NP is present.
Sanskrit and Hindi sentences are typically written in the passive voice, whereas English
sentences are typically written in the active voice. Due to this voice change, it is difficult to
translate between certain language pairs.
4 PROPOSED SYSTEM
Sanskrit Tagged
Pre- Tokens
Sentence Tokens
Processing
Input POS Sanskrit
and Tagging Grammar
Tokenization
Sanskrit Rule-
Base
NO Text YES Sanskrit
Direct MT Parse
Parsed?
Tree
Output as Hindi
Sentence
4.2 Corpora
This module serves as the proposed system's database and contains
(a) a bilingual Sanskrit-Hindi dictionary
(b) Grammatical Corpora consisting of tables of Communal Words (Word and Meaning), Interrogative
words (Includes the Shabdh, Meaning, Vacchan and Ling), Pronouns (Includes the Shabdh, Meaning,
Vacchan and Ling), Verbs (Includes the Word, Meaning, Joining Word and Word Type), Definition of
Vowels (Includes the Shabdh Type, Vacchan Type, Ling Type, Vibhakti, Ending Meaning, Suffix and
Prefix) and Lakaar (Includes the Shabdh Type, Kaala Type, Purush Type, Vacchan Type, End
Meaning, Suffix, Prefix and Verb Gender).
This dataset is used by many modules throughout the translation process, as depicted in Figure. 6. The
authors employed the Elasticsearch approach, an open-source, extensible, text searching and diagnostic
engine, to improve access to data from these vocabularies and the labelled dataset. It does the task of
categorizing words and quickly prepares them for searching with their position. Massive amounts of data can
be analysed, searched for, and stored fast and almost instantly. Using this technique, searching may be
performed on any data, whether it is structured or not.
4.3 Parsing
The Sanskrit grammatical and the Cocke Younger Kasami (CYK) analyser used to process source language
content are demonstrated in this module. The CYK parsing table and the construction of the Sanskrit parse
tree are further broken down into two submodules in this module.
The parse table is utilised to create the Sanskrit parser in Section 4.3.2 if the proposed grammar is
successful in processing the input sentence; otherwise, control is sent to Section 4.5.
Sentence
NNP NN Verb
NNP NN VP
जािा है
4.6 Post-processing
The entire output is generated during this step, and the final result is then reordered. The target English
sentence is produced by traversing the leaf nodes of the tree in the specific language parse tree module from
left to right. To obtain the sentence to have the structure of the target language, the words in Sect. 4.5 are
rearranged for the direct method.
5 EXPERIMENTAL SETUP
Bilingual vocabularies and a grammatical corpus were created in preparation for the planned system's
implementation.
(a) a bilingual Sanskrit-Hindi dictionary.
(b) Grammatical Corpora consisting of tables of Communal Words, Interrogative words, Pronouns,
Verbs, Definition of Vowels and Lakaar.
Using our method, we have created corpora for the parasmai pad verb tenses. The tenses in Sanskrit are:
present tense (latlakaar), past tense (langlakaar), future tense (lritlakaar), imperative mood (lotlakaar), and
operational mood (lotlakaar) (known as vidhilingalakaar). Verb forms have three numbers, just like noun
forms: singular (ekvachan), dual (dvivachan), and plural (ekvachan) (known as vahuvachan).
Three individuals (purush) are allowed in Sanskrit: the first is referred to as Prathama purush, the second
is Madhyam purush, and the third is Madhyam purush (known as Uttam purush). Prathama purush describes
the personification of nouns (such as name, place, and object) and pronouns (such as He, She, It, and They).
Madhyam purush is another name for the verb person that refers to You or Them. The uttam purush is the
verb person for I / We. We took into consideration 15 various swaraanta and vyanjanaanta for our noun (few
pronouns) forms. There are three genders, three numerals, and eight cases for each type of ending alphabet.
These noun form groups can therefore contain a total of 707 distinct noun words. If we count them, these
noun nouns have 261 different suffixes. It implies that each noun form can be defined by a total of 261
different suffixes (rests are repetitive). According to the previously mentioned five verb tenses, three persons,
and three numbers for each person, there are a total of 45 possible verb terms for these verb form categories.
Since each word suffix is unique, they must all be considered. In one tense langlakaar, the extra prefix "a" is
used in addition to the suffix.
According to the explanation of Sanskrit grammar above, constructed Sanskrit corpora comprise the
following entities:
A technology called Elasticsearch was applied to increase translation speed. The web environment for the
proposed system is created using PHP on the WAMP server and the Sanskrit-Hindi generating rule.
Tokenizing the input and applying various part-of-speech tags based on information from the Sanskrit rule
base/tagged corpus constitute Sanskrit sentence processing. For the following stage of Sanskrit parsing, the
labelled phrases are given to the Sanskrit CNF lexicon and CYK bottom-up parsing algorithm. The parsing
table is produced by the CYK parser. The parse tree is produced by the first suggested algorithm, which
provides an array of the left, parent, and correct nodes of the tree. The target language text is provided by the
post-processing phase, which reorders the various phases. If the proposed grammar cannot effectively parse
the input sentence, the DMT technique is utilised to perform the translation with the use of bilingual
dictionaries.
REFERENCES
[1] Kak SC. 1987. The paninian approach to natural language processing. Int J Approx Reason 1(1), 117 130.
[2] Briggs R. 1985. Knowledge representation in Sanskrit and artificial intelligence. AI Mag 6(1):32.
[3] Bahadur P, Jain A, and Chauhan DS. 2011. English to Sanskrit machine translation. In Proceedings of the international conference & workshop on
emerging trends in technology. ACM, 641 645.
[4] Mishra V, and Mishra RB. 2008. Study of example-based English to Sanskrit machine translation. J Res Dev Comput Sci Eng, 37, 43 54.
[5] Mishra V, and Mishra RB. 2009. ANN and rule-based model for English to Sanskrit machine translation. INFOCOMP J Comput Sci 9(1), 80 89.
[6] Bahadur P, Jain AK, and Chauhan DS. 2012. Etrans-A complete framework for English to Sanskrit machine translation. International Journal of
Advanced Computer Science and Applications (IJACSA) from international conference and workshop on emerging trends in technology. Citeseer, 52
59.
[7] Lewis MP, Simons GF, and Fennig CD. 2015. Ethnologue: languages of Ecuador. SIL International, Dallas.
[8] Mallikarjun B. 2010. Patterns of Indian multilingualism. In: Strength for today and bright hope for tomorrow, vol 10, no 6, 1 18.
[9] Dorr BJ, Hovy EH, and Levin LS. 2004. Natural language processing and machine translation encyclopaedia of language and linguistics, (ELL2).
Machine translation: interlingual methods. In Proceeding international conference of the world congress on engineering.
[10] Dorr Bonnie J. 1994. Machine translation divergences: a formal description and proposed solution. Comput Linguist 20(4), 597 633.
[11] Goyal P, and Sinha RMK. 2009. Translation divergence in English Sanskrit Hindi language pairs. In International sanskrit computational linguistics
symposium. Springer, 134 143.
[12] Shukla, P., Shukl, D., and Kulkarni, A. 2010. Vibhakti Divergence between Sanskrit and Hindi. In Jha, G.N. (eds) Sanskrit Computational Linguistics.
ISCLS 2010. Lecture Notes in Computer Science(), vol 6465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17528-2_15
[13] Goyal V, and Lehal GS. 2010. Web based Hindi to Punjabi machine translation system. J Emerg Technol Web Intell 2(2), 148 151.
[14] Dubey P. 2013. Machine translation system for Hindi Dogri language pair. In 2013 international conference on machine intelligence and research
advancement (ICMIRA). IEEE, 422 425.
[15] Dubey P. 2019. The Hindi to Dogri machine translation system: grammatical perspective. Int J Inf Technol 11(1), 171 182.
[16] Narayana VN. 1994. Anusarak: a device to overcome the language barrier. PhD thesis, Department of CSE, IIT Kanpur.
[17] Bharati A, Chaitanya V, Kulkarni AP, and Sangal R. 1997. Anusaaraka machine translation in stages. VIVEK-Bombay 10, 22 25.
[18] Bharati RM, Sankar B, Reddy P, Sharma DM, and Sangal R. 2003. Machine translation: the shakti approach. Pre-conference tutorial. ICON-2003.
[19] Josan GS, and Lehal GS. 2008. A Punjabi to Hindi machine translation system. In 22nd international conference on computational linguistics:
demonstration papers. Association for Computational Linguistics, 157 160.