Hybridization Based Machine Translations For Low-Resource Language With Language Divergence

Hybridization Based Machine Translations for Low-Resource Language with Language
Divergence
NANDINI SETHI
Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Kashmere Gate, New
Delhi,110006, Delhi, India. nandinisethi2104@gmail.com
AMITA DEV
Delhi,110006, Delhi, India. vc@igdtuw.ac.in
POONAM BANSAL
Delhi,110006, Delhi, India. poonambansal@igdtuw.ac.in
DEEPAK KUMAR SHARMA

Delhi,110006, Delhi, India. dk.sharma1982@yahoo.com
DEEPAK GUPTA
Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, Delhi, India.
deepakgupta@mait.ac.in
A hybridised form of direct and rule-based language processing is used in this paper to present a Machine translation system from
Sanskrit to Hindi. The divergence between Sanskrit and Hindi is also discussed in this paper, along with a proposition for how to handle
it. Sanskrit-Hindi bilingual dictionaries, Grammatical Sanskrit corpus and a Sanskrit analyses rule base, have all been used in the
projected system. The projected system's ability to access data from various data vocabularies and rule bases utilised in the system
expansion has been improved by the usage of Elasticsearch technique. Additionally, a novel technique that builds a parse tree from the
parsing table is presented in this paper. The system processes the input Sanskrit sentence using the parsing approach and the Context
Free Grammar in normal form for Sanskrit language processing. No standard Sanskrit-Hindi Grammatical corpora available for Machine
Translation which is designed and developed in the proposed work. The specific language sentence is produced using the Grammatical
corpora and bilingual dictionaries. The proposed system achieved a Bilingual Evaluation Understudy (BLEU) score of 51.6 percent after
being tested using Python's natural language toolkit API. The proposed system performs better than current systems when compared to
cutting-edge systems, according to the comparison.
Keywords: Corpora, Direct translation, Elasticsearch, Natural language processing, Sanskrit grammar, Tagger.
1 INTRODUCTION
Sanskrit is a decorative, refined, and pure language written in the Devanagari script, also known as
"Samaskrita." All other Indian languages have their roots in Sanskrit, one of the world's oldest languages. The
Sanskrit corpus is around 30 million words long, which is one million times larger than the aggregate corpus
of Greek and Latin dialects. It was also available in the form of inscriptions even before the invention of the
printing machine. Sanskrit literature includes literary, theatrical, lyrical, devotional, scientific, engineering,
and mathematical works. The Vedic language is extremely inflected by nature, and [1] portrayed it in the
structure of eight stages in a better organised and understandable manner (Panini Asthadhyayi). Since
Georges Artstrouni's Mechanical Brain (1937), which used computers to translate text, commercial-level
products have been created thanks to the excellent dependability, resilience, and efficiency of modern
computer-based translation approaches. Only 49,736 people in India were Sanskrit language speakers as of the
census taken in 1991. Future population declines and a possible shortage of Sanskrit scholars could make it
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To
copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
© 2022 Association for Computing Machinery.
2375-4699/2022/1-ART1 $15.00
http://dx.doi.org/10.1145/3571742
ACM Trans. Asian Low-Resour. Lang. Inf. Process.
more challenging to comprehend the ancient Indian texts that are still available in Sanskrit. Multiple
government and non-government organisations have recently taken a number of actions at the national and
international levels to preserve this indigenous language (Sanskrit) and to unearth the innovation and
engineering written in this language for the benefit of humankind, particularly in the medicinal disciplines.
According to [2] Sanskrit has the most orderly and mathematically organised grammatical style of any
natural language, making it a preferable language for computer comprehension. The Sanskrit language has
the following characteristics:
 Sanskrit is touted as the language of processing because of its generally straightforward structure
and lack of ambiguity.
 Sanskrit is the most methodical and organised language, and because of its more precisely defined
grammar, it is also more technically computable.
 Sanskrit contains a number of hidden algorithms that can be used to examine "Meanings" or "Word
sense" from a variety of angles as part of its extensive scholarly treatises.
 Sanskrit depicts words based on their qualities rather than their objects; any vowel or consonant can
be found in a Sanskrit word. Vowels are self-contained, whereas consonants are dependent on
vowels. It establishes the Sandhi method.
Over 60 million Indians use Hindi on the internet, placing it fourth in the world in proportion to the
number of speakers [87]. In light of the aforementioned facts, our goal is to introduce a large number of Hindi
speakers to Sanskrit and help them become at least somewhat conversant in it. There are several language
translation projects being pursued globally, but Sanskrit-related research is still lacking. Machine Translation
(MT) is a method for using computers to translate source text to another. Using MT, this issue of converting
Sanskrit text into Hindi-equivalent content can be efficiently and quickly solved [119]. The primary solution
to this issue is not human translation. Various approaches are available of developing the machine translation
system. Table 1 shows the comparison of those approaches.
Table 1: Comparison of existing Machine translation approaches
Approaches Advantages Disadvantages
Simple to create an initial system Experts create the rules.

Rule based Based on notions of linguistics Difficult to keep and expand
Suitable for fundamental phenomena Ineffective for minor occurrences
Knowledge is extracted from a corpus. The similarity metric is system dependent.

Based on corpus translation patterns. The expense of searching is high.
Example based
Lowers the human cost Acquiring knowledge remains difficult.
Knowledge of numbers linguistic experience none.

Knowledge is extracted from a corpus. Search expenses are high.
Statistics based The concept is mathematically supported. Long-distance phenomena are challenging to capture.
Because of its complex and dynamic character, context- Continuous learning necessitates a high level of
Neural based based machine translation is possible. processing.
1.1 Need of the Machine Translation Systems (MTS)

Around 7102 languages and dozens of accents have been used by mankind throughout history, as per
Ethnologue Languages of the World [7]. Due to the scarcity of human translators, the high expense of manual
translation, and the barriers to entry, human interpretation has never been an efficient solution for such
issues. Data from the Census of India 2001 show that there are about 1600 local dialects spoken across 22
regulated and 100 non-scheduled dialects [8, 9]. People must collaborate without regard to language in order
to trade technology, knowledge, and ideas for the development of nations like India. Such issues can be
effectively eliminated with MT approaches. Therefore, there is a huge need for MT both globally and locally

in India. In general, MTS has applications in every aspect of life, including those related to tourism, health
care, finance, defence, commerce, public work, web content, and app development. The envisaged conversion
system can be used for research purposes to comprehend the characteristics of the Sanskrit linguistic, which
are one of the most explicit regional dialects, have well-structured grammar, are celestial in nature, are finest
suited for computers according to NASA, are a paragon trove of antiquated science and innovation. MTS has a
number of advantages over traditional translation techniques, including a high conversion speed, inferior cost,
supplementary memory than a humanoid to memorise huge data, ease of simultaneous translation into
multiple languages in a multilingual environment, the ability to translate without getting tired, and the
system's availability anywhere, anytime.
By suggesting a neural probabilistic language model, the statistical, probabilistic language model was
enhanced with context, perplexity, and feature vectors [103]. Later, a researcher in this area employing neural
language models in Statistical Machine Translation (SMT) demonstrated significant advancements [104],
although many research groups were still hindered by the computing requirement for GPU for data training.
Other research involving the incorporation of neural components into SMT included joint neural network
models [105], statistical language models based on neural networks, and mixing neural network models into
the decoder. The later model produced successful empirical findings [106]. Other language model integrations
include parameterized recurrent neural network word embeddings in TED lecture transcription [108] and
constant space language model on a GPU for SMT [107]. For quicker and better SMT, source side pre-ordering
is often used [109]. SMT systems are improvised reordered using neural models [110].
1.2 Novelty of the work

 A hybridised form of direct and rule-based language processing is used to present a Machine
translation system from Sanskrit to Hindi.
 No standard Sanskrit-Hindi Grammatical corpora available for Machine Translation which is
designed and developed in the proposed work.
 Additionally, a novel technique that builds a parse tree from the parsing table is presented in this
paper.
 The disparity between Sanskrit and Hindi is also discussed in this work, along with a suggestion
for how to handle it.
1.3 Contributions
 To begin with, a grammatical corpus was manually compiled for the proposed model's
construction using a combination of direct and rule-based mechanisms.
 To improve the suggested system's accuracy, we trained and tested it using a variety of models,
training data, and sentence lengths.
 After that, we combined direct machine translation with the linguistic tools produced by the
traditional rule-based approach. This allowed us to disambiguate translations of words because
the same word might have many meanings depending on the context. We also examined its
ability to tokenize data effectively and lessen data sparsity.
 Performance testing was done to evaluate three metrics BLEU, Fluency Score, and Adequacy
Score between rule-based, corpus-based, and hybrid systems.
1.4 Organization of the paper

This article is broken down into seven segments. The first segment introduces the Sanskrit language as well
as the need for MT from Sanskrit to Hindi. The second review presents an overview of the literature on
several translation strategies and MTS established by various scholars. Segment 3 describes several
divergences of the languages that arise during Sanskrit-to-Hindi translation, as well as examples of them. The
proposed six-module Sanskrit-to-Hindi machine translation system is described in Segment 4. The data
dictionary, rule base, Grammatical corpus, and technologies that is used to implement the projected system
are described in Segment 5. Segment 6 discusses the evaluation processes that were utilised to examine and
compare the proposed system to various current systems. Segment 7 brings the article to a close with an
informative conclusion and references.

2 BACKGROUND
Language divergence must be studied carefully before any translation between any two languages can begin.
The divergence in languages may arise at various levels namely: lexical, syntactic, gerund, thematic, inflation,
voice and particle, according to [10 12]. Chomsky's hierarchy of grammar has been extensively used to the
computer processing of natural dialectal text. The input sentence has been parsed using a suitable parsing
algorithm. Researchers have employed a variety of strategies to create MTS for various dialects, and these
strategies can be divided majorly into 4 clusters: direct, rule-based, corpus-based and hybrid machine
translation. This section reviews numerous MT structures created for Indian dialects using various
methodologies. It is clear from the analysis that more MTS were created utilising Corpus Based Machine
Translation (CBMT) than Rule Based Machine Translation (RBMT), which was greater than Hybrid Based
Machine Translation (HBMT) and Direct Machine Translation (DMT).
2.1 Direct Machine Translation (DMT)

The DMT method makes use of machine translation technologies [13 19]. Since this method does not
necessitate a thorough investigation of the parent or destination language's syntax or semantics, it requires
less effort to develop. This strategy is still used to construct MTS quickly. Figure 1 represents the evolution of
various machine translation systems using direct approach.
Figure 1: Direct-based MTS
2.2 Rule-based Machine Translation (RBMT)

MTS is a part of RBMT in [6, 20-39, 86,89-91]. For translation, this method calls for a syntactic and semantic
study of the source and destination languages. Although it takes a long time and a lot of effort to design, the
MT system that relies on the RBMT technique has the maximum efficiency compared to other systems. Figure
2 represents the evolution of various machine translation systems using rule-based approach.

Figure 2: Rule-based MTS
2.3 Corpus-based Machine Translation (CBMT)

The CBMT comprises: [4, 23, 40 56, 92-99] MTS. This method covers neural, statistical, and example-based
MT systems. This method has gained popularity among MT developers due to the expansion of digital
language resources, computational techniques, and processing power. Figure 3 represents the evolution of
various machine translation systems using Corpus-based approach.
Figure 3: Corpus-based MTS

2.4 Hybrid Based Machine Translation (HBMT)
The MTS for HBMT is [5, 18, 57-59, 59-66, 85, 100-102]. HBMT is a preferred translation process because it
combines the best aspects of human-engineered and machine-engineered approaches. One way it differs from
other MT systems is the use of several MT modelling approaches. The creation of a hybrid strategy was
driven by the dearth of a single technique that could achieve a respectable level of accuracy. Numerous HBMT
systems have succeeded in raising the accuracy of translation systems. Figure 4 represents the evolution of
various machine translation systems using Hybrid-based approach.
Figure 4: Hybrid-based MTS
Sanskrit language parsers have been proposed in three different forms: shallow [67], deterministic [68],
and constraint-based [69]. Bhadra et al. [70] developed a Sanskrit analytic system that uses segmentation,
shallow parsing, and the Karaka Analysis module to examine Sanskrit texts. [71] suggested a system for
parsing Sanskrit compounds that automatically executes the segmentation operation and detects the Vedic
compounds using statistical methods. Anusaaraka platform was used to offer a Sanskrit-to-Hindi MTS [17,
72]. For the English translation of Sanskrit texts, Aparna developed an RBMT MTS in 2005. In order to create
the simple words from the complicated Sanskrit words, morphological properties of the input Sanskrit
sentence were done. These straightforward words were then supplied to the transducer module, which
created word-by-word transducers. The target language was created using a translator module, which
received the output from a collection of transducers used to create the parser. In 2014, Upadhayay et al.
proposed yet another DMT system for converting Sanskrit text to English text and delivering text-to-speech
interpretation, albeit the system did not undertake any syntactic or semantic analysis. It just substituted
words with each other and rearranged the resulting words. According to the preceding discussion, a variety of
methodologies have been employed in the creation of MT systems, specifically for the Sanskrit language. [17,
72, 74], for example, used the DMT approach, whereas others [73] used the RBMT process to develop
translations from Sanskrit to Hindi. Any approach has both advantages and disadvantages. As a result of the
complexity of constructing the rule base to cover a large amount of the language and the difficulty of lack of
semantic features analysis in the DMT approach, the investigators present a hybrid form of both the DMT and
RBMT methods to take advantage of the strengths of both.

Table 2: Existing Hybrid Machine Translation System for Indian Languages
Author Toolkit Technique Domain Corpus Language
K. M. Kavitha CLUTO, NLTK, Word-Word Translation NA CFILT, IIT- English-Hindi
[85] pyiwn Generation (pivoting Bombay English- Sanskrit
+clusters) Konkani-Hindi
Konkani- Sanskrit
M. Singh et TensorFlow NMT and RBMT general domain, 162,760 Sanskrit-Hindi

al. [111] and Keras news, health, parallel
tourism, sentences
literature,
Wikipedia, and
the arts
Salunkhe et Open NLP RBMT + SMT NA Parallel Corpus English Marathi
al. [112] (IIT-Bombay)
Dhore and NA RBMT + SMT Banking Reserve Bank of English Hindi
Dixit [115] glossary India
English Marathi
English Gujarati
Chatterji et al. NA RBMT (lattice based Tourism 2000 sentences Bengali Hindi
[116] lexical transfer) + SMT
Chatterji et al. Giza++ RBMT (lexical transfer Written EMMILE-CIIL Bengali Hindi
[118] based) + SMT
Nithya et al. GIZA++, SMT + translation Indian and 563 sentences English Malayalam
[113] Moses’ decoder memory Islamic history
Kaur and NA RBMT + EBMT News headlines 300 sentences English Punjabi
Laxmi [114]
Shahnawaz java(jdk1.5) RBMT + NMT NA NA English Urdu
and Mishra with MATLAB
[117] 7.1
2.5 Research Gaps and Motivation

 Limited or no work done on machine translation for Sanskrit-Hindi language pair.
 No standard Sanskrit-Hindi Grammatical corpora available for Machine Translation.
 In non-rule match cases, the rule-based model does not return any output; on the other hand, the
proposed hybrid model always returns the best solution.
Table 3: Comparison of proposed model with Existing system

Research work MTS Language pair BLEU Score Modelling Technique used
K. M. Kavitha et English-Sanskrit 49.26% Word-Word Translation Generation
al. [85] (pivoting +clusters)
P. Agrawal et al. Sanskrit-Hindi NA Rule-based Machine translation
[86]
M. Singh et al. [88] Sanskrit-Hindi 24% better than RBMT Corpus-based Machine Translation
Proposed System Sanskrit-Hindi 51.6% Direct Machine Translation fused
with Rule based Machine Translation
3 IDENTIFICATION OF SANSKRIT AND HINDI LANGUAGE DIVERGENCE

Before beginning the translation process, it is vital to comprehend the differences between the languages
being taken into consideration. According on grammatical and syntactic features, Dorr [10] divided the
linguistic divergence issue into seven groups, each with potential remedies. The language divergence between
Hindi and Sanskrit, which also incorporates Dorr's categorization and other divergence patterns, was also
identified by Goyal and Sinha [11] and Shukla, P. [12]. The suggested remedies for the divergence found were
presented as algorithms by Shukla, P. [12]. Any divergence must be resolved using three different sorts of
information: Generalized LR (GLR), and Lexical Conceptual Structure (LCS). While LCS retains the dialect
information about lexical elements, GLR, and LCS are language independent. A lexical divergence is present
when there is an exception in either the GLR, LCS, or both of the languages. Figure 5 shows the divergence
example among Sanskrit and Hindi language.

Sanskrit-Source • aham pipaThiShaami |
Language (SL) • ष |
Divergence • A single verb from the source language was

transformed into two verbs from the target
Example language.
Hindi-Target • mai. paDhanaa chaahataa huM |

Language(TL) • ढ |
Figure. 5: Language divergence Example among Sanskrit and Hindi
Table 4 shows various areas where Sanskrit and Hindi deviate, along with possible examples that might
occur during translation. Sanskrit and Hindi sentences are denoted by "SS" and "HS," respectively, in Table 4.
The ITRANS (Indian Language TRANSliteration) format is used to write the Sanskrit sentences.
Table 4: Language divergence among Sanskrit and Hindi

Divergence Explanation Example
Thematic Divergence Thematic divergence mentions to differences in the SS-aham madhuram khādāmi.
manifestation of a verb's argument structure. ( )
Between Hindi and Sanskrit, there is a rift. In Hindi, HS-main mithai khātā huM.
the experience verb 'ruc' is given an active ( )
construction, whereas in Sanskrit, it conditions a
dative subject. There is, however, no differentiation
between Sanskrit and Hindi because the closest
Hindi equivalent has a dative subject as well.
Conflational and There are countless examples of this mismatch in SS- aham pipathisami.
Inflectional Divergence Sanskrit. The same can be said about Hindi and ( )
Sanskrit. HS-main paḍhanā cahatā huM.
( ढ )
Categorial Divergence Categorical divergences can be evident in the SS- sa mahyam irsyati.
difference in parts of speech between the translation ( )
languages. We have an alternate solution in Hindi HS-vaha mujhase irsya karati hai
when transcribing from Hindi to Sanskrit. (व )
Optional Divergence When a sutra assigns two alternate vibhaktis in SS: aks.aih. d¯ıvyati
Sanskrit but only one in Hindi, this is known as ( ı )
optional divergence. HS: p¯aso ˙m se khelat¯a hai
( )
Exceptional Divergence When sutras establish extraordinary restrictions for SS: b¯alakah. parya ˙nkam adhi´sete.
specific instances by confining general standards, ( )
these regulations are not appropriate in Hindi. HS: lad. ak¯a palam. ga para sot¯a hai
( ग )
Alternative Divergence When it comes to Alternative Divergence, Sanskrit SS: gr¯amasya/gr¯am¯at vanam. d¯uram
allows for multiple case suffixes, whereas Hindi only asti.
allows for a few, and seldom a totally separate case ( व )
suffix. HS: g¯anva se ja˙ngala d¯ura hai
(ग व ग )
Differential Divergence Differential divergence occurs when Panini's rule SS: ahn¯a anuv¯akah. adh¯ıtah.
assigns a certain vibhakti, either by karaka or explicit ( व )
case allocation, but Hindi uses an entirely different HS: dinabhara me ˙m anuv¯aka par.ha
vibhakti. liy¯a
( व ढ )

The divergence cases can then be categorised as follows:
1. Divergences stemming from Sanskrit Grammar

 Optional Divergence: In addition to the default vibhakti, Sanskrit uses optional vibhakti. Only the
default vibhakti is permitted in Hindi.
 Exceptional Divergence: The default vibhakti is not used in Sanskrit; instead, Panini prohibits it
by treating it as an exceptional circumstance. Vibhakti is the standard in Hindi.
 Differential Divergence: Certain karakas or vibhaktis that Panini imposes cannot be understood
through semantic generalisations. There are numerous vibhaktis in Hindi.
 Alternative Divergence: Sanskrit has several other vibhakti options. A single one is used in
Hindi.
 Non-Karaka Divergence: Sanskrit has several other vibhakti options. The sixth vibhakti is used in
Hindi.
2. Divergences brought on by Hindi's peculiarities
 Special vibhakti expectancy of verbs: Divergences brought on by the unique requirements of
some verbs.
 Complex Predicate Divergence: If a Sanskrit verb corresponds to a difficult Hindi predicate, the
verb's karma will adopt the sixth case suffix.
Challenges in Processing Sanskrit Language

The difficulties with machine translation vary from language to language since it is difficult to come up
with a general technique that works for all languages. A foundation for further research into the numerous
languages spoken in India has been created by Dorr's review of the divergence of distinct languages [121].
Before creating a machine translation system, linguistic issues specific to Sanskrit must be addressed:
 Irony is present in the morphology and inflection of Sanskrit. This causes a change in the
string's last consonant, its gender, and makes it difficult to remember different string infection
conFigureurations.
 The structural differences between the source and target languages are mostly to blame for
translation issues. Sanskrit and Hindi are based on bhakti and karaka; however, English replaces
bhakti with prepositions or nothing when NP is present.
 Sanskrit and Hindi sentences are typically written in the passive voice, whereas English
sentences are typically written in the active voice. Due to this voice change, it is difficult to
translate between certain language pairs.
4 PROPOSED SYSTEM

The proposed rule-based and direct machine translation techniques are used to create the Sanskrit to Hindi
translation system. The Sanskrit sentence is represented by the system using Unicode. According to Figure. 6,
the proposed Sanskrit-to-Hindi MTS is broken up into six modules for translation.
Sanskrit Tagged
Pre- Tokens
Sentence Tokens
Processing
Input POS Sanskrit
and Tagging Grammar
Tokenization
Sanskrit Rule-
Base
NO Text YES Sanskrit
Direct MT Parse
Parsed?
Tree
Elastic Search Indexer

Hindi
Generation
Rule Base Hindi Parse
Tree
Post-Processing
Sanskrit-
Grammatical
Hindi
Corpus
Dictionary
Output as Hindi
Sentence
Figure 6: Architecture of Projected System.
4.1 Data Pre-processing

The pre-processing and Part-of-Speech (POS) tagging of the sentence in the source linguistic are done by this
module.
4.1.1 Input Language pre-processing

Even though Sanskrit has a flexible word order, the proposed system processes Sanskrit sentences using the
subject-object-verb (SOV) order because other Indian languages also employ this structure. Using the Unicode
encoding system, the Sanskrit phrase is input during this stage. If the input sentence does not adhere to the
necessary grammar structure, it is examined for the SOV grammatical structure and reformed into the SOV
format. The Kaarka Analyses (Case structure) Sanskrit grammatical rule base, which recognises the subject,
object, and verbs from the input sentence, is used to reformat the sentence [122].
The finest aspect of Sanskrit is that a word's position does not indicate its function in a sentence.
Therefore, the system uses the grammar characteristics (grammar rule basis) to determine whether word
serves as the subject, object, or verb and then converts the sentence appropriately into SOV format for
straightforward translation. Because many other Indian languages employ the same pattern for translation,
SOV is used. The interrogative Sanskrit sentence, which was covered in the prior section on language

divergence, is an exception to this rule. If the phrase is a complicated word, the Sandhi rule base is then
applied in reverse to obtain the appropriate tokens. Tokens are generated once the word order is finalised
using spaces as the delimiter and transmitted to the next module.
4.1.2 Part-of-Speech Tagger

The interpretation of the Sanskrit phrase depends heavily on the appropriate POS tag, making this the most
crucial component of the system. A number other POS tag sets, including [77, 78], have been proposed [75,
76]. (JPOS). The IL-POSTS Sanskrit tagset has been chosen for the suggested translation system based on the
comparison in Table 5. The POS tagging procedure is as follows:
(a) The rule base processes the tokens produced in the previous phase first.
(b) By putting the rules into practise, the tagging is completed token by token and moved on to the next
stage.
(c) The tagged corpus is utilised to perform the tagging with Elasticsearch approach to increase
processing performance if no rule is identified for any token.
(d) If there is still confusion, the Grammatical Corpus is employed to resolve the issue by employing
different attributes, and the tokens are then tagged appropriately. Tokens with tags are delivered to
Parsing module for processing.
Table 5: Comparison of existing POS Taggers

Unilingual/Multilingual Structure Base Labels
ILMT Multilingual Flat Penn Tree Bank 26
JPOS Unilingual Flat Paninian Grammar 134
CPOS Unilingual Flat ILMT+JPOS 28
IL-POSTS Multilingual Hierarchy EAGLES 7 categories
4.2 Corpora
This module serves as the proposed system's database and contains
(a) a bilingual Sanskrit-Hindi dictionary
(b) Grammatical Corpora consisting of tables of Communal Words (Word and Meaning), Interrogative
words (Includes the Shabdh, Meaning, Vacchan and Ling), Pronouns (Includes the Shabdh, Meaning,
Vacchan and Ling), Verbs (Includes the Word, Meaning, Joining Word and Word Type), Definition of
Vowels (Includes the Shabdh Type, Vacchan Type, Ling Type, Vibhakti, Ending Meaning, Suffix and
Prefix) and Lakaar (Includes the Shabdh Type, Kaala Type, Purush Type, Vacchan Type, End
Meaning, Suffix, Prefix and Verb Gender).
This dataset is used by many modules throughout the translation process, as depicted in Figure. 6. The
authors employed the Elasticsearch approach, an open-source, extensible, text searching and diagnostic
engine, to improve access to data from these vocabularies and the labelled dataset. It does the task of
categorizing words and quickly prepares them for searching with their position. Massive amounts of data can
be analysed, searched for, and stored fast and almost instantly. Using this technique, searching may be
performed on any data, whether it is structured or not.
4.3 Parsing
The Sanskrit grammatical and the Cocke Younger Kasami (CYK) analyser used to process source language
content are demonstrated in this module. The CYK parsing table and the construction of the Sanskrit parse
tree are further broken down into two submodules in this module.
4.3.1 CYK Parsing Table

CYK Parser, which employs bottom-up parsing and a dynamic programming technique, implements the
Sanskrit grammar. The CFG vocabulary in CNF form is used by the CYK parser. For every input pattern of
length "m" and the grammar with "p" non-terminals, it starts in a triangular shape. The worst-case time
3 2
complexity of the CYK parser is O(m ), and the worst-case space complexity is O(m ), both of which are better

than other parser techniques in worst-case circumstances, where m is the length of the input string [120]. The
task of parsing is carried out using a matrix representation.
The following is the parsing procedure:
(a) Use the input of words and their parts of speech.

(b) Produce a matrix with the dimensions [N, N], where N represents the total number of tokens in the
text.
(c) Place the variables and terminals of the mapped grammar in the diagonally cells of the matrix in
same order as the token appear in the sentence.
(d) When there are multiple possibilities, CYK implementation takes the one that is identified later into
account
(e) Turn the CYK matrices into a real tree by tracing the descendants at each point starting from the
start symbol of the grammar located at [0, N].
The parse table is utilised to create the Sanskrit parser in Section 4.3.2 if the proposed grammar is
successful in processing the input sentence; otherwise, control is sent to Section 4.5.
4.3.2 Tree Generation

The parse tree is created using Parse tree Algorithm using the parsing table created in Section 4.3.1 as shown
in Figure 7.
Sentence
Noun Phrase Verb Phrase
NNP NN Verb
रामॡ वियालयॠ गछवि
Figure 7: Parse tree for Sanskrit Sentence
4.4 Target Language Parse Tree

This module produces the Hindi version of the Sanskrit parse tree as shown in Figure 8. The bilingual
dictionary (Sanskrit-Hindi) and the language divergence mechanism as stated in Section 3 are used to
generate the target parse tree. The use of Elasticsearch decreases the time it takes to access data from
multilingual dictionaries. When creating sentences in a target language, grammatical corpora clear out word
ambiguity and contribute semantic information.

Sentence
Noun Phrase Verb Phrase
NNP NN VP
राम वियालय Verb VM
जािा है
Figure 8: Parse Tree of Hindi Sentence
4.5 DMT-based Translation

This module uses the DMT technique to carry out the translation. The destination language word for the
input language word is generated using the Sanskrit-Hindi dictionary and grammatical corpora. In this stage,
the term-by-term replacement is completed. The computational time of accessing the comparable Hindi word
is accelerated by Sect. 4.2. In Section 4.6, word order is adjusted to the target language.
4.6 Post-processing
The entire output is generated during this step, and the final result is then reordered. The target English
sentence is produced by traversing the leaf nodes of the tree in the specific language parse tree module from
left to right. To obtain the sentence to have the structure of the target language, the words in Sect. 4.5 are
rearranged for the direct method.
5 EXPERIMENTAL SETUP
Bilingual vocabularies and a grammatical corpus were created in preparation for the planned system's
implementation.
(a) a bilingual Sanskrit-Hindi dictionary.
(b) Grammatical Corpora consisting of tables of Communal Words, Interrogative words, Pronouns,
Verbs, Definition of Vowels and Lakaar.
5.2 Grammatical Corpora

The four fundamental components of Sanskrit grammar are gender (referred to as linga), number or quantifier
(referred to as vachan), case (referred to as vibhakti), and the ending alphabet of the noun word (referred to as
swaraanta / vyanjanaanta). Similar to the English language, Sanskrit has three lingas: masculine (known as
Pullinga), feminine (known as Strilinga), and neutral (known as Napunsaklinga). Sanskrit contains three
different categories of numerical or quantified nouns, although Hindi only has two. The quantified categories
are singular (also known as ekvachan), dual (also known as dvivachan), and plural (known as vahuvachan).
Dvivachan is another type of quantified word in Sanskrit. There are eight different vibhaktis for each noun in
Sanskrit: The nominative case, also known as prathamaa, the accusative case, also known as dvitiya, the
instrumental case, also known as tritiya, the dative case, also known as chaturthi, the ablative case, also
known as panchami, the genitive case, also known as shashthi, the locative case, also known as saptami, and
the vocative case, also known as saptami (known as sambodhan).
Any verb's basic meaning in Sanskrit is governed by the three elements of tense (known as lakaar),
number (known as vachan), and person (Known as purush). Unlike English grammar, Sanskrit only has five

tenses (although 10 tenses: 5 for parasmai pad verbs and 5 for aatmanai pad verbs). Sanskrit verbs fall into
two categories: Aatmanai pad verbs and Parasmai pad verbs. Parasmai pad verbs are the ones that are utilised
the most frequently as compared to aatmanai pad verbs.
Using our method, we have created corpora for the parasmai pad verb tenses. The tenses in Sanskrit are:
present tense (latlakaar), past tense (langlakaar), future tense (lritlakaar), imperative mood (lotlakaar), and
operational mood (lotlakaar) (known as vidhilingalakaar). Verb forms have three numbers, just like noun
forms: singular (ekvachan), dual (dvivachan), and plural (ekvachan) (known as vahuvachan).
Three individuals (purush) are allowed in Sanskrit: the first is referred to as Prathama purush, the second
is Madhyam purush, and the third is Madhyam purush (known as Uttam purush). Prathama purush describes
the personification of nouns (such as name, place, and object) and pronouns (such as He, She, It, and They).
Madhyam purush is another name for the verb person that refers to You or Them. The uttam purush is the
verb person for I / We. We took into consideration 15 various swaraanta and vyanjanaanta for our noun (few
pronouns) forms. There are three genders, three numerals, and eight cases for each type of ending alphabet.
These noun form groups can therefore contain a total of 707 distinct noun words. If we count them, these
noun nouns have 261 different suffixes. It implies that each noun form can be defined by a total of 261
different suffixes (rests are repetitive). According to the previously mentioned five verb tenses, three persons,
and three numbers for each person, there are a total of 45 possible verb terms for these verb form categories.
Since each word suffix is unique, they must all be considered. In one tense langlakaar, the extra prefix "a" is
used in addition to the suffix.
According to the explanation of Sanskrit grammar above, constructed Sanskrit corpora comprise the
following entities:
 Communal Words (Word and Meaning)

 Interrogative Words (Includes the Shabdh, Meaning, Vacchan and Ling)
 Pronouns (Includes the Shabdh, Meaning, Vacchan and Ling)
 Verbs (Includes the Word, Meaning, Joining Word and Word Type)
 Definition of Vowels (Includes the Shabdh Type, Vacchan Type, Ling Type, Vibhakti, Ending
Meaning, Suffix and Prefix)
 Lakaar (Includes the Shabdh Type, Kaala Type, Purush Type, Vacchan Type, End Meaning,
Suffix, Prefix and Verb Gender)
A technology called Elasticsearch was applied to increase translation speed. The web environment for the
proposed system is created using PHP on the WAMP server and the Sanskrit-Hindi generating rule.
Tokenizing the input and applying various part-of-speech tags based on information from the Sanskrit rule
base/tagged corpus constitute Sanskrit sentence processing. For the following stage of Sanskrit parsing, the
labelled phrases are given to the Sanskrit CNF lexicon and CYK bottom-up parsing algorithm. The parsing
table is produced by the CYK parser. The parse tree is produced by the first suggested algorithm, which
provides an array of the left, parent, and correct nodes of the tree. The target language text is provided by the
post-processing phase, which reorders the various phases. If the proposed grammar cannot effectively parse
the input sentence, the DMT technique is utilised to perform the translation with the use of bilingual
dictionaries.
6 RESULTS AND DISCUSSION

For evaluation purposes, a set of 100 manually translated sentences from the general domain in Sanskrit and
Hindi was used (daily conversation sentences). Results of the proposed system are shown in Table 6.

Table 6: Testing Results
Sanskrit Sentence Observed Output Desired Output Information
Transmission
ग । - । - Complete information
। transmission
। Almost complete
। । information
transmission
व । व ग । व Small information
ग । transmission
व :व । व - । व - Complete information
। transmission
। । । Almost complete
information
transmission
। व । व । Small information
transmission
6.1 BLEU Score

The Bilingual Evaluation Understudy (BLEU) is a crucial statistic for determining how accurate translated
sentences are in comparison to human-made reference translations like those used in Equation (1). A weighed
BLEU score approach of the study is used to evaluate the suggested system. Because the BLEU score value
lowers rapidly in the 3-gram and 4-gram models relative to the 2-gram model, a 2-gram cumulative BLEU
score is generated in Python using the NLTK. The BLEU-2 (2-gram BLEU) score of the suggested system is
51.6 percent.
BLEU= ( ( (∏ (1)
6.2 Fluency Score

The proposed system receives a 3.2 out of 4 on a four-scale Fluency score evaluation (out of four). The score
reflects how closely the proposed system's generated sentences adhere to grammatical conventions in the
target language. The ideal translation receives a score of 4, while fair translation receives a score of 3,
adequate translation receives a score of 2, and inaccurate translation receives a score of 1.
The analysis of the 100-sentence sample is described as follows:
(a) 68 sentences received a score of 4 (perfect translation)

(b) 12 sentences received a score of 3. (Fair translation)
(c) 11 sentences received a score of 2 (acceptable but require efforts to understand)
(d) 9 sentences received a score of 1 (not acceptable)
6.3 Adequacy Score

The Adequacy Score reflects how well the translated sentence incorporates the information from the source
language. On a four-scale system, the suggested system received a score of 3.32. An information transmission
score of 4 represents full transmission, 3 suggests nearly full transmission, 2 indicates minimal transmission,
and 1 indicates no transmission.
The study of the 100-sentence sample's results is described as follows:
(a) A score of 4 was attained by 69 sentences. (Full information transfer)

(b) A score of 3 attained by 14 sentences. (Almost all information transmitted).
(c) A score of 2 attained by 10 sentences. (Small information transmission)
(d) A score of 1 attained by 07 sentences. (No information transmission)

Table 7 compared the proposed system to several existing systems, and it is clear that the proposed
system outperforms other MT systems in use.
Table 7: Evaluation of proposed system with prevailing MTS

MTS BLEU Score Modelling Technique Citation
English- Sanskrit(EN- 49.26% Word-Word Translation [85]
SN) Generation (pivoting
+clusters)
Sanskrit-Hindi NA RBMT [86]
Sanskrit-Hindi 24% better than RBMT CBMT [88]
Sanskrit-Hindi 51.6% DMT with RBMT Proposed System
7 CONCLUSION AND FUTURE WORK

In the presented Sanskrit-to-Hindi translation system, a Grammatical corpus, CYK parser, and a parsing
algorithm, as well as a hybridization of DMT and RBMT, were developed. This work contributes in providing
the different type of divergences in Sanskrit and Hindi language pair. The proposed system includes a
Sanskrit-Hindi bilingual dictionary, a Grammatical Sanskrit corpus, and a rule-based Sanskrit analyser. The
proposed system fixes earlier work's shortcomings in extensibility, generalizability, and adaptability [85, 86,
88]. The work created and described here is innovative and applicable to any language pair. The projected
system obtained a BLEU score of 51.6 percent and outperforms all other MT systems already in use. The
suggested hybrid model outperforms the present RBMT in terms of speed and efficiency. In terms of text non
retrieval, the proposed hybrid paradigm is more efficient. If the rule-based technique fails to deliver output,
the output is returned using the direct way. The rule-based model does not produce any output in non-rule
match circumstances, but our proposed model always returns the optimal answer. The provided grammatical
corpus and syntax tree generating technique could be utilised to create various NLP applications, such as a
Sanskrit language analyser. This proposed system may be useful to young people who want to learn Sanskrit
but are unable to do so owing to a lack of resources. It has the potential to assist more than 60 million Indians
who use the Hindi internet in self-learning Sanskrit. Domain-specific applications can be created by
expanding the domain rule base.
Long sentences become extremely difficult under the current approach, making them occasionally
practically impossible. For short and medium-length sentences, our suggested model performs exceptionally
well, but the findings for long sentences are less reliable. To obtain more fluid and in-the-moment outcomes,
this study can be improved in the future by including deep learning techniques into the suggested model.
REFERENCES
[1] Kak SC. 1987. The paninian approach to natural language processing. Int J Approx Reason 1(1), 117 130.
[2] Briggs R. 1985. Knowledge representation in Sanskrit and artificial intelligence. AI Mag 6(1):32.
[3] Bahadur P, Jain A, and Chauhan DS. 2011. English to Sanskrit machine translation. In Proceedings of the international conference & workshop on
emerging trends in technology. ACM, 641 645.
[4] Mishra V, and Mishra RB. 2008. Study of example-based English to Sanskrit machine translation. J Res Dev Comput Sci Eng, 37, 43 54.
[5] Mishra V, and Mishra RB. 2009. ANN and rule-based model for English to Sanskrit machine translation. INFOCOMP J Comput Sci 9(1), 80 89.
[6] Bahadur P, Jain AK, and Chauhan DS. 2012. Etrans-A complete framework for English to Sanskrit machine translation. International Journal of
Advanced Computer Science and Applications (IJACSA) from international conference and workshop on emerging trends in technology. Citeseer, 52
59.
[7] Lewis MP, Simons GF, and Fennig CD. 2015. Ethnologue: languages of Ecuador. SIL International, Dallas.
[8] Mallikarjun B. 2010. Patterns of Indian multilingualism. In: Strength for today and bright hope for tomorrow, vol 10, no 6, 1 18.
[9] Dorr BJ, Hovy EH, and Levin LS. 2004. Natural language processing and machine translation encyclopaedia of language and linguistics, (ELL2).
Machine translation: interlingual methods. In Proceeding international conference of the world congress on engineering.
[10] Dorr Bonnie J. 1994. Machine translation divergences: a formal description and proposed solution. Comput Linguist 20(4), 597 633.
[11] Goyal P, and Sinha RMK. 2009. Translation divergence in English Sanskrit Hindi language pairs. In International sanskrit computational linguistics
symposium. Springer, 134 143.
[12] Shukla, P., Shukl, D., and Kulkarni, A. 2010. Vibhakti Divergence between Sanskrit and Hindi. In Jha, G.N. (eds) Sanskrit Computational Linguistics.
ISCLS 2010. Lecture Notes in Computer Science(), vol 6465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17528-2_15
[13] Goyal V, and Lehal GS. 2010. Web based Hindi to Punjabi machine translation system. J Emerg Technol Web Intell 2(2), 148 151.
[14] Dubey P. 2013. Machine translation system for Hindi Dogri language pair. In 2013 international conference on machine intelligence and research
advancement (ICMIRA). IEEE, 422 425.
[15] Dubey P. 2019. The Hindi to Dogri machine translation system: grammatical perspective. Int J Inf Technol 11(1), 171 182.
[16] Narayana VN. 1994. Anusarak: a device to overcome the language barrier. PhD thesis, Department of CSE, IIT Kanpur.
[17] Bharati A, Chaitanya V, Kulkarni AP, and Sangal R. 1997. Anusaaraka machine translation in stages. VIVEK-Bombay 10, 22 25.
[18] Bharati RM, Sankar B, Reddy P, Sharma DM, and Sangal R. 2003. Machine translation: the shakti approach. Pre-conference tutorial. ICON-2003.
[19] Josan GS, and Lehal GS. 2008. A Punjabi to Hindi machine translation system. In 22nd international conference on computational linguistics:
demonstration papers. Association for Computational Linguistics, 157 160.

[20] Rajan R, Sivan R, Ravindran R, and Soman KP. 2009. Rule based machine translation from English to Malayalam. In ACT’09. International conference
on advances in computing, control, & telecommunication technologies, IEEE, 439 441.
[21] Goyal P, and Sinha RMK. 2009. A study towards design of an English to Sanskrit machine translation system. Sanskrit computational linguistics.
Springer, 287 305.
[22] Pathak GR, and Godse SP. 2010. English to Sanskrit machine translation using transfer approach. In International conference on methods and models
in science and technology. American Institute of Physics, Pune, 122 126.
[23] Mishra V, and Mishra RB. 2012. English to Sanskrit machine translation system: a rule-based approach. Int J Adv Intell Paradig 4(2), 168 184.
[24] Reddy MV, and Hanumanthappa M. 2013. Indic language machine translation tool: English to Kannada/Telugu. In Multimedia processing,
communication and computing applications. Springer, 35 49. https://doi.org/10.1007/978-81- 322-1143-3_4
[25] Jayan V, and Bhadran VK. 2014. Anglabharati to Anglamalayalam: an experience with English to Indian language machine translation. In 2014
international conference on contemporary computing and informatics (IC3I). IEEE, 282 287.
[26] Desai P, Sangodkar A, and Damani OP. 2014. A domain-restricted, rule based, English Hindi machine translation system based on dependency parsing.
In Proceedings of the 11th international conference on natural language processing, 177 185.
[27] Balyan R, and Chatterjee N. 2015. Translating noun compounds using semantic relations. Comput Speech Lang 32(1), 91 108.
[28] Aasha VC, and Ganesh A. 2015. Machine translation from English to Malayalam using transfer approach. In 2015 international conference on advances
in computing, communications and informatics (ICACCI). IEEE, 1565 1570.
[29] Sridhar R, Sethuraman P, and Krishnakumar K. 2016. English to Tamil machine translation system using universal networking language. Sa¯dhana¯
41(6), 607 620.
[30] Sinha R, sivaraman KS, Agrawal A, Jain R, Srivastava R, and Jain A. 1995. Anglabharti: a multilingual machine aided translation project on translation
from English to Indian languages. In IEEE international conference on systems, man and cybernetics, 1995. Intelligent systems for the 21st century, vol
2. IEEE, 1609 1614.
[31] Darbari H. 1999. Computer-assisted translation system an Indian perspective. In Machine translation summit VII, 13th 17th September, 80 85.
[32] Dave S, Parikh J, and Bhattacharyya P. 2001. Interlingua-based English-Hindi machine translation and language divergence. Mach Transl 16(4), 251
304.
[33] Singh S, Dalal M, Vachani V, Bhattacharyya P, and Damani OP. 2007. Hindi generation from interlingua. In Proceedings of machine translation
summit, 1 8.
[34] Choudhary A, and Singh M. 2009. Gb theory-based Hindi to English translation system. In 2nd IEEE international conference on computer science and
information technology, ICCSIT 2009. IEEE, 293 297.
nd
[35] Christopher M, and Rao UM. 2010. IL-ILMT sampark: a hybrid machine translation system. In 32 all India conference of linguistics (AICL32).
Lucknow University, 69 75.
[36] Batra KK, and Lehal GS. 2010. Rule based machine translation of noun phrases from Punjabi to English. Int J Comput Sci Issues 7(5), 409 413.
[37] Batra KK, and Lehal GS. 2011. Automatic translation system from Punjabi to English for simple sentences in legal domain. Int J Trans 23(1), 79 98.
[38] Kumar P, and Sharma RK. 2012. Punjabi to unl enconversion system. Sadhana 37(2), 299 318.
[39] Parteek Kumar, and Rajendra Kumar Sharma. 2013. Punjabi deconverter for generating Punjabi from universal networking language. J Zhejiang Univ
Sci C 14(3), 179 196.
[40] Udupa UR, and Faruquie TA. 2005. An English Hindi statistical machine translation system. In Su KY, Tsujii J, Lee JH, Kwong OY (eds) Natural
language processing IJCNLP 2004. IJCNLP 2004. Lecture notes in computer science, vol 3248. Springer, Berlin, Heidelberg, 254 262.
https://doi.org/10.1007/978-3- 540-30211-7_27
[41] Antony PJ. 2013. Machine translation approaches and survey for Indian languages. Int J Comput Linguist Chin Lang Process 18(1), 47 78.
[42] Garje GV, and Kharate GK. 2013. Survey of machine translation systems in India. Int J Nat Lang Comput (IJNLC) 2(4), 47 67.
[43] Sinha RMK. 2004. An engineering perspective of machine translation: anglabharti-ii and anubharti-ii architectures. In Proceedings of international
symposium on machine translation, NLP and translation support system (iSTRANS-2004), 10 17.
[44] Jain R Sinha RMK, and Jain A. 2001. Anubharti-using hybrid example-based approach for machine translation. In: STRANS2001, IIT Kanpur, 20 32.
[45] Sinha RMK, and Thakur A. 2005. Machine translation of bi-lingual Hindi English (Hinglish) text. In 10th Machine translation summit (MT Summit X),
Phuket, Thailand, 149 156.
[46] Sachdeva K, Srivastava R, Jain S, and Sharma DM. 2014. Hindi to English machine translation: using effective selection in multimodel SMT. In LREC,
1807 1811.
[47] Dungarwal P, Chatterjee R, Mishra A, Kunchukuttan A, Shah R, and Bhattacharyya P. 2014. The IIT bombay Hindi English translation system at WMT
2014. In: ACL 2014, 90-96.
[48] Och FJ. 2007. Google translator. In Joint conference on empirical methods in natural language processing and computational natural language learning.
Prague. Association for Computational Linguistics, 858 867.
[49] Venkatapathy S, and Bangalore S. 2009. Discriminative machine translation using global lexical selection. ACM Trans Asian Lang Inf Process (TALIP)
8(2).
[50] Sharma N. 2011. English to Hindi statistical machine translation system. PhD thesis, Thapar University Patiala.
[51] Khan N, Anwar W, Bajwa UI, and Durrani N. 2013. English to Urdu hierarchical phrase-based statistical machine translation. In WSSANLP2013, Japan,
72 76.
[52] Ali A, Hussain A, Malik MK. 2013. Model for English Urdu statistical machine translation. World Appl Sci 24, 1362 1367.
[53] Sheikh M, and Conlon S 2013. Application of machine translation in bilingual knowledge management. Int J Intercult Inf Manag 3(2), 123 137.
[54] Jawaid B, Kamran A, and Bojar O. 2014. English to Urdu statistical machine translation: establishing a baseline. In Proceedings of the Fifth workshop
on south and southeast Asian natural language processing, 37 42.
[55] Naskar S, and Bandyopadhyay S. 2005. Use of machine translation in India: current status. AAMT J 16, 25 31.
[56] Badodekar S. 2003. Translation resources, services and tools for Indian languages. In Computer science and engineering department, Indian Institute of
Technology.
[57] Saini TS, Lehal GS, and Kalra VS. 2008. Shahmukhi to Gurmukhi transliteration system. In 22nd international conference on on computational
linguistics: demonstration papers. Association for Computational Linguistics, 177 180.
[58] Goyal V, and Lehal GS. 2011. Hindi to Punjabi machine translation system. In Proceedings of the 49th annual meeting of the association for
computational linguistics: human language technologies: systems demonstrations. Association for Computational Linguistics, 1 6.
[59] Narayan R, Singh VP, and Chakraverty S. 2014. Quantum neural network based machine translator for Hindi to English. Sci World J 2014, 1 8.
https://doi.org/10.1155/2014/485737
[60] Sinha RMK, and Jain A. 2003. Anglahindi: an English to Hindi machine-aided translation system. In MT Summit IX, New Orleans, USA, 494 497.
[61] Sinha RMK. 2005. Integrating CAT and MT in Anglabharti-II architecture. In 10th EAMT conference, 235 244.
[62] Saha GK. 2005. The eb-anubad translator: a hybrid scheme. J Zhejiang Univ Sci A 6(10), 1047 1050.
[63] NCST. 2008. Matra: an English to Hindi machine translation system. Technical report, NCST Mumbai.
[64] Shahnawaz A, and Mishra RB. 2011. Translation rules and ANN based model for English to Urdu machine translation. INFOCOMP J Comput Sci 10(3),
25 35.
[65] Shahnawaz, and Mishra RB. 2015. An English to Urdu translation model based on CBR ANN and translation rules. Int J Adv Intell Paradig 7(1), 1 23.
[66] Jaideepsinh K, and Jatinderkumar S. 2016. Sanskrit machine translation systems: a comparative analysis. Int J Comput Appl 136, 1 4.
[67] Huet G. 2006. Shallow syntax analysis in Sanskrit guided by semantic nets constraints. In Proceedings of the 2006 international workshop on research
issues in digital libraries. ACM.
[68] Kulkarni A, Pokar S, and Shukl D. 2010. Designing a constraint-based parser for Sanskrit. In Sanskrit computational linguistics. Springer, 70 90.
[69] Kulkarni A. 2013. A deterministic dependency parser with dynamic programming for Sanskrit. In Proceedings of the second international conference
on dependency linguistics (DepLing 2013), 157 166.

[70] Bhadra M, Singh SK, Kumar S, Agrawal M, Chandrasekhar R, Mishra SK, and Jha GN. 2009. Sanskrit analysis system (SAS). In Sanskrit computational
linguistics. Springer, 116 133.
[71] Kumar A, Mittal V, and Kulkarni A. 2010. Sanskrit compound processor. In Sanskrit computational linguistics. Springer, 57 69.
[72] Bharati A, Kulkarni A. 2009. Anusaaraka: an accessor cum machine translator. Department of Sanskrit Studies, University of Hyderabad, 1 7.
[73] Aparna S. 2005. Sanskrit to English translator. In Language in India, vol 5.
[74] Upadhyay P, Jaiswal UC, and Ashish K. 2014. Transish: translator from Sanskrit to English-a rule based machine translation. Int J Curr Eng Technol
4(5), 2277 4106.
[75] Gopal M, Mishra D, and Singh DP. 2010. Evaluating tagsets for Sanskrit. In International sanskrit computational linguistics symposium. Springer, 150
161.
[76] Gopal M, and Jha GN. 2011. Tagging Sanskrit corpus using bis pos tagset. In International conference on information systems for Indian languages.
Springer, 191 194.
[77] Gopal M, and Jha GN. 2007. Indian language part of speech tagger (IL-post). http://sanskrit.jnu.ac.in/corpora/tagset.jsp. Accessed 24 Aug 2022
[78] Chandershekhar R, Jha GN 2007. Part-of-speech tagging for Sanskrit. PhD thesis, Special Centre for Sanskrit Studies, JNU Delhi.
http://sanskrit.jnu.ac.in/corpora/JNU-Sanskrit-Tagset.htm
[79] Sitender Bawa S. 2018. Sansunl: A Sanskrit to UNL enconverter system. IETE J Res. https://doi.org/10.1080/03772063.2018. 1528187
[80] Younger DH. 1967. Recognition and parsing of context-free languages in time n3. Inf Control 10(2), 189 208.
[81] Li T, and Alagappan D. 2006. A comparison of CYK and earley parsing algorithms. In ICAR-CNR, 1 5.
[82] Papineni K, Roukos S, Ward T, and Zhu W-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual
meeting on association for computational linguistics. Association for Computational Linguistics, 311 318.
[83] LDC. 2005. Linguistic data annotation specification: assessment of adequacy and fluency in translations. revision 1.5. Technical report, Linguistic Data
Consortium.
[84] Kumar P, and Sharma RK. 2012. UNL based machine translation system for Punjabi language. PhD thesis, Thapar University.
[85] K. M. Kavitha, V. Naik, S. Angadi, S. Satish and S. Nayak. 2020. Hybrid Approaches for Augmentation of Translation Tables for Indian Languages. In
19th IEEE International Conference on Machine Learning and Applications (ICMLA), 965-970. doi: 10.1109/ICMLA51294.2020.00157.
[86] Agrawal, P., and Jain, L. 2018. Anuvaadika: Implementation of Sanskrit to Hindi Translation Tool Using Rule-Based Approach. Recent Advances in
Computer Science and Communications, 13(6), 1136 1151. https://doi.org/10.2174/2213275912666181226155829.
[87] http://www.business-standard.com/article/current-affairs/hindiinternet-users-estimated-at-60-million-in-india-survey116020400922_1.html
[88] Singh, M., Kumar, R., and Chana, I. 2020. Corpus based Machine Translation System with Deep Neural Network for Sanskrit to Hindi Translation.
Procedia Computer Science, 167(2019), 2534 2544. https://doi.org/10.1016/j.procs.2020.03.306
[89] Kaka-Khan, K. Mikael. 2018. English to Kurdish Rule-based Machine Translation System. UHD Journal of Science and Technology, 2(2), 32 39.
https://doi.org/10.21928/uhdjst.v2n2y2018.pp32-39
[90] Mukta, A. P., Mamun, A. A., Basak, C., Nahar, S., and Arif, M. F. H. 2019. A Phrase-Based Machine Translation from English to Bangla Using Rule-
Based Approach. In 2nd International Conference on Electrical, Computer and Communication Engineering, ECCE 2019, 1 5.
https://doi.org/10.1109/ECACE.2019.8679456
[91] Singh, M., Kumar, R., and Chana, I. 2019. Neuro-FGA Based Machine Translation System for Sanskrit to Hindi Language. In International Conference
on Innovative Sustainable Computational Technologies, CISCT 2019. https://doi.org/10.1109/CISCT46613.2019.9008136.
[92] Goyal, V., and Sharma, D. M. 2019. The IIIT-H Gujarati-English Machine Translation System for WMT19. 2(1), 191 195.
https://doi.org/10.18653/v1/w19-5316.
[93] Vikrant Goyal and Dipti Misra Sharma. 2019. LTRC-MT Simple and Effective Hindi-English Neural Machine Translation Systems at WAT 2019. In
Proceedings of the 6th Workshop on Asian Translation,137 140.
[94] Koul, N., and Manvi, S. S. 2021. A proposed model for neural machine translation of Sanskrit into English. International Journal of Information
Technology International Journal of Information Technology (Singapore), 13(1), 375 381. https://doi.org/10.1007/s41870-019-00340-8.
[95] Mujadia, V., and Sharma, D. 2020. NMT based Similar Language Translation for Hindi - Marathi. Proceedings of the Fifth Conference on Machine
Translation, 414 417. https://aclanthology.org/2020.wmt-1.48.
[96] Laskar, S. R., Pakray, P., and Bandyopadhyay, S. 2021. Neural Machine Translation for Low Resource Assamese English. Lecture Notes in Networks
and Systems, 170 LNNS(May), 35 44. https://doi.org/10.1007/978-981-33-4084-8_4.
[97] Chauhan, Shweta, Saxena, Shefali and Daniel, Philemon. 2021. Monolingual and Parallel Corpora for Kangri Low Resource Language.
[98] Rahul, L., Meetei, L.S., Jayanna, H.S. 2021. Statistical and Neural Machine Translation for Manipuri-English on Intelligence Domain. In Thampi, S.M.,
Gelenbe, E., Atiquzzaman, M., Chaudhary, V., Li, KC. (eds) Advances in Computing and Network Communications. Lecture Notes in Electrical
Engineering, vol 736. Springer, Singapore. https://doi.org/10.1007/978-981-33-6987-0_21.
[99] Donald Jefferson Thabah, N., and Purkayastha, B.S. 2021. Low Resource Neural Machine Translation from English to Khasi: A Transformer-Based
Approach. In Maji, A.K., Saha, G., Das, S., Basu, S., Tavares, J.M.R.S. (eds) Proceedings of the International Conference on Computing and
Communication Systems. Lecture Notes in Networks and Systems, vol 170. Springer, Singapore. https://doi.org/10.1007/978-981-33-4084-8_1.
[100] Salunkhe P, Kadam AD, Joshi S, Patil S, Thakore D, and Jadhav S. 2016. Hybrid machine translation for English to Marathi: a research evaluation in
machine translation: (hybrid translator). In International conference on electrical, electronics, and optimization techniques (ICEEOT). IEEE, 924 931.
[101] Haroon, R. P., and Shaharban, T. A. 2016. Malayalam machine translation using hybrid approach. International Conference on Electrical, Electronics,
and Optimization Techniques, 1013 1017. https://doi.org/10.1109/ICEEOT.2016.7754839.
[102] Dhariya, O., Malviya, S., and Tiwary, U. S. 2017. A hybrid approach for Hindi-English machine translation. International Conference on Information
Networking, 389 394. https://doi.org/10.1109/ICOIN.2017.7899465
[103] Bengio Y, Ducharme R, Vincent P, and Jauvin C. 2003. A neural probabilistic language model. J Mach Learn Res 3 (Feb), 1137 1155.
[104] Schwenk H. 2007. Continuous space language models. Comput Speech Lang 21 (3), 492 518.
[105] Mikolov T. 2012. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April.
[106] Devlin J, Zbib R, Huang Z, Lamar T, and Schwartz R, Makhoul J. 2014. Fast and robust neural network joint models for statistical machine translation.
In Proceedings of the 52nd annual meeting of the association for computational linguistics (vol 1: Long Papers), vol 1, 1370 1380.
[107] Schwenk H, Rousseau A, and Attik M. 2012. Large, pruned or continuous space language models on a GPU for statistical machine translation. In
Proceedings of the NAACL-HLT 2012 workshop: will we ever really replace the N-gram model? On the future of language modeling for HLT,
Association for Computational Linguistics, 11 19.
[108] Wu Y, Yamamoto H, Lu X, Matsuda S, Hori C, and Kashioka H. 2012. Factored recurrent neural network language model in ted lecture transcription. In
International workshop on spoken language translation (IWSLT).
[109] De Gispert A, Iglesias G, and Byrne B. 2015. Fast and accurate preordering for SMT using neural networks. In Proceedings of the 2015 conference of
the North American chapter of the association for computational linguistics: human language technologies, 1012 1017.
[110] Kanouchi S, Sudoh K, and Komachi M. 2016. Neural reordering model considering phrase translation and word alignment for phrase-based translation.
In Proceedings of the 3rd workshop on Asian translation (WAT2016), 94 103.
[111] Singh, M., Kumar, R., and Chana, I. 2019. Improving Neural Machine Translation Using Rule-Based Machine Translation. In 7th International
Conference on Smart Computing and Communications, 1 5. https://doi.org/10.1109/ICSCC.2019.8843685.
[112] Salunkhe P, Kadam AD, Joshi S, Patil S, Thakore D, and Jadhav S. 2016. Hybrid machine translation for English to Marathi: a research evaluation in
machine translation: (hybrid translator). In International conference on electrical, electronics, and optimization techniques (ICEEOT). IEEE, 924 931.
[113] Nithya B, and Joseph S. 2013. A hybrid approach to English to Malayalam machine translation. Int J Comput Appl 81(8), 11 15.
[114] Kaur H, and Laxmi DV. 2013. A web-based English to Punjabi MT system for news headlines. Int J Adv Res Comput Sci Softw Eng 3(6), 1092 1094.
[115] Dhore M, Dixit S, and Karande J. 2011. Web page interface localisation in Devanagari for commercial interactive applications by enhancing basic
functionality of apache server. Int J Comput Appl 18(4), 6 10.
[116] Chatterji S, Sonare P, Sarkar S, and Basu A. 2011. Lattice based lexical transfer in Bengali Hindi machine translation framework. In Proceedings of
ICON-2011: 9th international conference on natural language processing.

[117] Shahnawaz Mishra R. 2015. An English to Urdu translation model based on CBR, ANN and translation rules. Int J Adv Intell Paradig 7(1), 1 23.
[118] Chatterji S, Roy D, Sarkar S, and Basu A. 2009. A hybrid approach for Bengali to Hindi machine translation. In 7th international conference on natural
language processing, 83 91.
[119] Z. Guo, K. Yu, Z. Lv, K. -K. R. Choo, P. Shi, and J. J. P. C. Rodrigues. 2022. Deep Federated Learning Enhanced Secure POI Microservices for Cyber-
Physical Systems. In IEEE Wireless Communications, vol. 29, no. 2, 22-29. doi: 10.1109/MWC.002.2100272.
[120] Z. Guo, K. Yu, N. Kumar, W. Wei, S. Mumtaz, and M. Guizani. 2022. Deep Distributed Learning-based POI Recommendation Under Mobile Edge
Networks. In IEEE Internet of Things Journal. doi: 10.1109/JIOT.2022.3202628.
[121] A. K. Sangaiah, D. V. Medhane, T. Han, M. S. Hossain, and G. Muhammad. 2019. Enforcing Position-Based Confidentiality With Machine Learning
Paradigm Through Mobile Edge Computing in Real-Time Industrial Informatics. In IEEE Transactions on Industrial Informatics, vol. 15, no. 7, 4189-
4196. doi: 10.1109/TII.2019.2898174.
[122] G. Jain, T. Mahara, S. C. Sharma, and A. K. Sangaiah. 2022. A Cognitive Similarity-Based Measure to Enhance the Performance of Collaborative
Filtering-Based Recommendation System. In IEEE Transactions on Computational Social Systems. doi: 10.1109/TCSS.2022.3187430.

Hybridization Based Machine Translations For Low-Resource Language With Language Divergence

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hybridization Based Machine Translations For Low-Resource Language With Language Divergence

Uploaded by

Copyright:

Available Formats

Hybridization Based Machine Translations for Low-Resource Language with Language

DEEPAK KUMAR SHARMA

Table 1: Comparison of existing Machine translation approaches

Approaches Advantages Disadvantages

Simple to create an initial system Experts create the rules.

Knowledge is extracted from a corpus. The similarity metric is system dependent.

Knowledge of numbers linguistic experience none.

1.1 Need of the Machine Translation Systems (MTS)

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

1.2 Novelty of the work

1.4 Organization of the paper

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

2.1 Direct Machine Translation (DMT)

Figure 1: Direct-based MTS

2.2 Rule-based Machine Translation (RBMT)

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

2.3 Corpus-based Machine Translation (CBMT)

Figure 3: Corpus-based MTS

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Figure 4: Hybrid-based MTS

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

M. Singh et TensorFlow NMT and RBMT general domain, 162,760 Sanskrit-Hindi

2.5 Research Gaps and Motivation

Table 3: Comparison of proposed model with Existing system

3 IDENTIFICATION OF SANSKRIT AND HINDI LANGUAGE DIVERGENCE

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Divergence • A single verb from the source language was

Hindi-Target • mai. paDhanaa chaahataa huM |

Figure. 5: Language divergence Example among Sanskrit and Hindi

Table 4: Language divergence among Sanskrit and Hindi

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

1. Divergences stemming from Sanskrit Grammar

Challenges in Processing Sanskrit Language

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Elastic Search Indexer

Figure 6: Architecture of Projected System.

4.1 Data Pre-processing

4.1.1 Input Language pre-processing

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

4.1.2 Part-of-Speech Tagger

Table 5: Comparison of existing POS Taggers

4.3.1 CYK Parsing Table

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

(a) Use the input of words and their parts of speech.

4.3.2 Tree Generation

Noun Phrase Verb Phrase

रामॡ वियालयॠ गछवि

Figure 7: Parse tree for Sanskrit Sentence

4.4 Target Language Parse Tree

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Noun Phrase Verb Phrase

राम वियालय Verb VM

Figure 8: Parse Tree of Hindi Sentence

4.5 DMT-based Translation

5.2 Grammatical Corpora

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

 Communal Words (Word and Meaning)

6 RESULTS AND DISCUSSION

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

6.1 BLEU Score

6.2 Fluency Score

(a) 68 sentences received a score of 4 (perfect translation)

6.3 Adequacy Score