Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions

Archives of Computational Methods in Engineering
https://doi.org/10.1007/s11831-020-09449-7
ORIGINAL PAPER
Machine Translation Systems for Indian Languages: Review

of Modelling Techniques, Challenges, Open Issues and Future Research
Directions
Muskaan Singh1 · Ravinder Kumar1 · Inderveer Chana2
Received: 23 October 2019 / Accepted: 1 June 2020

© CIMNE, Barcelona, Spain 2020
Abstract
With the advancement in computer language technology in a multilingual country like India, numerous linguistics require
technology for translation. It aids in research for ancient languages like Sanskrit, Tamil, Telugu, Malayalam to be available
for society. Devising these languages require natural language processing, and machine translation is one of its essential
areas. It plays a significant role in breaking the language barrier and facilitating inter-lingual communication by translating
one language to another. As, with the advent of information technology, many documents and web pages are available in
local languages. There is a tremendous need to establish proper communication amongst the people belonging to discrete
backgrounds and cultures. This paper contributes in two ways: firstly, the review of different modelling techniques is per-
formed. It serves the developers with resources required for modelling different techniques such as corpus, domains, toolkits,
techniques, models, features and their evaluation measures. Secondly, a comparison of research work on different Indic
language pairs based on their modelling techniques have been performed. It influences the work on Sanskrit language to
be minimal despite holding an ancient scientific and comprehensive literature of India. It also contributes to linguistic and
technical challenges for processing Sanskrit language, open issues, and future research directions in this field.
1 Introduction information extraction, sentiment analysis, speech recogni-

sation, text classification, etc. Machine translation (MT) is
Artificial intelligence (AI) aims to develop an intelligent sys- one of key applications of NLP [2, 3].
tem examined by humans intuitively [1]. Natural language
processing (NLP) is one of many applications of AI. NLP is 1.1 Machine Translation
an area of research and application involving computers to
understand the text in natural language. It builds a computa- MT is a process of translating source language to target lan-
tional model for its analysis and generation. It involves tech- guage using a computerised system. Human translators or
nological, cognitive and linguistic motivation for developing editors can be involved in the process of MT, although mini-
intelligent computer systems such as machine translation, mal human aid is the goal of MT. The field of man–machine
interaction involves the processing of natural language.
Some of the technologies that aid in the development of
* Muskaan Singh MT have been listed below:
muskaan_singh@thapar.edu
Ravinder Kumar • Computational Linguistics It covers word formations and
ravinder@thapar.edu ordering, analysis of meaning and other communications
Inderveer Chana aspects.
inderveer@thapar.edu • Knowledge Representation It is an area that deals with
1 formalisms used in logic, frames and semantic networks.
Language Engineering and Machine Learning Research Lab,
Thapar Institute of Engineering and Technology (TIET), • Semantic Network It provides linkage through relation-
Patiala, Punjab, India ships of concept collections.
2
CSED, Thapar Institute of Engineering and Technology
(TIET), Patiala, Punjab, India
13
Vol.:(0123456789)
M. Singh et al.
• Machine Learning It is a field where the machine learns In any case, even with these advantages as referred to
from different machine learning models for the accession in the Table 1, MT still hasn’t surpassed human translation
of new knowledge from data. in terms of translation accuracy and fluency of translation.
• Search Algorithms It assists in finding a solution to the Although minimal human intervention is the goal of MT
problem and not getting stuck in an infinite loop. which has been approached by recent techniques of deep
learning. In the prevailing scenario, MT output still needs
A translation provided by the Machine Translation System a human translator to edit and post-process its output for
(MTS) involves both syntactic and semantic aspects of lan- effective translation. As web opens up to the more exten-
guage to be covered to provide the correct version of transla- sive multilingual and worldwide group, innovative work on
tion [4]. MT is considered a challenging task as it includes MT keeps on developing at a quick rate. A couple of MTS
immense statistical models that have been created by using are accessible commercially but minimal work for Indian
complex semantic information. It is a sub-field of computa- languages has been performed as compared to European
tional linguistics that examines the utilisation of program- Languages. MTS utilises bilingual information in the form
ming to disambiguate sentences at the discourse level. At of corpora and other linguistic resources for processing the
the basic level, MT performs substitution of words from one language and machine learning models. Building the MTS
language to another using a dictionary. Substitution of words requires skill along with immense knowledge in punctua-
does not provide a sound translation of a text because rec- tion, linguistic structure and semantics in the source and
ognition of whole phrases and their closest counterparts in target language. Human and MT each have their share of
the target language is required. Improving translation quality difficulties for developing MTS. It also requires to choose
involves corpus improvement, human involvement, efficient the appropriate modelling technique which is based on
approach, handling contrasts in etymology, typology and the different criteria as listed.
segregation of anomalies. MTS has its applicability in enter-
prises, professional translation, casual/home use, bilingual • Resource availability (parallel corpus, monolingual cor-
communication and bilingual dictionaries. It may act as a pus, human resources, lexical resources and technical
benchmark for human professional translators. The uses of resources).
MTS are categorized into dissemination, assimilation, inter- • Background of the developer (linguist prefer rule-based
action and translation aids [5, 6]. The properties of MTS technique, translators prefer example-based technique,
have been depicted in the Table 1. computer-scientist prefer interlingua-based technique,
statistician prefer statistical-based technique and math-
ematician prefer neural-based technique).
• Goal or purpose of MTS (assimilation, dissemination,
one or more language pair, general or specific domain).
Table 1 Properties of MT system
Properties Description
Quick translation MT system enables you to save your time while translating large texts
Enhanced timeline A key benefit of MT is speed. It is significantly faster than human translators. It is estimated that an average human
translator could translate around 2000 words per day. Also, multiple translators can be assigned to a given project
to increase the output. Still, it lacks in comparison to MT as MT can generate thousands of words every minute
Confidentiality Many people use MTS to translate their private emails because no one would agree to give his private corre-
spondence to a translator whom he does not know. Also, no one would like to entrust financial documents to an
unknown person
Cost The cost of MT is also significantly lower than that of human translation. There are incredibly advanced MT
platforms that are accessible for free. Google translation is becoming more accurate each year and users can also
enter an unlimited amount of contents for translation for free
Universality Usually, a professional translator has specialisation in a particular field, but the MTS can translate any text about
any particular field. For the translation of specialised terminology, you have to switch to a corresponding setting
Consistency MT allows systems to memorise key terms and phrases used within a given area. This leads to increase in consist-
ency across the entire text. If human translators are used for translating a text, their translations may change
slightly depending on many factors whereas a MT output is consistent
Online translation and The advantage of online translation services is evident. Online translation services are at hand, and you can
translation of web page translate information more quickly. Furthermore, you can translate any web page content and query of the search
content engine by the use of MTS
13
Machine Translation Systems for Indian Languages: Review of Modelling Techniques, Challenges,…
1.2 Research Problem 1.3 Brief Overview of Recent Works
The process of communication comprises of coding and This section examines the various review paper available for
decoding. The speaker encodes his thoughts in a language MTS. It distinguishes the generic survey with the Sanskrit
string and listener decodes it. The challenge in this com- specific survey. Although several reviews have been carried
munication process arises with the restriction that lan- out for MTS addressing Indic Languages, only two surveys
guage inflicts both at word and sentence level. The word have been found for processing Sanskrit as the source, target
choices for any language are many but finite, causing a and key supporting languages [10, 11]. Most of the previ-
challenge in the communication of proper selection of ous works for MT were generic and gives a brief descrip-
word. These words are then combined by language impos- tion of MT approaches along with major MTS develop-
ing grammatical rules restraining the speaker to encode ment in India for English to Indian languages and Indian
his thoughts. Thus the encoding process for the speaker to Indian languages [12]. The survey conducted by [13]
has specific challenges for communicating his thought also describe briefly the MTS for Indian languages along
to the listener. Therefore, approximation methods are with their features, domain and limitations. Another survey
approached in the real world scenario. Thus thoughts by [14] reviewed only important MTS focussing on their
of the speaker are restricted by language constraints methodology. Naskar and Bandyopadhyay [15] and Dwivedi
and the encoded information is only an approximation and Sukhadeve [16] discusses details of MTS developed for
of thoughts. The computer is an information processing English-Indian languages and Indian to Indian languages as
device and used since its invention in the field of NLP. a field or Web service.
Since 1950’s efforts are being carried out to build auto- The review studies specific to Sanskrit language [10, 11]
matic MTS. Earlier the difficulties were not recognized describes briefly MT approaches and MTS for the Sanskrit
and the researchers were enthusiastic for the advancement language. Many other aspects such as challenges, lexical and
in this field. This lead to the advancement in linguistic technical resource requirements prevailing to the Sanskrit
tools, statistical, deep learning techniques and computer language need to be considered before developing MTS. The
hardware. As a consequence, this field developed with resources required for developing MTS such as appropriate
better open research theories, data and tools for process- modelling technique and their specific linguistic and tech-
ing the language [7]. nical requirements such as corpus, domains, toolkits, tech-
Sanskrit is the ancient language of India with an enor- niques, models, features. The novelty of the current survey
mous collection of literature in different domains such focuses on the resources of different MT modelling tech-
as medicine, poetics, mathematical logic, astronomy, niques for Indian language research. The study of model-
literature, technology, philosophy and dramatics. With a ling techniques was compared for ancient languages such as
considerable literature of its own, it has lived as a spoken Punjabi, Bengali, Marathi, Telugu, Tamil, Assamese, Urdu,
language for almost 1000 years. Until recent times it was Malayalam, Gujarati, Sanskrit, Kannada, Dogri, Sinhala
a means of communication for scholarly and discourse and Devnagari and from the synthesis, it was concluded that
communications. This produced the perpetual generation work for the Sanskrit language was minimal despite such
of literature in Sanskrit for different domains for almost 2 rich literature. Therefore this work, contribute to research
millennia. The corpus size of Sanskrit is 100 times more for the Sanskrit language with all the resources, challenges,
than those of Latin and Greek. Besides literary work, future research directions in this field. Taking care of recent
there is a largely grammatical and philosophical tradition advances in deep learning, it is highly required to analyse
that has lasted to live with undiminished strength until the the research in this dimension. Therefore, the current new
present time [8]. Nevertheless, in the last two centuries, paradigm in MT i.e Neural Machine Translation has been
the situation has completely changed. Western learning reviewed along with its modelling technique, corpus, tools,
systems have substituted traditional learning methods. As domains, technique, features and evaluation measures. The
an outcome, the vast reserves of knowledge in Sanskrit technical challenges for applying this modelling technique
text is unavailable to Indian scholars [9]. Therefore there for the Sanskrit language has also been covered in Sect. 3
is a need for an efficient MTS that can easily translate along with linguistic challenges. The open issues and future
Sanskrit to Hindi so that the vast scholarly and other research direction in this field have also been highlighted.
traditional knowledge is readily available to the current
generation. This was our main motivation for carrying 1.4 Goal of Paper
out review work addressing Sanskrit language and to lead
this research area for further exploration by deriving chal- This section discusses the goal of the paper with the help of
lenges and opportunities in this field. a few highlights contributing to the paper.
13
M. Singh et al.
1. To identify gaps in the existing research work for devis- only for processing the Sanskrit language. This paper further
ing an effective MT technique. contributes to Open Issues in Sect. 7, future research direc-
2. The MTS should easily translate Sanskrit to Hindi so tions in Sect. 8 and finally conclusion in Sect. 9.
that the vast scholarly and the other traditional knowl-
edge is readily available to the current generation.
3. The review of different modelling techniques with a per- 2 Background: Machine Translation
spective of resources has been presented in the paper. It Modelling Techniques
serves the developers with resources required for model-
ling different techniques such as corpus, domains, tool- This section describes the pre-eminent MTS modelling tech-
kits, techniques, models, features and their evaluation niques in brief. These techniques have been classified on
measures. the basis of engineering involved for developing it either
4. A comparison of research work on different Indic lan- manually or mechanically. The techniques have been majorly
guage pairs based on modelling techniques has been categorised into human-engineered, machine-engineered and
performed. The survey of modelling techniques was combination of both i.e hybrid. The human-engineered mod-
compared for ancient languages such as Punjabi, Ben- elling technique is rule-based and machine-engineered is
gali, Marathi, Telugu, Tamil, Assamese, Urdu, Malay- corpus-based. These techniques have been further classified
alam, Gujarati, Sanskrit, Kannada, Dogri, Sinhala and as shown in the Fig. 1 and described in further subsections.
Devnagari and from the synthesis, it was concluded that
work for the Sanskrit language was minimal despite such 2.1 Human‑Engineered Translation
rich literature. Therefore this paper contributes to the
research of MTS for processing the Sanskrit language. This modelling technique involves rule-based MTS which
5. The review also covers the dimension of recent advances can be build using dictionary-based, transfer or interlingua
of deep learning in MT paradigm i.e Neural Machine approach. It involves more human intervention as all the
Translation. modules of these system requires human insights.
6. The critical challenges, technical as well as linguistic
that are likely to be faced in building MTS for the San- 2.1.1 Rule‑Based Machine Translation (RBMT) Modelling
skrit language have been reported in depth.
7. Open issues and future research direction in this field The RBMT modelling technique is one of the oldest and
have also been discussed in this paper. is still being used for less-resourced languages [17]. This
technique depends on the linguistic features of the source
1.5 Organization of Paper and target language. The linguistic information along with
grammatical properties (morphological, semantic and syn-
The remaining paper is organised into following different tax) is acquired from lexical resources such as bilingual,
sections. Section 2 describes the background of MTS as a unilingual or multilingual corpus, dictionaries and rules.
brief introduction to MT and its approaches. Section 3 con- As provided with source sentence it is processed through
tains challenges pertaining to Sanskrit languages. It covers many linguistic phases for collecting all the grammatical
different linguistic and technical challenges. Section 4 con- information, and then words are disambiguated to gener-
tains a comparison of various modelling techniques based ate target sentence. The syntax is mapped with the help
on the resource requirement. Section 5 compares MTS for of parsing. It depends on colossal lexicons and linguistic
different ancient languages while section 6 compares MTS rules. The translation quality can be enhanced by adopting
Fig. 1 Classification of transla-
tion modelling
13
contextual reference of a word in the sentence. This tech- RBMT is one of the oldest modelling technique applied to
nique supersedes the MTS default settings of linguistic develop MTS. Some of the recent research works using a
rules of each word. The MTS using rule-based technique rule-based approach have been studied along with the evalu-
can be build using the dictionary, transfer and interlingua ation measures and are shown in the Table 2. The Table 3
approach. exhibits methodology used by RBMT systems along with
their domains. The analytical study performed for MTS
1. Dictionary-based MT This technique provides direct build using rule-based approach concludes that most of
transfer of meaning from source to target language RBMT are for English to Hindi language and English to
words based on dictionary entries. It does not cover Marathi as shown in Fig. 2.
syntactic structure and semantic information of the lan-
guage pair. It is one of the simplest and easiest tech-
nique. It identifies the root word with the removal of suf- 2.2 Machine Engineered Translation
fixes and lookup the bilingual dictionary for the meaning
of source word into the target word. The final output can These modelling techniques involve more machine interven-
also be post-edited by re-ordering the sentence. [18] tion as compared to human-engineered as all the modules of
2. Transfer-based MTS This technique transfers the source the system require machine processing. It builds a translation
language parse tree into target language parse tree. It model using machine learning from the corpus (monolin-
covers the syntactic aspect of the language pairs. It gual and bilingual) . The machine-engineered MTS include
performs the analyses of source language structure and corpus-based modelling technique which has been further
transfer it to the target language and finally generates the classified into statistical and neural-based.
target sentence [19].
3. Interlingua-based MTS This technique converts the 2.2.1 Corpus‑Based MTS
source language into an intermediate or interlingua
structure to generate target translation. It analyses the The approach depends entirely on the corpus (bilingual,
source sentence and performs synthesis to convert it into multi-text or parallel corpus). MTS can be developed by
the target sentence. The idea is to capture the meaning training the corpus to perform translation. In comparison to
in the interlingua. It is independent of the language pair. the rule-based modelling technique, it requires fewer efforts
The transfer of interlingua is performed on the semantic as it is more machine-dependent. This technique has been
level as well as syntactic level [20]. further classified in statistical and neural-based.
Fig. 2 Rule-based modelling for different languages
13
M. Singh et al.
Table 2 RBMT system based on their language pair with its respec- In the decoding phase, a source sentence chooses a
tive accuracy translation sentence with maximum probability as in
Author Year Language Parameters/accu- Eq. (3),
racy
argmaxP(t|s) = argmaxP(t) × P(s|t) (3)
BLEU Accuracy
The decoding phase substitute phrases from left to right
Kavirajan et al. [21] 2017 English–Tamil NA 71.80
to produce the translation. It provides a dynamic pro-
Rana and Antique 2016 English–Hindi NA 81.70
[22] gramming solution by applying the beam search algo-
Darbari et al. [23] 2015 English–Hindi 64.6 NA rithm. The fluency of translation output depends on the
Garje et al. [24] 2014 English–Marathi 44.29 49.78 language model and adequacy of translation is derived
Dubey [25] 2014 Hindi–Dogri NA 95 by translation model. Though decoding is a complex
Basavaraddi et al. 2014 English–Kannada NA NA process, it provides the output sentence along with re-
[26] ordering .
Adapanawar et al. 2013 English–Bengali NA 82.92 There are different models of SMT based on segmen-
[27] tation of source sentence into words [39, 40], phrase
Adapanawar et al. 2013 English–Marathi NA NA [41], syntax [42, 43] and hierarchy [44]. The SMT sys-
[27] tems better utilise data and human resources. It is bet-
Pisharoty et al. [28] 2012 English–Marathi NA NA ter than rule-based as it does not require human-formed
Nair [29] 2012 Malayalam–English NA NA linguistic rules which are time-consuming and language-
Batra and Lehal [30] 2010 Punjabi–English NA 85.33 specific. It demonstrates to have higher productivity and
Rajan et al. [31] 2009 English–Malayalam NA 79.60 quality for domain-specific translations such as news,
Sinha [32] 2009 English–Hindi 34.12 NA official documents and literature.
English–Urdu 35.44 NA The survey performed on some of the recent research
Sinha and Jain [33] 2003 English–Hindi NA 90 work using statistical approach have been presented in
Dave et al. [34] 2001 English–Hindi NA NA the form of subsequent tables. The Table 4 compares
various SMT based MTS in terms of different evalua-
tion measures used for computation of accuracy. SMT
1. Statistical machine translation (SMT) modeling tech- systems are developed using different models having
niques. several toolkits. These are described in the Table 5. The
It produces translation based on the statistical model corpora details along with their domains are mentioned
while learning from parallel and monolingual corpus. in the Table 6. The analytical study performed for indic
It was introduced in the late 1950’s [37] and had been languages using SMT modelling technique have been
performing quite well until the present time. The statisti- shown in Fig. 3 for various languages and concluded
cal model assumes the presence of aligned large quantity that maximum work has been done to translate different
and high-quality data. It encodes the data information languages into English and Hindi language.
in the language and translation model which is decoded 2. Neural machine translation (NMT) modelling NMT is
by the decoder to generate the translation. The model of the most recent technique for MT and is said to make
language and translation are produced using the given a substantially more accurate translation. It depends on
source and target sentence. The language model is built the model of neural systems in the human cerebrum and
using the monolingual corpus of the target language. It with data being sent to various “layers” to handle before
also assigns the probability to each string by calculating output. It utilises deep-learning procedures to guide
the relative frequencies. While the translation model is itself to translate content given existing reference trans-
built using parallel corpora by assigning probability, i.e. lation and build models. It forms the technique faster as
source sentence is a translation of the target sentence. it requires a single sequence model rather than multi-
Mathematically, using Bayes theorem in Eq. (1) [38], ple models as in SMT. It also produces a higher quality
P(s) × P(t|s) output. It models the source sentence s1 , s2 , s3, … , s(m)
P(s|t) = (1) to target sentence t1 , t2 , t3, … , tn with conditional prob-
P(t)
ability modelling context vector c as in Eq. (4)
The highest probability sentence is chosen as the best n
∑
translation using Eq. (2) logP(t|s) = logP(yi |y < i, c) (4)
i=1
emax ̂
= argmaxP(s)P(t|s) (2)
13
Table 3 RBMT system methodology adopted with its specific domains and corpus
Author name Year Methodology Domain Corpus
Kavirajan et al. [35] 2017 Sentence simplifier, linguistic rules NA 250 sentences
Rana and Antique [22] 2016 Fuzzy rule-based translation NA 50, 100, 200, 300, 500, 1000 sentences
Darbari et al. [23] 2015 TAG & MCSSG approach Rajya Sabha using Tree Adjoin- Rajya Sabha website since 2006
ing Grammar (TAG)
Basavaraddi et al. [26] 2014 Re-formatting, pre-editing, morpho- Goverment and education sector NA
logical analysis, transfer of internal
representation and generation along
with reformatting
Garje et al [24] 2014 Pre-translation processor,parsing, Tourism 1000 sentences from TDIL
named entity tagger, rearrangement
generator, sentence filter,word by
word translation,disambiguator,
target generator
Adak [36] 2014 Morphological analysis, lexicaliza- General 2574 words
tion: pos tagging, re-ordering,
transliteration and combine
Dubey [25] 2014 Direct technique NA 18,500 words
Pisharoty [28] 2012 Tokenization, lemmatization, pars- NA NA
ing, syntax validation, semantic
validation, translation, transforma-
tion and reconstruction of sentence
Nair [29] 2012 Pre-processing, morphological NA NA
parser, transfer and generator
Adapanawar et al. [27] 2013 Tokenisation, POS tagging, diction- Assertive sentences NA
ary lookup and rule extraction from
database
Batra and Lehal [30] 2010 Three components based approach: NA 500 sentences
analyzer, a transfer component, and
a generation component
Rajan et al. [31] 2009 Transfer link rules, morphological Word dictionary 5000
rules
Sinha [32] 2009 Interlingua based rule-based NA 100,000 words
approach.
Sinha and Jain [33] 2003 Rule based NA NA
Dave et al. [34] 2001 English analyzer with disambigua- Political and stock market stories 180 sentences (Ministry of Informa-
tion, UNL conversion and Hindi tion and technology) and Brown
generator corpus
The basic form of NMT consists of encoder and decoder 1. Recurrent neural network (RNN) It has been pro-
components. The encoder encodes the source sentence ducing good quality translation result. RNN is com-
into a context vector c while decoder decodes this posed of encoder and decoder with similar working
context vector and generate one word at a time. NMT of sequence to sequence learning. Different RNN
requires minimal domain knowledge. The sequence to architecture are experimenting different models, [62,
sequence learning proposed by [59, 60] employed by 64–69].
[61, 62], i.e. reads a sentence till the end and output 2. Convolution neural network (CNN) It has achieved
one word at a time. It produced good translation results. surpassing results for the word-based MTS, but
It has the drawback of encoding source sentence into along with RNN [64, 70]. These work applied con-
a fixed-size vector which deteriorated in quality when volution layer on the bottom of the recurrent layer
exposed to longer sentences. However, this drawback which hinders the performance. The bottleneck
can be overcome by attention mechanism [63]. There was handled by implementing the fully convolu-
are two different architectures for constructing NMT. tional model as suggested by [71, 72]. Later formed
13
M. Singh et al.
Fig. 3 Statistical modelling for different Languages
stacked convolutional layers [73] with GPU version level of accuracy. Many HBMT systems have been suc-
of the neural model. The performance and accuracy cessful in improving the accuracy of the translation
was improved with a number of models [72–75]. systems.
The HBMT architecture is guided by the human-engi-
NMT makes it easier to train large models and gen- neered approach, i.e. rule-based using corpora to build
eralise long sentences. It doesn’t have to store phrase [86–93], using corpus-based tool for weighing the RBMT
tables, language models, score tables as in SMT. output [94–99], RBMT is guided by statistical post-edit-
The survey performed on some of the recent research ing [100–103]. Corpus-based HBMT uses rules at pre-
work using neural-based approach has been presented processing and post-processing suggested by [104–109],
in the form of subsequent tables. There are differ- incorporating dictionaries and rules in corpus-based MTS
ent toolkits available for developing the NMT system [110–117] and building HBMT using corpus with statisti-
experimenting with different methodology as in Table 7. cal approach [118–124]. These several works demonstrate
The experiment is conducted on different corpus and that HBMT provides better translation quality. Most of
domains mentioned in the Table 8. These systems are the techniques concatenate rules and data, whereas fewer
evaluated with different evaluation measures such as works are combining machine-engineered approaches.
automatic or human, which are presented in the Table 9. Some of the recent work incorporates additional informa-
The analytical study performed for Indic languages tion to guided RBMT or guided corpus-based technique.
using neural-based techniques has been shown in Fig. 4 The hybridisation technique has grown to speech trans-
and concludes that English to Hindi translation systems lation, cross-lingual information retrieval and computer-
is more prominent as compared to other languages. This aided and post edited systems.
is because more parallel corpora exist for this language The survey performed on some of the recent research
pair than other Indic languages. work using the hybrid approach has been presented in the
form of subsequent tables. The HBMT developed for Indic
languages using different experiment or methodology along
2.3 Hybrid Machine Translation (HBMT) Modelling with toolkits is depicted in Table 10. These system devel-
oped with corpus along with domains is mentioned in the
HBMT is a preferred translation technique as it combines Table 11. Evaluating these systems with different evalua-
the best of human-engineered and machine-engineered tion measures is shown in Table 12. The analytical study
approach. It is characterised by the use of multiple MT performed for Indic languages using hybrid approach has
modelling techniques within a single MT system. The been shown in Fig. 5 and concludes that English to Marathi,
motivation for developing a hybrid approach stems from English to Hindi, Bengali to Hindi, and English to Punjabi
the failure of any single technique to achieve a satisfactory have notable more systems as compared to other languages.
13
Table 4 SMT systems for Author Year Language Parameters/accuracy

languages with its respective
accuracy BLEU NIST TER
Subalalitha et al. [45] 2018 English–Hindi NA NA 73.43

Jindal [46] 2018 English–Punjabi 87.67 NA NA
Patel et al. [47] 2018 English–Malayalam 8.25 NA 21.57
English–Hindi 19.43 NA 37.77
English–Punjabi 23.09 NA 44.06
English–Tamil 7.56 NA 23.62
Khan et al. [48] 2017 Bengali–English 19.7 3.786 NA
Hindi–English 19.3 37.79 NA
Malayalam–English 11.1 NA NA
Tamil–English 12.8 NA NA
Telugu–English 14.2 NA NA
Urdu–English 24.7 4.26 NA
Patel et al. [49] 2016 Bengali–Hindi 31.3 6.802 46.06
Marathi–Hindi 38.71 7.369 41.38
Telugu–Hindi 28.51 6.268 50.8
Tamil–Hindi 19.19 5.012 62.9
English–Hindi 20.75 5.616 61.7
Patel and Pimpale [50] 2016 Bengali–Hindi 33.77 7.195 45.52
Marathi–Hindi 41.2 7.935 39.25
Telugu–Hindi 29.72 6.669 49.26
Tamil–Hindi 20.38 5.34 62.25
English–Hindi 24 6.121 58.99
Das and Baruah [51] 2014 Assamese–English 11.32 NA NA
Ali et al. [52] 2014 English–Urdu 9.035 NA NA
Khan et al. [53] 2013 English–Urdu 2.9 6.36 NA
Kumar and Kumar [55] 2013 Punjabi–English NA 97 NA
Anwar et al. [56] 2009 Bangla–English NA 92.53 NA
Udupa and Faruquie [57] 2005 English–Hindi 13.445 4.5741 NA
3 Challenges of MTS for Processing Sanskrit for Indian languages. The challenges to translate Eng-
Language lish–Sanskrit–Hindi MT [135], English–Hindi [34] and
English–Sanskrit [136] are also considered. These linguis-
This section discusses the types of challenges that one must tic challenges are specific to Sanskrit language and should
consider in developing an MTS. We examine these chal- be considered before building MTS for Sanskrit to English
lenges along two dimensions, the first on different types language or Sanskrit to Hindi language.
of linguistic considerations (e.g. syntactic word order and
semantic ambiguity) and the second on different types of • Sanskrit contains complex or compound words being
operational or technical considerations. influenced by oral tradition i.e continuous strings of
characters without word boundaries or punctuations.
3.1 Linguistic Challenges It becomes difficult to guess the boundaries as they
undergo euphonic changes [9].
The pattern of divergence between two languages needs • In Sanskrit, there is a special category of verbs requir-
to be recognised before building an MTS. The challenges ing special treatment termed as thematic divergence.
for translation is specific to language pair as it is dif- The subject Noun Phrase (NP) in Sanskrit and Hindi is
ficult to build a general approach applicable to all lan- the dative case while in English is the nominative case
guages. The work studying the deviation of languages by which causes a divergence in translation [135].
Dorr [133, 134] has formed a basis for further research
13
M. Singh et al.
Table 5 SMT models along with methodology and toolkit

SMT model Author Year Methodology Toolkit
Phrase-based Subalalitha [45] 2018 n-gram and Naive Bayes probability NA

Shishpal Jindal [46] 2018 IRSLTM language model, trained model with Moses, IRSLTM and GIZA++
GIZA++ alignment and testing
Patel et al. [47] 2018 Pre-processing, re-ordering, suffix seperation, Moses, MERT, KenLM and Kneser-Key
transliteration
Khan et al. [48] 2017 Sampling, tokenization, tuning set Moses, Giza++, Kneser-Ney, SRILM
Patel and Pimpale [49] 2016 Pre-processing, transliteration Modified KenLM, Moses
Patel et al. [50] 2016 Suffix separation (compound splitting and Modified KenLM
reordering)
Ali et al. [52] 2014 Manual alignment, translation and tuning Moses , Giza++
Das and Baruah [51] 2014 Text, decoder and transliteration IRSTLM tool, Giza++, Moses Decoder
Ali et al. [54] 2013 Tokenization, training and tuning Moses, IRSTLM
Kumar and Kumar [55] 2013 N-gram model and transliteration NA
Ali [52] 2010 Data distribution, mkcls, GIZA++ and Moses, GIZA++ and MERT
MERT tuning
Anwar et al. [56] 2009 Tokenization, syntax analysis, parsing and NA
NLP conversion
Anwar et al. [56] 2009 Tokenization, syntax analysis, parsing and NA
NLP conversion
Cherry [58] 2008 NA Moses, Giza++, and phrasal
Udupa and Faruquie [57] 2005 NA Moses, Giza++, and Phrasal
Udupa and Faruquie [57] 2005 NA IBM models 1, 2, and 3
Hierarchical-based Khan et al. [48] 2013 Sampling, tokenization and tuning Moses Decoder, Giza++, SRILM
Table 6 SMT classification based on domains and corpus

Author name Year Domain Corpus
Subalalitha [45] 2018 News, agriculture and technical phrases IIT Bombay and manually collected corpus
Jindal et al. [46] 2018 Health, tourism and Gyan nidhi Manually curated
Patel [47] 2018 General MTIL-2017
Khan et al. [48] 2017 Consumer, education, health, housing, legal and EMILLE
social documents
Patel et al. [49] 2016 Health, tourism, and general domain ILSMT and 23K sentences from other
Patel and Pimpale [50] 2016 Health, tourism, and general domain ILSMT, 23K, and 500 sentences from other
Das and Baruah [51] 2014 Tourism data Parallel corpora of about 8000 sentences
Ali et al. [52] 2014 Quran, ahadeeth 6000 sentences
Ali [52] 2010 Religious Adaheeth
Khan et al. [48] 2013 NA EMILLE
Ali et al. [54] 2013 NA 41,208 sentences
Kumar and Kumar [55] 2013 Names 15,000
Anwar et al. [56] 2009 NA NA
Udupa and Faruquie [57] 2005 News, government documents, conversation, 150,000 sentence pairs
and magazine articles
• Sanskrit is rich in inflectional and morphology. This • Almost all Indian Languages vocabulary has been
richness makes a difference in the last character of the derived from Sanskrit. There have been cases of mean-
word, its gender and makes difficult to remember dif- ing shift, reduction and expansion. It makes difficult to
ferent forms of word inflections [9]. understand the Sanskrit text without the prior knowledge
13
Fig. 4 Neural modelling for different languages
Table 7 NMT system based on toolkit with its respective methodology

Toolkit Author name Year Methodology Model
Tensorflow Jha et al. [76] 2018 Sequence to sequence, alignment, hierarchical Bi-LSTM with attention
attention network, transformer network and
character encoding
GoogleTranslate API Choudhary et al. [77] 2018 Sequence to sequence model, attention, BPE, Bi-LSTM with attention
word embedding
OpenNMT Pathak and Pakray [78] 2018 Data pre-processing, training, encoding, NA
decoding and translation
OpenNMT Ramesh and Sankaranayanan [79] 2018 NA Attention-based model
TensorFlow Singh et al. [80] 2018 Neural machine translation, training, testing LSTM with attention
Nematus Jigar Mistry [81] 2017 Encoder–decoder model, attention and BPE Vanilla LSTM or GRU
OpenNMT Revanuru et al. [82] 2017 Input, Bi-LSTM, Sum, LSTM, bridge and Deep Bi-LSTM
decoder
OpenNMT Tennage et al. [83] 2017 Pre-processing, OpenNMT, benchmark, train- NA
ing and word phrases
TensorFlow Aggarwal and Sharma [84] 2017 LSTM and GRU cells, encoders and decoders Bi-LSTM with attention mode
Theano Yerra et al. [85] 2016 NA Attention-based model
Table 8 NMT systems with their respective domain and corpus

Author Year Domain Corpus
Choudhary et al. [77] 2018 News, Bible, Cinema, Movie Subtitles EnTamV2.0 and Opus
Jha et al. [76] 2018 Dictionary Manually curated words from dic.
Pathak and Pakray [78] 2018 General domain MTIL
Ramesh and Sankaranayanan [79] 2018 Wikipedia Wikimedia dumps
Singh et al. [80] 2018 Health, tourism, agriculture and entertainment TDIL, EMILLE, OPUS
Revanuru et al. [82] 2017 Agriculture, entertainment, health and tourism TDIL-DC
Aggarwal and Sharma [84] 2017 General domain 50,000 sentences from Bojar corpus and ILCI
Tennage et al. [83] 2017 Annual reports, establishment codes, order Official government documents of Sri Lanka
papers, and official letters
Jigar et al. [83] 2017 Health and tourism ILCI
13
M. Singh et al.
Table 9 NMT system with their Author name Year Language Parameters/accuracy

language pairs and respective
accuracy BLEU Accuracy
Jha et al. [76] 2018 Hindi–Bhojpuri 90.89 90.23

Choudhary [77] 2018 English–Tamil 8.33 NA
Pathak and Pakray [78] 2018 English–Hindi 52.54 NA
Ramesh and Sankaranayanan [79] 2018 English–Tamil 5.53 NA
English–Hindi 3.97 NA
Singh et al. [80] 2018 English–Punjabi 26.07 NA
Mistry [81] 2018 English–Hindi 26.88 NA
Bengali–Hindi 33.87 NA
Gujrati–Hindi 53.95 NA
Revanuru et al. [82] 2017 Punjabi–Hindi 46.47 NA
Gujrati–Hindi 35.69 NA
Urdu–Hindi 22.47 NA
Tamil– Hindi 7.56 NA
Aggarwal and Sharma [84] 2017 English–Hindi 9.23 NA
Tennage et al. [83] 2017 Sinhala–Tamil 7.5 NA
Tamil–Sinhala 12.75 NA
Yerra et al. [85] 2016 Bengali–Hindi 20.41 NA
Fig. 5 Hybrid modelling for different languages
of original meaning. There are various trends in Sanskrit whenever Noun Phrase is encountered Vibhakti place
literature for commentaries. The presentation of com- is replaced with prepositions or null [135].
mentaries in the Sanskrit language is in a nested form • Sanskrit consists of complex words resulting in the
which makes it difficult to understand for modern schol- translation of two to three words of English in the one-
ars [9]. word translation of Sanskrit. This Conflational and
• The structure difference between languages lead to Inflation divergence is encountered for Sanskrit and
problems in translation. Sanskrit and Hindi are Vib- Hindi translation as well [135].
hakti and karaka based languages while in English
13
Table 10 Hybrid MTS modelling technique based on its model and toolkit

Technique Author Year Model Toolkit
Rule-based + statistical-based Salunkhe et al. [126] 2016 NA Open NLP

Rule based + statistical phrase based Dhore and Dixit [129] 2011 NA NA
Rule-based (lattice based lexical transfer) + statistical-based Chatterji et al. [130] 2011 Phrase based NA
Rule-based (lexical transfer based) + statistical-based Chatterji et al. [132] 2009 Phrase based Giza++
Statistical machine translation + translation memory Nithya and Joseph [127] 2013 Phrase based IRSTLM,
GIZA++,
Moses
decoder
Rule-based + example-based Kaur and Laxmi [128] 2013 NA NA
Rules-based + neural-based) Shahnawaz and Mishra [131] 2011 NA java(jdk1.5)
with Mat-
lab 7.1
Rule-based + example-based + statistical-based Dhariya et al. [125] 2017 Phrase based NA
Table 11 Hybrid systems with Author Year Domain Corpus

its domain and corpus
Dhariya et al. [125] 2017 NA CFILT, IIT-Bombay
Salunkhe et al. [126] 2016 NA Parallel Corpus (IIT-Bombay)
Nithya and Joseph [127] 2013 Indian history and 563 sentences
Islamic history
Kaur and Laxmi [128] 2013 News headlines 300 sentences
Dhore and Dixit [129] 2011 Banking glossary Reserve Bank of India
Chatterji et al. [130] 2011 Tourism 2000 sentences
Shahnawaz and Mishra [131] 2011 NA NA
Chatterji et al. [132] 2009 Written EMMILE-CIIL
Table 12 Hybrid MT based on Author Year Language Parameters/accuracy

language with its respective
accuracy BLEU Accuracy
Dhariya et al. [125] 2017 Hindi–English NA 86.50

Salunkhe et al. [126] 2016 English–Marathi NA 83
Nithya and Joseph [127] 2013 English–Malayalam 69.33 75.30
Kaur and Laxmi [128] 2013 English–Punjabi NA 81.67
Dhore and Dixit [129] 2011 English–Hindi NA 97.25
English–Marathi NA 97
English–Gujarati NA 96.50
Chatterji et al. [130] 2011 Bengali–Hindi 29.45 NA
Shahnawaz and Mishra [131] 2011 English–Urdu 69.54 NA
Chatterji et al. [132] 2009 Bengali–Hindi 22.57 NA
• Though Sanskrit is a free-word order language still, there • There is categorical and lexical divergence for Sanskrit–
are cases where the change in the word order changes the English and Sanskrit–Hindi language pair. The categori-
meaning of a sentence for translation of language pair cal divergence occurs in case of mismatch in parts of
Sanskrit and English [135]. speech of translation pair languages. In the translation
• Sanskrit and Hindi language pair have most commonly process when an exact match is not mapped from one
passive voice while English has active voice sentences. language to another leads to lexical divergence. In San-
This change of voice leads to problems in translation for skrit different meaning is generated with the addition of
specific language pair [135]. upsarga to the verb [135].
13
M. Singh et al.
• In the translation of Sanskrit to English/Hindi there are attention mechanism may produce better alignment if
different adjuncts and clauses are located causing diver- provided with guided learning [145, 146]. The source
gence in translation. This divergence changes the sen- sentence can be translated into many target sentences in
tence construction of language pair [137]. SMT as well. Aligning of phrases in the target concern-
• Sanskrit has honorific features containing verb endings ing source becomes cumbersome. An efficient alignment
with adjectives and noun. These plural verb and plural algorithm is required after translation. There are several
pronouns inflections are caused because of socio-cultural techniques such as a template for aligment [118, 147],
aspects of languages [137]. Hidden Markov Model (HMM) [148], toolkits [149–
• The mapping of time from English to Sanskrit and Hindi 151]. Despite various attempts to enhance alignment in
causes a problem as am and pm in English cannot be SMT systems, it does not fulfil the role and shows diver-
mapped in Sanskrit as afternoon and morning. The term gence.
am and pm cannot be translated as such in Sanskrit and • Beam search The decoding process generates the trans-
Hindi [137]. lation of highest probability. To find the highest prob-
• There is formal grammar defined by Panini for Sanskrit ability translation, a search operation is performed from
but no such parallel grammar exist for Hindi or other all possible translations. In NMT, the size of all possible
European languages. In the absence of such parallel translation is termed as beam size for each input word. It
grammar, exceptional cases are covered by forming lin- is not directly proportional to the accuracy of the system
guistic rules. There are cases were Vibhakti diverges as after a point it starts degrading. It requires manually
from Sanskrit to Hindi such as optional, exceptional, dif- normalizing of scores by the sentence length. Falling out
ferential, alternative, non-Karaka, verbal and complex- of the optimal range (30–50) of beam size the translation
predicate divergence [137]. quality starts degrading. The wider beams deteriorate the
quality with shorter translations [152].
3.2 Technical Challenges • Longer sentences and inflectional category words: Com-
plex and large sentences quality is low as compared to
This section discusses the technical challenges of Sanskrit- small sentences. Even words which are of low frequency
based MTS. The challenges are encountered while develop- are not easily translated with NMT based systems [152].
ing and applying modelling techniques for MTS. • Parallel Corpus Corpus-based modelling techniques
require a large amount of parallel and monolingual cor-
• Domain mismatch In NMT based system if the content is pus but it is costly and time-consuming. Training data is
not from the same domain, then it exhibits poor perfor- directly proportional to the model constructed. A collec-
mance. It has low-quality for out of domain text, as for tion of millions of monolingual sentence yields a better
fluency it sacrifices adequacy. The translation misguides language model producing fluent output. The translation
the user by visualising the fluent output in case of infor- model is trained using parallel corpus producing the ade-
mation gisting [138, 139]. quate translation. Even with huge corpora, translation
• Amount of training data It performs well for high- quality is very coarse. This modelling technique does not
resource languages as compared to low-resources lan- apply to many Indic languages as there is very less or no
guages as its learning depends on the amount of training parallel corpora for some low-resource languages [152].
data. NMT systems trains million of data, showing direct • Idioms and multiword expressions The properties exhib-
proportionality to accuracy [140, 141]. ited by idioms and multiword expression make it difficult
• Noisy data NMT system is not robust to noisy data in for them to translate. The corpora should be customised
corpora such as misaligned sentences [142], poorly trans- for a specific idiom for particular language pair to be
lated sentences, content in wrong languages. In such most effective. Even then, SMT requires pre-processing
cases, NMT fails to predict the relationship between the and post-processing steps for handling such cases [153].
language model and the input data context. Even vary- • Time consuming SMT, RBMT can also be expensive as it
ing the training ratios, the problem does not dissolve as requires a lot of upfront costs. In this, both the processes
it produces inadequate output. i.e, pre-processing and corpus creation are expensive
• Word alignment Aligning input words to output words and time-consuming. It also requires collaboration with
was served by the attention model of NMT [63]. The computer scientists, translators, linguists and statisticians
performance of attention based NMTs is very poor in [154].
case of more substantial sentences, and it does not pro- • Learning For the learning of system every phase requires
vide accurate word alignment. Incorporating discrete continuous error detection. It is harder to fix mistakes
translation and lexicon dictionaries [143] for improving in the system once they have been implemented. With
system with fertility and coverage modelling [144]. The models like RBMT, you can fix errors and remove cer-
13
tain words quite easily. With SMT, you need to retrain capacity to build and manage large translation models. Fig. 6
the whole system and check if other errors have been shows that SMT systems are mostly used by the researchers
emerged or not [153]. for the translation of different languages. On the other hand,
• Linguistic issues Even training the SMT system with 100 neural and hybrid-based systems are very less used for Indic
million words produce a partially excellent translation. language pair due to insufficient parallel data. The perfor-
It suffers from various linguistic issues such as man- mance of these modelling techniques is good with fluent
gled grammar, wrong word choices, name translation, output. The computational requirement for these techniques
unknown words and syntactic transformations [155] is more than human resource requirements. Machine learn-
• Linguistic knowledge Some linguistic information still ing and linguistic knowledge apply to these techniques to
needs to be set manually (such as rules, part of speech). improve performance. These techniques sometimes exhibit
This requires human intervention and linguistic knowl- out of domain quality. The error analysis is difficult to per-
edge of the source and target language [155]. form for these techniques. So, these systems need to test
• Generalized system It is hard to deal with rule interac- for future translations. The performance of some of the MT
tions in big systems, resolve ambiguity and handle idi- modelling techniques for different quality parameters are
omatic expressions. It is difficult to build a generalized compared in Table 13. The year-wise usage of these tech-
system handling all the aspects of a language pair [154]. niques is shown in Fig. 6.
• Domain adaptability Although RBMT systems usually
provide a mechanism to create new rules and extend and
adapt the lexicon. Adapting to new domains requires 5 Comparison of Machine Translation
extensive knowledge of language and human-effort as Systems
each word requires a rule to disambiguate its meaning
[154]. MT aims to translate one language to another by utilis-
• Resources The resources required for developing RBMT ing different resources over the years; different modelling
are linguistic rules, dictionaries, language-specific tools, techniques have evolved to provide efficient translations.
morph-analyser, parser and generator. It is an expensive (Fig. 7). The principal objective is to fulfil the language
process as it requires much human effort and knowledge. gap between two distinct languages involving individuals,
There is also a requirement of efficient corpora for a par- groups or nations. In India, we have different languages and
ticular language. Some languages do not have sufficient web-contents extensions which needs massive language
data and preparation of these data is a challenging task translations. The Table 14 and 15 contains the classification
and time-consuming process [154]. of MTS based on human and machine-engineered approach
• Modification To change in existing RBMT system along with their features and outcomes. This table contains
requires to process as an alteration in the rules. It is more information like the modelling technique used, features of
expensive as compared to generating new rules [154]. the technology and outcomes along with different language
4 Comparison of Various Modeling SMT

Techniques
In MT research for Indic languages, RBMT and SMT are

the methods which are used most frequently. Though RBMT
is the oldest approach it can achieve good results but the 41%
development is very time-consuming as manually linguis-
tic rules need to be fixed for every word in a sentence. In
terms of investment, the customisation cycle needs to reach 22%
the quality threshold which can be quite long and costly RBMT
in terms of human resources. The RBMT systems are built
23%
with fewer data as compared to SMT systems, along with
dictionaries and language rules to translate. It does not pro- 14%
duce fluently translated output. Also, language is constantly Neural
changing, which means rules must be managed and updated
wherever necessary in RBMT systems. Moreover, SMT sys- Hybrid
tems require much less time and linguistic knowledge. SMT
models require more computer processing power and storage Fig. 6 Overall percentage of various modelling techniques
13
M. Singh et al.
Table 13 Comparison of Features RBMTS SMTS EBMTS NMTS

various MTS technique based
on various parameters Performance Good Medium Good Good
Fluency Less Medium High Medium
Robust Yes No Yes No
Human resources requirement High Less Medium Less
Computational resources requirement Less Medium Medium High
Machine learning applicability No Yes Yes Yes
Linguistic knowledge requirement Yes No Yes Yes
Use of grammar Yes No No No
Out of domain quality Medium Low High Low
Predic quality Good Similar Very well Similar
Consistency High Low Medium Low
GPU requirement No No No Yes
Language dependency Yes No No No
Maintenance Difficult Easy Easy Easy
Model size Huge Huge Moderate Small
Error analysis Easy Difficult Difficult Impossible
Parallel corpus requirement No Yes Yes Yes
Dictionary and rule requirement Yes No No No
Extendable Difficult Easy Easy Easy
Fig. 7 Modelling techniques based on the year of its development
13
Table 14 Human-engineered systems developed for various languages
Author Corpus Language pair Features Outcome
Technique: Rule-based
Adapanawar et al. [27] General database English–Marathi Open source NLP tools are used for developing the system. The research represents a theoretical and grammatical frame-
Database of rules using a bilingual dictionary is used for work which is extendable
mapping
Adak [36] Parallel corpus English–Bengali Use a soft computational technique where the fuzzy If-Then The proposed system works well in sentence translation from
rule is applied to choose a lemma from prior knowledge English to Bengali and they obtain 82.92% F-measure on the
basis of their test case analysis
Pisharoty et al. [28] General database English–Marathi For improving the performance of the system, grammar and The additional functionalities have improved the system accu-
spell checker can be used. The sentiment analysis module racy, although there is a trade-off of time
can also be used
Garje et al. [24] 1000 sentences English–Marathi Semantic and morphological properties are maintained in the The accuracy of the system using TDIL corpora is 44.29% and
lexicon grammatical structure of the target language gives for human translation is 49.78%
importance for better translation
Basavaraddi et al. [26] General database English–Kannada Complex morphology of the target language handled by the Differences were found in the syntactic module (word order and
morphological generator. Syntax reordering overcomes morphological level)
syntactic differences
Technique: Transfer based
Nair and Peter [29] 1000 sentences Malayalam–English Artificial intelligence techniques used for system develop- The system was tested for 1000 different sentences and reported
ment. Splitter for splitting compound words, bilingual true result for the sentences which had two subordinate
dictionary, the morphological parser clauses. The system is easily extendable for other language
pairs
13

13
Table 15 Machine-engineered systems developed for various languages
Author Corpus Language pair Features Outcome
Technique: Statistical based

Kumar and Kumar [55] 1000 names Punjabi–Engish System has training to learn and transliteration The accuracy achieved by the system is 97% test set gives
the BLEU score 32.11
Ali et al. 20,173 sentence pair English-Urdu Moses is used for language training with modeling NA
toolkit IRSILM
Technique: Phrase-based statistical
Pingali and Vasudeva [156] 43,500 sentences Telugu–English Modules: language model, translation model, and The system exhibits based on geometric average fluency
decoder of 2.693 and adequacy of 2.93
Technique: Hierarchical phrase-based
Khan et al. [53] EMILLE Corpus English–Urdu EMILLE corpus is used. On Urdu monolingual corpus BLEU score for the system in fivefold test data, 40%
(12,500 sentences) language model is built. Using SRILM toolkit N-gram (phrase-based) and 29% (hierarchical-based) NIST
model is used score for the system is 73% (phrase-based) and 63%
(hierarchical-based)
Technique: Hybrid
Nithya and Joseph [127] 563 sentences Malayalam–English A statistical method is applied to the corpus and apply- BLEU score for the baseline system was 68.14 and for the
ing machine learning techniques for translation hybrid system was 69.33
Dhore and Dixit [157] 1000 words English–Devnagari The multilingual dictionary is created using 1000 bank- The multilingual dictionary is improved and semantic
ing glossary, which is available on the RBI website. C rules in the parser design have also improved
language is used for the lexical analyzer. For running
the system Bison tool is used
Nair et al. [29] General database English–Hindi New rules has been added The proposed system design shows the accurate results
than other systems that are 94%.
Godase and Govilkar [158] Dictionary database English–Marathi Bilingual corpus is used for training. Parsing is used by It makes resources available to everyone by presenting
the system a complete architecture and several algorithms for the
system
Singla and Baghla [159] 15,000 sentences English–Punjabi Using rule-based modelling technique parsing is per- The accuracy achieved by the system is 81.67%
formed by the system
Technique: Example based
Sinhal and Gupta [160] 677 sentences English–Hindi Comparing sentence to extract the translation. Parallel The system provides 96.07% word strength and 86%
corpus is used for Training. Uses various modules such precision for 677 sentences
as similarity matrix, training matrix, tagging matrix
Anuj and Manoj Kumar [161] General database Malayalam–English Phases of the system are an acquisition, matching, Best translation quality is given by the 75% test and reor-
and recombination searching mechanism is used for dering problems
searching fragments of Malayalam
M. Singh et al.
pairs. An analysis of the percentage of work performed for many languages. Sanskrit is the mother tongue of 24,821
different languages is shown in Table 16. In the field of people and Hindi of 52,83,47,193 people, i.e., 43% of total
MT, more systems have been developed to translate Hindi languages in India according to the census of India [162].
and English language as per analysis shown in Fig. 8. The Sanskrit is considered as the donor of almost all Indian
paper contains details of research work for other ancient languages [163]. The vast reserves in the Sanskrit lan-
Indian languages such as Punjabi, Bengali, Marathi, Tel- guage can be converted into other languages [164].
ugu, Tamil, Assamese, Urdu, Malayalam, Gujarati, San- The rich knowledge base in Sanskrit is its grammatical
skrit, Kannada, Dogri, Sinhala and Devnagari in Tables 14, tradition attracting Indian and western scholars. It is one
15, 16 and concludes that work for Sanskrit language is of the spoken languages and also at one time was known as
minimal despite of huge corpora and literature. ’Lingua Franca’ of the world intellectuals [165]. Sanskrit
has the text of different domains ranging from Ayurveda,
Philosophy and Astronomy. It holds a rich grammar con-
6 Machine Translation System for Sanskrit fined by Panini nearly 2500 years ago formulating 3949
and Hindi Language Processing rules, which extended later on [2]. Sanskrit has the strong-
est and simple non-ambiguous grammar [166]. Sanskrit
In a large multilingual society like India, there is a great has the richness of scientific literature with extensiveness
interest in the translation of text from one language to and comprehensive analysis, structured approach and tra-
ditional grammar [167]. Many people have attempted to
Table 16 Comparison of modelling techniques based on its develop- write a grammar for Sanskrit language using the Paninian
ment for Indic language pairs framework and used it to develop translation system [168].
Language pair SMT RBMT NMT Hybrid The Sanskrit grammar is termed as ‘Father of Informat-
ics’ as it builds a relationship between speech and utter-
Bengali–Hindi ✓ ✗ ✓ ✓
ance of speaker and meaning derived by the listener [169].
Marathi–Hindi ✓ ✗ ✗ ✗
Hence, the primary objective of Paninian grammar is to
Telugu–Hindi ✓ ✗ ✗ ✗
form a theory of human natural language communication.
Tamil–Hindi ✓ ✗ ✓ ✗
Sanskrit and Hindi belong to the same Indo-Aryan family
English–Hindi ✓ ✓ ✓ ✓
[137]. They both have structural and lexical similarity as
Assamese–English ✓ ✗ ✗ ✗
Hindi inherits from Sanskrit. Sanskrit has the rich and
English–Urdu ✓ ✓ ✗ ✓
structured grammar in the form of Panini Astadhayayi
Punjabi–English ✓ ✓ ✗ ✗
whereas in Hindi such parallel grammar does not exist.
Bengali–English ✓ ✗ ✗ ✗
Therefore, it becomes difficult to map the divergence
Hindi–English ✓ ✗ ✗ ✓
between these two languages. The non-existence of such
Malayalam–English ✓ ✗ ✗ ✓
grammar leads to exceptional cases which uncover lin-
Tamil–English ✓ ✗ ✗ ✗
guistic generalisations such as Vibhakti in Hindi. Despite
Telugu–English ✓ ✗ ✗ ✗
rich grammar, choosing Sanskrit as a source language is
Urdu–English ✓ ✗ ✗ ✗
difficult because parsing fails due to its synthetic nature
Bangla–English ✓ ✗ ✗ ✗
in which single word can run up to 32 pages. With the
English–Malayalam ✗ ✓ ✗ ✓
rich diversity of grammar, text and resources, it is per-
English–Tamil ✗ ✓ ✓ ✗
plexing to find access to Sanskrit computational tools.
English–Dogri ✗ ✓ ✗ ✗
One of the many reasons stated is unable to access the
English–Marathi ✗ ✓ ✗ ✗
literature as Sanskrit scholars do not turn towards com-
English–Bengali ✗ ✓ ✗ ✗
puter science. There are few systems for processing San-
English–Kannada ✗ ✓ ✗ ✗
skrit language and translating it to English or vice-versa.
Punjabi–Hindi ✗ ✗ ✓ ✗
These are depicted in Table 17. There are decidedly fewer
Gujrati–Hindi ✗ ✗ ✓ ✗
systems for translation of Sanskrit to Hindi as compared
Urdu–Hindi ✗ ✗ ✓ ✗
to English–Sanskrit as displayed in the Fig. 10. Differ-
Sinhala–Tamil ✗ ✗ ✓ ✗
ent modelling techniques have been used for processing
Tamil–Sinhala ✗ ✗ ✓ ✗
the Sanskrit language as depicted in the Fig. 9. Figure 9
English–Punjabi ✗ ✗ ✓ ✗
concludes that there is a lot of work done for RBMT sys-
English–Gujrati ✗ ✗ ✗ ✗
tems for the Sanskrit language, whereas no work is done in
English–Devnagari ✗ ✗ ✗ ✗
Sanskrit Translation using Statistical modelling technique.
English–Sanskrit ✗ ✓ ✗ ✓
The other techniques like example-based and hybrid-based
Sanskrit–Hindi ✓ ✗ ✗ ✗
are equally used for Sanskrit translation. The performance
13
M. Singh et al.
33%
30 %
24%
Percentage(%)
20 %
10 %
7%
6%
5%
4% 4%
3% 3%
2% 2%
1% 1% 1% 1% 1%
0%
Telgu
Tamil
Gujrati
English
Hindi
Kannada
Sinhala
Dogri
Bangali
Bhojpuri
Assamese
Sanskrit
Punjabi
Malyalam
Marathi
Urdu
Languages
Fig. 8 MTS developed for various languages
Fig. 9 MTS developed for dif-

ferent modelling techniques Neural-based Approach 0%
Hybrid Approach 14%
Example-based Approach 14%
Statistical-based Approach 28%
Rule-based Approach 43%
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%

Percentage of modelling techniques for Sanskrit languages
Sanskrit-Hindi 37.5%
English-Sanskrit 62.5%
0% 10% 20% 30% 40% 50% 60% 70%

Percentage of Systems
Fig. 10 MTS developed for Sanskrit language
13
Table 17 Comparison of different MTS for Sanskrit and Hindi language processing
Author Technique Language pair Corpus Features Results Issues
Pandey and Jha [170] Statistical-based Sanskrit–Hindi 24,000 (bilingual) The system is being trained 39.17 % for long, complex and The system takes input only in
25,000 (monolin- simultaneously on Microsoft compound sentences. 41.17 Devanagari Unicode script
gual) Translator Hub (MTHub) for bilingual and monolin- and gives output in same.
and is intended only for sim- gual sentences Sometimes the system does
ple Sanskrit prose texts not a response to long and
compound sentences. One can
translate Sanskrit sentence by
giving direct input in the box
Rathod et al. [171] Rule and example-based English–Sanskrit NA The text input is processed An average improvement of EBMT is better than RBMT but
with spell checker followed 10% has been achieved by performance degraded in case
by token generator, transla- using EBMT in the transla- of extra large sentences
tor, parser, EBMT/RBMT tion quality than RBMT
database and generator
Shukla and Shukla [172] Rule-based English–Sanskrit NA It unifies the isolated word An average improvement of Translates only simple sen-
class under the speech 7% has been achieved in the tences, not complex sentences
recognition type, tradi- translation quality
tional dictionary rule-
based machine translation
technique and text to speech
synthesizer
Bahadur et al. [173] Rule-based English–Sanskrit 500 sentences The sentences can be simple An average improvement of The sentence is correct in terms
and compound with the 13% has been achieved in of grammar, but the transla-
affirmative and imperative the translation quality. And tion is not correct. Few words
type or of active or passive 90% accuracy is achieved by in English may be used as
voice having any of the three this modeling technique for both noun and verb. This
tenses i.e. present, past, and extra large sentences generates ambiguity for the
future system
Jayan et al. [174] Example-based English–Sanskrit 125 input-output pairs Proposed a novel method An average improvement of For better improvement, future
that uses rules and ANN 8% has been achieved in the work is carrying to perform
technique to detect and translation quality case-based reasoning in a
implement the adaptation combination with rule-based
rules for the divergence in and ANN model for this
English to Sanskrit machine purpose
translation
13
M. Singh et al.
of these techniques for the Sanskrit language is compared
out of the scope. Some among

discourse considerations were
aspect that has not been dealt

use of right prepositions and
with is the structuring of the

not handled, pragmatics and
in Fig. 10 and accordingly hybrid performs much better
them are the ambiguities of
plurals of words. Another

Semantic ambiguities were than other techniques by achieving 37% improvement as
compared to others.
sentence
7 Open Issues
Issues
MTS still face many issues. In this section, current as well

7% has been achieved in the
as future challenges are deliberated due to the extreme need

An average improvement of
for efficient MTS.

translation quality
• MT is an NP-hard problem designed to provide accurate

translations. The technology has improved drastically in
the past 10 years, but it is still a work in development.
Results
Therefore, even after post-processing, the meaning of the

original document is not 100% accurate.
• Developing a large amount of parallel data or corpus is
individual words of Sanskrit
mon occurrence in Sanskrit
required for application of statistical and neural tech-

the most important part of
into longer ones is a com-
texts, sandhi vichcheda is
niques for processing the Sanskrit language. Machine-

Since the conjunction of
engineered techniques require an enormous amount of

parallel text, even that is not available for specific lan-
the translator
guage pairs as some languages are not rich in lexical

resources [176].
Features
• For translating between the language pairs which have

different word order such as Subject Verb Object (SVO)
(such as in English) and Subject Object Verb (SOV)
(such as in Hindi), difficulties arise as output translation
does not follow the word order of target language. It is
also termed as structural divergences as the structure of
the two languages differs from each other. For example,
Corpus
He saw a man on the hill with a telescope. Here whether

English–Sanskrit NA
the man is seen with a telescope or the man is seen on the

hill is ambiguous. The post-editing rules can be used for
Language pair
handling such problems. However, ambiguity remains a

challenging problem in MTS [177].
• Translating one language into another also entails struc-
tural and semantic divergence. Handling of ambiguity
whether partial or complete at various levels such as lexi-
cal, semantic and structural need to be modelled.
• Handling various grammatical issues related to a par-
ticular language such as Subject-Verb, Adjective-Noun
Subramanian et al. [175] Rule-based
Technique
arrangement and parts of speech assignments.

• In RBMT approach, anaphora resolution problem arises,
which is handled by the expression whose contextual
interpretations depend on another expression. It also
includes problems such as zero-pronoun anaphora and
Table 17 (continued)
pro-drop anaphora [18].

• The case marking is the grammatical category (Noun,
pronoun, Adjectives) to a particular word or group of
Author
words which denote the course of action of that word

in sentence or phrase. The case marking is language-
13
dependent, and therefore it is different for different lan- ity, increases the accuracy and builds a better coverage
guages such as Sanskrit and Hindi. model [180].
• Handling of idiomatic expressions, rare words and out of • The alignment of the input sentence with the translated
vocabulary words in which the meaning is not inferred sentence requires human intervention [181]. In human-
from the word are difficult to handle. engineered approach manual rules for Vibhakti, TAM
• In setting linguistic features of languages there are many and gerund handle alignment. Whereas in the case of
challenges such as a parser, root, GNP (Gender, Number, machine-engineered approach there are several tools such
Person), TAM (Tense–Aspect–Modality) and reordering. as GIZA++ [149], fast-align [182] and attention mecha-
nism [69]. In computer-aided translations requires the
origin of output translation and optimized alignment.
8 Future Research Directions • The machine-engineered approach sometime translates
some word multiple times while skips some word lead-
From the literature and analysis of current MTS following ing to over-generation and under-generation respectively.
research directions are suggested. Some possible solution is attention mechanism, feature
engineering and fertility of words [183].
• MTS helps to unite the world socially, culturally and • MT data may contain text from a different topic, style and
technologically. Hence, there is a significant require- formality and so on. The convergence of text needs to be
ment for inter-language translation for transfer and shar- handled by adopting the system with out-of-domain data
ing of information and ideas. Sanskrit is considered as [184].
an essential language in the Indo-European families. • Indic language has rich linguistic features loaded with
Research work to explore the potential of this language morphology, inflexion and so on. These features need to
is required to open perspectives in the computational lin- be integrated into MTS to yield an efficient translation.
guistic domain. Usage of Sanskrit becomes a challenging It can be embedded to input sentences, output sentences
task for MTS because of the morphological complexity and to build linguistically structured models [185].
of the language. Currently, there are a few systems for • The languages with rich morphology are required to
English to Sanskrit translation. However, more work in come up with an entirely different approach in dealing
this field is highly desirable [11] because, in India, most with rich linguistic words. As it is necessary to find a
of the holy books (Granths) are available in Sanskrit. way to scale up training a neural network both in terms of
Even Sanskrit derives almost all Indic language. computation and memory so that much larger vocabular-
• Usage of Ensemble techniques for developing multiple ies for both source and target languages can be used [66].
systems requires manual intervention [69]. The merging • In machine-engineered approaches, performance
of multiple alternative generative systems and combining degrades on inflectional words and long sentences. To
outputs of different systems. The ensemble technique can handle this it requires linguistic knowledge of language
be applied using checkpoints, averaging the outputs and pairs. As suggested by [185] incorporating linguistic
re-ranking the translation from right to left in decoding. information to train the corpus-based system improves
• Dealing with rear words or unknown words in the vocab- the system.
ulary of language pairs is tedious in building MTS [178]. • Different neural architectures need to be explored, espe-
Managing inventory and company names containing spe- cially for the decoder [66].
cial characters is also challenging. These words need to • The technique for paraphrase-creation needs to be
be handled by forming a special entry as UNK tag or explored from manual word substitutions to pivot tech-
entry in the back-off dictionary in machine-engineered nique of other translation systems. Future work possi-
approaches. Whereas, in the case of human-engineering bilities include handling of noisy paraphrase, evaluation
these approaches are handled by manually constructing strategy, named entity, genre mismatch [186].
rules. • Usage of Sanskrit becomes a challenging task for MTS.
• In most of the Indic languages, monolingual corpora is There are some limited systems for English to Sanskrit
present. Due to scarce parallel corpora, the machine- translation. However, more work in this field is highly
engineered techniques are not applicable. Therefore, desirable [11].
there is an absolute need to either embed language • The Sanskrit grammar is well structured and least ambig-
models in training the system or create synthetic par- uous. MTS in Sanskrit is not an easy task. For enhancing
allel corpora to build MTS for Indic languages [179]. MTS development source or target language, MTS is in
• Recent trend of deep learning in computer vision and developing stages. There are some systems which are
speech recognization has inspired the MT field to confined to specific domains and have concise sentences
develop deeper models. It reduces the model perplex- and phrases. Due to the morphological richness of, a
13
M. Singh et al.
separate lexicon for Sanskrit sentence with morphologi- and efficient technique i.e, NMT and HBMT are required
cal details may be maintained in a database stored in the to be implemented. The paper further contributes to open
form of the logic of a programming language [171]. issues, technical and linguistic challenges along with future
• The technique used for Sanskrit to Hindi MTS is dic- research directions in the field of MT for processing the San-
tionary-based (word to word translation), rule-based skrit language.
and statistical-based. Further extensions can be made by
using other methods (example-based, corpus-based or a
hybrid technique) for Sanskrit to Hindi MTS. Compliance with Ethical Standards
• Decades of work of digitisation of Sanskrit text, lexi-
cal resources, and development of Sanskrit linguistic Conflict of interest The authors declare that they have no conflict of
interest.
resources have culminated in a collaborative effort to
develop an efficient MTS for Sanskrit to Hindi [172].
• Exploring the use of high-order statistical methods for
processing of Sanskrit language is desirable [187] as References
current methodology lack generalisation, domain adapt-
ability and extension. Much work in the field of MTS for 1. Trujillo A (2012) Translation engines: techniques for machine
translation. Springer, Berlin
low-resource language pairs and morphologically rich
2. Bharati A, Chaitanya V, Sangal R, Ramakrishnamacharyulu K
languages such as Indic Languages is highly desirable (1995) Natural language processing: a Paninian perspective.
[176]. Prentice-Hall, New Delhi
• Handle the most common problem of MTS, i.e., updates 3. Chowdhury GG (2003) Natural language processing. Annu Rev
Inf Sci Technol 37(1):51–89
by making it function as a repository. It can also be
4. Drummer A (1996) Literature review: MT. MT for South African
formed as a virtual appliance to solve this problem. The Languages. http://people.cs.uct.ac.za/~bsharwood/downloads.
key benefit of developing virtual machine is fine granu- html. Retrieved 28 May 2019
larity while reducing the time for adding and removing 5. Hutchins J (2005) Current commercial machine translation sys-
tems and computer-based translation tools: system types and
computational resources. Virtualisation also increases
their uses. Int J Transl 17(1–2):5–38
the mobility of application and reduces deployment time. 6. Hutchins J (2009) Multiple uses of machine translation and com-
The virtualisation can be deployed both on cloud and puterised translation tools. Mach Transl 13–20
standalone system [176]. 7. Bharati A, Kulkarni A (2008) Information coding in a language:
some insights from pan.inian grammar
8. Briggs R (1985) Knowledge representation in Sanskrit and arti-
ficial intelligence. AI Mag 6(1):32
9 Conclusion 9. Kulkarni A, Das M (2012) Discourse analysis of Sanskrit texts.
In: Proceedings of the workshop on advances in discourse analy-
sis and its computational aspects, pp 1–16
In this paper, the review of different modelling technique
10. Mane D, Hirve A (2013) Study of various approaches in
along with the research challenges is presented. It serves the machine translation for Sanskrit language. Int J Adv Res Technol
developers with resources required for modelling different 2(4):383–387
techniques such as corpus, domains, toolkits, techniques, 11. Raulji JK, Saini JR (2016) Sanskrit machine translation systems:
a comparative analysis. Int J Comput Appl 136(1):1–4
models, features and their evaluation measures. A compari-
12. Antony P (2013) Machine translation approaches and survey
son of research work on different Indic language pairs based for Indian languages. Int J Comput Linguist Chin Lang Process
on modelling techniques is performed. It influences the work 18(1):13–20
on Sanskrit–Hindi language pair which is minimal, despite 13. Garje G, Kharate G (2013) Survey of machine translation sys-
tems in India. Int J Nat Lang Comput 2(4):47–65
holding an ancient scientific and comprehensive literature of
14. Saini S, Sahula V (2015) A survey of machine translation tech-
India. As per the analysis, the availability of the translation niques and systems for Indian languages. In: 2015 IEEE interna-
systems for translating any language to English and Hindi tional conference on computational intelligence & communica-
is more as compared to other languages. As per the analy- tion technology. IEEE, pp 676–681
15. Naskar S, Bandyopadhyay S (2005) Use of machine translation
sis, it can be concluded that the use of SMT based MTS is
in India: current status. AAMT J 36:25–31
more, i.e. 41%, as compared to others whereas the use of 16. Dwivedi SK, Sukhadeve PP (2010) Machine translation system
hybrid MTS is 14%, rule-based, is 22%, and neural-based in Indian perspectives. J Comput Sci 6(10):1111
is 23%. As per the analysis, Neural and hybrid approaches 17. Hutchins WJ (2001) Machine translation over fifty years. Histoire
epistémologie langage 23(1):7–31
perform better as compared to other techniques so, these
18. Hutchins WJ, Somers HL (1992) An introduction to machine
techniques may be considered for future use. In this paper translation, vol 362. Academic Press, London
existing work for the Sanskrit language has been reviewed 19. Noone G (2003) Machine translation a transfer approach. Com-
and analyzed that most frequent approach i.e RBMT and puter Science, Linguistics and a Language (CSLL) Department,
University of Dublin, Trinity College, Final Report
SMT were used with low accuracy. Therefore more accurate
13
20. Dorr BJ, Hovy EH, Levin LS (2004) Machine translation: inter- 41. Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based
lingual methods. In: Natural language processing and machine translation. In: Proceedings of the 2003 conference of the North
translation encyclopedia of language and linguistics, 2nd edn. American chapter of the association for computational linguistics
Elsevier, Amsterdam on human language technology, vol 1. Association for Computa-
21. Kavirajan B, Kumar MA, Soman K, Rajendran S, Vaithehi S tional Linguistics, pp 48–54
(2017) Improving the rule based machine translation system 42. Yamada K, Knight K (2001) A syntax-based statistical transla-
using sentence simplification (English to Tamil). In: 2017 inter- tion model. In: Proceedings of the 39th annual meeting of the
national conference on advances in computing. Communications Association for Computational Linguistics
and informatics (ICACCI). IEEE, pp 957–963 43. Charniak E, Knight K, Yamada K (2003) Syntax-based language
22. Rana M, Atique M (2016) Use of fuzzy tool for example based models for statistical machine translation. In: Proceedings of MT
machine translation. Procedia Comput Sci 79:199–206 Summit IX, pp 40–46
23. Darbari H, Kumar A, Dasgupta A, Mishra SK (2010) Complexity 44. Chiang D (2005) A hierarchical phrase-based model for statistical
of language in Rajya Sabha domain and MANTRA approach. machine translation. In: Proceedings of the 43rd annual meeting
In: Proceedings of ICON-2010: 8th international conference on on Association for Computational Linguistics. Association for
natural language processing Computational Linguistics, pp 263–270
24. Garje G, Kharate G, Kulkarni H (2014) Transmuter: an approach 45. Subalalitha BSB Aarthi Venkataraman (2018) Statistical machine
to rule-based English to Marathi machine translation. Int J Com- translation system from English–Hindi. Int J Pure Appl Math
put Appl 98(21):33–37 118(20):1649–1655
25. Dubey P (2014) Need for Hindi–Dogri machine translation sys- 46. Jindal S, Goyal V, Bhullar JS (2018) English to Punjabi statistical
tem. In: 2014 international conference on computing for sustain- machine translation using moses (corpus based). J Stat Manag
able global development (INDIACom). IEEE, pp 136–140 Syst 21(4):553–560
26. Basavaraddi MCCS, Shashirekha DH (2014) A typical machine 47. Patel RN, Pimpale PB, Sasikumar M (2018) Machine translation
translation system for English to Kannada. Int J Sci Eng Res 5(4). in Indian languages: challenges and resolution. J Intell Syst. https
ISSN 2229-5518 ://doi.org/10.1515/jisys-2018-0014
27. Adapanawar A, Garje A, Thakare P, Gundawar P, Kulkarni P 48. Khan NJ, Anwar W, Durrani N (2017) Machine translation
(2013) Rule based English to Marathi translation of assertive approaches and survey for Indian languages. arXiv preprint arXiv
sentence. Int J Sci Eng Res 4(5):516–518 :170104290
28. Pishartoy D, Priya SW (2012) Extending capabilities of English 49. Patel RN, Pimpale PB, Sasikumar M (2016) Statistical machine
to Marathi machine translator. I. JCSI Int J Comput Sci Issues translation for Indian languages: mission Hindi. arXiv preprint
9(3):375 arXiv:161007418
29. Nair LR, Peter D, Ravindran RP (2012) Design and development 50. Patel RN, Pimpale PB (2016) Statistical machine translation for
of a Malayalam to English translator—a transfer based approach. Indian languages: mission Hindi 2. arXiv:161007418v1
Int J Comput Linguist (IJCL) 3(1):1–11 51. Das P, Baruah KK (2014) Assamese to English statistical
30. Batra KK, Lehal G (2010) Rule based machine translation of machine translation integrated with a transliteration module. Int
noun phrases from Punjabi to English. Int J Comput Sci Issues J Comput Appl 100(5):401–406
(IJCSI) 7(5):409 52. Ali A, Siddiq S, Malik MK (2010) Development of parallel cor-
31. Rajan R, Sivan R, Ravindran R, Soman K (2009) Rule based pus and English to Urdu statistical machine translation. Int J Eng
machine translation from English to Malayalam. In: International Technol IJET-IJENS 10:31–33
conference on advances in computing, control, & telecommuni- 53. Khan N, Anwar MW, Bajwa UI, Durrani N (2013) English to
cation technologies, 2009, ACT’09. IEEE, pp 439–441 Urdu hierarchical phrase-based statistical machine translation. In:
32. Sinha RMK, Mahesh K (2009) Developing English–Urdu Proceedings of the 4th workshop on South and Southeast Asian
machine translation via Hindi. In: Third workshop on computa- natural language processing, pp 72–76
tional approaches to Arabic-script-based languages, Citeseer 54. Ali A, Hussain A, Malik MK (2013) Model for English–Urdu
33. Sinha R, Jain A (2003) AnglaHindi: an English to Hindi machine- statistical machine translation. World Appl Sci 24:1362–1367
aided translation system. MT Summit IX, New Orleans, USA pp 55. Kumar P, Kumar V (2013) Statistical machine translation based
494–497 Punjabi to English transliteration system for proper nouns. Int J
34. Dave S, Parikh J, Bhattacharyya P (2001) Interlingua-based Eng- Appl Innov Eng Manag 2(8):318–321
lish–Hindi machine translation and language divergence. Mach 56. Anwar MM, Anwar MZ, Bhuiyan MAA (2009) Syntax analysis
Transl 16(4):251–304 and machine translation of Bangla sentences. Int J Comput Sci
35. Kumar MA, Premjith B, Shivkaran S, Kavirajan B, Rajendran A, Netw Secur 9(8):317–326
Soman K (2017) Overview of the shared task on machine transla- 57. Udupa R, Faruquie TA (2004) An English–Hindi statistical
tion in Indian languages (MTIL-2017). J Intell Syst 28:455–464 machine translation system. In: International conference on
36. Adak C (2014) A bilingual machine translation system: English natural language processing. Springer, pp 254–262
& Bengali. In: 2014 first international conference on automation, 58. Cherry C (2008) Cohesive phrase-based decoding for statistical
control, energy and systems (ACES). IEEE, pp 1–4 machine translation. In: Proceedings of ACL-08: HLT, pp 72–80
37. Weaver W (1955) Translation. Mach Transl Lang 14:15–23 59. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence
38. Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The math- learning with neural networks. In: Advances in neural informa-
ematics of statistical machine translation: parameter estimation. tion processing systems, pp 3104–3112
Comput Linguist 19(2):263–311 60. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares
39. Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment F, Schwenk H, Bengio Y (2014) Learning phrase representations
in statistical translation. In: Proceedings of the 16th conference using RNN encoder–decoder for statistical machine translation.
on computational linguistics, vol 2. Association for Computa- arXiv preprint arXiv:14061078
tional Linguistics, pp 836–841 61. Luong MT, Le QV, Sutskever I, Vinyals O, Kaiser L (2015)
40. Och FJ, Ney H (2003) A systematic comparison of various sta- Multi-task sequence to sequence learning. arXiv preprint arXiv
tistical alignment models. Comput Linguist 29(1):19–51 :151106114
13
M. Singh et al.
62. Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal 82. Revanuru K, Turlapaty K, Rao S (2017) Neural machine trans-
neural machine translation systems for WMT’15. In: Proceed- lation of Indian languages. In: Proceedings of the 10th annual
ings of the tenth workshop on statistical machine translation, pp ACM India compute conference on ZZZ. ACM, pp 11–20
134–140 83. Tennage P, Sandaruwan P, Thilakarathne M, Herath A, Ranat-
63. Bahdanau D, Cho K, Bengio Y (2014) Neural machine trans- hunga S, Jayasena S, Dias G (2017) Neural machine translation
lation by jointly learning to align and translate. arXiv preprint for Sinhala and Tamil languages. In: 2017 international confer-
arXiv:14090473 ence on Asian language processing (IALP). IEEE, pp 189–192
64. Kalchbrenner N, Blunsom P (2013) Recurrent continuous transla- 84. Agrawal R, Sharma DM (2017) Building an effective mt system
tion models. In: Proceedings of the 2013 conference on empirical for English–Hindi using RNN’s. Int J Artif Intell Appl (IJAIA)
methods in natural language processing, pp 1700–1709 8(5):602–609
65. Luong MT, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) 85. Das A, Yerra P, Kumar K, Sarkar S (2016) A study of attention-
Addressing the rare word problem in neural machine translation. based neural machine translation model on Indian languages. In:
arXiv preprint arXiv:14108206 Proceedings of the 6th workshop on South and Southeast Asian
66. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On natural language processing (WSSANLP2016), pp 163–172
the properties of neural machine translation: encoder–decoder 86. Habash N, Dorr B, Monz C (2009) Symbolic-to-statistical
approaches. arXiv preprint arXiv:14091259 hybridization: extending generation-heavy machine translation.
67. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Mach Transl 23(1):23–63
Attention-based models for speech recognition. In: Advances in 87. Sánchez-Martínez F, Forcada ML, Way A et al (2009) Hybrid
neural information processing systems. MIT Press, pp 577–585 rule-based-example-based MT: feeding apertium with sub-sen-
68. Hochreiter S, Schmidhuber J (1997) Long short-term memory. tential translation units. In: Proceedings of the 3rd workshop on
Neural Comput 9(8):1735–1780 example based machine translation, pp 11–18
69. Luong MT, Pham H, Manning CD (2015) Effective approaches to 88. Antonova A, Misyurev A (2014) Improving the precision of auto-
attention-based neural machine translation. arXiv preprint arXiv matically constructed human-oriented translation dictionaries.
:150804025 In: Proceedings of the 3rd workshop on hybrid approaches to
70. Meng F, Lu Z, Wang M, Li H, Jiang W, Liu Q (2015) Encoding machine translation (HyTra), pp 58–66
source language with convolutional neural network for machine 89. Göhring A (2014) Building a Spanish–German dictionary for
translation. arXiv preprint arXiv:150301838 hybrid MT. In: Proceedings of the 3rd workshop on hybrid
71. Kaiser Ł, Bengio S (2016) Can active memory replace atten- approaches to machine translation (HyTra), pp 30–35
tion? In: Advances in neural information processing systems. 90. Sánchez-Martínez F, Forcada ML (2009) Inferring shallow-
MIT Press, pp 3781–3789 transfer machine translation rules from small parallel corpora.
72. Kalchbrenner N, Espeholt L, Simonyan K, Oord Avd, Graves A, J Artif Intell Res 34:605–635
Kavukcuoglu K (2016) Neural machine translation in linear time. 91. Tyers FM, Sánchez-Martínez F, Forcada ML et al (2012) Flexible
arXiv preprint arXiv:161010099 finite-state lexical selection for rule-based machine translation.
73. Kaiser L, Gomez AN, Chollet F (2017) Depthwise separable In: Proceedings of the 16th EAMT conference, Trento, Italy
convolutions for neural machine translation. arXiv preprint arXiv 92. Rudnick A, Gasser M (2013) Lexical selection for hybrid Mt with
:170603059 sequence labeling. In: Proceedings of the second workshop on
74. Oord Avd, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves hybrid approaches to translation, pp 102–108
A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: 93. Ruiz Costa-Jussà M, Centelles J (2015) Description of the Chi-
a generative model for raw audio. arXiv preprint arXiv:16090 nese-to-Spanish rule-based machine translation system devel-
3499 oped with a hybrid combination of human annotation and statisti-
75. Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) cal techniques. ACM Trans Asian Lang Inf Process 15(1):1–13
Convolutional sequence to sequence learning. Corr arXiv 94. Federmann C, Hunsicker S (2011) Stochastic parse tree selec-
:1705.03122 tion for an existing RBMT system. In: Proceedings of the sixth
76. Jha S, Sudhakar A, Singh AK (2018) Neural machine translation workshop on statistical machine translation. Association for
based word transduction mechanisms for low-resource languages. Computational Linguistics, pp 351–357
arXiv preprint arXiv:181108816 95. Dove C, Loskutova O, de la Fuente R (2012) What’s your pick:
77. Choudhary H, Pathak AK, Saha RR, Kumaraguru P (2018) Neu- RBMT, SMT or hybrid. In: Proceedings of the tenth conference
ral machine translation for English–Tamil. In: Proceedings of the of the Association for Machine Translation in the Americas
third conference on machine translation: shared task papers, pp (AMTA 2012), San Diego, CA
770–775 96. Hunsicker S, Yu C, Federmann C (2012) Machine learning for
78. Pathak A, Pakray P (2017) Neural machine translation for Indian hybrid machine translation. In: Proceedings of the seventh work-
languages. J Intell Syst 28:465–477 shop on statistical machine translation. Association for Compu-
79. Ramesh SH, Sankaranarayanan KP (2018) Neural machine trans- tational Linguistics, pp 312–316
lation for low resource languages using bilingual lexicon induced 97. Labaka G, España-Bonet C, Màrquez L, Sarasola K (2014) A
from comparable corpora. In: Proceedings of the 2018 confer- hybrid machine translation architecture guided by syntax. Mach
ence of the North American chapter of the association for com- Transl 28(2):91–125
putational linguistics: student research workshop, pp 112–119 98. Crego J (2014) Systran RBMT engine: hybridization experi-
80. Singh S, Anand Kumar M, Soman K (2018) Attention based ments. In: 3rd workshop on hybrid approaches to machine trans-
English to Punjabi neural machine translation. J Intell Fuzzy Syst lation (HyTra), Gothenburg, Sweden
34(3):1551–1559 99. Eberle K (2014) Hybrid strategies for better products and shorter
81. Mistry J, Verma AA, Bhattacharyya P (2017) Literature survey: time-to-market. In: Proceedings of the 3rd workshop on hybrid
study of neural machine translation. www.cfilti itbac in/resour ces/ approaches to machine translation (HyTra), p 97
surveys/ajay-jigar-nmt-survey-jun17pdf 100. Simard M, Ueffing N, Isabelle P, Kuhn R (2007) Rule-based
translation with statistical phrase-based post-editing. In:
13
Proceedings of the second workshop on statistical machine trans- 117. Enache R, España Bonet C, Ranta A, Màrquez Villodre L
lation. Association for Computational Linguistics, pp 203–206 (2012) A hybrid system for patent translation. In: Proceedings
101. Lagarda AL, Alabau V, Casacuberta F, Silva R, Díaz-de Liaño E of the 16th annual conference of the European Association for
(2009) Statistical post-editing of a rule-based machine transla- Machine Translation: EAMT 2012: Trento, Italy, May 28th–
tion system. In: Proceedings of human language technologies: 30th 2012, pp 269–278
the 2009 annual conference of the North American chapter of 118. Och FJ, Ney H (2004) The alignment template approach to sta-
the Association for Computational Linguistics, Companion Vol- tistical machine translation. Comput Linguist 30(4):417–449
ume: Short Papers, Association for Computational Linguistics, 119. Groves D, Way A (2005) Hybrid data-driven models of
pp 217–220 machine translation. Mach Transl 19(3–4):301–323
102. Suzuki H (2011) Automatic post-editing based on SMT and its 120. Wang K, Zong C, Su KY (2013) Integrating translation mem-
selective application by sentence-level automatic quality evalu- ory into phrase-based machine translation during decoding.
ation. Language 1:59–429 In: Proceedings of the 51st annual meeting of the Association
103. Béchara H, Rubino R, He Y, Ma Y, van Genabith J (2012) An for Computational Linguistics (vol 1: Long Papers), vol 1, pp
evaluation of statistical post-editing systems applied to RBMT 11–21
and SMT systems. In: Proceedings of COLING 2012, pp 121. Carbonell JG, Klein S, Miller D, Steinbaum M, Grassiany T, Frey
215–230 J (2006) Context-based machine translation. In: 7th conference
104. Xia F, McCord M (2004) Improving a statistical MT system of the Association for Machine Translation in the Americas
with automatically learned rewrite patterns. In: Proceedings of 122. Vandeghinste V, Schuurman I, Carl M, Markantonatou S, Badia
the 20th international conference on computational linguistics. T (2006) METIS-II: machine translation for low resource lan-
Association for Computational Linguistics, p 508 guages. In: LREC, pp 1284–1289
105. Collins M, Koehn P, Kučerová I (2005) Clause restructuring for 123. Ruiz Costa-Jussà M, Rodríguez Fonollosa JA (2011) Using
statistical machine translation. In: Proceedings of the 43rd annual linear interpolation and weighted reordering hypotheses in the
meeting on association for computational linguistics. Association Moses system. In: Seventh conference on international language
for Computational Linguistics, pp 531–540 resources and evaluation, pp 1712–1718
106. Wang C, Collins M, Koehn P (2007) Chinese syntactic reor- 124. Tambouratzis G, Sofianopoulos S, Vassiliou M (2013) Language-
dering for statistical machine translation. In: Proceedings of independent hybrid MT with present. In: Proceedings of the sec-
the 2007 joint conference on empirical methods in natural lan- ond workshop on hybrid approaches to translation, pp 123–130
guage processing and computational natural language learning 125. Dhariya O, Malviya S, Tiwary US (2017) A hybrid approach for
(EMNLP-CoNLL) Hindi–English machine translation. In: 2017 international con-
107. Patel RN, Gupta R, Pimpale PB et al (2016) Reordering rules for ference on information networking (ICOIN). IEEE, pp 389–394
English–Hindi SMT. arXiv preprint arXiv:161007420 126. Salunkhe P, Kadam AD, Joshi S, Patil S, Thakore D, Jadhav
108. Farrús M, Costa-Jussa MR, Marino JB, Poch M, Hernández S (2016) Hybrid machine translation for English to Marathi: a
A, Henríquez C, Fonollosa JA (2011) Overcoming statistical research evaluation in machine translation: (hybrid translator).
machine translation limitations: error analysis and proposed In: International conference on electrical, electronics, and opti-
solutions for the Catalan–Spanish language pair. Lang Resour mization techniques (ICEEOT). IEEE, pp 924–931
Eval 45(2):181–208 127. Nithya B, Joseph S (2013) A hybrid approach to English to
109. Formiga Fanals L, Hernández Huerta A, Mariño Acebal JB, Malayalam machine translation. Int J Comput Appl 81(8):11–15
Monte Moreno E (2012) Improving English to Spanish out- 128. Kaur H, Laxmi DV (2013) A web based English to Punjabi MT
of-domain translations by morphology generalization and gen- system for news headlines. Int J Adv Res Comput Sci Softw Eng
eration. In: Proceedings of the monolingual machine transla- 3(6):1092–1094
tion-2012 workshop, pp 6–16 129. Dhore M, Dixit S, Karande J (2011) Web page interface locali-
110. Carl M, Pease C, Iomdin LL, Streiter O (2000) Towards a sation in Devanagari for commercial interactive applications by
dynamic linkage of example-based and rule-based machine enhancing basic functionality of apache server. Int J Comput
translation. Mach Transl 15(3):223–257 Appl 18(4):6–10
111. Hua W, Haifeng W (2004) Improving statistical word alignment 130. Chatterji S, Sonare P, Sarkar S, Basu A (2011) Lattice based
with a rule-based machine translation system. In: Proceedings of lexical transfer in Bengali Hindi machine translation framework.
the 20th international conference on computational linguistics. In: Proceedings of ICON-2011: 9th international conference on
Association for Computational Linguistics, p 29 natural language processing
112. Okuma H, Yamamoto H, Sumita E (2008) Introducing a transla- 131. Shahnawaz Mishra R (2015) An English to Urdu translation
tion dictionary into phrase-based SMT. IEICE Trans Inf Syst model based on CBR, ANN and translation rules. Int J Adv Intell
91(7):2051–2057 Paradig 7(1):1–23
113. Eisele A, Federmann C, Saint-Amand H, Jellinghaus M, Her- 132. Chatterji S, Roy D, Sarkar S, Basu A (2009) A hybrid approach
rmann T, Chen Y (2008) Using Moses to integrate multiple for Bengali to Hindi machine translation. In: 7th international
rule-based machine translation engines into a hybrid system. In: conference on natural language processing, pp 83–91
Proceedings of the third workshop on statistical machine transla- 133. Dorr BJ (1994) Machine translation divergences: a for-
tion, pp 179–182 mal description and proposed solution. Comput Linguist
114. Sánchez-Cartagena VM, Sánchez-Martínez F, Pérez-Ortiz JA 20(4):597–633
et al (2011) Integrating shallow-transfer rules into phrase-based 134. Habash N, Dorr B (2002) Handling translation divergences:
statistical machine translation. In: Machine translation summit. combining statistical and symbolic techniques in generation-
pp 562–569 heavy machine translation. In: Conference of the association for
115. Chen Y, Eisele A (2010) Integrating a rule-based with a hierar- machine translation in the Americas. Springer, pp 84–93
chical translation system. In: LREC 135. Goyal P, Sinha RMK (2009) Translation divergence in English–
116. Ahsan A, Kolachina P, Kolachina S, Sharma DM, Sangal R Sanskrit–Hindi language pairs. In: International Sanskrit com-
(2010) Coupling statistical machine translation with rule-based putational linguistics symposium. Springer, pp 134–143
transfer and generation. In: Proceedings of the 9th conference 136. Mishra V, Mishra R (2008) Study of example based English to
of the association for machine translation in the Americas Sanskrit machine translation. Polibits 37:43–54
13
M. Singh et al.
137. Shukla P, Shukl D, Kulkarni A (2010) Vibhakti divergence 157. Dhore M, Dixit S (2011) English to Devanagari translation for
between Sanskrit and Hindi. In: International Sanskrit computa- UI labels of commercial web based interactive applications.
tional linguistics symposium. Springer, pp 198–208 Int J Comput Appl 35(10):6–12
138. Luong MT, Manning CD (2015) Stanford neural machine transla- 158. Godase A, Govilkar S (2015) A novel approach for rule based
tion systems for spoken language domains. In: Proceedings of the translation of English to Marathi. Adv Comput Intell Int J.
international workshop on spoken language translation, pp 76–79 https://doi.org/10.5121/acii.2015.2401
139. Farajian MA, Turchi M, Negri M, Bertoldi N, Federico M (2017) 159. Savita Singla SB (2013) Hybrid approach for English to Pun-
Neural vs. phrase-based machine translation in a multi-domain jabi translation system for news paper headlines in a specific
scenario. In: Proceedings of the 15th conference of the European domain. Int J Eng Res Technol 2:1792–1795
chapter of the association for computational linguistics: volume 160. Sinhal RA, Gupta KO (2014) A pure EBMT approach for
2, short papers, vol 2, pp 280–284 English to Hindi sentence translation system. Int J Mod Educ
140. Turchi M, De Bie T, Cristianini N (2008) Learning performance Comput Sci 6(7):1
of a machine translation system: a statistical and computational 161. Anju E, Manoj Kumar K (2014) Malayalam to English
analysis. In: Proceedings of the third workshop on statistical machine translation: an EBMT system. IOSR J Eng (IOSRJEN)
machine translation. Association for Computational Linguistics, 4(01):18–23
pp 35–43 162. Census of India Government (2018) Census-2018, “language-
141. Irvine A, Callison-Burch C (2013) Combining bilingual and census of India, states and union territories”. www.censusIndi
comparable corpora for low resource machine translation. In: a.gov.in/2011Census/C-16_25062018_NEW.pdf. Accessed 29
Proceedings of the eighth workshop on statistical machine trans- May 2019
lation, pp 262–270 163. Bahadur P, Jain A, Chauhan D (2011) English to Sanskrit
142. Chen B, Kuhn R, Foster G, Cherry C, Huang F (2016) Bilingual machine translation. In: Proceedings of the international con-
methods for adaptive training data selection for machine transla- ference & workshop on emerging trends in technology. ACM,
tion. In: Proceedings of AMTA, pp 93–103 pp 641–645
143. Arthur P, Neubig G, Nakamura S (2016) Incorporating discrete 164. Jha GN, Mishra SK, Chandrashekar R (2005) Developing a
translation lexicons into neural machine translation. arXiv pre- Sanskrit analysis system for machine translation. In: Proceed-
print arXiv:160602006 ings of national seminar on translation today: state and issues.
144. Tu Z, Lu Z, Liu Y, Liu X, Li H (2016) Modeling coverage for Department of Linguistics, University of Kerala, Trivandrum,
neural machine translation. arXiv preprint arXiv:160104811 pp 23–25
145. Chen W, Matusov E, Khadivi S, Peter JT (2016) Guided align- 165. Bharati A, Kulkarni A (2007) Sanskrit and computational lin-
ment training for topic-aware neural machine translation. arXiv guistics. In: First international Sanskrit computational sympo-
preprint arXiv:160701628 sium. Department of Sanskrit Studies, University of Hyderabad
146. Liu L, Utiyama M, Finch A, Sumita E (2016) Neural machine 166. Huet G (2009) Formal structure of Sanskrit text: requirements
translation with supervised attention. arXiv preprint arXiv: 16090 analysis for a mechanical Sanskrit processor. In: Nath Jha G (ed)
4186 Sanskrit computational linguistics. Springer, Berlin, pp 162–199
147. Liu Y, Liu Q, Lin S (2006) Tree-to-string alignment template for 167. Baindur M (2015) Nature in Indian philosophy and cultural tradi-
statistical machine translation. In: Proceedings of the 21st inter- tions. Springer, Berlin
national conference on computational linguistics and the 44th 168. Bharati A, Chaitanya V, Sangal R (1994) Paninian framework
annual meeting of the association for computational linguistics. and its application to Anusaraka. Sadhana 19(1):113–127
Association for Computational Linguistics, pp 609–616 169. Nair RR, Devi LS (2011) Sanskrit informatics: informatics for
148. Och FJ, Ney H (2000) A comparison of alignment models for Sanskrit studies and research. Centre for Informatics Research
statistical machine translation. In: COLING 2000, the 18th inter- and Development, Trivandrum
national conference on computational linguistics, vol 2 170. Pandey RK, Jha GN (2016) Error analysis of Sahit—a statistical
149. Tian L, Wong F, Chao S (2011) Word alignment using Giza++ Sanskrit–Hindi translator. Procedia Comput Sci 96:495–501
on windows. Mach Transl 2:1762–1765 171. Rathod SG (2014) Machine translation of natural language using
150. Deng Y, Byrne W (2006) MTTK: an alignment toolkit for statisti- different approaches. Int J Comput Appl 102(15):26–31
cal machine translation. In: Proceedings of the 2006 conference 172. Shukla P, Shukla A (2014) English speech to Sanskrit speech
of the North American chapter of the association for computa- (ESSS) using rule based translation. Int J Comput Appl
tional linguistics on human language technology: companion vol- 92(10):37–42
ume: demonstrations. Association for Computational Linguistics, 173. Bahadur P, Jain A, Chauhan D (2012) ETRANS—a complete
pp 265–268 framework for English to Sanskrit machine translation. In: Inter-
151. Ortiz-Martínez D, García-Varea I, Casacuberta F (2005) THOT: national Journal of Advanced Computer Science and Applica-
a toolkit to train phrase-based statistical translation models. In: tions (IJACSA) from international conference and workshop on
Tenth machine translation, Citeseer emerging trends in technology, Citeseer
152. Koehn P, Knowles R (2017) Six challenges for neural machine 174. Jayan V, Sunil R, Kurambath GS, Kumar RR (2012) Divergence
translation. arXiv preprint arXiv:170603872 patterns in machine translation between Malayalam and English.
153. Koehn P (2009) Statistical machine translation. Cambridge Uni- In: Proceedings of the international conference on advances in
versity Press, Cambridge computing, communications and informatics. ACM, pp 788–794
154. Okpor M (2014) Machine translation approaches: issues and 175. Aparna S (2005) Sanskrit to English translator. In: Language
challenges. Int J Comput Sci Issues (IJCSI) 11(5):159 in India, vol 5. http://www.languageinindia.com/jan2005/aparn
155. Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, asanskritdissertation1.html. Accessed 29 May 2019
Talbot D, White M (2005) Edinburgh system description for the 176. Lopez A, Post M (2013) Beyond bitext: five open problems in
2005 NIST MT evaluation. In: Proceedings of machine transla- machine translation. In: Proceedings of the EMNLP workshop
tion evaluation workshop 2005 on twenty years of Bitext
156. Pingali P, Varma V (2006) Hindi and Telugu to English cross 177. Pathak KN, Jha GN (2011) Challenges in NP case-mapping in
language information retrieval at CLEF 2006. In: CLEF (work- Sanskrit Hindi machine translation. In: Information systems for
ing notes) Indian languages. Springer, pp 289–293
13
178. Li X, Zhang J, Zong C (2016) Towards zero unknown word in 184. Freitag M, Al-Onaizan Y (2016) Fast domain adaptation for neu-
neural machine translation. In: IJCAI, pp 2852–2858 ral machine translation. arXiv preprint arXiv:161206897
179. Sennrich R, Haddow B, Birch A (2015) Improving neural 185. Sennrich R, Haddow B (2016) Linguistic input features improve
machine translation models with monolingual data. arXiv pre- neural machine translation. arXiv preprint arXiv:160602892
print arXiv:151106709 186. Madnani N, Ayan NF, Resnik P, Dorr BJ (2007) Using para-
180. Britz D, Goldie A, Luong MT, Le Q (2017) Massive exploration phrases for parameter tuning in statistical machine translation. In:
of neural machine translation architectures. arXiv preprint arXiv Proceedings of the second workshop on statistical machine trans-
:170303906 lation. Association for Computational Linguistics, pp 120–127
181. Alkhouli T, Bretschner G, Peter JT, Hethnawi M, Guta A, Ney 187. Goyal P, Huet G, Kulkarni A, Scharf P, Bunker R (2012) A dis-
H (2016) Alignment-based neural machine translation. In: Pro- tributed platform for Sanskrit processing. In: Proceedings of
ceedings of the first conference on machine translation: volume COLING 2012, pp 1011–1028
1, research papers, vol 1, pp 54–65
182. Lamraoui F, Langlais P (2013) Yet another fast, robust and open Publisher’s Note Springer Nature remains neutral with regard to
source sentence aligner. Time to reconsider sentence alignment. jurisdictional claims in published maps and institutional affiliations.
In: XIV machine translation summit
183. Takebayashi Y, Chenhui C, Arase Y, Nagata M (2018) Word
rewarding for adequate neural machine translation. In: Interna-
tional workshop on spoken language translation
13

Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions

Uploaded by

Copyright:

Available Formats

Archives of Computational Methods in Engineering

Machine Translation Systems for Indian Languages: Review

Received: 23 October 2019 / Accepted: 1 June 2020

1 Introduction information extraction, sentiment analysis, speech recogni-

1.2 Research Problem 1.3 Brief Overview of Recent Works

Fig. 2 Rule-based modelling for different languages

Fig. 3 Statistical modelling for different Languages

Table 4 SMT systems for Author Year Language Parameters/accuracy

Subalalitha et al. [45] 2018 English–Hindi NA NA 73.43

Table 5 SMT models along with methodology and toolkit

Phrase-based Subalalitha [45] 2018 n-gram and Naive Bayes probability NA

Table 6 SMT classification based on domains and corpus

Fig. 4 Neural modelling for different languages

Table 7 NMT system based on toolkit with its respective methodology

Table 8 NMT systems with their respective domain and corpus

Table 9 NMT system with their Author name Year Language Parameters/accuracy

Jha et al. [76] 2018 Hindi–Bhojpuri 90.89 90.23

Fig. 5 Hybrid modelling for different languages

Table 10 Hybrid MTS modelling technique based on its model and toolkit

Rule-based + statistical-based Salunkhe et al. [126] 2016 NA Open NLP

Table 11 Hybrid systems with Author Year Domain Corpus

Table 12 Hybrid MT based on Author Year Language Parameters/accuracy

Dhariya et al. [125] 2017 Hindi–English NA 86.50

4 Comparison of Various Modeling SMT

In MT research for Indic languages, RBMT and SMT are

Table 13 Comparison of Features RBMTS SMTS EBMTS NMTS

Fig. 7 Modelling techniques based on the year of its development

Technique: Statistical based

Fig. 8 MTS developed for various languages

Fig. 9 MTS developed for dif-

Hybrid Approach 14%

Example-based Approach 14%

Statistical-based Approach 28%

Rule-based Approach 43%

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%

0% 10% 20% 30% 40% 50% 60% 70%

Fig. 10 MTS developed for Sanskrit language

of these techniques for the Sanskrit language is compared

out of the scope. Some among

aspect that has not been dealt

with is the structuring of the

them are the ambiguities of

plurals of words. Another

MTS still face many issues. In this section, current as well

as future challenges are deliberated due to the extreme need

for efficient MTS.

• MT is an NP-hard problem designed to provide accurate

Therefore, even after post-processing, the meaning of the

mon occurrence in Sanskrit

required for application of statistical and neural tech-

texts, sandhi vichcheda is

niques for processing the Sanskrit language. Machine-

engineered techniques require an enormous amount of

guage pairs as some languages are not rich in lexical

• For translating between the language pairs which have

He saw a man on the hill with a telescope. Here whether

the man is seen with a telescope or the man is seen on the

handling such problems. However, ambiguity remains a

arrangement and parts of speech assignments.

pro-drop anaphora [18].

words which denote the course of action of that word

You might also like

1.2 Research Problem 1.3 Brief Overview of Recent Works

Fig. 2 Rule-based modelling for different languages

Fig. 3 Statistical modelling for different Languages

Table 4 SMT systems for Author Year Language Parameters/accuracy

Table 5 SMT models along with methodology and toolkit

Table 6 SMT classification based on domains and corpus

Fig. 4 Neural modelling for different languages

Table 7 NMT system based on toolkit with its respective methodology

Table 8 NMT systems with their respective domain and corpus

Table 9 NMT system with their Author name Year Language Parameters/accuracy

Fig. 5 Hybrid modelling for different languages

Table 10 Hybrid MTS modelling technique based on its model and toolkit

Table 11 Hybrid systems with Author Year Domain Corpus

Table 12 Hybrid MT based on Author Year Language Parameters/accuracy

4 Comparison of Various Modeling SMT

Table 13 Comparison of Features RBMTS SMTS EBMTS NMTS

Fig. 7 Modelling techniques based on the year of its development

Fig. 8 MTS developed for various languages

Fig. 9 MTS developed for dif-

Fig. 10 MTS developed for Sanskrit language