Iccit 2019

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/338223294
Neural Machine Translation for the Bangla-English Language Pair
Preprint · December 2019
CITATIONS READS
0 4,913
4 authors:
Md. Arid Hasan Firoj Alam

Daffodil International University Qatar Computing Research Institute
23 PUBLICATIONS 55 CITATIONS 131 PUBLICATIONS 1,833 CITATIONS
SEE PROFILE SEE PROFILE
Shammur Absar Chowdhury Naira Khan

Qatar Computing Research Institute University of Dhaka
77 PUBLICATIONS 854 CITATIONS 13 PUBLICATIONS 64 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
AIDR: Artificial Intelligence for Digital Response View project
TREiL: Technologies for Research and Education in Linguistics View project
All content following this page was uploaded by Md. Arid Hasan on 29 December 2019.
The user has requested enhancement of the downloaded file.

2019 22nd International Conference of Computer and Information Technology (ICCIT), 18-20 December, 2019
Neural Machine Translation for the Bangla-English

Language Pair
Md. Arid Hasan Firoj Alam Shammur Absar Chowdhury Naira Khan
Cognitive Insight Limited, Bangladesh QCRI, Qatar QCRI, Qatar Dhaka University, Bangladesh
arid.hasan.h@gmail.com fialam@hbku.edu.qa shchowdhury@hbku.edu.qa nairakhan@du.ac.bd
Abstract—Due to the rapid advancement of different neural variant of Recurrent Neural Networks (RNNs), however, other
network architectures, the task of automated translation from one architectures such as a Convolutional Neural Network (CNN)
language to another is now in a new era of Machine Translation can also be used for the encoder.
(MT) research. In the last few years, Neural Machine Translation
(NMT) architectures have proven to be successful for resource- The advantage of NMT is that it learns mapping from the
rich languages, trained on a large dataset of translated sentences, input to the output in an end-to-end fashion, trained in a single
with variations of NMT algorithms used to train the model. In big neural network. The model jointly learns the parameters
this study, we explore different NMT algorithms – Bidirectional in order to maximize the performance of the translation output
Long Short Term Memory (LSTM) and Transformer based [4]–[6], which also requires minimum domain knowledge. In
NMT, to translate the Bangla to English language pair. For the
experiments, we used different datasets and our experimental addition, similar to Statistical Machine Translation (SMT),
results outperform the existing performance by a large margin NMT does not need to tune and store different models such
on different datasets. We also investigated the factors affecting as the translation language, and reordering models. The study
the data quality and how they influence the performance of the of Cho et al. [6] reports that the NMT models require only a
models. It shows a promising research avenue to enhance NMT fraction of the memory needed by traditional SMT models.
for the Bangla-English language pair.
Index Terms—Machine Translation, Bangla-to-English, Neural Since NMT emerged, it has been providing state-of-the-
Machine Translation, Transformer, Bidirectional LSTM art performance for various language pairs, however, the
literature also reports its limitations, such as dealing with long
I. Introduction sentences [7]. In order to deal with said issues Attention based
The task of automated translation from one language to an- mechanisms have been introduced, in which the model jointly
other has undergone rapid advancement due to the emergence learns to align and translate. Various attention mechanisms
of deep neural networks. Neural networks have been studied have also been proposed in the literature [8], [9], however,
for machine translation in the 20th-century [1]. However, the transformer architecture [9] has become well known to the
very recently it has reached state-of-the-art performance [2] community. It is based on self-attention, as discussed in-detail
with large scale deployment. In the Machine Translation (MT) in Section IV-B2.
community, a neural network based model for machine trans- The literature with NMT techniques report higher perfor-
lation is referred to as Neural Machine Translation (NMT), mances for resource-rich languages such as English to German
where a sequence-to-sequence (seq2seq) [3] model is most [10] and English to French [11]. Compared to resource-
commonly used. Although Statistical Machine Translation rich languages the literature of NMT for the Bangla-English
(SMT) has been successful in the community in the last language pair is relatively sparse. More details of the current
decade, however, the complete pipeline gets complex with the state-of-the-art can be found in the next section. In this study,
addition of more features, saturating the translation quality. we aim to shed light on this area. Our contributions include,
This limitation of SMT and the success of deep learning has (i) conducting experiments using different NMT approaches,
led to a focus on NMT approaches for machine translation in (ii) consolidating publicly available data from different sources
the MT community. and evaluating them using these approaches.
Typically, the NMT consists of an encoder and a decoder. The structure of this paper is as follows. Section II, provides
The first network, the encoder, processes a source sentence a brief overview of the existing work on Bangla MT systems.
(e.g., Bangla) into a vector (i.e., also referred to as a context In Section III, we discuss the datasets that we use in this study.
vector or thought vector). A second network, called the We present the approaches that we use for our experiments
decoder, uses this vector to predict the words in the target in Section IV. In Section V, we discuss the results of our
language (e.g., English). Traditionally, NMT uses a different experiments. Finally, we conclude our work in Section VI.
The research leading to the results has been supported and funded by II. Related Work
Cognitive Insight Limited – https://cogniinsight.com/.
The first MT research for the Bangla-English language pair
along with other Indic languages was introduced in 1991 [12].
There have been many endeavours for the Bangla-English
978-1-7281-5842-6/19/$31.00 ©2019 IEEE

language pair since then. Initial studies include rule based 1) Indic Languages Multilingual Parallel Corpus (ILMPC):
approaches, either a set of predefined rules from examining This dataset has been released in the Workshop on Asian
the grammar [13], [14] or learning rules by studying sentence Translation (WAT) [27]. It consists of 7 parallel languages.
structure during training [15], [16]. The later SMT approaches The text of this corpus has been collected from OPUS , and it
introduced for the Bangla-English MT systems includes study consists of spoken languages – subtitles2 from movies and
on word- and phrase-based models. Amongst these, phrase- TV series [28]. In addition, it contains ∼ 337K, 500 and
based models have been widely used for the Bangla-English 1, 000 parallel sentences in the training, development and test
MT systems [17]. set, respectively. Since this is a spoken language dataset, this
In [18], the authors report the first MT system for the corpus is different than other corpora that we mention below.
Bangla-English language pair. The study used a rule-based It poses challenges for the translation even when combined
translation model with the use of morphological analysis, with other corpora.
in which sentences were tokenized into a phrasal template. 2) Six Indian Parallel Corpora [24] (SIPC): This Corpora
The study of Arefin et al. [19] proposed a context sensitive consists of six language pairs. The parallel sentences were
grammar rule-based approach for Bangla to English machine collected from the top-100 most-viewed documents from the
translation. The authors generate rules based on POS tagging Wikipedia page of each language [24]. The training set con-
from the source sentence, find the equivalent target rule for tains ∼20K parallel sentences, the development set contains
the sentence and transfer the translations of the source words 914 parallel sentences, and the test set contains 1K parallel
according to the target rule. In another study [20], authors sentences.
report a comparative study of current state-of-the-art phrase- 3) Penn Treebank Bangla-English parallel corpus (PTB):
based SMT model performances for Indian languages, in This corpus has been developed by the Bangladesh team of
which a BLEU score of 11.8 is reported for the Bangla to the PAN Localization Project3 , in which the source English
English language pair on the EMILLE corpus. sentences have been collected from the Penn Treebank cor-
In [21], the authors used the SUPara corpus [22] and pus. The aim was to develop a multilingual parallel corpus.
achieved a BLEU score of 17.43 with the use of log-linear The dataset has been translated by expert translators, which
phrase-based SMT technique. In another study [17], the au- consists of 1, 313 parallel sentences.
thors used a many-to-one phrase-based SMT approach and 4) SUPara Corpus [29], [30]: This corpus has been de-
achieved a BLEU score 13.98. The experimental settings veloped by the Shahjalal University of Science and Technol-
remove sentences by length 50, a language model of 3-gram, ogy (SUST), Bangladesh, in which the sentences consist of
and extract phrases with the use of a grow-diag-final-and different genres such as Literature, Journalistic, Instructive,
heuristic. In [23], the authors report a phrase-based SMT Administrative, and External Communication. The aim was to
model for the Bangla-English language pair termed Anuvadak 1 develop a multilingual parallel corpus for the Bangla-English
and achieved a maximum BLEU score of 18.20 using source language pair. It consists of ∼70.8K parallel sentences [29].
reordering. The study also reports morphological complexity 5) AmaderCAT Corpus [31]: This corpus has been devel-
for the Indic language family. In [24], the authors report oped using a collaborative platform named AmaderCAT by
a phrase-based MT system for low resource languages and the students of Daffodil International University, Bangladesh,
achieved a BLEU score of 12.74. in which the source English sentences were collected from
One of the recent studies for the Bangla-English language newspapers. The corpus consists of 1, 782 parallel sentences.
pair was done by Liu et al. [25], in which the authors used Table I, presents the individual corpus statistics that we used
an LSTM-based sequence-to-sequence model with an attention in the current study. Our combined dataset is the largest dataset
mechanism. The study introduced the NMT approach for the for the Bangla-English language pair.
Bangla-English language pair and also achieved a notable
BLEU score of 10.92. In [26], authors reported a study of Table I: Corpora statistics. Bangla (BN), English (EN).
NMT results on the Bangla-English language pair. This study
Corpus Name # of Sentences # of Tokens
used a different dataset than the one we used in our study.
ILMPC 324,366 2,826,203 (BN), 2,673,485 (EN)
Therefore, the results are not directly comparable. SIPC 20,788 271,461 (BN), 323,696 (EN)
Our study differs from the reported study of the Bangla- SUPara 70,861 832,657 (BN), 998,717 (EN)
English language pair in that we present the effectiveness PTB 1,313 31,232 (BN), 32,220 (EN)
AmaderCAT 1,782 16,626 (BN), 19,794 (EN)
of BiLSTM based NMT and Transformer-based approaches.
From the evaluation across different corpora we report that
our system outperforms existing baselines [17], [21] on the IV. Experimental Methodology
Bangla-English language pair.
In this section, we discuss the data preparation process,
III. Parallel Corpora NMT approaches (i.e., BiLSTM and transformer), training and
In this section, we discuss the publicly available corpora experimental settings. To prepare the training data, we first
that we used for this study:
2 http://www.opensubtitles.org/
1 http://anubadok.sourceforge.net 3 http://www.panl10n.net
preprocess it, i.e., we remove all the sentences that contain input, output and forget gates and are capable of capturing
any English words or characters from the ILMPC test set as the long-term dependencies. During the training, we initialized
it increases additional complexity for the translations. For the the model parameters using pre-trained word-embeddings. For
other corpora we used the tokenization approach discussed English, we used the Glove model [35] and for Bangla, we
in Section IV-A. For the different experimental setups, we used the word2vec model [36]. The embedding dimension is
merged the datasets as shown in Table II. 300 for both the models. To avoid overfitting of the model we
used a dropout rate of 0.30.
Table II: Training, development and test data statistics. Bangla
(BN), English (EN), Merged - merged dataset, Semi-merged
- ILMPC + SIPC + PTB.
Data Set # of Sent # of Tokens Source
Train 70,861 832,657 (BN), 998,717 (EN) SUPara
Development 500 8,838 (BN), 10,841 (EN) SUPara
Test 500 8,793 (BN), 10,843 (EN) SUPara
Train 346,845 2.92M (BN), 2.42M (EN) Semi-merged
Dev 500 8,838 (BN), 7,762 (EN) ILMPC
Test 956 14,229 (BN), 16,065 (EN) ILMPC
Train 419,109 3,421,177 (BN), 4,047,912 (EN) Merged Figure 1: BiLSTM based NMT architecture.
Development 500 8,838 (BN), 10,841 (EN) SUPara 2) Transformer – Self Attention based Network: Transform-
Test 500 8,793 (BN), 10,843 (EN) SUPara
Test 956 14,229 (BN), 16,065 (EN) test set (ILMPC) ers are a variant of Attention models, and additionally, it uses
position-wise feed-forward neural networks. Although LSTMs
A. Preprocessing perform better than any other RNNs for the task of sequence
We tokenized both Bangla and English sentences and pre- labeling, such as language modeling, speech recognition, etc.,
pared the data for training and evaluation. As part of the it has a limitation of parallelization [9], [37] and also in dealing
tokenization process, all English sentences were converted with long sentences. To solve this issue transformer models
into lowercase. We limited our sentence length to 40 and use a position-wise feed-forward network. The attention model
removed longer sentences to reduce the computational time that uses a transformer is termed as “Self-Attention” [9]. The
[32]. For tokenizing Bangla sentences, we followed the ap- basic structure of the transformer model uses 6 encoders and
proach discussed in [33], where semiotic classes are identified decoders. In Figure 2, we present transformer architecture.
before tokenization. It includes identifying number, date, time, Each encoder has two layers of multi-head self-attention and
percentage, abbreviations, email, URL, money and any other a position-wise fully connected feed-forward network. Each
textual content. A set of rules are then implemented using decoder uses multi-head self-attention, multi-head attention,
regular expressions, which are then used for separating tokens. and a position-wise feed-forward neural network.
Such approaches are important for tokenization as oftentimes Self-Attention: The input of the encoder first appears in the
issues arise such as full-stops (dots) being used mid-sentence self-attention layer and at the same time, self-attention looks
making the tokenization difficult when using simple whites- at the other words of the input sequence even though it only
pace as delimiters. More details of this tokenization process processes a specific word. The decoder has both layers of
can be found in [33]. the encoder. Additionally, it uses a multi-head attention layer
between the two layers for finding related information from
B. NMT Architectures the input sequence. In the first self-attention layer, each input
Our experimental setup consists of different NMT architec- word is taken as a vector of size 512, the vectors of the input
tures such as BiLSTM, Attention models, i.e., Transformer. We words is termed as “word embedding”. In self-attention, the
also explored the benefits of embeddings with BiLSTM. transformer model uses scaled dot-product attention, which is
1) BiLSTM based Network: For the preliminary exper- much faster and more space-efficient than “additive attention”
iments, we used the BiLSTM network-based approach, as [7], [9]. However, additive attention outperforms dot-product
discussed earlier and the architecture is shown in Figure 1. attention without scaling for larger values of any dimension
It consists of an encoder and a decoder. The encoder takes [9], [38]. Self-attention can easily learn long-range dependen-
input source text with embeddings and learns to convert the cies, which outperforms other existing models.
input into a thought vector. During the training process the Multi-Head Attention: For multi-head attention, we set 8
decoder learns to convert this vector to produce the output parallel attention layers or heads for our training. Multi-head
translation. In the encoder and decoder, we use BiLSTMs. attention allows the model to jointly attend to information from
LSTMs [34] are a form of Recurrent Neural Network (RNN), different representation sub-spaces at different positions [9].
which is widely used for capturing long-term dependencies. Instead of paying attention to one dimension, the transformer
To predict the output of an input sequence, RNNs capture all pays attention to each word based on what types of input
previous information in a memory cell, which is limited in sequences were present.
predicting the output from a very long distance. To overcome Position-wise Feed-Forward Networks: In another layer
this limitation LSTMs were introduced, which consists of of the attention model, each of the layers in our encoder and
development and test set, as presented in Table II. We
conducted this experiment using only BiLSTM model.
The reason for not using Transformers for this experiment
is due to the size of the dataset, which is not enough to be
used with Transformers. However, we plan to investigate
it further in our future study.
2) Exp Setting 2: In this experiment setting we used the
ILMPC, SIPC, and PTB corpora. We mainly used the
ILMPC, SIPC and PTB corpora for the training set and
the ILMPC development and test set for evaluation. For
both training and testing, we removed all sentences from
the merged corpus that contained any English words,
resulting in 346,845 and 956 parallel sentences in the
training and test set, respectively (instead of ∼358K and
1000 parallel sentences presented in Table II). We use
both the BiLSTM and Transformer model for this set of
experiments.
3) Exp Setting 3: In this experimental setting we used the
Figure 2: Transformer Encoder-Decoder architecture, taken SUPara, SIPC, PTB, AmaderCAT, and ILMPC corpora
from Vaswani et al. [9] for illustration and to be self-contained. for training. For the evaluation, we used both the SUPara
and ILMPC test set. We maintained the official SUPara
decoder contains a fully connected feed-forward network [9]. test set and removed all sentences that contained any
Position-wise a feed-forward network is a variant of feed- English words from the ILMPC test set and the merged
forward neural network, which finds out each position of a trained set. The data we use in this experiment setting is
word from the input sequences separately and identically. For the largest dataset for the Bangla-English language pair,
the base model we set the dimensionality of the input and out- as presented in Table II. In this experiment setting, we
put sequences to 512 and the inner-layer has a dimensionality use both BiLSTM and Transformers to train the model.
of 2048. V. Results and Discussion
Positional Encoding: It is used to find the position of
A. Results:
each word in an input sequence. “Positional Embeddings” are
usually added to the input embeddings at the bottom part of In Table III, we present the performance of NMT systems
the encoder and decoder stacks [9]. The dimensionality of both including the baseline and Google Translator’s results on the
positional embeddings and input embeddings are the same for SUPara test set. As mentioned earlier, we only used the
adding two embeddings together. SUPara training set to train the model. From the results,
it is noticeable that we achieve a higher performance us-
C. Experiments ing different NMT based approaches. Interestingly, Google
For training the system, we used the OpenNMT toolkit [39]. Translator’s result is far better than any other result including
In the OpenNMT toolkit, we used different types of neural ours. However, we tried to use the Transformers model using
network based systems as discussed earlier. For our training, the SUPara training set and we were unable to measure the
we used a BiLSTMs neural network with the use of word performance due to the smaller size of the dataset.
embedding dimension 512. We initialized hidden layer 500, In Table IV, we report the results of our second experiment
number of layers 2, and saved our models in every 10,000 setting using different NMT approaches. As discussed earlier,
steps. The model also uses dropout mask and learning rate of we used the ILMPC, SIPC, and PTB corpora to train the
0.3 and 1 respectively, which passes to the BiLSTMs network. models. It is evident from the results that our NMT variants
We also used a transformer based neural network with the outperform both baselines and Google Translator’s results and
use of word embedding dimension 512, multi-head attention the Transformer model provides the best result on the ILMPC
size 8, hidden transformer feed-forward size 2048, optimizer test set.
Adam [40], batch type tokens, batch size 4096, and a learning In Table V and VI, we show the results of our third ex-
rate 2. The transformer model also uses a dropout mask with periment setting using different NMT variants on the merged
a rate of 0.1. The development set are used to optimize the corpus. We use the SUPara, SIPC, PTB, AmaderCAT, and
parameters of the model trained using training set. Bilingual ILMPC corpora to train both BiLSTM and Transformers
Evaluation Understudy (BLEU) [41] version 4 was used to approaches. For the evaluation, we use two test sets, the first
compute the performances of the test set results. We conducted one from SUPara corpus and the second one is the ILMPC
three different sets of experiments: test set, as mentioned earlier. Table V represents the SUPara
1) Exp Setting 1: In this experiment setting we used only test set results and Table VI represents the ILMPC test set
the SUPara corpus. We maintained the official training, results using NMT variants on the merged corpus.
Table III: NMT results on the Exp Setting 1. EN-Emb: Gold: mass awareness should be created.
Glove Pretrained Embeddings, BN-Emb: Bangla Pretrained
Example 2
Embeddings.
Sent: বাবা মােক আমার সালাম জািনও।
Experiments BLEU Pred: my mother is my mummy.
Google: please bless my parents.
shu-torjoma [21] 17.43 Gold: convey my compliments to parents.
BiLSTM 18.72
BiLSTM + BN-Emb 19.68
Example 3
BiLSTM + BN-Emb, EN-Emb 19.98 Sent: আিম একটা চমৎকার িসেনমা দখিছ।
Google Translator 28.09
Pred: i see a cinema movie.
Gold: i am watching a nice movie.
In Table V, we noticed that the merged corpus results B. Discussion

surprisingly decreased in comparison to our first experiment Achieving a higher performance for the Bangla-English
setting, however, the results outperform the baseline. In Table language pair is challenging for various reasons such as i)
VI, the results represent that the merged corpus outperforms morphological richness [31], results in highly-inflected words,
our second experiment setting including both baseline and ii) limited resources (e.g., amount of training data covering
Google Translator’s results. We achieve a notable result using different domains) among others. Some of the issues have also
the transformer model for the ILMPC test set. After achieving been highlighted in [21].
unexpected results for the SUPara test set we examine the Domain coverage and adaptation is one of the main chal-
results that are generated from our best models and we observe lenges. We used a total of five corpora, in which the domain
some challenges, which are described in Section V-B. of corpora doesn’t match with each other. In this case, we
faced a few challenges. For experiment setting 1, we used
Table IV: NMT results on the Exp Setting 2. only the SUPara corpus and for experiment setting 3, we used
the SUPara, SIPC, PTB, AmaderCAT, and ILMPC corpora.
Experiments BLEU
While examining the output of both experiment setting 1 and
Baseline [17] 14.17 3, we noticed that a few words were translated with synonyms,
BiLSTM 15.24 which were not used previously while using only the SUPara
BiLSTM + BN-Emb, EN-Emb 15.62
corpus (Example 4).
Transformer (Base) 16.58 Example 4
Google Translator 12.59 Sent: আিম তােক কখেনা মাফ করব না।
Exp Setting 1: i will never forgive you.
Table V: Results on the SUPara test set for Exp Setting 3. Exp Setting 3: i will never forget you.
Gold: i will never forgive you.
Experiments BLEU
We also realized that our merged corpus contains more noisy
BiLSTM 18.13 words than the SUPara corpus, such as the combination of
BiLSTM + BN-Emb, EN-Emb 19.24 British and American English spellings of the same word.
Transformer (base) 18.99 Hence, for predicting the output in our third experiment
setting, the model has more than one option to choose from,
Table VI: Results on the ILMPC test set for Exp Setting 3. for a single word. One such example is the word ি য় ,
which can be translated to the word with the spellings either
Experiments BLEU ‘favourite’ or ‘favorite’. In both cases, the prediction should
BiLSTM 15.94 be correct, however due to spelling variations and not being
BiLSTM + BN-Emb 16.21 an exact match with Gold, the system will assign the word as
BiLSTM + BN-Emb, EN-Emb 16.36
Transformer (base) 18.73 substitution error.
In our study, we also found that NMT has a weakness in
We provide a few example translations below, translated by translating rare words. In the example 3, the frequency of the
our best system. From the example 1, we see that the system word ‘চমতকার ’/chomotkar/[nice/excellent] is 1 in the training
translated it well, unlike however, for example 3, neither set and the model couldn’t translate the proper meaning for
Google Translator nor our system performed well. In fact, in this word. In the sentence of example 4, the meaning of
the example 2, the sentence should, in our opinion, have a the word ‘মাফ’/maf/[forgive] changed after adding more data
comma after the word বাবা/baba/[Father]. The Gold translation and the translation becomes incongruous. Thus indicating the
is also not convincing. Note that the example sentences we performance dependency on the quality of data and annotation.
selected are from the test dataset. Our reported results are higher than existing systems. How-
Example 1 ever, reaching the results of resource-rich languages is still
Sent: গণসেচতনতা সৃ করা উিচত। a big challenge. More effort is required to improve on these
Pred: mass awareness should be created. results for the future.
VI. Conclusions [17] T. Banerjee, A. Kunchukuttan, and P. Bhattacharyya, “Multilingual
indian language translation system at wat 2018: Many-to-one phrase-
In this study, we explored neural machine translation ap- based smt,” in Proc. of the 5th Workshop on Asian Translation, 2018.
proaches for the Bangla-English language pair and we com- [18] S. Naskar, D. Saha, and S. Bandyopadhyay, “Anubaad–a hybrid machine
translation system from english to bangla,” simple’04, 2004.
pared the performance of different NMT based approaches. [19] M. S. Arefin, M. M. Hoque, M. O. Rahman, and M. S. Arefin, “A
Our experiments outperform existing reported results on both machine translation framework for translating bangla assertive, inter-
the ILMPC and the SUPara test sets. On the SUPara and rogative and imperative sentences into english,” in 2015 International
Conference on Electrical Engineering and Information Communication
ILMPC test sets our system shows 14.63% and 32.18% Technology (ICEEICT). IEEE, 2015, pp. 1–6.
relative improvement, respectively, using the best models. [20] N. J. Khan, W. Anwar, and N. Durrani, “Machine translation approaches
Compared to Google Translator, our system performs better and survey for indian languages,” arXiv preprint arXiv:1701.04290,
2017.
on the ILMPC data set but not on the SUPara dataset. For [21] M. A. A. Mumin, M. H. Seddiqui, M. Z. Iqbal, and M. J. Islam, “shu-
the experiment and evaluation, we used publicly available torjoma: An english↔bangla statistical machine translation system,”
datasets. While analyzing the datasets, we realized that the Journal of Computer Science (Science Publications), 2019.
[22] M. A. Al Mumin, A. A. M. Shoeb, M. R. Selim, and M. Z. Iqbal,
performance and evaluation are affected by some existing “Supara: A balanced english-bengali parallel corpus,” SUST Journal of
noisy translations. We studied and discussed some of these Science and Technology, pp. 46–51, 2012.
factors in the paper, and will address them experimentally in [23] A. K. A. Mishra, R. Chatterjee, R. Shah, and P. Bhattacharyya, “Shata-
anuvadak: Tackling multiway translation of indian languages,” LREC,
future studies. We also found that for NMT based techniques, Rekjyavik, Iceland, 2014.
its performance is highly dependent on parameter optimization [24] M. Post, C. Callison-Burch, and M. Osborne, “Constructing parallel
and architecture, which we also plan to explore in the future. corpora for six indian languages via crowdsourcing,” in Proc. of the 7th
WSMT. ACL, 2012, pp. 401–409.
References [25] N. F. Liu, J. May, M. Pust, and K. Knight, “Augmenting statistical ma-
chine translation with subword translation of out-of-vocabulary words,”
[1] A. Waibel, A. N. Jain, A. E. McNair, H. Saito, A. Hauptmann, and arXiv preprint arXiv:1808.05700, 2018.
J. Tebelskis, “Janus: A speech-to-speech translation system using con- [26] S. Dandapat and W. Lewis, “Training deployable general domain mt for
nectionist and symbolic processing strategies,” in Proc. of the ICASSP, a low resource language pair: English–bangla,” 2018.
1991, pp. 793–796. [27] T. Nakazawa, S. Kurohashi, S. Higashiyama, C. Ding, R. Dabre,
[2] S. Jean, O. Firat, K. Cho, R. Memisevic, and Y. Bengio, “Montreal H. Mino, I. Goto, W. P. Pa, A. Kunchukuttanand, and S. Kurohashi,
neural machine translation systems for wmt-15,” in Proc. of the “Overview of the 5th Workshop on Asian Translation,” Tech. Rep.,
10th WSMT. Lisbon, Portugal: Association for Computational 2018. [Online]. Available: http://www2.nict.go.jp/astrec-att/
Linguistics, September 2015, pp. 134–140. [Online]. Available: [28] P. Lison and J. Tiedemann, “Opensubtitles2016: Extracting large parallel
http://aclweb.org/anthology/W15-3014 corpora from movie and tv subtitles,” 2016.
[3] O. Kuchaiev, B. Ginsburg, I. Gitman, V. Lavrukhin, J. Li, H. Nguyen, [29] M. A. Al Mumin, M. H. Seddiqui, M. Z. Iqbal, and M. J.
C. Case, and P. Micikevicius, “Mixed-precision training for nlp and Islam, “Supara0.8m: A balanced english-bangla parallel corpus,” 2018.
speech recognition with openseq2seq,” 2018. [Online]. Available: http://dx.doi.org/10.21227/gz0b-5p24
[4] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation [30] M. A. Al Mumin, S. M. H., M. Z. Iqbal, and M. J. Islam,
models,” in Proc. of the EMNLP, 2013, pp. 1700–1709. “Supara-benchmark: A benchmark dataset for english-bangla machine
[5] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning translation,” 2018. [Online]. Available: http://dx.doi.org/10.21227/
with neural networks,” in Advances in neural information processing czes-gs42
systems, 2014, pp. 3104–3112. [31] M. A. Hasan, F. Alam, and S. R. H. Noori, “A collaborative platform to
[6] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the collect data for developing machine translation systems,” in Proc. of In-
properties of neural machine translation: Encoder-decoder approaches,” ternational Joint Conference on Computational Intelligence. Springer,
arXiv preprint arXiv:1409.1259, 2014. 2020, pp. 407–416.
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by [32] P. Koehn and J. Schroeder, “Experiments in domain adaptation for
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, statistical machine translation,” in Proc. of the 2nd WSMT, 2007, pp.
2014. 224–227.
[8] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- [33] F. Alam, S. Habib, and M. Khan, “Text normalization system for
proaches to attention-based neural machine translation,” arXiv preprint bangla,” in Conference on Language and Technology, 2009.
arXiv:1508.04025, 2015. [34] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances [35] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for
in neural information processing systems, 2017, pp. 5998–6008. word representation,” in Proc. of EMNLP, 2014, pp. 1532–1543.
[10] S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very [36] F. Alam, S. A. Chowdhury, and S. R. H. Noori, “Bidirectional lstms—
large target vocabulary for neural machine translation,” arXiv preprint crfs networks for bangla pos tagging,” in Proc. of ICCIT. IEEE, 2016,
arXiv:1412.2007, 2014. pp. 377–382.
[11] M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba, [37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Addressing the rare word problem in neural machine translation,” arXiv “Language models are unsupervised multitask learners,” OpenAI Blog,
preprint arXiv:1410.8206, 2014. vol. 1, p. 8, 2019.
[12] S. Naskar and S. Bandyopadhyay, “Use of machine translation in india: [38] D. Britz, A. Goldie, T. Luong, and Q. Le, “Massive Exploration of
Current status,” AAMT Journal, pp. 25–31, 2005. Neural Machine Translation Architectures,” ArXiv e-prints, Mar. 2017.
[13] S. Dasgupta, A. Wasif, and S. Azam, “An optimal way of machine trans- [39] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “Opennmt:
lation from english to bengali,” in Proc. 7th International Conference Open-source toolkit for neural machine translation,” arXiv preprint
on Computer and Information (ICCIT), 2004, pp. 648–653. arXiv:1701.02810, 2017.
[14] J. Francisca, M. M. Mia, and S. M. Rahman, “Adapting rule based [40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
machine translation from english to bangla,” Indian Journal of Computer arXiv preprint arXiv:1412.6980, 2014.
Science and Engineering (IJCSE), vol. 2, no. 3, pp. 334–342, 2011. [41] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method
[15] G. K. Saha, “The e2b machine translation: a new approach to hlt,” for automatic evaluation of machine translation,” in Proc. of the 40th
Ubiquity, vol. 2005, no. August, pp. 1–1, 2005. ACL, ser. ACL ’02. Stroudsburg, PA, USA: ACL, 2002, pp. 311–318.
[16] K. M. Salm, A. Salam, M. Khan, and T. Nishino, “Example based [Online]. Available: https://doi.org/10.3115/1073083.1073135
english-bengali machine translation using wordnet,” 2009.
View publication stats

Iccit 2019

Uploaded by

Copyright:

Available Formats

You might also like

Iccit 2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Iccit 2019

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Neural Machine Translation for the Bangla-English Language Pair

Preprint · December 2019

Md. Arid Hasan Firoj Alam

SEE PROFILE SEE PROFILE

Shammur Absar Chowdhury Naira Khan

SEE PROFILE SEE PROFILE

AIDR: Artificial Intelligence for Digital Response View project

TREiL: Technologies for Research and Education in Linguistics View project

The user has requested enhancement of the downloaded file.

Neural Machine Translation for the Bangla-English

978-1-7281-5842-6/19/$31.00 ©2019 IEEE

In Table V, we noticed that the merged corpus results B. Discussion

View publication stats

You might also like