Professional Documents
Culture Documents
Context-Based Bengali Next Word Prediction A Compa
Context-Based Bengali Next Word Prediction A Compa
Context-Based Bengali Next Word Prediction A Compa
65088
1. Introduction
Because of the digitization trend in every aspect of our Bengali word completion and sequence prediction.
life, all types of recorded documentation nowadays tend
to be in electronic format. Typing in digital devices can be In this paper, we conduct a comparison of different embedding
made faster if the computer is able to accurately predict the methods using an LSTM model for next word prediction.
next word that is intended. This prediction is important in Word embedding is the learned representation of words in
increasing the productivity of the writer by reducing typing numerical format. This embedding contains the inner word
time. Furthermore, a next word prediction system can also level knowledge as well as the relational understanding
be useful in detecting spelling errors and other typing errors. among words. Four models are developed using different
It also reduces the gap between the people who are experts embeddings. We designed LSTM models using word2vec
in typing with the people who are differently abled or people skip-gram and continuous bag of words (CBOW)
who are not regular computer users [1]. embedding, as well as fastText skip-gram and continuous
bag of words(CBOW) embedding. The main contribution
There have been some works on Bengali next word of this research is implementing the four-word embeddings
prediction, most of which are developed using the n-gram mentioned above, training the next word prediction model
method which can not predict context-relevant words [2]. for word prediction using those embeddings, and finally
As it is well known that The n-gram technique is useful for analyzing empirically the best-fit embedding for next word
sequence prediction in various fields such as text prediction prediction using the LSTM network. n-gram models are also
[3], genome prediction [4] etc. But to truly capture the developed for the purpose of comparison.
essence of the language, the words should be predicted from
the given context. To understand the importance of context, 2. Preliminaries
let us consider an example. The context relevant next word
2.1 n-gram
of Òসে কবিতাÓ can be ÒলিখেÓ or Òপড়েÓ. In this case, when the
previous sentence is Òরহিম একজন কবিÓ, the relevant word is n-gram is a frequency-based method that assigns probabilities
ÒলিখেÓ. In the same way, when the previous sentence is Ò to the sequences of words. The n-gram is the sequence of N
রহিম কবিতা পছন্দ করেÓ, the next word should be Òপড়েÓ. There words of a sentence where the N value can be two (called
have been a couple of attempts at capturing this contextual bigram), three (called trigram), and so on. In the following
information in Bengali texts. equation 1, p(w|h) refers to the probability of the word w
given the previous n words which is denoted as h.
In [5], the proposed method consists of a GRU model
extended on the n-gram model to give the prediction in Count of w
p(w|h) = (1)
context. Another hybrid approach of sequential LSTM and
Count of h
n-gram along with trie implementation [6] is proposed for
Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods 9
Here w is the word that should be after the previous words Ct = tanh (Wc [ht−1, xt] + bc) (4)
h. The count of w and h is from our training corpus. For
bigram, next word is predicted given one previous word. Finally values from ”Input gate layer” and ”tanh layer” are
For example, probability of word “ভাত” after previous word combined to create an update to the state.
“আমি”, is the ratio of count “আমি ভাত” and “আমি”. Suppose
count of “আমি ভাত” is 6 and total count of “আমি” is 10 then 2.3 Embedding
probability of the bigram model is 6/10=0.6. So for every Distributed word representation often proved to be helpful
n-gram, previous N-1 words are given in a sentence in while understanding the words and their inner properties.
sequence and the n-th word is predicted. Though those embedding can be divided into two classes
based on their context. word2vec and fastText are the
2.2 LSTM embeddings that resulted from the non-contextual scenario.
From the human perspective, To predict the next word Whereas contextual embedding like ELMO, Bert trained
at the end of the sequence, It is important to know what on sequence-level semantics by considering the sequence
are the words that are already present in the sequence. So of words which can result in different representations for
past memories along with the context from sequences are polysemous words. Here non-contextual embedding like
needed to predict the next word. Here the recurrent structure word2vec and fastText with their different training variants
of the neural network comes into play. Recurrent Neural like CBOW and skip-gram are used for word embedding.
Network(RNN) is widely used for purposes like this as it has
1) CBOW: CBOW tries to predict the target word given
a chain-like structure. Though in theory, RNN can preserve
the context words from the contextual window of size
long-term dependencies but in practice, it is actually not
T, while considering the n history words and m future
the case. That is because of the major arithmetic operations
words where T = n + m. The CBOW architecture, which
applied to the memories that are obtained from the previous
is shown in Figure 1, is similar to Feedforward Neural
cell, resulting in a vanishing gradient problem [7]. Suppose
Net Language Model (NNLM) [11], where the hidden
there is a sentence “বিড়াল একটি আদু রে প্রাণী, এটি হল একটি
layers are discarded, and the projection layer is shared for
গৃ হপালিত প্রাণী যাকে দেখলেই”, Here the probable predicted
all words in the vocabulary [12]. It tries to maximize the
next word should be “আদর”. This prediction is influenced
predicting probability of the center word, p (wc | wt), where
by the word “আদু রে” in the sequence. RNN cannot grab this
t ∈ T, wc represents the center word and wt represents the
type of long-term dependencies in the practical world. To
neighboring words. Because of its architecture CBOW
solve this, RNN with Gate structure (LSTM/GRU) [8] [9]
learns better syntactic relationships between words.
can be used instead of classical RNN. The major difference
between RNN and Gate-based RNN(LSTM/GRU) is in 2) Skip-gram: Architecture of CBOW model, which is
their repeating module. For the next word prediction, LSTM shown in Figure 2, is like the CBOW model except that
[10] is used here. LSTM cells use the cell state with which it predicts the context words based on the current word
information passes through the cells with some non-major which is opposite to the CBOW’s methodology. It tries
linear interaction. LSTM have abilities to add or remove any to predict the history words and future words from the
information to cell states by letting the information through context windows.
to the next cell with an option. LSTM uses the “forget gate
layer” to decide whether to throw away the information from
the previous layer. Forget gate equation is shown in equation
2. Here Wf is the weighted matrix and bf is the bias, ht−1 is the
previous hidden state of the forget gate ft.
ft = σ (Wf [ht−1, xt] + bf) (2)
This model also discards the hidden layer. The predictive the necessary features in the embedding. The embedding size
words in the training set are sampled based on the distance less than or bigger than this can lead to missing the important
from the input word [12]. skipgram tries to maximize the features or overestimating the features.
neighboring word predicting probability from the given
center word, p (w1, w2, …, wn-1, wn | wc) where w1, w2, … , 2.4 Next Word Prediction
wn-1, wn are the neighboring words and wc is the given center
The next word prediction works based on contextual
word. In contrast to the architecture of the CBOW model,
knowledge of past words. In n-gram, the next word is predicted
the input is one word and the output is a set of words that
based on the availability of the n-gram on the training set.
are considered the neighbors of the given center word. Skip-
n-gram models do not consider linguistic knowledge while
gram captures better semantic relationships.
predicting the next word. It gives the same importance to
3) Word2vec, fast Text and Subword Model: The word2vec all the past words in the sequence. By linguistic knowledge,
and fastText both use the skip-gram and the CBOW concepts such as inner word level knowledge, inter-relational
architecture. But fastText introduces an additional understanding of words (such as semantic and syntactical
subword model [13]. To extract the internal structure of knowledge) along with the contextual understanding of
the word, the word w is divided into bags of ngrams of the sequence are meant. In the LSTM-based model that is
itself. The n-gram bag also includes the main word w used in this paper, the knowledge of linguistic instincts is
itself. The embedding of the word w is the sum of the taken into consideration by injecting word embedding in
embedding of the elements in the corresponding bag of the embedding layer and also through the LSTM layers.
words w [14]. In skip-gram model, the probability of word Also, positional importance is learned in the model while
wc given surrounding word wt is resembled in equation 5, considering the word’s position in the sentence. So in terms
of the context of sequence in a sentence, the LSTM model
𝑒𝑒 𝑠𝑠(𝑤𝑤 𝑡𝑡 ,𝑤𝑤 𝑐𝑐 ) based on embedding should work well in comparison to the
𝑝𝑝(𝑤𝑤𝑐𝑐 |𝑤𝑤𝑡𝑡 ) = ∑𝑊𝑊 𝑠𝑠(𝑤𝑤 𝑡𝑡 ,𝑗𝑗 ) (5)
𝑗𝑗 =1 𝑒𝑒
n-gram model and plain LSTM-based modeling because
those models can not provide all the characteristics that are
𝑠𝑠(𝑤𝑤, 𝑐𝑐) = ∑𝑔𝑔∈𝐺𝐺𝑤𝑤 𝑧𝑧𝑔𝑔𝑇𝑇 𝑣𝑣𝑐𝑐 (6) mentioned above.
3. Literature Review
In equation 6, the scoring function, which is denoted as s,
is the dot product of two-word embedding in the original Next word prediction is a Natural Language Processing
model. Whereas, in fastText, s is denoted as the sum of the (NLP) task which is one of the important topics regarding
vector representations of n-grams of the word wc [13]. NLP research. n-gram method is the early approach for the
completion of a sentence [15]. This is developed for English.
Another approach [16] is proposed where a novel relevance
and enhanced Deng entropy-based Dempster’s combination
rule is proposed for next word prediction. Actually, for
English, there are numerous efforts for accurate word
prediction [17]. However, there are only a small number of
works in the Bengali Language for next word prediction. One
of the major works in this regard uses the n-gram method [2].
They use unigram, bigram, trigram, back-off propagation,
and deleted interpolation which are the various variants of
n-gram. the computational cost of these approaches is very
high as they count N-before state as well as N-after state
frequencies.
n-states [19]. An explanation of the process is given below. an unknown word where other methods can not.
Suppose three sentences - I like Science, I like Photography,
I love Mathematics. All the unique words from the sentences In [29], the researchers proposed a language model which
above (‘I’, ‘like’, ‘love’, ‘Photography’, ‘Science’ and utilizes GRU architecture along with the n-gram for predicting
‘Mathematics’) could form different states. ‘like’ appeared next word. Next word suggestion makes it easier to complete
two times after ‘I’, so the more probable word after ‘I’ is user-user communication. The result is promising because
‘like’. Words are predicted using only their previous N-states, GRU can capture context for the output prediction. They use
not based on the context. newspapers (BBC, Prothom Alo) and academic datasets for
their experiment and found out that LSTM performs best in
Mikolov et. al [20] implemented next word prediction word prediction and sentence completion when the previous
of sequential data using Recurrent Neural Network for 5 words are given.
the English Language. Zhou et. al [21] implemented text
classification and predictions on sentence representations by In [30], sentence generation from a sequence of words is
using a combined approach of CNN and LSTM. useful in machine translation, speech recognition, image
captioning, language identification, video captioning,
In [22], the authors propose both a word completion and and much more. In this paper, the author develops a text
sentence completion using trie and sequential LSTM model. generation method using a deep learning approach and
Their proposed trie model is supposed to be dynamic. As the taking advantage of Long Short-term Memory (LSTM)
vocabulary increases, it checks the word prefix, if exists in for capturing context. But their dataset was so small only
the model it suggests the probable word otherwise it approves contain 4500 sentences. So the performance will decrease in
the new word with some spelling check. Sequential LSTM unseen data as the model is trained using a small set of data.
is used to detect the context and complete the sequence
of sentences given some previous words. They show that 4. Experimental Setup
their model has a better suggestion and word completion
4.1 Corpus Description
capability rather than n-gram. They did not specify any word
embedding model that is used in their language model. Our corpus combines a corpus containing Bengali newspaper
articles (Prothom Alo, Kalerkontho, Ittefaq) and Bengali
There are some works published for next word prediction Wikipedia data. The data is collected by crawling and by
in several other languages apart from English. For example, manual approach, Then the collected data are preprocessed
an n-gram based [23] and an RNN based [24] next word for further use. Table 1 shows the proportions of data we
prediction model for the Assamese language has been collected from different sources.
proposed. In [25], some sequence prediction models were
explored to evaluate the performance of the next word Table 1: Details of the datasets used for corpus building.
prediction for Hindi. The same analogy is applied to
Data Proportion (%)
Ukrainian language [26] to evaluate next word prediction
models performance for existing sequence models. Bangla_newspaper_article 65.36
Bangla Wikipedia 34.64
In [27], the authors experimentally fine-tune the word2vec
embedding parameter to get better performance for Bengali The total number of sentences in the corpus is 10,340,273,
word embedding. The authors claim that 300 dimensions which contains 8,928,472 unique words. We use 25,000
with 4 window sizes for word2vec skip-gram model performs sentences in the LSTM model for next word prediction
best for robust vector representation of the Bengali language. in 4 different models. We split 90% for training and 10%
They mainly use newspaper datasets where the dataset can for testing. We used filters to remove special characters &
be categorized into economy, international, entertainment, symbols and English words. We use the whole corpus for
sports, and state. Their finetuned word2vec skip-gram model word embedding and a portion of the data mentioned above
achieves 90% purity in word clustering. for next word prediction.
4.2 Implementation Details
In [28], they applied analysis on word2vec and fastText
embedding for Bengali dataset. There have been many A neural model that is used here, is preceded by an
researches on finding appropriate word clustering and ngram embedding layer and an LSTM [8] layer. the prediction
models are one of them. Now, with the improvement of layer uses adaptive softmax [31] same as the unique number
deep learning methods, dynamic word clustering models are of words in the dataset used for next word prediction. The
preferred because they reduce processing time and improve LSTM model architecture is as shown in Figure 3.
memory efficiency. Word2Vec and fastText are two popular
dynamic word embedding methods and the authors use
these two embeddings. They found that fastText embedding
performs better than the remaining embedding in word
clustering. The reason is that fastText can give a cluster for
12 Mahir Mahbub, Suravi Akhter, Ahmedul Kabir and Zerina Begum
5. Results and Discussion From the Natural Language perspective, perplexity can
be defined as the inverse probability of a test set which is
The performance of the models has been evaluated using normalized by the total number of words. Equation 8 is
accuracy and perplexity. The Word2vec and fastText equivalent to the defined statement above, where N is the
embeddings, both with their two variants named skip-gram number of words in test set, and w represents the word.
and CBOW are used in the embedding layer of the next word
prediction model. Sequentially, four n-gram based models (8)
are implemented for the same purpose of predicting the next
word. In Table 2, N denotes the number of predictions to take By considering our next word prediction model, the
into account during calculating the accuracy. When the test perplexity is defined in equation 9. Here wi is the word for
label matched any of the N predictions, the result is taken as which the probability is calculated depending on the other
positive. words, w1...wi−1.
Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods 13
References
1. N. Garay-Vitoria and J. Gonzalez-Abascal, ``Intelligent word- Empirical Methods in Natural Language Processing, pp. 193-
prediction to enhance text input rate (a syntactic analysis- -200, 2005.
based word-prediction aid for people with severe motor and
16. G. L. Prajapati and R. Saha, ``Reeds: Relevance and enhanced
speech disability),’’ in Proceedings of the 2nd international
conference on Intelligent user interfaces, pp. 241—244, 1997. entropy based dempster shafer approach for next word
prediction using language model,’’ Journal of Computational
2. M. Haque, M. Habib, M. Rahman et al., ``Automated word Science, vol. 35, pp. 1-11, 2019.
prediction in bangla language using stochastic language
17. K. Trnka, J. McCaw, D. Yarrington, K. F. McCoy, and C.
models,’’ arXiv preprint arXiv:1602.07803, 2016.
Pennington, ``User interaction with word prediction: The
3. N. Garay-Vitoria and J. Abascal, ``Text prediction systems: a effects of prediction quality,’’ ACM Transactions on Accessible
survey,’’ Universal Access in the Information Society, vol. 4, Computing (TACCESS), vol. 1, no. 3, pp. 1--34, 2009.
no. 3, pp. 188--203, 2006.
18. H. X. Goulart, M. D. Tosi, D. S. Goncalves, R. F. Maia, and
4. T. S. Rani and R. S. Bapi, ``Analysis of n-gram based promoter G. A. Wachs-Lopes, ``Hybrid model for word prediction
recognition methods and application to whole genome using naive bayes and latent information,’’ arXiv preprint
promoter prediction,’’ In silico biology, vol. 9, no. 1, 2, pp. arXiv:1803.00985, 2018.
S1--S16, 2009.
19. G. Szymanski and Z. Ciota, ``Hidden markov models suitable
5. O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. for text generation,’’ in WSEAS International Conference
Habibullah, ``Bangla word prediction and sentence completion on Signal, Speech and Image Processing (WSEAS ICOSSIP
using gru: an extended version of rnn on n-gram language 2002). Citeseer, pp. 3081--3084, 2002,
model,’’ in 2019 International Conference on Sustainable
20. T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S.
Technologies for Industry 4.0 (STI). IEEE, pp. 1--6, 2019,
Khudanpur, ``Recurrent neural network based language
6. S. Sarker, M. E. Islam, J. R. Saurav, and M. M. H. Nahid, ``Word model.’’ in Interspeech, vol. 2, no. 3. Makuhari, pp. 1045—
completion and sequence prediction in bangla language using 1048, 2010,
trie and a hybrid approach of sequential lstm and n-gram,’’ in
21. C. Zhou, C. Sun, Z. Liu, and F. Lau, ``A c-lstm neural network
2020 2nd International Conference on Advanced Information
and Communication Technology (ICAICT). IEEE, pp. 162-- for text classification,’’ arXiv preprint arXiv:1511.08630,
167, 2020. 2015.
22. S. Sarker, M. E. Islam, J. R. Saurav, and M. M. H. Nahid,
7. S. Hochreiter, ``The vanishing gradient problem during learning
recurrent neural nets and problem solutions,’’ International ``Word completion and sequence prediction in bangla
Journal of Uncertainty, Fuzziness and Knowledge-Based language using trie and a hybrid approach of sequential lstm
Systems, vol. 6, no. 02, pp. 107--116, 1998. and n-gram,’’ in 2nd International Conference on Advanced
Information and Communication Technology (ICAICT).
8. S. Hochreiter and J. Schmidhuber, ``Long short-term IEEE, 2020, pp. 162—167, 2020.
memory,’’ Neural computation, vol. 9, no. 8, pp. 1735--1780,
23. M. Bhuyan and S. Sarma, ``An n-gram based model for
1997.
predicting of word-formation in assamese language,’’ Journal
9. K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. of Information and Optimization Sciences, vol. 40, no. 2, pp.
Bougares, H. Schwenk, and Y. Bengio, ``Learning phrase 427--440, 2019.
representations using rnn encoder-decoder for statistical
24. P. P. Barman and A. Boruah, ``A rnn based approach for next
machine translation,’’ arXiv preprint arXiv:1406.1078, 2014.
word prediction in assamese phonetic transcription,’’ Procedia
10. M. Sundermeyer, R. Schluter, and H. Ney, ``Lstm neural computer science, vol. 143, pp. 117--123, 2018.
networks for language modeling,’’ in Thirteenth annual
25. R. Sharma, N. Goel, N. Aggarwal, P. Kaur, and C. Prakash,
conference of the international speech communication
association, 2012. ``Next word prediction in hindi using deep learning
techniques,’’ in 2019 International Conference on Data
11. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, ``A neural Science and Engineering (ICDSE). IEEE, pp. 55—60, 2019,
probabilistic language model,’’ Journal of machine learning
26. K. Shakhovska, I. Dumyn, N. Kryvinska, and M. K. Kagita,
research, vol. 3, no. Feb, pp. 1137--1155, 2003.
``An approach for a next-word prediction for ukrainian
12. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. language,’’ Wireless Communications and Mobile Computing,
Dean, ``Distributed representations of words and phrases and vol. 2021, 2021.
their compositionality,’’ in Advances in neural information
27. R. Rahman, ``Robust and consistent estimation of word
processing systems, 2013, pp. 3111--3119.
embedding for bangla language by fine-tuning word2vec
13. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, ``Enriching model,’’ in 2020 23rd International Conference on Computer
word vectors with subword information,’’ Transactions of the and Information Technology (ICCIT). IEEE, pp. 1--6, 2020.
Association for Computational Linguistics, vol. 5, pp. 135--
28. Z. S. Ritu, N. Nowshin, M. M. H. Nahid, and S. Ismail,
146, 2017.
``Performance analysis of different word embedding models
14. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, ``Bag on bangla language,’’in 2018 International Conference on
of tricks for efficient text classification,’’ arXiv preprint Bangla Speech and Language Processing (ICBSLP). IEEE,
arXiv:1607.01759, 2016. pp. 1--5, 2018.
15. S. Bickel, P. Haider, and T. Scheffer, ``Predicting sentences 29. O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M.
using ngram language models,’’ in Proceedings of Human Habibullah, ``Bangla word prediction and sentence completion
Language Technology Conference and Conference on using gru: an extended version of rnn on n-gram language
Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods 15
model,’’ in 2019 International Conference on Sustainable 33. A. Pal and A. Mustafi, ``Vartani spellcheck--automatic
Technologies for Industry 4.0 (STI). IEEE, pp. 1—6, 2019, contextsensitive spelling correction of ocr-generated hindi
text using bert and levenshtein distance,’’ arXiv preprint
30. M. S. Islam, S. S. S. Mousumi, S. Abujar, and S. A. Hossain,
arXiv:2012.07652, 2020.
``Sequenceto-sequence bangla sentence generation with lstm
recurrent neural networks,’’ Procedia Computer Science, vol. 34. Y. Hong, X. Yu, N. He, N. Liu, and J. Liu, ``Faspell: A fast,
152, pp. 51--58, 2019. adaptable, simple, powerful chinese spell checker based on
dae-decoder paradigm,’’ in Proceedings of the 5th Workshop
31. A. Joulin, M. Cisse, D. Grangier, H. Jegou et al., ``Efficient
on Noisy Usergenerated Text (W-NUT 2019), pp. 160—169,
softmax approximation for gpus,’’ in International conference 2019,
on machine learning. PMLR, pp. 1302—1310, 2017,
32. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ``Efficient
estimation of word representations in vector space,’’ arXiv
preprint arXiv:1301.3781, 2013.