Professional Documents
Culture Documents
A Multifaceted Evaluation of Representation of Graphemes For Practically Effective Bangla OCR
A Multifaceted Evaluation of Representation of Graphemes For Practically Effective Bangla OCR
https://doi.org/10.1007/s10032-023-00446-7
ORIGINAL PAPER
Abstract
Bangla Optical Character Recognition (OCR) poses a unique challenge due to the presence of hundreds of diverse conjunct
characters formed by the combination of two or more letters. In this paper, we propose two novel grapheme representation
methods that improve the recognition of these conjunct characters and the overall performance of OCR in Bangla. We have
utilized the popular Convolutional Recurrent Neural Network architecture and implemented our grapheme representation
strategies to design the final labels of the model. Due to the absence of a large-scale Bangla word-level printed dataset, we
created a synthetically generated Bangla corpus containing 2 million samples that are representative and sufficiently varied
in terms of fonts, domain, and vocabulary size to train our Bangla OCR model. To test the various aspects of our model,
we have also created 6 test protocols. Finally, to establish the generalizability of our grapheme representation methods, we
have performed training and testing on external handwriting datasets. Experimental results proved the effectiveness of our
novel approach. Furthermore, our synthetically generated training dataset and the test protocols are made available to serve
as benchmarks for future Bangla OCR research.
Keywords Bangla OCR · Word-level OCR · Synthetic Bangla OCR dataset · OCR benchmark · Neural networks
1 Introduction
Koushik Roy, Md Sazzad Hossain, Pritom Kumar Saha have contributed
equally to this work. Optical character recognition, or OCR, can be considered
a complicated and diverse problem. Many approaches have
B Fuad Rahman been taken to address this complex problem. These solutions
fuad@apurbatech.com contain both rule-based deterministic and heuristic solutions.
Koushik Roy Although rule-based solutions can be considered for their
koushik.roy@northsouth.edu interpretability, it has proven difficult to capture all the appli-
Md Sazzad Hossain cable rules, thus reducing the model’s adaptability. This has
sazzad_hossain@apurbatech.com led to the rise in heuristic method-based solutions, which
Pritom Kumar Saha stem mostly from statistics and machine learning. Earlier
pritom_saha@apurbatech.com solutions contained a mixture of rule and popular statistical
Shadman Rohan
shadman.rohan@northsouth.edu
Imranul Ashrafi Nabeel Mohammed
imranul.ashrafi@northsouth.edu nabeel.mohammed@northsouth.edu
Ifty Mohammad Rezwan 1 Apurba Technologies, Dhaka, Bangladesh
mohammad.rezwan@northsouth.edu
2 Apurba-NSU R&D Lab, North South University, Dhaka
B. M. Mainul Hossain 1229, Bangladesh
mainul@iit.du.ac.bd
3 Institute of Information Technology, University of Dhaka,
Ahmedul Kabir Dhaka, Bangladesh
kabir@iit.du.ac.bd
123
K. Roy et al.
machine learning-based solutions such as Hidden Markov OCR model was used as the baseline, which was further
Models [1], Bayesian Models [2]. The recent availability of adapted as per the specifications of grapheme representa-
large datasets and deep learning models have brought sig- tion methods in the final layer. The models with the various
nificant improvements [3, 4]. These deep learning models grapheme representation methods score over a number of test
have been found to learn more useful features and outper- sets that also include real, non-synthetic samples. To summa-
form their statistical and rule-based counterparts. During rize, the main contributions of this paper are threefold:
the recent emergence of deep learning in OCR, most solu-
tions have been geared towards accessible languages such 1. We present a fully synthetic corpus for Bangla word-level
as English [5, 6] or Chinese [7] consisting of a plethora of OCR consisting of 2 million (2,074,992 to be exact) sam-
resources. Languages such as English tend to have a limited ples covering a vocabulary of 691,664 words.
number of characters and do not have any special conjunct 2. Six test sets we created are used to evaluate the OCR mod-
character to themselves. On the other hand, languages such els trained on the synthetic corpus on aspects such as real
as Bangla, as addressed in this paper tend to have a consid- and synthetic seen and unseen words, fonts, formatting,
erably larger set of characters. In addition, these characters etc.
have some special characteristics that make it difficult for 3. We propose two novel grapheme representation methods
OCR. An instance of such is the usage of modifiers where suitable for the morphological richness of an intricate lan-
one character is conjoined with another character to form a guage like Bangla, relevant to determine the final design
new grapheme. Another special attribute of Bangla emerges of our models. To confirm the generalizability of the pro-
with the conjunct characters. A certain variant form of the posed methods, we performed training and testing on
character may appear before or subsequent (or both) to the external publicly available Bangla handwriting datasets,
base character, mainly discerning a certain form of sound and which are completely different from our synthetic data
the word itself. Such linguistic flexibility of Bangla adds a and test protocols.
new dimension of difficulty to the OCR problem.
There are multiple approaches to optical character recog-
The datasets generated and used during the experiments
nition, which include the character-based approach and the
are available at figshare.1
word-based approach. Character-based approach requires the
The experiments reproducible code is publicly available
document to be first segmented into lines, then words, and
at GitHub.2
finally characters. Each level of segmentation adds a new
The remainder of this paper is organized as follows, Sect. 2
level of complexity to the problem. This is specifically true
consists of the review of the literature relevant to this paper.
for the Bangla language due to the morphological flexibil-
Next, in Sect. 3, we describe the methodology to generate
ity mentioned before. The word-level approach removes the
our synthetic training set and further outline the test pro-
last layer of complexity and preserves the conjuncts as much
tocols. In Sect. 4, we elaborate on the methods we used to
as possible. Moreover, there are two different approaches to
perform the experiments, our proposed grapheme represen-
word-level approach in OCR. The first approach, commonly
tation methods, and the evaluation metrics. In Sect. 5, we
known as the Fixed method [8, 9], tends to perform OCR
report all the metrics achieved by our models using different
within a fixed set of words from a dictionary. The second
grapheme representation methods and provide an in-depth
approach, known as the Dynamic method [10, 11], takes into
analysis. Finally, we provide our concluding observations in
account the characters in words and performs OCR based on
Sect. 7.
the dictionary of characters or graphemes it uses while being
trained on a diverse set of words. Finally, when it performs the
recognition, the usage of the dynamic architecture enables the
model to potentially detect words that might not be present 2 Related works
in the dictionary it was trained on. The latter approach is
used in this paper. At the time of conducting our research, Despite being a challenging task, OCR has seen a lot of appli-
no diverse, correct, large-scale dataset for Bangla word-level cations in real life. The problem’s challenges can be relevant
OCR was present. Moreover, the datasets that existed con- to the difficulty of understanding the target language. Other
formed primarily for character level recognition and, in some challenges include predicting document content of different
cases, contained numerous errors. levels. These can include paragraphs, sentences, words and
The research conducted in this paper is a part of building characters. Concentrating on these levels of challenges of
a word-level printed text recognition service required for a OCR, in addition to the inherent complexity of the Bangla
commercial project. Following this, we propose a pipeline
1
to create efficient, diverse large-scale synthetic datasets with https://doi.org/10.6084/m9.figshare.20186825.
word-level annotations. The popular CRNN [10] word level 2 https://github.com/apurba-nsu-rnd-lab/bangla-ocr-crnn.
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
language, multiple efforts have been made towards amelio- ules are known as feature extractors, sequence modeling
rating the challenges [12]. OCR pipelines are commonly and transcription layer. The feature extraction module is a
distributed into two different components: the first is the Text CNN-based feature extractor and is loosely based on the well-
Detection module [13–16] responsible for segmenting and known VGG16 [3] architecture. Using a sequence model
localizing each sentence, word, and character, and the second implemented with a Bidirectional Long Short-Term Mem-
is the Text Recognition module [5, 6, 8–11] which leverages ory (LSTM) [23] based encoder–decoder architecture, the
the segmented word or character images for recognition. One dependency on various lengths of texts was omitted. The tran-
other variation of OCR models uses an end-to-end strategy scription layer was responsible for predicting the sequence
to perform both text detection and recognition in a single labels for each timestep in the model using their maximum
pipeline [6, 17–20]. As this paper looks into the recogni- probability scores and determining the respective label. This
tion problem, we limit the scope of the review to only Text model is popularly known as the Convolutional Recurrent
Recognition problems, more specifically, towards word-level Neural Network (CRNN). The paper reports a Recognition
recognition. accuracy of 89.6% for English on the ICDAR 2013 [24]
In recent years, due to the prevalence of Neural Networks, dataset.
OCR models have achieved outstanding performances [10, Baek et al. [25] showed that using attention [26] based
21, 22]. However, for Bangla, the development of word-level prediction layer instead of Connectionist Temporal Classi-
OCR has been slower when compared with the growth of fication (CTC) [27] yielded 1.7% and 2.6% better accuracy
similar OCR pipelines in other languages. We discuss the rel- on their experiments and even before them, using attention
evant works for this literature survey in two parts. Firstly, we instead of CTC had become a popular method of achieving
discuss the state-of-the-art OCR models in other languages state-of-the-art performance. ASTER [28] uses a rectification
and finally review the pre-existing OCR models for Bangla. network and a recognition network where the rectification
network transforms the images to deal with perspective and
2.1 OCR in other languages curved texts without any human annotations. The recogni-
tion network uses attention LSTM to decode the prediction.
The advancement of deep neural networks in the last decade FAN [29] addresses the alignment issue between feature
kept pushing the ceiling of OCR research and there have been areas and targets which they called the attention drift present
a lot of influential works. The OCR challenge can be divided in attention-based encoder–decoder models. SCATTER [30]
into two key domains. One is scene text and printed word improved the architecture proposed by Baek et al. [25] by
recognition and the other is handwriting recognition. introducing a selective decoder that operates on both visual
features from the CNN layer and contextual features from
2.1.1 Scene text and printed words recognition the Bidirectional Long Short-Term Memory (BiLSTM) layer
harnessing a two-step 1D attention mechanism. The method
One of the earlier approaches phrased the text recognition recognized cursive texts and texts on complex backgrounds
problem as a classification problem [9]. Each word was better than Baek et al. [25].
assigned a specific label and a deep Convolutional Neural To combat the shortcomings of Recurrent Neural Network
Network (CNN) was trained with an English vocabulary of (RNN) based models, Yu et al. [31] proposed an end-to-
90,000 words. However, the major drawback of the model end framework known as the Semantic Reasoning Network.
appears when used in languages with a higher character set. The paper introduces a Global Semantic Reasoning Module
Consequently, the model is unable to handle words out of that mainly flows the semantic information in a “multi-way
the designated vocabulary. For instance, when experimenting parallel” manner. This sub-network can essentially learn the
with Chinese, a language with a very large number of char- words or characters simultaneously, making it more robust.
acters making it is extremely difficult for a deep CNN model The introduction of this network eliminates the need for
to learn such a large number of patterns. Thus, a CNN model a time-step-based sequential learner, which can at times
phrased as an object detector would fail to generalize in mul- pass erroneous information resulting in the accumulation of
tiple scenarios and cannot be used for a better-performing unexpected semantic information. To capture the semantic
OCR model. information, the authors used Transformer blocks [26], where
A dynamic word-based recognition was proposed by Shi the input is a feature extracted from the Parallel Visual Atten-
et al. [10]. Prior to this, almost all deep learning models tion Module. The paper reports a recognition rate of 95.5%
for word-level recognition focused on prediction from a on the ICDAR 2013 dataset [24] and also achieved a recog-
fixed vocabulary which as aforementioned has its own draw- nition rate of 82.7% on ICDAR 2015 dataset [32].
backs. Shi et al. proposed a novel neural network architecture Transformer-based architectures [33–37] have been widely
that does not require any pre-determined word vocabulary. used recently for their better learning and generalization
This architecture included three chief modules. These mod- capability. Aberdam et al. [38] proposed a Multimodal
123
K. Roy et al.
Semi-Supervised contrastive learning-based method utiliz- structural analysis, and algorithmic analysis and using the
ing a visual representation learning algorithm for scene text template and feature matching techniques [51–59]. However,
recognition. Self-supervised contrastive learning and masked since the advent of deep learning, feed-forward neural net-
image modeling are utilized by [39] to learn discrimination works have become the preferred method in OCR studies
and generalization for text recognition. Chu et al. [40] pro- [60–66].
posed an Iterative Language Modeling Module (IterLM) to Due to the absence of an open-source printed word dataset,
enhance the scene text recognition further. A novel single research on Bangla OCR has mostly advanced in handwriting
visual model is introduced by Du et al. [41], where recog- recognition, specifically character-based classification mod-
nition is done through simple linear prediction. Zheng et al. els. On segmented character recognition tasks, many studies
[42] proposed and regularization-based method to reduce the [67–71] have used deep learning to achieve good perfor-
domain discrepancy between real and synthetic data. mance. Sharif et al. proposed a hybrid HOG-CNN model
[72] to classify Bangla compound characters and achieved
2.1.2 Handwriting recognition 92.50% accuracy on the CMATERdb3.1.3.3 isolated char-
acters dataset. Hasan et al. [73] proposed a Deep CNN with
It is difficult to replicate the success of scene text and printed a Bidirectional Long Short-Term Memory model to predict
word recognition for handwriting recognition. Handwritings Bangla compound characters. They experimented with their
are incredibly more varied than printed words and even the proposed model on the same CMATERdb3.1.3.3 dataset and
human recognition accuracy of handwriting can be lower achieved a new milestone recognition accuracy of 98.50%.
than printed word recognition. However, the deep learning Paul et al. [74] proposed a BiLSTM-CTC-based model
architectures used for printed word recognition are also used trained on 47,720 text lines. Their dataset contained 4,72,167
in recognizing handwriting. Chammas et al. [43] used the words, consisting of 28,67,659 characters, and was classi-
CRNN architecture and applied Temporal Dropout to image- fied into 166 unique labels. The test set comprised 61,582
level and internal network representation during fine-tuning words containing 3,69,931 characters. During training, they
and reported better results than the CRNN baseline on Span- ignored the peephole connections of the BiLSTM to reduce
ish and German benchmark datasets. The CRNN baseline the CTC loss, and the weights of the model were initialized by
has also been used in [44], where the authors proposed a the Xavier Initializer [75], which helped the loss function to
novel loss function called SoftCTC and demonstrated state- converge fast. On the test set, they achieved character recog-
of-the-art results on the handwriting recognition benchmark nition rate and word recognition rate of 99.32% and 96.65%
datasets. Yousef et al. [45] and Maillette de Buy Wenniger respectively which outperforms Google’s Tesseract 4.0 [21]
et al. [46] have also achieved good performance by uti- character and word recognition accuracy of 91.79, 76.31,
lizing CTC loss-based architecture. Encode–decoder-based and Google Drive OCR 98.54, 92.86%, respectively. How-
sequence-to-sequence models such as AttentionHTR [47] ever, despite the reported great performance, the dataset is not
has also seen their use in handwriting recognition. Resource available for comparative study. Recently, two Bangla hand-
scarcity is another issue that faces handwriting recognition written word datasets BN-HTRd [76] and BanglaWriting
and thus Fogel et al. have proposed the ScrabbleGAN [48] to [77] have been published, but no baseline word recognition
generate varying-length handwritten texts. Many researchers model or results have been presented in the papers.
have also worked on the resource efficiency of handwriting
recognition [43, 45, 46, 49].
3 Proposed datasets
2.2 OCR in Bangla language
Neural network-based solutions require a dataset with suf-
Although a significant amount of research has been done ficient size and variation to perform well based on current
on OCR, the improvements in word-level OCR for Bangla training methods. Although optical character recognition is a
have been slower in comparison with other languages. The well-defined problem with satisfactory models, the require-
major hindrance to this dormant growth of performance is ment of language-specific datasets at scale, in this case, a
due to the unavailability of any standard dataset for a word- Bangla word-level dataset, still remains. We are aware that
level OCR. We combat this problem by presenting a Bangla the process of deriving annotated data is a sufficiently lengthy
word-level OCR corpus that can be used to train and evaluate process and requires a large number of resources. To solve
a word-level text recognition model. Character recognition our problem, we propose a synthetic word dataset. Despite
of the Bangla language is challenging as most characters are steps taken to produce synthetically generated Bangla char-
cursive in nature, and there are no well-defined strokes [50]. acter dataset [78], no existing large synthetic word dataset
In the pre-deep learning phase, researchers tried to improve prior to ours exists to our knowledge. In addition to our syn-
the performance of the character-level OCR by combining thetic training dataset, we have also created six unique test
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
3 https://www.unicode.org/charts/PDF/U0980.pdf. 4 https://github.com/Belval/TextRecognitionDataGenerator.
123
K. Roy et al.
Fig. 3 Synthetically generated image samples from Protocol I of the Fig. 4 Synthetically generated image samples from Protocol II of the
test sets test sets
sets for evaluating the word-level OCR. For segmentation, we images of the prior eight variant sizes along with five new
used the Apurba Segmentation pipeline [12], which depends fonts as the test set. These fonts were not used to generate
on three significant steps to extract words from a document the training set or the validation set. Samples of word images
effectively. To be clear, we used this framework, and it does for this protocol is shown in Fig. 3.
not constitute a contribution to this paper. Choosing this par-
ticular segmentation pipeline is not influential to our research 3.2.2 Protocol II
as words segmented by the segmentation pipeline were later
double-checked by a human before compiling the word-level Evaluation aspect: This is a hybrid test set that was cre-
test datasets. ated by producing two documents, each containing a hundred
Instead of one big test set, the six different protocols were words. These documents were printed, scanned, and finally
created to rigorously test the OCR model and bring robust- processed by the Apurba segmentation pipeline. This proto-
ness to the model. The following sections provide details on col has been created to evaluate text recognition performance
why each protocol was created and the steps taken to create on clean, printed and scanned documents. Upon segmenta-
these six test protocols in form of datasets, each serving a par- tion, the authors annotated each word sample in this protocol.
ticular purpose. These sections also contain image samples This is cleaner in content than many other real-world exam-
that provide intricate descriptions of different partitions of ples, as observed from the samples in Fig. 4.
the test data. Additionally, the annotations of the real-world Creation method For this dataset, we used two fonts, “Lohit”
images were performed by the authors. In order to improve and “Kalpurush”, with fixed sizes, where the “Lohit” font was
the quality of the annotation, each annotation was later veri- not used to generate the 2 million training set. The words were
fied by more than one person. also randomly taken and had overlapped.
Evaluation aspect In order to evaluate the performance of Evaluation aspect For this variation, a diverse set of articles
our models on words it has not encountered in the training were randomly selected from a collection of newspaper arti-
set, we establish the first test protocol. This protocol consists cles available online. This protocol was created to test the
of a large amount of unique synthetic words to test the per- robustness of the model on styled texts, such as bold, under-
formance of our model on a large scale which is difficult to line and italic texts.
do with real data due to resource constraints of creating a Creation method The fonts used to generate this pro-
sizeable real dataset. tocol were “SiyamRupali” and “Kalpurush” which were
Creation method For this protocol, from a list of around also present in the training set and “Lalsalu”, “Lohit” and
hundred and fifty thousand unique words taken from the “BenSen” which were not present in the training set. To intro-
SUMono dataset [80], we curated a list of 75,138 unique duce further variations in the data, the fonts were in variable
words which were not used to generate the prior defined 2 sizes, and we replicated some of the samples by generat-
million training dataset or the validation dataset. We then ing images in Bold, Italic, and Underlined font variations.
used the synthetic data generation pipeline to generate text Figure 5 shows the variations introduced in this protocol.
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
We printed out the documents and scanned them again to Fig. 8 Real-world image samples of words with at least one conjunct
character from Protocol VI of the test sets
introduce real-world foreground and background noises [83].
Consequently, we used the Apurba segmentation pipeline
explained in the earlier section to convert the document into over, this protocol contains single-character images, unlike
single images of words. This eventually culminated in a total the prior test sets. It is also known that single-character-
of 3056 test samples for this protocol. based word recognition provides a depth in the challenge
for word-level recognition models [10]. This protocol con-
3.2.4 Protocol IV sists of multiple real-world challenging cases with varying
noises, as shown in Fig. 7.
Evaluation aspect In this test protocol, we introduced three Creation method This test set is based on documents col-
different types of documents. These documents include type- lected from various government organizations in Bangladesh.
set pages, printed books and old binarized books. This Upon segmenting the documents, we retrieved 1931 real-
protocol evaluates the recognition model’s performance on world images. This test set has further complications, which
documents from different domains. come with the noise that the documents carry after the scan-
Creation method The documents were scanned and fur- ning process.
ther segmented using the Apurba segmentation pipeline.
The resulting test protocol comprises 1105 real-world word 3.2.6 Protocol VI
image samples. Among them, 456 word images are from
typeset documents, 305 images are extracted from printed Evaluation aspect In addition to having real-world noises,
books and 344 words are retrieved from old binarized books. all the words in this protocol consist of at least one conjunct
In Fig. 6, we demonstrate segmented word samples from the character. We created this protocol to test the performance
three categories of documents collected for this test set. of our three grapheme representation methods on conjunct
characters. Samples are shown in Fig. 8.
3.2.5 Protocol V Creation method Our final test protocol is based on vari-
ous books of stories and poems, religious books, scientific
Evaluation aspect The fifth test protocol has six variations textbooks, etc. After running multiple pages from different
of real-world noise across nine different documents. More- documents through the segmentation pipeline, the extracted
123
K. Roy et al.
word images that meet the criteria of this protocol were hand-
picked.
In Table 1, we provide a summary of the six test protocols.
4 Methodology
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
Table 1 Summary of the test protocols including their respective data type, font information, sources, and the samples count
Proto Type Font Sources Samples Tests
1 Synthetic Fonts not in training Online articles 75,138 Performance on large scale
data
2 Hybrid 1 out of 2 fonts in training Real documents 199 Performance on clean
scanned text
3 Hybrid 2 out of 5 fonts in training. Contains Online articles 3056 Performance on stylized
bold, italic, underline text
4 Real Unknown Typeset, printed, and 1105 Performance on images
binarized books from different domains
5 Real Unknown Government documents 1931 Performance on real-world
noise
6 Real Unknown Textbooks, story, poems 114 Performance on conjunct
and religious books, etc characters
123
K. Roy et al.
Table 3 A possible representation of a grapheme in multiple combina- Table 4, we provide examples of representing characters in
tions illuminating the difficulty of prediction different ways and further discuss the proposed representa-
tion approach.
In the first Naive Separation method, the graphemes are
separated at the root level and broken into their smallest form,
similar to the legacy implementation of character extraction
from unicode strings. Due to this, the modifiers are also sep-
arated, and consonant clusters are broken down. Thus, each
character is treated as a class. In Bangla, a consonant cluster
comprises two to three possible consonants by a ‘hoshonto’
Consonants can also form clusters with one or two other character previously shown in Table 3. In the first row of Pos-
consonants connected by a ‘hoshonto’, an example of which sible Representations in Table 3, we show that the grapheme
is can be seen in Table 3. In Table 3, we show an exam- can be represented with multiple characters, however, joined
ple of a grapheme extracted from a word. In all traditional by the ‘hoshonto’ character in between.
sequential recognition models, Bangla Unicode text are used In the second Vowel Diacritics Separation (VDS) method,
as labels for words. In that way, the grapheme in Table 3 will the vowel diacritics are separated, and the consonant clusters
have three labels to represent it, shown on the upper right are kept in their own form, which is further demonstrated in
cell. However, by observing Table 3 we realize that as we Table 4.
are working with image-level data, it may be harder for the Finally, the last All Diacritics Separation (ADS) method
model to understand that the grapheme on the left column for grapheme representations is based on the vowel diacritics
is composed of three different characters. While the model separation method. But now, atop vowel diacritics separa-
can certainly learn it, treating graphemes distinctly based on tion, we also separate a few Brahmi-Sanskrit diacritics: jofola
their visual presentation may be easier. Instead of using three (grapheme created by ‘Hoshonto + Jo’), rofola (grapheme
different labels to represent this grapheme, we can combine created by ‘Hoshonto + Ro’) and ref (grapheme created by
the unicodes and use them as a single label for the grapheme, ‘Ro + Hoshonto’). Like the vowel diacritics, these diacrit-
shown on the lower right cell. ics are used along with almost every consonant and create
In our experiments, we break down words into three dif- consonant clusters. But jofola, rofola and ref remain visually
ferent representations. As mentioned above, the graphemes identical in every consonant cluster. It can be shown that these
may also involve conjuncts or modifiers, which are essen- diacritics only appear along with the base consonant clusters
tially special structural variants of characters or combined or characters and produce unique clusters. From Table 4, for
characters. The takeaway is that if they are visually different, the first and the third example of the separation method, we
they too can be considered a separate class in the vocabulary. can find that due to the presence of the ‘Ro + Hoshonto’
This adds a new level of complexity because if we try to grapheme, it is followed by newer graphemes such as ‘Ro
add the possibility of all combinations of consonant clusters + Hoshonto + Bo’ and ‘Ro + Hoshonto + To’ as shown in
along with the vowel diacritics, this will lead the number of Table 5. We can also show that some graphemes may also
classes in vocabulary to be in the thousands. On the other coexist in other graphemes where they do not appear visually.
hand, if we separate the graphemes naively like in other lan- Since these diacritics themselves are not unique but may pro-
guages, namely English and Chinese, where the characters duce thousands of unique clusters, we separate them from the
are split to their root form, we will end up with a small set clusters and only keep the unique clusters formed by com-
of characters which is a poor representation of the consonant bining different consonants. This ensures that the consonant
clusters and diacritics, and therefore the recognition accu- clusters without any diacritics are kept in their original forms
racy may fall. So we try to reach an optimum trade-off. In while the clusters with vowel diacritics and/or the Brahmi-
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
Table 5 Composition of multiple conjunct-based characters and rein- the number of insertions, deletions and substitutions respec-
force the examples in Table 4, showing how one grapheme may co-exist tively.
in other graphemes in their Conjunct form edit_distance( pr ed, gt)
NED = (3)
max(| pr ed|, |gt|)
Sanskrit diacritics are broken down. Thus we get a list of Our final metric is the Character Recognition Rate (CRR)
classes that is neither too low to represent the language nor as defined in Eq. 4 which denotes the percentage of char-
too complex. acters correctly recognized. Edit Distance (ED) counts the
Both our novel Vowel Diacritics Separation (VDS) and number of characters that were not correctly recognized by
All Diacritics Separation (ADS) do not break down com- aligning pr ed and gt. By dividing the total number of incor-
pound characters, even if it is a combination of more than rectly predicted characters by the total number of characters
two consonants. The algorithm to extract labels in the VDS which is calculated by taking the summation of the length of
and ADS method has been provided in “Appendix B”. Our each longest string among the prediction pr ed and ground
experiments showed that VDS and ADS-based methods per- truth string gt, we get the percentage of incorrectly recog-
formed better than the legacy Naive method on consonant nized characters in a word. Finally, by subtracting from 1,
conjuncts or consonant clusters. The experimental setup and we find the percentage of correctly recognized characters or
results regarding the grapheme representation methods are the Character Recognition Rate (CRR).
described in the following section.
123
K. Roy et al.
Kaiming initialization [87] and initially trained with a learn- Table 6 Labels for all the variants of our models
ing rate of 5e−5 . Cosine Annealing with Warm Restarts [88] Configuration Classes Total params
learning rate scheduler was used where the first restart occurs
after 15 epochs and the minimum learning rate was set to CRNN-Naive 115 8.76M
1e−6 . We trained the models for 30 epochs and optimized CRNN-VDS 591 9.01M
them using the Adam Optimizer. We used 128 × 32 sized CRNN-ADS 262 8.84M
input, from which a 29-frame feature sequence has been Table 7 Performance of our models with two feature extractors using
generated. This is different from the input size of 100 × 32 our three proposed grapheme representation methods on the six test
mentioned in [10]. We have increased the width of the input protocols
image because Bangla words tend to be longer in length and a Protocol Model WRR (%) Total NED CRR (%)
bigger input size lets the model recognize the visual compli-
cations of the conjunct characters better. Initial testing with I CRNN-Naive 93.43 1029.3625 98.28
these longer images gave us better accuracy on all the tests CRNN-VDS 92.11 1340.5018 97.88
and thus we have shown results using this size in this paper. CRNN-ADS 93.08 1131.4137 98.15
To test the robustness of our grapheme representation II CRNN-Naive 81.91 18.9000 95.42
methods on real data, we chose two OHR datasets, BN-HTRd CRNN-VDS 85.93 14.8988 95.83
[76] and BanglaWriting [77]. Handwriting datasets were cho- CRNN-ADS 80.90 24.1583 93.88
sen due to the lack of proper Bangla Scene Text or Printed III CRNN-Naive 62.11 373.5325 90.34
word datasets for which we created our synthetic dataset in CRNN-VDS 64.01 372.2867 90.36
the first place. However, choosing these datasets also enabled CRNN-ADS 63.02 376.7820 90.15
us to test our grapheme representation methods in OHR. Dur- IV CRNN-Naive 43.71 269.0556 80.66
ing training on the real handwriting datasets, a learning rate of CRNN-VDS 48.69 239.9502 82.31
1e−3 was used and on the BanglaWriting dataset the models CRNN-ADS 42.81 272.4529 79.80
were trained for 60 epochs. All other configurations remained V CRNN-Naive 87.52 150.5297 97.17
the same as the models trained on the synthetic dataset. Ignor- CRNN-VDS 87.99 135.8027 97.10
ing some of the data lost to mislabeling, we were able to CRNN-ADS 86.85 148.4270 96.93
extract 108,061 words from BN-HTRd and 21,221 words VI CRNN-Naive 78.07 7.1286 94.25
from BanglaWriting. BanglaWriting provides both raw and CRNN-VDS 78.95 7.1302 94.25
noise-reduced versions and we have chosen raw images as
CRNN-ADS 81.58 6.1917 95.33
they are more natural and difficult to recognize. As no prior
train-test split was provided with either of the datasets, we Bold indicates that the model has the best result on the protocol among
the three models
randomly split each of the datasets and used 90% for training
and 5% each for validation and testing. We have performed the test results in the following subsections. Each model
both intra-dataset and inter-dataset testing. Intra-dataset test- configuration denotes one of the grapheme representation
ing is performed on the 5% test split but inter-dataset testing strategies paired with the CRNN architecture. For exam-
was performed on the entire dataset, which means the model ple, CRNN-Naive represents the CRNN architecture with the
trained on BN-HTRd was tested on the entire BanglaWriting Naive representation method. For reporting results for each
dataset and vice-versa. All our experiments were conducted model on all of our test sets, we use the WRR, Total NED,
on a machine with an NVIDIA V100 GPU using the Pytorch and CRR as indicators for evaluating the performance of the
framework [89] where we used a mini-batch size of 256. We models.
used the weights of the best validation epoch during testing
and provided results on the test sets mentioned above. 5.2.1 Results of training on the synthetic dataset and
testing on our six test protocols
5.2 Results
We report the results of the training experiments performed
In this section, we report the results of our models in two dif- on the synthetic dataset in Table 7.
ferent scenarios. Firstly, the models trained on the synthetic As previously mentioned, the purpose of Protocol I is to
dataset were tested on the six test protocols we described determine how the models perform on unseen synthetic data,
earlier. Then we trained on real handwriting datasets and that is, words that the model has not encountered during train-
reported intra and inter-dataset results on those real datasets. ing. For the first protocol, from Table 7, we can observe that
There are three different configurations that were trained the model with the Naive method outperformed our proposed
and tested, Table 6 lists all the different model configura- VDS and ADS grapheme representation methods, but from
tions explored. The labels in Table 6 are also used to present the table, it is clear that the margin is very low. Protocol I is
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
also the only protocol that is a completely synthetic dataset Table 8 Training and testing schemes on handwriting dataset
while the rest of the protocols are either hybrid or real. Trained on Tested on Scheme
Protocol II mainly comprises synthetically generated
words that were later printed and scanned and thus it is con- BN-HTRd BN-HTRd INTRA1
sidered a hybrid dataset. In Table 7, for the second protocol, BanglaWriting (Raw) BanglaWriting (Raw) INTRA2
it can be observed that the CRNN-VDS model achieved the BN-HTRd BanglaWriting (Raw) INTER1
highest WRR and CRR of 85.93% and 95.83% respectively. BanglaWriting (Raw) BN-HTRd INTER2
In addition to that, due to the low Total NED, we consider
our architecture, CRNN-VDS as the best performing model
Table 9 Performance of our three proposed grapheme representation
for the second protocol. methods on real handwriting datasets
Protocol III is an extension of the previously reported pro-
Scheme Model WRR (%) Total NED CRR (%)
tocol. This test set consists of multiple varieties of fonts and
font sizes with bold, italic and underlined samples as well. INTRA1 CRNN-Naive 79.48 831.5276 89.33
All of these data were generated at the document level and CRNN-VDS 82.91 757.5683 90.27
later segmented for evaluation. From the results of Protocol CRNN-ADS 83.69 712.8907 90.85
III in Table 7, we can clearly observe that our CRNN-VDS INTRA2 CRNN-Naive 69.40 116.3189 87.47
model achieves the highest 64.01% WRR and 90.36% CRR CRNN-VDS 69.30 126.9041 86.39
among all the other models and the result is reassured by the CRNN-ADS 72.50 108.6174 88.27
low Total NED. INTER1 CRNN-Naive 60.58 3647.4147 82.02
The purpose of Protocol IV was to further evaluate the CRNN-VDS 61.96 3692.2762 81.75
model with real samples comprised of typeset documents,
CRNN-ADS 62.94 3516.3509 82.93
and printed and old binarized books. Upon segmentation of
INTER2 CRNN-Naive 50.42 23897.9807 76.64
the documents, we retrieved around eleven hundred word
CRNN-VDS 49.47 26075.8238 74.13
samples on which we perform the testing while the model
CRNN-ADS 52.14 23,325.9911 76.99
trained was on the synthetic 2 million dataset. From the
results of Protocol IV in Table 7, we can observe that our Bold indicates that the model has the best result on the scheme among
the three models
CRNN-VDS model displayed the best results.
Our fifth test set, Protocol V, mainly consists of documents
from various government organizations in Bangladesh. Addi- chosen BN-HTRd [76] and BanglaWriting [77] and demon-
tionally, it also consisted of certain variations of real-world strate our novelty on these datasets. As these are published
noises. From Table 7, we can observe that the CRNN-VDS and fairly recent datasets and due to the scarcity of a good
model attained the best WRR of 87.99% and consequently Bangla real or synthetic printed text or scene text dataset,
achieved a CRR of 97.10% which is slightly lower than the these datasets were chosen to test our novel representation
model with the naive representation method. Then again, the methods. We have divided our training and testing on these
lowest Total NED among the three models makes CRNN- datasets into four schemes as presented in Table 8. The first
VDS the better model. two schemes are intra-dataset training and testing and in the
Protocol VI consists of real-world data where each word last two schemes we tested the model trained on BN-HTRd on
comprises at least one conjunct character. It was specifi- the entire BanglaWriting dataset and vice-versa. The results
cally designed to test our grapheme representation schemes. of these experiments are shown in Table 9.
Here we can observe the consistency of the CRNN-VDS From the results of Table 9 we can observe the consistency
model which achieved a better WRR than the X1 model. The of the CRNN-ADS model with the ADS grapheme represen-
CRNN-ADS model with the ADS grapheme representation tation method on the real handwriting datasets which has
was the best-performing model with a WRR and CRR of surpassed the baseline naive extraction method in terms of
81.58% and 95.33% accompanied by the lowest Total NED. all the metrices.
This shows the CRNN-ADS model’s competitive perfor-
mance on real data alongside CRNN-VDS and its superiority
on words with conjunct characters.
123
K. Roy et al.
Fig. 11 Percentage of conjuncts mispredicted by each of the models Fig. 12 Percentage of conjuncts mispredicted by each model on the
on the printed words test protocols handwriting training and testing schemes
6.1 Analysis on conjunct characters grapheme representation methods are performing better than
the CRNN-Naive model with the naive method. In all the
The evaluation metrices used so far were the WRR which schemes, there is a noticeable difference. In the first scheme,
represents absolute matches and CRR and Total NED which there is almost a 10 percent difference between the best-
highly depend on the calculation of edit distance. To calculate performing CRNN-VDS model and the CRNN-Naive model
the edit distance between two words, the words are broken with the naive representation. On the last three schemes, the
down into single characters. Bangla consonant clusters are margins were even greater.
formed with two or more consonants with a ‘hoshonto’ char- Finally, by observing the performance of the models on
acter in between each of the consonants and thus, the edit Bangla consonant clusters in Figs. 11 and 12 as well as by
distance does not reflect how well these consonant clusters the results of the models in Sects. 5.2.1 and 5.2.2, we can
are being predicted. For this, we have performed a separate conclude that the VDS grapheme representation is fitting for
analysis to calculate the percentage of mispredicted conso- the Bangla OCR problem. The model has achieved better
nant clusters. scores on the test protocols as well as the real handwriting
Although the CRNN-Naive model did better in terms of datasets and also recognized the consonant clusters better.
our evaluation metrics on Protocol I as shown in Table 7, ADS grapheme representation also showed better perfor-
from Fig. 11, we can see that the CRNN-Naive model had mance than the naive method on the real handwriting datasets
a higher error rate than CRNN-VDS and CRNN-ADS. This and did better at recognizing consonant clusters. Also, the
trend was consistent on all the protocols, as the novel charac- high WRR and CRR achieved on our test protocols indicate
ter representation methods performed better on the consonant that the models trained on the synthetic dataset were perform-
clusters. Protocols IV, V and VI consist of real data, and ing well on the hybrid and real test sets. As the CRNN-VDS
protocol V is mainly comprised of simple words and single- model with VDS grapheme representation performed better
character words, there are not many complex consonant than the rest, we chose it to further test its generalizabil-
clusters there to predict and thus the error rate is low for ity over three disparate real-world test sets reported in the
all the models. However, on protocols IV and VI, we can “Appendix A” and found that the model was performing
really see the difference. On protocol IV, the error rate of exceptionally well on an unseen real-world printed dataset.
the CRNN-Naive model is 30.18% whereas the CRNN-VDS
and CRNN-ADS models have considerably lower error rates 6.2 Analysis on short and long words
of 21.52% and 21.2% respectively. Protocol VI, specially
built for testing such cases, also demonstrates a visible dif- In Bangla text, as conjunct characters can be broken down
ference between the novel representation methods and the into two or more characters, the length of Bangla words
naive method. Here the CRNN-VDS and CRNN-ADS mod- tends to be longer than in other scripts, such as Latin. This
els have an error rate of 10.74% and 11.57% respectively, and is why it is important for a Bangla OCR system to be able
the CRNN-Naive model mispredicted 17.36% of the conso- to predict longer words effectively. However, the prediction
nant clusters. sequence length is usually a fixed length (29 in our case) and
Some word samples with conjunct characters have been thus, it is difficult to predict a long sequence using the naive
shown in Table 10. We can see that the CRNN-Naive model method. This is where our CRNN-VDS and CRNN-ADS
has a high edit distance on the samples, while the CRNN- models shine. We have conducted experiments on various
VDS and CRNN-ADS models have predicted the words word lengths: 1–4, 5–8, 9–12 and 13+ to test the models’
perfectly or with a low edit distance. performance at different text lengths. As different represen-
Figure 12 also exhibits a similar trend to Fig. 11 as here tation methods can provide different numbers of labels for
too the CRNN-VDS, and CRNN-ADS models with the novel the same text, we have considered the number of Unicode
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
characters in a text to be the word length to keep the cal- samples with synthetically added noise are shown in Fig. 13.
culation consistent across all the extraction methods. From Due to the visual similarity, it is the easiest protocol for our
Table 11, we can observe that our novel methods perform recognition models and in this protocol, the margin is very
better than the naive method on both short and long words slim between the models.
on all real and hybrid test protocols. The CRNN-VDS method performs the best on real-world
samples when it has been trained on a large enough dataset.
However, as the number of classes for the CRNN-VDS is
6.3 Limitations and error cases
more than five times higher than the baseline CRNN-VDS
model, it needs a lot of data to achieve good accuracy and dis-
The novel representation methods do not perform exception-
tinguish each class properly. This has been made clear from
ally well on the large-scale synthetic testing protocol I. The
the results on the handwriting datasets, where the training
synthetic testing protocol I data are visually similar to the syn-
samples were low and the CRNN-ADS model overperformed
thetic training dataset. Some of the samples of the training
123
K. Roy et al.
7 Conclusion
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
Declarations
123
K. Roy et al.
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
123
K. Roy et al.
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
22. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text 38. Aberdam, A., Ganz, R., Mazor, S., Litman, R.: Multimodal
recognition with automatic rectification. In: Proceedings of the semi-supervised learning for text recognition. arXiv preprint
IEEE Conference on Computer Vision and Pattern Recognition, arXiv:2205.03873 (2022)
pp. 4168–4176 (2016) 39. Yang, M., Liao, M., Lu, P., Wang, J., Zhu, S., Luo, H., Tian, Q.,
23. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Bai, X.: Reading and writing: discriminative and generative mod-
Comput. 9(8), 1735–1780 (1997) eling for self-supervised text recognition. In: Proceedings of the
24. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., 30th ACM International Conference on Multimedia, pp. 4214–
Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, 4223 (2022)
L.P.: ICDAR 2013 robust reading competition. In: 2013 12th Inter- 40. Chu, X., Wang, Y.: IterVM: iterative vision modeling module for
national Conference on Document Analysis and Recognition, pp. scene text recognition. In: 2022 26th International Conference on
1484–1493 (2013). IEEE Pattern Recognition (ICPR), pp. 1393–1399 (2022). IEEE
25. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, 41. Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y., Jiang,
H.: What is wrong with scene text recognition model comparisons? Y.-G.: Svtr: scene text recognition with a single visual model. arXiv
Dataset and model analysis. In: Proceedings of the IEEE/CVF preprint arXiv:2205.00159 (2022)
International Conference on Computer Vision (ICCV) (2019) 42. Zheng, C., Li, H., Rhee, S.-M., Han, S., Han, J.-J., Wang, P.: Push-
26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., ing the performance limit of scene text recognizer without human
Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. annotation. In: Proceedings of the IEEE/CVF Conference on Com-
arXiv preprint arXiv:1706.03762 (2017) puter Vision and Pattern Recognition, pp. 14116–14125 (2022)
27. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connec- 43. Chammas, E., Mokbel, C., Likforman-Sulem, L.: Handwriting
tionist temporal classification: labelling unsegmented sequence recognition of historical documents with few labeled data. In: 2018
data with recurrent neural networks. In: Proceedings of the 23rd 13th IAPR International Workshop on Document Analysis Systems
International Conference on Machine Learning, pp. 369–376 (DAS), pp. 43–48 (2018). IEEE
(2006) 44. Kišš, M., Hradiš, M., Beneš, K., Buchal, P., Kula, M.: SoftCTC—
28. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an Semi-Supervised Learning for Text Recognition using Soft Pseudo-
attentional scene text recognizer with flexible rectification. IEEE labels. arXiv (2022). arXiv:2212.02135
Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2019). https:// 45. Yousef, M., Hussain, K.F., Mohammed, U.S.: Accurate, data-
doi.org/10.1109/TPAMI.2018.2848939 efficient, unconstrained text recognition with convolutional neural
29. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing networks. Pattern Recogn. 108, 107482 (2020). https://doi.org/10.
attention: towards accurate text recognition in natural images. In: 1016/j.patcog.2020.107482
2017 IEEE International Conference on Computer Vision (ICCV), 46. Maillette de Buy Wenniger, G., Schomaker, L., Way, A.: No
pp. 5086–5094 (2017). https://doi.org/10.1109/ICCV.2017.543 padding please: efficient neural handwriting recognition. In: 2019
30. Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., International Conference on Document Analysis and Recognition
Manmatha, R.: Scatter: selective context attentional scene text rec- (ICDAR), pp. 355–362 (2019). https://doi.org/10.1109/ICDAR.
ognizer. In: 2020 IEEE/CVF Conference on Computer Vision and 2019.00064
Pattern Recognition (CVPR), pp. 11959–11969 (2020). https://doi. 47. Kass, D., Vats, E.: AttentionHTR: handwritten text recognition
org/10.1109/CVPR42600.2020.01198 based on attention encoder–decoder networks. In: Uchida, S., Bar-
31. Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., Ding, E.: Towards ney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 507–522.
accurate scene text recognition with semantic reasoning networks. Springer, Cham (2022)
In: Proceedings of the IEEE/CVF Conference on Computer Vision 48. Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.:
and Pattern Recognition, pp. 12113–12122 (2020) Scrabblegan: Semi-supervised varying length handwritten text gen-
32. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bag- eration. In: 2020 IEEE/CVF Conference on Computer Vision and
danov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, Pattern Recognition (CVPR), pp. 4323–4332 (2020). https://doi.
V.R., Lu, S., et al.: ICDAR 2015 competition on robust reading. org/10.1109/CVPR42600.2020.00438
In: 2015 13th International Conference on Document Analysis and 49. Souibgui, M.A., Fornés, A., Kessentini, Y., Megyesi, B.: Few shots
Recognition (ICDAR), pp. 1156–1160 (2015). IEEE are all you need: a progressive learning approach for low resource
33. Feng, X., Yao, H., Qi, Y., Zhang, J., Zhang, S.: Scene text recog- handwritten text recognition. Pattern Recogn. Lett. 160, 43–49
nition via transformer. arXiv preprint arXiv:2003.08077 (2020) (2022). https://doi.org/10.1016/j.patrec.2022.06.003
34. Atienza, R.: Vision transformer for fast and efficient scene text 50. Rahman, A., Kaykobad, M.: A complete Bengali OCR: a novel
recognition. In: Document Analysis and Recognition–ICDAR hybrid approach to handwritten Bengali character recognition. J.
2021: 16th International Conference, Lausanne, Switzerland, Comput. Inf. Technol. 6(4), 395–413 (1998)
September 5–10, 2021, Proceedings, Part I, vol. 16, pp. 319–334 51. Pal, U., Chaudhuri, B.B.: OCR in Bangla: an Indo-Bangladeshi lan-
(2021). Springer guage. In: Proceedings of the 12th IAPR International Conference
35. Wu, J., Peng, Y., Zhang, S., Qi, W., Zhang, J.: Masked vision- on Pattern Recognition, Vol. 3—Conference C: Signal Processing
language transformers for scene text recognition. arXiv preprint (Cat. No.94CH3440-5), vol. 2, pp. 269–2732 (1994). https://doi.
arXiv:2211.04785 (2022) org/10.1109/ICPR.1994.576917
36. Wang, P., Da, C., Yao, C.: Multi-granularity prediction for scene 52. Sattar, M., Rahman, S.: An experimental investigation on Bangla
text recognition. In: Computer Vision—ECCV 2022: 17th Euro- character recognition system. Bangladesh Comput. Soc. J. 4(1),
pean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed- 1–4 (1989)
ings, Part XXVIII, pp. 339–355 (2022). Springer 53. Rahman, A.F.R., Fairhurst, M.: Multi-prototype classification:
37. Xie, X., Fu, L., Zhang, Z., Wang, Z., Bai, X.: Toward understand- improved modelling of the variability of handwritten data using
ing wordart: corner-guided transformer for scene text recognition. statistical clustering algorithms. Electron. Lett. 33(14), 1208–1210
In: Computer Vision–ECCV 2022: 17th European Conference, Tel (1997)
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pp. 54. Pal, U.: On the development of an optical character recognition
303–321 (2022). Springer (OCR) system for printed Bangla script. PhD thesis, Indian Statis-
tical Institute, Calcutta (1997)
123
K. Roy et al.
55. Chaudhuri, B., Pal, U.: A complete printed Bangla OCR system. 73. Hasan, M.J., Wahid, M.F., Alom, M.S.: Bangla compound charac-
Pattern Recogn. 31(5), 531–549 (1998) ter recognition by combining deep convolutional neural network
56. Rahman, A.F.R., Fairhurst, M.C.: A new hybrid approach in com- with bidirectional long short-term memory. In: 2019 4th Interna-
bining multiple experts to recognise handwritten numerals. Pattern tional Conference on Electrical Information and Communication
Recogn. Lett. 18(8), 781–790 (1997) Technology (EICT), pp. 1–4 (2019). IEEE
57. Rahman, A.F.R., Rahman, R., Fairhurst, M.C.: Recognition of 74. Paul, D., Chaudhuri, B.B.: A BLSTM network for printed Bengali
handwritten Bengali characters: a novel multistage approach. Pat- OCR system with high accuracy. arXiv preprint arXiv:1908.08674
tern Recogn. 35(5), 997–1006 (2002) (2019)
58. Mahmud, J.U., Raihan, M.F., Rahman, C.M.: A complete OCR 75. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep
system for continuous Bengali characters. In: TENCON 2003. Con- feedforward neural networks. In: Proceedings of the Thirteenth
ference on Convergent Technologies for Asia-Pacific Region, vol. International Conference on Artificial Intelligence and Statistics,
4, pp. 1372–1376 (2003). IEEE pp. 249–256 (2010). JMLR Workshop and Conference Proceedings
59. Kamruzzaman, J., Aziz, S.: Improved machine recognition for 76. Rahman, M.A., Tabassum, N., Paul, M., Pal, R., Islam, M.K.:
Bangla characters. In: International Conference on Electrical and BN-HTRd: A Benchmark Dataset for Document Level Offline
Computer Engineering 2004, pp. 557–560 (2004). ICECE 2004 Bangla Handwritten Text Recognition (HTR) and Line Segmenta-
Conference Secretariat, Bangladesh of Engineering and Technol- tion. arXiv (2022). https://doi.org/10.48550/ARXIV.2206.08977.
ogy https://arxiv.org/abs/2206.08977
60. Alam, M.M., Kashem, M.A.: A complete Bangla OCR system for 77. Mridha, M.F., Ohi, A.Q., Ali, M.A., Emon, M.I., Kabir, M.M.:
printed characters. JCIT 1(01), 30–35 (2010) Banglawriting: a multi-purpose offline Bangla handwriting dataset.
61. Ahmed, S., Kashem, M.A.: Enhancing the character segmentation Data Brief. 34, 106633 (2021). https://doi.org/10.1016/j.dib.2020.
accuracy of Bangla OCR using BPNN. Int. J. Sci. Res. (IJSR) ISSN 106633
(Online), 2319–7064 (2013) 78. Banik, M., Rifat, M.J.R., Nahar, J., Hasan, N., Rahman, F.: Okkhor:
62. Chowdhury, A.A., Ahmed, E., Ahmed, S., Hossain, S., Rahman, a synthetic corpus of Bangla printed characters. In: Arai, K.,
C.M.: Optical character recognition of Bangla characters using neu- Kapoor, S., Bhatia, R. (eds.) Proceedings of the Future Technolo-
ral network: a better approach. In: 2nd ICEE (2002) gies Conference (FTC) 2020, vol. 1, pp. 693–711. Springer, Cham
63. Ahmed, S., Sakib, A.N., Ishtiaque Mahmud, M., Belali, H., Rah- (2021)
man, S.: The anatomy of Bangla OCR system for printed texts using 79. Roark, B., Wolf-Sonkin, L., Kirov, C., Mielke, S.J., Johny, C.,
back propagation neural network. Glob. J. Comput. Sci. Technol. Demirsahin, I., Hall, K.: Processing South Asian languages written
(2012) in the Latin script: the Dakshina dataset. In: Proceedings of the 12th
64. Afroge, S., Ahmed, B., Hossain, A.: Bangla optical character Language Resources and Evaluation Conference, pp. 2413–2423.
recognition through segmentation using curvature distance and European Language Resources Association, Marseille, France
multilayer perceptron algorithm. In: 2017 International Conference (2020). https://aclanthology.org/2020.lrec-1.294
on Electrical, Computer and Communication Engineering (ECCE), 80. Al Mumin, M.A., Shoeb, A.A.M., Selim, M.R., Iqbal, M.Z.:
pp. 253–257 (2017). IEEE Sumono: a representative modern Bengali corpus. SUST J. Sci.
65. Hossain, S.A., Tabassum, T.: Neural net based complete character Technol. 21(1), 78–86 (2014)
recognition scheme for Bangla printed text books. In: 16th Interna- 81. Biswas, E.: Bangla Largest Newspaper Dataset. Kaggle (2021).
tional Conference on Computer and Information Technology, pp. https://doi.org/10.34740/KAGGLE/DSV/1857507. https://www.
71–75 (2014). IEEE kaggle.com/dsv/1857507
66. Pramanik, R., Bag, S.: Shape decomposition-based handwritten 82. Ahmed, M.F., Mahmud, Z., Biash, Z.T., Ryen, A.A.N., Hossain,
compound character recognition for Bangla OCR. J. Vis. Commun. A., Ashraf, F.B.: Bangla Online Comments Dataset. Mendeley
Image Represent. 50, 123–134 (2018) Data (2021). https://doi.org/10.17632/9xjx8twk8p.1. https://data.
67. Ghosh, R., Vamshi, C., Kumar, P.: RNN based online handwritten mendeley.com/datasets/9xjx8twk8p/1
word recognition in Devanagari and Bengali scripts using horizon- 83. Farahmand, A., Sarrafzadeh, H., Shanbehzadeh, J.: Document
tal zoning. Pattern Recogn. 92, 203–218 (2019) image noises and removal methods (2013)
68. Purkaystha, B., Datta, T., Islam, M.S.: Bengali handwritten charac- 84. Lee, C.-Y., Osindero, S.: Recursive recurrent nets with attention
ter recognition using deep convolutional neural network. In: 2017 modeling for OCR in the wild. In: 2016 IEEE Conference on
20th International Conference of Computer and Information Tech- Computer Vision and Pattern Recognition (CVPR), pp. 2231–2239
nology (ICCIT), pp. 1–5 (2017). IEEE (2016). https://doi.org/10.1109/CVPR.2016.245
69. Islam, M.S., Rahman, M.M., Rahman, M.H., Rivolta, M.W., 85. Wagner, R.A., Fischer, M.J.: The string-to-string correction prob-
Aktaruzzaman, M.: Ratnet: a deep learning model for Ben- lem. J. ACM (JACM) 21(1), 168–173 (1974)
gali handwritten characters recognition. Multimed. Tools Appl. 86. Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A.,
81, 10631–10651 (2022). https://doi.org/10.1007/s11042-022- Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexi-
12070-4 ble image augmentations. Information (2020). https://doi.org/10.
70. Maity, S., Dey, A., Chowdhury, A., Banerjee, A.: Handwritten Ben- 3390/info11020125
gali character recognition using deep convolution neural network. 87. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: sur-
In: Bhattacharjee, A., Borgohain, S.K., Soni, B., Verma, G., Gao, passing human-level performance on imagenet classification. In:
X.-Z. (eds.) Machine Learning, Image Processing, Network Secu- 2015 IEEE International Conference on Computer Vision (ICCV),
rity and Data Sciences, pp. 84–92. Springer, Singapore (2020) pp. 1026–1034. IEEE Computer Society, Los Alamitos, CA, USA
71. Roy, A.: AKHCRNet: Bengali Handwritten Character Recognition (2015). https://doi.org/10.1109/ICCV.2015.123
Using Deep Learning (2020) 88. Loshchilov, I., Hutter, F.: SGDR: Stochastic Gradient Descent with
72. Sharif, S., Mohammed, N., Momen, S., Mansoor, N.: Classification Warm Restarts (2017)
of Bangla compound characters using a HOG-CNN hybrid model.
In: Proceedings of the International Conference on Computing and
Communication Systems, pp. 403–411 (2018). Springer
123
A multifaceted evaluation of representation of graphemes for practically effective Bangla OCR
89. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Publisher’s Note Springer Nature remains neutral with regard to juris-
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, dictional claims in published maps and institutional affiliations.
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-
son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, Springer Nature or its licensor (e.g. a society or other partner) holds
L., Bai, J., Chintala, S.: Pytorch: an imperative style, high- exclusive rights to this article under a publishing agreement with the
performance deep learning library. In: Advances in Neural Infor- author(s) or other rightsholder(s); author self-archiving of the accepted
mation Processing Systems 32, pp. 8024–8035. Curran Asso- manuscript version of this article is solely governed by the terms of such
ciates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch- publishing agreement and applicable law.
an-imperative-style-high-performance-deep-learning-library.pdf
123