Bangla Compound Character Recognition by Combining Deep Convolutional Neural Network With Bidirectional Long Short-Term Memory

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

4th International Conference on Electrical Information and Communication Technology (EICT), 20-22 December 2019, Khulna, Bangladesh

Bangla Compound Character Recognition by


Combining Deep Convolutional Neural Network
with Bidirectional Long Short-Term Memory
Md. Jahid Hasan∗, Md. Ferdous Wahid†, Md. Shahin Alom‡
Department of Electrical and Electronic Engineering,
en Name
Hajee Mohammad Danesh Science and Technology University
Dinajpur-5200, Bangladesh
jahidnoyon36@gmail.com∗, mfwahid26@gmail.com†, ashahin200@gmail.com‡

Abstract— Recognition of Bangla handwritten compound They trained LeNet-5 CNN architecture in a supervised
characters has a significant role in Bangla language in order to manner using Greedy layer wise training and optimized the
develop a complete Bangla OCR. It is a challenging task owing model using RMSProp optimizer. Their proposed model has
to its high variation in individual writing style and structural attained 90.33% accuracy on CMATERDB 3.1.3.3 dataset.
resemblance between characters. This paper proposed a novel Sharif et al. [7] employed a manual feature extraction
approach to recognize Bangla handwritten compound technique named Histogram of Oriented Gradients (HOG)
characters by combining deep convolutional neural network with a CNN, which learned feature from an image
with Bidirectional long short-term memory (CNN-BiLSTM). automatically to recognize handwritten Bangla compound
The efficacy of the proposed model is evaluated on test set of
characters. The hybrid model that combine HOG feature with
compound character dataset CMATERdb 3.1.3.3, which
CNN obtained 92.57% recognition accuracy on each classes
consists of 171 distinct character classes. The model has gained
98.50% recognition accuracy which is significantly better than
of CMATERDB 3.1.3.3 dataset. Deep CNN with dropout
current state-of-the-art techniques on this dataset. technique has proposed for recognizing handwritten Bangla
compound characters by Ashiquzzaman et al. [8]. They used
Keywords—Bangla handwritten compound characters, dropout technique in order to minimize data over-fitting which
Bangla OCR, deep CNN, Bidirectional long short-term memory, generalized proposed model and Exponential Linear Unit
compound characters recognition. (ELU) to regulate the vanishing and exploding gradient
problem. Greedy layer wise training approach has been
adopted during training process. This scheme achieved test
I. INTRODUCTION accuracy 93.68% on CMATERdb3.1.3.3 dataset. Sarkar et al.
Bangla language is the seventh most frequently spoken [9] proposed a deep learning model with spatial transformer
native language in the world. Around 270 million people network (STN) for classification of handwritten Bangla
speak in Bangla[1]. So, it is a crying need to enhance the compound character. They used STN to align the feature maps
computer adaptability of Bangla language. Therefore, in of CNN. They claimed that their model could predict any scale
recent years, Bangla OCR has attained huge attention from the invariant and non-uniform character with an accuracy of
researchers. Many studies have been conducted on 96.34% on CMATERdb3.1.3.3 dataset. Keserwani et al. [10]
handwritten Bangla basic character and numeral recognition. highlighted a two-phase learning CNN architecture for
But handwritten Bangla compound character recognition is classifying handwritten Bangla compound character. At the
still comparatively less investigated region of study in this beginning phase, CNN was trained in an unsupervised manner
domain. Hence, there is an ample of scope to improve in to minimize the reconstruction loss and in second phase of
handwritten Bangla compound character recognition learning these weights were used to train second CNN
approach. Recognition of handwritten Bangla compound architecture in supervised manner to reduce the recognition
character which is formed in combination of multiple loss. They validated their proposed model on CMATERdb
characters is a challenging task due to the structural closeness 3.1.3.3 which showed promising result. Fardous et al. [11] had
and non-uniform scaled characteristics. Few examples of achieved test accuracy of 95.5% on CMATERdb 3.1.3.3
compound characters are shown in Table I. database to recognize handwritten isolated Bangla compound
character by employing a deep CNN architecture that has eight
TABLE I. FEW EXAMPLE OF COMPOUND CHARACTERS
convolutional layers, four pooling layers and two dense layers.
All the above mentioned study reveals the potentiality of deep
learning in image recognition task. Thus, we were encouraged
to develop an intelligence system based on deep CNN.
In the proposed approach, we have combined deep
Several approaches such as neural network, handcraft Convolutional Neural Network with Bidirectional Long
feature, support vector machine, deep learning have employed Short-Term Memory in order to capture the contextual
for handwritten Bangla compound character recognition. dependencies among different image region that helps to
Among them, in OCR based application, deep learning achieve better recognition accuracy. The proposed deep CNN-
especially CNN [2] approaches are widely preferred technique BiLSTM model conveyed a new benchmark recognition
because of its remarkable recognition performance and low accuracy of 98.50% on 171 classes of CMATERdb 3.1.3.3
pre-processing requirement for analyzing any visual imagery database in recognizing handwritten Bangla compound
input. Roy et al. [6] has explored deep learning technique for characters. The performance of the proposed model clearly
Bangla handwritten isolated compound character recognition. has outperformed most of the existing approaches.

978-1-7281-6040-5/19/$31.00 ©2019 IEEEXXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Authorized licensed use limited to: University of Wollongong. Downloaded on May 31,2020 at 07:58:29 UTC from IEEE Xplore. Restrictions apply.
II. DATASET PREPROCESSING connected layer collects output of the Bi-LSTM layer to learn
Handwritten Bangla compound character dataset named global image representation and classification is performed by
CMATERdb 3.1.3.3 [12] from pattern recognition database the second fully-connected layer that use softmax activation
repository was collected to evaluate the performance of the function. Dropout is used to diminish the over-fitting problem
proposed hybrid model. The images of this dataset are split of the network. The proposed hybrid network is depicted on
into a training set of 44152 images and testing set of 11126 Fig. 2.
Input image (50x50x3 px)
images. This dataset contained 171 unique character classes of
Bangla handwritten compound characters where per class has
minimum 200 grayscale images. The grayscale images of the
dataset are at different scale and non-centered translation.
Moreover, the dataset image has noise which is introduced Conv2D
during the electronic transmission of image. Thus, in order to
increase accuracy of the proposed model, we applied Conv2D
preprocessing method on this dataset. Table II shows the Max_pooling2D
parameter of the collected and preprocessed dataset.
TABLE II. STATE OF CMATERDB 3.1.3.3 DATASET AFTER Conv2D Feature
PREPROCESSING Conv2D Extractor
Parameter Collected Preprocessed
Max_pooling2D
Scale Non-uniform Uniform
Translation Non-centered Centered
Conv2D
Output-Class 171 171
Training sample 44152 93455 Conv2D
Testing sample 11126 17688
Max_pooling2D
Firstly, we used mean filter of kernel size 3x3 which
effectively reduces the noise from the original dataset image Flattern
using equation 1. Secondly, the background of the original Sequential
dataset image in inverted using invert binarization technique feature
to obtain image having white writing on a black background LSTM LSTM
…… LSTM LSTM learning
through
that help to recognize invariant feature easily. Thirdly, LSTM LSTM LSTM LSTM
BiLSTM
cropping process is employed to crop only the character layer
portion of the image which makes sure to hold character at
center position of image. Finally, all images are rescaled into Dense
FC 1
50x50 pixels. layer
1 Dropout
, I , y 1
9 Classification
FC 2
layer
Afterward, data augmentation technique such as random
Fig. 2. Architecture of the complete CNN-BiLSTM model.
rotation, flipping and zooming are applied on the
preprocessed dataset image. After augmentation, the
preprocessed dataset has total 93,455 training images with A. Convolutional Neural Network
minimum 540 image per class and 17688 testing images with CNN architecture of the proposed network consists of six
minimum 100 images per class. Fig. 1 shows the adopted convolution layers with three pooling layers. The convolution
preprocessing steps. layer takes input and generates feature maps using kernel and
Size (50x50) px pooling layers reduces the spatial resolution of the feature
maps to obtain dominant feature. Among the six convolutional
layers, we use kernel size 5x5 in the first two layers and kernel
size 3x3 in the remaining layers. ReLU is a nonlinear
activation function used in each convolutional layer which
Centralization provides acceleration for understanding complex functions. In
Invert Data
Denoising
Binarization & augmentation
the proposed architecture, we place pooling layer of max-
Resize pooling type having kernel size 2x2 after every two
Fig. 1. Image preprocessing step. convolutional layers which simply outputs the maximum
activation as observed in the region and also reduce the
number of parameters.

III. METHODOLOGY B. Bi-directional Long Short Term Memory Network


This section describes CNN-BiLSTM hybrid network Recurrent Neural Network (RNN) works on sequenced
which is proposed for the recognition of handwritten Bangla information by updating weight matrix through recursive
compound characters. Here, CNN layers learn mid-level process. But unfortunately, it has exploding or vanishing
image representations at different levels of abstraction and gradient problem which makes difficult for RNN to remember
contextual dependencies between mid-level image long range dependencies in input sequence. This problem is
representations are extracted by BiLSTM layers. Afterwards, mitigated by introducing LSTM architecture in which
we use two fully-connected layers and dropout. The first fully- memory cell is used that preserves state over long time step.

Authorized licensed use limited to: University of Wollongong. Downloaded on May 31,2020 at 07:58:29 UTC from IEEE Xplore. Restrictions apply.
In LSTM, information flows in one direction i.e. backward to IV. RESULT AND DISCUSSION
forward but when the input sequence direction is reversed, it The proposed model was trained using the training dataset
cannot work well. On the contrary, in BiLSTM, information that have 93455 images. In order to validate the model during
flows in both directions i.e. backward to forward and forward the training phase, a random set of 20% of these images were
to backward using two hidden layers i.e. forward layer and separated. Here, Adam optimizer which is a stochastic
backward layer. There are no hidden-to-hidden connections gradient descent optimization procedure was used to train the
between forward and backward layers. The output dense layer network. The initial learning rate was set to 0.0001 for the
receives information from both layers. The backward layer training phase. The parameter settings are summarized in
provides past information whereas future information is Table IV. For the implementation of the model, Keras library
provided by forward layer. Therefore, Bi-LSTM works more in python was used in windows10 operating system. The
accurate than LSTM on spatial correlation. Fig. 3 shows the whole experiment was run on a machine with Intel(R) Core
structure of Bi-LSTM. (TM) i7 CPU @3.20GHz, 16GB RAM and NVIDIA GeForce
GTX-1070.
TABLE IV. PARAMETER VALUE OF PRPOSED MODEL
Parameter Value
Learning rate 0.0001
Backward Layer hb hb hb Epochs 40
Optimizer Adam
Loss function Categorical cross entropy

The proposed model has achieved 99.50% accuracy on


hf hf hf
Forward Layer training set within 40 epochs and validation accuracy around
98.50% during training period. Fig. 4 shows the accuracy
curve during training phase. The value of loss function is
observed in whole training process as shown in Fig. 5. Finally,
Input Layer the performance of the proposed CNN-BILSTM model was
tested using 17688 images of the augmented test dataset of
Fig. 3. Construction of a BiLSTM layer. CMATERdb 3.1.3.3 and achieved average accuracy of
98.50% on 171 classes where it took 0.58 seconds to recognize
The forward and backward hidden states are computed and an image.
merged using the following equations.
∗ 2
∗ 3
4
Where is the hidden layer state, is the weight, is the
bias term and is the activation function.
In this study, we add BiLSTM unit at the end of CNN
architecture for finding spatial dependencies among obtained
features. We employed 100 LSTM units to construct the
BiLSTM unit for interpreting the features across time steps in
both directions in the proposed network. Table III shows the
layer wise description of proposed CNN-BiLSTM
architecture.
Fig. 4. Train and validation accuracy rate of CMATERdb 3.1.3.3 dataset
TABLE III. LAYER DESCRIPTION OF THE PROPOSED CNN-BILSRM of our proposed model.
MODEL
No of
Activation Kernel
Layer Neurons
Function Size
/Filters
Conv1 32 ReLU 5x5
Conv2 32 ReLU 5x5
MaxPool - - 2x2
Conv3 128 ReLU 3x3
Conv4 128 ReLU 3x3
MaxPool - - 2x2
Conv5 128 ReLU 3x3
Conv6 128 ReLU 3x3
MaxPool - - 2x2
Flatten_1 4608 - -
BiLSTM 200 ReLU -
FullyConnected1 256 ReLU -
Fig. 5. Train and validation loss rate of CMATERdb 3.1.3.3 dataset of our
FullyConnected2 171 Softmax - prosposed model.

Authorized licensed use limited to: University of Wollongong. Downloaded on May 31,2020 at 07:58:29 UTC from IEEE Xplore. Restrictions apply.
In order to validate the effectiveness of the proposed classes. The performance of the model is superior compared
method, the model is compared with other existing models on to other implemented classification models for handwritten
the same dataset. Table V shows accuracy comparison of compound Bangla characters recognition. This model shows
Bangla handwritten compound character recognition. The improved accuracy due to its capability to learn contextual
obtained result clearly outperforms the previously published dependencies and spatial correlation between different image
methods for Bangla handwritten compound character regions of the dataset image. Despite of having promising
recognition on the same dataset. performance, the Bangla compound characters which have
high structural similarity is sometime misclassified by the
TABLE V. COMPARISON AMONG ACCURACY WITH DIFFERENT
MODEL model. In future, this model will be employed for complete
Bangla OCR for handwritten texts.
Ref Method Accuracy
[8] DCNN + ELU + 93.68% REFERENCES
Dropout
[9] CNN + 4 STN 96.34%
[1] “Bengali,” Ethnologue. [Online]. Available:
[11] CNN-softmax 95.50%. https://www.ethnologue.com/language/ben. [Accessed: 13-Sep-2019].
Proposed model CNN-BiLSTM 98.50% [2] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. “Gradient Based
Learning Applied to Document Recognition,” Proceedings of the
Although the proposed model shows improved IEEE, vol. 86, Issue. 11, Nov 1998.
performance, but the model sometime find difficulty in [3] Z. Zuo et al., "Convolutional recurrent neural networks: Learning
spatial dependencies for image representation," 2015 IEEE Conference
classifying character of some classes due to the closeness of on Computer Vision and Pattern Recognition Workshops (CVPRW),
the character structure. Some of the frequently misclassified Boston, MA, 2015, pp. 18-26.
characters are shown in Table VI. [4] K. Zhang,“LSTM: An Image Classification Model Based on Fashion-
MNIST Dataset”.
TABLE VI. SOME MISCLASSIFICATION RESULT [5] Yongqing Zhu, Xiangyang Li, Xue Li, Jian Sun, Xinhang Song, and
Shuqiang Jiang, “Joint Learning of CNN and LSTM for Image
Test Image Predicted Image Captioning”, CLEF, 2016.
[6] S. Roy, N. Das, M. Kundu, and M. Nasipuri, “Handwritten isolated
Actual Sample Image Sample Image Predicted
Bangla compound character recognition: A new benchmark using a
Label Label
novel deep learning approach,” Pattern Recognition Letters, vol. 90,
pp. 15–21, Apr. 2017
[7] S. M. A. Sharif, N. Mohammed, S. Momen, and N.
1 127 Mansoor,“Classification of Bangla Compound Characters Using a
HOG-CNN Hybrid Model,” in Proceedings of the International
Conference on Computing and Communication Systems, Springer
Singapore, 2018, pp. 403–411
[8] A. Ashiquzzaman, A. K. Tushar, S. Dutta, and F. Mohsin, ‘‘An
71 18 efficient method for improving classification accuracy of handwritten
Bangla compound characters using DCNN with dropout and ELU,’’ in
Third International Conference on Research in Computational
Intelligence and Communication Networks (ICRCICN), 2017.
[9] P.R. Sarkar, D. Mishra, G.R. Manyam, ”Improving Isolated Bangla
125 23 Compound Character Recognition Through Feature-map Alignment”.
2017 Ninth International Conference on Advances in Pattern
Recognition (ICAPR), pp. 1-5, 2017.
[10] P. Keserwani, T. Ali and P. P. Roy, "A two phase trained Convolutional
Neural Network for Handwritten Bangla Compound Character
Recognition," 2017 Ninth International Conference on Advances in
27 7
Pattern Recognition (ICAPR), Bangalore, pp. 1-6, 2017.
[11] Asfi Fardous and Shyla Afroge, “Handwritten Isolated Bangla
Compound Character Recognition”, 2019 International Conference on
Electrical, Computer and Communication Engineering (ECCE), 7-9
February 2019.
[12] N. Das, S. Basu, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu,
"Handwritten Bangla Compound character recognition: Potential
V. CONCLUSION challenges and probable solution," in 4th Indian International
Conference on Artificial Intelligence, Bangalore, 2009, pp. 1901-1913.
This paper has presented a CNN-BiLSTM model to
recognize handwritten compound Bangla characters.
Experiment demonstrates that the proposed hybrid model
obtained 98.50% test accuracy on test dataset having 171

Authorized licensed use limited to: University of Wollongong. Downloaded on May 31,2020 at 07:58:29 UTC from IEEE Xplore. Restrictions apply.

You might also like