Professional Documents
Culture Documents
PID3745169 Camera Ready Submitted PDF
PID3745169 Camera Ready Submitted PDF
net/publication/280777081
CITATIONS READS
5 826
3 authors:
Thomas Breuel
Google Inc.
260 PUBLICATIONS 4,465 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Adnan Ul-Hasan on 28 August 2015.
81 50% – Zone-2
20% – Zone-3
500
Original Image
12
40 20
8
Normalized Image
Figure 3. (Original Image) Two Dashed lines divide a Devanagari Figure 4. Four sample images taken from the training set. These
text-line into three distinct zones. These zones are determined on are synthetic line images generated using different fonts and different
the basis of a statistical analysis of training data. Zone 1, with a degradation models.
height of 30% of the image height, starts from top of the image to
the Shirorekha, which contains the maatras. Zone 2, with a height
of 50% contains the basic character shapes. Zone 3, with a height of
20%, contains the vowels which get attached to the bottom of the A. Deva-DB – Devanagari Printed Text-line Image
consonants. (Normalized Image) Each of these zones are normalized Database1
individually for each image. The total height of the resultant image
is made 40 and each of the three zones are scaled accordingly.
For the experiments we generated both synthetic and
real scanned text-line images. The training data consisted
outputs (k is the number of Unicode code-points available of pairs of Devanagari line images and the ground-truth
in Devanagari script). The hidden layer consists of 100 text file. The ground-truth was represented in Unicode
units. The learning rate for the LSTM is set to 0.0001. format. The synthetic line images for the experiments
were generated with OCRopus using various degradation
B. Normalization and Features models [20]. Fig. 4 shows an example of an image used
for training. To generate the training set, we collected De-
As reasoned in Section III, the input image needs to be vanagari (Hindi) text documents from various sources. The
normalized to make depth of individual frames equal. We topics in the training set include current-affairs, science,
tried OCRopus normalizers, but found that they do not religion and classical literature. This was done to make
yield good results with Devanagari script. So, to normalize sure that the training set covered all the commonly used
the images in our experiments, we used the normalization word and character shapes.
technique similar to the one used by Rashid et al. [19].
Any given image was divided into 3 zones (see Fig. 3 For real scanned images, we collected the Devanagari
for details). To calculate the size of each zone we did text book scans from the work published by Setlur et
a statistical analysis of the height of each zone for all al. [21]. We segmented these whole page scan images into
the images present in the train set. For each image in line images and manually generated the ground-truth data
the train set, the size of each zone was readjusted to the in Unicode format. Fig. 4 shows a few examples of line
average value of that zone as calculated by the statistical images from the training set (synthetic).
analysis. The same values were used to normalize zones
of the test set. The sizes of the Zone1, Zone2 and Zone3 To compare the quality of our training set, we take the
were calculated to be 30%, 50% and 20% respectively of character and word statistics for comparison. Chaudhuri
the total height of the image. For our experiments, we set et al. [22] has published the twenty most frequently used
the total height of the image to be 40 (Zone1 = 12, Zone2 characters in Hindi (Table I) based on three million Hindi
= 20, Zone3 = 8). The height of 40 gave enough pixels for words. We collected a similar statistics for our training
each character shape to be identified uniquely. data based on 9, 56, 405 words. As seen from Table I, the
relative frequency of the characters in our training set is
In all our experiments, we have not extracted any similar to the statistics provided by Chaudhuri et al. [22]
features from the images. The binarized pixel values in [22]. IIIT Hyderabad, India [23] has published a word
the line image are used as features for the character. frequency from a Hindi language corpus containing three
million words. Our training set also matched the top ten
IV. Experimental Evaluation frequent words of that table.
We have performed three kind of experiments differing
in the number of fonts for train and test data. These exper- The test set was divided into two groups. First set
iments have been classified into (i) Single Font-Single Font consisted of 1000 synthetic line images generated from a
(train-test), (ii) Multi Font-Single Font, and (iii) Multi different text corpus (other than the one used for training).
Font-Multi Font. Our experiments differ significantly from The second set consisted of a set of 621 real scanned text-
the experiments done by Sankaran et al. [3] since we use line images consisting of different fonts and different levels
the whole unsegmented line images (which includes a space of degradations. Fig. 5 shows few sample images from the
character as a label in the output layer) for training the second set containing the real scanned data.
network. Moreover, we do not calculate any extra features
on our images. 1 Available free of costs from the authors
Table I. Comparison of the character frequency in our Table II. Performance comparison of four sets of
training data set. The left side shows the frequency of experiments performed on Deva-DB. These experiments
characters as published by [22] and the right side shows the differ in types of fonts used in training/evaluation and also
character frequency in our set. The frequency of most the type of test data, i.e., synthetic or scanned.
common characters in our set is quite close to the
frequency of characters published by [22] No. Training Data Test Data CER(%)
1 Single Font, 24000 images Single-Font, Synthetic 1000 1.2
. (1, 855, 410 chars) images (80, 500 chars)
Symbol Occurence(%) Symbol Occurence(%) 2 Multi-Font, 24000 images Single Font, Synthetic 621 4.1
(1, 855, 410 chars) images (22, 517 chars)
◌ा 10.12 ◌ा 8.33
3 Multi-Font, 24000 images Multi-Font, Synthetic 621 2.7
क 7.94 क 6.81 (1, 855, 410 chars) images (22, 517 chars)
र 7.40 र 6.07 4 Multi-Font, 24000 images Multi-Font, Real Data 621 9.0
(1, 855, 410 chars) images (22, 517 chars)
◌ेे 6.53 ◌ेे 5.13
न 5.06 त 4.22
◌ी 4.47 ि◌ 4.19 detect it in some cases. The first image in Fig. 5 shows
ह 4.39 न 4.08 an example of such a deletion on the character ज.
स 4.36 स 3.93
The character ‘◌्’ indicates that the preceding should
त 4.32 ह 3.50 fuse with the succeeding character of ‘◌्’. If ‘प’ (pa) were
ि◌ 4.22 ◌ी 3.33 to fuse with character ‘र’(ra), the code sequence would
be ‘प’ + ‘◌्’+ ‘र’. The compound character has a shape
‘ू’(pra). As can be seen, the shape of the compound
B. Performance Metric character represents ‘प’ (pa) more than ‘र’ (ra) (This is the
The recognition accuracy is measured as “Character case with all the consonants which fuse with ‘र’ i.e. they
Error Rate (%), in terms of edit distance, which is a ratio take the shape of the first consonant). Therefore, there is a
between the insertion, deletion and substitution errors and high chance that the network might predict it as ‘प’. When
the total number of characters. this happens, we have a deletion of two characters which
are ‘◌्’ and ‘र’. This explains why the top three confusions
are deletions. In the fourth image in Fig. 5, the conjunct
C. Results character ‘ ’ is replaced by consonant ‘ख’ because of a
The LSTM network was trained at two levels. The first similar shape. This is one such example of deletion of ‘◌्’.
one was with images only from a single font (Lohit Hindi) The top deletion errors can be reduced by taking into
and second one trained with a training set consisting of consideration the pixel variation in the vertical direction
images from multiple fonts. The results of the experiments as well. Whereas other substitution errors can be removed
are summarised in the Table II. While creating multi-font by training on data which has more samples of these
training set, we used seven different fonts: Lohit Hindi, substituted shapes.
Mangal, Samanata, Kalimati, NAGARI SHREE, Sarai,
DV ME Shree. To compare the performance of LSTM network, we
evaluated the well known OCR engine Tesseract [24] on
our real test set. Tesseract results showed an error of
V. Error Analysis and Discussions
11.96% (with the default Hindi model) on the same test set
Sample images of the scanned text-line images along where LSTM network gave an error of 9%. The confusion
with the LSTM-based line recognizer are shown in Fig. 5.
The error made by the network is highlighted in red color.
Apart from errors due to the confusions between similar
shapes, one main reason for errors in these images is the
ink spread on scanned text-line images. This ink spread
makes it harder for LSTM networks to correctly recognize
the similar characters.
Although the confusion matrix in Table III shows most
of the confusions to be similar in shape, the top confusions
(◌ं, ◌्, र) happen to be deletions (network missing the
character) or insertion (network erroneously inserting a
character). This appears very strange at the first look. But
a deeper analysis of the problem lead us to the conjunct
characters in Devanagari.
Figure 5. Samples taken from the real scanned set of line images
Character ‘◌ं’ is a vowel which when combined with a where LSTM-based line recognizer fails. These samples not only vary
consonant appears as a dot on the top of the consonant. For in fonts but also in the degree of distortion due to ink spreads and
instance, when combined with the consonant ‘त’ (tha), the distortion from the scanner.The corresponding Unicode output from
conjunct character would appear as ‘त’ं (than). A network the LSTM is also shown on the right side of each image. The errors
in the output are marked in red. The errors have mostly occurred
trained with distorted images, can treat ‘◌ं’ (with a high at places where the ink spreads are such that the LSTM could not
probability) as a distortion in the image and would fail to differentiate between two similar characters.
Table III. The top confusions from all four experiments (see Table II) and from Tesseract OCR Engine. ‘_’ corresponds to
a deletion (Pred.) or an insertion (GT). The top two confusions are deletions or insertions in all but one case.
matrix from Tesseract also shows the top confusions to be [9] B. Shaw, S. Kumar Parui, and M. Shridhar, “Offline handwrit-
deletions. The character ‘◌ं’ also appears as the top deleted ten devanagari word recognition: A holistic approach based on
directional chain code feature and hmm,” in Information Tech-
character in the confusion matrix. nology, 2008. ICIT’08. International Conference on. IEEE,
2008, pp. 203–208.
VI. Conclusions [10] U. Bhattacharya, S. Parui, B. Shaw, K. Bhattacharya et al.,
The complex nature of the Devanagari script (involving “Neural combination of ann and hmm for handwritten devana-
gari numeral recognition,” in Tenth International Workshop on
fused/conjunct characters) makes the OCR research in Frontiers in Handwriting Recognition, 2006.
Devanagari a challenging task. We have introduced a new [11] C. Jawahar, M. P. Kumar, S. R. Kiran et al., “A bilingual ocr
database, Deva-DB, comprising of Ground-truthed text- for hindi-telugu documents and its applications.” in ICDAR,
line images from various scanned pages and synthetically vol. 3, 2003, pp. 408–412.
generated text-lines. OCRopus line-recognizer has been [12] U. Bhattacharya and B. B. Chaudhuri, “Handwritten numeral
adapted and trained on this database. This LSTM-based databases of indian scripts and multistage recognition of mixed
system yielded a character error rate of 1.2% when the test numerals,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 31, no. 3, pp. 444–457, 2009.
fonts matched that of the training data but the error rate
[13] R. Singh, C. Yadav, P. Verma, and V. Yadav, “Optical charac-
increased (9%) when tested on scanned data (different set ter recognition (ocr) for printed devnagari script using artificial
of fonts). The important issue that the network faced while neural network,” International Journal of Computer Science &
classifying the characters was that of conjunct characters Communication, vol. 1, no. 1, pp. 91–95, 2010.
and the cases where characters are vertically stacked. The [14] A. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recog-
shape and position of these vertically stacked glyphs vary nition: a review,” Pattern Analysis and Machine Intelligence,
widely with different fonts. The top error is the deletion of IEEE Transactions on, vol. 22, no. 1, pp. 4–37, Jan 2000.
‘◌ं’. To address these issues and as a future step, 2D-LSTM [15] J. Chen and N. S. Chaudhari, “Protein secondary structure
prediction with bidirectional lstm networks,” in International
can be evaluated for this database and may improve the Joint Conference on Neural Networks: Post-Conference Work-
accuracy as these nets would scan the text-lines not only shop on Computational Intelligence Approaches for the Analysis
sideways, but also top-down. We believe that this would of Bio-data (CI-BIO)(August 2005), 2005.
give an improved result since the pixel variation in the [16] A. Graves, “Supervised sequence labelling,” in Supervised Se-
vertical direction would also be taken into account. quence Labelling with Recurrent Neural Networks. Springer,
2012, pp. 5–13.
References [17] A. Ul-Hasan, S. B. Ahmed, S. F. Rashid, F. Shafait, and T. M.
Breuel, “Offline Printed Urdu Nastaleeq Script Recognition
[1] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, with Bidirectional LSTM Networks.” in ICDAR, Washington
and J. Schmidhuber, “A novel connectionist system for un- D.C. USA, 2013, pp. 1061–1065.
constrained handwriting recognition,” Pattern Analysis and [18] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Mod-
Machine Intelligence, IEEE Transactions on, vol. 31, no. 5, pp. eling temporal dependencies in high-dimensional sequences:
855–868, 2009. Application to polyphonic music generation and transcription,”
[2] W. Wei and G. Guanglai, “Online cursive handwriting mongolia arXiv preprint arXiv:1206.6392, 2012.
words recognition with recurrent neural networks,” Interna- [19] S. Rashid, F. Shafait, and T. Breuel, “Scanning neural network
tional Journal of Information Processing and Management, for text line recognition,” in DAS, Gold Coast, Australia, March
vol. 2, no. 3, 2011. 2012, pp. 105–109.
[3] N. Sankaran and C. Jawahar, “Recognition of printed devana- [20] H. S. Baird, “Document Image Defect Models,” in Structured
gari text using blstm neural network,” in Pattern Recognition Document Image Analysis, H. S. Baird, H. Bunke, and K. Ya-
(ICPR), 2012 21st International Conference on, Nov 2012, pp. mamoto, Eds. Springer-Verlag, 1992.
322–325.
[21] S. Setlur, S. Kompalli, V. Ramanaprasad, and V. Govindaraju,
[4] T. M. Breuel, A. Ul-Hasan, M. Al Azawi, F. Shafait, “High “Creation of data resources and design of an evaluation test bed
Performance OCR for Printed English and Fraktur using LSTM for devanagari script recognition,” March 2003, pp. 55–61.
Networks,” in ICDAR, Washington D.C. USA, aug 2013.
[22] B. Chaudhuri and U. Pal, “An ocr system to read two indian
[5] R. M. K. Sinha, “A syntactic pattern analysis system and its language scripts: Bangla and devnagari (hindi),” in Document
application to devnagari script recognition,” 1973. Analysis and Recognition, 1997., Proceedings of the Fourth
[6] B. Chadhuri and S. Palit, “A feature-based scheme for the International Conference on, vol. 2. IEEE, 1997, pp. 1011–
machine recognition of printed devanagari script,” 1995. 1015.
[7] U. Pal and B. Chaudhuri, “Printed devanagari script ocr sys- [23] IIIT. (2014, Jun.) http://ltrc.iiit.ac.in/corpus/corpus.html.
tem,” VIVEK-BOMBAY-, vol. 10, pp. 12–24, 1997. [24] Tesseract. (2014, Jun.) http://code.google.com/p/tesseract-
[8] U. Pal, “Indian script character recognition: a survey,” Pattern ocr/.
Recognition, vol. 37, pp. 1887–1899, 2004.