Professional Documents
Culture Documents
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
net/publication/260341307
CITATIONS READS
31 9,898
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Adnan Ul-Hasan on 26 February 2014.
In what follows, preprocessing step is reported in next paper, we used a modified version of the LSTM library de-
section, Section 3 describes the configuration of the LSTM scribed in [14]. That library provides 1D and multidimen-
network used in the experiments, Section 4 gives the details sional LSTM networks, together with ground-truth align-
of experimental evaluation. Section 5 concludes the paper ment using a forward-backward algorithm (“CTC”, connec-
with discussions on the current work and future directions. tionist temporal classification; [13]). The library also pro-
vides a heuristic decoding mechanism to map the frame-wise
network output onto a sequence of symbols. We have reim-
2. PREPROCESSING plemented LSTM networks and forward-backward alignment
Scale and relative position of a character are important from scratch and reproduced these results (our implementa-
features to distinguish characters in Latin script (and some tion uses a slightly different decoding mechanism). This im-
other scripts). So, text line normalization is an essential step plementation has been released as an open-source form [15]
in applying 1D LSTM networks to OCR. In this work, we (ocropus version 0.7 ).
used the normalization approach introduced in [5], namely During the training stage, randomly chosen input text-
text-line normalization based on a trainable, shape-based line images are presented as 1D sequences to forward prop-
model. A token dictionary created from a collection of a agation step through LSTM cells and then the forward-
bunch of text lines contains information about x-height, backward alignment of the output is performed. Errors are
baseline (geometric features) and shape of individual charac- then back-propagated to update weights and the process is
ters. These models are then used to normalize any text-line then repeated for the next randomly selected text-line im-
image. age. It is to be noted that raw pixel values are being used
as the only features and other sophisticated features were
3. LSTM NETWORKS extracted from the text-line images. The implicit features
Recurrent Neural Networks (RNNs) have shown a greate in 1D sequence are baseline and x-heights of individual char-
promise in recent times due to the Long Short Term Mem- acters.
ory (LSTM) architecture [6], [7]. The LSTM architecture
differs significantly from earlier architectures like Elman net- 4. EXPERIMENTAL EVALUATION
works [8] and echo-state networks [9]; and appears to over-
The aim of our experiments was to evaluate LSTM per-
come many of the limitations and problems of those earlier
formance on multilingual OCR without the aid of language
architectures.
modelling and other language-specific assistance. To explore
Traditinoal RNNs, though are good at context-aware pro-
the cross-language performance of LSTM networks, a num-
cessing [10], have not shown vying performance for OCR and
ber of experiments were performed. We trained four sep-
speech recognition tasks. Their incompetence is reported
arate LSTM networks for English, German, French and a
mainly due to the vanishing gradient problem [11, 12]. The
mixed set of all these languages. For testing, we have a to-
Long Short Term Memory [6] architecture was designed to
tal of 16 permutation. Each LSTM network was tested on
overcome this problem. It is a highly non-linear recurrent
network with multiplicative “gates” and additive feedback.
Graves et al. [7] introduced bidirectional LSTM architecture
for accessing context in both forward and backward direc- Table 1: Statistics on number of text-line images
tions. Both layers are then connected to a single output in each of English, French, German and mix-script
layer. To avoid the requirement of segmented training data, datasets.
Graves et al. [13] used a forward backward algorithm to align Language Total Training Test
transcripts with the output of the neural network. Interested
reader is suggested to see the above-mentioned references for English 85, 350 81, 600 4750
further details regarded LSTM and RNN architectures. French 85, 350 81, 600 4750
For recognition, we used a 1D bidirectional LSTM archi-
tecture, as described in [7]. We found that 1D architec- German 1, 14, 749 1, 10, 400 4349
ture outperforms their 2D or higher dimensional siblings for
Mixed-script 85, 350 81, 600 4750
printed OCR tasks. For all the experiments reported in this
Table 2: Experimental results of applying LSTM networks for multilingual OCR. These results validate our
hypothesis that a single LSTM model trained with a mixture of scripts (from a single family of script) can
be used to recognize text of individual family members. Note that the error rates of testing LSTM network
trained for German on French and networks trained for English on French and German were obtained by
ignoring the words containing special characters (umlauts and accented letters) to correctly gauge the affect
of language models of a particular language. LSTM networks trained for individual languages can also be
used to recognize other scripts, but they show some language dependence. All these results were achieved
without the aid of any language model.
XXX
XXModel
XXX English (%) German (%) French (%) Mixed (%)
Script X
English 0.5 1.22 4.1 1.06
German 2.04 0.85 4.7 1.2
French 1.8 1.4 1.1 1.05
Mixed-script 1.7 1.1 2.9 1.1
the respective script and on other three scripts, e.g. test- this was to able to correctly gauge the affect of not-using
ing LSTM network trained on German on German, French, language models. If those words were not removed, then the
English and mixed-script. These results are summarized in resulting error would also contain a proportion of errors due
Table 2, and some sample outputs are presented in Table 3. to character mis-recognition. So by removing those words
As error measure, we used the ratio of insertions, deletions with special characters, the true performance of the LSTM
and substitution relative to the ground-truth and accuracy network trained for language containing lesser alphabets on
was measured at character level. the language containing more alphabets can be evaluated.
It should be noted that these results were obtained without
4.1 Database the aid of any post-processing step, like language modelling,
A separate synthetic database for each language was de- use of dictionaries to correct OCR errors, etc.
veloped using OCRopus [16] (ocropus-linegen). This utility LSTM model trained for mixed-data was able to obtain
requires a bunch of utf-8 encoded text files and a set of similar recognition results (around 1% recognition error)
true-type fonts. With these two things available, one can when applied to English, German and French script indi-
artificially generate any number of text-line images. This vidually. Other results indicate small language dependence
utility also provide control to induce scanning artefacts such in that LSTM models trained for a single language yielded
as distortion, jitter, and other degradations. Separate cor- lower error rates when tested on the same script than when
pora of text-line images in German, English and French they are evaluated on other scripts.
languages were generated in commonly used fonts (includ- To gauge the magnitude of affect of language modelling,
ing bold, italic, italic-bold ) from freely available literature. we compared our results with Tesseract open-source OCR
These images were degraded using degradation models [17] system [18]. We applied latest available models (as of sub-
to reflect scanning artefacts. There are four degradation pa- mission date) of English, French and German on the same
rameters, namely elastic elongation, jitter, sensitivity and test-data. Tesseract system achieved high rates as com-
threshold. Sample text-lines images in our database are pared to LSTM based models. Tesseract’s model for En-
shown in Figure 1. Each database is further divided into glish yielded 1.33%, 5.02%, 5.09% and 4.82% recognition
training and test datasets. Statistics on number of text line error when applied to English, French, German and Mixed-
images in each four scripts is given in Table 1. data respectively. Model for French yielded 2.06%, 2.7%,
3.5% and 2.96% recognition error when applied to English,
4.2 Parameters German and Mixed-data respectively, while model for Ger-
man yielded 1.85%, 2.9%, 6.63% and 4.36% recognition er-
The text lines were normalized to a height of 32 in pre-
ror when applied to English, French and Mixed-data re-
processing step. Both left-to-right and right-to-left LSTM
spectively. So, these results show that absence of language
layers contain 100 LSTM memory blocks. The learning rate
modelling or applying different language models affects the
was set to 1e − 4, and the momentum was set to 0.9. The
recognition. Since no model for mixed data is available for
training was carried out for one million steps (roughly cor-
Tesseract, the effect of evaluating such a model on individual
responding to 100 epochs, given the size of the training set).
script could not be computed.
Training errors were reported every 10, 000 training steps
and plotted. The network corresponding to the minimum
training error was used for test set evaluation. 5. DISCUSSION AND CONCLUSIONS
The results presented in this paper show that LSTM net-
4.3 Results works can be used for multilingual OCR. LSTM networks
Since, there are no umlauts (German) and accented (French) do not learn a particular language model internally (nor we
letters in English, so while testing LSTM model trained for need such models as post-processing step). They show great
German on French and model trained for English on French promise to learn various shapes of a certain character in dif-
and German, the words containing those special characters ferent fonts and under degradations (as evident from our
were omitted from the recognition results. The reason to do highly versatile data). The language dependence is observ-
Table 3: Sample outputs from four LSTM networks trained for English, German, French and Mixed data.
LSTM net trained on a specific language is unable to recognize special characters of other languages as they
were not part of training. Therefore, it is necessary to ignore these errors from final error score. Thus we
can train an LSTM model for mix-data of a family of script and can use it to recognize individual language
of this family with very low recognition error.
Text-line Image
English
German
French
Mixed-data
Text-line Image
English
German
French
Mixed-data H
Text-line Image
English
German
French
H Mixed-data
able, but the affects are small as compared to other state- For LSTM networks trained on German language (second
of-the-art OCR, where absence of language models results column), most of the top errors are due to inability to rec-
in relatively bad results. To gauge the language dependence ognize a particular letter. Top errors when applying LSTM
more precisely, one can evaluate the performance of LSTM network trained for French language on other scripts are con-
by training LSTM networks on randomly generated data fusion between w/W with v/V. An interesting observation,
using n-gram statistics and testing those models on natural which could be a possible reason for such behaviour, is that
languages. Currently, we are working in this direction and relative frequency of w (see footnote) is very low in French.
the results will be reported elsewhere. In other words, ‘w’ may be considered as a special character
In the following, we will analyse the errors made by our w.r.t. French language when applying French model to Ger-
LSTM networks when applied to other scripts. Top 5 con- man and English. So, this is a language dependent issue,
fusions for each case are tabulated in Table 4. The case of which is not observable in case of mix-data.
applying an LSTM network to the same language for which This work can be extended in future in many directions.
it was trained is not discussed here as it is not relevant for First, more European languages like Italian, Spanish, Dutch
the discussion of cross-language performance of LSTM net- may be included in current set-up to train an all-in-one
works. LSTM network for these languages. Secondly, other fam-
Most of the errors caused by LSTM network trained on ilies of script especially Nabataean and Indic scripts can be
mixed-data are non-recognition (deletion) of certain char- tested to further validate our hypothesis empirically.
acters like l,t,r,i. These errors may be removed by better
training. 6. REFERENCES
Looking at the first column of Table 4 (Applying LSTM [1] A. C. Popat, “Multilingual OCR Challenges in Google
network trained for English on other 3 scripts), most of the Books,” 2012. [Online]. Available:
errors are due to the confusion between characters of similar http://dri.ie/sites/default/files/files/popat multilingual
shapes, like I to l (and vice verca), Z to 2 and c to e. Two ocr challenges-handout.pdf
confusions namely Z with A and Z with L are interesting as,
[2] R. Smith, D. Antonova, and D. S. Lee, “Adapting the
apparently, there are no shape similarity between them. One
Tesseract Open Source OCR Engine for Multilingual
possibility of such a behaviour may be due to the fact that
OCR,” in Int. Workshop on Multilingual OCR, Jul.
Z is the least frequent letter in English2 and thus there may
2009.
be not many Zs in the training samples, thereby resulting
in its poor recognition. Two other noticeable errors (also in [3] M. A. Obaida, M. J. Hossain, M. Begum, and M. S.
other models) are unrecognised space and ’ ( denotes that Alam, “Multilingual OCR (MOCR): An Approach to
this letter was deleted). Classify Words to Languages,” Int’l Journal of
Computer Applications, vol. 32, no. 1, pp. 46–53, Oct.
2
http://en.wikipedia.org/wiki/Letter frequency 2011.
Table 4: Top confusions for applying LSTM models for various tasks. The confusions for an LSTM models for
which it was trained are not mentioned as it is unnecessary for our present paper. shows the garbage class,
i.e. the character is not recognized at all. When the LSTM net trained on English was applied to recognize
other scripts, the resulting top errors are similar: shape confusions between characters. Non-recognition of
“space” and “ ’ ” are other noticeable errors. For network trained on German language, most errors are due
to deletion of characters. Confusion of w/W with v/V are the top confusions when LSTM network trained
on French was applied to other scripts.
XXX
XXModel
XXX English German French Mixed
Script X
English - ← space v←w ← space
←c vv ← w ←t
←t ← space ←0
←0 ←w l←I
v←y l←I ←l
German l←I - v←w ← space
L←Z û ← ü ←t
A←Z V ←W ←l
c←e ← space ←i
2←Z vv ← w ←r
French ←0 ← space - ← space
← space ←0 ←i
I←l e← e ← é
t←l ←c ←l
I ←! ←l ←0
Mixed-script ←0 ← space v←w -
l←I ←0 ô ← ö
I←l g←q â ← ä
← space e← V ←W
t←l T ← l0 û ← ü
[4] P. Natarajan, Z. Lu, R. M. Schwartz, I. Bazzi, and Netwoks, S. C. Kremer and J. F. Kolen, Eds. IEEE
J. Makhoul, “Multilingual Machine Printed OCR,” Press, 2001.
IJPRAI, vol. 15, no. 1, pp. 43–63, 2001. [12] Y. Bengio, P. Smirard, and P. Frasconi, “Learning
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and long-term dependencies with gradient descent is
F. Shafait, “High Performance OCR for English and difficult,” IEEE Trans. on Neural Networks, vol. 5,
Fraktur using LSTM Networks,” in Int. Conf. on no. 2, pp. 157–166, Mar. 1994.
Document Analysis and Recognition, Aug. 2013. [13] A. Graves, S. Fernandez, F. Gomes, and
[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term J. Schmidhuber, “Connectionist Temporal
Memory,” Nueral Computation, vol. 9, no. 8, pp. Classification: Labeling Unsegemented Sequence Data
1735–1780, 1997. with Recurrent Nerual Networks,” in ICML,
[7] A. Graves, M. Liwicki, S. Fernandez, Bertolami, Pennsylvania, USA, 2006, pp. 369–376.
H. Bunke, and J. Schmidhuber, “A Novel [14] A. Graves, “RNNLIB: A recurrent neural network
Connectionist System for Unconstrained Handwriting library for sequence learning problems.” [Online].
Recognition,” IEEE Trans. on Pattern Analysis and Available: http://sourceforge.net/projects/rnnl
Machine Intelligence, vol. 31, no. 5, pp. 855–868, May [15] “OCRopus - Open Source Document Analysis and
2008. OCR system.” [Online]. Available:
[8] J. L. Elman, “Finding Structure in Time.” Cognitive https://code.google.com/p/ocropus
Science, vol. 14, no. 2, pp. 179–211, 1990. [16] T. M. Breuel, “The OCRopus open source OCR
[9] H. Jaeger, “Tutorial on Training Recurrent Neural system,” in DRR XV, vol. 6815, Jan. 2008, p. 68150F.
Networks, Covering BPTT, RTRL, EKF and the [17] H. S. Baird, “Document Image Defect Models ,” in
‘Echo State Network’ approach,” Sankt Augustin, Structured Document Image Analysis, H. S. Baird,
Tech. Rep., 2002. H. Bunke, and K. Yamamoto, Eds. New York:
[10] A. W. Senior, “Off-line Cursive Handwriting Springer-Verlag, 1992.
Recognition using Recurrent Neural Networks,” Ph.D. [18] R. Smith, “An Overview of the Tesseract OCR
dissertation, England, 1994. Engine,” in ICDAR, 2007, pp. 629–633.
[11] S. Hochreiter, Y. Bengio, P. Frasconi, and
J. Schmidhuber, “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,” in A
Field Guide to Dynammical Recurrent Neural