2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260341307

Can we build language-independent OCR using LSTM networks?

Conference Paper · August 2013


DOI: 10.1145/2505377.2505394

CITATIONS READS
31 9,898

2 authors:

Adnan Ul-Hasan Thomas Breuel


Technische Universität Kaiserslautern Google Inc.
25 PUBLICATIONS   415 CITATIONS    260 PUBLICATIONS   6,165 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Personalized Search View project

Printed and Handwritten Urdu Text Recognition View project

All content following this page was uploaded by Adnan Ul-Hasan on 26 February 2014.

The user has requested enhancement of the downloaded file.


Can we build language-independent OCR
using LSTM networks?

Adnan Ul-Hasan Thomas M. Breuel


Technical University of Kaiserslautern Technical University of Kaiserslautern
67663 Kaiserslautern, Germany 67663 Kaiserslautern, Germany
adnan@cs.uni-kl.de tmb@cs.uni-kl.de

ABSTRACT There have been efforts reported to adapt the existing


Language models or recognition dictionaries are usually con- OCR systems for other languages. Open source OCR sys-
sidered an essential step in OCR. However, using a lan- tem Tesseract [2] is one such example. The basic classifica-
guage model complicates training of OCR systems, and it tion is based on hierarchical shape-classification, where at
also narrows the range of texts that an OCR system can first the character set is reduced to few characters and then
be used with. Recent results have shown that Long Short- at last stage, the test sample is matched against the repre-
Term Memory (LSTM) based OCR yields low error rates sentative of the short set. Although, Tesseract can be used
even without language modeling. In this paper, we explore for a variety of languages (due to support available for many
the question to what extent LSTM models can be used for languages), it can not be used as an all-in-one solution for
multilingual OCR without the use of language models. To situation where we have multiple scripts together.
do this, we measure cross-language performance of LSTM The usual approach to address multilingual OCR problem
models trained on different languages. LSTM models show is to somehow combine two or more separate classifiers [3],
good promise to be used for language-independent OCR. as it is believed that a reasonable OCR output for a sin-
The recognition errors are very low (around 1%) without gle script can not be obtained without sophisticated post-
using any language model or dictionary correction. processing steps such as language modelling, use of dictio-
nary to correct OCR errors, font adaptation, etc. Natarajan
et al. [4] proposed an HMM-based script-independent multi-
Keywords lingual OCR system. Feature extraction, training and recog-
MOCR, LSTM Networks, RNN nition components are all language independent; however,
they use language specific word lexicon and language models
1. INTRODUCTION for recognition purpose. To our best knowledge, there was
Multilingual OCR (MOCR) is of interest for many rea- not a single method proposed for OCR, that can achieve
sons; digitizing historic books containing two or more scripts, very low error rates without using aforementioned sophis-
bilingual books, dictionaries, and books with line by line ticated post-processing techniques. But recent experiments
translation are few reasons to have reliable multilingual OCR on English and German script using LSTM networks [5] have
systems. However, it (MOCR) also present several unique shown that reliable OCR results can be obtained without
challenges as Popat pointed out in context of Google books such techniques.
project1 . Some of the unique challenges are: Our hypothesis for multilingual OCR is that if a single
model, at least for a family of scripts, e.g. Latin, Arabic,
• Multiple scripts/languages on a page. (multi-sript iden- Indic can be obtained, we can then use this single model to
tification) recognize scripts of that particular family; thereby reduc-
• Multiple languages in same or similar fonts, like Arabic- ing the efforts to combine multiple classifiers. Since LSTM
Persian, English-German. networks can achieve very low error-rates without using lan-
guage modelling post-processing step; they can be used for
• The same language in multiple scripts, like Urdu in
multilingual OCR.
Nastaleeq and Naskh scripts.
In this paper, we report the results of applying LSTM
• Archaic and reformed orthographies, e.g. 18th Century networks to address multilingual OCR problem. The ba-
English, Fraktur (historical German), etc. sic aim is to benchmark how LSTM networks use language
1
http://en.wikipedia.org/wiki/Google Books modelling to predict the correct labels or can we do better
without using language modelling and other post-processing
steps. Additionally, we also want to see how well LSTM
Permission to make digital or hard copies of all or part of this work for networks use context to recognize a particular character.
personal or classroom use is granted without fee provided that copies are Specifically, we trained LSTM networks for English, Ger-
not made or distributed for profit or commercial advantage and that copies man, French and a mix set of these three languages and test
bear this notice and the full citation on the first page. To copy otherwise, to them on each other. LSTM network based models achieve
republish, to post on servers or to redistribute to lists, requires prior specific very high recognition accuracy without the aid of language
permission and/or a fee. Request permissions from Permissions@acm.org.
MOCR ’13, August 24 2013, Washington, DC, USA
modelling and they have shown good promise to be used for
Copyright 2013 ACM 978-1-4503-2114-3/13/08 ...$15.00. multilingual OCR tasks.
http://dx.doi.org/10.1145/2505377.2505394
Figure 1: Some sample images from our database. There are 96 variations in standard fonts used in common
practice, e.g. for times true-type fonts; its normal, italic, bold and italic-bold variations were included. Also,
note that these images were degraded to reflect scanning artefacts.

In what follows, preprocessing step is reported in next paper, we used a modified version of the LSTM library de-
section, Section 3 describes the configuration of the LSTM scribed in [14]. That library provides 1D and multidimen-
network used in the experiments, Section 4 gives the details sional LSTM networks, together with ground-truth align-
of experimental evaluation. Section 5 concludes the paper ment using a forward-backward algorithm (“CTC”, connec-
with discussions on the current work and future directions. tionist temporal classification; [13]). The library also pro-
vides a heuristic decoding mechanism to map the frame-wise
network output onto a sequence of symbols. We have reim-
2. PREPROCESSING plemented LSTM networks and forward-backward alignment
Scale and relative position of a character are important from scratch and reproduced these results (our implementa-
features to distinguish characters in Latin script (and some tion uses a slightly different decoding mechanism). This im-
other scripts). So, text line normalization is an essential step plementation has been released as an open-source form [15]
in applying 1D LSTM networks to OCR. In this work, we (ocropus version 0.7 ).
used the normalization approach introduced in [5], namely During the training stage, randomly chosen input text-
text-line normalization based on a trainable, shape-based line images are presented as 1D sequences to forward prop-
model. A token dictionary created from a collection of a agation step through LSTM cells and then the forward-
bunch of text lines contains information about x-height, backward alignment of the output is performed. Errors are
baseline (geometric features) and shape of individual charac- then back-propagated to update weights and the process is
ters. These models are then used to normalize any text-line then repeated for the next randomly selected text-line im-
image. age. It is to be noted that raw pixel values are being used
as the only features and other sophisticated features were
3. LSTM NETWORKS extracted from the text-line images. The implicit features
Recurrent Neural Networks (RNNs) have shown a greate in 1D sequence are baseline and x-heights of individual char-
promise in recent times due to the Long Short Term Mem- acters.
ory (LSTM) architecture [6], [7]. The LSTM architecture
differs significantly from earlier architectures like Elman net- 4. EXPERIMENTAL EVALUATION
works [8] and echo-state networks [9]; and appears to over-
The aim of our experiments was to evaluate LSTM per-
come many of the limitations and problems of those earlier
formance on multilingual OCR without the aid of language
architectures.
modelling and other language-specific assistance. To explore
Traditinoal RNNs, though are good at context-aware pro-
the cross-language performance of LSTM networks, a num-
cessing [10], have not shown vying performance for OCR and
ber of experiments were performed. We trained four sep-
speech recognition tasks. Their incompetence is reported
arate LSTM networks for English, German, French and a
mainly due to the vanishing gradient problem [11, 12]. The
mixed set of all these languages. For testing, we have a to-
Long Short Term Memory [6] architecture was designed to
tal of 16 permutation. Each LSTM network was tested on
overcome this problem. It is a highly non-linear recurrent
network with multiplicative “gates” and additive feedback.
Graves et al. [7] introduced bidirectional LSTM architecture
for accessing context in both forward and backward direc- Table 1: Statistics on number of text-line images
tions. Both layers are then connected to a single output in each of English, French, German and mix-script
layer. To avoid the requirement of segmented training data, datasets.
Graves et al. [13] used a forward backward algorithm to align Language Total Training Test
transcripts with the output of the neural network. Interested
reader is suggested to see the above-mentioned references for English 85, 350 81, 600 4750
further details regarded LSTM and RNN architectures. French 85, 350 81, 600 4750
For recognition, we used a 1D bidirectional LSTM archi-
tecture, as described in [7]. We found that 1D architec- German 1, 14, 749 1, 10, 400 4349
ture outperforms their 2D or higher dimensional siblings for
Mixed-script 85, 350 81, 600 4750
printed OCR tasks. For all the experiments reported in this
Table 2: Experimental results of applying LSTM networks for multilingual OCR. These results validate our
hypothesis that a single LSTM model trained with a mixture of scripts (from a single family of script) can
be used to recognize text of individual family members. Note that the error rates of testing LSTM network
trained for German on French and networks trained for English on French and German were obtained by
ignoring the words containing special characters (umlauts and accented letters) to correctly gauge the affect
of language models of a particular language. LSTM networks trained for individual languages can also be
used to recognize other scripts, but they show some language dependence. All these results were achieved
without the aid of any language model.
XXX
XXModel
XXX English (%) German (%) French (%) Mixed (%)
Script X
English 0.5 1.22 4.1 1.06
German 2.04 0.85 4.7 1.2
French 1.8 1.4 1.1 1.05
Mixed-script 1.7 1.1 2.9 1.1

the respective script and on other three scripts, e.g. test- this was to able to correctly gauge the affect of not-using
ing LSTM network trained on German on German, French, language models. If those words were not removed, then the
English and mixed-script. These results are summarized in resulting error would also contain a proportion of errors due
Table 2, and some sample outputs are presented in Table 3. to character mis-recognition. So by removing those words
As error measure, we used the ratio of insertions, deletions with special characters, the true performance of the LSTM
and substitution relative to the ground-truth and accuracy network trained for language containing lesser alphabets on
was measured at character level. the language containing more alphabets can be evaluated.
It should be noted that these results were obtained without
4.1 Database the aid of any post-processing step, like language modelling,
A separate synthetic database for each language was de- use of dictionaries to correct OCR errors, etc.
veloped using OCRopus [16] (ocropus-linegen). This utility LSTM model trained for mixed-data was able to obtain
requires a bunch of utf-8 encoded text files and a set of similar recognition results (around 1% recognition error)
true-type fonts. With these two things available, one can when applied to English, German and French script indi-
artificially generate any number of text-line images. This vidually. Other results indicate small language dependence
utility also provide control to induce scanning artefacts such in that LSTM models trained for a single language yielded
as distortion, jitter, and other degradations. Separate cor- lower error rates when tested on the same script than when
pora of text-line images in German, English and French they are evaluated on other scripts.
languages were generated in commonly used fonts (includ- To gauge the magnitude of affect of language modelling,
ing bold, italic, italic-bold ) from freely available literature. we compared our results with Tesseract open-source OCR
These images were degraded using degradation models [17] system [18]. We applied latest available models (as of sub-
to reflect scanning artefacts. There are four degradation pa- mission date) of English, French and German on the same
rameters, namely elastic elongation, jitter, sensitivity and test-data. Tesseract system achieved high rates as com-
threshold. Sample text-lines images in our database are pared to LSTM based models. Tesseract’s model for En-
shown in Figure 1. Each database is further divided into glish yielded 1.33%, 5.02%, 5.09% and 4.82% recognition
training and test datasets. Statistics on number of text line error when applied to English, French, German and Mixed-
images in each four scripts is given in Table 1. data respectively. Model for French yielded 2.06%, 2.7%,
3.5% and 2.96% recognition error when applied to English,
4.2 Parameters German and Mixed-data respectively, while model for Ger-
man yielded 1.85%, 2.9%, 6.63% and 4.36% recognition er-
The text lines were normalized to a height of 32 in pre-
ror when applied to English, French and Mixed-data re-
processing step. Both left-to-right and right-to-left LSTM
spectively. So, these results show that absence of language
layers contain 100 LSTM memory blocks. The learning rate
modelling or applying different language models affects the
was set to 1e − 4, and the momentum was set to 0.9. The
recognition. Since no model for mixed data is available for
training was carried out for one million steps (roughly cor-
Tesseract, the effect of evaluating such a model on individual
responding to 100 epochs, given the size of the training set).
script could not be computed.
Training errors were reported every 10, 000 training steps
and plotted. The network corresponding to the minimum
training error was used for test set evaluation. 5. DISCUSSION AND CONCLUSIONS
The results presented in this paper show that LSTM net-
4.3 Results works can be used for multilingual OCR. LSTM networks
Since, there are no umlauts (German) and accented (French) do not learn a particular language model internally (nor we
letters in English, so while testing LSTM model trained for need such models as post-processing step). They show great
German on French and model trained for English on French promise to learn various shapes of a certain character in dif-
and German, the words containing those special characters ferent fonts and under degradations (as evident from our
were omitted from the recognition results. The reason to do highly versatile data). The language dependence is observ-
Table 3: Sample outputs from four LSTM networks trained for English, German, French and Mixed data.
LSTM net trained on a specific language is unable to recognize special characters of other languages as they
were not part of training. Therefore, it is necessary to ignore these errors from final error score. Thus we
can train an LSTM model for mix-data of a family of script and can use it to recognize individual language
of this family with very low recognition error.
Text-line Image
English
German
French
Mixed-data
Text-line Image
English
German
French
Mixed-data H
Text-line Image
English
German
French
H Mixed-data

able, but the affects are small as compared to other state- For LSTM networks trained on German language (second
of-the-art OCR, where absence of language models results column), most of the top errors are due to inability to rec-
in relatively bad results. To gauge the language dependence ognize a particular letter. Top errors when applying LSTM
more precisely, one can evaluate the performance of LSTM network trained for French language on other scripts are con-
by training LSTM networks on randomly generated data fusion between w/W with v/V. An interesting observation,
using n-gram statistics and testing those models on natural which could be a possible reason for such behaviour, is that
languages. Currently, we are working in this direction and relative frequency of w (see footnote) is very low in French.
the results will be reported elsewhere. In other words, ‘w’ may be considered as a special character
In the following, we will analyse the errors made by our w.r.t. French language when applying French model to Ger-
LSTM networks when applied to other scripts. Top 5 con- man and English. So, this is a language dependent issue,
fusions for each case are tabulated in Table 4. The case of which is not observable in case of mix-data.
applying an LSTM network to the same language for which This work can be extended in future in many directions.
it was trained is not discussed here as it is not relevant for First, more European languages like Italian, Spanish, Dutch
the discussion of cross-language performance of LSTM net- may be included in current set-up to train an all-in-one
works. LSTM network for these languages. Secondly, other fam-
Most of the errors caused by LSTM network trained on ilies of script especially Nabataean and Indic scripts can be
mixed-data are non-recognition (deletion) of certain char- tested to further validate our hypothesis empirically.
acters like l,t,r,i. These errors may be removed by better
training. 6. REFERENCES
Looking at the first column of Table 4 (Applying LSTM [1] A. C. Popat, “Multilingual OCR Challenges in Google
network trained for English on other 3 scripts), most of the Books,” 2012. [Online]. Available:
errors are due to the confusion between characters of similar http://dri.ie/sites/default/files/files/popat multilingual
shapes, like I to l (and vice verca), Z to 2 and c to e. Two ocr challenges-handout.pdf
confusions namely Z with A and Z with L are interesting as,
[2] R. Smith, D. Antonova, and D. S. Lee, “Adapting the
apparently, there are no shape similarity between them. One
Tesseract Open Source OCR Engine for Multilingual
possibility of such a behaviour may be due to the fact that
OCR,” in Int. Workshop on Multilingual OCR, Jul.
Z is the least frequent letter in English2 and thus there may
2009.
be not many Zs in the training samples, thereby resulting
in its poor recognition. Two other noticeable errors (also in [3] M. A. Obaida, M. J. Hossain, M. Begum, and M. S.
other models) are unrecognised space and ’ ( denotes that Alam, “Multilingual OCR (MOCR): An Approach to
this letter was deleted). Classify Words to Languages,” Int’l Journal of
Computer Applications, vol. 32, no. 1, pp. 46–53, Oct.
2
http://en.wikipedia.org/wiki/Letter frequency 2011.
Table 4: Top confusions for applying LSTM models for various tasks. The confusions for an LSTM models for
which it was trained are not mentioned as it is unnecessary for our present paper. shows the garbage class,
i.e. the character is not recognized at all. When the LSTM net trained on English was applied to recognize
other scripts, the resulting top errors are similar: shape confusions between characters. Non-recognition of
“space” and “ ’ ” are other noticeable errors. For network trained on German language, most errors are due
to deletion of characters. Confusion of w/W with v/V are the top confusions when LSTM network trained
on French was applied to other scripts.
XXX
XXModel
XXX English German French Mixed
Script X
English - ← space v←w ← space
←c vv ← w ←t
←t ← space ←0
←0 ←w l←I
v←y l←I ←l
German l←I - v←w ← space
L←Z û ← ü ←t
A←Z V ←W ←l
c←e ← space ←i
2←Z vv ← w ←r
French ←0 ← space - ← space
← space ←0 ←i
I←l e← e ← é
t←l ←c ←l
I ←! ←l ←0
Mixed-script ←0 ← space v←w -
l←I ←0 ô ← ö
I←l g←q â ← ä
← space e← V ←W
t←l T ← l0 û ← ü

[4] P. Natarajan, Z. Lu, R. M. Schwartz, I. Bazzi, and Netwoks, S. C. Kremer and J. F. Kolen, Eds. IEEE
J. Makhoul, “Multilingual Machine Printed OCR,” Press, 2001.
IJPRAI, vol. 15, no. 1, pp. 43–63, 2001. [12] Y. Bengio, P. Smirard, and P. Frasconi, “Learning
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and long-term dependencies with gradient descent is
F. Shafait, “High Performance OCR for English and difficult,” IEEE Trans. on Neural Networks, vol. 5,
Fraktur using LSTM Networks,” in Int. Conf. on no. 2, pp. 157–166, Mar. 1994.
Document Analysis and Recognition, Aug. 2013. [13] A. Graves, S. Fernandez, F. Gomes, and
[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term J. Schmidhuber, “Connectionist Temporal
Memory,” Nueral Computation, vol. 9, no. 8, pp. Classification: Labeling Unsegemented Sequence Data
1735–1780, 1997. with Recurrent Nerual Networks,” in ICML,
[7] A. Graves, M. Liwicki, S. Fernandez, Bertolami, Pennsylvania, USA, 2006, pp. 369–376.
H. Bunke, and J. Schmidhuber, “A Novel [14] A. Graves, “RNNLIB: A recurrent neural network
Connectionist System for Unconstrained Handwriting library for sequence learning problems.” [Online].
Recognition,” IEEE Trans. on Pattern Analysis and Available: http://sourceforge.net/projects/rnnl
Machine Intelligence, vol. 31, no. 5, pp. 855–868, May [15] “OCRopus - Open Source Document Analysis and
2008. OCR system.” [Online]. Available:
[8] J. L. Elman, “Finding Structure in Time.” Cognitive https://code.google.com/p/ocropus
Science, vol. 14, no. 2, pp. 179–211, 1990. [16] T. M. Breuel, “The OCRopus open source OCR
[9] H. Jaeger, “Tutorial on Training Recurrent Neural system,” in DRR XV, vol. 6815, Jan. 2008, p. 68150F.
Networks, Covering BPTT, RTRL, EKF and the [17] H. S. Baird, “Document Image Defect Models ,” in
‘Echo State Network’ approach,” Sankt Augustin, Structured Document Image Analysis, H. S. Baird,
Tech. Rep., 2002. H. Bunke, and K. Yamamoto, Eds. New York:
[10] A. W. Senior, “Off-line Cursive Handwriting Springer-Verlag, 1992.
Recognition using Recurrent Neural Networks,” Ph.D. [18] R. Smith, “An Overview of the Tesseract OCR
dissertation, England, 1994. Engine,” in ICDAR, 2007, pp. 629–633.
[11] S. Hochreiter, Y. Bengio, P. Frasconi, and
J. Schmidhuber, “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,” in A
Field Guide to Dynammical Recurrent Neural

View publication stats

You might also like