Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

The RODRIGO database

N. Serrano, F. Castro, A. Juan

DSIC/ITI, Universitat Politècnica de València


Camı́ de Vera, s/n, 46022 València, SPAIN
{nserrano,francas,ajuan}@iti.upv.es
Abstract
Annotation of digitized pages from historical document collections is very important to research on automatic extraction of
text blocks, lines, and handwriting recognition. We have recently introduced a new handwritten text database, GERMANA,
which is based on a Spanish manuscript from 1891. To our knowledge, GERMANA is the first publicly available database
mostly written in Spanish and comparable in size to standard databases. In this paper, we present another handwritten
text database, RODRIGO, completely written in Spanish and comparable in size to GERMANA. However, RODRIGO
comes from a much older manuscript, from 1545, where the typical difficult characteristics of historical documents are
more evident. In particular, the writing style, which has clear Gothic influences, is significantly more complex than that
of GERMANA. We also provide baseline results of handwriting recognition for reference in future studies, using standard
techniques and tools for preprocessing, feature extraction, HMM-based image modelling, and language modelling.

1. Introduction this case, we have selected a manuscript much older


than that of GERMANA, from 1545, which is pub-
There are huge historical document collections resid-
licly available in digitized form, at 300dpi in true col-
ing in libraries, museums and archives that are cur-
ors, from the Spanish “Ministerio de Cultura” web
rently being digitized for preservation purposes and to
site (Rod, 2006). The original manuscript is a 853-
make them available worldwide through large, on-line
page bound volume, entitled “Historia de España del
digital libraries. The main objective, however, is not to
arçobispo Don Rodrigo”, and completely written in
simply provide access to raw images of digitized doc-
old Castilian (Spanish) by a single author. We care-
uments, but to annotate them with their real informa-
fully annotated all text blocks, lines and transcriptions,
tive content and, in particular, with text transcriptions.
resulting in approximately 20K lines and 231K run-
However, automatic extraction of text blocks, lines,
ning words from a lexicon of 17K words, that is, very
and handwriting recognition are still open research
similar to GERMANA in size. The main purpose of
problems (Ramos et al., 2010; Likforman-Sulem et al.,
this work is to let this annotation known to researchers
2007; Bertolami and Bunke, 2008).
and to provide an adequate reference for future stud-
In (Pérez et al., 2009), we presented a new handwrit-
ies. The interested reader can download it from (Rod,
ten text database, GERMANA, to facilitate empirical
2010).
comparison of different approaches to automatic ex-
traction of text blocks, lines, and handwriting recog- As GERMANA, RODRIGO is not a particularly dif-
nition. GERMANA is the result of digitizing and an- ficult task for text and block line detection since most
notating a 764-page Spanish manuscript from 1891, pages only contain a single text block of nearly cal-
in which most pages only contain nearly calligraphed ligraphed handwriting on well-separated lines. It is
text written on ruled sheets of well-separated lines. also a single-author manuscript on a limited-domain
To our knowledge, it is the first publicly available task and, easier than GERMANA, it is only written
database for handwriting research, mostly written in in Spanish. Nevertheless, RODRIGO comes from a
Spanish and comparable in size to standard databases much older manuscript, and thus the typical difficult
such as IAM (Marti and Bunke, 2002; Su et al., 2007). characteristics of historical documents are more evi-
In this paper, we present another handwritten text dent. In particular, the writing style, which has clear
database, which will be referred to as RODRIGO. In Gothic influences, is significantly more complex than
that of GERMANA.
Work supported by the EC (FSE), the Spanish Gov-
In what follows, we first describe the manuscript and
ernment (MEC, MICINN, ”Plan E”, under grants
MIPRCV ”Consolider Ingenio 2010” CSD2007-00018, the database in Sections 2 and 3, respectively. Then, in
iTransDoc TIN2006-15694-CO2-01, MITTRAL TIN2009- Section 4, some preliminary results are reported using
14633-C03-01 and FPU AP2007-02867) and the Generali- a standard, HMM-based recognizer. Finally, conclu-
tat Valenciana (grant Prometeo/2009/014). sions and future work are discussed in Section 5.

2709
2. The manuscript block, each text line was marked by its (straight) base-
As said above, the RODRIGO database corresponds to line. This was done semi-automatically by means
a manuscript from 1545 entitled “Historia de España of the GIDOC prototype (Serrano et al., 2010). All
del arçobispo Don Rodrigo”, and completely written blocks and baselines automatically detected were also
in old Castilian (Spanish) by a single author. It is a manually supervised, and corrected when needed.
853-page bound volume divided into 307 chapters de- On the other hand, the whole manuscript was tran-
scribing chronicles from the Spanish history. Most scribed line by line, by a paleography expert, in ac-
pages only contain a single text block of nearly cal- cordance with the following transcription rules:
ligraphed handwriting on well-separated lines. This
• Page and line breaks are copied exactly.
can be seen in Fig. 1, where it is also apparent that
writing style has clear Gothic influences (Millares and • Missing natural blank spaces between successive
Ruiz, 1983). words are indicated by the symbol “⌣ ”.
Other characteristic details of RODRIGO that can be • Inserted artificial blank spaces within words are
clearly appreciated in Fig. 1 are: indicated by the symbol “ ”.
• The author tends to embellish the writing, spe- • No spelling mistakes are corrected.
cially in broad white spaces, resulting in the ex- • No case or accentuation change is done.
tension of some ascenders and descenders across
• Punctuation signs are copied as they appear.
whole words.
• Word abbreviations are first copied verbatim, ex-
• Natural blank spaces between successive words cept for sub-indixes and super-indixes, which
are often omitted; e.g., the words “de la” are writ- are written in LATEX-like notation as {sub} and
ten as a single word “dela” in the third line from ˆ{super}, respectively. Then, they are fol-
the bottom of page 15. Sometimes, on the con- lowed by the corresponding word between brack-
trary, artificial blank spaces are inserted within a ets. Thus, for instance, qi er. is transcribed as
single word; e.g., the word “llegaronse” is writ- qˆ{i}er[quier].
ten as two words, “llegaron se”.
• The symbol ”$” is appended to each line having
• Each chapter should begin with a dropcap, but a broken word at its end.
the manuscript contains no dropcaps, probably
because it was never brought to an artist to do The total time required for a single expert to man-
so. Instead, there is a blank area in each position ually annotate (text blocks, baselines and transcrip-
where a dropcap should have been inserted and, tions of) the whole manuscript was estimated as 500
in most cases, the corresponding letter is written hours; that is, approximately 35 minutes per page on
in small size. average. The complete annotation of RODRIGO is
publicly available, for non-commercial use, at (Rod,
• The first words in each even page are also copied 2010). It comprises about 20K text lines and 231K
in the bottom right corner of its preceding page. running words from a lexicon of 17K words, which
is comparable in size to standard databases such as
3. The database IAM (Marti and Bunke, 2002; Su et al., 2007). It is
The manuscript was carefully digitized by experts worth noting that more than half of the words in the
from the Spanish Ministry of Culture, at 300dpi in lexicon (54.4%) are singletons (hapax legomena), but
true colors, and it is publicly available at (Rod, 2006). they only account for a 4.1% of the running words.
As with historical documents in general, scanned Please see Table 1 for more details.
pages have noise effects like spots, tears, ink fading
and transparency of back side. Also, they show a 4. Experiments
slight warping due to book binding. Nevertheless, the As discussed in the introduction, RODRIGO is intro-
manuscript can be easily read and thus we decided duced to facilitate comparison of different approaches
not to apply any preprocessing to it (apart from de- to automatic extraction of text blocks, lines, and hand-
saturation) for ground-truth annotation. writing recognition. In this section, however, we will
We followed an annotation procedure very similar to restrict ourselves to (automatic) transcription (hand-
the one used for the GERMANA database (Pérez et writing recognition). More specifically, our aim is
al., 2009). First, all text blocks were annotated with simply to provide baseline results for reference in fu-
minimal enclosing rectangles and, within each text ture studies, using standard techniques and tools; i.e.,

2710
Figure 1: Pages 15 and 16 of RODRIGO.

Pages 853 a block of lines or pages, all supervised transcriptions


Lines 20357 may be very well used for better (re-)training of im-
Running words 232K age and language models, and thus improving system
Perplexity 166 accuracy (Serrano et al., 2010).
Lexicon size 17.3K Taking into account the above discussion, we divided
Singletons (%) 54.4 RODRIGO into 20 consecutive blocks of 1000 lines
Character set size 115 each (1 − 1000, 1001 − 2000, . . . , 19001 − 20357).
Then, from block 1 to 19, the system was (re-)trained
Table 1: Basic statistics of the RODRIGO text tran-
using all preceding blocks, with block 2 also used for
scriptions (with isolated punctuation signs and ab-
further adjustment of a few, key parameters. After
breviations substituted by their corresponding words).
each retraining, the system accuracy was measured in
Perplexity was computed using a bigram language
terms of Word Error Rate (WER) on both, the next
model and a 100-fold cross-validation experiment.
block to supervise and the last block. The resulting
Singletons refers to words occurring exactly once.
curves are shown in Fig. 2. Also shown is the part of
the WER due to the occurrence of out-of-vocabulary
HMM-based text image modeling and n-gram lan- (OOV) words.
guage modeling (Pérez et al., 2009). As expected, the WER on the last block decreases as
Due to its sequential book structure, the very basic the amount of training data increases. Interestingly,
task on RODRIGO is to transcribe it line by line, from however, the WER on the next block curve reveals
the beginning to the end. We assume that an auto- considerable fluctuations in the recognition complex-
matic transcription system is used, and that each (auto- ity of intermediate blocks. Nevertheless, this curve
matically) transcribed line is supervised and, if neces- also tends to decrease and, indeed, both curves con-
sary, amended by an expert. Clearly, after processing verge to a WER around 36.5% for block 20. We think

2711
80 WER on last block L. Likforman-Sulem, A. Zahour, and B. Taconet.
OOVs on last block
WER on next block 2007. Text line segmentation of historical docu-
70
OOVs on next block ments: a survey. International Journal on Doc-
60 ument Analysis and Recognition (IJDAR), 9:123–
50 138.
U. V. Marti and H. Bunke. 2002. The IAM-database:
40
an English sentence database for off-line handwrit-
30 ing recognition. International Journal on Docu-
ment Analysis and Recognition (IJDAR), 5:39–46.
20
A. Millares and J. M. Ruiz. 1983. Tratado de pale-
10 ografı́a española, volume 1. Espasa-Calpe, 3rd edi-
Training Lines tion.
0
1000 5000 10000 15000 19000 D. Pérez, L. Tarazón, N. Serrano, F. Castro, O. Ramos,
and A. Juan. 2009. The GERMANA database. In
Figure 2: Transcription Word Error Rate (WER) on Proccedings of the 10th International Conference
RODRIGO as a function of the blocks of (1000 lines) on Document Analysis and Recognition (ICDAR
already supervised and thus available for training 2010), pages 301–305, Barcelona (Spain).
(Training lines). The WER is computed for both, the O. Ramos, N. Serrano, and A. Juan. 2010.
next block to supervise (solid line with black circles) Interactive-predictive detection of handwritten text
and the last block (lines 19001 − 20357). Also shown
blocks. In Proc. of the 17th Document Recognition
is the part of the WER due to the occurrence of out-
and Retrieval (DRR 2010), San Jose, CA (USA).
of-vocabulary (OOV) words (dashed lines).
2006. The RODRIGO database: digitized data.
bvpb.mcu.es.
that this WER is not too bad for effective computer- 2010. The RODRIGO database: annotated data.
assisted transcription, though for sure there is room prhlt.iti.es/rodrigo.php.
for significant improvement. Note that, as can be ob- N. Serrano, A. Sanchis, and A. Juan. 2010. Balancing
served from the OOV curves, many errors are caused error and supervision effort in interactive-predictive
by the occurrence of out-of-vocabulary words. handwritten text recognition. In Proceedings of the
15th International Conference on Intelligent User
5. Conclusions and future work Interfaces (IUI 2010), pages 373–376, Hong Kong
(China), June.
A new handwritten text database, RODRIGO, has
T. Su, T. Zhang, and D. Guan. 2007. Corpus-
been presented to facilitate empirical comparison of
based HIT-MW database for offline recognition of
different approaches to text line extraction and off-line
general-purpose Chinese handwritten text. Interna-
handwriting recognition. RODRIGO is completely
tional Journal on Document Analysis and Recogni-
written in old Castilian (Spanish) by a single author
tion, 10:27–38.
and comparable in size to standard databases. Some
preliminary empirical results have been also reported,
using standard techniques and tools for preprocess-
ing, feature extraction, HMM-based image modeling,
and language modeling. Although we think that there
is room for significant improvements, the word er-
ror rates obtained are already acceptable for effective
computer-assisted transcription. For future work, we
plan to also provide annotated data in accordance with
the guidelines for Electronic Text Encoding and Inter-
change of the Text Encoding Initiative Consortium.

6. References
R. Bertolami and H. Bunke. 2008. Hidden Markov
model-based ensemble methods for offline hand-
written text line recognition. Pattern Recognition,
41:3452–3460.

2712

You might also like