Professional Documents
Culture Documents
The RODRIGO Database: N. Serrano, F. Castro, A. Juan
The RODRIGO Database: N. Serrano, F. Castro, A. Juan
2709
2. The manuscript block, each text line was marked by its (straight) base-
As said above, the RODRIGO database corresponds to line. This was done semi-automatically by means
a manuscript from 1545 entitled “Historia de España of the GIDOC prototype (Serrano et al., 2010). All
del arçobispo Don Rodrigo”, and completely written blocks and baselines automatically detected were also
in old Castilian (Spanish) by a single author. It is a manually supervised, and corrected when needed.
853-page bound volume divided into 307 chapters de- On the other hand, the whole manuscript was tran-
scribing chronicles from the Spanish history. Most scribed line by line, by a paleography expert, in ac-
pages only contain a single text block of nearly cal- cordance with the following transcription rules:
ligraphed handwriting on well-separated lines. This
• Page and line breaks are copied exactly.
can be seen in Fig. 1, where it is also apparent that
writing style has clear Gothic influences (Millares and • Missing natural blank spaces between successive
Ruiz, 1983). words are indicated by the symbol “⌣ ”.
Other characteristic details of RODRIGO that can be • Inserted artificial blank spaces within words are
clearly appreciated in Fig. 1 are: indicated by the symbol “ ”.
• The author tends to embellish the writing, spe- • No spelling mistakes are corrected.
cially in broad white spaces, resulting in the ex- • No case or accentuation change is done.
tension of some ascenders and descenders across
• Punctuation signs are copied as they appear.
whole words.
• Word abbreviations are first copied verbatim, ex-
• Natural blank spaces between successive words cept for sub-indixes and super-indixes, which
are often omitted; e.g., the words “de la” are writ- are written in LATEX-like notation as {sub} and
ten as a single word “dela” in the third line from ˆ{super}, respectively. Then, they are fol-
the bottom of page 15. Sometimes, on the con- lowed by the corresponding word between brack-
trary, artificial blank spaces are inserted within a ets. Thus, for instance, qi er. is transcribed as
single word; e.g., the word “llegaronse” is writ- qˆ{i}er[quier].
ten as two words, “llegaron se”.
• The symbol ”$” is appended to each line having
• Each chapter should begin with a dropcap, but a broken word at its end.
the manuscript contains no dropcaps, probably
because it was never brought to an artist to do The total time required for a single expert to man-
so. Instead, there is a blank area in each position ually annotate (text blocks, baselines and transcrip-
where a dropcap should have been inserted and, tions of) the whole manuscript was estimated as 500
in most cases, the corresponding letter is written hours; that is, approximately 35 minutes per page on
in small size. average. The complete annotation of RODRIGO is
publicly available, for non-commercial use, at (Rod,
• The first words in each even page are also copied 2010). It comprises about 20K text lines and 231K
in the bottom right corner of its preceding page. running words from a lexicon of 17K words, which
is comparable in size to standard databases such as
3. The database IAM (Marti and Bunke, 2002; Su et al., 2007). It is
The manuscript was carefully digitized by experts worth noting that more than half of the words in the
from the Spanish Ministry of Culture, at 300dpi in lexicon (54.4%) are singletons (hapax legomena), but
true colors, and it is publicly available at (Rod, 2006). they only account for a 4.1% of the running words.
As with historical documents in general, scanned Please see Table 1 for more details.
pages have noise effects like spots, tears, ink fading
and transparency of back side. Also, they show a 4. Experiments
slight warping due to book binding. Nevertheless, the As discussed in the introduction, RODRIGO is intro-
manuscript can be easily read and thus we decided duced to facilitate comparison of different approaches
not to apply any preprocessing to it (apart from de- to automatic extraction of text blocks, lines, and hand-
saturation) for ground-truth annotation. writing recognition. In this section, however, we will
We followed an annotation procedure very similar to restrict ourselves to (automatic) transcription (hand-
the one used for the GERMANA database (Pérez et writing recognition). More specifically, our aim is
al., 2009). First, all text blocks were annotated with simply to provide baseline results for reference in fu-
minimal enclosing rectangles and, within each text ture studies, using standard techniques and tools; i.e.,
2710
Figure 1: Pages 15 and 16 of RODRIGO.
2711
80 WER on last block L. Likforman-Sulem, A. Zahour, and B. Taconet.
OOVs on last block
WER on next block 2007. Text line segmentation of historical docu-
70
OOVs on next block ments: a survey. International Journal on Doc-
60 ument Analysis and Recognition (IJDAR), 9:123–
50 138.
U. V. Marti and H. Bunke. 2002. The IAM-database:
40
an English sentence database for off-line handwrit-
30 ing recognition. International Journal on Docu-
ment Analysis and Recognition (IJDAR), 5:39–46.
20
A. Millares and J. M. Ruiz. 1983. Tratado de pale-
10 ografı́a española, volume 1. Espasa-Calpe, 3rd edi-
Training Lines tion.
0
1000 5000 10000 15000 19000 D. Pérez, L. Tarazón, N. Serrano, F. Castro, O. Ramos,
and A. Juan. 2009. The GERMANA database. In
Figure 2: Transcription Word Error Rate (WER) on Proccedings of the 10th International Conference
RODRIGO as a function of the blocks of (1000 lines) on Document Analysis and Recognition (ICDAR
already supervised and thus available for training 2010), pages 301–305, Barcelona (Spain).
(Training lines). The WER is computed for both, the O. Ramos, N. Serrano, and A. Juan. 2010.
next block to supervise (solid line with black circles) Interactive-predictive detection of handwritten text
and the last block (lines 19001 − 20357). Also shown
blocks. In Proc. of the 17th Document Recognition
is the part of the WER due to the occurrence of out-
and Retrieval (DRR 2010), San Jose, CA (USA).
of-vocabulary (OOV) words (dashed lines).
2006. The RODRIGO database: digitized data.
bvpb.mcu.es.
that this WER is not too bad for effective computer- 2010. The RODRIGO database: annotated data.
assisted transcription, though for sure there is room prhlt.iti.es/rodrigo.php.
for significant improvement. Note that, as can be ob- N. Serrano, A. Sanchis, and A. Juan. 2010. Balancing
served from the OOV curves, many errors are caused error and supervision effort in interactive-predictive
by the occurrence of out-of-vocabulary words. handwritten text recognition. In Proceedings of the
15th International Conference on Intelligent User
5. Conclusions and future work Interfaces (IUI 2010), pages 373–376, Hong Kong
(China), June.
A new handwritten text database, RODRIGO, has
T. Su, T. Zhang, and D. Guan. 2007. Corpus-
been presented to facilitate empirical comparison of
based HIT-MW database for offline recognition of
different approaches to text line extraction and off-line
general-purpose Chinese handwritten text. Interna-
handwriting recognition. RODRIGO is completely
tional Journal on Document Analysis and Recogni-
written in old Castilian (Spanish) by a single author
tion, 10:27–38.
and comparable in size to standard databases. Some
preliminary empirical results have been also reported,
using standard techniques and tools for preprocess-
ing, feature extraction, HMM-based image modeling,
and language modeling. Although we think that there
is room for significant improvements, the word er-
ror rates obtained are already acceptable for effective
computer-assisted transcription. For future work, we
plan to also provide annotated data in accordance with
the guidelines for Electronic Text Encoding and Inter-
change of the Text Encoding Initiative Consortium.
6. References
R. Bertolami and H. Bunke. 2008. Hidden Markov
model-based ensemble methods for offline hand-
written text line recognition. Pattern Recognition,
41:3452–3460.
2712