Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Document Analysis and Recognition

ICDAR 2021 16th International


Conference Lausanne Switzerland
September 5 10 2021 Proceedings Part
II 1st Edition Josep Lladós
Visit to download the full and correct content document:
https://ebookmeta.com/product/document-analysis-and-recognition-icdar-2021-16th-i
nternational-conference-lausanne-switzerland-september-5-10-2021-proceedings-par
t-ii-1st-edition-josep-llados/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Parallel Computing Technologies 16th International


Conference PaCT 2021 Kaliningrad Russia September 13 18
2021 Proceedings Victor Malyshkin (Editor)

https://ebookmeta.com/product/parallel-computing-
technologies-16th-international-conference-pact-2021-kaliningrad-
russia-september-13-18-2021-proceedings-victor-malyshkin-editor/

Computer Analysis of Images and Patterns 19th


International Conference CAIP 2021 Virtual Event
September 28 30 2021 Proceedings Part I 1st Edition
Nicolas Tsapatsoulis
https://ebookmeta.com/product/computer-analysis-of-images-and-
patterns-19th-international-conference-caip-2021-virtual-event-
september-28-30-2021-proceedings-part-i-1st-edition-nicolas-
tsapatsoulis/

HCI International 2021 Posters 23rd HCI International


Conference HCII 2021 Virtual Event July 24 29 2021
Proceedings Part II 1st Edition Constantine Stephanidis

https://ebookmeta.com/product/hci-
international-2021-posters-23rd-hci-international-conference-
hcii-2021-virtual-event-july-24-29-2021-proceedings-part-ii-1st-
edition-constantine-stephanidis/

Computational Science and Its Applications ICCSA 2021


21st International Conference Cagliari Italy September
13 16 2021 Proceedings Part VII Osvaldo Gervasi

https://ebookmeta.com/product/computational-science-and-its-
applications-iccsa-2021-21st-international-conference-cagliari-
italy-september-13-16-2021-proceedings-part-vii-osvaldo-gervasi/
Advances in Information and Computer Security 16th
International Workshop on Security IWSEC 2021 Virtual
Event September 8 10 2021 Proceedings 1st Edition Toru
Nakanishi
https://ebookmeta.com/product/advances-in-information-and-
computer-security-16th-international-workshop-on-security-
iwsec-2021-virtual-event-september-8-10-2021-proceedings-1st-
edition-toru-nakanishi/

Computational Science and Its Applications ICCSA 2021


21st International Conference Cagliari Italy September
13 16 2021 Proceedings Part III 1st Edition Osvaldo
Gervasi (Editor)
https://ebookmeta.com/product/computational-science-and-its-
applications-iccsa-2021-21st-international-conference-cagliari-
italy-september-13-16-2021-proceedings-part-iii-1st-edition-
osvaldo-gervasi-editor/

Intelligent Computing Theories and Application 17th


International Conference ICIC 2021 Shenzhen China
August 12 15 2021 Proceedings Part II 1st Edition De-
Shuang Huang
https://ebookmeta.com/product/intelligent-computing-theories-and-
application-17th-international-conference-icic-2021-shenzhen-
china-august-12-15-2021-proceedings-part-ii-1st-edition-de-
shuang-huang/

Computer Vision and Image Processing 6th International


Conference CVIP 2021 Rupnagar India December 3 5 2021
Revised Selected Papers Part II Balasubramanian Raman

https://ebookmeta.com/product/computer-vision-and-image-
processing-6th-international-conference-cvip-2021-rupnagar-india-
december-3-5-2021-revised-selected-papers-part-ii-
balasubramanian-raman/

Towards Autonomous Robotic Systems: 22nd Annual


Conference, TAROS 2021, Lincoln, UK, September 8–10,
2021, Proceedings 1st Edition Charles Fox

https://ebookmeta.com/product/towards-autonomous-robotic-
systems-22nd-annual-conference-taros-2021-lincoln-uk-
september-8-10-2021-proceedings-1st-edition-charles-fox/
Josep Lladós
Daniel Lopresti
Seiichi Uchida (Eds.)

Document Analysis
LNCS 12822

and Recognition –
ICDAR 2021
16th International Conference
Lausanne, Switzerland, September 5–10, 2021
Proceedings, Part II
Lecture Notes in Computer Science 12822

Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA

Editorial Board Members


Elisa Bertino
Purdue University, West Lafayette, IN, USA
Wen Gao
Peking University, Beijing, China
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Gerhard Woeginger
RWTH Aachen, Aachen, Germany
Moti Yung
Columbia University, New York, NY, USA
More information about this subseries at http://www.springer.com/series/7412
Josep Lladós Daniel Lopresti
• •

Seiichi Uchida (Eds.)

Document Analysis
and Recognition –
ICDAR 2021
16th International Conference
Lausanne, Switzerland, September 5–10, 2021
Proceedings, Part II

123
Editors
Josep Lladós Daniel Lopresti
Universitat Autònoma de Barcelona Lehigh University
Barcelona, Spain Bethlehem, PA, USA
Seiichi Uchida
Kyushu University
Fukuoka-shi, Japan

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Computer Science
ISBN 978-3-030-86330-2 ISBN 978-3-030-86331-9 (eBook)
https://doi.org/10.1007/978-3-030-86331-9
LNCS Sublibrary: SL6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics

© Springer Nature Switzerland AG 2021


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword

Our warmest welcome to the proceedings of ICDAR 2021, the 16th IAPR International
Conference on Document Analysis and Recognition, which was held in Switzerland for
the first time. Organizing an international conference of significant size during the
COVID-19 pandemic, with the goal of welcoming at least some of the participants
physically, is similar to navigating a rowboat across the ocean during a storm. For-
tunately, we were able to work together with partners who have shown a tremendous
amount of flexibility and patience including, in particular, our local partners, namely
the Beaulieu convention center in Lausanne, EPFL, and Lausanne Tourisme, and also
the international ICDAR advisory board and IAPR-TC 10/11 leadership teams who
have supported us not only with excellent advice but also financially, encouraging us to
setup a hybrid format for the conference.
We were not a hundred percent sure if we would see each other in Lausanne but we
remained confident, together with almost half of the attendees who registered for
on-site participation. We relied on the hybridization support of a motivated team from
the Lule University of Technology during the pre-conference, and professional support
from Imavox during the main conference, to ensure a smooth connection between the
physical and the virtual world. Indeed, our welcome is extended especially to all our
colleagues who were not able to travel to Switzerland this year. We hope you had an
exciting virtual conference week, and look forward to seeing you in person again at
another event of the active DAR community.
With ICDAR 2021, we stepped into the shoes of a longstanding conference series,
which is the premier international event for scientists and practitioners involved in
document analysis and recognition, a field of growing importance in the current age of
digital transitions. The conference is endorsed by IAPR-TC 10/11 and celebrates its
30th anniversary this year with the 16th edition. The very first ICDAR conference was
held in St. Malo, France in 1991, followed by Tsukuba, Japan (1993), Montreal,
Canada (1995), Ulm, Germany (1997), Bangalore, India (1999), Seattle, USA (2001),
Edinburgh, UK (2003), Seoul, South Korea (2005), Curitiba, Brazil (2007), Barcelona,
Spain (2009), Beijing, China (2011), Washington DC, USA (2013), Nancy, France
(2015), Kyoto, Japan (2017), and Syndey, Australia (2019).
The attentive reader may have remarked that this list of cities includes several
venues for the Olympic Games. This year the conference was be hosted in Lausanne,
which is the headquarters of the International Olympic Committee. Not unlike the
athletes who were recently competing in Tokyo, Japan, the researchers profited from a
healthy spirit of competition, aimed at advancing our knowledge on how a machine can
understand written communication. Indeed, following the tradition from previous
years, 13 scientific competitions were held in conjunction with ICDAR 2021 including,
for the first time, three so-called “long-term” competitions addressing wider challenges
that may continue over the next few years.
vi Foreword

Other highlights of the conference included the keynote talks given by Masaki
Nakagawa, recipient of the IAPR/ICDAR Outstanding Achievements Award, and
Mickaël Coustaty, recipient of the IAPR/ICDAR Young Investigator Award, as well as
our distinguished keynote speakers Prem Natarajan, vice president at Amazon, who
gave a talk on “OCR: A Journey through Advances in the Science, Engineering, and
Productization of AI/ML”, and Beta Megyesi, professor of computational linguistics at
Uppsala University, who elaborated on “Cracking Ciphers with ‘AI-in-the-loop’:
Transcription and Decryption in a Cross-Disciplinary Field”.
A total of 340 publications were submitted to the main conference, which was held
at the Beaulieu convention center during September 8–10, 2021. Based on the reviews,
our Program Committee chairs accepted 40 papers for oral presentation and 142 papers
for poster presentation. In addition, nine articles accepted for the ICDAR-IJDAR
journal track were presented orally at the conference and a workshop was integrated in
a poster session. Furthermore, 12 workshops, 2 tutorials, and the doctoral consortium
were held during the pre-conference at EPFL during September 5–7, 2021, focusing on
specific aspects of document analysis and recognition, such as graphics recognition,
camera-based document analysis, and historical documents.
The conference would not have been possible without hundreds of hours of work
done by volunteers in the organizing committee. First of all we would like to express
our deepest gratitude to our Program Committee chairs, Joseph Lladós, Dan Lopresti,
and Seiichi Uchida, who oversaw a comprehensive reviewing process and designed the
intriguing technical program of the main conference. We are also very grateful for all
the hours invested by the members of the Program Committee to deliver high-quality
peer reviews. Furthermore, we would like to highlight the excellent contribution by our
publication chairs, Liangrui Peng, Fouad Slimane, and Oussama Zayene, who nego-
tiated a great online visibility of the conference proceedings with Springer and ensured
flawless camera-ready versions of all publications. Many thanks also to our chairs and
organizers of the workshops, competitions, tutorials, and the doctoral consortium for
setting up such an inspiring environment around the main conference. Finally, we are
thankful for the support we have received from the sponsorship chairs, from our valued
sponsors, and from our local organization chairs, which enabled us to put in the extra
effort required for a hybrid conference setup.
Our main motivation for organizing ICDAR 2021 was to give practitioners in the
DAR community a chance to showcase their research, both at this conference and its
satellite events. Thank you to all the authors for submitting and presenting your out-
standing work. We sincerely hope that you enjoyed the conference and the exchange
with your colleagues, be it on-site or online.

September 2021 Andreas Fischer


Rolf Ingold
Marcus Liwicki
Preface

It gives us great pleasure to welcome you to the proceedings of the 16th International
Conference on Document Analysis and Recognition (ICDAR 2021). ICDAR brings
together practitioners and theoreticians, industry researchers and academics, repre-
senting a range of disciplines with interests in the latest developments in the field of
document analysis and recognition. The last ICDAR conference was held in Sydney,
Australia, in September 2019. A few months later the COVID-19 pandemic locked
down the world, and the Document Analysis and Recognition (DAR) events under the
umbrella of IAPR had to be held in virtual format (DAS 2020 in Wuhan, China, and
ICFHR 2020 in Dortmund, Germany). ICDAR 2021 was held in Lausanne,
Switzerland, in a hybrid mode. Thus, it offered the opportunity to resume normality,
and show that the scientific community in DAR has kept active during this long period.
Despite the difficulties of COVID-19, ICDAR 2021 managed to achieve an
impressive number of submissions. The conference received 340 paper submissions, of
which 182 were accepted for publication (54%) and, of those, 40 were selected as oral
presentations (12%) and 142 as posters (42%). Among the accepted papers, 112 had a
student as the first author (62%), and 41 were identified as coming from industry
(23%). In addition, a special track was organized in connection with a Special Issue
of the International Journal on Document Analysis and Recognition (IJDAR). The
Special Issue received 32 submissions that underwent the full journal review and
revision process. The nine accepted papers were published in IJDAR and the authors
were invited to present their work in the special track at ICDAR.
The review model was double blind, i.e. the authors did not know the name of the
reviewers and vice versa. A plagiarism filter was applied to each paper as an added
measure of scientific integrity. Each paper received at least three reviews, totaling more
than 1,000 reviews. We recruited 30 Senior Program Committee (SPC) members and
200 reviewers. The SPC members were selected based on their expertise in the area,
considering that they had served in similar roles in past DAR events. We also included
some younger researchers who are rising leaders in the field.
In the final program, authors from 47 different countries were represented, with
China, India, France, the USA, Japan, Germany, and Spain at the top of the list. The
most popular topics for accepted papers, in order, included text and symbol recogni-
tion, document image processing, document analysis systems, handwriting recognition,
historical document analysis, extracting document semantics, and scene text detection
and recognition. With the aim of establishing ties with other communities within the
concept of reading systems at large, we broadened the scope, accepting papers on
topics like natural language processing, multimedia documents, and sketch
understanding.
The final program consisted of ten oral sessions, two poster sessions, three keynotes,
one of them given by the recipient of the ICDAR Outstanding Achievements Award,
and two panel sessions. We offer our deepest thanks to all who contributed their time
viii Preface

and effort to make ICDAR 2021 a first-rate event for the community. This year’s
ICDAR had a large number of interesting satellite events as well: workshops, tutorials,
competitions, and the doctoral consortium. We would also like to express our sincere
thanks to the keynote speakers, Prem Natarajan and Beta Megyesi.
Finally, we would like to thank all the people who spent time and effort to make this
impressive program: the authors of the papers, the SPC members, the reviewers, and
the ICDAR organizing committee as well as the local arrangements team.

September 2021 Josep Lladós


Daniel Lopresti
Seiichi Uchida
Organization

Organizing Committee
General Chairs
Andreas Fischer University of Applied Sciences and Arts Western
Switzerland, Switzerland
Rolf Ingold University of Fribourg, Switzerland
Marcus Liwicki Luleå University of Technology, Sweden

Program Committee Chairs


Josep Lladós Computer Vision Center, Spain
Daniel Lopresti Lehigh University, USA
Seiichi Uchida Kyushu University, Japan

Workshop Chairs
Elisa H. Barney Smith Boise State University, USA
Umapada Pal Indian Statistical Institute, India

Competition Chairs
Harold Mouchère University of Nantes, France
Foteini Simistira Luleå University of Technology, Sweden

Tutorial Chairs
Véronique Eglin Institut National des Sciences Appliquées, France
Alicia Fornés Computer Vision Center, Spain

Doctoral Consortium Chairs


Jean-Christophe Burie La Rochelle University, France
Nibal Nayef MyScript, France

Publication Chairs
Liangrui Peng Tsinghua University, China
Fouad Slimane University of Fribourg, Switzerland
Oussama Zayene University of Applied Sciences and Arts Western
Switzerland, Switzerland

Sponsorship Chairs
David Doermann University at Buffalo, USA
Koichi Kise Osaka Prefecture University, Japan
Jean-Marc Ogier University of La Rochelle, France
x Organization

Local Organization Chairs


Jean Hennebert University of Applied Sciences and Arts Western
Switzerland, Switzerland
Anna Scius-Bertrand University of Applied Sciences and Arts Western
Switzerland, Switzerland
Sabine Süsstrunk École Polytechnique Fédérale de Lausanne,
Switzerland

Industrial Liaison
Aurélie Lemaitre University of Rennes, France

Social Media Manager


Linda Studer University of Fribourg, Switzerland

Program Committee
Senior Program Committee Members
Apostolos Antonacopoulos University of Salford, UK
Xiang Bai Huazhong University of Science and Technology,
China
Michael Blumenstein University of Technology Sydney, Australia
Jean-Christophe Burie University of La Rochelle, France
Mickaël Coustaty University of La Rochelle, France
Bertrand Coüasnon University of Rennes, France
Andreas Dengel DFKI, Germany
Gernot Fink TU Dortmund University, Germany
Basilis Gatos Demokritos, Greece
Nicholas Howe Smith College, USA
Masakazu Iwamura Osaka Prefecture University, Japan
C. V. Javahar IIIT Hyderabad, India
Lianwen Jin South China University of Technology, China
Dimosthenis Karatzas Computer Vision Center, Spain
Laurence Likforman-Sulem Télécom ParisTech, France
Cheng-Lin Liu Chinese Academy of Sciences, China
Angelo Marcelli University of Salerno, Italy
Simone Marinai University of Florence, Italy
Wataru Ohyama Saitama Institute of Technology, Japan
Luiz Oliveira Federal University of Parana, Brazil
Liangrui Peng Tsinghua University, China
Ashok Popat Google Research, USA
Partha Pratim Roy Indian Institute of Technology Roorkee, India
Marçal Rusiñol Computer Vision Center, Spain
Robert Sablatnig Vienna University of Technology, Austria
Marc-Peter Schambach Siemens, Germany
Organization xi

Srirangaraj Setlur University at Buffalo, USA


Faisal Shafait National University of Sciences and Technology, India
Nicole Vincent Paris Descartes University, France
Jerod Weinman Grinnell College, USA
Richard Zanibbi Rochester Institute of Technology, USA

Program Committee Members


Sébastien Adam Nadir Farah
Irfan Ahmad Rafael Ferreira Mello
Sheraz Ahmed Miguel Ferrer
Younes Akbari Julian Fierrez
Musab Al-Ghadi Francesco Fontanella
Alireza Alaei Alicia Fornés
Eric Anquetil Volkmar Frinken
Srikar Appalaraju Yasuhisa Fujii
Elisa H. Barney Smith Akio Fujiyoshi
Abdel Belaid Liangcai Gao
Mohammed Faouzi Benzeghiba Utpal Garain
Anurag Bhardwaj C. Lee Giles
Ujjwal Bhattacharya Romain Giot
Alceu Britto Lluis Gomez
Jorge Calvo-Zaragoza Petra Gomez-Krämer
Chee Kheng Ch’Ng Emilio Granell
Sukalpa Chanda Mehdi Hamdani
Bidyut B. Chaudhuri Gaurav Harit
Jin Chen Ehtesham Hassan
Youssouf Chherawala Anders Hast
Hojin Cho Sheng He
Nam Ik Cho Jean Hennebert
Vincent Christlein Pierre Héroux
Christian Clausner Laurent Heutte
Florence Cloppet Nina S. T. Hirata
Donatello Conte Tin Kam Ho
Kenny Davila Kaizhu Huang
Claudio De Stefano Qiang Huo
Sounak Dey Donato Impedovo
Moises Diaz Reeve Ingle
David Doermann Brian Kenji Iwana
Antoine Doucet Motoi Iwata
Fadoua Drira Antonio Jimeno
Jun Du Slim Kanoun
Véronique Eglin Vassilis Katsouros
Jihad El-Sana Ergina Kavallieratou
Jonathan Fabrizio Klara Kedem
xii Organization

Christopher Kermorvant Thierry Paquet


Khurram Khurshid Swapan Kr. Parui
Soo-Hyung Kim Antonio Parziale
Koichi Kise Antonio Pertusa
Florian Kleber Giuseppe Pirlo
Pramod Kompalli Réjean Plamondon
Alessandro Lameiras Koerich Stefan Pletschacher
Bart Lamiroy Utkarsh Porwal
Anh Le Duc Vincent Poulain D’Andecy
Frank Lebourgeois Ioannis Pratikakis
Gurpreet Lehal Joan Puigcerver
Byron Leite Dantas Bezerra Siyang Qin
Aurélie Lemaitre Irina Rabaev
Haifeng Li Jean-Yves Ramel
Zhouhui Lian Oriol Ramos Terrades
Minghui Liao Romain Raveaux
Rafael Lins Frédéric Rayar
Wenyin Liu Ana Rebelo
Lu Liu Pau Riba
Georgios Louloudis Kaspar Riesen
Yue Lu Christophe Rigaud
Xiaoqing Lu Syed Tahseen Raza Rizvi
Muhammad Muzzamil Luqman Leonard Rothacker
Sriganesh Madhvanath Javad Sadri
Muhammad Imran Malik Rajkumar Saini
R. Manmatha Joan Andreu Sanchez
Volker Märgner K. C. Santosh
Daniel Martín-Albo Rosa Senatore
Carlos David Martinez Hinarejos Amina Serir
Minesh Mathew Mathias Seuret
Maroua Mehri Badarinath Shantharam
Carlos Mello Imran Siddiqi
Tomo Miyazaki Nicolas Sidère
Momina Moetesum Foteini Simistira Liwicki
Harold Mouchère Steven Simske
Masaki Nakagawa Volker Sorge
Nibal Nayef Nikolaos Stamatopoulos
Atul Negi Bela Stantic
Clemens Neudecker H. Siegfried Stiehl
Cuong Tuan Nguyen Daniel Stoekl Ben Ezra
Hung Tuan Nguyen Tonghua Su
Journet Nicholas Tong Sun
Jean-Marc Ogier Yipeng Sun
Shinichiro Omachi Jun Sun
Umapada Pal Suresh Sundaram
Shivakumara Palaiahnakote Salvatore Tabbone
Organization xiii

Kazem Taghva Da-Han Wang


Ryohei Tanaka Curtis Wigington
Christopher Tensmeyer Liang Wu
Kengo Terasawa Mingkun Yang
Ruben Tolosana Xu-Cheng Yin
Alejandro Toselli Fei Yin
Cao De Tran Guangwei Zhang
Szilard Vajda Heng Zhang
Ernest Valveny Xu-Yao Zhang
Marie Vans Yuchen Zheng
Eduardo Vellasques Guoqiang Zhong
Ruben Vera-Rodriguez Yu Zhou
Christian Viard-Gaudin Anna Zhu
Mauricio Villegas Majid Ziaratban
Qiu-Feng Wang
Contents – Part II

Document Analysis for Literature Search

Towards Document Panoptic Segmentation with Pinpoint Accuracy:


Method and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Rongyu Cao, Hongwei Li, Ganbin Zhou, and Ping Luo

A Math Formula Extraction and Evaluation Framework


for PDF Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Ayush Kumar Shah, Abhisek Dey, and Richard Zanibbi

Toward Automatic Interpretation of 3D Plots . . . . . . . . . . . . . . . . . . . . . . . 35


Laura E. Brandt and William T. Freeman

Document Summarization and Translation

Can Text Summarization Enhance the Headline Stance Detection Task?


Benefits and Drawbacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Marta Vicente, Robiert Sepúlveda-Torrres, Cristina Barros,
Estela Saquete, and Elena Lloret

The Biased Coin Flip Process for Nonparametric Topic Modeling . . . . . . . . . 68


Justin Wood, Wei Wang, and Corey Arnold

CoMSum and SIBERT: A Dataset and Neural Model for Query-Based


Multi-document Summarization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, and Eugene Ie

RTNet: An End-to-End Method for Handwritten Text Image Translation . . . . 99


Tonghua Su, Shuchen Liu, and Shengjie Zhou

Multimedia Document Analysis

NTable: A Dataset for Camera-Based Table Detection . . . . . . . . . . . . . . . . . 117


Ziyi Zhu, Liangcai Gao, Yibo Li, Yilun Huang, Lin Du, Ning Lu,
and Xianfeng Wang

Label Selection Algorithm Based on Boolean Interpolative Decomposition


with Sequential Backward Selection for Multi-label Classification . . . . . . . . . 130
Tianqi Ji, Jun Li, and Jianhua Xu
xvi Contents – Part II

GSSF: A Generative Sequence Similarity Function Based on a Seq2Seq


Model for Clustering Online Handwritten Mathematical Answers . . . . . . . . . 145
Huy Quang Ung, Cuong Tuan Nguyen, Hung Tuan Nguyen,
and Masaki Nakagawa

C2VNet: A Deep Learning Framework Towards Comic Strip


to Audio-Visual Scene Synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Vaibhavi Gupta, Vinay Detani, Vivek Khokar,
and Chiranjoy Chattopadhyay

LSTMVAEF: Vivid Layout via LSTM-Based Variational Autoencoder


Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Jie He, Xingjiao Wu, Wenxin Hu, and Jing Yang

Mobile Text Recognition

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke


Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Andrii Grygoriev, Illya Degtyarenko, Ivan Deriuga, Serhii Polotskyi,
Volodymyr Melnyk, Dmytro Zakharchuk, and Olga Radyvonenko

RFDoc: Memory Efficient Local Descriptors for ID Documents


Localization and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Daniil Matalov, Elena Limonova, Natalya Skoryukina,
and Vladimir V. Arlazarov

Dynamic Receptive Field Adaptation for Attention-Based


Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Haibo Qin, Chun Yang, Xiaobin Zhu, and Xucheng Yin

Context-Free TextSpotter for Real-Time and Mobile End-to-End Text


Detection and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Ryota Yoshihashi, Tomohiro Tanaka, Kenji Doi, Takumi Fujino,
and Naoaki Yamashita

MIDV-LAIT: A Challenging Dataset for Recognition of IDs with


Perso-Arabic, Thai, and Indian Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Yulia Chernyshova, Ekaterina Emelianova, Alexander Sheshkus,
and Vladimir V. Arlazarov

Determining Optimal Frame Processing Strategies for Real-Time Document


Recognition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Konstantin Bulatov and Vladimir V. Arlazarov
Contents – Part II xvii

Document Analysis for Social Good

Embedded Attributes for Cuneiform Sign Spotting . . . . . . . . . . . . . . . . . . . 291


Eugen Rusakov, Turna Somel, Gerfrid G. W. Müller, and Gernot A. Fink

Date Estimation in the Wild of Scanned Historical Photos: An Image


Retrieval Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Adrià Molina, Pau Riba, Lluis Gomez, Oriol Ramos-Terrades,
and Josep Lladós

Two-Step Fine-Tuned Convolutional Neural Networks for Multi-label


Classification of Children’s Drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Muhammad Osama Zeeshan, Imran Siddiqi, and Momina Moetesum

DCINN: Deformable Convolution and Inception Based Neural Network for


Tattoo Text Detection Through Skin Region. . . . . . . . . . . . . . . . . . . . . . . . 335
Tamal Chowdhury, Palaiahnakote Shivakumara, Umapada Pal,
Tong Lu, Ramachandra Raghavendra, and Sukalpa Chanda

Sparse Document Analysis Using Beta-Liouville Naive Bayes with


Vocabulary Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Fatma Najar and Nizar Bouguila

Automatic Signature-Based Writer Identification


in Mixed-Script Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Sk Md Obaidullah, Mridul Ghosh, Himadri Mukherjee, Kaushik Roy,
and Umapada Pal

Indexing and Retrieval of Documents

Learning to Rank Words: Optimizing Ranking Metrics for Word Spotting . . . 381
Pau Riba, Adrià Molina, Lluis Gomez, Oriol Ramos-Terrades,
and Josep Lladós

A-VLAD: An End-to-End Attention-Based Neural Network for Writer


Identification in Historical Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Trung Tan Ngo, Hung Tuan Nguyen, and Masaki Nakagawa

Manga-MMTL: Multimodal Multitask Transfer Learning for Manga


Character Indexing and Search for Hyphenated Words. . . . . . . . . . . . . . . . . 410
Enrique Vidal and Alejandro H. Toselli

Physical and Logical Layout Analysis

SandSlide: Automatic Slideshow Normalization . . . . . . . . . . . . . . . . . . . . . 445


Sieben Bocklandt, Gust Verbruggen, and Thomas Winters
xviii Contents – Part II

Digital Editions as Distant Supervision for Layout Analysis


of Printed Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
Alejandro H. Toselli, Si Wu, and David A. Smith

Palmira: A Deep Deformable Network for Instance Segmentation of Dense


and Uneven Layouts in Handwritten Manuscripts . . . . . . . . . . . . . . . . . . . . 477
S. P. Sharan, Sowmya Aitha, Amandeep Kumar, Abhishek Trivedi,
Aaron Augustine, and Ravi Kiran Sarvadevabhatla

Page Layout Analysis System for Unconstrained Historic Documents . . . . . . 492


Oldřich Kodym and Michal Hradiš

Improved Graph Methods for Table Layout Understanding. . . . . . . . . . . . . . 507


Jose Ramón Prieto and Enrique Vidal

Unsupervised Learning of Text Line Segmentation by Differentiating


Coarse Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Berat Kurar Barakat, Ahmad Droby, Raid Saabni, and Jihad El-Sana

Recognition of Tables and Formulas

Rethinking Table Structure Recognition Using Sequence


Labeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
Yibo Li, Yilun Huang, Ziyi Zhu, Lemeng Pan, Yongshuai Huang, Lin Du,
Zhi Tang, and Liangcai Gao

TABLEX: A Benchmark Dataset for Structure and Content Information


Extraction from Scientific Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
Harsh Desai, Pratik Kayal, and Mayank Singh

Handwritten Mathematical Expression Recognition with Bidirectionally


Trained Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du,
and Ziyin Zhang

TabAug: Data Driven Augmentation for Enhanced Table Structure


Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
Umar Khan, Sohaib Zahid, Muhammad Asad Ali, Adnan Ul-Hasan,
and Faisal Shafait

An Encoder-Decoder Approach to Handwritten Mathematical Expression


Recognition with Multi-head Attention and Stacked Decoder . . . . . . . . . . . . 602
Haisong Ding, Kai Chen, and Qiang Huo

Global Context for Improving Recognition of Online Handwritten


Mathematical Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Cuong Tuan Nguyen, Thanh-Nghia Truong, Hung Tuan Nguyen,
and Masaki Nakagawa
Contents – Part II xix

Image-Based Relation Classification Approach for Table


Structure Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Koji Ichikawa

Image to LaTeX with Graph Neural Network for Mathematical


Formula Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
Shuai Peng, Liangcai Gao, Ke Yuan, and Zhi Tang

NLP for Document Understanding

A Novel Method for Automated Suggestion of Similar Software Incidents


Using 2-Stage Filtering: Findings on Primary Data . . . . . . . . . . . . . . . . . . . 667
Badal Agrawal, Mohit Mishra, and Varun Parashar

Research on Pseudo-label Technology for Multi-label News Classification . . . 683


Lianxi Wang, Xiaotian Lin, and Nankai Lin

Information Extraction from Invoices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699


Ahmed Hamdi, Elodie Carel, Aurélie Joseph, Mickael Coustaty,
and Antoine Doucet

Are You Really Complaining? A Multi-task Framework for Complaint


Identification, Emotion, and Sentiment Classification . . . . . . . . . . . . . . . . . . 715
Apoorva Singh and Sriparna Saha

Going Full-TILT Boogie on Document Understanding with Text-Image-


Layout Transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak,
Michał Pietruszka, and Gabriela Pałka

Data Centric Domain Adaptation for Historical Text with OCR Errors. . . . . . 748
Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth,
and Hinrich Schütze

Temporal Ordering of Events via Deep Neural Networks . . . . . . . . . . . . . . . 762


Nafaa Haffar, Rami Ayadi, Emna Hkiri, and Mounir Zrigui

Document Collection Visual Question Answering . . . . . . . . . . . . . . . . . . . . 778


Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny

Dialogue Act Recognition Using Visual Information . . . . . . . . . . . . . . . . . . 793


Jiří Martínek, Pavel Král, and Ladislav Lenc

Are End-to-End Systems Really Necessary for NER on Handwritten


Document Images? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
Oliver Tüselmann, Fabian Wolf, and Gernot A. Fink
xx Contents – Part II

Training Bi-Encoders for Word Sense Disambiguation. . . . . . . . . . . . . . . . . 823


Harsh Kohli

DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End


Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Freddy C. Chua and Nigel P. Duffy

Consideration of the Word’s Neighborhood in GATs for Information


Extraction in Semi-structured Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 854
Djedjiga Belhadj, Yolande Belaïd, and Abdel Belaïd

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871


Document Analysis for Literature
Search
Towards Document Panoptic
Segmentation with Pinpoint Accuracy:
Method and Evaluation

Rongyu Cao1,2(B) , Hongwei Li3 , Ganbin Zhou4 , and Ping Luo1,2,5


1
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences,
Institute of Computing Technology, CAS, Beijing 100190, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
{caorongyu19b,luop}@ict.ac.cn
3
Research Department, P.A.I. Ltd., Beijing 100025, China
lihw@paodingai.com
4
WeChat Search Application Department, Tencent Ltd., Beijing 100080, China
ganbinzhou@tencent.com
5
Peng Cheng Laboratory, Shenzhen 518066, China

Abstract. In this paper we study the task of document layout recog-


nition for digital documents, requiring that the model should detect
the exact physical object region without missing any text or contain-
ing any redundant text outside objects. It is the vital step to support
high-quality information extraction, table understanding and knowledge
base construction over the documents from various vertical domains
(e.g. financial, legal, and government fields). Here, we consider digital
documents, where characters and graphic elements are given with their
exact texts, positions inside document pages, compared with image doc-
uments. Towards document layout recognition with pinpoint accuracy,
we consider this problem as a document panoptic segmentation task,
that each token in the document page must be assigned a class label
and an instance id. Considering that two predicted objects may intersect
under traditional visual panoptic segmentation method, like Mask R-
CNN, however, document objects never intersect because most document
pages follow manhattan layout. Therefore, we propose a novel framework,
named document panoptic segmentation (DPS) model. It first splits the
document page into column regions and groups tokens into line regions,
then extracts the textual and visual features, and finally assigns class
label and instance id to each line region. Additionally, we propose a
novel metric based on the intersection over union (IoU) between the
tokens contained in predicted and the ground-truth object, which is more
suitable than metric based on the area IoU between predicted and the
ground-truth bounding box. Finally, the empirical experiments based on
PubLayNet, ArXiv and Financial datasets show that the proposed DPS
model obtains 0.8833, 0.9205 and 0.8530 mAP scores on three datasets.
The proposed model obtains great improvement on mAP score compared
with Faster R-CNN and Mask R-CNN models.

c Springer Nature Switzerland AG 2021


J. Lladós et al. (Eds.): ICDAR 2021, LNCS 12822, pp. 3–18, 2021.
https://doi.org/10.1007/978-3-030-86331-9_1
4 R. Cao et al.

1 Introduction
Although a large quantity of digital documents, in the form of HTML, WORD
and PDF files, can be electronically accessed in the Web, these documents are
mostly for human reading, but not machine-readable to support downstream
applications with quantitative analysis. In other words, electronic access is a
necessary but insufficient condition of machine readability [2]. Document lay-
out recognition is the cornerstone task for this electronic access, which aims to
automatically transform semi-structured information (e.g. tokens, vectors and
rasters that scattered on the page) to structured physical objects (e.g. tables,
graphs, paragraphs, etc.). Document layout recognition is critical for a variety
of downstream applications such as information extraction, table understanding
and knowledge base building over digital documents [4,15,35].

(a) ground-truth physical objects (b) predicted physical objects

Fig. 1. Examples on a part of a document page to illustrate how imprecisely predicted


document objects will harm various downstream tasks. Red rectangles and green rect-
angles in (a) and (b) represent the tokens contained in paragraph and table objects.
(Color figure online)

Since various downstream tasks, such as information extraction and table


understanding, depend on the result of the predicted physical objects, it requires
high accuracy for document layout recognition task. In other words, the docu-
ment layout recognition model should detect the exact physical object region
without missing any text or containing any redundant text outside objects.
Otherwise, it will cause harmful impact on subsequent tasks. Figure 1 depicts
several examples about how imprecisely-predicted physical objects will harm
various downstream tasks. For example, ground-truth tables are two green solid
boxes in Fig. 1(a) and the true relevant time of quantity “471,250” is “Septem-
ber 30, 2017” in downstream table understanding task. We assume that the
model predicts two tables as one table by mistake as shown in Fig. 1(b). Thus,
the relevant time of “471,250” might be recognized as “September 30, 2016”
since “September 30, 2016” will be probably recognized as the column header of
“471,250”. So, the reason for the error is predicting two ground-truth tables as
Towards Document Panoptic Segmentation with Pinpoint Accuracy 5

one table by mistake. Another example is that “Purchases of Equity Securities


by the Company” has been merged into other paragraphs by mistake. Thus, in
downstream table-of-content recognition task, the true heading “Purchases of
Equity Securities by the Company” will be lost. Therefore, pinpoint accuracy
of physical objects is indispensable for document layout recognition task. Next,
we introduce three key aspects towards pinpoint accuracy: problem formation,
solution and evaluation metric.

(b) one column region and (c) document object


tokens detection

(a) original document page (d) document semantic (e) document panoptic
and each column region segmentation segmentation

Fig. 2. For a given document page (a), we select the right bottom column region (b).
Then, we show the ground-truth of object detection (c), semantic segmentation (d) and
panoptic segmentation (e) for document layout recognition. Transparent yellow boxes
in (b) represent the tokens. Solid red boxes and green boxes in (c) represent paragraph
and table objects. Transparent red boxes and green boxes in (d) and (e) represent the
tokens contained in paragraph and table objects. Note that, we use different shades
of the red color in (e) to distinguish different instances of paragraph objects. (Color
figure online)

Problem Formation. Previous studies mostly regard the document lay-


out recognition problem as an object detection or a semantic segmentation
task [9,16–18,28,38,39]. Document object detection is defined as recognizing
each physical object with the class label and its position by delineating it with
a bounding box, as shown in Fig. 2(c). However, in the inference phase, two
predicted objects may overlap since there are no restrictions among predicted
bounding boxes. Note that, different from that visual objects in the natural
image may overlap sometimes, physical objects on document page never over-
lap. This limitation may cause that some tokens belong to multiple physical
6 R. Cao et al.

objects, which will harm downstream tasks. Here, tokens refer to English words
or Chinese characters on the digital document page.
On the other hand, document semantic segmentation aims to simply assign a
class label to each pixel in a document image, or each token in a digital document.
This way fixes the above limitation since each token can only be assigned to one
class label. However, it causes another limitation - it cannot specify instances.
In other words, it cannot distinguish adjacent objects of the same class as shown
in Fig. 2(d).
To combine the merits and overcome the limitations of these two definitions,
we propose a new formation called document panoptic segmentation, inspired
by the previous study [13] in the computer vision field. We formulate document
panoptic segmentation as that each token in the document page must be assigned
a class label and an instance id, as shown in Fig. 2(e). Thus, tokens with the same
class label and instance id constitute a physical object. This definition guaran-
tees that each token belongs to only one physical object, meanwhile, adjacent
document objects can be separated via distinct instance id.
Solution. In this paper, we assume that document page follows manhattan
layout: that is, physical object can be segmented by a set of horizontal and
vertical line [24]. Thus, for a given document page, we propose a novel framework,
named document panoptic segmentation (DPS) model, which first segments the
page into column regions to avoid interference from different columns, as shown
in Fig. 2(a), inspired by [19]. Then, DPS further splits each column region into
an ordered sequence of line regions. Finally, a network predicts the class label
and instance id for each token and can group these line regions into different
types of physical objects. The framework of solution is detailed in Sect. 3.
Evaluation Metric. Previous studies on document layout recognition mostly
use the intersection over union area (area-IoU) between the predicted object
and the matched ground-truth object as the evaluation metric [6,9,16–19,38,39].
However, area-IoU based metric only reflects the overlap of the pixels between
the predicted object and the ground-truth object, which cannot faithfully reflect
whether the predicted object matches the ground-truth with pinpoint accuracy.
To this end, we propose token-IoU, which calculates the intersection over union
between the tokens contained in predicted object and the ground-truth object.
Then, the mAP value is calculated based on token-IoU as the final metric. The
example and experiment (detailed in Sect. 4) show that the proposed token-IoU
is more suitable than area-IoU for document layout recognition task.
In this paper, we use the PubLayNet Dataset and additionally collect two
private datasets - ArXiv Dataset and Financial Dataset. ArXiv Dataset contains
scientific publication document pages from ArXiv and Financial Dataset contains
financial document pages from annual reports, prospectuses, research reports,
etc. The empirical experiments based on these three datasets show that the
proposed DPS model obtains 0.8833, 0.9205 and 0.8530 mAP scores on three
datasets. However, Faster R-CNN model obtains 0.8415, 0.8451 and 0.8133 mAP
scores and Mask R-CNN model obtains 0.8587, 0.8500 and 0.8209 mAP scores
Towards Document Panoptic Segmentation with Pinpoint Accuracy 7

on three datasets. Therefore, the proposed model obtains great improvement on


mAP score compared with baseline Faster R-CNN and Mask R-CNN models.
Our contributions in this paper are as follows:
– To the best of our knowledge, we first consider the document layout recogni-
tion problem as a document panoptic segmentation task.
– Towards document panoptic segmentation with pinpoint accuracy, we propose
a novel and more rigorous evaluation metric.
– We propose a neural-based model and the empirical experiments based on
three datasets show that the proposed model outperforms other baselines.

2 Related Work
Generally, all previous studies focused on document layout can be grouped into
four classes, object detection, semantic segmentation, sequence labeling and
graph learning method.
Object detection method deals with detecting instances of semantic objects
of a certain class in images or videos [1]. Considering physical objects in the
document as the target semantic objects, this type of method mostly applies
the existing object detection framework to regress the bounding box of table
regions. Along this line, Gilani et al. [6] add distance transformations to the
input image and apply Faster R-CNN [27] to detect table regions. Li et al. [16]
use Feature Pyramid Networks to predict physical object regions and apply three
layer alignment modules for cross-domain object detection in documents.
Semantic segmentation is a pixel-labeling technology that maps every pixel of
an image or tokens of a document into a class [21,30]. Then, a method to frame
the region of table pixels is applied. Yang et al. [38] apply FCN [21] for classifying
pixels and generate bounding box of table region by applying rule-based methods
that rely on heuristics or additional PDF metadata. He et al. [9] first predict the
contour of each table and apply connected components analysis (CCA) to make
the contour more clear. Li et al. [18] use LayoutLM [37] framework to predict
each token in the document a class label. Thus, adjacent tokens with the same
class label can be grouped into a physical object.
Sequence labeling attempts to assign labels to a sequence of observations,
where each observation in the sequence refers to the text line on the document
page. Luong et al. [22] split text lines on the document page and apply CRFs to
classify each text line into figure, table and other physical objects. Li et al. [19]
split text lines and employ a verification network to classify each text line, then
employ a pairwise network to merge adjacent text lines with the same type.
Graph network method considers texts inside page as nodes and the relation
between texts as edges, then applies spatial-based [20] or spectral-based [12]
graph network to classify each node and edge. Koci et al. [14] propose a remove
and conquer algorithm and combine spectral-based graph network for detecting
tables in spreadsheets. Riba et al. [29] apply a spectral-based graph network to
classify nodes and edges, then group nodes that with the same label and with
high edge confidence as a predicted table.
8 R. Cao et al.

3 Document Panoptic Segmentation Model


3.1 System Overview

We aim to assign a class label and an instance id for each token on the document
page. After obtaining the input document page, we first segment it into column
cell regions. For each column cell region, we further split it into a sequence of
line regions from top to bottom. Then, a FCN and LayoutLM is used to extract
local and global features to obtain the representation of each line region. Another
Unet is used to extract the relation information between line regions. Finally,
each line region is assigned a class label with BIS tag via Softmax layer. After
that, line regions can be merged to get different physical objects.

3.2 Column Region Segmentation

document page grid structure column regions

splitting merging
model model

Fig. 3. An example about column region segmentation on a document page. On the


right part, red solid boxes represent each column region on the document page. (Color
figure online)

Previous column region segmentation methods [19,22] design heuristic rules to


segment column regions. However, for some complex document pages, these
methods might obtain low accuracy. Here, column region is a part of the docu-
ment page, in which the bounding box of two arbitrary physical objects must not
intersect on the horizontal projection. In this paper, we consider that column
region segmentation in the document page is analogous to parsing cell structure
in table regions. Thus, we use state-of-art table structure recognition method
in [34] to segment column regions. In detail, first, the splitting model predicts
the basic grid of the page, ignoring regions that horizontally or vertically span
multiple regions. Then, the merging model predicts which grid elements should
be merged to recover spanning column regions. Figure 3 illustrates this process
with a document page as an example.
Towards Document Panoptic Segmentation with Pinpoint Accuracy 9

column region tokens and vectors


graph
vectors

tokens
vectors

vectors
table

line regions

token FCNs
line region
visual textual class label BIS tag
vector begin of graph
line region inner of graph graph
inner of graph
UNet begin of table

token inner of table


table
line region inner of table

vector
line region
LayoutLM

Fig. 4. Examples about line region splitting and sequence labeling. In the left top part,
the purple and green dashed boxes represent graph and table object. In the right top
part, the yellow box represent each token region. Some vectors such as polygons, curves
and blocks are pointed with red arrows. In the left bottom part, the yellow and red
boxes represent token line regions and vector line regions. (Color figure online)

3.3 Line Region Splitting

For each column region extracted in the previous step, we assume that the
bounding box of two arbitrary physical objects in one column region must not
intersect on the horizontal projection. Thus, if we further finely split the column
region into a sequence of line regions from top to bottom, each physical object
will contain continuous several line regions and each line region will be assigned
to only one physical object. Considering that the basic components on the PDF
document page are tokens and vectors. Here, tokens refer to English words or
Chinese characters with the bounding box, vectors refer to some graphical ele-
ment, such as polygons, curves and colorful blocks. Some physical objects could
contain some tokens and vectors at the same time, such as graphs or tables, as
shown in the example of Fig. 4.
Now we split the column region into line regions as follows. First, for each
pixel line, if any token box exists in it, we name this pixel line as a token pixel
line. Else, if no token box exists but any vector exists in this pixel line, we name
it as a vector pixel line. Else, if no token box or vector exist in this pixel line,
we name it as a blank pixel line. Then, we group continuous token pixel lines as
token line region and group continuous vector pixel line as vector line region as
shown in the left bottom part of Fig. 4. Thus, we obtain a sequence of token line
regions and vector line regions from top to bottom in the column region. Each
of these line regions contains some tokens or vectors.
10 R. Cao et al.

3.4 Line Region Sequence Labeling

Now, the problem is to design a sequence labeling model over line regions. First,
we use a fully convolutional network (FCN) [21] to obtain a feature map with
the same size of the document page. After pooling the feature embeddings in
each line region, we can obtain the fixed-length vector to represent the visual
information of each line region.
Meanwhile, for each line region, we use a LayoutLM [37] layer to obtain
the fixed-length vector to represent the textual information of each line region.
In detail, for each line region, the input of LayoutLM is a collection of tokens
and vectors contained in this line region. LayoutLM adds text embeddings and
2-d position embeddings of each token and vector and uses a Transformer to
obtain the fixed-length vector of each token and vector. Here, we use the overall
sequence vector (the vector of [CLS] token) as the output of the LayoutLM layer.
Then, we concatenate the visual vector and textual vector of each line region
as the inner line feature. To further capture the relationship between each line
region, we use a UNet [30] layer to obtain cross line features.
Finally, a Softmax layer is used to assign the class label with BIS tag to each
line region. Each class corresponds to 3 BIS tags, and each one represents begin,
inner and single, respectively. In other words, if a line region is the beginning line
of a table object, its tag is “begin of table”. Other line regions contained in this
table object corresponds to the tag “inner of table”. If a table object contains
only one line region, the tag of this line region is “single of table”. Thus, if there
are 10 class labels, i.e. 10 types of physical objects, the number of labels is 30.
According to these class labels with the BIS tag, line regions can be merged to
get different physical objects. Note that, since we use 3 BIS tags, the adjacent
objects of the same class can be split separately.

3.5 Page Representation

Inspired by Katti et al. [11], we make the page representation to integrate both
visual and textual features in the document page. The page representation is a
tensor with the size of H × W × 3, where H and W refer to the original height
and width of the document page. First, we initialize the page representation with
the image of the document page. To add semantic information of tokens, we pre-
train the 3-dimension text embedding of each token by skip-gram model [23] and
normalize the value in each dimension between 0 to 255. Thus, we can regard that
different tokens distinguish their semantics according to their different “colors”.
Then, for each token, the area covered by the token on the page representation
is filled with its text embedding. In other words, pixels the belong to the same
token will share the same text embedding. All remaining pixels that do not
belong to any tokens are filled with the color vector of the original document
image. Therefore, we combine visual and textual features into one tensor as the
page representation, which is the input of the DPS model.
Towards Document Panoptic Segmentation with Pinpoint Accuracy 11

Fig. 5. An example to illustrate the difference between area-IoU and token-IoU. The
blue dash-dot and purple solid bounding box represent two predicted tables. Intuitively,
the blue dash-dot table is better than the purple solid table. (Color figure online)

4 Evaluation Metric
To adopt to pinpoint accuracy, we propose a new evaluation measure by consider-
ing tokens in document pages. For example, in a document page, we assume that
there are n physical objects predicted by the model, namely P = {tp1 , tp2 , · · · , tpn },
and m ground-truth objects, namely G = {tg1 , tg2 , · · · , tgm }. Here, we assume these
objects are of the same class for simplification. For each pair (tpi , tgj ), we can cal-
culate two values, namely token-IoU and area-IoU, as follow:

|token(tpi ) ∩ token(tgj )|
token-IoU(tpi , tgj ) = ,
|token(tpi ) ∪ token(tgj )|
(1)
intersection-area(tpi , tgj )
area-IoU(tpi , tgj ) = ,
union-area(tpi , tgj )
where, token(t) represents the set of tokens contained in the bounding box of
object t, intersection-area(ti , tj ) and union-area(ti , tj ) represent intersection area
and union area of two bounding boxes. Note that, token-IoU and area-IoU are
both bound between 0 and 1. Area-IoU is widely used in traditional object
detection tasks [7,25–27].
Afterward, we will explain why token-IoU is more suitable than area-IoU
to reflect the accuracy of detection for digital documents. As shown in Fig. 5,
the predicted table A (the dot-dashed blue box) only misses little tokens and
predicted table B (dashed purple box) misses the whole text line below the
table, thus the predicted table B is apparently worse than predicted table A.
However, they have similar area-IoU value (around 0.8), which means that area-
IoU cannot distinguish the quality of these two different predictions. In contrast,
12 R. Cao et al.

predicted table B obtains much lower token-IoU (0.81) than predicted table A
(0.98). Therefore, token-IoU is more suitable than area-IoU for document layout
recognition task, intuitively.
Based on token-IoU or area-IoU, we can further calculate mean average pre-
cision (mAP) for a sequence of IoU thresholds. Then, the mean of the mAP on
all objects categories is computed as the final score. ICDAR 2021 document lay-
out recognition competition1 and COCO competition2 both use mAP based on
area-IoU, aka mAP@area-IoU, for a sequence of IoU thresholds ranging from 0.5
to 0.95 with a step size of 0.05. However, in this paper, since our goal is pinpoint
accuracy, we use mAP based on token-IoU, aka mAP@token-IoU, for a sequence
of IoU thresholds ranging from 0.9 to 1.0 with a step size of 0.01. Note that,
when setting IoU threshold as 1.0, only the predicted object that contains the
same tokens as the ground-truth object is regarded as a successful prediction,
which the most rigorous metric (Fig. 6).

(a) paragraph object (b) equation object

Fig. 6. Comparing mAP@area-IoU and mAP@token-IoU metric of Faster R-CNN


model on paragraph and equation classes.

Finally, we design an experiment to further analyze the advantages of token-


IoU compared with area-IoU. First, following [39], we use the Faster R-CNN
model to detect physical objects on each document page in the ArXiv Dataset.
After obtaining the predicted bounding box of each physical object, we employ
post-processing to adjust the predicted bounding box. In detail, for a predicted
bounding box b, we first find out all the tokens contained in b. Then, we shrink
b to the bounding box b that exactly bounds the contained token areas tightly.
In other words, the left edge of b is the leftmost edge of these tokens and the
top, right and bottom edge are analogous. Then, we use the mAP@area-IoU

1
https://icdar2021.org/competitions/competition-on-scientific-literature-parsing/.
2
https://cocodataset.org/.
Towards Document Panoptic Segmentation with Pinpoint Accuracy 13

and mAP@token-IoU for IoU thresholds ranging from 0.9 to 1.0 with a step
size of 0.01. Along with the increase of the IoU threshold, the mAP@area-IoU
and mAP@token-IoU both decline. From the perspective of the rate of decline,
area-IoU declines slowly when the threshold close to 0.9, but declines steeply
when the threshold close to 1.0. In contrast, token-IoU declines uniformly from
0.9 to 1.0 threshold. That is to say, compared with area-IoU, token-IoU can
objectively represent the different performance of the model under coarse or
pinpoint accuracy. Therefore, in the experiments, we will use mAP@token-IoU
as the final evaluation metric.

5 Experiments and Results


5.1 Dataset
There exist some public available document layout recognition datasets, how-
ever, many datasets are not suitable for pinpoint table detection in this paper.
ICDAR 2013 Table Competition Dataset [8], UNLV Table Dataset [31] and Mar-
mot Dataset [5] contain 128, 427 and 2000 document pages, respectively. How-
ever, these datasets only contain few thousands of document pages. TableBank
Dataset [17] contains more than 700,000+ document images in all. Although
document images can be converted to digital pages via executing optical charac-
ter recognition (OCR). In this scenario, we employ Tesseract OCR Engine [33]
to extract characters and words in these document images. However, due to the
low resolution of these images, the produced OCR results are too poor to obtain
pinpoint accuracy. DocBank Dataset [18] contains 5,000+ PDF documents, thus
each token can be easily obtained according to PDF metadata. DocBank Dataset
only provides the class label of each word but does not provide the bounding box
of each physical object. In this scenario, it cannot distinguish two adjacent para-
graphs or tables. Thus, DocBank Dataset can only handle document semantic
segmentation problem, however, it cannot handle document instance segmenta-
tion or document panoptic segmentation. In this work, we focus on document
panoptic segmentation problem on documents with complex layouts. Therefore,
we do not use DocBank Dataset.
PubLayNet Dataset [39] contains 300,000+ document pages and releases
PDF format of each document page recently. Meanwhile, the bounding box
of each physical object is also provided, thus it is suitable for pinpoint docu-
ment layout recognition task. Additionally, we also collect another two datasets
- ArXiv Dataset and Financial Dataset. ArXiv Dataset contains 300,000+ sci-
entific publication document pages from ArXiv and Financial Dataset contains
200,000+ financial document pages from annual reports, prospectuses, research
reports, etc. For ArXiv Dataset, we use the similar method in [17,32] to automat-
ically annotate the bounding box of each physical object. For Financial Dataset,
each document page is assigned to at least two annotators for annotating the
bounding box of each physical object. If the results on a document page are
different, another senior annotator will address the conflicts and output the final
answer. These datasets provide PDF format of each document and annotate
14 R. Cao et al.

bounding box of each physical object. The detailed information of each dataset
are shown in Table 1. Thus, in this paper, we use PubLayNet Dataset, ArXiv
Dataset and Financial Dataset for experiments towards document panoptic seg-
mentation with pinpoint accuracy.

Table 1. The statistic information of different datasets.

Dataset #class #document #page #physical object


PubLayNet 4 125,621 335,640 3,208,989
ArXiv 9 23,770 313,424 4,583,502
Financial 5 2,998 228,036 2,352,714

5.2 Baseline Methods and Proposed Method


In this paper, we compare our proposed DPS model with an object detection
based model [39] and an instance segmentation based model [18] as follows.

– Faster R-CNN. Zhong et al. [39] employ Faster R-CNN [27] to predict the
bounding box of each physical object. Faster R-CNN is an object detection
method. It is complemented by Detectron2 [36] and set as the baseline model
in ICDAR 2021 document layout recognition competition. We use the same
hyper-parameters and other configurations. We also use the page representa-
tion introduced in Sect. 3.5 as the input.
– Mask R-CNN. Zhong et al. [39] also employ Mask R-CNN [10] to predict
the bounding box of each physical object on the document page. Mask R-CNN
is an instance segmentation method. It is complemented by Detectron2 [36].
We use the same hyper-parameters and other configurations. We also use the
page representation introduced in Sect. 3.5 as the input.
– DPS. We propose document panoptic segmentation (DPS) model towards
document layout recognition with pinpoint accuracy. The framework is intro-
duced in Sect. 3. The input page representation is the same as Faster R-CNN
and Mask R-CNN.

5.3 Experimental Results


In this section, we compare the proposed DPS model with two baseline models
on three datasets. Since our goal is pinpoint accuracy, we use mAP@token-IoU
under 1.0 threshold as the evaluation metric. Note that, under this metric, only
the predicted object that contains the same tokens as the ground-truth object is
regarded as a success prediction. In other words, we require that the predicted
object must not miss any tokens or contain any redundant tokens compared with
the ground-truth object. Although this metric is very rigorous, it can guarantee
Towards Document Panoptic Segmentation with Pinpoint Accuracy 15

Table 2. Final results of different physical objects on three datasets.

(a) The PubLayNet Dataset


model average text title list table
Mask R-CNN 0.8415 0.9051 0.9666 0.6807 0.8136
Faster R-CNN 0.8587 0.9057 0.9707 0.7290 0.8294
DPS 0.8833 0.9062 0.9273 0.8194 0.8804
(b) The ArXiv Dataset
model average table graph paragraph header footer equation reference topic caption
Faster R-CNN 0.8451 0.8031 0.6482 0.9209 0.9591 0.9993 0.7145 0.6755 0.9138 0.9711
Mask R-CNN 0.8500 0.8110 0.6513 0.9285 0.9534 0.9941 0.7155 0.7018 0.9245 0.9698
DPS 0.9205 0.8261 0.7468 0.9689 0.9700 0.9800 0.9483 0.9383 0.9483 0.9581
(c) The Financial Dataset
model average table graph paragraph header footer
Faster R-CNN 0.8133 0.8110 0.4505 0.9147 0.9295 0.9608
Mask R-CNN 0.8209 0.8293 0.4648 0.9206 0.9320 0.9580
DPS 0.8530 0.8648 0.5358 0.9501 0.9345 0.9798

the results of document layout recognition will not have bad effects for the
downstream tasks.
The final results are shown in Table 2. For each dataset, we list the
mAP@token-IoU under 1.0 threshold of each object class. Clearly, the DPS
model obtains 0.8833, 0.9205 and 0.8530 mAP scores, which obtains 0.0418 and
0.0246 improvement compared with Faster R-CNN and Mask R-CNN in Pub-
LayNet Dataset, obtains 0.0754 and 0.0705 improvement in ArXiv Dataset, and
obtains 0.0397 and 0.0321 improvement in Financial Dataset. Note that, for list
objects in PubLayNet Dataset, equation and inference objects in Arxiv Dataset
the DPS model obtains great improvement compared with baselines. Since these
objects have relatively small regions and appear more densely on the document
page, Faster R-CNN and Mask R-CNN cannot predict them with pinpoint accu-
racy. However, the DPS model can predict these objects more accurately due to
considering line regions sequentially. For some objects like footer, caption and
title objects, the DPS model cannot obtain better performance, because there
exists some errors in column region segmentation stage. Overall, the proposed
DPS model obtains great improvement compared with Faster R-CNN and Mask
R-CNN.
Additionally, we further explore the overlap ratio of physical objects in differ-
ent models. For Faster R-CNN and Mask R-CNN, we retain the physical objects
whose confidence score is more than 0.5. For the DPS model, we retain all the
physical objects. Then, for each model, we judge each physical object whether it
overlaps with other physical objects and calculate the overlap ratio of physical
objects. The experimental results show that 3.7% and 3.4% physical objects are
overlapped in Faster R-CNN and Mask R-CNN model. However, in our proposed
DPS model, this ratio is 0. That is to say, the DPS model tackles the problem
of overlapping between predicted physical objects.
16 R. Cao et al.

Fig. 7. PDFlux performs document panoptic segmentation for a PDF document.

5.4 Application

The core model in this study is also adopted in a PDF parser software, PDFlux3 ,
which is dedicated to accomplishing the task of document panoptic segmentation.
For disclosure documents in the financial field, we carried out a large number of
manual annotations and trained the proposed model. After years of polishing, it
can now handle various types of documents, including financial annual reports,
prospectuses, audit reports, scientistic publications, industry reports, etc.
Figure 7 shows the screenshot of PDFlux. A user only needs to upload a
PDF document to PDFlux and click the “recognizing” button. Then, PDFlux
will automatically execute all the modules in document panoptic segmentation.
First, PDFlux recognizes the column region for each document page. Through
the column lines, users can clearly see what contents are contained in each col-
umn region. Then, PDFlux extracts the physical objects in each column region,
including paragraphs, tables, graphs, page headers, page footers, and so on. The
user can locate each physical object by moving the cursor, and perform dif-
ferent operations such as copying texts, editing texts, translation, etc. on each
physical object. Especially for the table object, PDFlux displays the in-depth
analysis results on the table, such as row and column lines, merged cells, cell
alignment and so on. Additionally, PDFlux also recognizes whether two para-
graphs or two tables across two pages or two columns need to be merged. In the
future, PDFlux aims to recover the logical hierarchy of the entire document [3].
Finally, the entire document can be converted to WORD/EPUB/MOBI format
for users to download. In order to facilitate further analysis and understanding
of PDF documents, PDFlux also provides external SaaS services in the form of
REST API.

3
http://pdflux.com/.
Towards Document Panoptic Segmentation with Pinpoint Accuracy 17

6 Conclusion
In this paper, we focus on document layout recognition and propose a document
panoptic segmentation framework towards pinpoint accuracy. We also propose
a novel evaluation metric, named token-IoU, for this task, which is more suit-
able for pinpoint accuracy. Furthermore, we use three datasets and compare the
proposed DPS model with Faster R-CNN and Mask R-CNN. The results show
that DPS model outperforms other two baselines on three datasets.

Acknowledgements. The research work supported by the National Key Research


and Development Program of China under Grant No. 2017YFB1002104, the National
Natural Science Foundation of China under Grant No. 62076231, U1811461. We thank
Xu Wang and Jie Luo (from P.A.I Tech) for their kind help. We also thank anonymous
reviewers for their valuable comments and suggestions.

References
1. Object detection. https://en.wikipedia.org/wiki/Object detection
2. Bauguess, S.W.: The role of machine readability in an AI world (2018). https://
www.sec.gov/news/speech/speech-bauguess-050318
3. Cao, R., Cao, Y., Zhou, G., Luo, P.: Extracting variable-depth logical document
hierarchy from long documents: method, evaluation, and application. J. Comput.
Sci. Technol. (2021)
4. Cao, Y., Li, H., Luo, P., Yao, J.: Towards automatic numerical cross-checking:
extracting formulas from text. In: WWW (2018)
5. Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and perfor-
mance metrics for table detection evaluation. In: DAS (2012)
6. Gilani, A., Qasim, S.R., Malik, I., Shafait, F.: Table detection using deep learning.
In: ICDAR (2017)
7. Girshick, R.: Fast R-CNN. In: ICCV (2015)
8. Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In:
ICDAR (2013)
9. He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for
semantic page segmentation and table detection. In: ICDAR (2018)
10. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
11. Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP
(2018)
12. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. In: ICLR (2017)
13. Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P.: Panoptic segmentation.
In: CVPR (2019)
14. Koci, E., Thiele, M., Lehner, W., Romero, O.: Table recognition in spreadsheets
via a graph representation. In: DAS (2018)
15. Li, H., Yang, Q., Cao, Y., Yao, J., Luo, P.: Cracking tabular presentation diversity
for automatic cross-checking over numerical facts. In: KDD (2020)
16. Li, K., et al.: Cross-domain document object detection: benchmark suite and
method. In: CVPR (2020)
17. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark
for image-based table detection and recognition (2019)
18 R. Cao et al.

18. Li, M., et al.: Docbank: A benchmark dataset for document layout analysis. arXiv
(2020)
19. Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by
deep structured prediction and supervised clustering. In: ICPR (2018)
20. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural
networks. In: ICLR (2016)
21. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: CVPR (2015)
22. Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly
articles with rich document features. Int. J. Digit. Libr. Syst. (2010)
23. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. In: ICLR (2013)
24. Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents.
In: Conference on Pattern Recognition (1984)
25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified,
real-time object detection. In: CVPR (2016)
26. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)
27. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. In: NeurIPS (2015)
28. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detec-
tion in invoice documents by graph neural networks. In: ICDAR (2019)
29. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detec-
tion in invoice documents by graph neural networks (2019)
30. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: MICCAI (2015)
31. Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the
benchmarking of table structure recognition systems. In: DAS (2010)
32. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with
distantly supervised neural networks. In: JCDL (2018)
33. Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007)
34. Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting
and merging for table structure decomposition. In: ICDAR (2019)
35. Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data.
In: SIGMOD (2018)
36. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://
github.com/facebookresearch/detectron2
37. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of
text and layout for document image understanding. In: KDD (2020)
38. Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to
extract semantic structure from documents using multimodal fully convolutional
neural networks. In: CVPR (2017)
39. Zhong, X., Tang, J., Yepes, A.J.: PublayNet: largest dataset ever for document
layout analysis. In: ICDAR (2019)
A Math Formula Extraction
and Evaluation Framework for PDF
Documents

Ayush Kumar Shah , Abhisek Dey , and Richard Zanibbi(B)

Rochester Institute of Technology, Rochester, NY 14623, USA


{as1211,ad4529,rxzvcs}@rit.edu

Abstract. We present a processing pipeline for math formula extrac-


tion in PDF documents that takes advantage of character information
in born-digital PDFs (e.g., created using LATEX or Word). Our pipeline
is designed for indexing math in technical document collections to sup-
port math-aware search engines capable of processing queries containing
keywords and formulas. The system includes user-friendly tools for visu-
alizing recognition results in HTML pages. Our pipeline is comprised
of a new state-of-the-art PDF character extractor that identifies precise
bounding boxes for non-Latin symbols, a novel Single Shot Detector-
based formula detector, and an existing graph-based formula parser (QD-
GGA) for recognizing formula structure. To simplify analyzing struc-
ture recognition errors, we have extended the LgEval library (from the
CROHME competitions) to allow viewing all instances of specific errors
by clicking on HTML links. Our source code is publicly available.

Keywords: Math formula recognition · Document analysis systems ·


PDF character extraction · Single Shot Detector (SSD) · Evaluation

1 Introduction

There is growing interest in developing techniques to extract information from


formulas, tables, figures, and other graphics in documents, since not all infor-
mation can be retrieved using text [3]. Also, the poor accessibility of technical
content presented graphically in PDF files is a common problem for researchers,
and in various educational fields [19].
In this work, we focus upon improving automatic formula extraction [21].
Mathematical expressions are essential in technical communication, as we use
them frequently to represent complex relationships and concepts compactly.
Specifically, we consider PDF documents, and present a new pipeline that
extracts symbols present in PDF files with exact bounding box locations where
available, detects formula locations in document pages (see Fig. 1), and then
recognizes the structure of detected formulas as Symbol Layout Trees (SLTs).
SLTs represent a formula by the spatial arrangement of symbols on writing lines
[23], and may be converted to LATEX or Presentation MathML.
c Springer Nature Switzerland AG 2021
J. Lladós et al. (Eds.): ICDAR 2021, LNCS 12822, pp. 19–34, 2021.
https://doi.org/10.1007/978-3-030-86331-9_2
20 A. K. Shah et al.

(a) (b)

(c) (d)

Fig. 1. Detection of symbols and expressions. The PDF page shown in (a) contains
encoded symbols shown in (b). (c) shows formula regions identified in the rendered
page image, and (d) shows symbols located in each formula region.

There are several challenges in recognizing mathematical structure from PDF


documents. Mathematical formulas have a much larger symbol vocabulary than
ordinary text, formulas may be embedded in textlines, and the structure of
formulas may be quite complex. Likewise, it is difficult to visualize and evaluate
formula structure recognition errors - to address this, we present an improved
evaluation framework that builds upon the LgEval library from the CROHME
handwritten math recognition competitions [10,12,14]. It provides a convenient
HTML-based visualization of detection and recognition results, including the
ability to view all individual structure recognition errors organized by ground
truth subgraphs. For example, the most frequent symbol segmentation, symbol
classification, and relationship classification errors are automatically organized
in a table, with links that take the user directly to a list of specific formulas with
the selected error.
A Math Formula Extraction in PDF Documents 21

The contributions of this work include:

1. A new open-source formula extraction pipeline for PDF files,1


2. new tools for visualization and evaluation of results and parsing errors,
3. a PDF symbol extractor that identifies precise bounding box locations in
born-digital PDF documents, and
4. a simple and effective algorithm which performs detection of math expressions
using visual features alone.

We summarize related work for previous formula extraction pipelines in the


next Section, present the pipeline in Sect. 3, our new visualization and evaluation
tools in Sect. 4, preliminary results for the components in the pipeline in Sect. 5,
and then conclude in Sect. 6.

2 Related Work
Existing mathematical recognition systems use different detection and recogni-
tion approaches. Lin et al. identify three categories of formula detection meth-
ods, based on features used [7]: character-based (OCR-based), image-based, and
layout-based. Character-based methods use OCR engines, where characters not
recognized by the engine are considered candidates for math expression elements.
Image-based methods use image segmentation, while layout-based detection uses
features such as line height, line spacing, and alignment from typesetting infor-
mation available in PDF files [7], possibly along with visual features. Likewise, for
parsing mathematical expressions, syntax (grammar-based) approaches, graph
search approaches, and image-based RNNs producing LATEX strings as output
are common.
This Section summarizes the contributions and limitations of some existing
math formula extraction systems, and briefly describes our system’s similarities
and differences.
Utilizing OCR and recognition as a graph search problem, Suzuki et
al.’s INFTY system [18] is perhaps the best-known commercial mathematical for-
mula detection and recognition system. The system uses simultaneous character
recognition and math-text separation based on two complementary recognition
engines; a commercial OCR engine not specialized for mathematical documents
and a character recognition engine developed for mathematical symbols, followed
by structure analysis of math expressions performed by identifying the minimum
cost spanning-trees in a weighted digraph representation of the expression. The
system obtains accurate recognition results using a graph search-based formula
structure recognition approach.
Using symbol information from PDFs directly rather than applying
OCR to rendered images was first introduced by Baker et al. [2]. They used a syn-
tactic pattern recognition approach to recognize formulas from PDF documents
using an expression grammar. The grammar requirement and need for manually
1
https://www.cs.rit.edu/∼dprl/software.html.
22 A. K. Shah et al.

segmenting mathematical expressions from text make it less robust than INFTY.
However, the system was faster by avoiding rendering and analyzing document
images, and improved accuracy using PDF character information. In later work,
Sorge et al. [16] were able to reconstruct fonts not embedded in a PDF docu-
ment, by mapping unicode values to standard character codes where possible.
They then use CC analysis to identify characters with identical style and spac-
ing from a grouping provided by pdf2htmlEX, allowing exact bounding boxes to
be obtained. Following on Baker et al.’s approach to PDF character extraction
[2], Zhang et al. [24] use a dual extraction method based on a PDF parser and
an OCR engine to supplement PDF symbol extraction, recursively dividing and
reconstructing the formula based on symbols found on the main baseline for
formula structure analysis. Later, Suzuki et al. improved the recognition rate in
INFTYReader [18], by also utilizing extracted PDF character information from
PDFMiner [19].
Some PDF characters are composed of multiple glyphs, such as large braces
or square roots (commonly drawn with a radical symbol connected to a horizon-
tal line). These ‘compound’ characters were identified by Baker et al. [2] using
overlapping bounding boxes in modern PDF documents containing Type 1 fonts.
Image-based detection using RNN-based recognition was first pro-
posed by Deng et al. [4], inspired by RNN-based image captioning techniques.
A more recent example of this approach is Phong et al.’s [15], which uses a
YOLO v3 network based on a Darknet-53 network consisting of 53 convolu-
tional layers for feature extraction and detection, and an advanced end to end
neural network (Watch, Attend and Parse (WAP)) for recognition. The parser
uses a GRU with attention-based mechanisms, which makes the system slow due
to the pixel-wise computations. Also, error diagnosis in these recurrent image-
based models is challenging, since there is not a direct mapping between the
input image regions and the LATEX output strings. To address this issue for the
CROHME 2019 handwriten math competition, evaluation was performed using
the LgEval library, comparing formula trees over recognized symbols (e.g., as
represented in LATEX strings), without requiring those symbols to have assigned
strokes or input regions [10]. This alleviates, but does not completely resolve
challenges with diagnosing errors for RNN-based systems.
This Paper. We introduce algorithms to create a unified system for detecting
and recognizing mathematical expressions. We first introduce an improved PDF
symbol extractor (SymbolScraper) to obtain symbol locations and identities. We
also present a new Scanning Single Shot Detector (ScanSSD) for math formu-
las using visual features, by modifying the Single Shot Detector (SSD) [8] to
work with large document images. For structure recognition, we use the existing
Query-driven Global Graph Attention (QD-GGA) model [9], which uses CNN
based features with attention. QD-GGA extracts formula structure as a maxi-
mum spanning tree over detected symbols, similar to [18]. However, unlike [18],
the system trains the features and attention modules concurrently in a feed-
forward pass for multiple tasks: symbol classification, edge classification, and
segmentation resulting in fast training and execution.
A Math Formula Extraction in PDF Documents 23

Fig. 2. Mathematical Expression Extraction Pipeline. Symbol Extraction outputs sym-


bols and their bounding boxes directly from drawing commands in the PDF file, avoid-
ing OCR. However, formula regions and formula structure are absent in PDF, and so
are identified from rendered document pages (600 dpi).

Some important distinctions between previous work and the systems and
tools presented in this paper include:
1. Our system focuses solely on formula extraction and its evaluation.
2. Grammars and manual segmentation are not required (e.g., per [2])
3. Recognition is graph-based, with outputs generated more quickly and with
more easily diagnosed errors than RNN models (e.g., [15]).
4. Symbol information from born-digital PDFs is used directly; where absent,
characters are recognized from connected components (CCs) in images.
5. Structure recognition errors can be directly observed in graphs grounded in
input image regions (e.g., CCs), and the LgEval library [10,13,20,22] has been
extended to visualize errors both in isolation, and in their page contexts.

3 Formula Detection Pipeline Components


We describe the components of our formula processing pipeline in this Section.
As seen in Fig. 2, the extraction pipeline has three major components: 1) symbol
extraction from PDF, 2) formula detection, and 3) formula structure recognition
(parsing). The final outputs are Symbol Layout Trees (SLTs) corresponding to
each detected mathematical expression in the input PDF documents, which we
visualize as both graphs and using Presentation MathML (see Fig. 5).

3.1 SymbolScraper: Symbol Extraction from PDF


Unless formula images are embedded in a PDF file (e.g., as a .png), born-digital
PDF documents provide encoded symbols directly, removing the need for char-
acter recognition [2]. In PDF documents, character locations are represented by
24 A. K. Shah et al.

(a)

(b) (c)

Fig. 3. Symbol extraction from PDF. (a) Symbol locations in a PDF formula from
various tools. (b) correcting bounding box translations from glyph data. (b) Compound
character: a large brace (‘{’) drawn with three characters.

their position on writing lines and in ‘words,’ along with their character codes
and font parameters (e.g., size). Available systems such as PDFBox, PDF Miner,
and PyMuPDF return the locations of characters using the drawing commands
provided in PDF files directly. However, they return inconsistent locations (see
Fig. 3(a))(a).
We resolve this problem by intercepting the rendering pipeline, and using
the character outlines (glyphs) provided in embedded font profiles in PDF files.
Glyphs are vector-based representations of how characters are drawn using a
series of lines, arcs, and ‘pen’ up/down/move operations. These operations detail
the outlines of the lines that make up the character. Glyphs are represented by a
set of coordinates with winding rules (commands) used to draw the outline of a
character. The glyph along with font scaling information is used to determine a
character’s bounding box and relative position in the PDF. Our symbol extractor
extends the open-source PDFBox [1] library from Apache.
Character Bounding Boxes. For high-quality page rendering, individual
squares used to represent the character outlines (i.e., glyphs) are known as ‘em
squares.’ The em square is typically 1000 or 2048 pixels in length on each side,
depending upon the font type used and how it is rendered.2 Glyphs are repre-
sented by vector drawing commands in the em square ‘glyph space,’ which has
a higher resolution (e.g., a 1000 × 1000) than that used to render the character
in page images (in ‘text space’). To obtain the bounding box for a character on
a page, we need to: 1) convert glyph outline bounding boxes in glyph space to
a bounding box in ‘text space’ (i.e. page coordinates), and 2) translate the box
based on the varying origin location within each glyph’s em square represen-
tation. The difference in origin locations by glyph reflects the different writing
2
You can see the various ways we obtain the em square value by looking at the
getBoundingBox() method in our BoundingBox class.
A Math Formula Extraction in PDF Documents 25

line positions for symbols, and allows the whole box to be used in representing
character shapes (maximizing their level of detail/resolution).
Glyph outlines are translated to page coordinates using character font size
and type information. A scaling matrix for the character, defined by the font size,
text scaling matrix, and current transformation matrix is used for the conversion.
PDF files provide a baseline starting point coordinate (bx , by ) for characters
on writing lines, but this provided coordinate does not take character kerning,
font-weight, ascent, descent, etc., into account, as seen in Fig. 3(a). To produce
the appropriate translations for character bounding boxes, we use the textline
position of the character provided within the em square for each glyph (in glyph
space). We use displacement (δx , δy ) of the bottom left coordinate within the
glyph em square to obtain the corrective translation. Next, we scale the glyph
space coordinates to text space coordinates and then use coordinate list G from
the character’s glyph, and find the width and height using the minimum and
maximum coordinates of G.
The extractedcharacter bounding boxes are very precise even for non-latin
characters (e.g., ) as seen in Fig. 3(a) at far right. Likewise, the recognized
symbol labels are accurate. Some documents contain characters embedded in
images, in which case these characters must be identified in downstream process-
ing. We currently use font tables provided by PDFBox for mapping characters
to glyphs. Using these font tables, we have been able to retrieve glyphs for the
following font families: TrueType, Type 1, Type 2, and CID. Font information
is extracted using the getFont() method in the PDFBox TextPosition class.
There are cases where the bounding boxes do not align perfectly with characters,
due to the use of rarer font types not handled by our system (e.g., Type 3, which
seems to be commonly used for raster fonts in OCR output). We hope to handle
Type 3 in the future. A rare type of fonts, not handled by the current.
Compound Characters. Compound characters are formed by two or more
characters as shown for a large brace in Fig. 3(c). For simplicity, we assume
that bounding boxes for sub-character glyphs intersect each other, and merge
intersecting glyphs into a single character. As another common example, square
roots are represented in PDF by a radical symbol and a horizontal line. To
identify characters with multiple glyphs, we test for bounding box intersection
of adjacent characters, and intersecting glyphs are merged into a compound
character. The label unknown is assigned to compound characters other than
roots, where the radical symbol is easily identified.
Horizontal Lines. Horizontal lines are drawn as strokes and not characters. To
extract them, we override the strokepath() method of the PageDrawer class in
the PDFBox library [1]. By overriding this method, we obtain the starting and
ending coordinates of horizontal lines directly. Some additional work is needed to
improve the discrimination between different line types, and other vector-based
objects could be extracted similarly (e.g., boxes, image BBs).
26 A. K. Shah et al.

Fig. 4. ScanSSD architecture. Heatmaps illustrate detection confidences with gray ≈ 0,


red ≈ 0.5, white ≈ 1.0. Purple and green bounding boxes show formula regions after
stitching window-level detections and pooling, respectively. (Color figure online)

3.2 Formula Detection in Page Images

We detect formulas in document page images using the Scanning Single Shot
Detector (ScanSSD [11]). ScanSSD modifies the popular Single Shot Detector
(SSD [8]) to work with sliding windows, in order to process large page images (600
dpi). After detection is performed in windows, candidate detections are pooled,
and then finally regions are selected by pixel-level voting and thresholding.
Figure 4 illustrates the ScanSSD workflow. First, a sliding window samples
overlapping page regions and applies a standard 512 × 512 Single-Shot Detector
to locate formula regions. Non-Maximal Suppression (NMS) is used to select the
highest confidence regions from amongst overlapping detections in each window.
Formulas detected within each window have associated confidences, shown using
colour in the 3rd stage of Fig. 4. As seen with the purple boxes in Fig. 4, many
formulas are found repeatedly and/or split across windows.
To obtain page-level detections, we stitch window-level detections on the
page, and then use a voting-based pooling method to produce output detections
(green boxes in Fig. 4). Details are provided below.
Sliding Windows and SSD. Starting from a 600 dpi page image we slide a
1200 × 1200 window with a vertical and horizontal stride (shift) of 120 pixels
(10% of window size). Windows are roughly 10 text lines in height. The SSD
is trained using ground truth math regions cropped at the boundary of each
window, after scaling and translating formula bounding boxes appropriately.
There are four main advantages to using sliding windows. The first is data
augmentation: only 569 page images are available in our training set, which
is very small for training a deep neural network. Our sliding windows produce
656,717 sub-images. Second, converting the original page image directly to 300×
300 or 512 × 512 loses a great deal of visual information, and detecting formulas
using sub-sampled page images yielded low recall. Third, windowing provides
multiple chances to detect a formula. Finally, Liu et al. [8] mention that SSD
is challenged when detecting small objects, and formulas with just one or two
characters are common. Using high-resolution sub-images increases the relative
size of small formulas, making them easier to detect.
Another random document with
no related content on Scribd:
[Turning to the defendant.] Well now, the last of these
documents before the order was finally drawn up and issued, the
Court has already seen, because it was put in.
It is 1263-PS, RF-365, My Lord, it was in Jodl’s Document Book
Number 2, Page 104.
Will you look at the original, please.
There is an unfortunate omission from Page 110 in Dr. Exner’s
book, which I am perfectly certain is quite inadvertent. Will you look
at the document dated 15 October 1942?
My Lord, I think that is the first in your bundle. It is Page 110. It
is first in the single documents. It is Page 110 of Dr. Exner’s book,
and I apologize to him because I have just seen the marginal writing.
It was covered over before, and I had not seen it. I apologize.
My Lord, the...
It is a note, is it not, Witness, signed Warlimont, your deputy, 15
October. I think you will find it the second document in your file. I do
not want to read it all again because it has been read, but you see:
“The Proposal of the Amt Ausland Abwehr will be submitted as
Appendix 1.”
The Tribunal will find Appendix 1, in which he says—in which it
is suggested, under letter “A,” that sabotage troops who do not wear
uniforms should be court-martialed. You have said “no.” You have
given your reasons. I will not worry you about that any more. And
then “B”; members of sabotage units, who are in uniform but are
guilty of dishonorable activities are after capture to be put into
special confinement. Do you say that that does not go either?
And then, if you will go back to 15 October, just the second
paragraph down:
“The Chief of WR”—that is the Legal Department—“has
made a statement to the effect that the order was to be
drawn up in such a way that it will take into account our own
interests...”
Is it “our own interests,” Witness? “Take into account our own
interests”?
JODL: Yes, “our own interests.”
MR. ROBERTS: “...our own interests while considering the
future conduct of the war. In this way he wanted to avoid
repercussions which would run counter to our further
intentions. Sabotage is an essential part of conducting war
in time of total warfare; we ourselves have strongly
developed this method of fighting.”
And you write against that, do you, “But the English make much
more use of it”?
JODL: Yes, it is an undeniable fact that at that time of the war
the English made much more use of it than we.
MR. ROBERTS: Is that a reason for making a law, an order of
this kind, to try and discourage the English from using sabotage
detachments?
JODL: No, that is certainly not a reason. It is only a denial of the
statement that we had strongly developed this method of fighting;
hence my remark, “Yes, but the English to a much greater extent
than we.” That, of course, has nothing at all to do with the reason for
the order.
MR. ROBERTS: Then I am not going to take more time on that
particular document, except—have you got a document dated 14
October with 1, 2, 3, 4 at the end? I think it is on a separate page,
the 1, 2, 3, 4.
JODL: Yes.
MR. ROBERTS: It says:
“With the view in mind—to prevent the enemy’s fighting the
war by using sabotage troops—following questions have to
be clarified before formulating an order:
“1) Have we ourselves the intention of dropping sabotage
Units in the zone of rear echelons of the enemy, or also far
back in the interior?
“2) Who will drop more sabotage troops, the enemy or we?
“3) Can we establish the principle: Sabotage troops do not
conduct legal war; they are to be exterminated in the
fighting without mercy?
“4) Do we attach importance to first arresting the single
members of this group for interrogation by
Counterintelligence and not killing them immediately?”
These were the considerations which were discussed in your
office before the orders were drawn up.
JODL: These were questions—not points of view—questions
which were raised in the Armed Forces Operations Staff as a result
of the Armed Forces communiqué. Fortunately, the submission of all
these documents proves the complete correctness of everything I
said here 2 days ago. The staff, the Legal Department, and the
Ausland department racked their brains and pondered how they
could draw up the executive order implementing the Führer’s
additions to the Wehrmacht communiqué. Neither they nor I came to
any conclusion, and no proposal was made to the Führer; nothing
was done. That is what I stated here the day before yesterday, and
that is what, fortunately, you yourself have proved by submitting
these documents.
MR. ROBERTS: You have said, I think, that part of the Führer’s
order disgusted you?
JODL: Yes.
MR. ROBERTS: And you have said in your interrogation that
circulating this order was one of the things which went against your
inner conscience—one of the few things. “Your inner convictions”—to
use your actual words.
JODL: In the preliminary interrogation I said that it was one of
the few—or the only—order I received from the Führer which I, in my
own mind, completely rejected.
MR. ROBERTS: You rejected it, but these young men went on
being shot, did they not?
JODL: I have already described exactly how the commanding
generals at the front, vigorously supported by me, interpreted this
order in the mildest imaginable way in practice; actually, only very
few such incidents occurred, and I believe that most—at any rate,
nearly all that came to my knowledge—were highly justified, because
the fighting methods of those people were not methods of honest
soldiers.
MR. ROBERTS: You see, you talk about your “inner
convictions.” I think Keitel spoke about his “inner conscience.” But
should we have heard anything about these convictions and this
conscience if Germany had not lost the war?
JODL: No, but then we might have heard of the strangled at
Dieppe in a similar trial.
MR. ROBERTS: It is very late and—now, I just want to deal with
a few examples, very very quickly, of the order being carried out, as
you said it was only carried out a few times. I just, first of all, want to
refer to UK-57, which is Page 309 of Document Book 7, the German
copy Page 33—German copy Page 344. I am sorry, I had given you
the wrong number. I can read this out. It is a report which is initialed
by Keitel.
“On 16 September 1942”—mark the date, that is more than
a month before the Commando Order came into force—“10
Englishmen and 2 Norwegians landed on the Norwegian
coast, dressed in the uniform of the British Mountain Rifle
Regiment, heavily armed and equipped with explosives of
every description. After negotiating difficult mountain
country, they blew up important installations in the power
station Glomfjord on 21 September. A German sentry was
shot during the incident. Norwegian workmen were
threatened that they would be chloroformed if they resisted.
For this purpose the Englishmen were equipped with
morphium syringes. Seven of the participants have been
arrested. The others escaped into Sweden.”
Then follow seven names, which I read out to this Court, I think,
in January. They were shot on 30 October 1942. That would be, shot
as a result of the order which you circulated, although it was not in
existence when those men blew up that power station. You told me
some little time ago that that power station was a proper military
target. These men were in uniform. Can you begin to justify that?
JODL: No, I cannot justify that, and I will not justify it. I consider
it completely illegal, because this order could certainly not have been
retroactive; but I did not learn of this affair at the time. Of UK-57 I
read the first and second parts here for the first time; the third part I
read in April 1944.
MR. ROBERTS: Well, now, there are other exhibits dealing with
this matter which I am not going to put to you. They have been
referred to before, and I do not want to be cumulative. I would like
you—or perhaps I will ask you one question first.
I think it was laid down, was it not, that every action taken under
this Führer Order was to be reported in the Wehrmacht report?
JODL: Yes, that was ordered.
MR. ROBERTS: Yes, I just want to give you an example of the
Wehrmacht report.
526-PS, USA-502, My Lord, it is 7a, Page 15. It is dated 10 May
1943, German Page 21 of the small book.
[Turning to the defendant.] It is a notice from the “Q” branch of
your staff.
“On 30 March 1943 in Toftefjord an enemy cutter was
sighted. Cutter was blown up by enemy. Crew: 2 dead men,
10 prisoners.
“Cutter was sent from Scalloway (Shetlands) by the
Norwegian Navy.
“Arms: Two Colt machine guns, two mounted machine
guns, a small transmitter ... 1,000 kilograms of explosives....
“Purpose: Forming an organization for sabotaging strong-
points, battery positions, staff and troop billets, and
bridges....
“Führer order executed by the SD.
“Wehrmacht report of 6 April announces the following:
“In northern Norway an enemy sabotage unit was engaged
and destroyed on approaching the coast.”
That was false, was it not?
JODL: I confirmed this communiqué of 6 April which included
the contribution from the commander in Norway as I received it on 6
April; this brief formulation always originated with the commander at
the front. But what actually happened is set down in this note of 10
May which, most unfortunately, I never saw, because on 10 May
1943 I traveled by train to Bad Gastein to begin a cure for a severe
case of lumbago; and so, unfortunately, I saw this document for the
first time here in Nuremberg. I am sorry, because this would have
been one of the few cases in which I might have been able to
intervene.
MR. ROBERTS: Yes, but, Witness—keep it in front of you—
because you see the action was not taken on 10 May; it was taken
before, or on 6 April. Look at the last paragraph:
“Wehrmacht report of 6 April announces the following:
“...enemy sabotage unit engaged and destroyed on
approaching the coast.”
Whereas, in fact, they had been taken prisoner and then shot
like dogs by the SD.
JODL: Yes, I have just said that. Before this contribution of 6
April, I heard nothing about the whole matter, but only on the 10th of
May did it come to our knowledge, and then the Armed Forces
Operations Staff drew up this note. The whole investigation into
these events was made by the Intelligence Service, the office of
Canaris, together with its Security Police; it was not the SD; that is
wrong; it was the Security Police.
Unfortunately I did not know of these details; the Intelligence
Service knew them. I was concerned with the whole question only
because I had to edit the Armed Forces communiqué; otherwise I
would never have dealt with the Commando Order; I was quite
innocent of it.
MR. ROBERTS: Now I just want to show you one more
instance. It is 2610-PS.
It is, My Lord, in small Document Book 7a, Page 23, the
German small book Page 41.
Now, I want you to notice, Witness, this is the only document
which I rely on, which is not one of your own captured
contemporaneous German documents. This is a report from the
Judge Advocate General’s Department, United States Army. It
concerns 15 United States personnel who were shot under this
order. If you look at the second page:
“On the night of 22 March 1944, 2 officers and 13 enlisted
men of the Special Reconnaissance Battalion of the Army
of the United States disembarked from some United States
Navy boats and landed on the Italian coast near Stazione di
Framura. All 15 men were members of the United States
Army and were in the military service of the United States.
When they landed they were all properly dressed in the field
uniform of the United States Army and they had no civilian
clothes. Their mission was to demolish a railroad tunnel on
the main line between La Spezia and Genoa. That rail line
was being used by the German Forces to supply their
fighting forces on the Cassino and Anzio Beachhead
fronts.”
That was a good military target, that tunnel, was it not?
JODL: Yes, a military target, absolutely.
MR. ROBERTS: And all 15 men were shot because of the order
that you circulated...
JODL: I did not understand. The order which—which I
circulated; yes.
MR. ROBERTS: Which you circulated on the 19th of October.
You circulated a supplementary order to the Führer Order, the last
paragraph of which, I think, disgusted you. That is 503-PS.
JODL: It would be more correct to say “which you had to
circulate.”
MR. ROBERTS: I will take that question up in a moment. I do
not agree. I must not argue with you, but I must put some questions.
General Dostler, who ordered the shooting of those men, he
himself, you see, was also shot by sentence of this court martial.
I am going to turn now from the Commando Order and...
JODL: May I say something else about this document?
MR. ROBERTS: Yes, anything you like.
JODL: This incident never came to my knowledge; at least, I
have no recollection of it. As far as I know, it never appeared in the
Armed Forces communiqué, because General Dostler did not report
the incident to his commanding officer, Kesselring, who might have
been able to take, and might have taken, a different course in this
affair.
MR. ROBERTS: Why do you say that you had to circulate this
order? No man can compel another to circulate an order for murder,
unless he does it.
JODL: I have explained at length that this order could not simply
be interpreted as an order to murder, but that very serious and
justified doubts and considerations could arise with regard to
international law and with regard to the justness of this order. In any
case, you should have complete appreciation of such a delicate
situation, because even now, in my position here, I cannot say or do
as I like, and that exactly is what I experienced during these last 5½
years.
MR. ROBERTS: You could have refused. You could have said,
and the other generals could have said, could you not: “We are all
honorable soldiers. We will not publish and issue those orders”?
JODL: Certainly under other circumstances it might have been
possible, first, if at the time I had not had that conflict with the Führer,
and secondly, if the British Ministry of War had made my task a little
easier. However, these events and the statement made by the British
on the 2d of September put the Führer into a rage against which I
was powerless. How much I tried to resist, for that the document
itself is the best proof, because the threat of punishment and
detailed justification for it were directed against me personally.
THE PRESIDENT: Mr. Roberts, would that be a convenient time
to break off?
[A recess was taken.]

MR. ROBERTS: I want to ask you a few questions about the


deportation of the Jews from Denmark. Will you look, please, at a
new exhibit, D-547, which I offer as GB-488. Now, that is the OKW
Operational Staff from the commander, Denmark, dated 20
September 1943. That is before the teletype which has been put in,
2 days before:
“The Führer has agreed in principle with Dr. Best’s telegram
that the Jewish question in Denmark be solved very soon
by deportation.
“Execution of this measure should take place while the
state of military emergency still exists. It is not certain if
sufficient police forces can be provided for the arrest of the
Jews and their families, about 6,000 persons, most of
whom live in Copenhagen. The Army would be heavily
burdened....
“I believe that the results of the deportation will be
serious....
“The armament industry deliveries will be prejudiced.
Considerable disturbances will have to be reckoned with.”
And you made a note on the back of it:
“I know nothing of this. If a political measure is to be carried
out by the commander of Denmark the OKW must be
notified by the Foreign Office.”
Is that right?
JODL: Yes. I would not have recalled this document, but I
certainly wrote the note. It proves that I did not remember until now
that obviously this question had been discussed in Denmark some
days before and that the commander in Denmark had been making
objections. Consequently I wrote, I know nothing of this. This is a
political measure, and if a political measure is to be carried out in
Denmark, then the Foreign Office should kindly notify us.
MR. ROBERTS: I omit one or two unimportant documents. Will
you go to document dated 1 October 1943—the fifth or sixth
document of Your Lordship’s file, Number D-547, dated 1 October
1943. It is to the OKW from Denmark, and quotes as follows:
“The Reich Plenipotentiary in Denmark has given the
following report to the Minister for Foreign Affairs:
“1) The arrest of the Jews will take place on the night of 1-2
October. Transportation from Zealand will be carried out by
ship....
“2) Should I receive no contrary instruction, I do not intend
allowing the Jewish action to be mentioned, either on the
radio or in the press”—and then
“3) ...I intend leaving the possessions of the evacuated
Jews undisturbed in order that the seizure of these
possessions cannot be imputed to be the reason or one of
the reasons for the action.”
Then you deal with the disadvantages—the writer does—and
there is a question: “Does the Reichsführer SS know?” The answer:
“The Reichsführer SS knows, is in agreement,” and then a
pencil note in Jodl’s handwriting, “The Führer agrees.” Is that in your
writing?
JODL: Yes, that is my handwriting, but that refers only to the
announcement of the release of the interned Danish soldiers.
MR. ROBERTS: I see.
JODL: Then it is important to note in this document that the
commander in Denmark said that he did not intend having the
property of the evacuated Jews disturbed. He said:
“...I intend leaving the possessions of the evacuated Jews
undisturbed....”
He had the executive power at that time.
MR. ROBERTS: Have you got the next document in the same
bundle, 2d of October 1943, to OKW Operations Staff, from
Denmark? I quote:
“Jewish action carried out in the night of the 1-2 October by
the German Police without incidents.”
And then the last document, dated 3 October 1943, to the OKW
Operations Staff:
“According to the statement of the Reich Plenipotentiary,
the Reichsführer SS has ordered that the Reichsführer SS
alone as the person ordering the Jewish action is to receive
the exact figures on arrests.
“The Plenipotentiary has, therefore, given no figures to the
commander of the German troops in Denmark. 232 Jews
have been handed in by the Police via the collecting points
set up by the Watch Battalion, Copenhagen.”
What was the Watch Battalion?
JODL: I cannot say that at the moment; I do not know how it
was composed. It might have been a unit of the Police; it might have
been part of the Army; I cannot say with certainty. At any rate it was
a unit which was used only for guard duties. But it is interesting that I
wrote the remark: “Is a matter of complete indifference to us,” which
proves that I was not interested in the affair, and refused to have any
part in it.
MR. ROBERTS: Yes, I wonder. First of all, you said that the
Watch Battalion might have been a part of the Wehrmacht. Were
you...
JODL: That is not certain. I do not wish to dispute it definitely.
There were also watch battalions of the Army, but it might equally
well have been a watch unit of the Police. I cannot say it with
certainty, but General Von Hannecken should have information about
it.
MR. ROBERTS: But were your “decent German soldiers,” whom
you mentioned yesterday, were they called upon to round up Jews
who managed to get through the SS net?
JODL: No, it says here, “...it was carried through by the Police,”
and I do not believe that any unit of the Wehrmacht concerned itself
with deportation of Jews. I do not believe it; the Wehrmacht rejected
that.
MR. ROBERTS: Dirty work, was it not?
JODL: I do not believe that it happened; I do not believe it.
MR. ROBERTS: Then your note: “...is a matter of complete
indifference to us”—it was a matter of complete indifference to you
how many Jews were deported, you did not care?
JODL: The note does not imply that, but it does prove that the
matter was a political one, and with political matters I was not
concerned. My attitude to the Jewish question has, I believe, been
made clear already.
MR. ROBERTS: Where did the Jews go to, Auschwitz?
JODL: No. The French Prosecution read it here; these Jews of
whom we are speaking now were taken to Theresienstadt; a few of
the older people died there, but all of them were treated well, and
received clothing and food. I had the same information, and this
document of the Danish Government confirms it.
MR. ROBERTS: You believe that, do you?
JODL: Yes, I believe that, because the Danish Government
confirms it here; it was confirmed in this court by the Prosecution
itself.
MR. ROBERTS: Now I want to deal with one other topic, the
topic of forced labor. Did you say in your speech—will you look at
your notes of your speech, Pages 38 and 39, and it is Page 298 of
Document Book 7, the big one, the paragraph, which begins on
Page 38 in the witness’ copy. It has got a frame; I think it is a piece
of paper headed “38.” I wonder if you can find it for him.
“This dilemma of manpower shortage has led to the idea of
making more thorough use of the manpower reserves in the
territories dominated by us. Right thinking and wrong
thinking are mixed up together. I believe that insofar as
concerns labor, everything has been done that could be
done. Where this has not yet been achieved, it appeared to
be more favorable politically to refrain from measures of
compulsion, and secure in turn order and economic aid. In
my opinion, however, the time has now come to take steps
with remorseless vigor and resolution in Denmark, Holland,
France, and Belgium to compel thousands of idlers to carry
out the fortification work which is more important than any
other work. The necessary orders for this have already
been given.” (Document Number L-172, Exhibit Number
USA-34.)
Do you remember them?
JODL: There is no doubt that I drafted this once.
MR. ROBERTS: Yes?
JODL: But that does not prove that I said it.
MR. ROBERTS: But had the necessary orders been given for
the civilians in the occupied territories to work on the German
fortification?
JODL: A compulsory labor order was issued in most countries,
but I—you may not know it—I state under my oath that in Denmark
and Holland, and also in Belgium, local firms, which recruited their
own labor under the labor order, worked on these fortifications and
that the populations of these areas were particularly glad about this,
because the stronger their coast was fortified, the more certain were
they that the invasion would not take place in their neighborhood.
And, of course, they were greatly interested in preventing an
invasion, which they knew would destroy everything. Though it
sounds incredible, the local inhabitants did work on these
fortifications, some of them with the greatest enthusiasm. That is a
fact.
MR. ROBERTS: No, I did not stop you. But had the necessary
orders been given—that is in the last sentence—to compel these
people who did not want to, to compel them to work on fortification? I
am not talking about the people who did want it, but the people who
did not.
JODL: I understand. I did not know details of the procedure, as I
did not concern myself with it, but I did know that compulsory labor
orders had been issued in the occupied countries.
MR. ROBERTS: Very good. I will leave that, if you have said all
you want to say. Will you look now, please, at a new document,
Number 1383-PS, which I offer as GB-489. This is a report of a
discussion of the current military situation, 12 December 1942,
Pages 65 and 66, Jodl speaking:
“The military commander of France reports: The number of
French workers deported into the Reich since 1 June has
now passed 220,000. There are in round figures 100,000
skilled laborers in Berlin.”
How many of these 220,000 were volunteers, did you find out?
JODL: I cannot say that; I only quoted from a report which was
appended to the situation report from France. That a large-scale
exchange between prisoners of war and workers had been in
progress has already been stated in detail by Sauckel.
MR. ROBERTS: I will leave that. I ask only two questions now
on Sagan, Stalag Luft III.
You said yesterday that after the incident of the Sagan shooting,
you thought Hitler was no longer “humane.” Did you say that?
JODL: I said yesterday, I had the impression then that he was
disavowing all humane conceptions of right.
MR. ROBERTS: Had you thought that he was humane up to
March of 1944?
JODL: Before this time, I personally knew of no action of his
which could not be justified legally, at least under international law.
All his previous orders, so far as I knew, could still be justified in
some way. They were reprisals. But this act was not a reprisal.
MR. ROBERTS: This was—would you agree with me—the word
is not too strong—that this was sheer murder of these 50 airmen?
JODL: I completely agree with you: I consider it sheer murder.
MR. ROBERTS: How could you honorable generals go on
serving a murderer with unabated loyalty?
JODL: I did not serve with unabated loyalty after this event, but I
did everything in my power to avoid further injustice.
MR. ROBERTS: Now I come to something else, the question of
destruction in Norway. The document is 754-PS. It has not yet been
exhibited. I offer it as GB-490. This document is signed by you, is it
not?
JODL: I have known this document for a long time; it is signed
by me.
MR. ROBERTS: Yes. Perhaps I might just read parts of it to the
Tribunal. Dated 28 October 1944. It is from your staff, and the
distribution is to the Army supreme command; commander-in-chief,
Norway; to the Reich Commissioner, Norway; and the Navy.
“Because of the unwillingness of the northern Norwegian
population to evacuate voluntarily the Führer has agreed to
the proposals of the Reich Commissioner and has ordered
that the entire Norwegian population east of the Fjord of
Lyngen be evacuated by force in the interest of their own
security, and that all homes are to be burned to the ground
or destroyed.
“The commander, Northern Finland, is responsible that the
Führer’s orders be carried out without consideration. Only
by this method can the Russians with strong forces, aided
by these homesteads and the population familiar with the
terrain, be prevented from following our withdrawal
operations during the winter and shortly appearing in front
of our position in Lyngen. This is not the place for sympathy
for the civilian population.”
Lyngen is in the very north of Norway, is it not, on the west
coast?
JODL: No, on the northern coast, where Finland is closest to the
coast of the polar region and very near Norway.
MR. ROBERTS: Now, that order was carried out, according to
the Norwegian report, UK-79, which the Tribunal will find as the last
document in the small book, 7A, Page 26 of the Norwegian report, at
the bottom of the page, Page 26:
“As a result of the advance of the Russian troops and the
retreat of the German Army in Finnmark, October-
November 1944, the Germans practiced the ‘scorched
earth’ policy for the first time in Norway. Orders were issued
that the civilian population was to evacuate, and that all
houses, transport, and stores were to be destroyed. As a
result of this, about 30,000 houses were damaged apart
from 12,000 items of damage to chattels amounting to 176
million kroner.”
And then, for photographs will the Tribunal turn to Pages 62 and
63; 62 is a copy of the German order, and 63 is a photograph of the
ruins of a fishing village.
That was a cruel order, was it not, Witness?
JODL: No, not exactly. I should like to make a few explanatory
remarks about it. Typically, as I have always said, this order was
urged upon the Führer by the Reich Commissioner Terboven; not by
the soldiers but much against their will.
Secondly, this order was not carried out, because otherwise the
cities of Kirkenes, Hammerfest, and Alta would today no longer exist.
All these cities are east of the Lyngen Fjord. In practice this order
was moderated by our forces in agreement with me, and in
conversations I had with my brother, who was the commanding
general in that region—and whom I wanted to call as a witness since
I expected this document to be produced—it was moderated to such
an extent that, in fact, only what was necessary from a military point
of view and could be justified under Article 23 of the Hague
Regulations on Land Warfare was destroyed. Otherwise no city or
house would be left today in northern Norway; and if you were to
travel there, you would see that these cities are still standing, not
destroyed.
The Armed Forces commander in Norway strongly protested
against this attitude of Terboven, and I repeated these objections to
the Führer in similarly strong terms, but nevertheless he demanded
that this order be issued. We who retained our humanitarian
sentiments carried out the order only insofar as it was absolutely
necessary for military reasons. These are the facts.
MR. ROBERTS: I think you said, when you were interrogated,
that your brother complained of this order, did he not?
JODL: Yes, quite, he was enraged by this decree.
MR. ROBERTS: Very well. I am now going to turn to two
documents with regard to the treatment of the Norwegian civilian
population.
They are in your Document Book 1, Pages 99 and 100—well, it
begins at Page 98. These are regulations on the conduct during the
occupation of Denmark and Norway. And there are instructions to
the troops to treat the inhabitants politely and well and to behave
themselves with due decorum. That is right, is it not?
JODL: Yes, that is correct.
MR. ROBERTS: And they must be told that they are entering
Norway for the protection of the country and the safety of its
inhabitants. That appears on Page 99. That is rather a euphemistic
description of a sudden invasion with no declaration of war, is it not?
JODL: Yes, but at first it was carried out in a fairly peaceful
manner on the whole.
MR. ROBERTS: From your point of view?
JODL: No, from the point of view of the Norwegians as well. The
most extraordinary things...
MR. ROBERTS: Well, you know, we have seen—we can see in
the Norwegian Government’s report photograph after photograph of
these towns and villages bombed to ruins. Is that your idea of an
orderly occupation?
JODL: What was bombed on the day of the landing is hardly
worth mentioning; just a few coastal batteries and a few fortifications,
but no cities. Villages were destroyed only later in the battle with the
English brigade at Dombass and at Lillehammer, but nothing was
destroyed when the country was first occupied. Then the
Norwegians only stood at the quays, hands in their pockets, and
looked on with great interest.
MR. ROBERTS: And naturally, Witness, if you could have
landed without opposition and occupied the country without
opposition, so much the better for you? That is obvious, is it not?
JODL: Yes, undoubtedly; that would have been even better; and
the Norwegians would certainly have fared very well during the
occupation if Terboven had not come.
MR. ROBERTS: Now, I want you to look at a part of that
document which, quite properly, of course, was not read.
It is Appendix 5 which will be part, My Lord, I assume, of Exhibit
AJ-14, the number which this document was given when it was put in
in the examination-in-chief. But I am handing the Tribunal copies of
Appendix 5, because it does not appear in the Jodl document book.
[Turning to the defendant.] Well, now, Appendix 5, I can
describe as the sting in the tail of this document:
“Guiding Principles for the Attitude of Troops in Occupied
Areas.
“Only”—I do not read the first few paragraphs—“Only in the
event of the civil population’s putting up a resistance or
behaving rebelliously can the following decisions be carried
out:
“1) If the civilian population offers resistance or if attacks
are to be feared, the arrest of hostages should, on principle,
be resorted to. Hostages should only be arrested on orders
of the commander of a regiment or a commander of
equivalent rank.
“When accommodating and feeding hostages it should be
borne in mind that they are not imprisoned because of
crimes. Hostages and population are to be informed that
the hostages will be shot at any sign of hostile action.
Previous sanction of the shooting by the divisional
commander must be obtained....”
Then:
“Armed resistance by the civilian population is to be
crushed by force of arms.”
The last sentence on that page:
“The death penalty will be imposed for violence of any kind
against the German Armed Forces. Immediate trials will be
held by a field court martial. The regimental commander
can appoint the summary court, composed of one captain,
one sergeant, one corporal, hear witnesses, draw up the
sentence in writing. The verdict will be the death penalty if
guilty, otherwise acquittal. The sentence will be executed
immediately after confirmation by the regimental
commander.
“The following are to be considered as acts of violence:
Sabotage, destruction of our lines of communications,
cutting of telephone wires, demolitions, et cetera.”
(Document Number Jodl-37, Exhibit Number AJ-14.)
A little drastic, that, was it not? Only the death penalty?
JODL: These instructions are, word for word, in complete accord
with our directives which, in times of peace, were laid down by the
group of experts on international law in co-operation with the Foreign
Office and with German professors of international law. It would have
been well, if only these, our military precepts, our military court
procedure laid down before we went to war, had been followed
consistently everywhere. Our official directives laid down the
question of hostages from the point of view of international law, and
there is no doubt that under international law as applicable in the
year 1939, the taking of hostages was admissible.
MR. ROBERTS: I suggest to you, as you raise that point, that
nowhere in international law will you find the shooting of hostages
legalized at all.
JODL: Then it is not with certainty prohibited anywhere in
international law. I believe it is an open question. In our directives,
even in the Handbook on Tactics, the concept of taking hostages
had been laid down for years.
MR. ROBERTS: That may be so, and I do not want to argue
with you about it. I suggest to you that the Hague Regulations
protect the lives of civilians in occupied countries, unless they
commit crimes, of course, and also prohibit collective punishment of
the innocent.
If you do not want to say any more on that—I do not want to
stop you if you do.
JODL: I can only summarize and say that every word here is in
accord with the directives applicable in the German Army, and these
directives were not illegal. But one would have to argue this problem
with experts on international law.
MR. ROBERTS: Very good. Now, will you look at one other
document dealing with Norway? It is D-582.
My Lord, it is a new document, and I offer it as GB-491.
[Turning to the defendant.] Is that a document which comes from
your office?
JODL: Yes. It originated with the Armed Forces Operations
Staff, Quartermaster Section.
MR. ROBERTS: Do you know of it or not?
JODL: I cannot recall it, but there are some notes of mine on it,
and so I undoubtedly saw the document.
MR. ROBERTS: Oh, yes. Where are the notes, Witness?
JODL: They are on the back page of the last teleprint message.
MR. ROBERTS: Oh, I see what you mean, yes. Well, will you
take first of all—I had forgotten that you were getting more than one
document. Will you take first of all the document dated the 2d of
February 1945? I think it is the top one.
JODL: There are no remarks of mine on that document, so I
cannot say with certainty whether I have seen it.
MR. ROBERTS: Just have a look at it and tell me whether you
have seen it.
JODL: I do not think I have seen this. I do not—I have no
recollection of having ever read it.
MR. ROBERTS: Well, then, I do not think it would be right to
cross-examine you on that document.
My Lord, in that case, I would ask to withdraw it, and I will not
put it in as an exhibit.
THE PRESIDENT: I think the defendant said that it was from his
office.
MR. ROBERTS: Very well, then. I will—he did that.
[Turning to the defendant.] You see what the document says,
Defendant. It is dated 2 April 1945; it deals with...
JODL: The 2d of February.
MR. ROBERTS: It is the 2d of February. It deals with Reich
Commissioner Terboven’s report to the Führer. It says:
“Those responsible for attempts to murder and to carry out
sabotage are the illegal elements within Norway with a

You might also like