Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371349178

Automated Extraction of Bioclimatic Time Series from PDF Tables

Conference Paper · June 2023


DOI: 10.5194/egusphere-egu23-15072

CITATIONS READS

0 37

3 authors:

Sabino Maggi Silvana Fuina


Italian National Research Council Università degli Studi di Bari Aldo Moro
137 PUBLICATIONS 470 CITATIONS 13 PUBLICATIONS 45 CITATIONS

SEE PROFILE SEE PROFILE

Saverio Vicario
Italian National Research Council
137 PUBLICATIONS 4,485 CITATIONS

SEE PROFILE

All content following this page was uploaded by Sabino Maggi on 07 June 2023.

The user has requested enhancement of the downloaded file.


EGU23-15072
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Automated Extraction of Bioclimatic Time Series from PDF Tables


Sabino Maggi, Silvana Fuina, and Saverio Vicario
CNR National Research Council, Institute of Atmospheric Research, Bari, Italy (sabino.maggi@cnr.it)

Since the development of the original specifications in the '90s the PDF document format has
become the de-facto standard for the distribution and archival of documents in electronic form
because of its ability to preserve the original layout of the documents, independently of the
hardware, operating system and application software used to visualize them.

Unfortunately the PDF format does not contain explicit structural and semantic information,
making it very difficult to extract structured information from them, in particular data presented in
tabular form.
The automatic extraction of tabular data is a difficult and challenging task because tables can have
extremely different formats and layouts, and involves several complex steps, from the proper
recognition and conversion of printed text into machine-encoded characters, to the identification
of logically coherent table constructs (headers, columns, rows, spanning elements), and to the
breaking down of the data constructs into elemental objects.

Several tools have been developed to support the extraction process. In this work we survey the
most interesting tools for the automatic detection and extraction of tabular data, analyzing their
respective advantages and limitations. A particular emphasis is given on programmable open
source tools because of their flexibility and long-term availability, together with the possibility to
easily tweak them to meet the peculiar needs of the problem at hand.

As a practical application, we also present a workflow based on a set of R and AWK scripts that can
automatically extract daily temperature and precipitation data from the official PDF documents
made available each year by Regione Puglia, in Italy. The lessons learned from the development of
this workflow and the possibility to generalize the approach to different kinds of PDF documents
are also discussed.

View publication stats


Powered by TCPDF (www.tcpdf.org)

You might also like