Rule Based Extraction From PDF

2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA) | 978-1-7281-9437-0/20/$31.
00 ©2020 IEEE | DOI: 10.1109/CITISIA50690.2020.9371784
Rule Based Approach to Extract Metadata from

Scientific PDF Documents
Ahmer Maqsood Hashmi∗ , Muhammad Tanvir Afzal† and Sabih ur Rehman‡
∗ Department of Computer Science
Capital University of Science and Technology, Islamabad, Pakistan

† Department of Computer Science
Namal Institute Mianwali, Pakistan

‡ School of Computing and Mathematics
Charles Sturt University, Australia

Email: ahmermaqsoodh@gmail.com, † tanvir.afzal@namal.edu.pk, ‡ sarehman@csu.edu.au.
∗
Abstract—The number of scientific PDF documents is increas- traction is very difficult task and requires lots of human effort
ing at a very rapid pace. The searching for these documents is (if performed manually), due to the sheer volume of journals/
becoming a time consuming task, due to the large number of publishers and each having their own format and feature set.
PDF documents. To make the search and storage more efficient,
we need a mechanism to extract metadata from these documents In this paper we present a rule-based approach to extract in-
and store this metadata according to their semantics. Extracting formation from the scientific PDF documents in an automated
information from metadata and storing that information is manner. We developed and tested our approach on the data-set
very time consuming task and requires lots of human effort if [3] provided in Extended Semantic Web Conference (ESWC)
performed manually due to large numbers of documents and 2016 conference. Our study is focused on extracting following
their varying formats. In this paper, we present a rule-based
approach to extract metadata information from the research metadata properties from the research articles: (1) Title, (2)
articles. This approach was developed and evaluated on a diverse Author, (3) Email, (4) Affiliation, (5) Country, (6) Section
data-set provided by ESWC (2016) having a number of different heading/ title, (7) Table caption, (8) Figure caption, and (9)
formats and features. Evaluation results show that our proposed Funding agency. We were able to achieve very high F-scores
approach performs 22% better than CERMINE and 9% better for all the extracted metadata properties. Evaluation results
than GROBID.
show that our approach performed 22% better than CERMINE
Index Terms—Rule-based, information, Machine Learning and 9% better than GROBID.
(ML), extraction, PDF The rest of the paper is organized as follows: Section II
presents the brief overview of the previous research. Section III
I. I NTRODUCTION describes the data-set used for training and evaluation of this
system. Section IV provides complete working and metadata
Nowadays most of the research articles are available in extraction process of the system. Section V presents the nu-
Portable Document Format (PDF). According to a study merical results and comparison with CERMINE and GROBID.
conducted by University of Ottawa [1], more than 50 million Section VI concludes the paper.
articles have been published between 1665 to 2009. These
articles are mostly available in the form of PDF documents II. S TATE - OF - THE - ART
and are stored by digital libraries and Citation Indexes(CI). It In this section, we briefly discuss the state-of-the-art and
has been observed that it is very hard to find relevant research benchmark approaches that are currently being used for
articles from digital libraries and citation indexes. These digital metadata extraction. In [7], a detailed survey was conducted
and citation indexes are searching the keywords that are typed by the authors to describe various techniques to extract
inside the search box, and not understanding the actual query. information from the PDF documents, along with their
This happens due to lack of metadata availability for the advantages and disadvantages. These approaches can be
document. There are several reasons for the lack of metadata: categorized into mainly three categories: (1) Rule/ Heuristic
(1) The authors do not provide complete information due to based, (2) Machine Learning (ML) based, and (3) Hybrid
lack of knowledge, (2) Journals/ Publishers do not have the approaches.
capability to store such metadata due to which they do not
collect such data, (3) The systems are not efficient enough to Rule/ Heuristic based approaches works by critically analyz-
extract metadata in an automated manner. ing the data-set and extracting the common patterns to develop
To make these searches and indexes more refined, we need rules/ heuristics. PDFX [5] extracts metadata information from
metadata for the published research documents. Metadata ex- the PDF document using a rule-based approach. In the first
978-1-7281-5684-2/20/$31.00 ©2020 IEEE step, this approach converts the PDF document into XML
Authorized licensed use limited to: Consortium - Saudi Arabia SDL. Downloaded on May 27,2021 at 13:46:54 UTC from IEEE Xplore. Restrictions apply.
format. After converting the document into XML format, Experiment, etc.) using a set of rules/ heuristics prepared by
PDFX performs logical reconstruction of the logical structure, analysis of PDF documents. Hybrid approaches incorporate
that extracts the logical section of the PDF document. After advantages and effectiveness of multiple approaches. How-
the extraction of logical sections, multiple rules are applied ever, these approaches are very difficult to develop, due to
on the extracted section to find the metadata. In the same the complexities of combining multiple approaches. A major
manner authors in [10] proposed a rule based approach that problem that often arises when combining ML approach to
extracts information from the PDF document by converting the others approach is that the approaches end up requiring a
it into XML and textual form, and then apply multiple rules large data-set in order to train properly.
to find metadata. Most of the rule based approaches convert
the PDF document into XML/ HTML or textual format to III. DATA - SET
develop rules for metadata extraction. Rule/ Heuristic based The data-set used in this research was provided by ESWC
approaches most of the time provide good results and do not [3] 2016 conference. It contains 45 papers from different
need a very large data-set for their training or identification CEUR [4] proceedings, having variety different format and
of rules, however, to accommodate all the possibilities while textual/ font features. There were many different heuristics
crafting rules makes them very complex, which results in low that were identified by performing critical analysis of these
performance. research articles. There were different header section (the
The ML approaches work by extracting distinct features section that includes metadata properties such as title, author,
from the PDF documents and applying different ML tech- affiliations) formats that we found in this data-set. “Fig. 1”
niques such as Conditional Random Fields (CRM) and Support shows some of the header formats that were identified from
Vector Machine(SVM) to identify the information available in the data-set. In the same manner, a number of different formats
PDF. GROBID (GeneRation Of BIbliographic Data) [8] and and font features were identified from the data-set to develop
CERMINE (Content ExtRactor and MINEr) [9] are two very the rule-based approach for this research.
popular tools in this area of research. Both of these approaches
are ML approaches and are in process of continuous devel-
opment. These approaches are often used as benchmark to
compare any new methods proposed in this domain. GROBID
identifies different font and geometrical features from the
PDF document and then applies K-means clustering, CRF,
and SVM classifiers to extract different metadata properties.
GROBID is one the most well-known tools available for
metadata extraction of scientific PDF documents. CERMINE
on the other hand applies ML techniques to find logical and
generic section classification. Each section is identified as
zone, line, words and characters. Once these are extracted,
CERMINE uses ML to identify each extracted information.
ML approaches works much more efficiently, however, these
approaches are dependent on the feature set obtained from
the data-set and also the data-set size. ML approaches are not
much efficient on smaller data-set, due to the fact that all ML
approaches require a large tagged data-set in order to train the
system.
Hybrid approaches tend to combine and utilize benefits
of multiple approaches for extraction of metadata informa- Fig. 1. Header section formats
tion. PDFMEF [11] incorporates multiple approaches and
extract information from PDF document. PDFMEF extracts
IV. M ETHODOLOGY
header section metadata properties (title, author, affiliation,
and country) from GROBID, whereas figures, tables, headings In this research, we propose a rule-based approach that first
are extracted by ParsCit. Another hybrid approach proposed converts documents into XML format, and then extracts the
by Sateli and Witte [12] combined LOD-based Name Entity metadata from that converted XML. Converting the PDF to
Recognition(NER) tool with a rule based approach to extract XML provides a number of different font and geometrical
metadata. LOD-based NER tool is used to extract sections features, that when incorporated with textual features provide
from the PDF document, on which rules are applied to extract very strong rules and good results. The extraction process
the metadata information. In [13] proposed an approach that involves two stages: (1) Converting PDF document into XML
identifies the section boundaries using machine learning. After format, and (2) Passing the converted XML document to
the identification of the sections, these sections are labeled as metadata extractor to perform actual extraction. “Fig. 2” shows
standard identifiers (Abstract, Introduction, Background, and the methodology followed in this research.
in the header section is considered as the title of the research
article. Using this heuristic method, we were able to achieve
an F-score of 1 on the research articles available in data-set.
3) Email Extraction: We identified a number of different
formats for emails in the research articles. The extraction of
email was also performed from the header section. The line
containing “@” symbol was identified to have an the email.
After the identification of the email lines, these lines are
extracted by our system. The final stage cleans and separates
the email by removing braces, brackets, ampersand, or pipes.
After the cleaning stage, our system outputs each email on the
separate line as an output.
4) Affiliation & Country Extractor: These metadata prop-
erties were also found from the header section. The start
of an affiliation is identified by checking any of the fol-
lowing keywords: “Laboratory, Intelligence, Institut, Division,
Faculty, University, College, Universit, Educational, Depart-
ment,School, Centre, Institute, Group, Universität, Escuela,
Fig. 2. Methodology Diagram Engineering, Dept, St., Research”. These keywords were
identified by critically analyzing the research articles. These
keywords help in identification of start of the affiliation. The
A. PDF to XML Conversion end of the affiliation is detected, if a country name is found
A number of different approaches were identified to convert or email is detected. After the extraction of the affiliation
PDF document into the XML format. There were numerous part, this extracted information is cleaned and separated by
PDF to XML libraries available for Java, Python and other outputting each affiliation on separate line. The country names
languages, however, the converted XML was very noisy and are extracted from the affiliation section. The list of predefined
did not provide optimal PDF to XML results. We used a 195 countries and their short forms were used to identify the
free online PDF to XML converter tool [6] to convert PDF country. We received an F-score of 1 for affiliation and 0.93
documents into the XML format. This conversion had very for country extraction.
less noise as compared to the libraries available and the font/ 5) Author Extractor: Authors of a research paper are also
geometrical features provided by this tool were very efficient. extracted from the header section of the PDF. All the extracted
The converted XML contained physical features like (1) Page metadata properties (title, affiliation and emails) are removed
number, (2) Top, (3) Height, (4) Width, (5) Left, (6) Font Size, from the header section. Remaining part of the header section
(7) Bold, and (8) Italic. is considered as the author information part. This extracted
data is passed through the author cleaner that removes the extra
B. Metadata Extraction words and separates the authors in a single line. This heuristic
This stage consists of extracting metadata from the PDF process resulted in an F-score of 0.95 to extract authors from
document. The properties residing in Header section (the part research articles.
of paper that contains metadata information of title, author, 6) Table Caption Extractor: The table extractor extracts
email, affiliation and country) are extracted by first extracting the table label. The extractor starts its working by identi-
the header section, after which each metadata property is fying the start of the table label using multiple heuristics.
extracted using their respective metadata extractor. Each ex- The “Table|TABLE” keywords are identified along with the
tractor identify the information using rules/ heuristics, extract numeric and roman numbers to find the start of the table
that information and clean the extracted information to give label. The geometrical properties of a table label is also used
final output. for identification. An ideal gap between table label and the
1) Header Section Extraction: This section is extracted next/ previous section was found out to be “25” after critically
by extracting the start of the page till the “Abstract” or analyzing all the research articles. Using this gap helps a lot
“Introduction” keyword. These keywords identify the end of in finding the exact table labels instead of false positives. With
the header section. Extracting this part of the paper helps these heuristics we were able to achieve and F-score of 0.98.
in finding the information of title, author, email, affiliation 7) Figure Caption Extractor: The figure extractor works in
and country metadata. Separating this header section from the a similar way as the table extractor. However, it has different
complete research article helps in efficient extraction of the font and feature properties. The start of the figure is identified
metadata properties from this part only, instead of the complete by “Figure|FIGURE|Fig|FIG” keywords. These keywords are
research article. identified along with roman and numeric numbers to find
2) Title Extraction: The title of a research article is ex- the start of figure label. The gap between the start and the
tracted from the header section. The text with the largest font previous/ next section of the figure label was found to be “26”.
This geometrical gap helps in identification of correct figure TABLE I
labels instead of the false positives. These heuristics helped in C OMPARISON OF P ROPOSED A PPROACH WITH CERMINE & GROBID
identification of figure labels with an F-score of 0.94. Metadata F-Score
8) Headings Extractor: In this research we focused on Property Proposed Approach CERMINE GROBID
identification of level 1 headings. The start of the heading Title 1 1 1
Author 0.95 0.70 0.98
is identified as a roman or a numeric number. After the Email 0.99 0.65 0.73
identification of roman or numeric number, the heading always Affiliation 0.93 0.80 0.90
starts with a capital letter. Using these heuristics, we were able Country 1 0.91 0.92
Table Caption 0.98 0 0.73
to identify the headings of a research article with an F-score
Figure Caption 0.94 0 0.76
of 0.8. Headings 0.79 0.70 0.88
9) Funding Agency Extractor: Funding agency extractor is Funding Agency 0.82 0 0
first searched in the “Acknowledgement” section. In case of no
“Acknowledgement” section, complete paper is searched for
the funding agency. A set of keywords (supported by, financial were very specific to the data-set, due to which we were
support, supported, in part e.t.c) are used to identify the start unable to get the complete 100% results. Our future work
of the funding agency. After the. identification of start, the end includes evaluating our approach on multiple data-sets and
is detected by following keywords “under grant, fig, within, ), remove the discrepancies from PDF to XML conversion as
under, grant, .”. Using these heuristics, we achieved an F-score much as possible. Besides this, we will also be extracting more
of 0.90. metadata properties present inside the research article such as,
V. R ESULTS publisher, second level Headings, keywords and supplementary
material.
This section discusses the results and evaluation of our
proposed approach. We evaluated our approach on three pa- R EFERENCES
rameters: (1) Precision, (2) Recall, and F-Measure. Precision [1] J. Arif, “Article 50 million: An estimate of the number of scholarly
“(1)” of a system is the relevant values by the ratio of retrieved articles in existence,” Learned Publishing, vol. 23, pp. 258–263, July
values. Recall “(2)” on the other hand is the ratio of relevant 2010.
[2] “Extended Semantic Web Conference”,
results to the actual results. F-measure “(3)” is a commutative https://github.com/ceurws/lod/wiki/SemPub2016, accessed on 12
measure of precision and recall and calculates the actual test June, 2020.
accuracy of a system. [3] “ESWC Task 2 Training Data-Set”,
https://github.com/ceurws/lod/wiki/SemPub2016, accessed on 12
June, 2020.
RelevantResults
P recision = (1) [4] “CEUR”, http://ceur-ws.org/, accessed on 12 June, 2020.
RetrievedResults [5] A. Constantin, S. Pettifer, and A. Voronkov, “PDFX: fully-automated
pdfto-xml conversion of scientific literature,”, DocEng’ 13, pp. 177–180,
RelevantResults 2013.
Recall = (2) [6] “PDF to XML Conversion Tool”, https://www.freefileconvert.com/pdf-
ActualResults xml, accessed on 12 June, 2020
[7] Ahmer M. Hashmi, F. Qayyum, and Muhammad T. Afzal, “Insights to
2 ∗ P recision ∗ Recall the state-of-the-art PDF Extraction Techniques”, IPSI Transaction on
F − M easure = (3)
P recision + Recall Internet Research, vol.16, 2020.
[8] Lopez, P., “GROBID: Combining Automatic Bibliographic Data Recog-
nition and Term Extraction for Scholarship Publications,” In Research
The defined equations are used to calculate precision, recall and Advanced Technology for Digital Libraries, Springer, 2009, pp.
and F-measure of our proposed approach. To evaluate our 473–474, 2009.
[9] Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., Bolikowski,
approach, we also calculated results for CERMINE and GRO- L., “CERMINE: automatic extraction of structured metadata from sci-
BID. We were able to achieve higher values than CERMINE entific literature,” In International Journal on Document Analysis and
and GROBID in most of the properties. “Table I” summarizes Recognition (IJDAR), 4th ed., vol.18, pp. 317–335, 2015.
[10] Ahmad, R., Afzal, M. T., and Qadir, M. A., “Information extraction for
the evaluation results of our approach with CERMINE and PDF sources based on rule-based system using integrated formats,” In
GROBID. We have extracted nine metadata properties, while Communications in Computer and Information Science, pp. 293–308,
in comparison GROBID and CERMINE extract only 8 and 2016.
[11] Wu, J., Killian, J., Yang, H., Williams, K., Choudhury, S. R., Tuarob,
5 respectively. Our system has performed much better than S., Caragea, C., and Giles, C. L., “Pdfmef: A multi-entity knowledge
these two techniques, providing 22% and 9% better metadata extraction framework for scholarly documents and semantic search,” In
extraction than CERMINE and GRBOID respectively. Proceedings of the 8th International Conference on Knowledge Capture,
pp. 1–8, 2015.
[12] Sateli, B., Witte, R., “An Automatic Workflow for the Formalization of
VI. C ONCLUSION Scholarly Articles’ Structural and Semantic Elements,” In Communica-
In this paper, we have presented a rule-based approach for tions in Computer and Information Science, pp. 309–320, 2016.
[13] Tuarob, S., Mitra, P., Giles, C.L., “A Hybrid Approach to Discover
metadata extraction from scientific PDF documents. Evalua- Semantic Hierarchical Sections in Scholarly Documents,” ICDAR ’13,
tion results show that we were able to achieve much higher pp. 1081–1085, 2015.
accuracy than CERMINE and GROBID. There were some
PDF to XML conversion deformities and some cases that

Rule Based Extraction From PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rule Based Extraction From PDF

Uploaded by

Copyright:

Available Formats

2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA) | 978-1-7281-9437-0/20/$31.

00 ©2020 IEEE | DOI: 10.1109/CITISIA50690.2020.9371784

Rule Based Approach to Extract Metadata from

Capital University of Science and Technology, Islamabad, Pakistan

Namal Institute Mianwali, Pakistan

Charles Sturt University, Australia

You might also like