Professional Documents
Culture Documents
SCICAP Generating Captions for Scientific Figures
SCICAP Generating Captions for Scientific Figures
SCICAP Generating Captions for Scientific Figures
Abstract
Researchers use figures to communicate rich,
complex information in scientific papers. The
captions of these figures are critical to convey-
arXiv:2110.11624v2 [cs.CL] 25 Oct 2021
ing effective messages. However, low-quality Figure 1: The figure captioning model takes a scientific
figure captions commonly occur in scientific figure (e.g., a graph plot) as input and generate captions
articles and may decrease understanding. In that describes the figure.
this paper, we propose an end-to-end neural
framework to automatically generate informa-
tive, high-quality captions for scientific figures. suggest better captions. Second, the proposed tech-
To this end, we introduce S CI C AP,1 a large- nology can make scientific charts and figures more
scale figure-caption dataset based on computer accessible to blind or visually impaired readers. Re-
science arXiv papers published between 2010 searchers have developed technologies to assist the
and 2020. After pre-processing – including
figure-type classification, sub-figure identifica-
blind to navigate graphical content, such as data
tion, text normalization, and caption text selec- visualization charts (Swaminathan et al., 2014),
tion – S CI C AP contained more than two mil- printed physical maps (Swaminathan et al., 2016),
lion figures extracted from over 290,000 pa- 3D chemical diagrams (Bernareggi et al., 2019),
pers. We then established baseline models that and images on social media (Wu et al., 2017; Salis-
caption graph plots, the dominant (19.2%) fig- bury et al., 2017). However, only a few prior works
ure type. The experimental results showed focused on scientific figures. An image-captioning
both opportunities and steep challenges of gen-
model specialized for scientific figures can improve
erating captions for scientific figures.
the narration of scientific articles for the blind even
1 Introduction when the original caption is unhelpful.
To this end, we introduce S CI C AP, a large-scale
Researchers use figures to explain complex con- image-captioning dataset that contains real-world
cepts or show critical results. In scholarly articles, scientific figures and captions. S CI C AP was con-
figure captions are critical to get the message across structed using computer science papers collected
effectively. Ones that are too generic (e.g., “Re- and released by arXiv. With pre-processing com-
sults of Experiment A.”) or poorly written (e.g., plete – including figure-type classification, sub-
“Relations between X and Y.”) represent missed op- figure identification, text normalization, and cap-
portunities to explain scientific narratives to read- tion text selection – S CI C AP contained more than
ers. Unfortunately, such low-quality captions still two million figures extracted from over 290,000
occur in published scientific articles. This paper papers. We then established baseline models that
aims to develop automatic figure-captioning mod- caption graph plots, the dominant (19.2%) figure
els that generate high-quality captions for figures type. The experimental results showed both excit-
and charts in scientific papers (Figure 1). ing opportunities and steep challenges of generat-
Our motivation is two-fold. First, we aim to help ing captions for scientific figures.
researchers write better captions for the figures and
charts in their papers. Automatic caption models 2 Related Work
trained on informative, high-quality captions can
1
S CI C AP is available at: https://github.com/ One of the few prior works attempting to caption
tingyaohsu/SciCap scientific figures was by Chen et al. (2019a; 2019b;
2020). They created FigCAP, a caption-figure pair on the arXiv dataset (Clement et al., 2019).2 The
corpus where the figures are synthesized, and used arXiv dataset is licensed under CC-0, which grants
an LSTM model with an attention mechanism to remake and republish rights. It contains a reposi-
produce captions. FigCAP was built on research tory of 1.7 million articles with relevant features,
that aimed to analyze figure content automatically, such as article titles, authors, categories, abstracts,
including Figure-Seer (Siegel et al., 2016), Fig- full-text PDFs, and more.
ureQA (Kahou et al., 2017), and DVQA (Kafle We first downloaded all the scholarly articles
et al., 2018). DVQA and FigureQA were both from the arXiv dataset and froze the date on Dec
made using synthetic figures; FigureSeer contained 22, 2020 (a total of 1,921,287 papers). S CI C AP
over 60,000 figures across seven figure types ex- does not include any papers published after this
tracted from research papers. Meanwhile, Qian et date. We further narrowed our dataset to papers
al. (2020) proposed a set of “caption units” (such published between 2010 and 2020 in computer sci-
as Title, Label Name, Min/Max, etc.) that are ence (cs.) and machine learning (stat.ML) topics,
important to include in a caption of scientific fig- which numbered 295,028 papers. We did not use
ures; they created a model, FigJAM, to produce these papers’ “source files,” which might contain
such units (Qian et al., 2021). Also relevant is the original LaTeX and figure files. Not all pa-
the “data-to-caption” work, which takes a chart’s pers come with source files; some source files have
source data table and metadata as input to generate complex dependencies that are hard to parse.
a caption (Obeid and Hoque, 2020; Spreafico and
Step 2: Figure-Caption Pair Extraction. We
Carenini, 2020). These models generate captions
then used PDFFigures 2.0 (Clark and Divvala,
based on data tables, not the figures.
2016) to extract the figures from papers in our pa-
Differences Between Synthetic and Real-World per collection. PDFFigures 2.0 is a Scala-based
Captions. Most prior work has tried to generate tool created to extract figures, captions, tables, and
captions for scientific figures using synthetic im- section titles from scholarly documents, with a fo-
ages and texts (Chen et al., 2019a,b, 2020; Kahou cus on the computer science domain. In addition to
et al., 2017). However, synthetic captions tend to the figures’ images and captions, the tool also ex-
be generic and describe features without convey- tracted all the text snippets inside the figures, such
ing higher-level insights, for example, “This is a as legends, X-Y labels, and titles. The extracted
line plot. It contains 6 categories. Dark Magenta information can be used to boost the performance
has the lowest value. Lawn Green has the highest of image-captioning models. This step resulted in
value.” (example from FigCAP.) Human-written 295,028 papers and 2,170,719 figures.
captions, on the other hand, tend to highlight the Step 3: Figure Type Classification. Given the
meaningful parts of the figure and bring more con- high diversity in the figure types included in scien-
text, for example: “Train loss curve with respect tific articles, we did not aim to create a single cap-
to optimization steps. With prior coarse-tuning on tioning model for all types of figures. Instead, we
NLI data, convergence becomes much faster and aimed to create captioning models specialized for
easier.” [example from (Jin et al., 2020)]. one particular figure type. We used an automatic
figure type classifier (Siegel et al., 2016) to classify
3 Constructing S CI C AP Dataset figure type in S CI C AP. This pre-trained classifier
can identify seven types of figures: graph plots,
This section describes the process that massages flowcharts (also called node diagrams), equations
real-world figure-caption data into an appropri- (also called algorithms), bar plots, scatter plots, ta-
ate easy-to-use format for the NLP community. bles, and “other.” Its reported accuracy is 86% over
This data-processing procedure was developed iter- 60,000 samples (Siegel et al., 2016).
atively and empirically. According to the classifier’s prediction, out of
2,170,719 figures, 19.2% (416,804) are graph plots,
Step 1: Data Acquisition and Pre-processing. 23.6% (511,984) are tables,3 5.9% (127,197) are
Data acquisition is a fundamental challenge for con-
2
structing a public scientific figure-caption dataset. arXiv Dataset on Kaggle: https://www.kaggle.
com/Cornell-University/arxiv
Although there is a vast number of scientific papers, 3
In this work, tables are not considered to be figures due
they are not all easy to access. S CI C AP is based to drastically different visual features and contents.
equations (including algorithms and pseudo codes), Figure Type Classification (Class = Graph Plot)
8.5% (185,398) are flowcharts, 2.0% (44,052) are Approach P R F Acc
scatter plots, 4.7% (101,146) are bar charts, and (Siegel et al., 2016) .90 .83 .87 .95
36.1% (784,138) are “other.” In S CI C AP, we only
Non-Subfigure Figure Classification
focus on graph plots, which have the highest classi- (For figures labeled as graph plots in Step 3.)
fication performance (Siegel et al., 2016) and are Approach P R F Acc
also the most common figure type.
Rule-Based .54 .95 .69 .59
Step 4: Removing Figures with Subfigures. FigureSeparator .98 .66 .79 .83
Many scientific figures contain subfigures. For ex- Rule-Based+FigureSeparator .98 .62 .76 .81
ample, in our pilot study (Section 3.1), 35.72% of
overall scientific figures had subfigures. S CI C AP Table 1: The tools used to construct S CI C AP evaluated
focuses on generating captions for single figures, on 1,926 labeled images. For figure type classification,
so we removed figures with subfigures from the the overall performance over graph plots was reliable.
dataset. We first used handcrafted rules to identify Regarding identifying the graph plots (as labeled auto-
matically in Step 3) that do not contain subfigures, Fig-
captions that explicitly mention or refer to subfig-
ureSeparator achieved an exceptionally high precision.
ures [for example, (a), a), (b), b), (1), 1),
(2), 2) ... etc.]. Furthermore, we also used Fig-
ureSeparator (Tsutsui and Crandall, 2017) to fil- included, this collection only includes the first
ter figures with subfigures out of our collection. sentence of the caption.
FigureSeparator is a CNN-based model that sepa-
rates compound figures in the ImageCLEF Medical • Single-Sentence Caption (94,110 Figures):
dataset with 85.9% accuracy. This collection includes the complete caption
Of 416,804 graph plots identified in Step 3, the of only the figures with a one-sentence caption.
rule-based approach yielded 352,719 graph plots, Of the graph plots, 70.47% had a one-sentence
and the FigureSeparator further narrowed the col- caption.
lection down to 133,543 figures. An estimated
• Caption with No More than 100 Words
32.04% of the graph plots did not have subfigures.
(131,319 Figures): This collection includes
Step 5: Text Normalization. We used the complete caption of only the figures whose
NLTK (Loper and Bird, 2002) for tokeniza- captions contained no more than one hundred
tion and converted all the text to lowercase. tokens (punctuation marks included). In this
We also removed the figure numbers, such as collection, a caption contains 1.66 sentences
“Figure 1:” or “Fig. 1:”, and only kept the main on average (SD=1.07).
caption text. The following two text normalization
strategies were then applied: On average, with advanced normalization (Step
4), a sentence in the “First Sentence” collection
• Basic Normalization: We replaced all contains 23.19 tokens (SD=20.86); a sentence in
the numbers (e.g., 0, -0.2, 3.44%, the “Single-Sentence Caption” collection contains
1,000,000) with [NUM]. 14.05 tokens (SD=8.15); and a sentence in the
“Caption with No More Than 100 Words” collec-
• Advanced Normalization: We created regu- tion contains 22.04 tokens (SD=17.44).
lar expressions to identify equations in cap-
Note that we first created the 80/10/10
tions and replaced them with [EQUATION].
train/val/test data split for the entire corpus and then
We also replaced all the text spans enclosed
proceeded with the caption selection step. This pro-
by any types of bracket pairs, including {},
cedure ensured that we used the identical set of
[], and (), with [BRACKET].
figures to construct each collection’s test set; the
Step 6: Target Caption Text Selection. S CI - same applied to their training and validation sets.
C AP provides three different data collections, each 3.1 Data Analysis and Quality Measurement
sampled using different strategies:
To evaluate the quality of our data cleaning and
• First Sentence (133,543 Figures): This col- processing pipeline, we randomly sampled 2,000
lection includes all the figures. For each figure figures from the original arXiv dataset, and one
author manually labelled each figure’s figure type First Sentence
and whether it contained subfigures (Yes/No).4 Of Subfig Filter Norm.
Vocab
these 2,000 figures, 1,926 figures had no extrac- Rule FigSep B. A.
#Fig.
Size
BLEU-4
tion errors, and were included in our follow-up
416,804 30,776 .0259
calculation. As for types, 20.35% of the figures
3 3 352,719 24,355 .0236
were graph plots, 4.1% were bar charts, and 3.11%
were scatter plots.5 In terms of subfigures, 237 3 3 3 12,666 .0224
133,543
out of 1,926 figures (35.72%) contained subfigures: 3 3 3 3 11,946 .0219
33.14% of these figures contained graph plots as Single-Sentence Caption Only
subfigures, 5.81% contained bar charts, and 6.83% Subfig Filter Norm.
Vocab
contained scatter plots. Rule FigSep B. A.
#Fig.
Size
BLEU-4
We used these 1,926 labeled images to evalu-
247,649 21,765 .0291
ate the tools we employed in constructing S CI -
C AP. Table 1 shows the results. For the figure 3 3 218,655 17,685 .0228
Data Collection Feature BLEU-4 single-sentence captions performed the best. This
Vision Only .0219 might be because the Single-Sentence Caption col-
First Sentence lection, which is a subset of the First Sentence
Vision+Text .0205
collection, had the smallest vocabulary size.
Text Only .0213
Vision Only .0207 Effects of Text Normalization. Our experi-
Single-Sent Caption ments did not show the clear benefits of normaliz-
Vision+Text .0202
Text Only .0212 ing text to the resulting BLEU-4 scores. We will ex-
Vision Only .0172
plore other methods to normalize text, for example,
Caption w/ <=100 words using advanced techniques to identify equations in
Vision+Text .0168
text (Mali et al., 2020; Mansouri et al., 2020).
Text Only .0165
Effects of Text and Vision Features. We also
Table 3: The experimental results of models us- used Vision-Only, Text-Only, and Text+Vision fea-
ing Vision-Only, Text-Only, and Vision+Text features. tures to develop models (Table 3). Vision-Only and
Vision-Only and Text-Only features yielded similar
Text-Only features yielded similar performance.
performance. (All the subfigure-filtering and text-
normalization steps were applied.) Furthermore, the models performed slightly worse
when training on combined features.
Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, and Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod
Joel Chan. 2020. A formative study on designing Lipson. 2014. How transferable are features in deep
accurate and natural figure captioning systems. In neural networks? arXiv preprint arXiv:1411.1792.
Extended Abstracts of the 2020 CHI Conference on
Human Factors in Computing Systems, pages 1–8. Peng W Zhang, Francis Lau, and Chiu-W Sham.
2020. Protograph-based low-density parity-check
Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, Joel hadamard codes. arXiv preprint arXiv:2010.08285.
Chan, Ryan A Rossi, Sana Malik, and Tak Yeon Lee.
2021. Generating accurate caption units for figure
captioning. In Proceedings of the Web Conference
2021, pages 2792–2804.