Controllable Citation Text Generation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Controllable Citation Text Generation

Nianlong Gu and Richard H.R. Hahnloser


Institute of Neuroinformatics,
University of Zurich and ETH Zurich
{nianlong,rich}@ini.ethz.ch

Cited Paper "[53]": Perspective transformer nets: Learning 


Abstract single-view 3d object reconstruction without 3d supervision

  intent:         background
The aim of citation generation is usually to au-   keywords:  encoder-decoder  network;  
tomatically generate a citation sentence that                       3D shape
arXiv:2211.07066v1 [cs.CL] 14 Nov 2022

refers to a chosen paper in the context of a


manuscript. However, a rigid citation gener-
ation process is at odds with an author’s de- intent:         method
sire to control the generated text based on cer- keywords:  mesh representation  
tain attributes, such as 1) the citation intent
of e.g. either introducing background infor-
mation or comparing results; 2) keywords that
should appear in the citation text; or 3) specific
sentences in the cited paper that characterize Figure 1: Example citation sentences of the same pa-
the citation content. To provide these degrees per, taken from a paragraph in (Liu et al., 2019). The
of freedom, we present a controllable citation citations differ in intents and keywords.
generation system. In data from a large cor-
pus, we first parse the attributes of each cita-
tion sentence and use these as additional input reference list have been cited in previous articles.
sources during training of the BART-based ab-
Xing et al. (2020) developed a RNN-based pointer
stractive summarizer. We further develop an
attribute suggestion module that infers the ci- generator network that can copy words from the
tation intent and suggests relevant keywords manuscript and the abstract of the cited paper based
and sentences that users can select to tune on cross-attention, which was further extended in
the generation. Our framework gives users (Ge et al., 2021) using information from the citation
more control over generated citations, outper- graph to enhance citation generation.
forming citation generation models without at- These efforts focused primarily on developing
tribute awareness in both ROUGE and human
fully automated pipelines and they left little room
evaluations.
for users to control the generation process. How-
1 Introduction ever, we believe control is desirable for the follow-
ing considerations. Authors often have a clear moti-
A common practice in scientific writing is to cite vation before writing a citation sentence. For exam-
and discuss relevant papers in support of an argu- ple, they may have a specific intent to cite, such as
ment, in provision of background information, or comparing results or presenting background infor-
in comparison of results (Penders, 2018). Recent mation; they may have keywords in mind to appear
studies aim to automate this process by using neu- in the citation sentence; or they may want to refer
ral networks to generate a citation sentence based to a particularly relevant sentence in the body text
on information from the manuscript being written of the cited paper, e.g., a specific experimental find-
and/or the paper to be cited. For example, Niki- ing. Even in a given context, the motivation to cite
forovskaya et al. (2020) proposed a BERT-based a paper can be diverse (Figure 1). When the gener-
extractive summarizer (Liu, 2019) that produces a ated citations do not match an author’s motivation,
paper review by extracting one sentence from each the author may wish to change the generation by
of the related papers. Chen and Zhuge (2019) pro- specifying certain attributes, such as citation intent,
posed to automatically generate a related work sec- keywords, or relevant sentences. However, recent
tion by extracting information on how papers in the works do not allow for this possibility (Wang et al.,
2022; Xing et al., 2020) because their generation such as genre or topic as conditions when pre-
process is not conditional on these properties. training an autoregressive language decoder based
In order to allow users to freely adjust the gener- on self-attention (Vaswani et al., 2017) to encour-
ation, we propose a controllable citation generation age the next-token prediction (during decoding)
system consisting of two modules: a conditional conditional on a specific attribute. In this paper, we
citation generator and an attribute suggester. apply this conditional approach to the citation gen-
The conditional citation generator is based on eration task, where we introduce a set of citation-
BART-large (Lewis et al., 2020), a transformer- related attributes as conditional input text when
based (Vaswani et al., 2017) text denoising autoen- training an encoder-decoder model to generate ci-
coder. During the training period, we let BART- tation sentences.
large receive two sets of input sources. The first Recently, Jung et al. (2022); Wu et al. (2021) pro-
group is contextual text, including 1) the sen- posed intent-controlled citation generation models
tences preceding the target citation sentence as lo- that focus on controlling the intention of generated
cal context in the manuscript and 2) the title and citations, while in this paper, we deal with more
abstract of the cited paper as global context. The conditions beyond the intent of the citation. In addi-
second group is formed by the attributes associ- tion, we propose solutions to automatically suggest
ated with the target citation sentence, including the potential attributes to better balance the automation
citation intent, the keywords, and the most relevant and controllability of citation generation.
sentences in the body text of the cited paper. We
add citation attributes as input to make the decod- 3 Model
ing process of BART-large conditional on these
We introduce our controllable citation generation
attributes.
system (outlined in Figure 2) in this section.
The citation attribute suggestion module modi-
fies and fine-tunes SciBERT (Beltagy et al., 2019) 3.1 Conditional Citation Generation Module
to infer possible intents and suggest keywords and
The Conditional Citation Generation (CCG) mod-
related sentences based on the contextual text and
ule outputs a citation sentence Y given the con-
the body text of the cited paper. The purpose of
textual text X and the citation-related attributes
this module is to propose to a user attributes to
C. C include 1) citation intent, 2) keywords, and
guide the generation of citation sentences. We be-
3) sentences in the body text that are related to
lieve that such a suggestion-selection-generation
the citation sentence. The training objective is to
pipeline can maintain a fair degree of automation
maximize the conditional generation probability:
while retaining good controllability.
Our contributions are summarized as follows: n
Y
1) We propose a controllable citation generation p(Y |X, C) = p(yi |y1 y2 . . . yi−1 , X, C), (1)
pipeline with an attribute suggestion module and a i=1
conditional citation generation module; 2) We eval-
uate the controllability of our system in response where yi is the ith token in the citation sentence
to different attributes using automated metrics and Y . This task can be modeled as a sequence-
human evaluation; 3) We parse the contextual text to-sequence problem where we consider X and
and citation attributes of each citation sentence to C as input text, Y as output sequence, and
build a large dataset that can be used for future train BART-large to minimize the negative log-
studies on controllable citation generation. likelihood − log p(Y |X, C). During testing, we
let BART-large take the contextual information X
2 Related Work and the attribute C as inputs to the encoder and gen-
erate the citation sentence on the decoder side by
Our idea of a controllable citation generation sys- next-token prediction to maximize the probability
tem originated from the study of conditional gen- in Equation (1).
erative models that were proposed in Sohn et al. For X, we define the local context in X as up to
(2015) and applied to controlled text generation Ns (Ns = 5) sentences (from the same section) that
tasks (Ficler and Goldberg, 2017; Keskar et al., precede the target citation sentence to be generated.
2019; Dathathri et al., 2019). For example, Keskar The global context in X is formed by the title and
et al. (2019) used attributes (or control codes) abstract of the cited paper. For C, when training the
Manuscript Attribute Suggestion

Module  background
intent:         
 (0.8)     background
 (Local) Context: ...Comparing to  keywords:  
 supervised learning, unsupervised 3D  Citation Intent
 method (0.1)
 reconstruction is becoming increasingly Predictor  result (0.1)     3D objects; unsupervised 3D reconstruction;
 important as collecting ground-truth 3D  Sentences: 
 models is much more difficult than       **User-selected  sentences**
 labeling 2D images.  3D objects;
User

Contextual Text
 unsupervised

 3D reconstr-  (Author)

Keyword
 uction; Selection
Paper To Be Cited Extractor Conditional Citation
 2D image;
 Global Context:  Generation (CCG)
 ...
Module
 Title: Perspective transformer nets: 
 Learning single-view 3d object 
 reconstruction without 3d supervision Relevant **Top relevant
 sentences   
Sentence  from the

 Abstract: Understanding the 3D  Extractor


 world is a fundamental problem in 
 body text** Generated Citation Sentence
 computer  vision. However, ...
 "Perspective transformer nets [53] propose an      
  encoder-decoder network which learns 3D shape 
 Body Text: Understanding the 3D    from silhouette images in an unsupervised  
 world is at the heart of successful  body text
  fashion."
 computer vision ... sentences

Figure 2: The pipeline for our controllable citation text generation system. The attribute suggestion module sug-
gests candidate attributes, including citation intent, relevant keywords, and sentences. Users can select the desired
attributes from the suggestions and use them as conditions to guide the citation generation module.

citation generator, we parse the ground truth cita- hidden states of all tokens. We then input the
tion sentence to obtain the oracle citation attributes last hidden state of “[CLS]” into the intent predic-
(Section 4.3). During testing, the attributes C are tion header, a fully connected two-layer network
either recommended by our attribute suggestion to obtain the output xintent (i). Finally, we apply
module or specified by the users. the softmax to xintent (i) to produce the likelihood
To allow the BART-large encoder to distinguish pintent (i) for three possible intents. After obtaining
between different input sources, we add a spe- the ground truth intent itrue by parsing the target
cial field name (or control code) before each input citation sentence (Section 4.3), we train SciBERT
source, following Raffel et al. (2020); Keskar et al. and the intent prediction header to minimize the
(2019). The final input text is structuted as: “intent: cross-entropy loss:
[Citation intent] keywords: [Keywords relevant to
exp (xintent (itrue ))
the citation, separated by ‘;’] sentences: [Body text L = −P (2)
sentences relevant to the citation] context: [local i∈all intents exp (xintent (i))
context in the manuscript] title: [Title of the cited The Keyword Extractor extracts keywords rele-
paper] abstract: [Abstract of the cited paper]”. vant to the target citation sentence from the contex-
tual text. We obtain a set of candidate keywords
3.2 Attribute Suggestion Module
by chunking all noun phrases1 in the contextual
The attribute suggestion module consists of three text. We embed the contextual text q into a query
submodules: a citation intent predictor, a keyword embedding vq and each candidate keyword ki into
extractor, and a relevant sentence extractor. a keyword embedding vki using the same SciB-
The Citation Intent Predictor predicts the intent ERT. The text encoding is done by averaging the
of a citation sentence based on the local context last hidden states of all tokens in the text. Then
in the manuscript and the global context (title and we rank the keywords based on the cosine similar-
abstract) of the cited paper. Following Cohan et al. ity between vki and vq . Inspired by Zhong et al.
(2019), we consider three possible intents i: 1) (2020), we fine-tune the SciBERT encoder so that
background, summarizing related work or concepts relevant keywords are ranked at the top positions.
as background; 2) method, using a certain method Here, we measure relevance using the average of
or dataset (e.g., the second citation sentence in the ROUGE-1&2 F1 scores between the keyword
Figure 1); and 3) result, comparing results. and the target citation sentence, and we assign a
We connect the local and global context texts rank to each keyword based on its relevance score.
with a special token “[SEP]” and prepend a to- Higher relevance scores correspond to smaller rank
ken “[CLS]” at the beginning. We use SciBERT 1
We use the noun phrase chunker from Spacy: https:
to compute from the concatenated text the last //spacy.io/ and remove the articles of the noun phrases.
values (minimum of 1), and keywords with the diverse citation attributes (intents and keywords).
same relevance score are assigned the same rank. Training our conditional citation generator on such
Within the candidate keyword list, given a pair of a dataset likely allows the generator to learn to cite
keywords ki and kj whose ranks ri and rj satisfy papers under diverse conditions.
ri < rj , we fine-tune SciBERT with the triplet loss: For the validation and the test sets, we obtain
citation sentences by parsing papers from arXiv
L = max(0, f (vq , vkj ) − f (vq , vki ) (Kaggle, 2022) and PMCOA (of Medicine, 2003)
(3)
+(rj − ri ) × γ)), ri < rj , that were published in 2022. We do not filter cited
T
papers based on the number of citations, because
where f (va , vb ) = kvvaakkv
vb
bk
denotes the cosine sim- we want to test our system in a real-world scenario
ilarity between the vectors (va , vb ) and γ is the where cited papers vary in terms of domain and
margin used in the triplet loss (Schroff et al., 2015). number of citations.
In order to extract relevant and diverse keywords,
the keyword extractor selects keywords based on 4.2 Citation Sentence Filtering and
maximal marginal relevance (MMR) (Carbonell Train/Test Decoupling
and Goldstein, 1998): at each step, it picks an uns-
elected keyword ki that maximizes: In the citation generation dataset proposed in Xing
et al. (2020), some citation sentences citing more
(1 − α)f (vq , vki ) − α max f (vki , vkj ), (4) than one paper appeared in both the training and the
kj ∈Sk test sets. For example, the following sentence “co-
occurrence of words [Paper A] and discourse rela-
where α is the diversity factor and Sk represents
tions [Paper B] also predict coherence.” appeared
the set of selected keywords. We set α = 0.2 to
as the target citation sentence in two samples, one
reduce redundancy among extracted keywords.
in the training set and the other in the test set. It
The Sentence extractor extracts the sentences
follows that the local context in the manuscript is
most relevant to the target citation sentence from
identical for these two data samples. The only dif-
the body text of the cited paper. Similarly to the
ference is that in the training sample paper A is
keyword extractor, we use a SciBERT text encoder
considered as the cited paper, while in the testing
to encode contextual text and body sentences as
test paper B is treated as the cited paper.
embeddings. We then use the contextual text as
a query and rank the body sentences based on the First, we argue that it is an ill-defined problem
cosine similarity between the query embedding and to train a model to generate a citation sentence
the sentence embedding. We fine-tune the SciB- that should cite two papers when only one paper
ERT encoder to encourage sentences that are more is provided as the global context. Second, the fact
relevant to target citation sentences (as measured by that some samples from the training and test sets
the average ROUGE-1 and ROUGE-2 F1 scores) are coupled increases the chance of overfitting: A
to rank in the top positions, the same strategy used coupled pair of training and testing samples share
to fine-tune the keyword extractor. Furthermore, the same local context as the input and the same
when calculating the triplet loss similar to Equation citation sentence as the target output. Thus, the
(3), we clamp the rank value of the sentences to 10: citation generation model can “remember” and thus
ri = min (ri , 10), because we focus only on the “recite” the corresponding citation sentence during
ranking order of the most relevant sentences. testing given a local context that has been used
during training.
4 Dataset Preparation To this end, when creating our scientific
controllable citation generation (SciCCG) dataset,
4.1 Cited Paper Filtering we eliminated citation sentences that were too short
For the training set, we filter a subset of S2ORC (Lo (<5 words) or too long (>200 words), or those that
et al., 2020) papers that were published between cited more than one paper. In addition, we split the
2000 and 2020 in the domains of biology, medicine, training, validation, and test sets with the following
or computer science. Furthermore, we select only decoupling rule: Given a citation sentence s from
papers that are cited in at least 50 sentences con- one set (e.g. training set), we refer to the paper
tained in S2ORC. With this criterion, we obtain for from which s comes as the citing paper ps,citing
each paper many citation sentences, hopefully with and the paper cited by s as the cited paper ps,cited .
Intent Category Background Method Result Average worthiness). For each auxiliary task, we use a sep-
(# samples) (1,014) (613) (260) (Macro)
arate functional head, a two-layer fully connected
Jurgens et al. (2018) 84.7 74.7 78.2 79.2
Cohan et al. (2019) 87.8 84.9 79.5 84.0
network, with the SciBERT-encoded "CLS" hidden
SciBERT+scaffolds states as input, to classify section titles and cita-
(Ours) 89.1 87.1 84.0 86.7
tion worthiness, respectively. The training loss is a
Table 1: The F1 scores on three citation intent cate- weighted sum of the cross-entropy losses (Equation
gories and the avarage (macro) F1, tested on the SciCite (2)) of the three tasks, with a weight of 1.0 for the
dataset created by Cohan et al. (2019). main task, and weights of 0.05 and 0.01 for the
auxiliary tasks 1) and 2), respectively.
Information Training Validation Test We use our SciBERT-based intent classifier (per-
# cited papers 59,645 1,373 1,404 formance shown in Table 1) to parse the intent of
# citing papers 1,404,690 1,360 1,382 each citation sentence in the training/validation/test
# citation sentences 2,678,983 1,385 1,411
set and use it as the ground truth citation intent of
Statistics of parsed citation attributes of the citation sentences
the target citation sentence when training the intent
# background intent 2,188,066 1,075 1,093
# method intent 388,437 213 219 predictor and and the conditional citation generator.
# result intent 102,480 97 99 Relevant Keywords. After chunking all noun
phrases in the contextual text as candidate key-
Table 2: The statistics of our SciCCG dataset. words, our goal is to select a set of keywords that as
a whole have the highest ROUGE-1 & 2 F1 scores
When splitting the training, validation, and test sets, compared to the target citation sentence. Follow-
we ensure that the citation sentences from the other ing the greedy selection strategy (Gu et al., 2022),
two sets neither come from ps,citing nor do they cite we select a keyword whose addition to the already
ps,cited . selected set of keywords maximally increases the
ROUGE F1 scores, and this selection process stops
4.3 Parsing Citation Attributes when the ROUGE F1 scores no longer increase or
at least 3 keywords have been selected. Greedily
Training the conditional citation generator (Section selected keywords are used as the ground truth rele-
3.1) and the citation intent predictor (Section 3.2) vant keywords of the target citation sentence when
requires citation attributes as labels. However, the training the CCG module.
true attributes of the citation sentences are usually
Relevant Sentences. We extract up to 2 sentences
not explicitly provided by the authors. For training
from the body text of the cited paper using the same
we therefore use pseudolabels by inferring citation
greedy strategy as for extracting relevant keywords.
attributes from the target citation sentences.
The statistics of our dataset are shown in Table 2.
Citation Intent. We infer the intent of citation
sentences using a SciBERT-based intent classifier 5 Experiment
that has the same structure as our citation intent
predictor, except that it differs in terms of input text We choose a learning rate of 1e-5 to fine-tune the
and purpose. The intent predictor takes contextual modules. When training the CCG module, we trun-
text as input and aims to predict the citation intent cated the input sequence to a maximum length of
before the citation sentence is written. In contrast, 512 tokens and truncated the output to 75 tokens.
the intent classifier works as a tutor. It takes the To make the CCG module robust to the unavailabil-
actual citation sentence as input and infers the true ity of conditional citation attributes, we randomly
intent, which can be used to “teach” the citation drop certain citation attributes during training as
intent predictor to predict the most likely intent. follows: 1) We set the citation intent to the empty
We use a multitask training strategy (Cohan et al., string with probability 0.5; 2) Given n relevant key-
2019) to train the intent classifier. In addition to words/sentences of the target citation sentence, we
the main task of classifying citation intent, we add randomly select m (0 ≤ m ≤ n) keywords/sen-
two auxiliary classification tasks (scaffolds) (Co- tences and use them as the conditional input.
han et al., 2019) to improve the performance on the For the citation intent predictor, the size of the
main task: 1) predicting the title of the section to hidden layer of the intent prediction head is 32. We
which the cited sentence belongs; and 2) detecting observed that the categories of citation intent were
whether the sentence needs a reference (citation unbalanced and most of the citation sentences were
Citation Generation Citation Attributes Performance
Mode Model
citation intent relevant keywords relevant sentences R-1 R-2 R-L
PTGEN-Cross† – – – 24.95 4.12 18.52
BART-large – – – 29.62 7.56 22.51
– – – 29.43 7.53 22.70
Random – – 28.16 6.57 21.37
fully intent predictor – – 29.32 7.26 22.52
automatic (ours)
BART-large-CCG – KeyBERT – 29.46 6.71 21.78
– keyword extractor – 30.58 7.78 22.87
(ours)
– – Sentence-BERT 29.87 7.08 21.93
– – sentence extractor 30.91 7.76 23.02
(ours)
intent predictor keyword extractor sentence extractor 31.31 7.98 22.92
Ground Truth – – 29.85 7.75 23.10
user- – Ground Truth – 38.74 12.53 28.68
controlled BART-large-CCG
– – Ground Truth 36.29 12.28 27.52
Ground Truth Ground Truth Ground Truth 42.89 15.91 31.61

Table 3: We evaluated the performance of citation generation in a fully automated mode and a user-controlled
mode. We used the same trained BART-large-CCG model for all evaluation tasks. † represents our own implemen-
tation. The symbol “–” means that the corresponding citation attribute was not used as the conditional input for
generation.

"background" intents (Table 2). Therefore, when 2) extracting keywords using KeyBERT2 (Grooten-
training the citation intent predictor, we downsam- dorst, 2020) and 3) ranking and extracting body
pled the "background" citation sentences to 1/5 of text sentences based on embeddings encoded by
the original number to balance the number of sam- Sentence-BERT (Reimers and Gurevych, 2019).
ples for all intent categories. For training the key- In the user-controlled mode, we assume that
word extractor and the sentence extractor, we used ground truth citation attributes are available. This
the margin γ = 0.01 in the triplet loss (Equation happens when users want to control the generation
(3)). We evaluated the generated citation sentences by specifying the desired citation attributes.
using ROUGE-1,2 and L (Lin, 2004) F1 scores.
6.1 Results Comparison
6 Results and Discussion In automatic mode, BART-large-CCG taking cita-
tion attributes suggested by our attribute suggestion
We evaluated the performance of the citation gen- module as conditional input outperformed uncon-
eration in both the fully automatic mode and the ditional generation models, including BART-large
user-controlled mode. and PTGEN-Cross (Table 3). For example, when
In the fully automatic mode, we assume that we let BART-large-CCG take as input 1) contex-
the ground truth citation attributes are unknown. tual text and 2) keywords extracted also from the
We adopted a suggestion-generation strategy to au- contextual text using our keyword extractor, BART-
tomatically generate citation sentences. We first large-CCG achieved a higher ROUGE score than
use the attribute suggestion module to suggest cita- BART-large, even though both methods take the
tion attributes (citation intent, 3 relevant keywords, contextual text as the only source of information.
and 2 relevant sentences) and then let BART-large- We observed that BART-large-CCG performed best
CCG take contextual text and suggested attributes when it used all the citation attributes (intent, key-
as input and generate citation sentences. We com- words, and sentences) suggested by our attribute
pared our method with 1) PTGEN-Cross (Xing suggestion module. These results show that our
et al., 2020) and 2) BART-large that were both pipeline for generating citation sentences using au-
trained without using citation attributes as condi- tomatically suggested citation attributes is effective
tions. In addition, we compared our attribute sug- without human guidance.
gestion module with the following baselines: 1) 2
We used the BERT emebedding model “all-mpnet-base-
randomly selecting one intent from all three intents; v2” and set the diversity to 0.2 when using MRR.
Input Intent Attribute
In addition, our SciBERT-based keyword extrac- 0.8

nd
0.82 0.13 0.047

rou
tor outperformed KeyBERT when used as a rele-

ckg
0.6

ba
vant keyword suggestion module, and the SciBERT- 0.38 0.55 0.067

d
tho
0.4

me
based sentence extractor outperformed Sentence-
BERT when suggesting relevant body sentences. 0.18 0.12 0.7 0.2

ult
res
We attribute this improvement to the fine-tuning background method result
process using triplet loss, which allows the SciB- Generated Citation's Intent
ERT text encoder to better estimate the similarity (a) Citation intents as controlled attributes.
between contextual text and keywords/sentences. 1.0

Matching Frequency
In the user-controlled mode, the performance 0.8
0.63 0.65
of BART-large-CCG improved when we added 0.6
ground truth citation attributes as conditional input, 0.4
especially when the generation was conditioned on 0.2
relevant keywords or relevant body text sentences 0.0
keywords sentences
from the cited paper. And the best performance was Controlled Citation Attribute
observed when all three attributes were used as con-
(b) Relevant keywords and sentences as controlled attributes.
ditional inputs. The large performance gain over
the fully automatic mode indicates that the citation
attributes are crucial for generating good citation Figure 3: We evaluated the controllability by the match-
sentences and should be considered as conditional ing frequency between the controlled input attributes
input whenever available, for example, when au- and the attributes of the generated citation sentences.
thors want to cite a paper with specific keywords
or relevant sentences they want to discuss.
Unlike for citation intent, for relevant keywords
The results in Table 3 show that our BART-large
and sentences there is no fixed number of pre-
CCG model has stable performance regardless of
defined categories. For each test sample, we ran-
whether attributes are suggested by our attribute
domly selected two keywords A and B from the
suggestion module or specified by users, and is
first 5 keywords extracted by our SciBERT-based
flexible and robust under different availability of
keyword extractor. We first used each keyword
citation attributes, indicating good controllability,
separately as a condition to guide the generation of
which we will investigate further next.
BART-large-CCG. We then determined the key-
word (A or B) that was semantically closer to
6.2 Controllability of CCG Module
the generated citations by calculating the cosine
We tested whether the generated citation sentences similarity between the embeddings (encoded by
reflect the provided conditional attributes as fol- Sentence-BERT (Reimers and Gurevych, 2019))
lows. For each test sample, we manually assigned of the two keywords and those of the generated
a conditional intent ("background", "method", or citation sentences. We calculated the frequency of
"result") and let BART-large-CCG use the assigned the generated citation sentences being semantically
intent attribute, leaving the other conditional at- closer to the associated keywords, e.g., the citation
tributes (e.g., relevant keywords and sentences) and sentence generated with keyword A as a condition
the contextual text unchanged during generation of was closer to keyword A than to B. Similarly, to
a citation sentence. We then determined the intent test the controllability by sentence attribute, we
of the generated citation using our SciBERT-based randomly selected two sentences from the first 5
intent classifier (Section 4.3). We calculated the relevant sentences extracted by the SciBERT-based
intents category frequencies of the generated ci- sentence extractor and conducted the same exper-
tations and plotted the confusion matrix (Figure iment. The results in Figure 3b show a good se-
3a). The large values of the diagonal elements in mantic match between the conditional keywords
the confusion matrix imply that the intents of the (or conditional sentences) and the generated cita-
generated citations tend to be consistent with the tion sentences, indicating that the model effectively
desired intent provided as input, indicating that our makes use of keywords and related sentences to
model effectively adapts the generated text to the guide the CCG generation process.
desired intent. These results show that our CCG module has
cited title: Low COVID-19 Vaccine Acceptance Is Correlated with Conspiracy Beliefs among University Students in Jordan
cited abstract: Vaccination to prevent coronavirus disease 2019 (COVID-19) emerged as a promising measure to overcome
the negative ... The intention to get COVID-19 vaccines was low: 34.9% (yes) compared to 39.6% (no) and 25.5% ...
context in manuscript: A third of the entire sample notes that they do not plan to vaccinate, another third doubts the decision
and focuses on the more distant results of the vaccination program conducted in the country, 11.6% are already vaccinated,
and 13.3% plan to vaccinate shortly. »generate a citation sentence HERE«
[background][acceptability] The acceptability of the COVID-19 vaccine in Jordan is low, with only 34.9% of respondents
stating that they intend to vaccinate [].
[method][questionnaire] The intention to vaccinate was assessed using the COVID-19 vaccine hesitancy questionnaire [].
[result][COVID-19 vaccines] This finding is in line with the results of a previous study conducted in Jordan, which showed
that the intention to get COVID-19 vaccines was low: 34.9% (yes) compared to 39.6% (no) and 25.5% (maybe) [].

Table 4: Generated citation sentences are guided by the citation intent and the keywords provided.

good controllability, making it suitable for con- Metric BART-large-CCG Neutral BART-large
trolling the generated citations with conditional Informative 44.44* 28.89 26.67
attributes (see an example in Table 4). Coherent 51.11* 17.78 31.11
Intent-Matched 51.11* 26.67 22.22

6.3 Human Evaluation Table 5: Results of human preference ratings for sen-
We performed a human evaluation to test whether tences produced by the three methods (in %). “Neu-
tral” means no preference for sentences generated by
our BART-large-CCG-based controllable citation
either model. “*” represents statistical significance,
generation pipeline produces more satisfactory ci- p < 0.05.
tation sentences compared to BART-large, which
cannot be controlled by citation attributes. In the
Web interface we created for human evaluation tributes was preferred compared to the uncondi-
of a test sample, we first provide the context text tional BART-large model. Compared with BART-
in the manuscript and the content of the cited pa- large, BART-large-CCG generated sentences with
per (title, abstract, and body). In addition, we use higher coherence and more informative contents.
our attribute suggestion module to suggest the ci- The higher value of the "Intent-Matched" metric
tation intent, top 5 relevant keywords, and the top reflects the good controllability of the BART-large-
5 relevant body sentences. Participants were given CCG given the user-specified attributes. Addition-
the freedom to select suggested attributes and con- ally, we observed that participants selected on av-
tribute their own to guide the generation of citations erage 1.8 suggested keywords and 1.6 suggested
if they deemed it necessary. sentences as attributes for citation generation, indi-
We then show two sentences: 1) the sentence cating that our attribute suggestion module holds
generated by BART-large using only contextual promise for use in controllable citation generation
text as input; and 2) the sentence generated by systems.
our BART-large-CCG using contextual text and
user-specified attributes as input. These sentences
7 Conclusion
were presented in random order to prevent partici- We proposed a controllable citation generation
pants from identifying the method behind the sen- framework that consists of a citation attribute sug-
tences based on the order of appearance. Inspired gestion module and a conditional citation gener-
by Zhang et al. (2021), we asked participants to ation module. Our framework not only outper-
report their preferences among the two sentences forms previous uncontrolled approaches in fully au-
based on the creiteria 1) informative, whether the tomated generation mode, but also manifests good
sentence is informative and faithful; 2) coherent, control of generated citations by reflecting user-
whether the sentence is logical and consistent with specified attributes. Our approach allows users to
previous sentences in the manuscript; and 3) intent- efficiently select the suggested citation attributes
matched, whether the sentence matches the tester’s to guide the generation process, thus giving the
citation intent (or purpose). generated sentences a better chance of reflecting a
Human evaluation results (Table 5) showed that user’s intentions during scientific writing. More-
our BART-large-CCG conditional on citation at- over, we filter out ambiguous citation sentences and
decouple the training and test sets, which makes Nianlong Gu, Elliott Ash, and Richard Hahnloser. 2022.
our SciCCG dataset a good testbed for future con- MemSum: Extractive summarization of long doc-
uments using multi-step episodic Markov decision
trollable citation generation studies.
processes. In Proceedings of the 60th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 6507–6522, Dublin,
References Ireland. Association for Computational Linguistics.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- Shing-Yun Jung, Ting-Han Lin, Chia-Hung Liao,
ERT: A pretrained language model for scientific text. Shyan-Ming Yuan, and Chuen-Tsai Sun. 2022.
In Proceedings of the 2019 Conference on Empirical Intent-controllable citation text generation. Mathe-
Methods in Natural Language Processing and the matics, 10(10):1763.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3615– David Jurgens, Srijan Kumar, Raine Hoover, Dan Mc-
3620, Hong Kong, China. Association for Computa- Farland, and Dan Jurafsky. 2018. Measuring the evo-
tional Linguistics. lution of a scientific field through citation frames.
Transactions of the Association for Computational
Jaime Carbonell and Jade Goldstein. 1998. The use of Linguistics, 6:391–406.
mmr, diversity-based reranking for reordering doc-
uments and producing summaries. In Proceedings Kaggle. 2022. arXiv Dataset. https:
of the 21st annual international ACM SIGIR confer- //www.kaggle.com/datasets/
ence on Research and development in information Cornell-University/arxiv. Accesses:
retrieval, pages 335–336. 2022-08-01.

Jingqiang Chen and Hai Zhuge. 2019. Automatic Nitish Shirish Keskar, Bryan McCann, Lav R. Varsh-
generation of related work through summarizing ney, Caiming Xiong, and Richard Socher. 2019.
citations. Concurrency and Computation: Practice Ctrl: A conditional transformer language model for
and Experience, 31(3):e4261. Number: 3 _eprint: controllable generation.
https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4261.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, jan Ghazvininejad, Abdelrahman Mohamed, Omer
and Field Cady. 2019. Structural scaffolds for ci- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
tation intent classification in scientific publications. 2020. BART: Denoising sequence-to-sequence pre-
In Proceedings of the 2019 Conference of the North training for natural language generation, translation,
American Chapter of the Association for Compu- and comprehension. In Proceedings of the 58th An-
tational Linguistics: Human Language Technolo- nual Meeting of the Association for Computational
gies, Volume 1 (Long and Short Papers), pages Linguistics, pages 7871–7880, Online. Association
3586–3596, Minneapolis, Minnesota. Association for Computational Linguistics.
for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane
tion Branches Out, pages 74–81, Barcelona, Spain.
Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Association for Computational Linguistics.
Rosanne Liu. 2019. Plug and play language models:
A simple approach to controlled text generation.
Shichen Liu, Weikai Chen, Tianye Li, and Hao Li.
2019. Soft rasterizer: Differentiable rendering for
Jessica Ficler and Yoav Goldberg. 2017. Controlling unsupervised single-view mesh reconstruction.
linguistic style aspects in neural language genera-
tion. In Proceedings of the Workshop on Stylis- Yang Liu. 2019. Fine-tune bert for extractive summa-
tic Variation, pages 94–104, Copenhagen, Denmark. rization. ArXiv.
Association for Computational Linguistics.
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin-
Yubin Ge, Ly Dinh, Xiaofeng Liu, Jinsong Su, Ziyao ney, and Daniel Weld. 2020. S2ORC: The semantic
Lu, Ante Wang, and Jana Diesner. 2021. BACO: A scholar open research corpus. In Proceedings of the
Background Knowledge- and Content-Based Frame- 58th Annual Meeting of the Association for Compu-
work for Citing Sentence Generation. In Proceed- tational Linguistics, pages 4969–4983, Online. As-
ings of the 59th Annual Meeting of the Association sociation for Computational Linguistics.
for Computational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language Pro- Anna Nikiforovskaya, Nikolai Kapralov, Anna
cessing (Volume 1: Long Papers), pages 1466–1478, Vlasova, Oleg Shpynov, and Aleksei Shpilman.
Online. Association for Computational Linguistics. 2020. Automatic generation of reviews of sci-
entific papers. In 2020 19th IEEE International
Maarten Grootendorst. 2020. Keybert: Minimal key- Conference on Machine Learning and Applications
word extraction with bert. (ICMLA), pages 314–319.
Bethesda (MD): National Library of Medicine. Ming Zhong, Pengfei Liu, Yiran Chen, Dan-
2003. Pmc open access subset [internet]. qing Wang, Xipeng Qiu, and Xuanjing Huang.
https://www.ncbi.nlm.nih.gov/pmc/ 2020. Extractive Summarization as Text Matching.
tools/openftlist/. Accesses: 2022-07-30. arXiv:2004.08795 [cs]. ArXiv: 2004.08795.
Bart Penders. 2018. Ten simple rules for respon- A Computing Hardware
sible referencing. PLOS Computational Biology,
14(4):e1006036. We train the models using 8x NVIDIA GeForce
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- RTX 3090 GPUs. During testing and evaluation,
ine Lee, Sharan Narang, Michael Matena, Yanqi we used an NVIDIA RTX A6000 48GB GPU.
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the limits of transfer learning with a unified text-to-
text transformer. Journal of Machine Learning Re-
search, 21(140):1–67.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert-
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing.
Association for Computational Linguistics.
Florian Schroff, Dmitry Kalenichenko, and James
Philbin. 2015. FaceNet: A unified embedding for
face recognition and clustering. In 2015 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR). IEEE.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015.
Learning structured output representation using
deep conditional generative models. In Advances in
Neural Information Processing Systems, volume 28.
Curran Associates, Inc.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran Asso-
ciates, Inc.
Yifan Wang, Yiping Song, Shuai Li, Chaoran Cheng,
Wei Ju, Ming Zhang, and Sheng Wang. 2022.
Disencite: Graph-based disentangled representa-
tion learning for context-specific citation generation.
Proceedings of the AAAI Conference on Artificial In-
telligence, 36(10):11449–11458.
Jia-Yan Wu, Alexander Te-Wei Shieh, Shih-Ju Hsu,
and Yun-Nung Chen. 2021. Towards generating ci-
tation sentences for multiple references with intent
control.
Xinyu Xing, Xiaosheng Fan, and Xiaojun Wan. 2020.
Automatic Generation of Citation Texts in Schol-
arly Papers: A Pilot Study. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 6181–6190, Online. As-
sociation for Computational Linguistics.
Yizhe Zhang, Siqi Sun, Xiang Gao, Yuwei Fang, Chris
Brockett, Michel Galley, Jianfeng Gao, and Bill
Dolan. 2021. Retgen: A joint framework for re-
trieval and grounded text generation modeling.

You might also like