Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Hindawi

Computational Intelligence and Neuroscience


Volume 2022, Article ID 6241373, 14 pages
https://doi.org/10.1155/2022/6241373

Research Article
N-GPETS: Neural Attention Graph-Based Pretrained Statistical
Model for Extractive Text Summarization

Muhammad Umair ,1 Iftikhar Alam,1 Atif Khan ,2 Inayat Khan ,3 Niamat Ullah,4
and Mohammad Yusuf Momand 5
1
Department of Computer Science, City University of Science and Information Technology, Peshawar 25000, Pakistan
2
Department of Computer Science, Islamia College, Peshawar 25000, Pakistan
3
Department of Computer Science, University of Engineering and Technology, Mardan, Pakistan
4
Department of Computer Science, University of Buner, Buner 19290, Pakistan
5
Faculty of Computer Science, University of Nangarhar, Jalalabad 2600, Afghanistan

Correspondence should be addressed to Mohammad Yusuf Momand; yusuf.mohmand@gmail.com

Received 8 March 2022; Accepted 2 August 2022; Published 22 November 2022

Academic Editor: Rahim Khan

Copyright © 2022 Muhammad Umair et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Te extractive summarization approach involves selecting the source document’s salient sentences to build a summary. One of the most
important aspects of extractive summarization is learning and modelling cross-sentence associations. Inspired by the popularity of
Transformer-based Bidirectional Encoder Representations (BERT) pretrained linguistic model and graph attention network (GAT)
having a sophisticated network that captures intersentence associations, this research work proposes a novel neural model N-GPETS by
combining heterogeneous graph attention network with BERTmodel along with statistical approach using TF-IDF values for extractive
summarization task. Apart from sentence nodes, N-GPETS also works with diferent semantic word nodes of varying granularity levels
that serve as a link between sentences, improving intersentence interaction. Furthermore, proposed N-GPETS becomes more improved
and feature-rich by integrating graph layer with BERT encoder at graph initialization step rather than employing other neural network
encoders such as CNN or LSTM. To the best of our knowledge, this work is the frst attempt to combine the BERTencoder and TF-IDF
values of the entire document with a heterogeneous attention graph structure for the extractive summarization task. Te empirical
outcomes on benchmark news data sets CNN/DM show that the proposed model N-GPETS gets favorable results in comparison with
other heterogeneous graph structures employing the BERT model and graph structures without the BERT model.

1. Introduction language in the text and, if necessary, introduces new words/


phrases into the summary. In general, extractive summari-
Te rise of the Internet and big data results in a massive and zation models are simple, and they express the summarization
exponential growth of information. Because of this, nu- task as a classifcation problem for document sentences:
merous academics are working to develop a technical whether or not to include it in a summary [3].
method for automatically summarizing texts. Te automatic Te neural sequence-to-sequence encoder-decoder
summarization approaches generate summaries containing framework has demonstrated incredible performance doing
the relevant information from the input documents to re- extractive summarization in the past few years. In research
view it quickly without compromising its originality [1]. [4], researchers design an encoder that works hierarchically
Extractive and abstractive are the two types of summari- and summarize single-document text. Tey employ an ex-
zation. A subset of sentences from the input text is chosen tractor, which works on an attention mechanism allowing to
through extractive summarization to provide a summary [2]. extract sentences or words from documents for summari-
In contrast, abstractive summarization restructures the zation tasks. Te study [5] proposed a neural extractive
2 Computational Intelligence and Neuroscience

summarization model as a sentence ranking task, incor- Tis paper suggests an innovative pretrained statistical-
porating a reinforcement learning objective to optimize the based graph attention network (N-GPETS) for single-doc-
ROUGE evaluation metric. Encoder-decoder sequence ar- ument extractive summarization, fusing BERT pretrained
chitecture for extractive summarization was mostly adopted framework and TF-IDF with graph attention network. First,
by the researchers [6–8] who utilize various neural com- the whole document is fed to BERT for encoding, which has
ponents to encode each sentence uniquely. To produce a a strong architectural foundation and has been pretrained on
summary from the input content that includes signifcant enormous data sets. Te BERT encoder generates word
sentences, learning and modelling cross-sentence linkage are nodes and sentence nodes. Tese word nodes served as an
crucial [9]. Recurrent neural networks (RNNs) have been additional semantic unit. Second, for the graph layer, the
used in the majority of the recently presented research, output of the BERT encoder in the form of word and
including [4, 10, 11], to learn and simulate the cross-rela- sentence nodes act as graph nodes while the values of TF-
tionships between texts. However, RNNs may sufer from IDF of the whole document serve as edge features between
the vanishing gradient problem, which causes gradient corresponding nodes. In graph layer, the graph attention
magnitudes to shrink as they propagate over time. Because of mechanism is applied and the representation of nodes ware
this phenomenon, the network’s memory ignores long-term updated. Finally, labels are assigned by the sentence selection
dependencies and fails to learn the correlation between module after it has extracted the representation of signifcant
temporally distant events [12]. sentences node from the graph layer.
To understand and simulate the cross-relationship be- N-GPETS enhanced the work of [9], which was actually
tween sentences, many scholars took advantage of the graph about the problem of capturing semantic-level relationships,
structure. Tere have been numerous attempts made to model but this work was completely unaware of using pretrained
efective graph networks for jobs requiring summarization [9]. models such as BERT along with graph attention mecha-
Recent research [12] used sentence personalization traits and nism. N-GPETS also difers from previous work [14] as they
discourse-aware intersentential interactions to create a use topic nodes as an additional semantic unit with the help
summarization model (ADG). Te authors in [13] used a of a joint neural network model (NTM). Te other diference
Rhetorical Structure Teory (RST) graph to model cross- of proposed N-GPETS from previous models is to generate
sentence association by utilizing joint extraction and syntactic TF-IDF values of the whole input text and used these fea-
compression to create a summary of a single-document text. tures between edges of graph nodes. Te proposed N-GPETS
Another approach is proposed in [14], where a diferent graph structure has the following advantages: (i) during the
strategy is suggested. It looks at an unsupervised discourse- graph propagation stage, semantic word nodes (additional
aware hierarchical graph network (HIPORANK) for lengthy units) which are highly featured rich, due to the BERT
scientifc publications, which uses intra- and interconnection framework [16], improve the sentence representation and
between document sentences as well as model asymmetric gather information from sentences; (ii) to link sentences and
edge weights for extracting important sentences. identify intersentence links, semantic word nodes can also be
Te preceding approaches relied on third-party tools and employed as a bridge; and (iii) our graph structure can use
did not consider the error propagation problem [9]. One of diferent levels of information during message passing. Te
the simplest methods mentioned above is to model a fully following are our model’s standout contributions:
connected graph at the sentence level. Recently, studies [6, 7]
used transformer architecture to learn pairwise interactions (1) a novel approach and frst attempt to build a BERT-
between phrases to model sentence-level graphs. A hybrid based statistical graph attention network N-GPETS
graph attention framework for learning the cross-sentence for summarizing single-input document text. An
relationships was put forward by [8] utilizing GAT and CNN extractive summary is produced by the graph layer
as encoders while TF-IDF values as edge features. However, using sentence and word nodes produced by BERT
this graph-building approach runs into the problem of and TF-IDF values of the entire manuscript.
capturing semantic-level relationships [14]. In a study [14], (2) to assess the efectiveness of the suggested N-GPETS
the authors developed a sentence-level graph-based model technique against cutting-edge approaches using
that employed BERT for sentence encoding and a joint CNN/DM News data sets using ROUGE evaluation
neural network model (NTM) to discover latent topic in- metrics.
formation rather than semantic word nodes in a hetero-
(3) Te simulation fndings on benchmark news data
geneous graph network. Te authors in [15] proposed a
sets: CNN/DM demonstrates that N-GPETS pro-
heterogeneous graph structure for modelling the cross-
vides generally acceptable outcomes in comparison
sentence relationships between sentences. Tey used three
with existing graph attention networks utilizing
types of nodes to capture the relationships between the
BERT or the absence of BERT in combination with
EDUs: nodes of the sentence, nodes of EDU, and nodes of
graph structures.
entity, RST, and they also used external discourse knowledge
to improve the model’s results. Te creation of a useful graph Te remaining portions of the article are organized as
model that enhances the extraction of important stuf for the follows. In Section 2, we take a critical look at the leading
formation of extractive summaries remains a challenging work on extractive summarization tasks. Section 3 describes
and unsolved research issue despite the success of the prior in full the proposed N-GPETS model methodology. Section
approaches [9]. 4 details the proposed model compared to other existing
Computational Intelligence and Neuroscience 3

cutting-edge models and also the hyper-parameters and models such as BERT (Bidirectional Encoder
model settings. Section 5 focuses on Results, and lastly, the Representations from Transformers) and GPT
study paper concludes in section 6, which also ofers sug- (Generative Pretrained Transformer) trained on
gestions for future research. massive corpora [12, 24].
(ii) BERT (Bidirectional Encoder Representations from
2. Literature Review Transformers) is a Google-developed machine
learning model that has already been trained for
Tis section discusses some traditional and advanced ap- natural language processing (NLP) tasks. Te
proaches/techniques for extractive summarization tasks. Google AI research team created and published
Initially, we look into how extractive summarization is BERT in 2018. It should be noted that Google an-
performed using a deep neural sequence-to-sequence model. nounced in 2019 that their search engines are using
Ten, we investigate how various statistical methods, such as the BERT approach. In 2020, one of the most recent
TF-IDF, LDA, and TextRank, perform feature extraction and surveys [24] claims that BERT in just one year has
summary generation tasks. Ten, deep-learning-based evolved into a widely used baseline in NLP research,
transformer architectures for extractive summarization are with over 150 research publications analyzing and
presented. We discuss how pretrained models, such as improving the model. BERT is a new language
BERT, are used for various NLP tasks, particularly sum- representation baseline that extends word embed-
marization. Finally, we look into how other neural graph- ding models [25]. Two signifcant duties were
based structures are used for the task of summary genera- covered in BERT training: language modelling (15%
tion. In the following section, we will briefy defne some of tokens were hidden, and BERT was taught to
background concepts. anticipate them based on context) and next sentence
prediction (using the frst sentence as a guide, BERT
2.1. Text Summarization. Automatic text summarization was trained to determine whether a particular
(ATS) is a method that creates an overview that contains all statement would be expected or not).
pertinent and important information by automatically (iii) Graph Attention Networks (GATs). To address the
summarizing a substantial amount of text. It is important to limitations of earlier methods that just used graph
note that automatic text summarization is a text mining convolutions, neural network designs called graph
process that accepts a lengthy text document as input and attention networks (GATs) deal with input that is
produces an appropriate summary [17]. Tere is an abundance arranged as a graph and employ masked self-at-
of text-based content on the Internet, including web publi- tention layers [26]. By focusing on each node’s
cations, papers, news, and reviews, that must be summarised neighbors, it is intended to employ a self-attention
to get the document’s gist [18]. ATS has many uses, such as method to calculate each node’s hidden represen-
short read generation, passage reduction, compaction, tations. Some of the most useful characteristics of
extracting, and the most important information from sensitive the graph attention architecture include the fol-
reports, including legal reports produced by legal authorities lowing: (i) the parallelization characteristic is sur-
[19]. ATS can also be used in news text summarizers to assist rounded by node-neighbor pairs, which makes the
readers in fnding the most interesting and important content attention mechanism efective. (ii) By giving the
in less time [20–22]. Other applications of ATS include neighbors diferent weights, this architecture is
sentiment summarization, legal text summarization, scientifc especially efcient because it can be used with graph
document summarization, tweet summarization, book nodes of diferent degrees. (iii) Te graph attention
summarization, story/novel summarization, e-mail summa- model can be used to directly address learning is-
rization, and bio-medical document summarization [23]. Te sues, such as those requiring the model to extrap-
fundamental design of the ATS system, as shown in Figure 1, olate to previously unobserved graphs [26].
includes the following functions.
(i) Te Transformer is a fully self-attention-based deep 2.2. Extractive Text Summarization Approaches and
learning model. It is a simple network that is Techniques. Finding the sentence’s location and the fre-
completely free of recurrence and convolutions. It is quency of words in the text was the most typical problem
one of the most advanced architectures in NLP and that surfaced from extractive summarization research [1].
computer vision [12]. Te Transformer does not Researchers in [27] used a deep learning technique called
need to comprehend the initial part of the sentence Feed Forward Neural Network (FFNN) for single-document
before the end because the attention mechanism legal text summarization. Tis method generates a coherent
adds context to the input sequence at any point in extractive summary without needing features or domain
the sentence. Instead of processing input sequen- expertise but fails miserably in summarizing difcult and
tially as RNN does, the Transformer allows for more long statements [1]. Te study [11] presented the encoder-
parallelization, which results in a shorter training decoder architecture as a foundation for single-document
time. Because of Transformer’s parallelization fea- summarization that contains an attention-based extractor
ture, training on larger data sets is possible, allowing and a hierarchical document encoder. Te authors in [28]
for the development of cutting-edge pretrained presented a classifer-based architecture (RNN based) that
4 Computational Intelligence and Neuroscience

Automatic Text Summarizer


Source Target
OR Source
Document Precessing Summary
Document Pre- Post-
precessing precessing

(a) (b)
Figure 1: ATS basic architecture for single and multidocument [23].

accepts or rejects each sentence in the original document making to produce a summary as the output explanation. To
order for inclusion in the fnal summary in a sequential generate unsupervised extractive summaries, the researchers
manner. For a lengthy text that takes into account both the of [42] used a transformer attention mechanism to prioritize
global and local contexts of the content, the authors of [29] sentences. For extractive summarization of long text, the
suggested a single-document extractive summary model. A authors of [43] used the transformer model and introduced a
novel technique for summarization was provided by the type of heterogeneous framework called HETFORMER
authors in [30] that relied on a neural sequence-to-sequence framework. BERT is a pretrained model used by many re-
model with an attention mechanism and fuzzy character- searchers for extractive summarization. Te summarized
istics that could be customised. literature review is depicted in Table 1.
Statistical techniques such as TF-IDF, TextRank, LDA, Te “lecture summarizing service,” a Python-based
and clustering, among others, have been used for extractive RESTful service, chose relevant sentences near the cluster’s
summarization tasks. Te study [31] presented statistical centroid using the K-means clustering algorithm and the
topic modelling techniques such as latent BERT model for text embeddings to generate a summary
Dirichlet allocation (LDA), which select important sentences [47, 48]. Researchers in [49] utilize the bidirectional model
in clusters based on automatically generated keywords. BiLSTM and BERTmodel for extracting temporal information
Additionally, the study in [32] mainly utilized TF-IDF and from messages from social media platforms that are necessary
K-means clustering-based approaches for the creation of an for geographical applications. Te authors of [50] developed a
extracted summary. hybrid method for producing summaries of long scientifc
Te authors in [33] presented two methods for extractive texts that combined the benefts of both extractive and ab-
summarization of hotel reviews. Te frst method was used stractive designs. Te authors in [51, 52] use the deep learning
to select the most related sentences based on their TF-IDF model BERT and RISTECB model to answer important
score. Te second method generated the phrase summary questions related to the COVID-19 research articles. Te
style by pairing adjectives with the closest nouns and taking authors of [44] demonstrated an excellent tuning-based ap-
polarity into account. Te work done by [34] integrated the proach for extractive summarization using the BERT model.
TF-IDF and TextRank techniques to extract keywords from Te BERT model was also used by the authors of
input documents. [7, 8, 16, 36, 46] for contextual representation in summari-
A deep learning model called Transformer uses the self- zation tasks. Te authors in [53] use the BERT model to
attention process. Tis framework is utilized by numerous automatically generate titles from a huge set of published
researchers for extractive summarization jobs. Te re- literature or related work. Additionally, extractive summari-
searchers of [35] focused on the structured transformers zation tasks using graph structures have been carried out by
HiBERT presented by [36] and Extended Transformers exploiting linguistic and statistical information included in
presented by [37], which ofer an extractive encoder-centric sentences [9]. Recent research has combined neural networks
stepwise strategy for summarizing documents. Tis model with graphs, or (GNNs), and used the encoder-decoder
enabled stepwise summarization by inserting the previously structure for extractive summarization [13, 54]. Many re-
created summary as an additional substructure into the searchers nowadays use a heterogeneous graph neural net-
structured Transformer. Te authors in [38] presented an work with multiple updated nodes rather than a homogeneous
extractive summarization model based on layered trees, graph structure with no updated nodes for extractive sum-
where the given document’s discourse and syntactic trees are marization tasks. Te study [55] proposed a bipartite graph
combined to form nested tree structures. Te authors pri- attention network for multihop reading comprehension (RC)
marily focused on the existing model RoBERTa presented by across documents that encoded diferent documents and
[39] for constructing this model. By lowering the size of the entities together. Te authors in [48] presented an approach
attention module, the authors in [40] presented an extractive that modeled redundancy-aware heterogeneous graphs and
summarization technique for discourse-based attention at refned sentence representation using neural networks for
the document level; this constitutes the core of the trans- extractive summarization. Te studies [9, 56] proposed a
former architecture, utilizing a unique discourse-inspired heterogeneous graph neural network for extractive summa-
approach. Two diferent transformer-based techniques for rization that used CNN with Bi-LSTM as encoders for input
sentiment analysis were provided by the authors in [41] while text and TF-IDF values as edge features between graph nodes.
fetching the words that are crucial to the model’s decision- Utilizing a graph attention network, cross-sentence links
Computational Intelligence and Neuroscience 5

Table 1: Literature review.


Author(s) Research technique(s) Data set
New York Times (NYT) data set. CNN/
Huang et al., 2021 [15] BERT, graph attention network (GAT)
Daily Mail data set.
G. N. H, R et al., 2021
TF-IDF, combining adjectives TripAdvisor website
[33]
Approach et al., 2021
Transformer, attention mechanism IMDB Large Movie Review data set
[41]
New York Times (NYT) data set.
Peng Cui et al., 2020 BERT, joint neural network model (NTM) graph attention
CNN/Daily Mail data set.
[14] network (GAT)
PubMed and ArXiv data sets.
New York Times (NYT) data set
CNN + BiLSTM encoders, TF-ID, heterogeneous graph attention
Wang et al., 2020 [9] CNN/Daily Mail data set
network
Multi-News data set
Hernandez et al., 2020
LDA latent Dirichlet allocation, TF-IDF, N-Grams, clustering DUC 2002 data set
[31]
Anand et al., 2019 [27] Feed-forward neural network. Indian supreme court judgment document
Zhou et al., 2018 [11] RNN, encoder-decoder architecture. CNN/Daily Mail data set
Khan et al., 2019 [32] TF-IDF, K-means clustering News headlines data set
New York Times (NYT) data set
Xu et al., 2019 [13] BERT, RST graph, Coreference graph, graph convolution network
CNN/Daily Mail data set
Narayan et al., 2020 [35] Structured transformer model CNN/Daily Mail data set
New York Times (NYT) data set
Liu et al., 2019 [44] BERT + LSTM, BERT + Transformer
CNN/Daily Mail data set
New York Times (NYT) data set
Liu and Lapata et al.,
BERT + Transformer model, encoder-decoder framework CNN/Daily Mail data set
2019 [7]
XSum data set
AGNews data set
Snippets data set
Ohsumed data set
Linmei Hu et al., 2019 Semi-supervised, heterogeneous graph attention network, dual-
TagMyNews data set
[45] level attention mechanism
PTE data set
TextGCN data set
HAN data set
Jiacheng Xu et al., 2019 BiLSTM, convolution neural network (CNN), feed-forward neural New York Times (NYT) data set
[46] network (FFNN) CNN/Daily Mail data set
New York Times (NYT) data set
Zhang et al., 2018 [36] Unsupervised method, BERT + Transformer
CNN/Daily Mail data set
Nallapati et al., 2016 DUC 2002 data set
RNN classifer-based framework
[28] CNN/Daily Mail data set

between sentences are learned. Te work done by [14] built a BERT and graph attention network along with a statistical
sentence-level graph-based model, using BERT for sentence approach. N-GPETS is broken down into four phases:
encoding and joint neural network model (NTM) for dis- document representation comes frst, then there are three
covering latent topic information. Te authors in [15] pro- trainable modules: BERT graph initializers, graph layer, and
posed a heterogeneous graph structure for modelling cross- important sentences selector. Te following subsections go
sentence relationship between sentences. To represent the over each of these phases in detail.
relationships between the EDUs, they used three diferent
types of nodes, including sentence nodes, EDU nodes, and
3.1. Representing Document as a Heterogeneous Graph.
entity nodes, and RST discourse parsing and leverage external
Consider a document represented by G � (V, E), where V
discourse expertise to enhance the model’s performance. Te
represents the set of nodes, and E represents the edges in
next section goes over the unique model N-GPETS meth-
between the nodes. In our framework, the attention graph
odology that is proposed in this study in depth.
structure is made by taking the union of VW and VS , i.e.,
V � VW UVS , E � e11 , . . . , emn , and VW � W1 , W2 , . . . , Wm .
3. Methodology Te quantity of distinct words in the text VS � S1 , S2 , . . . , Sn
indicates the quantity of sentences. E is now an edge weight
In this paper, an innovative pretrained statistical model for matrix with real values and eij ≠ 0 where i ∈ {1, . . . , n} and
extractive summarization task called N-GPETS is presented, j ∈ {1, . . . , m} demonstrate that jth sentence has the ith word
which is designed by combining the deep learning model as discussed in [9].
6 Computational Intelligence and Neuroscience

According to Figure 2, three primary trainable modules Importrant


make up the N-GPETS: BERT graph initializers, graph layer, sentences
and important sentences selector. Te N-GPETS model
works as follows: frst, the pretrained BERT graph initializer
module generates sentence and word nodes using the BERT Sentence
Selector
encoder as opposed to alternative neural network encoders Graph Layer
already employed in other works. Tese word and sentence
nodes are then transmitted to the graph layer for the doc- W3 W31
ument graph together with the TF-IDF values utilized as W11 S1
edge characteristics. Te heterogeneous graph layer uses the W1
W12 S2
graph attention network to relay messages between these
W2 W 22
word and sentence nodes in the second step, iteratively
changing these nodes as a consequence. Finally, the im-
portant sentence selection module extracts the fnal sum-
Word Node Edge Feature Sentence Node
mary’s important sentence nodes.
TF-IDF

BERT Encoder
3.2. BERT Graph Initializers. As suggested by the BERT
model’s basic structure, the output vectors of BERTare based
on tokens instead of sentence tokens [25]. But it is clear to us
Word Sentence
that sentence-level representation is manipulated in the case
of extractive summarization. Te second thing that was Figure 2: A general framework of the suggested model (N-
noted was that the original BERT model’s segmentation GPETS).
embeddings just apply to the input of two sentences.
Nevertheless, the extractive summarization process requires
us to encode and manage multisentential inputs [7]. In this
(iii) Tese sentence vectors, word embeddings, and TF
study, each sentence begins with a [CLS] external token and
values of the whole input text are forwarded to the
ends with a [SEP] that overcomes the difculties that arise
attention graph layer
for single sentence representation in a document, same as
done in [7]. For the preceding sentence, external tokens (iv) Te graph layer serves as a summarization layer in
gather data while to diferentiate diferent sentences in a the N-GPETS model
document, segment embeddings are used [7]. For example,
we have fve diferent sentences in an input text, i.e., (sent1 ,
sent2 , sent3 , sent4 , and sent5 ). Each sentence has the fol- 3.5. Graph Layer. In the graph layer for the construction of
lowing embeddings associated with it: [EA, EB, EA, EB, and a bipartite graph, we gave the nodes of words and sen-
EA]. Tis method allows for the hierarchical learning of the tences along with TF values to the graph layer. After that
input document representations. In last, the vectors Ti which to update the representation of the semantic nodes, the
are the vectors of [CLS] tokens of every sentence generated graph attention network is used, same as previous work
by BERT having all information about each sentence senti [9], with the main diference being that the word and
are forwarded to the graph layer for the graph attention sentence node features used in the graph layer are encoded
mechanism. Tese vectors work as sentence nodes in the with the help of BERT model at the graph initializers stage
graph layer in the proposed N-GPETS. Te complete process discussed above rather than using diferent neural net-
of sentence nodes generation using BERT is depicted in work encoders such as CNN or Bi-LSTM. Te graph at-
Figure 3 [44]. tention layer (GAT) and hidden state of input nodes
hi ∈ Rdh can be constructed in the same way as demon-
strated in [9]:
3.3. Word Nodes and Edge Features. We employ the base
framework of the BERT encoder [25], depicted in Figure 2, zij � LeakyReLU􏼐wa 􏽨wq hi ; wk hj 􏽩􏼑,
which takes the word of input text, encodes the words, and
generates word vectors. To highlight how word and sentence exp􏼐zij 􏼑
nodes are connected, in the initialization step of our model, αij � ,
􏽐l∈Ni exp zil 􏼁 (1)
we incorporate TF-IDF values into the edge weights, similar
to [9]. Utilizing BERT to create word, nodes are presented in
Figure 4. ui � σ ⎝
⎛ 􏽘 αij Wv hj ⎠
⎞.
j∈Ni

3.4. Overview of the BERT Graph Initializers Phase


Here, wa , wq , wk , and wv denote training weights and
(i) Nodes of sentences creation (Ti) attention weight across hi and hj denoted by αij. Following is
(ii) Nodes of words creation the illustration of multihead attention [9]:
Computational Intelligence and Neuroscience 7

Input Document [CLS] sent one [SEP] [CLS] 2nd sent [SEP] [CLS] sent again [SEP]

Token Embeddings E[CLS] Esent Eone E[SEP] E[CLS] E2nd Esent E[SEP] E[CLS] Esent Eagain E[SEP]

Interval Segment
EA EA EA EA EB EB EB EB EA EA EA EA
Embeddings

Position Embeddings E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12

BERT

T1 •••••• T2 •••••• YT3 ••••••

Summarization Layers

Y1 Y2 Y3

Figure 3: Using the BERT model for extractive summarizing [44].

[MASK] [MASK]

Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]

Token
E[CLS] Emy E[MASK] Eis Ecute E[SEP] Ehe E[MASK] Eplay E##ing E[SEP]
Embeddings
+ + + + + + + + + + +
Segment
EA EA EA EA EA EA EB EB EB EB EB
Embeddings

+ + + + + + + + + + +

Position
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Embeddings

Figure 4: Utilizing BERT to create word nodes [25].

sentence-to-word and a word-to-sentence update process.


⎝ 􏽘 αk Wk h ⎞
ui � K σ ⎛ ⎠ Te process can be represented for the tth iteration [9].
ij i . (2)
k�1
j∈Ni
Ut+1 t t+1 t+1
s←w � GAT􏼐Hs , Hw , Hw 􏼑,
Te resultant output representation is as follows [9]: (5)
Ht+1
s � FFN􏼐Ut+1 t
s←w + Hs 􏼑.
h′i � ui + hi . (3)
3.7. Sentences Selection Module. Finally, the sentences se-
Now, equation (1) is changed to include edge weights eij lection module selected those important sentence nodes
in graph attention layer, which is given as follows [9]: from the graph layer which become the part of the fnal
extractive summary produced by the proposed model. For
zij � LeakyReLU􏼐Wa 􏽨Wq hi ; Wk hj ; eij 􏽩􏼑. (4) this task, node classifcation is done, which predicts labels 0
or 1 for each sentence in a document and cross-entropy loss
as the overall system’s training objective. Tose sentences
having label 1 include in the fnal summary while sentences
3.6. Iteratively Updated Nodes. Te information propagation with label 0 are not included in the fnal summary.
is used to send messages between the nodes of words and
sentences. Specifcally, after initialization, we use the GAT 4. Performance Evaluation
and FFN layers to change sentence nodes with their neighbor
nodes of words. Ten, using updated sentence nodes, we Tis segment evaluates the performance of the suggested
obtain new representations for word nodes and iteratively N-GPETS architecture to other latest models for the ex-
update sentence nodes. Each iteration includes both a tractive summarization job. Tis section covers the dataset
8 Computational Intelligence and Neuroscience

utilized in the proposed work and compares BERT-based 4.3.1. Hyperparameters and Implementation Settings.
and non-BERT models to the suggested model and also N-GPETS encodes the document using a pretrained BERT-
provides information about the objective evaluation ma- based model to produce sentence and word nodes. Vo-
trices utilized in the proposed system, as well as its cabulary is limited to 50,000 words, and tokens are activated
hyperparameters and execution settings. Te following in 300-dimensional embedding using BERT than the em-
subsections described them in a little bit of detail: bedded GloVe used in previous applications [7, 9]. Te
BERT output is 768-dimensional vectors, and the tokens are
4.1. Objective Evaluation Matrices. Te matrices like preci- 256-dimensional vectors.
sion, recall, F-measure accuracy, and ROUGE toolkit are Since the output of BERT is 768-dimensional vectors and
adopted by state of the art [9, 57–59]. Tey are defned the tokens taken by the Graph layer should be equal to the
below. 300-dimensional embedding as described in [8], we use a
linear layer after the BERT output that converts the 768-
4.1.1. Precision. Te number of sentences appearing in both dimensional vector to a 300-dimensional vector. When you
the system and the abbreviations for reference divided by the create word nodes, stops and punctuation are fltered. We
number of sentences present in the summary produced is limit the length of a sentence to 50 characters. To address the
called precision (P) [57]. problem of common noisy words, we remove 10% of vo-
cabulary from databases with low TF-IDF values. To start a
sentence node, the maximum size is kept at ds � 128, and the
4.1.2. Recall. Recall (R) is the number of sentences from
maximum size of the eij edge elements is kept at de � 50. FFN
both produced systems and reference abbreviations divided
is 512. Deep Graph Library (DGL) is used to use the graph
by the number of sentences present in the reference sum-
neural network, as is the case with [9].
mary [57].
Tirty-two batch size is used in training, and Adam’s
provided [50] with a reading rate of 5 and 4 is used. Pre-
4.1.3. F-Score. Te F-score is a compact matrix that com-
mature stops are made when the allowable loss is not less
bines accuracy with memory. Calculating the corresponding
than three consecutive epochs. Based on the functionality of
measure of accuracy and memory is a basic method for
the verifcation set, the number of duplicates is set to t = 1.
calculating the efect of the F-score [57, 58]. N-GPETS selects the top three sentences as a system-based
summary of the average length produced by human beings
4.1.4. Accuracy. Accuracy is the total number of well-labeled on CNN/Daily Mail, three sentences. Te specifcation of the
sentences divided by the number of sentences present in the system in which we train our model is 8 GB Random-Access
data set test set. Memory (RAM) and Intel (R) Core (TM) 7-6600U. Te CPU
is based on ×64 architecture and uses a 64-bit operating
4.1.5. N-Gram Co-Occurrence Statistics—ROUGE. system. Te Windows operating system installed is Win-
ROUGE stands for Recall-Oriented Understudy for Gisting dows 10 Pro.
Evaluation proposed by [58] is the most commonly used
testing tool in text summary research. Tis system compares 5. Results and Discussion
the quality of the summary produced by the system with the
man-made summary to determine how good it is. Te gold Tis section outlines the overall empirical fndings generated
standard we used includes two personal abbreviations and two by our suggested model, N-GPETS. N-GPETS is evaluated
human quotes. ROUGE test steps include ROUGE-N (N = 1, 2, using the criteria that when compared to other models, does
3, and 4) and ROUGE-L, ROUGE-W, ROUGE-S, and others. N-GPETS, which generates sentence and word nodes using
ROUGE-N quantifes the number of ’n-gram’ matches between BERT encoder and also has TF-IDF connections between
system summary and set of human summaries. nodes, produce adequate results? First, a frequently used
CNN/DM data set is utilized to train and then test the
4.2. Data Set. Tis study examined N-GPETS using a CNN/ N-GPETS model. For reference summaries, we considered
Daily Mail (anonymous version) data set, a standard data for the unigram R-1, bigram R-2, and longest common sub-
marking news [60]. Te data set is classifed and processed sequence R-L overlap. Second, a comparison has been made
according to standard classifcation, with 287,227/13,368/ between the proposed framework N-GPETS and previously
11,490 examples (92.5%/4.2%/3.6%) for training, validation, working both BERT-based and non-BERT graph structures
and sequence of tests, respectively, similar to previous tasks as depicted in Section 4.3. Additionally, ablation research is
[7, 9, 14, 15]. CNN/Daily Mail data set statistics are set out in carried out to show the importance of each model element.
Table 2 [14].
5.1. Overall Performance. On the CNN/DM data set, Table 5
4.3. Models for Comparison. Te N-GPETS model compares displays the ROUGE F-scores for several models. Tis table
high-resolution BERT-based graphs with non-BERT neural is divided into four sections: the frst section contains the
graph models for extracting text. Table 3 presents the details Lead-3 and Oracle scores, the second section contains the
of the non-BERT graph structures, and Table 4 shows the scores of models that did not use BERT, the results of BERT-
graph properties using the BERT model. based models are contained in the third part, and the
Computational Intelligence and Neuroscience 9

Table 2: CNN/Daily Mail data set statistics.


#Docs #Avg. tokens
Data set Source
Train Val Test Doc. Sum.
CNN News 90,266 1,220 1,093 761 46
Daily Mail News 196,961 12,148 10,397 653 55

Table 3: Non-BERT graph structures.


Model
Author(s) Non-BERT graph structure
name
Zhou et al., 2018, [11] NeuSum A framework based on the attention of seq2seq to extract text summary.
A model based on a variety of abstract graphs forms a graph of a sentence document based on word
Wang et al., 2020, [9] HSG
appearance. Tese words and sentence coding notes are CNN and BiLSTM.
Te compression model selects sentences and, to reduce repetition, suppresses these sentences by
Xu et al., 2020, [13] JECS
pruning the dependency tree.
Crawford et al., 2018, Faces the problem of selecting sentences as a contextual problem. To train model policy, gradient
BanditSum
[61] methods are used.

Table 4: BERT-based graph structures.


Author(s) Model name BERT-based graph structure
It uses several separating tokens in documents and gets sequential sentence representations.
Liu et al., 2019,
BERTSum (sent) It should be noted that this model was the frst BERT-based model for extraction
[44]
operations. We and many other functions use its framework as a document encoder.
Zhang et al., 2018, First, it works by transforming the BERT structure into a sequential structure and then
HiBERT
[36] using the unattended to train it in advance.
One of the most modern abstraction models uses the BERT model to assemble sentences
and update these sentence presentations with the help of a graph. It is clear that
Xu et al., 2019,
DISCOBERT DISCOBERT only uses sentence beginning and endings. However, we use sentence verbs
[46]
and additional semantic nodes in our work to construct a variety of diferent bipartite
graphs.
It uses a BERT model that creates a graph-based model of sentence coding and obtaining
Cui et al., 2020,
Topic-GraphSum information on a hidden subject. Tis topic information acts as an additional semantic unit
[14]
using a combined neural network (NTM) model.
Diferent graph formats were proposed and used three types of nodes: sentence locations,
Huang et al., 2021, DiscoCorrelation-
EDU locations, and business locations, and RST speech separation to capture interactions
[15] GraphSum
between EDUs and to use external speech information to improve model outcomes.
Our attention to a neural heterogeneous graph-based statistical model of pretrained
pretraining builds strong relationships between sentences based on additional semantic
Proposed N-GPETS
keywords (sentence-word-sentence). Due to the classifcation of nodes, sentences are
specifcally selected to produce our proposed N-GPETS model.

Table 5: ROUGE F1 scores/results on the CNN/DM data set.


CNN/Daily Mail
Models
R-1 R-2 R-L
Lead-3 40.42 17.62 36.67
Oracle 55.61 32.84 51.88
Non-BERT graph models
NeuSum 41.59 19.01 37.98
HSG 42.31 19.51 38.74
HSG + tri-blocking 42.95 19.76 39.23
BanditSum 41.50 18.70 37.60
JECS 41.70 18.50 37.90
BERT-based graph models
BERTSUM (sent) 43.25 20.24 39.63
HiBERT 42.37 19.95 38.83
DISCOBERT 43.77 20.85 40.67
Topic-GraphSum 44.02 20.81 40.55
DiscoCorrelation-GraphSum (EDU) 43.61 20.81 41.1
N-GPETS (proposed) 44.15 0.86 40.97
Te bold values against the model shows that the corresponding model gain the highest performance in comparison to all other models listed in table.
10 Computational Intelligence and Neuroscience

Table 6: Efect of ablation on diferent models on CNN/Daily Mail test set.


Model R-1 R-2 R-L
N-GPETS full (proposed) 44.15 20.86 40.97
-Residual connections 43.41 20.32 40.08
∗ TF-IDF values 43.22 19.97 40.05
-BERT 42.31 19.51 38.74
Te module with the ‘-’ was taken out of the original N-GPETS, but the module with the ‘∗’ had changes made to it.

44.5
44
43.5
43
42.5
42
41.5
41
N-GPETS W/o W/o BERT TF-IDF
Residual changes
connections
Figure 5: ROUGE-1, F1 fndings on the CNN/DM data set for our full model N-GPETS, and three ablated variations.

21

20.5

20

19.5

19

18.5
N-GPETS W/o Residual W/o BERT TF-IDF
connections changes
Figure 6: ROUGE-2, F1 fndings on the CNN/DM data set for our full model N-GPETS, and three ablated variations.

fndings of the suggested model N-GPETS are shown in the between EDUs via entity nodes, EDU nodes, and RST
fourth part. Te results lead to the following conclusions: discourse parsing, N-GPETS shows better outcomes on
N-GPETS performs better than the cutting-edge non-BERT ROUGE R-1/R-2 having an increase of (0.54/0.05), re-
model HSG by a 1.8/1.3/2.2 on the F-score of R-1/R-2/and spectively, and having the same score on R-L.
R-L. Tis shows that our graph network, which is based on It should be mentioned that RST discourse parsing and
the BERT algorithm, has a better comprehension of learning third-party external tools are the foundation of Dis-
cross-sentence links. Additionally, N-GPETS performs coCorrelation-GraphSum. Contrarily, N-GPETS does not
better than each of the non-BERT models presented in utilize any outside tools or knowledge. Additionally, N-GPETS
Table 5. After that comparison to models that utilized BEET, beats the cutting-edge extractive summarization model DIS-
frst, N-GPETS is contrasted with Topic-Graph-Sum, which COBERT depended on the external tool in R1 metrics and
utilizes topic data via NTM as an additional semantic unit. produces results that are equivalent in R2 and RL metrics.
N-GPETS produced better results beating the Topic-Graph-
Sum framework by 0.13/0.05/0.42 on the F-score of R-1/R-2/
R-L. Second, when compared to BERTSum-sent, N-GPETS 5.2. Ablation on CNN/Daily Mail. To understand the func-
achieves better results having an increase of 0.9/0.62/1.03 on tion and impact of various contributed modules revealed in
the F-score of R-1/R-2/R-L. Tird, in contrast to Dis- our recommended model N-GPETS on performance, abla-
coCorrelation-GraphSum, which captures relationships tion research is conducted. First of all, the residual
Computational Intelligence and Neuroscience 11

REFERENCE SUMMARY REFERENCE SUMMARY

angelique kerber beat madison keys 6-2 , 4-6 , 7-5 in statue in wuhan , central china , depicts country's frst
the charleston fnal. ruler and his wife.

kerber battled back to win six of the last seven games tourists fondling her exposed breast has damaged
in the decider. statue , ofcials say.

it is the german 's frst wta title since linz in 2013. legend has it that yu was lead to wife by a magical
nine-tailed fox.

PROPOSED SUMMARY PROPOSED SUMMARY

angelique kerber rallied past madison keys to win the ofcials in wuhan , the capital city of central china 's
family circle cup on sunday, capturing six of the last hubei province , have accused tourists of damaging a
seven games for a 6-2 , 4-6 , 7-5 victory. statue of the country 's frts leader and his wife by
fonding the woman 's exposed breast.
this was kerber 's fourth wta title and frst since linz in
2013. the sculpture , which has been in place for ten years ,
depicts yu the great , the founder of chine 's frst xia
kerber had defeated friend , countrywoman and dynasty in 2070 bc , meeting his wife.
defending champion andrea petkovic in the
semifnals. legend says that yu and his wife were brought
together by a nine-tailed fox that lead them to one
another

Figure 7: Examples of summaries generated by proposed model N-GPETS along with reference summaries.

connections that are present between GAT layers were re- advantages of Google Colab include preinstalled libraries
moved and word nodes were attached to the initial sentence and the capacity to upload fles to the cloud. With the help of
feature, similar to previous work [9]. Te second thing that we the COLAB collaboration tool, several developers can col-
have done, rather than using TF-IDF values from the entire laborate on the same project and use free GPUs and TPUs.
document, TF-IDF values from the individual sentences were We train our model for fve epochs on 287000 CNN/DM
used as features in the graph layer. In the third one, we gave a news articles. Figure 7 shows examples of summaries gen-
sideline to the BERT model and make use of BiLSTM and erated by proposed model N-GPETS along with reference
CNN models for encoding the document and checking the summaries.
overall performance. According to Table 6, cutting of re-
sidual connections between GAT layers lowers the F-score for 6. Conclusion and Future Work
the R1/R2/and RL measures. Tis implies that residual
connections are crucial in integrating genuine representation Te process of creating an extractive summary relies heavily on
with messages updated from other sorts of nodes that cannot modelling the relationships across the input sentences. Inspired
be substituted by straightforward concatenation [8]. As by the popularity of Transformer-based Bidirectional Encoder
shown in Figure 5, we noticed a decline in the F-score of R1/ Representations (BERT) pretrained linguistic models and graph
R2/RL metrics when TF-IDF values (computed from indi- attention network (GAT) that captures intersentence associa-
vidual sentences) were used as edge features in the graph tions, this study proposes a novel neural model (N-GPETS) for
layer, demonstrating the efectiveness of TF-IDF values the extractive summarization task by combining heterogeneous
entire document. Lastly, by substituting CNN-Encoder and graph attention network with BERT model and statistical ap-
BiLSTM in place of the BERTmodel, the model achieves lower proach using TF-IDF values. In contrast to earlier research,
F-score values than the proposed model N-GPETS, and the nobody employed BERT for both sentence and word node
model is reduced to the HSG, a non-BERTmodel [9]. Figure 6 formation along with the TF values for the creation of an at-
shows ROUGE-2, F1 fndings on the CNN/DM data set for tention graph network. Te following benefts are associated
our full model N-GPETS, and three ablated variations. with constructing the N-GPETS model: (i) during graph
propagation, the addition of feature-rich semantic word nodes
encoded using BERTstrengthens sentence representation. It can
5.3. Environment for Model Development and Training. compile information from modifed sentences. (ii) Additionally,
Due to its superior GPU compared to the free version, semantic word nodes can be used to link sentences together and
Google Colab (pro) is utilized for model coding and training identify links between them. (iii) Our graph structure can use
instead of using simple COLAB. Programmers can write and diferent levels of information during message passing.
execute Python code directly from their browsers using According to the simulation fndings on the widely used CNN/
Colab, a Google Research product. It is important to note Daily Mail benchmark data set, our model performed better
that Google Colab is a great tool for a variety of deep learning than other heterogeneous graph structures that used the BERT
jobs. Te Jupyter notebook is hosted by Google Colab. model as well as graph structures that are opposed to BERT.
Consequently, no additional software is needed. Te N-GPETS is based on the summarization of a single document.
12 Computational Intelligence and Neuroscience

However, it can be expanded to include summarizing nu- of the Association for Computational Linguistics,
merous documents rather than just one. Using this graph vol. 1pp. 654–663, Beijing, China, May 2018.
structure to condense lengthy research publications is the [10] A. Vaswani, “Attention is all you need,” in Advances in Neural
second direction for the future. Other semantic units like topic Information Processing Systems, pp. 5998–6008, arXiv, New
and paragraph semantic units in graph structure can also be York, NY, USA, 2017.
used to improve summarization performance. [11] D. Wang, P. Liu, Y. Zheng, X. Qiu, and X. Huang, “Het-
erogeneous Graph Neural Networks for Extractive Document
Data Availability Summarization,” 2020, https://arxiv.org/abs/2004.12393.
[12] J. Xu and G. Durrett, “Neural extractive text summarization
Te data that support this study are available upon request with syntactic compression,” in Proceedings of the 2019
from the frst author. Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural
Conflicts of Interest Language Processing (EMNLP-IJCNLP), pp. 3292–3303, July
2019.
Te authors declare that they have no conficts of Interest. [13] P. Cui, Le Hu, and Y. Liu, “Enhancing extractive text sum-
marization with topic-aware graph neural networks,” in
Acknowledgments Proceedings of the 28th International Conference on Compu-
tational Linguistics, January 2020.
Tis work was performed in the Department of Computer [14] Y. J. Huang and S. Kurohashi, “Extractive summarization
Science, City University of Science and Information Tech- considering discourse and coreference relations based on
nology, Peshawar, Pakistan. heterogeneous graph,” in Proceedings of the 16th Conference of
the European Chapter of the Association for Computational
References Linguistics, pp. 3046–3052, June 2021.
[15] D. K. Bangotra, Y. Singh, A. Selwal, N. Kumar, P. K. Singh,
[1] A. P. Widyassari, S. Rustad, G. F. Shidik, E. Noersasongko, and W.-C. Hong, “An intelligent opportunistic routing al-
A. Syukur, and A. Afandy, “Review of automatic text sum- gorithm for wireless sensor networks and its application
marization techniques & methods,” Journal of King Saud towards e-healthcare,” Sensors, vol. 20, no. 14, p. 3887, 2020.
University-Computer and Information Sciences, vol. 34, 2020. [16] M. N. Uddin, B. Li, Z. Ali, P. Kefalas, I. Khan, and I. Zada,
[2] M. Shabir, N. Islam, Z. Jan et al., “Real-time pashto hand- “Software Defect Prediction Employing BiLSTM and BERT-
written character recognition using salient geometric and Based Semantic Feature,” Soft Computing, vol. 26, pp. 1–15,
spectral features,” IEEE Access, vol. 9, Article ID 160238, 2021. 2022.
[3] J. Pilault, R. Li, S. Subramanian, and C. Pal, “On extractive and [17] D. Suleiman and A. Awajan, “Deep learning based abstractive
abstractive neural document summarization with transformer text summarization: approaches, datasets, evaluation mea-
language models,” in Proceedings of the 2020 Conference on sures, and challenges,” Mathematical Problems in Engineering,
Empirical Methods in Natural Language Processing (EMNLP), vol. 2020, pp. 1–29, 2020.
pp. 9308–9319, September 2020. [18] L. Nkenyereye, L. Nkenyereye, B. Adhi Tama, A. G. Reddy,
[4] J. Cheng and M. Lapata, “Neural summarization by extracting and J. Song, “Software-defned vehicular cloud networks:
sentences and words,” in Proceedings of the 54th Annual architecture, applications and virtual machine migration,”
Meeting of the Association for Computational Linguistics, Sensors, vol. 20, no. 4, p. 1092, 2020.
pp. 484–494, June 2016. [19] P. Sethi, S. Sonawane, S. Khanwalker, and R. Keskar, “Au-
[5] K. Arumae and F. Liu, “Reinforced extractive summarization
tomatic text summarization of news articles,” in Proceedings of
with question-focused rewards,” in Proceedings of the ACL
the 2017 International Conference on Big Data, IoT and Data
2018, Student Research Workshop, pp. 105–111, June 2018.
Science (BID), pp. 23–29, IEEE, Pune, India, May 2017.
[6] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences
[20] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and
for extractive summarization with reinforcement learning,” in
H. K. Mohamed, “Automatic text summarization: a com-
Proceedings of the 16th Annual Conference of the North
prehensive survey,” Expert Systems with Applications, vol. 165,
American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 1747–1759, Article ID 113679, 2021.
Association for Computational Linguistics (ACL), Pennsyl- [21] K. Anwar, T. Rahman, A. Zeb et al., “Improving the con-
vania, PA, USA, 2018. vergence period of adaptive data rate in a long range wide area
[7] M. Zhong, P. Liu, D. Wang, X. Qiu, and X.-J. Huang, network for the internet of things devices,” Energies, vol. 14,
“Searching for efective neural extractive summarization: no. 18, p. 5614, 2021.
what works and what’s next,” in Proceedings of the 57th [22] K. Anwar, T. Rahman, A. Zeb, I. Khan, M. Zareei, and
Annual Meeting of the Association for Computational C. Vargas-Rosales, “Rm-adr: resource management adaptive
Linguistics, pp. 1049–1058, ACL, Pennsylvania, PA, USA, July data rate for mobile application in lorawan,” Sensors, vol. 21,
2019. no. 23, p. 7980, 2021.
[8] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: a recurrent [23] P. J. Liu, “Generating wikipedia by summarizing long se-
neural network based sequence model for extractive sum- quences,” in Proceedings of the International Conference on
marization of documents,” in Proceedings of the Tirty-First Learning RepresentationsACL, Pennsylvania, PA, USA, April
AAAI Conference on Artifcial Intelligence, AAAI Press, 2018.
California, CF, USA, July 2017. [24] J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-
[9] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, training of Deep Bidirectional Transformers for Language
“Neural document summarization by jointly learning to score Understanding,” in Proceedings of the NAACL-HLT,
and select sentences,” Proceedings of the 56th Annual Meeting pp. 4171–4186, July 2019.
Computational Intelligence and Neuroscience 13

[25] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, [40] L. Bacco, A. Cimino, F. Dell’Orletta, and M. Merone, “Ex-
and Y. Bengio, “Graph attention networks,” Stat, vol. 1050, plainable sentiment analysis: a hierarchical transformer-based
p. 20, 2017. extractive summarization approach,” Electronics, vol. 10,
[26] D. Anand and R. Wagh, “Efective Deep Learning Approaches no. 18, p. 2195, 2021.
for Summarization of Legal Texts,” Journal of King Saud [41] S. Xu, X. Zhang, Y. Wu, F. Wei, and M. Zhou, “Unsupervised
University-Computer and Information Sciences, vol. 34, 2019. extractive summarization by pre-training hierarchical trans-
[27] R. Nallapati, B. Zhou, and M. Ma, Classify or Select: Neural formers,” in Proceedings of the Findings of the Association for
Architectures for Extractive Document Summarization, New Computational Linguistics: EMNLP, 2020, pp. 1784–1795,
York, NY, USA, 2016. May 2020.
[28] W. Xiao and G. Carenini, “Extractive summarization of long [42] Y. Liu, J. Zhang, Y. Wan, C. Xia, L. He, and S. Y. Philip,
documents by combining global and local context,” in Pro- “HETFORMER: heterogeneous transformer with sparse at-
ceedings of the 2019 Conference on Empirical Methods in tention for long-text extractive summarization,” in Proceed-
Natural Language Processing and the 9th International Joint ings of the 2021 Conference on Empirical Methods in Natural
Conference on Natural Language Processing (EMNLP- Language Processing, pp. 146–154, October 2021.
IJCNLP), pp. 3011–3021, August 2019. [43] Y. Liu, “Fine-tune BERT for extractive summarization,” 2019,
[29] R. Sahba, N. Ebadi, M. Jamshidi, and P. Rad, “Automatic text https://arxiv.org/abs/1903.10318.
summarization using customizable fuzzy features and at- [44] Y. Liu and M. Lapata, “Text summarization with pretrained
tention on the context and vocabulary,” in Proceedings of the encoders,” in Proceedings of the 2019 Conference on Empirical
2018 World Automation Congress (WAC), pp. 1–5, IEEE, Methods in Natural Language Processing and the 9th Inter-
Stevenson, WA, USA, July 2018. national Joint Conference on Natural Language Processing
[30] Á. Hernández-Castañeda, R. A. Garcı́a-Hernández, (EMNLP-IJCNLP), August 2019.
Y. Ledeneva, and C. E. Millán-Hernández, “Extractive au- [45] T. Yang, L. Hu, C. Shi, H. Ji, X. Li, and L. Nie, “HGAT:
tomatic text summarization based on lexical-semantic key- heterogeneous graph attention networks for semi-supervised
words,” IEEE Access, vol. 8, Article ID 49896, 2020. short text classifcation,” ACM Transactions on Information
[31] R. Khan, Y. Qian, and S. Naeem, “Extractive based text Systems, vol. 39, no. 3, pp. 1–29, 2021.
summarization using K-means and TF-IDF,” International [46] J. Xu, “Discourse-aware neural extractive text summariza-
Journal of Information Engineering and Electronic Business,
tion,” in Proceedings of the 58th Annual Meeting of the As-
vol. 11, no. 3, pp. 33–44, 2019.
sociation for Computational Linguistics, January 2020.
[32] R. Siautama, R. Siautama, D. Suhartono, and D. Suhartono,
[47] D. Miller, “Leveraging BERT for Extractive Text Summari-
“Extractive hotel review summarization based on TF/IDF and
zation on Lectures,” 2019, https://arxiv.org/abs/1906.04165.
adjective-noun pairing by considering annual sentiment
[48] I. Zada, S. Ali, I. Khan et al., “Performance evaluation of
trends,” Procedia Computer Science, vol. 179, pp. 558–565,
simple K-mean and parallel K-mean clustering algorithms:
2021.
big data business process management concept,” Mobile In-
[33] L. Yao, Z. Pengzhou, and Z. Chi, “Research on news keyword
formation Systems, vol. 2022, pp. 1–15, Article ID 1277765,
extraction technology based on TF-IDF and TextRank,” in
2022.
Proceedings of the IEEE/ACIS 18th International Conference
[49] K. Ma, Y. Tan, M. Tian et al., “Extraction of temporal in-
on Computer and Information Science (ICIS), pp. 452–455,
formation from social media messages using the BERT
IEEE, Beijing, China, June 2019.
[34] S. Narayan, J. Maynez, J. Adamek, D. Pighin, B. Bratanic, and model,” Earth Science Informatics, vol. 15, pp. 573–584, 2022.
R. McDonald, “Stepwise extractive summarization and [50] V. Tretyak and D. Stepanov, “Combination of Abstractive and
planning with structured transformers,” in Proceedings of the Extractive Approaches for Summarization of Long Scientifc
2020 Conference on Empirical Methods in Natural Language Texts,” 2020, https://arxiv.org/abs/2006.05354.
Processing (EMNLP), pp. 4143–4159, August 2020. [51] A. Widad, B. L. El Habib, and E. F. Ayoub, “Bert for question
[35] X. Zhang, F. Wei, and M. Zhou, “HIBERT: document level answering applied on covid-19,” Procedia Computer Science,
pre-training of hierarchical bidirectional transformers for vol. 198, pp. 379–384, 2022.
document summarization,” in Proceedings of the 57th Annual [52] I. Ahmad, S. J. Xu, A. Khatoon, U. Tariq, I. Khan, and
Meeting of the Association for Computational LinguisticsACL, S. S. Rizvi, “Analytical study of deep learning-based preventive
Pennsylvania, PA, USA, July 2019. measures of COVID-19 for decision making and aggregation
[36] A. Ravula, “ETC: Encoding Long and Structured Inputs in via the RISTECB model,” Scientifc Programming, vol. 2022,
Transformers,” in Proceedings of the 2020 Conference on Article ID 6142981, 17 pages, 2022.
Empirical Methods in Natural Language Processing (EMNLP), [53] K. Ma, M. Tian, Y. Tan, X. Xie, and Q. Qiu, “What is this
June 2020. article about? Generative summarization with the BERT
[37] J. Kwon, N. Kobayashi, H. Kamigaito, and M. Okumura, model in the geosciences domain,” Earth Science Informatics,
“Considering nested tree structure in sentence extractive vol. 15, no. 1, pp. 21–36, 2022.
summarization with pre-trained transformer,” in Proceedings [54] M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan,
of the 2021 Conference on Empirical Methods in Natural and D. Radev, “Graph-based neural multi-document sum-
Language Processing, pp. 4039–4044, January 2021. marization,” in Proceedings of the 21st Conference on Com-
[38] Y. Liu, RoBERTa: A Robustly Optimized, BERT Pretraining putational Natural Language Learning (CoNLL 2017),
Approach, 2019. pp. 452–462, June 2017.
[39] W. Xiao, P. Huber, and G. Carenini, “Do we really need that [55] M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou,
many parameters in transformer for extractive summariza- “Multi-hop reading comprehension across multiple docu-
tion? Discourse can help,” in Proceedings of the First Work- ments by reasoning over heterogeneous graphs,” in Pro-
shop on Computational Approaches to Discourse, pp. 124–134, ceedings of the 57th Annual Meeting of the Association for
June 2020. Computational Linguistics, pp. 2704–2713, May 2019.
14 Computational Intelligence and Neuroscience

[56] Z. U. Rahman, S. I. Ullah, A. Salam, T. Rahman, I. Khan, and


B. Niazi, “Automated detection of rehabilitation exercise by
stroke patients using 3-layer CNN-LSTM model,” Journal of
Healthcare Engineering, vol. 2022, Article ID 1563707,
12 pages, 2022.
[57] S. Saziyabegum and P. Sajja, “Review on text summarization
evaluation methods,” Indian J. Comput. Sci. Eng, vol. 8, no. 4,
Article ID 497500, 2017.
[58] C.-Y. Lin, “Rouge: a package for automatic evaluation of sum-
maries,” in Text Summarization Branches Out, pp. 74–81, 2004.
[59] A. Nawaz, M. Bakhtyar, J. Baber, I. Ullah, W. Noor, and
A. Basit, “Extractive text summarization models for Urdu
language,” Information Processing & Management, vol. 57,
no. 6, Article ID 102383, 2020.
[60] K. M. Hermann, “Teaching machines to read and compre-
hend,” Advances in Neural Information Processing Systems,
vol. 28, 2015.
[61] Y. Dong, Y. Shen, E. Crawford, H. van Hoof, and
J. C. K. Cheung, “Bandit Sum: extractive summarization as a
contextual bandit,” in Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pp. 3739–3748, September 2018.

You might also like