Professional Documents
Culture Documents
Research Article: N-GPETS: Neural Attention Graph-Based Pretrained Statistical Model For Extractive Text Summarization
Research Article: N-GPETS: Neural Attention Graph-Based Pretrained Statistical Model For Extractive Text Summarization
Research Article
N-GPETS: Neural Attention Graph-Based Pretrained Statistical
Model for Extractive Text Summarization
Muhammad Umair ,1 Iftikhar Alam,1 Atif Khan ,2 Inayat Khan ,3 Niamat Ullah,4
and Mohammad Yusuf Momand 5
1
Department of Computer Science, City University of Science and Information Technology, Peshawar 25000, Pakistan
2
Department of Computer Science, Islamia College, Peshawar 25000, Pakistan
3
Department of Computer Science, University of Engineering and Technology, Mardan, Pakistan
4
Department of Computer Science, University of Buner, Buner 19290, Pakistan
5
Faculty of Computer Science, University of Nangarhar, Jalalabad 2600, Afghanistan
Copyright © 2022 Muhammad Umair et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Te extractive summarization approach involves selecting the source document’s salient sentences to build a summary. One of the most
important aspects of extractive summarization is learning and modelling cross-sentence associations. Inspired by the popularity of
Transformer-based Bidirectional Encoder Representations (BERT) pretrained linguistic model and graph attention network (GAT)
having a sophisticated network that captures intersentence associations, this research work proposes a novel neural model N-GPETS by
combining heterogeneous graph attention network with BERTmodel along with statistical approach using TF-IDF values for extractive
summarization task. Apart from sentence nodes, N-GPETS also works with diferent semantic word nodes of varying granularity levels
that serve as a link between sentences, improving intersentence interaction. Furthermore, proposed N-GPETS becomes more improved
and feature-rich by integrating graph layer with BERT encoder at graph initialization step rather than employing other neural network
encoders such as CNN or LSTM. To the best of our knowledge, this work is the frst attempt to combine the BERTencoder and TF-IDF
values of the entire document with a heterogeneous attention graph structure for the extractive summarization task. Te empirical
outcomes on benchmark news data sets CNN/DM show that the proposed model N-GPETS gets favorable results in comparison with
other heterogeneous graph structures employing the BERT model and graph structures without the BERT model.
summarization model as a sentence ranking task, incor- Tis paper suggests an innovative pretrained statistical-
porating a reinforcement learning objective to optimize the based graph attention network (N-GPETS) for single-doc-
ROUGE evaluation metric. Encoder-decoder sequence ar- ument extractive summarization, fusing BERT pretrained
chitecture for extractive summarization was mostly adopted framework and TF-IDF with graph attention network. First,
by the researchers [6–8] who utilize various neural com- the whole document is fed to BERT for encoding, which has
ponents to encode each sentence uniquely. To produce a a strong architectural foundation and has been pretrained on
summary from the input content that includes signifcant enormous data sets. Te BERT encoder generates word
sentences, learning and modelling cross-sentence linkage are nodes and sentence nodes. Tese word nodes served as an
crucial [9]. Recurrent neural networks (RNNs) have been additional semantic unit. Second, for the graph layer, the
used in the majority of the recently presented research, output of the BERT encoder in the form of word and
including [4, 10, 11], to learn and simulate the cross-rela- sentence nodes act as graph nodes while the values of TF-
tionships between texts. However, RNNs may sufer from IDF of the whole document serve as edge features between
the vanishing gradient problem, which causes gradient corresponding nodes. In graph layer, the graph attention
magnitudes to shrink as they propagate over time. Because of mechanism is applied and the representation of nodes ware
this phenomenon, the network’s memory ignores long-term updated. Finally, labels are assigned by the sentence selection
dependencies and fails to learn the correlation between module after it has extracted the representation of signifcant
temporally distant events [12]. sentences node from the graph layer.
To understand and simulate the cross-relationship be- N-GPETS enhanced the work of [9], which was actually
tween sentences, many scholars took advantage of the graph about the problem of capturing semantic-level relationships,
structure. Tere have been numerous attempts made to model but this work was completely unaware of using pretrained
efective graph networks for jobs requiring summarization [9]. models such as BERT along with graph attention mecha-
Recent research [12] used sentence personalization traits and nism. N-GPETS also difers from previous work [14] as they
discourse-aware intersentential interactions to create a use topic nodes as an additional semantic unit with the help
summarization model (ADG). Te authors in [13] used a of a joint neural network model (NTM). Te other diference
Rhetorical Structure Teory (RST) graph to model cross- of proposed N-GPETS from previous models is to generate
sentence association by utilizing joint extraction and syntactic TF-IDF values of the whole input text and used these fea-
compression to create a summary of a single-document text. tures between edges of graph nodes. Te proposed N-GPETS
Another approach is proposed in [14], where a diferent graph structure has the following advantages: (i) during the
strategy is suggested. It looks at an unsupervised discourse- graph propagation stage, semantic word nodes (additional
aware hierarchical graph network (HIPORANK) for lengthy units) which are highly featured rich, due to the BERT
scientifc publications, which uses intra- and interconnection framework [16], improve the sentence representation and
between document sentences as well as model asymmetric gather information from sentences; (ii) to link sentences and
edge weights for extracting important sentences. identify intersentence links, semantic word nodes can also be
Te preceding approaches relied on third-party tools and employed as a bridge; and (iii) our graph structure can use
did not consider the error propagation problem [9]. One of diferent levels of information during message passing. Te
the simplest methods mentioned above is to model a fully following are our model’s standout contributions:
connected graph at the sentence level. Recently, studies [6, 7]
used transformer architecture to learn pairwise interactions (1) a novel approach and frst attempt to build a BERT-
between phrases to model sentence-level graphs. A hybrid based statistical graph attention network N-GPETS
graph attention framework for learning the cross-sentence for summarizing single-input document text. An
relationships was put forward by [8] utilizing GAT and CNN extractive summary is produced by the graph layer
as encoders while TF-IDF values as edge features. However, using sentence and word nodes produced by BERT
this graph-building approach runs into the problem of and TF-IDF values of the entire manuscript.
capturing semantic-level relationships [14]. In a study [14], (2) to assess the efectiveness of the suggested N-GPETS
the authors developed a sentence-level graph-based model technique against cutting-edge approaches using
that employed BERT for sentence encoding and a joint CNN/DM News data sets using ROUGE evaluation
neural network model (NTM) to discover latent topic in- metrics.
formation rather than semantic word nodes in a hetero-
(3) Te simulation fndings on benchmark news data
geneous graph network. Te authors in [15] proposed a
sets: CNN/DM demonstrates that N-GPETS pro-
heterogeneous graph structure for modelling the cross-
vides generally acceptable outcomes in comparison
sentence relationships between sentences. Tey used three
with existing graph attention networks utilizing
types of nodes to capture the relationships between the
BERT or the absence of BERT in combination with
EDUs: nodes of the sentence, nodes of EDU, and nodes of
graph structures.
entity, RST, and they also used external discourse knowledge
to improve the model’s results. Te creation of a useful graph Te remaining portions of the article are organized as
model that enhances the extraction of important stuf for the follows. In Section 2, we take a critical look at the leading
formation of extractive summaries remains a challenging work on extractive summarization tasks. Section 3 describes
and unsolved research issue despite the success of the prior in full the proposed N-GPETS model methodology. Section
approaches [9]. 4 details the proposed model compared to other existing
Computational Intelligence and Neuroscience 3
cutting-edge models and also the hyper-parameters and models such as BERT (Bidirectional Encoder
model settings. Section 5 focuses on Results, and lastly, the Representations from Transformers) and GPT
study paper concludes in section 6, which also ofers sug- (Generative Pretrained Transformer) trained on
gestions for future research. massive corpora [12, 24].
(ii) BERT (Bidirectional Encoder Representations from
2. Literature Review Transformers) is a Google-developed machine
learning model that has already been trained for
Tis section discusses some traditional and advanced ap- natural language processing (NLP) tasks. Te
proaches/techniques for extractive summarization tasks. Google AI research team created and published
Initially, we look into how extractive summarization is BERT in 2018. It should be noted that Google an-
performed using a deep neural sequence-to-sequence model. nounced in 2019 that their search engines are using
Ten, we investigate how various statistical methods, such as the BERT approach. In 2020, one of the most recent
TF-IDF, LDA, and TextRank, perform feature extraction and surveys [24] claims that BERT in just one year has
summary generation tasks. Ten, deep-learning-based evolved into a widely used baseline in NLP research,
transformer architectures for extractive summarization are with over 150 research publications analyzing and
presented. We discuss how pretrained models, such as improving the model. BERT is a new language
BERT, are used for various NLP tasks, particularly sum- representation baseline that extends word embed-
marization. Finally, we look into how other neural graph- ding models [25]. Two signifcant duties were
based structures are used for the task of summary genera- covered in BERT training: language modelling (15%
tion. In the following section, we will briefy defne some of tokens were hidden, and BERT was taught to
background concepts. anticipate them based on context) and next sentence
prediction (using the frst sentence as a guide, BERT
2.1. Text Summarization. Automatic text summarization was trained to determine whether a particular
(ATS) is a method that creates an overview that contains all statement would be expected or not).
pertinent and important information by automatically (iii) Graph Attention Networks (GATs). To address the
summarizing a substantial amount of text. It is important to limitations of earlier methods that just used graph
note that automatic text summarization is a text mining convolutions, neural network designs called graph
process that accepts a lengthy text document as input and attention networks (GATs) deal with input that is
produces an appropriate summary [17]. Tere is an abundance arranged as a graph and employ masked self-at-
of text-based content on the Internet, including web publi- tention layers [26]. By focusing on each node’s
cations, papers, news, and reviews, that must be summarised neighbors, it is intended to employ a self-attention
to get the document’s gist [18]. ATS has many uses, such as method to calculate each node’s hidden represen-
short read generation, passage reduction, compaction, tations. Some of the most useful characteristics of
extracting, and the most important information from sensitive the graph attention architecture include the fol-
reports, including legal reports produced by legal authorities lowing: (i) the parallelization characteristic is sur-
[19]. ATS can also be used in news text summarizers to assist rounded by node-neighbor pairs, which makes the
readers in fnding the most interesting and important content attention mechanism efective. (ii) By giving the
in less time [20–22]. Other applications of ATS include neighbors diferent weights, this architecture is
sentiment summarization, legal text summarization, scientifc especially efcient because it can be used with graph
document summarization, tweet summarization, book nodes of diferent degrees. (iii) Te graph attention
summarization, story/novel summarization, e-mail summa- model can be used to directly address learning is-
rization, and bio-medical document summarization [23]. Te sues, such as those requiring the model to extrap-
fundamental design of the ATS system, as shown in Figure 1, olate to previously unobserved graphs [26].
includes the following functions.
(i) Te Transformer is a fully self-attention-based deep 2.2. Extractive Text Summarization Approaches and
learning model. It is a simple network that is Techniques. Finding the sentence’s location and the fre-
completely free of recurrence and convolutions. It is quency of words in the text was the most typical problem
one of the most advanced architectures in NLP and that surfaced from extractive summarization research [1].
computer vision [12]. Te Transformer does not Researchers in [27] used a deep learning technique called
need to comprehend the initial part of the sentence Feed Forward Neural Network (FFNN) for single-document
before the end because the attention mechanism legal text summarization. Tis method generates a coherent
adds context to the input sequence at any point in extractive summary without needing features or domain
the sentence. Instead of processing input sequen- expertise but fails miserably in summarizing difcult and
tially as RNN does, the Transformer allows for more long statements [1]. Te study [11] presented the encoder-
parallelization, which results in a shorter training decoder architecture as a foundation for single-document
time. Because of Transformer’s parallelization fea- summarization that contains an attention-based extractor
ture, training on larger data sets is possible, allowing and a hierarchical document encoder. Te authors in [28]
for the development of cutting-edge pretrained presented a classifer-based architecture (RNN based) that
4 Computational Intelligence and Neuroscience
(a) (b)
Figure 1: ATS basic architecture for single and multidocument [23].
accepts or rejects each sentence in the original document making to produce a summary as the output explanation. To
order for inclusion in the fnal summary in a sequential generate unsupervised extractive summaries, the researchers
manner. For a lengthy text that takes into account both the of [42] used a transformer attention mechanism to prioritize
global and local contexts of the content, the authors of [29] sentences. For extractive summarization of long text, the
suggested a single-document extractive summary model. A authors of [43] used the transformer model and introduced a
novel technique for summarization was provided by the type of heterogeneous framework called HETFORMER
authors in [30] that relied on a neural sequence-to-sequence framework. BERT is a pretrained model used by many re-
model with an attention mechanism and fuzzy character- searchers for extractive summarization. Te summarized
istics that could be customised. literature review is depicted in Table 1.
Statistical techniques such as TF-IDF, TextRank, LDA, Te “lecture summarizing service,” a Python-based
and clustering, among others, have been used for extractive RESTful service, chose relevant sentences near the cluster’s
summarization tasks. Te study [31] presented statistical centroid using the K-means clustering algorithm and the
topic modelling techniques such as latent BERT model for text embeddings to generate a summary
Dirichlet allocation (LDA), which select important sentences [47, 48]. Researchers in [49] utilize the bidirectional model
in clusters based on automatically generated keywords. BiLSTM and BERTmodel for extracting temporal information
Additionally, the study in [32] mainly utilized TF-IDF and from messages from social media platforms that are necessary
K-means clustering-based approaches for the creation of an for geographical applications. Te authors of [50] developed a
extracted summary. hybrid method for producing summaries of long scientifc
Te authors in [33] presented two methods for extractive texts that combined the benefts of both extractive and ab-
summarization of hotel reviews. Te frst method was used stractive designs. Te authors in [51, 52] use the deep learning
to select the most related sentences based on their TF-IDF model BERT and RISTECB model to answer important
score. Te second method generated the phrase summary questions related to the COVID-19 research articles. Te
style by pairing adjectives with the closest nouns and taking authors of [44] demonstrated an excellent tuning-based ap-
polarity into account. Te work done by [34] integrated the proach for extractive summarization using the BERT model.
TF-IDF and TextRank techniques to extract keywords from Te BERT model was also used by the authors of
input documents. [7, 8, 16, 36, 46] for contextual representation in summari-
A deep learning model called Transformer uses the self- zation tasks. Te authors in [53] use the BERT model to
attention process. Tis framework is utilized by numerous automatically generate titles from a huge set of published
researchers for extractive summarization jobs. Te re- literature or related work. Additionally, extractive summari-
searchers of [35] focused on the structured transformers zation tasks using graph structures have been carried out by
HiBERT presented by [36] and Extended Transformers exploiting linguistic and statistical information included in
presented by [37], which ofer an extractive encoder-centric sentences [9]. Recent research has combined neural networks
stepwise strategy for summarizing documents. Tis model with graphs, or (GNNs), and used the encoder-decoder
enabled stepwise summarization by inserting the previously structure for extractive summarization [13, 54]. Many re-
created summary as an additional substructure into the searchers nowadays use a heterogeneous graph neural net-
structured Transformer. Te authors in [38] presented an work with multiple updated nodes rather than a homogeneous
extractive summarization model based on layered trees, graph structure with no updated nodes for extractive sum-
where the given document’s discourse and syntactic trees are marization tasks. Te study [55] proposed a bipartite graph
combined to form nested tree structures. Te authors pri- attention network for multihop reading comprehension (RC)
marily focused on the existing model RoBERTa presented by across documents that encoded diferent documents and
[39] for constructing this model. By lowering the size of the entities together. Te authors in [48] presented an approach
attention module, the authors in [40] presented an extractive that modeled redundancy-aware heterogeneous graphs and
summarization technique for discourse-based attention at refned sentence representation using neural networks for
the document level; this constitutes the core of the trans- extractive summarization. Te studies [9, 56] proposed a
former architecture, utilizing a unique discourse-inspired heterogeneous graph neural network for extractive summa-
approach. Two diferent transformer-based techniques for rization that used CNN with Bi-LSTM as encoders for input
sentiment analysis were provided by the authors in [41] while text and TF-IDF values as edge features between graph nodes.
fetching the words that are crucial to the model’s decision- Utilizing a graph attention network, cross-sentence links
Computational Intelligence and Neuroscience 5
between sentences are learned. Te work done by [14] built a BERT and graph attention network along with a statistical
sentence-level graph-based model, using BERT for sentence approach. N-GPETS is broken down into four phases:
encoding and joint neural network model (NTM) for dis- document representation comes frst, then there are three
covering latent topic information. Te authors in [15] pro- trainable modules: BERT graph initializers, graph layer, and
posed a heterogeneous graph structure for modelling cross- important sentences selector. Te following subsections go
sentence relationship between sentences. To represent the over each of these phases in detail.
relationships between the EDUs, they used three diferent
types of nodes, including sentence nodes, EDU nodes, and
3.1. Representing Document as a Heterogeneous Graph.
entity nodes, and RST discourse parsing and leverage external
Consider a document represented by G � (V, E), where V
discourse expertise to enhance the model’s performance. Te
represents the set of nodes, and E represents the edges in
next section goes over the unique model N-GPETS meth-
between the nodes. In our framework, the attention graph
odology that is proposed in this study in depth.
structure is made by taking the union of VW and VS , i.e.,
V � VW UVS , E � e11 , . . . , emn , and VW � W1 , W2 , . . . , Wm .
3. Methodology Te quantity of distinct words in the text VS � S1 , S2 , . . . , Sn
indicates the quantity of sentences. E is now an edge weight
In this paper, an innovative pretrained statistical model for matrix with real values and eij ≠ 0 where i ∈ {1, . . . , n} and
extractive summarization task called N-GPETS is presented, j ∈ {1, . . . , m} demonstrate that jth sentence has the ith word
which is designed by combining the deep learning model as discussed in [9].
6 Computational Intelligence and Neuroscience
BERT Encoder
3.2. BERT Graph Initializers. As suggested by the BERT
model’s basic structure, the output vectors of BERTare based
on tokens instead of sentence tokens [25]. But it is clear to us
Word Sentence
that sentence-level representation is manipulated in the case
of extractive summarization. Te second thing that was Figure 2: A general framework of the suggested model (N-
noted was that the original BERT model’s segmentation GPETS).
embeddings just apply to the input of two sentences.
Nevertheless, the extractive summarization process requires
us to encode and manage multisentential inputs [7]. In this
(iii) Tese sentence vectors, word embeddings, and TF
study, each sentence begins with a [CLS] external token and
values of the whole input text are forwarded to the
ends with a [SEP] that overcomes the difculties that arise
attention graph layer
for single sentence representation in a document, same as
done in [7]. For the preceding sentence, external tokens (iv) Te graph layer serves as a summarization layer in
gather data while to diferentiate diferent sentences in a the N-GPETS model
document, segment embeddings are used [7]. For example,
we have fve diferent sentences in an input text, i.e., (sent1 ,
sent2 , sent3 , sent4 , and sent5 ). Each sentence has the fol- 3.5. Graph Layer. In the graph layer for the construction of
lowing embeddings associated with it: [EA, EB, EA, EB, and a bipartite graph, we gave the nodes of words and sen-
EA]. Tis method allows for the hierarchical learning of the tences along with TF values to the graph layer. After that
input document representations. In last, the vectors Ti which to update the representation of the semantic nodes, the
are the vectors of [CLS] tokens of every sentence generated graph attention network is used, same as previous work
by BERT having all information about each sentence senti [9], with the main diference being that the word and
are forwarded to the graph layer for the graph attention sentence node features used in the graph layer are encoded
mechanism. Tese vectors work as sentence nodes in the with the help of BERT model at the graph initializers stage
graph layer in the proposed N-GPETS. Te complete process discussed above rather than using diferent neural net-
of sentence nodes generation using BERT is depicted in work encoders such as CNN or Bi-LSTM. Te graph at-
Figure 3 [44]. tention layer (GAT) and hidden state of input nodes
hi ∈ Rdh can be constructed in the same way as demon-
strated in [9]:
3.3. Word Nodes and Edge Features. We employ the base
framework of the BERT encoder [25], depicted in Figure 2, zij � LeakyReLUwa wq hi ; wk hj ,
which takes the word of input text, encodes the words, and
generates word vectors. To highlight how word and sentence expzij
nodes are connected, in the initialization step of our model, αij � ,
l∈Ni exp zil (1)
we incorporate TF-IDF values into the edge weights, similar
to [9]. Utilizing BERT to create word, nodes are presented in
Figure 4. ui � σ ⎝
⎛ αij Wv hj ⎠
⎞.
j∈Ni
Input Document [CLS] sent one [SEP] [CLS] 2nd sent [SEP] [CLS] sent again [SEP]
Token Embeddings E[CLS] Esent Eone E[SEP] E[CLS] E2nd Esent E[SEP] E[CLS] Esent Eagain E[SEP]
Interval Segment
EA EA EA EA EB EB EB EB EA EA EA EA
Embeddings
BERT
Summarization Layers
Y1 Y2 Y3
[MASK] [MASK]
Token
E[CLS] Emy E[MASK] Eis Ecute E[SEP] Ehe E[MASK] Eplay E##ing E[SEP]
Embeddings
+ + + + + + + + + + +
Segment
EA EA EA EA EA EA EB EB EB EB EB
Embeddings
+ + + + + + + + + + +
Position
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Embeddings
utilized in the proposed work and compares BERT-based 4.3.1. Hyperparameters and Implementation Settings.
and non-BERT models to the suggested model and also N-GPETS encodes the document using a pretrained BERT-
provides information about the objective evaluation ma- based model to produce sentence and word nodes. Vo-
trices utilized in the proposed system, as well as its cabulary is limited to 50,000 words, and tokens are activated
hyperparameters and execution settings. Te following in 300-dimensional embedding using BERT than the em-
subsections described them in a little bit of detail: bedded GloVe used in previous applications [7, 9]. Te
BERT output is 768-dimensional vectors, and the tokens are
4.1. Objective Evaluation Matrices. Te matrices like preci- 256-dimensional vectors.
sion, recall, F-measure accuracy, and ROUGE toolkit are Since the output of BERT is 768-dimensional vectors and
adopted by state of the art [9, 57–59]. Tey are defned the tokens taken by the Graph layer should be equal to the
below. 300-dimensional embedding as described in [8], we use a
linear layer after the BERT output that converts the 768-
4.1.1. Precision. Te number of sentences appearing in both dimensional vector to a 300-dimensional vector. When you
the system and the abbreviations for reference divided by the create word nodes, stops and punctuation are fltered. We
number of sentences present in the summary produced is limit the length of a sentence to 50 characters. To address the
called precision (P) [57]. problem of common noisy words, we remove 10% of vo-
cabulary from databases with low TF-IDF values. To start a
sentence node, the maximum size is kept at ds � 128, and the
4.1.2. Recall. Recall (R) is the number of sentences from
maximum size of the eij edge elements is kept at de � 50. FFN
both produced systems and reference abbreviations divided
is 512. Deep Graph Library (DGL) is used to use the graph
by the number of sentences present in the reference sum-
neural network, as is the case with [9].
mary [57].
Tirty-two batch size is used in training, and Adam’s
provided [50] with a reading rate of 5 and 4 is used. Pre-
4.1.3. F-Score. Te F-score is a compact matrix that com-
mature stops are made when the allowable loss is not less
bines accuracy with memory. Calculating the corresponding
than three consecutive epochs. Based on the functionality of
measure of accuracy and memory is a basic method for
the verifcation set, the number of duplicates is set to t = 1.
calculating the efect of the F-score [57, 58]. N-GPETS selects the top three sentences as a system-based
summary of the average length produced by human beings
4.1.4. Accuracy. Accuracy is the total number of well-labeled on CNN/Daily Mail, three sentences. Te specifcation of the
sentences divided by the number of sentences present in the system in which we train our model is 8 GB Random-Access
data set test set. Memory (RAM) and Intel (R) Core (TM) 7-6600U. Te CPU
is based on ×64 architecture and uses a 64-bit operating
4.1.5. N-Gram Co-Occurrence Statistics—ROUGE. system. Te Windows operating system installed is Win-
ROUGE stands for Recall-Oriented Understudy for Gisting dows 10 Pro.
Evaluation proposed by [58] is the most commonly used
testing tool in text summary research. Tis system compares 5. Results and Discussion
the quality of the summary produced by the system with the
man-made summary to determine how good it is. Te gold Tis section outlines the overall empirical fndings generated
standard we used includes two personal abbreviations and two by our suggested model, N-GPETS. N-GPETS is evaluated
human quotes. ROUGE test steps include ROUGE-N (N = 1, 2, using the criteria that when compared to other models, does
3, and 4) and ROUGE-L, ROUGE-W, ROUGE-S, and others. N-GPETS, which generates sentence and word nodes using
ROUGE-N quantifes the number of ’n-gram’ matches between BERT encoder and also has TF-IDF connections between
system summary and set of human summaries. nodes, produce adequate results? First, a frequently used
CNN/DM data set is utilized to train and then test the
4.2. Data Set. Tis study examined N-GPETS using a CNN/ N-GPETS model. For reference summaries, we considered
Daily Mail (anonymous version) data set, a standard data for the unigram R-1, bigram R-2, and longest common sub-
marking news [60]. Te data set is classifed and processed sequence R-L overlap. Second, a comparison has been made
according to standard classifcation, with 287,227/13,368/ between the proposed framework N-GPETS and previously
11,490 examples (92.5%/4.2%/3.6%) for training, validation, working both BERT-based and non-BERT graph structures
and sequence of tests, respectively, similar to previous tasks as depicted in Section 4.3. Additionally, ablation research is
[7, 9, 14, 15]. CNN/Daily Mail data set statistics are set out in carried out to show the importance of each model element.
Table 2 [14].
5.1. Overall Performance. On the CNN/DM data set, Table 5
4.3. Models for Comparison. Te N-GPETS model compares displays the ROUGE F-scores for several models. Tis table
high-resolution BERT-based graphs with non-BERT neural is divided into four sections: the frst section contains the
graph models for extracting text. Table 3 presents the details Lead-3 and Oracle scores, the second section contains the
of the non-BERT graph structures, and Table 4 shows the scores of models that did not use BERT, the results of BERT-
graph properties using the BERT model. based models are contained in the third part, and the
Computational Intelligence and Neuroscience 9
44.5
44
43.5
43
42.5
42
41.5
41
N-GPETS W/o W/o BERT TF-IDF
Residual changes
connections
Figure 5: ROUGE-1, F1 fndings on the CNN/DM data set for our full model N-GPETS, and three ablated variations.
21
20.5
20
19.5
19
18.5
N-GPETS W/o Residual W/o BERT TF-IDF
connections changes
Figure 6: ROUGE-2, F1 fndings on the CNN/DM data set for our full model N-GPETS, and three ablated variations.
fndings of the suggested model N-GPETS are shown in the between EDUs via entity nodes, EDU nodes, and RST
fourth part. Te results lead to the following conclusions: discourse parsing, N-GPETS shows better outcomes on
N-GPETS performs better than the cutting-edge non-BERT ROUGE R-1/R-2 having an increase of (0.54/0.05), re-
model HSG by a 1.8/1.3/2.2 on the F-score of R-1/R-2/and spectively, and having the same score on R-L.
R-L. Tis shows that our graph network, which is based on It should be mentioned that RST discourse parsing and
the BERT algorithm, has a better comprehension of learning third-party external tools are the foundation of Dis-
cross-sentence links. Additionally, N-GPETS performs coCorrelation-GraphSum. Contrarily, N-GPETS does not
better than each of the non-BERT models presented in utilize any outside tools or knowledge. Additionally, N-GPETS
Table 5. After that comparison to models that utilized BEET, beats the cutting-edge extractive summarization model DIS-
frst, N-GPETS is contrasted with Topic-Graph-Sum, which COBERT depended on the external tool in R1 metrics and
utilizes topic data via NTM as an additional semantic unit. produces results that are equivalent in R2 and RL metrics.
N-GPETS produced better results beating the Topic-Graph-
Sum framework by 0.13/0.05/0.42 on the F-score of R-1/R-2/
R-L. Second, when compared to BERTSum-sent, N-GPETS 5.2. Ablation on CNN/Daily Mail. To understand the func-
achieves better results having an increase of 0.9/0.62/1.03 on tion and impact of various contributed modules revealed in
the F-score of R-1/R-2/R-L. Tird, in contrast to Dis- our recommended model N-GPETS on performance, abla-
coCorrelation-GraphSum, which captures relationships tion research is conducted. First of all, the residual
Computational Intelligence and Neuroscience 11
angelique kerber beat madison keys 6-2 , 4-6 , 7-5 in statue in wuhan , central china , depicts country's frst
the charleston fnal. ruler and his wife.
kerber battled back to win six of the last seven games tourists fondling her exposed breast has damaged
in the decider. statue , ofcials say.
it is the german 's frst wta title since linz in 2013. legend has it that yu was lead to wife by a magical
nine-tailed fox.
angelique kerber rallied past madison keys to win the ofcials in wuhan , the capital city of central china 's
family circle cup on sunday, capturing six of the last hubei province , have accused tourists of damaging a
seven games for a 6-2 , 4-6 , 7-5 victory. statue of the country 's frts leader and his wife by
fonding the woman 's exposed breast.
this was kerber 's fourth wta title and frst since linz in
2013. the sculpture , which has been in place for ten years ,
depicts yu the great , the founder of chine 's frst xia
kerber had defeated friend , countrywoman and dynasty in 2070 bc , meeting his wife.
defending champion andrea petkovic in the
semifnals. legend says that yu and his wife were brought
together by a nine-tailed fox that lead them to one
another
Figure 7: Examples of summaries generated by proposed model N-GPETS along with reference summaries.
connections that are present between GAT layers were re- advantages of Google Colab include preinstalled libraries
moved and word nodes were attached to the initial sentence and the capacity to upload fles to the cloud. With the help of
feature, similar to previous work [9]. Te second thing that we the COLAB collaboration tool, several developers can col-
have done, rather than using TF-IDF values from the entire laborate on the same project and use free GPUs and TPUs.
document, TF-IDF values from the individual sentences were We train our model for fve epochs on 287000 CNN/DM
used as features in the graph layer. In the third one, we gave a news articles. Figure 7 shows examples of summaries gen-
sideline to the BERT model and make use of BiLSTM and erated by proposed model N-GPETS along with reference
CNN models for encoding the document and checking the summaries.
overall performance. According to Table 6, cutting of re-
sidual connections between GAT layers lowers the F-score for 6. Conclusion and Future Work
the R1/R2/and RL measures. Tis implies that residual
connections are crucial in integrating genuine representation Te process of creating an extractive summary relies heavily on
with messages updated from other sorts of nodes that cannot modelling the relationships across the input sentences. Inspired
be substituted by straightforward concatenation [8]. As by the popularity of Transformer-based Bidirectional Encoder
shown in Figure 5, we noticed a decline in the F-score of R1/ Representations (BERT) pretrained linguistic models and graph
R2/RL metrics when TF-IDF values (computed from indi- attention network (GAT) that captures intersentence associa-
vidual sentences) were used as edge features in the graph tions, this study proposes a novel neural model (N-GPETS) for
layer, demonstrating the efectiveness of TF-IDF values the extractive summarization task by combining heterogeneous
entire document. Lastly, by substituting CNN-Encoder and graph attention network with BERT model and statistical ap-
BiLSTM in place of the BERTmodel, the model achieves lower proach using TF-IDF values. In contrast to earlier research,
F-score values than the proposed model N-GPETS, and the nobody employed BERT for both sentence and word node
model is reduced to the HSG, a non-BERTmodel [9]. Figure 6 formation along with the TF values for the creation of an at-
shows ROUGE-2, F1 fndings on the CNN/DM data set for tention graph network. Te following benefts are associated
our full model N-GPETS, and three ablated variations. with constructing the N-GPETS model: (i) during graph
propagation, the addition of feature-rich semantic word nodes
encoded using BERTstrengthens sentence representation. It can
5.3. Environment for Model Development and Training. compile information from modifed sentences. (ii) Additionally,
Due to its superior GPU compared to the free version, semantic word nodes can be used to link sentences together and
Google Colab (pro) is utilized for model coding and training identify links between them. (iii) Our graph structure can use
instead of using simple COLAB. Programmers can write and diferent levels of information during message passing.
execute Python code directly from their browsers using According to the simulation fndings on the widely used CNN/
Colab, a Google Research product. It is important to note Daily Mail benchmark data set, our model performed better
that Google Colab is a great tool for a variety of deep learning than other heterogeneous graph structures that used the BERT
jobs. Te Jupyter notebook is hosted by Google Colab. model as well as graph structures that are opposed to BERT.
Consequently, no additional software is needed. Te N-GPETS is based on the summarization of a single document.
12 Computational Intelligence and Neuroscience
However, it can be expanded to include summarizing nu- of the Association for Computational Linguistics,
merous documents rather than just one. Using this graph vol. 1pp. 654–663, Beijing, China, May 2018.
structure to condense lengthy research publications is the [10] A. Vaswani, “Attention is all you need,” in Advances in Neural
second direction for the future. Other semantic units like topic Information Processing Systems, pp. 5998–6008, arXiv, New
and paragraph semantic units in graph structure can also be York, NY, USA, 2017.
used to improve summarization performance. [11] D. Wang, P. Liu, Y. Zheng, X. Qiu, and X. Huang, “Het-
erogeneous Graph Neural Networks for Extractive Document
Data Availability Summarization,” 2020, https://arxiv.org/abs/2004.12393.
[12] J. Xu and G. Durrett, “Neural extractive text summarization
Te data that support this study are available upon request with syntactic compression,” in Proceedings of the 2019
from the frst author. Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural
Conflicts of Interest Language Processing (EMNLP-IJCNLP), pp. 3292–3303, July
2019.
Te authors declare that they have no conficts of Interest. [13] P. Cui, Le Hu, and Y. Liu, “Enhancing extractive text sum-
marization with topic-aware graph neural networks,” in
Acknowledgments Proceedings of the 28th International Conference on Compu-
tational Linguistics, January 2020.
Tis work was performed in the Department of Computer [14] Y. J. Huang and S. Kurohashi, “Extractive summarization
Science, City University of Science and Information Tech- considering discourse and coreference relations based on
nology, Peshawar, Pakistan. heterogeneous graph,” in Proceedings of the 16th Conference of
the European Chapter of the Association for Computational
References Linguistics, pp. 3046–3052, June 2021.
[15] D. K. Bangotra, Y. Singh, A. Selwal, N. Kumar, P. K. Singh,
[1] A. P. Widyassari, S. Rustad, G. F. Shidik, E. Noersasongko, and W.-C. Hong, “An intelligent opportunistic routing al-
A. Syukur, and A. Afandy, “Review of automatic text sum- gorithm for wireless sensor networks and its application
marization techniques & methods,” Journal of King Saud towards e-healthcare,” Sensors, vol. 20, no. 14, p. 3887, 2020.
University-Computer and Information Sciences, vol. 34, 2020. [16] M. N. Uddin, B. Li, Z. Ali, P. Kefalas, I. Khan, and I. Zada,
[2] M. Shabir, N. Islam, Z. Jan et al., “Real-time pashto hand- “Software Defect Prediction Employing BiLSTM and BERT-
written character recognition using salient geometric and Based Semantic Feature,” Soft Computing, vol. 26, pp. 1–15,
spectral features,” IEEE Access, vol. 9, Article ID 160238, 2021. 2022.
[3] J. Pilault, R. Li, S. Subramanian, and C. Pal, “On extractive and [17] D. Suleiman and A. Awajan, “Deep learning based abstractive
abstractive neural document summarization with transformer text summarization: approaches, datasets, evaluation mea-
language models,” in Proceedings of the 2020 Conference on sures, and challenges,” Mathematical Problems in Engineering,
Empirical Methods in Natural Language Processing (EMNLP), vol. 2020, pp. 1–29, 2020.
pp. 9308–9319, September 2020. [18] L. Nkenyereye, L. Nkenyereye, B. Adhi Tama, A. G. Reddy,
[4] J. Cheng and M. Lapata, “Neural summarization by extracting and J. Song, “Software-defned vehicular cloud networks:
sentences and words,” in Proceedings of the 54th Annual architecture, applications and virtual machine migration,”
Meeting of the Association for Computational Linguistics, Sensors, vol. 20, no. 4, p. 1092, 2020.
pp. 484–494, June 2016. [19] P. Sethi, S. Sonawane, S. Khanwalker, and R. Keskar, “Au-
[5] K. Arumae and F. Liu, “Reinforced extractive summarization
tomatic text summarization of news articles,” in Proceedings of
with question-focused rewards,” in Proceedings of the ACL
the 2017 International Conference on Big Data, IoT and Data
2018, Student Research Workshop, pp. 105–111, June 2018.
Science (BID), pp. 23–29, IEEE, Pune, India, May 2017.
[6] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences
[20] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and
for extractive summarization with reinforcement learning,” in
H. K. Mohamed, “Automatic text summarization: a com-
Proceedings of the 16th Annual Conference of the North
prehensive survey,” Expert Systems with Applications, vol. 165,
American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 1747–1759, Article ID 113679, 2021.
Association for Computational Linguistics (ACL), Pennsyl- [21] K. Anwar, T. Rahman, A. Zeb et al., “Improving the con-
vania, PA, USA, 2018. vergence period of adaptive data rate in a long range wide area
[7] M. Zhong, P. Liu, D. Wang, X. Qiu, and X.-J. Huang, network for the internet of things devices,” Energies, vol. 14,
“Searching for efective neural extractive summarization: no. 18, p. 5614, 2021.
what works and what’s next,” in Proceedings of the 57th [22] K. Anwar, T. Rahman, A. Zeb, I. Khan, M. Zareei, and
Annual Meeting of the Association for Computational C. Vargas-Rosales, “Rm-adr: resource management adaptive
Linguistics, pp. 1049–1058, ACL, Pennsylvania, PA, USA, July data rate for mobile application in lorawan,” Sensors, vol. 21,
2019. no. 23, p. 7980, 2021.
[8] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: a recurrent [23] P. J. Liu, “Generating wikipedia by summarizing long se-
neural network based sequence model for extractive sum- quences,” in Proceedings of the International Conference on
marization of documents,” in Proceedings of the Tirty-First Learning RepresentationsACL, Pennsylvania, PA, USA, April
AAAI Conference on Artifcial Intelligence, AAAI Press, 2018.
California, CF, USA, July 2017. [24] J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-
[9] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, training of Deep Bidirectional Transformers for Language
“Neural document summarization by jointly learning to score Understanding,” in Proceedings of the NAACL-HLT,
and select sentences,” Proceedings of the 56th Annual Meeting pp. 4171–4186, July 2019.
Computational Intelligence and Neuroscience 13
[25] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, [40] L. Bacco, A. Cimino, F. Dell’Orletta, and M. Merone, “Ex-
and Y. Bengio, “Graph attention networks,” Stat, vol. 1050, plainable sentiment analysis: a hierarchical transformer-based
p. 20, 2017. extractive summarization approach,” Electronics, vol. 10,
[26] D. Anand and R. Wagh, “Efective Deep Learning Approaches no. 18, p. 2195, 2021.
for Summarization of Legal Texts,” Journal of King Saud [41] S. Xu, X. Zhang, Y. Wu, F. Wei, and M. Zhou, “Unsupervised
University-Computer and Information Sciences, vol. 34, 2019. extractive summarization by pre-training hierarchical trans-
[27] R. Nallapati, B. Zhou, and M. Ma, Classify or Select: Neural formers,” in Proceedings of the Findings of the Association for
Architectures for Extractive Document Summarization, New Computational Linguistics: EMNLP, 2020, pp. 1784–1795,
York, NY, USA, 2016. May 2020.
[28] W. Xiao and G. Carenini, “Extractive summarization of long [42] Y. Liu, J. Zhang, Y. Wan, C. Xia, L. He, and S. Y. Philip,
documents by combining global and local context,” in Pro- “HETFORMER: heterogeneous transformer with sparse at-
ceedings of the 2019 Conference on Empirical Methods in tention for long-text extractive summarization,” in Proceed-
Natural Language Processing and the 9th International Joint ings of the 2021 Conference on Empirical Methods in Natural
Conference on Natural Language Processing (EMNLP- Language Processing, pp. 146–154, October 2021.
IJCNLP), pp. 3011–3021, August 2019. [43] Y. Liu, “Fine-tune BERT for extractive summarization,” 2019,
[29] R. Sahba, N. Ebadi, M. Jamshidi, and P. Rad, “Automatic text https://arxiv.org/abs/1903.10318.
summarization using customizable fuzzy features and at- [44] Y. Liu and M. Lapata, “Text summarization with pretrained
tention on the context and vocabulary,” in Proceedings of the encoders,” in Proceedings of the 2019 Conference on Empirical
2018 World Automation Congress (WAC), pp. 1–5, IEEE, Methods in Natural Language Processing and the 9th Inter-
Stevenson, WA, USA, July 2018. national Joint Conference on Natural Language Processing
[30] Á. Hernández-Castañeda, R. A. Garcı́a-Hernández, (EMNLP-IJCNLP), August 2019.
Y. Ledeneva, and C. E. Millán-Hernández, “Extractive au- [45] T. Yang, L. Hu, C. Shi, H. Ji, X. Li, and L. Nie, “HGAT:
tomatic text summarization based on lexical-semantic key- heterogeneous graph attention networks for semi-supervised
words,” IEEE Access, vol. 8, Article ID 49896, 2020. short text classifcation,” ACM Transactions on Information
[31] R. Khan, Y. Qian, and S. Naeem, “Extractive based text Systems, vol. 39, no. 3, pp. 1–29, 2021.
summarization using K-means and TF-IDF,” International [46] J. Xu, “Discourse-aware neural extractive text summariza-
Journal of Information Engineering and Electronic Business,
tion,” in Proceedings of the 58th Annual Meeting of the As-
vol. 11, no. 3, pp. 33–44, 2019.
sociation for Computational Linguistics, January 2020.
[32] R. Siautama, R. Siautama, D. Suhartono, and D. Suhartono,
[47] D. Miller, “Leveraging BERT for Extractive Text Summari-
“Extractive hotel review summarization based on TF/IDF and
zation on Lectures,” 2019, https://arxiv.org/abs/1906.04165.
adjective-noun pairing by considering annual sentiment
[48] I. Zada, S. Ali, I. Khan et al., “Performance evaluation of
trends,” Procedia Computer Science, vol. 179, pp. 558–565,
simple K-mean and parallel K-mean clustering algorithms:
2021.
big data business process management concept,” Mobile In-
[33] L. Yao, Z. Pengzhou, and Z. Chi, “Research on news keyword
formation Systems, vol. 2022, pp. 1–15, Article ID 1277765,
extraction technology based on TF-IDF and TextRank,” in
2022.
Proceedings of the IEEE/ACIS 18th International Conference
[49] K. Ma, Y. Tan, M. Tian et al., “Extraction of temporal in-
on Computer and Information Science (ICIS), pp. 452–455,
formation from social media messages using the BERT
IEEE, Beijing, China, June 2019.
[34] S. Narayan, J. Maynez, J. Adamek, D. Pighin, B. Bratanic, and model,” Earth Science Informatics, vol. 15, pp. 573–584, 2022.
R. McDonald, “Stepwise extractive summarization and [50] V. Tretyak and D. Stepanov, “Combination of Abstractive and
planning with structured transformers,” in Proceedings of the Extractive Approaches for Summarization of Long Scientifc
2020 Conference on Empirical Methods in Natural Language Texts,” 2020, https://arxiv.org/abs/2006.05354.
Processing (EMNLP), pp. 4143–4159, August 2020. [51] A. Widad, B. L. El Habib, and E. F. Ayoub, “Bert for question
[35] X. Zhang, F. Wei, and M. Zhou, “HIBERT: document level answering applied on covid-19,” Procedia Computer Science,
pre-training of hierarchical bidirectional transformers for vol. 198, pp. 379–384, 2022.
document summarization,” in Proceedings of the 57th Annual [52] I. Ahmad, S. J. Xu, A. Khatoon, U. Tariq, I. Khan, and
Meeting of the Association for Computational LinguisticsACL, S. S. Rizvi, “Analytical study of deep learning-based preventive
Pennsylvania, PA, USA, July 2019. measures of COVID-19 for decision making and aggregation
[36] A. Ravula, “ETC: Encoding Long and Structured Inputs in via the RISTECB model,” Scientifc Programming, vol. 2022,
Transformers,” in Proceedings of the 2020 Conference on Article ID 6142981, 17 pages, 2022.
Empirical Methods in Natural Language Processing (EMNLP), [53] K. Ma, M. Tian, Y. Tan, X. Xie, and Q. Qiu, “What is this
June 2020. article about? Generative summarization with the BERT
[37] J. Kwon, N. Kobayashi, H. Kamigaito, and M. Okumura, model in the geosciences domain,” Earth Science Informatics,
“Considering nested tree structure in sentence extractive vol. 15, no. 1, pp. 21–36, 2022.
summarization with pre-trained transformer,” in Proceedings [54] M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan,
of the 2021 Conference on Empirical Methods in Natural and D. Radev, “Graph-based neural multi-document sum-
Language Processing, pp. 4039–4044, January 2021. marization,” in Proceedings of the 21st Conference on Com-
[38] Y. Liu, RoBERTa: A Robustly Optimized, BERT Pretraining putational Natural Language Learning (CoNLL 2017),
Approach, 2019. pp. 452–462, June 2017.
[39] W. Xiao, P. Huber, and G. Carenini, “Do we really need that [55] M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou,
many parameters in transformer for extractive summariza- “Multi-hop reading comprehension across multiple docu-
tion? Discourse can help,” in Proceedings of the First Work- ments by reasoning over heterogeneous graphs,” in Pro-
shop on Computational Approaches to Discourse, pp. 124–134, ceedings of the 57th Annual Meeting of the Association for
June 2020. Computational Linguistics, pp. 2704–2713, May 2019.
14 Computational Intelligence and Neuroscience