Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Natural Language Processing Journal 3 (2023) 100014

Contents lists available at ScienceDirect

Natural Language Processing Journal


journal homepage: www.elsevier.com/locate/nlp

Rare words in text summarization


Danila Morozovskii, Sheela Ramanna ∗
Department of Applied Computer Science, University of Winnipeg, Winnipeg, Manitoba R3B 2E9, Canada

ARTICLE INFO ABSTRACT


Keywords: Automatic text summarization is a difficult task, which involves a good understanding of an input text
Abstractive summarization to produce fluent, brief and vast summary. The usage of text summarization models can vary from legal
Extractive summarization document summarization to news summarization. The model should be able to understand where important
Pointer-generator
information is located to produce a good summary. However, infrequently used or rare words might limit
Transformer
model’s understanding of an input text, as the model might ignore such words or put less attention on them,
Rare words
especially, when the trained model is tested on the dataset, where the frequency distribution of used words
is different. Another issue is that the model accepts only a limited amount of tokens (words) of an input text,
which might contain redundant information or not including important information as it is located further in
the text. To address the problem of rare words, we have proposed a modification to the attention mechanism of
the transformer model with pointer-generator layer, where attention mechanism receives frequency information
for each word, which helps to boost rare words. Additionally, our proposed supervised learning model
uses the hybrid approach incorporating both extractive and abstractive elements, to include more important
information for the abstractive model in a news summarization task. We have designed experiments involving
a combination of six different hybrid models with varying input text sizes (measured as tokens) to test our
proposed model. Four well-known datasets specific to news articles were used in this work: CNN/DM, XSum,
Gigaword and DUC 2004 Task 1. Our results were compared using the well-known ROUGE metric. Our best
model achieved R-1 score of 38.22, R-2 score of 15.07 and R-L score of 35.79, outperforming three existing
models by several ROUGE points.

1. Introduction Text summarization models can be separated into two main groups:
extractive (Verma and Nidhi, 2018; Zhang et al., 2016) and abstrac-
The amount of textual data generated on the internet in recent years tive (Nallapati et al., 2016; Rush et al., 2015). An extractive method
has been enormous and is only growing (Clissa, 2022). Automatic text generates summaries using sentences from the input text. In contrast,
summarization helps to reduce textual information into a convenient in an abstractive approach, the text is generated using an external
summary, that is easier to understand. The summary, that is generated, vocabulary of words, which might include words that are not present
should be as informative as possible, at the same time being fluent, in the original input. The combination of extractive and abstrac-
brief and vast (Syed et al., 2021). Text summarization can be used in tive approaches is called a hybrid approach. This approach intends
a variety of applications, such as a search engine to provide a direct to overcome the weaknesses of the extractive and abstractive ap-
answer to a query (Dehru et al., 2021), legal document summariza- proach (Goodfellow et al., 2014; Chanb and Kinga, 2021). With the
tion (Kanapala et al., 2019) or a headline generation (Panthaplackel help of transformer model (Vaswani et al., 2017) and attention mech-
et al., 2022). Another application is news summarization, which helps anism (Kim et al., 2017) it became possible to not rely on sequential
to expedite the understanding of an article for a human (El-Kassas et al., elements of a model (such as recurrent neural networks (RNN) Elman,
2021). 1990 or long short-term memory (LSTM) Hochreiter and Schmidhuber,
Automatic text summarization can be based on different aspects (El- 1997), as these networks are not good in handling long text.
Kassas et al., 2021), such as input size (single-document, multi-
document) or multi-media (input information is gathered from several 1.1. Problem definition
sources, such as text, image and/or video Jangra et al., 2021), sum-
marization algorithm (supervised, unsupervised or semi-supervised), In natural language processing (NLP), word embedding encodes
summarization approach (extractive, abstractive or hybrid), summary words in a fixed-length vector (Almeida and Xexéo, 2019) by using
type (headline, sentence-level, highlights or full summary) and others. surrounding words to define the vector. However, the problem occurs

∗ Corresponding author.
E-mail addresses: morozovskii-d@webmail.uwinnipeg.ca (D. Morozovskii), s.ramanna@uwinnipeg.ca (S. Ramanna).

https://doi.org/10.1016/j.nlp.2023.100014
Received 13 December 2022; Received in revised form 25 April 2023; Accepted 5 May 2023

2949-7191/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Fig. 1. Overview of different models.

when a word is not present in the vocabulary, due to its usage being We refer to the transformer model with a pointer-generator layer
infrequent or being a completely new word. Such words are called out- as M1 model. Our proposed model uses a transformer model with a
of-vocabulary (OOV) words, which are encoded with the same vector. pointer-generator layer (See et al., 2017) as an abstractive approach.
The input text or the summary might contain OOV words, resulting Model M2 is the transformer model with a pointer-generator layer as
in the reduction of the amount of information that the model can well as frequency information added as an additional parameter to
process (Zhang et al., 2022). Some techniques to deal with OOV words the attention mechanism. Finally, to test the efficacy of the extractive
include pointer-generator model (where the model can ‘‘point’’ to the approach, models M1 and M2 have been tested with an input text
input text to copy the word) (See et al., 2017) or usage of dual encoder size consisting of 200 tokens (e.g., M1-200, M2-200) and 400 tokens
(where the first encoder encodes the text regularly, and the second (e.g., M1-400, M2-400) resulting in a total of 6 models (shown in
encoder encodes the importance of words Yao et al., 2020). Fig. 1). The goal of using different token sizes was to test if the
A similar but somewhat different problem is a rare words problem, extractive approach helps improve the performance of the models as
where the words are not new, but appear less frequently in the training well as makes it possible to observe the effects of changes to the input
dataset. Training the vector representation of such words tends to text size.
worsen the general performance of the model (Gulcehre et al., 2016). The reason for applying the extractive approach prior to model
Since OOV words are also encoded using exactly the same vector, no training is to remove redundant information from the input. The model
differentiation is made between words that are either used less or more testing is done using an annotated (labeled) summary. The pointer-
frequently. To the best of our knowledge, there has been limited work generator layer decides whether to copy a word from the input text
regarding usage of rare words in text summarization problem (Song or generate a new word using the probabilities from the transformer.
et al., 2019; Schick and Schütze, 2020), since the focus of most of the
The main part of the transformer model is an attention mechanism,
research has been on the OOV problem.
which the transformer relies on. Attention (Bahdanau et al., 2015) was
In addition, the deep learning model has a fixed input size; there-
designed to overcome the problem of long dependencies. It is used for
fore, only a limited amount of the input text (tokens) can be used. Some
the model to identify how much attention should be placed on other
papers (Nallapati et al., 2016; See et al., 2017; Devlin et al., 2019;
words in the context given the input word and the decoder hidden
Zhang et al., 2020) take the first 𝑁 tokens as input text. As a result,
layer.
it is possible that important information might be located further in
We use four well-known datasets: CNN/DM (Nallapati et al., 2016),
the text, if it is too long for the deep learning model. Including the first
XSum (Narayan et al., 2018), Gigaword (Graff et al., 2003) and DUC
𝑁 tokens might also lead to the inclusion of irrelevant information.
2004 Task 1 (Over et al., 2007). These datasets are based on news arti-
Before including the first 𝑁 tokens, important information should be
cles, and are used to generate a summary from the article. The Recall-
extracted. The approach taken in this paper, is to include tokens after
important information in the input text is extracted. In this paper, we Oriented Understudy for Gisting Evaluation (ROUGE) (Lin, 2004) score
also focus on the problem of rare words. is used as a metric to evaluate the performance of the models with the
most commonly used variations which are ROUGE-1 (R-1), ROUGE-2
1.2. Proposed approach (R-2) and ROUGE-L (R-L) scores.
Our proposed approach can be used in situations, when the text
Our proposed model puts more attention on rare words and less domain is different than the one the model was trained on (i.e., testing
attention on frequently used words. Our intuition is that boosting on a different dataset, which has shorter summaries). Frequency infor-
attention for rarely used words might provide additional information mation can help the model to concentrate on rare words, which might
to the model, which might help improve the performance and generate contain important information. Our hybrid model helps to extract
better summaries. We do not use pre-trained embedding encoders or important parts of the text, before the model generates a summary,
transfer learning (like BERT Devlin et al., 2019, BART Lewis et al., 2020 which might be helpful, when the input text is very long and important
or PEGASUS Zhang et al., 2020), which are used to achieve current information might be located further in the text.
state-of-the-art results, as transfer learning might be biased against The contribution of this research are as follows: We have combined
rare words. Our proposed supervised learning approach is to use the the transformer model with the pointer-generator model, which is the
hybrid model (shown in Fig. 1), where the summary generated from an first attempt in this field. We have proposed a modification to the
extractive approach is used as input to the abstractive model training. attention mechanism of the transformer model with a pointer-generator

2
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

layer, which adds frequency information about words in an effort to Probability 𝑝𝑔𝑒𝑛 is calculated by concatenating the context vector (ℎ),
boost the attention of rarely used words. We have implemented a the hidden layer of the decoder(𝑠) and the decoder input (𝑥) and
hybrid approach, which uses extractive and abstractive elements and bias 𝑏𝑝𝑡𝑟 after the positional encoding (embedding of summarized text
showed that it outperforms models which use only an abstractive ap- after positional encoding) and all these parameters are passed through
proach. We have demonstrated that our proposed model with frequency individually weighted matrices and the bias is added. Then, this con-
information, trained on CNN/DM and XSum datasets can perform catenated result is input into a sigmoid function. Later, the context
better when tested on the DUC 2004 Task 1 dataset. We present case- vector’s attention is used to decide whether the word should be copied
studies of summarization experiments to test the transformer model and multiplied by 1 − 𝑝𝑔𝑒𝑛 and the output of the model (𝑠) is used with
with a pointer-generator layer as well as our proposed model with the context vector to decide whether the word should be generated in
two different input text sizes (measured as tokens). Our best model which case it is multiplied by 𝑝𝑔𝑒𝑛 .
achieved R-1 score of 38.22, R-2 score of 15.07 and R-L score of 35.79, The final distribution 𝑃 (𝑤), is calculated using Eq. (2) (See et al.,
outperforming other models by several ROUGE points. 2017), where 𝑃𝑣𝑜𝑐𝑎𝑏 (𝑤) is the probability distribution and 𝑎𝑖 is the
The rest of this paper is organized as follows: Section 2 describes attention distribution over all words respectively:
the background information for the proposed transformer-based model, ∑
𝑃 (𝑤) = 𝑝𝑔𝑒𝑛 𝑃𝑣𝑜𝑐𝑎𝑏 (𝑤) + (1 − 𝑝𝑔𝑒𝑛 ) 𝑎𝑖 (2)
as well as hyper parameters used; Section 3 explains four different 𝑖∶𝑤𝑖 =𝑤
datasets that have been used as well as preprocessing steps; Section 4
Probabilities of a word being generated are multiplied by 𝑝𝑔𝑒𝑛 , and
describes experiments that have been performed and analyzes the
the attention distribution is multiplied by 1 − 𝑝𝑔𝑒𝑛 and those two
results; Section 5 concludes the paper and suggests improvements for
distributions are added together to form a final distribution of the
future work.
next word. The attention distribution 𝑎 represents the copy mechanism
which contains the attention for all words in the input text, so if the
2. Preliminaries
word is in OOV or belongs to OOV, the first part (which is multiplied
by 𝑝𝑔𝑒𝑛 ) will be closer to zero. The attention for this OOV word, which
The main backbone model that is our model built upon is the
appears in the second part, will be added to the final distribution, which
transformer model. It was firstly described in Vaswani et al. (2017) and
leads to ‘‘copying’’ the word with the highest attention. On the contrary,
can be used for text summarization. The transformer model does not
if the word does not appear in the input text, then the probability 1−𝑝𝑔𝑒𝑛
contain any sequential elements similar to RNN or LSTM, that are used
will be closer to zero, and the decoder prediction of the next word
in Sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014). The
being generated will be used. However, when the summary contains
authors argued that even though the Seq2Seq model with attention tries
an unknown word (OOV token is used), which cannot be encoded with
to solve the problem of the usage of sequential elements, it still uses the
a vocabulary or being encoded, the model will learn to also generate
fundamental concept of a sequential algorithm. The transformer model
an OOV token.
solely relies on the attention mechanism using a simple feed forward
The copy-generator mechanism combines distributions from the
neural network for training and uses an encoder–decoder structure. context vector (attention to words from the input text) and the proba-
bility distribution of a word being generated from the decoder. Using
2.1. Proposed transformer with pointer-generator layer these two probabilities and the 𝑝𝑔𝑒𝑛 parameter, model can regulate how
much attention from the input text or probability distribution from the
Our proposed model shown in Fig. 2 has a transformer backbone. decoder affect the final distribution, which predicts the probability of
To overcome the problem of OOV words, we have added a pointer- a word being generated.
generator layer from See et al. (2017), which decides whether to copy
or generate a word. 2.2. Attention with frequency information
The authors See et al. (2017) created a pointer-generator model
based on a sequence-to-sequence attentional model (similar to the The self-attention mechanism inside the encoder shows how much
study Nallapati et al., 2016), which can copy words from the original attention the model should put on the input words. However, the model
text (pointer) while being able to generate new words (generator). does not have information on the importance of a word and learns it
Attention is used to produce a context vector of an input text, which throughout the training phase. Therefore, we propose an approach of
is used later to produce a pointer distribution (a probability of a word adding information about the frequency of a given word. Our intuition
being copied). The model itself outputs the distribution of a word being is that it might help the model to focus more on the most important
generated, which later are added together with the pointer distribution words and less on others. For instance, if the word is rarely used after
to get the final probability distribution. Using pointer/generator the adding frequency information, then the attention for this word will be
model can regulate, how much of a distribution will be taken from much higher than for commonly used words, such as ‘‘the’’ or ‘‘an’’.
the pointer and how much from the generator distributions using To add this information, we multiply the self-attention of the en-
the 𝑝𝑔𝑒𝑛 parameter (Eq. (2)). A more detailed explanation how the coder layer with the score of a word, which is obtained by the inverse
pointer-generator mechanism works is described below. logarithm of how many times the word appears in the dataset (given
The original pointer-generator model (See et al., 2017) uses the in Eq. (3)). This equation is similar to the TF-IDF, however, as TF-IDF
sequence-to-sequence (Sutskever et al., 2014) attention model with an is used when several documents are present, we have used only the
encoder and decoder structure. In our model, the pointer-generator IDF part of it with total documents equal to 1 and we have added the
layer is inserted as the last layer in a transformer model shown in a ‘‘shift’’ variable, which is described later. For most numbers, its inverse
shaded green. It uses a context vector (ℎ), which uses information about is lower than 1, therefore, the model will decrease attention for almost
the input text and the hidden state of the decoder. To obtain the context all words and only boost a few. To boost values for more words, we
vector, we use another multi-head attention layer (shown in Fig. 3), shift all scores to a mean value of 1 using the ‘‘shift’’ variable, which
where K and V are the output of an encoder, and the Q is an output of is calculated dynamically. The scores are shifted to a mean value of
the normalization layer inside the decoder. 1, as to boost attention of rare words, the score should be larger than
Eq. (1) (See et al., 2017) gives the probability value of the pointer 1, and the output of the inverse logarithm has only 9 values which
generator layer 𝑝𝑔𝑒𝑛 where 𝒙 represents the decoder input after posi- are greater than 1. To have more words boosted, we can regulate the
tional encoding, 𝒔 is the output of the decoder and 𝒉 is a context vector, shifting parameter. This approach will decrease attention to the most
𝑤𝑇_ represents the weight matrix that is used inside hidden layers. commonly used words, while it will increase attention for rare words.
1
𝑝𝑔𝑒𝑛 = 𝜎(𝑤𝑇ℎ∗ ℎ + 𝑤𝑇𝑠 𝑠 + 𝑤𝑇𝑥 𝑥 + 𝑏𝑝𝑡𝑟 ), (1) 𝑤𝑜𝑟𝑑𝑓 𝑟𝑒𝑞 = + 𝑠ℎ𝑖𝑓 𝑡, (3)
𝑙𝑜𝑔(𝑜𝑐𝑐)

3
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Fig. 2. Transformer with pointer-generator layer (highlighted in green). The drawing is based on a figure from Vaswani et al. (2017). 𝒙 represents the decoder input after positional
encoding, 𝒔 is the output of the decoder and 𝒉 is the context vector. (For interpretation of the references to color in this figure legend, the reader is referred to the web version
of this article.)

Table 1
Example: Dictionary of frequencies and scores for each word.
Word Number of occurrences Score
The 11,896,156 0.619
mid-flight 170 0.926
million-a-year-deal 1 3.800

This information is later passed to the attention matrix inside the


encoder self-attention mechanism through the linear neural network,
multiplying it with the attention values (Eq. (4)). The aim here is that
by boosting the attention of rarely used words and limiting the model’s
attention to frequently used words can help the model to put higher
attention on uncommon words, which might contain some important
information.
Fig. 3. Multi-Head Attention with frequency information (frequency score for each 𝑎𝑡𝑡𝑛𝑓 𝑟𝑒𝑞 = 𝑎𝑡𝑡𝑛 ∗ 𝑊𝑠𝑇 𝑠𝑓 𝑟𝑒𝑞 (𝑡), (4)
word is added (highlighted in green)). (For interpretation of the references to color 𝑓 𝑟𝑒𝑞

in this figure legend, the reader is referred to the web version of this article.) where 𝑊𝑠𝑇 represents the weight matrix that is used inside hidden layers.
Source: The drawing is based on a figure from Vaswani et al. (2017). 𝑓 𝑟𝑒𝑞
𝑠𝑓 𝑟𝑒𝑞 (𝑡) is the frequency information about the word at a given time step ‘‘t’’.
The revised attention mechanism in the encoder is shown in shaded
green in Fig. 3 where the frequency score is added to the ‘‘Scaled
where ‘‘occ’’ is the number of occurrences of a specific word in the training Dot-Product Attention’’ and multiplied by a weight matrix.
dataset. In the following section, we describe our algorithm for the extrac-
The scores of all words are predefined; therefore, it is saved in a tive approach before applying the abstractive approach so that the
dictionary before training. Table 1 illustrates a predefined dictionary, model is provided with important information.
where the lower the number of occurrences of a word, the higher the
score. The shift value for this example is 0.478. If a word does not 2.3. Proposed extractive approach for rare information
appear in this dictionary (unknown word), a maximum score value will
be assigned, since it might contain some important information (for We propose adding extractive approach algorithm before an abstrac-
instance, a name/surname or a city, which is rarely used). tive approach to provide the model with more important information.

4
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Table 2 Table 3
Dictionary of 3-gram frequencies and scores. The DUC 2004 Task 1 dataset has only a testing dataset; therefore, for training and
3-gram Number of occurrences Score validation DUC 2004 Task 1 dataset it is ‘‘None’’.
Dataset Training Validation Testing
(‘wales’, ‘british’, ‘irish’) 45 0.605
(‘british’, ‘irish’, ‘lions’) 504 0.370 CNN/DM 286,817 13,368 11,487
(‘irish’, ‘lions’, ‘fly-half’) 6 1.285 XSum 204,045 11,332 11,334
Gigaword 3,803,957 189,651 1951
DUC 2004 Task 1 None None 500

The algorithm for an extractive approach is described in Alg. 1. To


tokenize sentences (separate the text into individual sentences) a pre-
(R-2) and ROUGE-L (R-L) scores to test the performance, as these scores
defined function from the nltk library1 is used (line 4). Stop words
are removed (line 5). Usually, the first few sentences contain the most are standard performance measurements and used as a benchmark.
important information, and it is useful to take the first 𝑘 sentences 𝑅𝑂𝑈 𝐺𝐸𝐹1 score (Eq. (7)) is calculated using recall (Eq. (5)) and preci-
(line 11). For instance, we take the first ten sentences when the limit sion (Eq. (6)) measures, where 𝑟𝑒𝑐𝑎𝑙𝑙 calculates the number of N-grams
of an input text is 400 tokens. In our approach, we extract the level common to both texts and divided by the number of N-grams in the
of importance by using the indicator representation of 3-gram scores second text, and 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 calculates the number of N-grams common
of phrases in each sentence, which are later used to select important to both texts and divided by the number of N-grams in the first text.
sentences (line 8). 𝑐𝑜𝑢𝑛𝑡(𝑁-𝑔𝑟𝑎𝑚𝑡𝑒𝑥𝑡1 ∩ 𝑁-𝑔𝑟𝑎𝑚𝑡𝑒𝑥𝑡2 )
𝑟𝑒𝑐𝑎𝑙𝑙 = (5)
𝑐𝑜𝑢𝑛𝑡(𝑁-𝑔𝑟𝑎𝑚𝑡𝑒𝑥𝑡2 )
Algorithm 1 An algorithm for an extractive approach 𝑐𝑜𝑢𝑛𝑡(𝑁-𝑔𝑟𝑎𝑚𝑡𝑒𝑥𝑡1 ∩ 𝑁-𝑔𝑟𝑎𝑚𝑡𝑒𝑥𝑡2 )
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (6)
1: procedure extractiveApproach(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑓 𝑖𝑟𝑠𝑡_𝑘, 𝑁) 𝑐𝑜𝑢𝑛𝑡(𝑁-𝑔𝑟𝑎𝑚𝑡𝑒𝑥𝑡1 )
2: for each 𝑖𝑛𝑝𝑢𝑡_𝑡𝑒𝑥𝑡 in 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 do
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
3: 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡_𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 ← [ ] 𝑅𝑂𝑈 𝐺𝐸𝐹1 = 2 ∗ (7)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
4: 𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒𝑑_𝑠𝑒𝑛𝑡𝑠 ← 𝑛𝑙𝑡𝑘.𝑠𝑒𝑛𝑡_𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒(𝑖𝑛𝑝𝑢𝑡_𝑡𝑒𝑥𝑡)
5: 𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒𝑑_𝑠𝑒𝑛𝑡𝑠 ← 𝑟𝑒𝑚𝑜𝑣𝑒_𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑𝑠(𝑖𝑛𝑝𝑢𝑡_𝑡𝑒𝑥𝑡) 2.5. Greedy algorithm and beam search
6: 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡_𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 ← 𝐹 𝑖𝑟𝑠𝑡𝐾𝑆𝑒𝑛𝑡𝑠(𝑓 𝑖𝑟𝑠𝑡_𝑘)
7: for each 𝑠𝑒𝑛𝑡 in 𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒𝑑_𝑠𝑒𝑛𝑡𝑠 do
We use the beam search algorithm (Medress et al., 1977) to choose
8: 𝑠𝑐𝑜𝑟𝑒𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑆𝑐𝑜𝑟𝑒𝐹 𝑜𝑟𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒(3_𝑔𝑟𝑎𝑚)
the predicted word. It is possible to modify the beam search using
9: end for
different approaches. In our version of a beam decoder, we use the same
10: 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡_𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 ← 𝑆𝑒𝑙𝑒𝑐𝑡𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡𝑆𝑒𝑛𝑡𝑠(𝑁,
length control algorithm as in Wu et al. (2016). Some papers suggested
𝑠𝑐𝑜𝑟𝑒𝑠)
different algorithms to control the length (for instance, the paper Yu
11: 𝐴𝑑𝑑𝑃 𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝑇 𝑒𝑥𝑡(𝑜𝑢𝑡_𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑡_𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠)
et al., 2021 used decoder hidden state and context vector to generate
12: end for
the context length vector); however, the beam search length control
13: end procedure
function is much easier to implement.

N-gram frequency scoring works the following way: the predefined 2.6. Hyper parameters
dictionary of N-gram words from a training dataset is extracted, where
we count how many times each N-gram is used in the entire dataset. We use Kullback–Leibler divergence loss function (Kullback and
Then, the inverse logarithm of a frequency is calculated, using the same Leibler, 1951) with label smoothing. The label smoothing (Szegedy et al.,
formula as frequency information of words given in Eq. (3) but without 2016) is used to increase the robustness of a model, as it penalizes
shifting, as we need the information about the frequency of the word overconfident outputs. We chose the dimension of our model (and
(N-grams, which are mentioned only once, are not included, due to embedding size) as 256 with an inner layer dimensionality of 1024.
the problem of calculating the inverse logarithm of 1). Table 2 gives a AdaGrad optimizer (Duchi et al., 2011) is used. We did not use Adap-
sample dictionary of 3-gram words with the number of times they occur tive Moment Estimation (Adam) optimizer (Kingma and Ba, 2015), as
and their scores. At the end, each sentence is scored based on their N- AdaGrad showed better results. We use a dropout (Srivastava et al.,
gram scores, where each sentence’s mean score is calculated. Unknown 2014) of 20. We use a beam size of 4 and 8 layers and 8 heads for
N-grams are assigned a maximum score since they might contain some the encoder–decoder structure. We have used a 3-gram model for the
important information. We have experimented with different values extractive approach. For 400 tokens, we take the first ten sentences,
for the N-gram approach, and selected the 3-gram approach since it and for 200 tokens, we take the first five sentences, as the input text is
achieved the best results. shorter. All parameters have been chosen based on models trained on
a partial dataset containing around 10,000 samples.
2.4. Performance measure
3. Dataset preparation
To evaluate the model’s performance, the Recall-Oriented Under-
study for Gisting Evaluation (ROUGE) (Lin, 2004; Ganesan, 2018; Liu We have used the following four datasets for text summarization:
and Liu, 2010; Yang et al., 2018) score is used as a benchmark based CNN/ DM, XSum, Gigaword and DUC 2004 Task 1 (Task 1 in DUC
on the well-known recall measure. It calculates the degree of overlap 2004 dataset refers to very short single-document summarizations).
between the annotated and generated summaries using the number of Most experiments have been done on the CNN/DM dataset because it
N-grams (recall). ROUGE-N measures how many numbers of N-gram contains the longest summary, and the annotated summary contains
words in the annotated summary appear in the generated summary. extractive and abstractive elements. It is most commonly used as a
R-1 uses 1-gram words, R-2 uses 2-gram words and R-L uses Longest standard text summarization dataset, so it is possible to compare our
Common Subsequence. We use 𝐹1 score of ROUGE-1 (R-1), ROUGE-2 model with other models. In Table 3, the number of samples from each
dataset used in training, validation and testing is shown. These datasets
1
https://www.nltk.org/. have the same structure: index, input text, and annotated summary.

5
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Fig. 4. Text preprocessing pipeline.

3.1. Datasets description Table 4


The number of tokens for input text, annotated summary and maximum generated
summary for different datasets. For the DUC 2004 Task 1 dataset, the length of the
The CNN/Daily Mail dataset (CNN/DM) (Hermann et al., 2015; See input text depends on the dataset the model has been trained on, as it has only a test
et al., 2017) is the most popular dataset for text summarization task dataset.
and is usually used as a benchmark. This dataset was collected using Dataset Input text Annotated Maximum generated
the CNN2 and DM3 websites, and bullet points from articles are used as summary summary
an annotated summary. It has long input text and annotated summary, CNN/DM 400/200 100 120
and on average the input text consists of 30 sentences and 724 tokens, XSum 400/200 20 30
whereas the annotated summary consists of 4 sentences and 52 tokens. Gigaword 50 18 25
DUC 2004 Task 1 – 18 25
The Extreme Summarization (XSum) (Narayan et al., 2018) is a highly
abstract dataset, which was designed to answer the question ‘‘What is
the article about?’’. It does not contain useless or redundant information
and rarely has long extracted phrases from the input text. It was will add a punctuation to the word (e.g., ‘‘soon, i will go home.’’ will
collected using the British Broadcasting Corporation (BBC) articles. The become [‘‘soon’’, ‘‘i’’, ‘‘will’’, ‘‘go’’, ‘‘home.’’]). We also simply cannot
input text, on average, contains 436 tokens; however, compared to the remove punctuations, as they are crucial to understanding the end of
CNN/DM dataset, it has some samples with over 15,000 tokens. As the a sentence, due to its importance in the extractive approach. Addition-
purpose of this dataset is to generate as condensed summary as possible ally, in words such as ‘‘u.s.a.’’, we cannot separate the word into three
and focus on the most important information, the annotated summary different sentences because it contains a dot. Therefore, we separate the
contains only 9 tokens on average. sentence into words by using regular expressions (regex) pattern described
The Annotated English Gigaword dataset is based on the Gigaword in NLTK Book, Chapter 3.7,5 where words such as ‘‘u.s.a.’’ or ‘‘poster-
corpus (Graff et al., 2003). This dataset contains the headline of each print’’ are retained in their original form. Next, words and punctuations
article as a summary text and the first sentence as an input text. The are concatenated to recreate a sentence, where each word is separated
average number of tokens for an input text sample is 31, and for an by space (e.g., ‘‘soon, i will go home .’’). In step 5, we expand the
annotated summary is 8. contractions, for example, ‘‘don’t’’ will be changed to ‘‘do not’’ or
The DUC 2004 (Document Understanding Conference) Task 1 ‘‘i’ll’’ will be changed to ‘‘i will’’. One can use a contractions library6 ;
dataset (Over et al., 2007) is used for testing purposes and contains however, their use will result in incorrect contractions. For example,
500 short articles from the New York Times and Associated Press Wire. ‘‘u.s.a’’ will be changed to ‘‘you.s.a’’. Therefore, a hard-coded dictionary
It has a structure similar to the Gigaword dataset, which contains was used, as all processed contractions can be stored. This preprocessed
an average of 36 tokens for input test and 11 tokens for annotated text is used in our extractive approach to extract important information.
summary. We use the nltk function sent_tokenize to separate sentences from each
other.
3.2. Preprocessing for text summarization After preprocessing, the vocabulary is created, where each unique
word is assigned a number in increasing order. Then, each sentence is
Various methods exist for text preprocessing such as removing stop encoded using the vocabulary, and the encoded list is sent to the model.
words, lemmatization, stemming, to name a few (Widyassari et al., During training, the model generates the same number of tokens
2022). However, only a limited number of preprocessing techniques are as presented in ‘‘annotated summary’’ shown in Table 4; however,
applicable for a text summarization task, as the model needs to generate during testing, the model generates ‘‘maximum generated summary’’
readable text, and some techniques can make it inappropriate. For number of tokens. The ‘‘maximum generated summary’’ has longer
instance, if stop words are removed, then the model will The authors length, as later it will be compared to the full annotated summary
of paper (Ledeneva, 2008) suggest that usage of preprocessing does (not cropped one); therefore, we generate a bit longer summary to
not affect the accuracy of the model. On the contrary, our experiments get higher ROUGE score. However, with most datasets, the model will
showed that preprocessing methods can help boost the model accuracy. generate EOS (end of sentence) token before this limit. Annotated
For text preprocessing, we have used nltk library.4 For all datasets, the summaries mainly consist of words which also appear in the input
same preprocessing techniques have been applied, and the pipeline can text. However, some words in the annotated summaries are not used
be seen in Fig. 4. in the input text vocabulary, which worsens the results and leads to
To preprocess the text, we use the following approach: in step 1, the generation of an < 𝑢𝑛𝑘 > token. Additionally, annotated summaries
extra symbols such as ‘‘-lrb-’’ are removed (which might be present might contain words which are missing in the input text; however,
in the HTML text). In step 2, the input text and annotated summary there are only a few such examples.
are converted into lowercase, since it does not change the meaning
of the sentence. Step 3 will separate the character ‘‘s’’ in the word 3.3. An example of rarely used words
(e.g., ‘‘amy’s’’) and remove all Unicode characters. We use ASCII (Amer-
ican Standard Code for Information Interchange) characters only and Table 5 shows an example of an input text and an annotated
not Unicode characters, as the used datasets contain standard English summary from the CNN/DM dataset. In the following table, the ex-
characters. In step 4, we separate a sentence into words. However, this ample of a rarely used words are ‘‘mayweather’’ or ‘‘pacquiao’’, which
process cannot be done by merely separating words using spaces, as it represent surnames of people. Even though from the context it might be

2 5
https://www.cnn.com. https://www.nltk.org/book/ch03.html#nltk-s-regular-expression-
3
https://www.dailymail.co.uk. tokenizer.
4 6
https://www.nltk.org/. https://pypi.org/project/contractions/.

6
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Table 5 Table 6
A fragment of input text and annotated summary from the CNN/DM dataset. Results of different models, which were trained for 5 epochs and tested on the CNN/DM
Input text Annotated summary dataset. Best results are shown in bold-face.
Model R-1 R-2 R-L
The build-up for the blockbuster fight Floyd mayweather holds an open
between floyd mayweather and manny media workout from 12 am uk 7 M1 35.56 12.90 33.08
pacquiao in las vegas on may 2 steps up pm edt. the american takes on M2 35.59 12.51 33.22
a gear on tuesday night when the manny pacquiao in las vegas on
M1-200 35.36 12.65 32.82
american holds an open workout for the may 2. mayweather ’s training is
M2-200 35.09 11.98 32.62
media. the session will be streamed live being streamed live across the
across the world and you can watch it world. M1-400 36.29 13.24 33.90
here from 12 am uk 7 pm edt. M2-400 35.83 13.09 33.48

understandable that it represents the surname, the rareness of this word 4.2. Trained models: Results and analysis
might help the model to identify that this word might be important for
the final summary, and it can be seen, that those surnames are used in The following section is separated into several subsections, where
the annotated summary. each subsection shows and analyzes results of models trained and
However, rarely used words can be represented not only as sur- tested on a specific dataset (CNN/DM, XSum, Gigaword datasets). We
names: other examples are ‘‘maternal’’, ‘‘deliberating’’ or ‘‘aberbar- perform tests on different models. Firstly, we test the performance of
goed’’ (city name). The idea behind boosting such rarely used words the transformer with the pointer-generator layer (M1) and perform
is that the attention of such words will be higher, therefore, there is a another test with the same model, but with the addition of information
higher chance that the model will copy rarely used words or that this about the frequency of each word (M2). Secondly, we add the extractive
word will affect more to the final distribution. approach to the abstractive approach by including the number of tokens
(M1-x and M2-x, where x is number of tokens used 200 or 400).
3.4. Implemented transformer model algorithm
4.2.1. Models trained on the CNN/DM dataset
A high level description of the implemented transformer model is In this subsection, we discuss models that were trained and tested
given in Alg. 2. The model uses the encoder (line 4) which gets encoded on CNN/DM dataset (Table 6). As we can see, the M2 model performs
input text and frequency information about words in the input text better than the M1 model in terms of R-1 and R-L scores; however,
(each word is encoded with a score). The encoder calculates the atten-
the R-2 score is larger for the M1 model. Nevertheless, the difference
tion of an input text and passes this information to the decoder. The
between R-1 and R-L of both models is small, so one can conclude that
decoder then uses this information and a part of annotated summary
both models perform similarly with respect to these scores.
with mask to calculate 𝑠𝑡 , 𝑥𝑡 and ℎ𝑡 (line 5 and 6). Then, the pointer-
However, when we test the same model with the addition of the
generator layer decides, whether the word should be copied from the
extractive approach layer, we can see that the performance suffers
input text or a new word should be generated in line 7. At the very
significantly indicating that frequency information does not result in
end, the loss function is used to calculate the total loss for a batch in
any gain in performance. Comparing models trained with 200 vs.
line 8.
400 tokens (M1-200, M2-200, M1-400 and M2-400), we can observe
that the model with 400 tokens performed much better overall, and
Algorithm 2 An algorithm for a transformer model significantly better than models without the extractive approach. It can
1: procedure Transformer(𝑒𝑝𝑜𝑐ℎ𝑠, 𝑏𝑎𝑡𝑐ℎ𝑒𝑠) be seen that the extractive approach helped to improve the performance
2: for each 𝑒𝑝𝑜𝑐ℎ in 𝑒𝑝𝑜𝑐ℎ𝑠 do of models, where ROUGE scores shown an increase from 35.56 to 36.29
3: for each 𝑏𝑎𝑡𝑐ℎ in 𝑏𝑎𝑡𝑐ℎ𝑒𝑠 do (M1-200).
4: 𝑒𝑛𝑐𝑜𝑑𝑒𝑑 ← 𝑒𝑛𝑐𝑜𝑑𝑒𝑟(𝑏𝑎𝑡𝑐ℎ.𝑠𝑟𝑐, 𝑏𝑎𝑡𝑐ℎ.𝑠𝑟𝑐_𝑓 𝑟𝑒𝑞) It is noteworthy that our experiments show that frequency infor-
5: 𝑠𝑡 , 𝑥𝑡 , 𝑚ℎ_𝑎𝑡𝑡𝑛_𝑑𝑒𝑐𝑜𝑑𝑒𝑟 ← 𝑑𝑒𝑐𝑜𝑑𝑒𝑟( mation helps to distribute attention throughout several words rather
𝑒𝑛𝑐𝑜𝑑𝑒𝑑, 𝑏𝑎𝑡𝑐ℎ.𝑡𝑟𝑔_𝑖𝑛𝑝, than relying on a single word or copying words. Adding frequency
𝑏𝑎𝑡𝑐ℎ.𝑡𝑟𝑔_𝑚𝑎𝑠𝑘) information also helps to generate new rarely used words or copy words
6: ℎ𝑡 ← 𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑉 𝑒𝑐𝑡𝑜𝑟(𝑒𝑛𝑐𝑜𝑑𝑒𝑑, 𝑚ℎ_𝑎𝑡𝑡𝑛_𝑑𝑒𝑐𝑜𝑑𝑒𝑟) from the input text, which are rarely used.
7: 𝑔𝑒𝑛_𝑠𝑢𝑚_𝑝𝑟𝑜𝑏 ← 𝑝𝑜𝑖𝑛𝑡𝑒𝑟_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟(𝑠𝑡 , 𝑥𝑡 , ℎ𝑡 , In general, for models trained on the CNN/DM dataset, M2 model
𝑒𝑛𝑐𝑜𝑑𝑒𝑑) performs better than the M1 model in terms of R-1 and R-L scores;
8: 𝑙𝑜𝑠𝑠 ← 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝐿𝑜𝑠𝑠(𝑔𝑒𝑛_𝑠𝑢𝑚_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠, however, the R-2 score is larger for the M1 model. Nevertheless, the
𝑏𝑎𝑡𝑐ℎ.𝑡𝑟𝑔_𝑒𝑥𝑡) difference between R-1 and R-L scores of both models is small, so one
9: end for can conclude that both models perform similarly with respect to these
10: end for scores. When we test same models with an extractive approach layer,
11: end procedure we can see that the performance suffers significantly for the model with
frequency information with no gain in performance. Comparing models
4. Experiments trained with 200 vs. 400 tokens (M1-200, M2-200, M1-400 and M2-
400), we can observe that 400 tokens performed much better, even
4.1. Experimental setup better than models without an extractive approach. It can be seen that
the extractive approach helped to improve the performance of models,
Two different machine configurations were used for our exper- where ROUGE scores show an increase from 35.56 to 36.29 (M1-200).
iments. One machine had one 8 GB GPU RTX 2080 and Intel(R)
Core(TM) i7-9700K CPU @ 3.60 GHz, and a second machine had four 4.2.2. Models trained on the XSum dataset
12 GB GPUs TITAN X and Intel(R) Core(TM) i7-5930K CPU @3.50 GHz. The process used to train and test the 6 models on the XSum dataset
All models have been trained on 5 epochs, which usually takes around was identical to the training method used on the CNN/DM dataset
a day to train. The best models using the CNN/DM dataset were since both datasets share a similar structure (i.e., long input text and
trained for 30 epochs to enable comparison with the models from short annotated summary, even though the CNN/DM dataset has longer
papers (Nallapati et al., 2016; See et al., 2017; Deaton et al., 2019). annotated summaries). The results are shown in Table 7. The M1 model

7
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Table 7 Table 9
Results of different models, which were trained for 5 epochs and tested on the XSum Results of different models, which were tested on the DUC 2004 Task 1 dataset. Best
dataset. Best results are shown in bold-face. results are shown in bold-face.
Model R-1 R-2 R-L Model R-1 R-2 R-L
M1 28.26 8.95 25.25 M1 CNN/DM 19.64 4.36 18.75
M2 27.73 9.13 24.92 M2 CNN/DM 20.86 4.50 18.82
M1-200 CNN/DM 21.13 5.03 19.18
M1-200 25.07 7.35 22.30
M2-200 CNN/DM 21.19 5.15 19.27
M2-200 24.09 7.02 21.73
M1-400 CNN/DM 20.92 4.81 18.90
M1-400 27.76 8.92 24.67 M2-400 CNN/DM 21.59 5.17 19.46
M2-400 26.61 8.49 23.83
M1 XSum 12.97 2.21 11.76
M2 XSum 12.07 2.22 11.16
Table 8 M1-200 XSum 10.59 1.64 9.98
Results of different models, which were trained for 5 epochs and tested on the Gigaword M2-200 XSum 14.45 2.62 13.27
dataset. Best results are shown in bold-face. M1-400 XSum 13.46 1.86 12.24
M2-400 XSum 14.67 2.63 13.83
Model R-1 R-2 R-L
M1 Gigaword 26.35 9.23 24.50
M1 33.09 15.64 31.14
M2 Gigaword 26.24 8.89 24.40
M2 32.31 14.68 30.49

datasets have, on average, around 400 tokens for an input text, meaning
showed better results in terms of R-1 and R-L scores and comparable
that for models, which were trained on these datasets, most of the
in R-2 score. Similar to the models that were trained on the CNN/DM
tokens will have a padding symbol. These models were trained to
dataset, in this model, adding an extractive layer helped it to achieve
gather information from different parts in the text, and when the text
higher ROUGE score in both cases. However, we can see more clearly,
is short, we can see that models, which were trained on these datasets,
that models without frequency information outperformed other models
performed poorly.
by almost one point score in terms of R-1 and R-L scores.
Models, which were trained on the CNN/DM dataset and used
Adding extractive layer seems to worsen the result for models
frequency information, always show better results than those without
trained on the XSum dataset. This maybe due to the fact that annotated
frequency information. Having only 200 tokens in an input text with an
summaries do not contain as much copied information from the input
extractive approach shows better results in both models, compared to
text as, for example, in the CNN/DM dataset.
the default ones. The M2-400 model trained on the CNN/DM dataset
showed the best ROUGE score (R-1 score of 21.59, R-2 score of 5.12
4.2.3. Models trained on the Gigaword dataset
and R-L score of 19.46). For the model without frequency information,
The Gigaword dataset is different from the previous two datasets.
the M1-200 model is better (R-1 score of 21.13, R-2 score of 5.03 and
Table 8 gives the experimental results without the extractive approach,
R-L score of 19.18). It might be connected to the fact that the CNN/DM
as the input text has only a few sentences, with an average of 31 tokens
dataset has longer input text and longer annotated summary when the
used in the input text. Hence, the extractive approach is not applicable. DUC 2004 Task 1 dataset is shorter; therefore, models trained on a
The results show that frequency information does not help for shorter input text (like the Gigaword dataset) show better results.
this dataset either. The difference between two models is much more Models trained on the XSum dataset with frequency information
significant than for models which were trained on other datasets, where are almost always better, except for the M1 model. Having 200 tokens
the difference is more than 1.5 ROUGE points. worsens the results for the M1 model, and shows improvement for
the M2-400 model. Having frequency information boosts the results
4.2.4. Why and where models with added frequency information can be of the M2-200 and M2-400 models compared to the one without this
used information, where the M2-400 model has the highest overall score
From previous results we can see, that most of the time models (R-1 score of 14.67, R-2 score of 2.63 and R-L score of 13.89).
with added frequency information (M2) performed poorly, than models The M1 and M2 models trained on the Gigaword dataset have a
without (M1). However, our experiments showed that models with difference in ROUGE score of 0.1–0.3 points, with the M1 model having
added frequency information (M2) tends to generate sentences using the highest score (R-1 score of 26.35, R-2 score of 9.23 and R-L score
synonyms and include in the final summary words from the input text, of 24.50). It is a significant improvement for the M2 model, tested on
that are rarely used. Unfortunately, due to the fact that ROUGE score the Gigaword dataset, where the difference was 1–2 points.
reduces the score for sentences, which use synonyms, models with The reason why the Gigaword dataset performs much better than
added frequency information showed worse ROUGE score, even though models trained on any other datasets is that the Gigaword dataset
our experiments showed that sometimes those models would generate has short input text and short annotated summary (the number of
more understandable summaries. tokens for the input text and the annotated summary is similar to
Models with added frequency information preferably to be used in the DUC 2004 Task 1 dataset) whereas the XSum and the CNN/DM
situations, when the test dataset has different frequency distribution of datasets have much longer input text. Even though the XSum dataset
used words, as it is shown in the next section (in experiments with the has a shorter annotated summary compared to the one in the CNN/DM
DUC 2004 Task 1 dataset). dataset, it showed worse results, which might be connected to the
fact that the XSum dataset is extremely abstractive, meaning that the
4.3. Models tested on the DUC 2004 Task 1 dataset annotated summary contains changed phrases, compared to the one in
the input text. The model trained on the CNN/DM dataset learned to
We have models that were trained on 3 different datasets. To copy phrases or words; therefore, there is a higher chance of scoring
compare the performance of all models, we have tested them on the a higher ROUGE score on the DUC 2004 Task 1 dataset, which might
DUC 2004 Task 1 dataset. contain copied information.
In Table 9, we can observe that the model trained on the Gigaword
dataset performed the best. It is because the DUC 2004 Task 1 dataset 4.3.1. DUC 2004 dataset results analysis
has short input text and short annotated summary, with only 36 and The ROUGE score of the model strongly depends on the dataset it
11 tokens on average, respectively. Both the CNN/DM and the XSum was trained on. Moreover, models trained on separate datasets generate

8
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Table 10 number of words from the annotated summary which appear in the
Comparing our models (trained for 30 epochs) to the models from other papers using
input text. We can see that the extractive approach with 200 tokens
the CNN/DM dataset.
has much less important information, when with 400 tokens more
Model R-1 R-2 R-L
important information is present, which improves the overall result. For
words-lvt2k-temp-att (Nallapati et al., 2016) 35.46 13.30 32.65
instance, sentence number 12 (missing in the text without an extractive
pointer-generator (See et al., 2017) 36.44 15.66 33.42
Transformer + Pointer-Generator + N-Gram 25.31 4.16 15.99 approach) contains ten words, which also appears in the annotated
Blocking (2-gram) (Deaton et al., 2019) summary. The total number of words from the annotated summary
M1-400 (trained on the CNN/DM dataset) 38.22 15.07 35.79 present in the input text of a dataset without an extractive approach,
M2-400 (trained on the CNN/DM dataset) 37.52 14.46 35.02 with extractive (400 tokens), and with extractive (200 tokens) are 94,
102, and 51, respectively. By including more important information,
the model can generate better summaries.
differently structured summaries and focus on different information. In most cases, the first few sentences contain the most important
The Gigaword dataset has a similar structure to the DUC 2004 Task information, which can be used to generate a summary. However, there
1 dataset, which is why the model trained on the Gigaword dataset might be text, where important information is located at the very end
performed so well on the latter one. The model, which is trained on the of the article, and it would not be included in the model. In See et al.
XSum dataset, generated a summary, which is highly abstractive, and (2017) where the pointer-generator approach is used, a smaller number
it copies less information from the input text compared to the models of tokens did not lead to a higher ROUGE score. It is probably due to
trained on the CNN/DM dataset. the fact that they were cutting off important information, which was
Adding an extractive approach helps to increase a ROUGE score of located farther in the text.
the model, which was trained on the CNN/DM and even the XSum Because our extractive approach sorts sentences based on impor-
dataset. In addition to that, adding frequency score inside the self- tance, this approach cannot be used on short text (e.g., 1–2 sentences)
attention of an encoder helped to achieve higher results; however, if the optimally, since all sentences will be included. This approach is useful
model is tested on the same dataset, adding frequency information fails only when the input text is long (for example, longer than 10–15
to outperform the M1 model, even though it generates more interesting sentences). Therefore, we have applied the extractive approach only
summaries, which might contain additional information. for the CNN/DM and XSum datasets.
Experiments on the DUC 2004 Task 1 dataset showed, that when the
test dataset format is different from the training dataset format (it was 5. Conclusion and future work
collected using different methods or different information), the model
with added word frequency information performs better than without We have successfully implemented a hybrid model using extractive
the frequency information. Frequency information helps the model to and abstractive approaches with a transformer and pointer-generator
concentrate on rare words, which might contain important information. layer, which helped to achieve a higher ROUGE score and boost rarely
Our experiments show that when the input text is short, it is better not used words. In addition, we showed that the extractive approach is
to use attention, for instance, in the Gigaword dataset. essential, since it serves as a preprocessing of the input text where
important information located in different parts of the text can be
4.4. Comparative analysis: Benchmark models captured rather than taking the first N tokens.
Adding frequency information to the model improves results only in
The current state-of-the-art models (Qi et al., 2020; Bao et al., some cases; however, the model might add extra information using less
2021; Xiao et al., 2022) use a pre-trained encoder and can achieve common words because the model distributes the attention to different
higher accuracy (more than 44 for R-1). However, since we are not words rather than to a single word. M1-400 achieved a high ROUGE
using pre-trained embedding, we have selected three models as our score of R-1 score of 38.22, R-2 score of 15.07 and R-L score of 35.79
benchmark (Nallapati et al., 2016; See et al., 2017; Deaton et al., 2019), and outperformed (See et al., 2017) model in R-1 and R-L score by
which used CNN/DM dataset for training and testing. The model words- at least 2 points, even though the R-2 score is slightly lower. M2-400
lvt2k-temp-att (Nallapati et al., 2016) and pointer-generator (See et al., model showed better results in almost all tests on the DUC 2004 Task
2017) are based on Seq2Seq backbone with pointer-generator layer. 1 dataset compared to either the M1 or the M1-400 models, which is
The model from Transformer + Pointer-Generator + N-Gram Blocking because the M2 model has additional information about the rarity of
(2-gram) (Deaton et al., 2019) tried to combine transformer (Vaswani each word which seemed helpful when tested on a different dataset
et al., 2017) with the pointer-generator layer from See et al. (2017). that the model was trained on.
The results are shown in Table 10. Both models (M1-400 and M2- In future work, coverage mechanism (Tu et al., 2016) can be im-
400 models) have been trained for 30 epochs. We can see that both plemented, which tracks what information has been included and
of our models showed better results in terms of R-1 and R-L scores eliminates repetitive words or phrases. In some cases that have been
compared to the selected benchmark papers and slightly lower in terms shown (for instance, examples of a generated summary of models that
of the R-2 score than the model from See et al. (2017). We were able have been trained on the CNN/DM dataset), important information
to achieve results similar to the ones in See et al. (2017) in terms of has been repeated several times. In addition, a more robust evaluation
the R-1 and R-L scores when our model was trained only for 5 epochs. technique should be tested. 𝑅𝑂𝑈 𝐺𝐸𝐹1 score is not the best evaluation
Even though adding frequency information lowered a ROUGE score, technique. For instance, if a synonym of a word is used, which does not
the results are still comparable. change the meaning of a sentence, the ROUGE score will be lowered,
These results show, that our model outperforms the selected bench- as in this case, it will not be counted towards the N-gram metric.
mark models, which use the pointer-generator model or pointer- Even though adding frequency information to the encoder’s atten-
generator with a transformer. Adding a pre-trained encoder or coverage tion improved ROUGE score of models, which were tested on the DUC
mechanism to our models could boost the results further. 2004 Task 1 dataset, in most of cases adding frequency information
worsened the ROUGE score and the generated summary sometimes
4.5. Analysis: Important sentences was including more information than necessary. Changing how the
frequency scoring is passed to the encoder could be tested, as well as
It is hard to predict whether the redundant information has been adding frequency scoring to the decoder part, which can help to focus
eliminated, however, the extractive approach can help to extract the on important information that has been generated so far. This might
most important sentences from the text. For instance, Fig. 5 shows the help to eliminate repetition of important information in the generated

9
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Fig. 5. Summarization statistics from the annotated summary, which appear in input text from the CNN/DM dataset.

text, as the model would be able to ‘‘notice’’ this information using Chen, Y.-C., Bansal, M., 2018. Fast abstractive summarization with reinforce-selected
the attention, which might lead to lowering probability of generating sentence rewriting. In: Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). Association for Computa-
particular words. In addition, the method to compute the frequency can
tional Linguistics, Melbourne, Australia, pp. 675–686, URL https://aclanthology.
be changed, so that it is based not only on how frequently the word org/P18-1063.
appears in the input text. Encoder–decoder network could be trained Clissa, L., 2022. Survey of big data sizes in 2021. arXiv preprint arXiv:2202.07659.
in parallel to speed up the process (as it is done in Chen and Bansal Deaton, J., Jacobs, A., Kenealy, K., See, A., 2019. Transformers and pointer-generator
(2018)); however, it is not clear whether the speed time might increase networks for abstractive summarization. https://web.stanford.edu/class/archive/cs/
cs224n/cs224n.1194/reports/custom/15784595.pdf.
significantly in the model that has been described in this research.
Dehru, V., Tiwari, P.K., Aggarwal, G., Joshi, B., Kartik, P., 2021. Text summarization
techniques and applications. IOP Conf. Ser.: Mater. Sci. Eng. 1099 (1), 012042,
CRediT authorship contribution statement URL http://dx.doi.org/10.1088/1757-899X/1099/1/012042.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep
Danila Morozovskii: Conceptualization, Methodology, Software, bidirectional transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Validation, Formal analysis, Investigation, Data curation, Writing –
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
original draft, Writing – review & editing. Sheela Ramanna: Con- Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186,
ceptualization, Supervision, Validation, Resources, Writing – review & URL https://aclanthology.org/N19-1423.
editing, Funding acquisition. Duchi, J., Hazan, E., Singer, Y., 2011. Adaptive subgradient methods for online learning
and stochastic optimization. 12 (null) 2121–2159.
El-Kassas, W.S., Salama, C.R., Rafea, A.A., Mohamed, H.K., 2021. Automatic text
Declaration of competing interest summarization: A comprehensive survey. Expert Syst. Appl. 165, 113679, URL
https://www.sciencedirect.com/science/article/pii/S0957417420305030.
The authors declare that they have no known competing finan- Elman, J.L., 1990. Finding structure in time. Cogn. Sci. 14 (2), 179–211.
cial interests or personal relationships that could have appeared to Ganesan, K., 2018. ROUGE 2.0: Updated and improved measures for evaluation of
summarization tasks. URL https://arxiv.org/abs/1803.01937.
influence the work reported in this paper.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., Bengio, Y., 2014. Generative adversarial nets. In: Advances in Neural
Acknowledgments Information Processing Systems. pp. 2672–2680.
Graff, D., Kong, J., Chen, K., Maeda, K., 2003. English gigaword. Linguist. Data Consort.
This work was funded by the NSERC Discovery Grants Program (no. Phila. 4 (1), 34.
Gulcehre, C., Ahn, S., Nallapati, R., Zhou, B., Bengio, Y., 2016. Pointing the un-
194376).
known words. In: Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Association for Computational
References Linguistics, Berlin, Germany, pp. 140–149, URL https://aclanthology.org/P16-1014.
Hermann, K.M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M.,
Almeida, F., Xexéo, G., 2019. Word embeddings: A survey. arXiv preprint arXiv: Blunsom, P., 2015. Teaching machines to read and comprehend. In: NIPS.
1901.09069. pp. 1693–1701, http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-
Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural machine translation by jointly learning comprehend.
to align and translate. arXiv preprint arXiv:1409.0473. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),
Bao, H., Dong, L., Wang, W., Yang, N., Wei, F., 2021. S2s-ft: Fine-tuning pretrained 1735–1780.
transformer encoders for sequence-to-sequence learning. arXiv preprint arXiv:2110. Jangra, A., Jatowt, A., Saha, S., Hasanuzzaman, M., 2021. A survey on multi-modal
13640. URL https://arxiv.org/pdf/2110.13640.pdf. summarization. arXiv preprint arXiv:2109.05199.
Chanb, H.P., Kinga, I., 2021. A condense-then-select strategy for text summariza- Kanapala, A., Pal, S., Pamula, R., 2019. Text summarization from legal documents: A
tion. Knowl.-Based Syst. 227, 107235, URL https://doi.org/10.1016/j.knosys.2021. survey. Artif. Intell. Rev. 51 (3), 371–402, URL https://doi.org/10.1007/s10462-
107235. 017-9566-2.

10
D. Morozovskii and S. Ramanna Natural Language Processing Journal 3 (2023) 100014

Kim, Y., Denton, C., Hoang, L., Rush, A.M., 2017. Structured attention networks. arXiv Song, S., Huang, H., Ruan, T., 2019. Abstractive text summarization using LSTM-
preprint arXiv:1702.00887. CNN based deep learning. Multimedia Tools Appl. 78 (1), 857–875, URL https:
Kingma, D.P., Ba, J.L., 2015. Adam: A method for stochastic optimization. In: ICLR //doi.org/10.1007/s11042-018-5749-3.
(Poster). Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.
Kullback, S., Leibler, R.A., 1951. On information and sufficiency. Ann. Math. Stat. 22 Dropout: A simple way toprevent neural networks from overfitting. J. Mach. Learn.
(1), 79–86, URL https://doi.org/10.1214/aoms/1177729694. Res. 15 (56), 1929–1958, URL http://jmlr.org/papers/v15/srivastava14a.html.
Ledeneva, Y., 2008. Effect of preprocessing on extractive summarization with maximal Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural
frequent sequences. In: MICAI 2008: Advances in Artificial Intelligence. Springer networks. In: Proceedings of the 27th International Conference on Neural Informa-
Berlin Heidelberg, pp. 123–132. tion Processing Systems - Volume 2. NIPS ’14, MIT Press, Cambridge, MA, USA,
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., pp. 3104–3112.
Zettlemoyer, L., 2020. BART: Denoising sequence-to-sequence pre-training for Syed, A.A., Gaol, F.L., Matsuo, T., 2021. A survey of the state-of-the-art models in
natural language generation, translation, and comprehension. In: Proceedings of the neural abstractive text summarization. IEEE Access 9, 13248–13265.
58th Annual Meeting of the Association for Computational Linguistics. Association Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the
for Computational Linguistics, pp. 7871–7880, URL https://aclanthology.org/2020. inception architecture for computer vision. In: 2016 IEEE Conference on Computer
acl-main.703. Vision and Pattern Recognition. CVPR, pp. 2818–2826.
Lin, C.-Y., 2004. ROUGE: A package for automatic evaluation of summaries. In: Text Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H., 2016. Modeling coverage for neural machine
Summarization Branches Out. Association for Computational Linguistics, Barcelona, translation. In: Proceedings of the 54th Annual Meeting of the Association for Com-
Spain, pp. 74–81, URL https://aclanthology.org/W04-1013. putational Linguistics. Association for Computational Linguistics, Berlin, Germany,
Liu, F., Liu, Y., 2010. Exploring correlation between ROUGE and human evaluation on pp. 76–85.
meeting summaries. IEEE Trans. Audio Speech Lang. Process. 18 (1), 187–196. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.,
Medress, M., Cooper, F., Forgie, J., Green, C., Klatt, D., O’Malley, M., Neuburg, E., Polosukhin, I., 2017. Attention is all you need. In: Proceedings of the 31st
Newell, A., Reddy, D., Ritea, B., Shoup-Hummel, J., Walker, D., Woods, W., International Conference on Neural Information Processing Systems. NIPS ’17,
1977. Speech understanding systems: Report of a steering committee. Artificial Curran Associates Inc., Red Hook, NY, USA, pp. 6000–6010.
Intelligence 9 (3), 307–316, URL https://www.sciencedirect.com/science/article/ Verma, S., Nidhi, V., 2018. Extractive summarization using deep learning. Res. Comput.
pii/0004370277900261. Sci. 147, 107–117, URL https://arxiv.org/abs/1708.04439.
Nallapati, R., Zhou, B., dos Santos, C., Gülçehre, Ç., Xiang, B., 2016. Abstractive Widyassari, A.P., Rustad, S., Shidik, G.F., Noersasongko, E., Syukur, A., Affandy, A.,
text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings Setiadi, D.R.I.M., 2022. Review of automatic text summarization techniques &
of the 20th SIGNLL Conference on Computational Natural Language Learning. methods. J. King Saud Univ. - Comput. Inf. Sci. 34 (4), 1029–1046, URL https:
Association for Computational Linguistics, Berlin, Germany, pp. 280–290, URL //www.sciencedirect.com/science/article/pii/S1319157820303712.
https://aclanthology.org/K16-1028. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y.,
Narayan, S., Cohen, S.B., Lapata, M., 2018. Don’t give me the details, just the summary! Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., et al., 2016.
topic-aware convolutional neural networks for extreme summarization. In: Proceed- Google’s neural machine translation system: Bridging the gap between human and
ings of the 2018 Conference on Empirical Methods in Natural Language Processing. machine translation. arXiv preprint arXiv:1609.08144. URL https://arxiv.org/pdf/
Association for Computational Linguistics, Brussels, Belgium, pp. 1797–1807, URL 1609.08144.pdf.
https://aclanthology.org/D18-1206. Xiao, W., Beltagy, I., Carenini, G., Cohan, A., 2022. PRIMERA: Pyramid-based masked
Over, P., Dang, H., Harman, D., 2007. Duc in context. Inf. Process. Man- sentence pre-training for multi-document summarization. In: Proceedings of the
age. 43 (6), 1506–1520, URL https://www.sciencedirect.com/science/article/pii/ 60th Annual Meeting of the Association for Computational Linguistics (Volume
S0306457307000404. Text Summarization. 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, pp.
Panthaplackel, S., Benton, A., Dredze, M., 2022. Updated headline generation: Creating 5245–5263, URL https://aclanthology.org/2022.acl-long.360.
updated summaries for evolving news stories. In: Proceedings of the 60th Annual Yang, A., Liu, K., Liu, J., Lyu, Y., Li, S., 2018. Adaptations of ROUGE and BLEU to
Meeting of the Association for Computational Linguistics Volume 1: Long Papers. better evaluate machine reading comprehension task. URL https://arxiv.org/abs/
Association for Computational Linguistics, pp. 6438–6461. 1806.03578.
Qi, W., Yan, Y., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R., Zhou, M., 2020. Yao, K., Zhang, L., Du, D., Luo, T., Tao, L., Wu, Y., 2020. Dual encoding for abstractive
ProphetNet: Predicting future N-gram for sequence-to-SequencePre-training. In: text summarization. IEEE Trans. Cybern. 50 (3), 985–996.
Findings of the Association for Computational Linguistics: EMNLP 2020. Association Yu, Z., Wu, Z., Zheng, H., XuanYuan, Z., Fong, J., Su, W., 2021. LenAtten: An effective
for Computational Linguistics, Online, pp. 2401–2410, URL https://aclanthology. length controlling unit for text summarization. In: Findings of the Association
org/2020.findings-emnlp.217. for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational
Rush, A.M., Chopra, S., Weston, J., 2015. A neural attention model for abstractive Linguistics, pp. 363–370.
sentence summarization. In: Proceedings of the 2015 Conference on Empirical Zhang, Y., Meng, J.E., Pratama, M., 2016. Extractive document summarization based
Methods in Natural Language Processing. Association for Computational Linguistics, on convolutional neural networks. In: IECON 2016 - 42nd Annual Conference of
Lisbon, Portugal, pp. 379–389, URL https://aclanthology.org/D15-1044. the IEEE Industrial Electronics Society. pp. 918–922.
Schick, T., Schütze, H., 2020. Rare words: A major problem for contextualized Zhang, J., Zhao, Y., Saleh, M., Liu, P.J., 2020. PEGASUS: Pre-training with ex-
embeddings and howto fix it by attentive mimicking. Proc. AAAI Conf. Artif. Intell. tracted gap-sentences for abstractive summarization. In: Proceedings of the
34 (05), 8766–8774. 37th International Conference on Machine Learning. ICML ’20, JMLR.org, pp.
See, A., Liu, P.J., Manning, C.D., 2017. Get to the point: Summarization with 11328–11339.
pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Zhang, M., Zhou, G., Yu, W., Huang, N., Liu, W., 2022. A comprehensive survey of
Association for Computational Linguistics (Volume 1: Long Papers). Association abstractive text summarization based on deep learning. Comput. Intell. Neurosci.
for Computational Linguistics, Vancouver, Canada, pp. 1073–1083, URL https: 7132226.
//aclanthology.org/P17-1099.

11

You might also like