Professional Documents
Culture Documents
FULLTEXT01
FULLTEXT01
2021-06-04
Norrköping 2021-06-04
Copyright
The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional on the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its WWW home page: http://www.ep.liu.se/
Extractive text summarization has over the years been an important re-
search area in Natural Language Processing. Numerous methods have
been proposed for extracting information from text documents. Recent
works has shown great success for English summarization tasks by fine-
tuning the language model BERT using large summarization datasets.
However, less research has been made for low-resource languages. This
work contributes by investigating how BERT can be used for Norwegian
text summarization. Two models are developed by applying a modified
BERT architecture, called BERTSum, on pre-trained Norwegian and Mul-
tilingual BERT. The results are models able to predict key sentences from
articles to generate bullet-point summaries. These models are evaluated
with the automatic metric ROUGE and in this evaluation the Multilin-
gual BERT model outperforms the Norwegian model. The multilingual
model is further evaluated in a human evaluation by journalists, reveal-
ing that the generated summaries are not entirely satisfactory in some
aspects. With some improvements, the model shows to be a valuable
tool for journalists to edit and rewrite generated summaries, saving time
and workload.
Acknowledgments
iv
Contents
Abstract iii
Acknowledgments iv
Contents v
List of Tables ix
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theory 4
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . 7
2.2 Sequential models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Input and Output Embeddings . . . . . . . . . . . . . . . 15
2.3.2 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Pretrained BERT models . . . . . . . . . . . . . . . . . . . 18
2.4 Extractive Text Summarization Methods . . . . . . . . . . . . . . 19
2.4.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
v
2.4.3 BERTSum . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Evaluation metrics for summarization . . . . . . . . . . . . . . . 23
2.5.1 Precision, Recall and F-Score . . . . . . . . . . . . . . . . 23
2.5.2 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . 25
3 Method 27
3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 CNN/DailyMail . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Aftenposten/Oppsummert . . . . . . . . . . . . . . . . . 28
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Restructure of the AP/Oppsummert dataset . . . . . . . 31
3.2.2 Truncation of articles . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Oracle Summary Generation . . . . . . . . . . . . . . . . 33
3.2.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . 36
3.2.6 Fine tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.8 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 ROUGE Evaluation . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Human Evaluation with Journalists . . . . . . . . . . . . 40
4 Results 41
4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 ROUGE Evaluation . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Human Evaluation with Journalists . . . . . . . . . . . . 45
5 Discussion 51
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 ROUGE Scores . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Sentence Selection . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.3 Human Evaluation with Journalists . . . . . . . . . . . . 53
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Conclusion 62
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vi
Bibliography 66
A Appendix I
A.1 All responses from Human Evaluation . . . . . . . . . . . . . . . I
A.1.1 Article 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.1.2 Article 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.1.3 Article 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.1.4 Article 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
vii
List of Figures
viii
List of Tables
3.1 Average token and sentence count for news articles and summaries
in the CNN/DailyMail dataset . . . . . . . . . . . . . . . . . . . . . 28
3.2 Dataset split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Article data type in the AP/Oppsummert dataset . . . . . . . . . . 29
3.4 Summary data type in the AP/Oppsummert dataset . . . . . . . . . 29
3.5 Average token and sentence count for news articles and summaries
for AP/Oppsummert . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
ix
1 Introduction
Over recent years, the amount of data available for both users and companies
has massively kept increasing. As a response to this, creating summarizations
of data has become a popular topic in data science. Text summarization tasks
take part in this, focusing on representing context in a shorter format. Con-
sidering the amount of text data in news and media, it is an example of a field
where automatic text summarization could be beneficial.
1.1 Background
Aftenposten is Norway’s largest daily newspaper based in Oslo. It is a private
company owned by Schibsted and has an estimate of 1.2 million readers. To
save readers time, Aftenposten developed a daily brief called Oppsummert,
which features the most important stories of the day, offered in a summarized
format. The idea is to help readers to be updated on the most important
daily news in a time-efficient way and at the same time offer a consistent and
standardized reading experience.
There are two main text summarization strategies: extractive and abstractive.
1
1.2. Motivation
1.2 Motivation
In the massive flood of news and media seen today, it can be challenging for
newsreaders to filter out the most important daily news. Furthermore, due
to the rapidness of our daily lives, users often want to be as time-efficient as
possible. Therefore, the motivation of news summaries is to help readers be
updated on the most important daily news in a time-efficient way. However,
writing these summaries manually leads to an increased workload for the
journalists. This is where machine learning potential can be seen for generat-
ing these summaries automatically. By implementing a model that can extract
key sentences from an article, we can reduce the workload for journalists, and
at the same time deliver summaries with the most important content to the
newsreaders.
1.3 Aim
The thesis project aims to develop a model that can extract the most relevant
sentences from an article written in Norwegian on which journalists can base
their summaries. This will be done by investigating possible approaches for
extractive text summarization using BERT with a limited labeled Norwegian
data set and evaluating the results.
2
1.4. Research question
• How news summaries can be used to generate labeled data that is re-
quired for a supervised learning model.
• How BERT can be used for extractive text summarization on a low re-
source language.
1.5 Delimitations
The study focuses on BERT-based models for extractive text summarization.
However, it will also explore traditional methods for comparison purposes.
The articles and summaries from Aftenposten are in Bokmål, one of Norway’s
official writing standards. Therefore, we narrow the scope of the language to
only Bokmål.
3
2 Theory
4
2.1. Natural Language Processing
present different methods for extractive text summarization and how these
models can be evaluated.
5
2.1. Natural Language Processing
Stop Words: Stop words are usually words with no semantics, and are
therefore considered not to provide any relevant information for the task. The
English language contains several hundreds stop words such as "the" or "and"
which does not carry any signification by themselves and are therefore often
removed from documents [39].
POS tagging: Part-of-Speech tagging (POS tagging) is the method for iden-
tifying the part of speech for words (noun, adverb, adjective, etc.). Some
words in a language can be used as more than one part of speech. For ex-
ample, consider the word "book" which can be used both as a noun, in the
sentence "The firefighter read a book" and as a verb, in the sentence "Did the
firefighter book a table?". This example shows why POS tagging becomes
important to process sentiment in a text.
6
2.1. Natural Language Processing
• Input: x = ( x1 , x2 , ..., xd )
• Output: y
The perceptron makes its predictions based on the prediction function that is
presented in Eq 2.1. An illustration of this equation is also shown in Fig 2.1
where f is an activation function that can be different for different types of
neurons, w is the weights vector that represents the strength of nodes, T is
the transpose, and b is the bias.
y = f (w T x + b) (2.1)
Training Perceptrons
The training of perceptrons is a process which is known as gradient decent.
The goal of gradient descent is to find optimal parameters w that minimize
the loss function. A training set is used during training, which contains input
values X and their corresponding correct values Y. The model predicts a
value ŷ from the input x, and this prediction is then compared with the actual
value y. The difference between predicted values and actual values is the
error E of the model. E can be calculated in different ways depending on
the output type of the model. A model that has a binary output type usually
has a loss function that is the binary cross-entropy loss. Since the goal is to
minimize the error of the loss function, we are looking for where the gradient
7
2.1. Natural Language Processing
of the loss function is zero. This is why it is called gradient descent because
the goal is to go down the gradient until it no longer has a slope, i.e., the error
becomes as small as it possibly can get.
The model parameters are updated via the equation presented in Eq 2.2.
Here, we calculate the new values of the parameters as the old parameters
minus a step in the direction of the derivative. e is called learning rate, and it is
a value that determines how big the size of the step should be. The size of the
step is important because if it is too large, we risk stepping over the optimal
point, and if it’s too small, the decent takes too much computational time.
BE
wi ( t + 1) = wi ( t ) ´ e (2.2)
Bwi
Training of perceptrons happens in epochs. An epoch is defined as a full cycle
through the training data. In the standard gradient descent method, we accu-
mulate the loss for every example in the train set before updating the weights.
The problem with the standard gradient descent method is that if the number
of training samples is large, it may be time-consuming because we have to run
through the whole training set for every parameter update. Instead, Stochastic
Gradient Decent (SGD) is often applied for faster convergence. There are two
main methods for SGD:
Updating the weights for each sample is fast, but it makes the model very
sensitive to noise in the training set. Using minibatch SGD is both fast and
robust to noise. That is why it is often preferred in training.
8
2.2. Sequential models
2.2.1 RNN
A Recurrent Neural Network (RNN) is a family of neural networks for pro-
cessing sequential data. RNNs can process long sequences and sequences
9
2.2. Sequential models
with variable length, meaning that the input sequence does not have to be the
same length as the output sequence.
Figure 2.3 above shows an RNN with loops. The model A takes in input
sequence xt and outputs a value ht . The model also passes its past state to the
next step. The same RNN can be visualized as an unpacked network instead,
shown in the following figure 2.4.
RNN can have different structures and combination. For example, RNN can
have multiple layers so that an output from one layer can be used as input
to another layer. Such layering are often called deep RNNs. Goldberg [13]
observed empirically that deep RNNs works better than shallower RNNs on
some tasks. However, it is not theoretically clear why they perform better.
10
2.2. Sequential models
Simple RNN
The most conventional RNN is called simple RNN (S-RNN), and it was pro-
posed by Elman [10]. Mikolov [27] later explored S-RNN for use in NLP [13].
It builds a strong foundation for tasks such as sequence tagging and language
modelling. However, S-RNN introduces a problem that causes the gradients
that carry information used in a parameter update to increase or decrease
rapidly over time. This problem is known as the exploding or vanishing gra-
dients problem, resulting in the gradients becoming so big or small that the
parameter updates carry no significant changes. In other words, this problem
causes the model not to learn. In later works, Hochreiter [15] proposed an
architecture known as Long Short-Term Memory which managed to overcome
the exploding and vanishing gradient problem.
LSTM
Long Short Term Memory networks (LSTMs) is a special kind of RNN capable
of learning long-term dependencies [30]. The main difference between RNN
and LSTM is that an LSTM is made up of a memory cell, input and output
gate, and a forget gate [24]. The memory cell is responsible for remembering
dependencies in the input sequence, while the gates control how much of the
previous states should be memorized.
2.2.2 Encoder-Decoder
Encoder-Decoder architecture is a standard method used in sequence-to-sequence
(seq2seq) NLP tasks such as translation. For RNN (section 2.2.1), an Encoder-
Decoder structure was first proposed by Cho et al (2014) [4]. The encoder
takes a sequence as an input and produces an encoder vector used as input to
the decoder. The decoder then predicts an output at each step with respect to
the previous states (auto-regression) until some END-token is generated. Fig-
ure 2.5 shows an RNN encoder-decoder architecture for seq2seq tasks where
hi is the hidden state, xi is the input sequence and y j is the output sequence.
2.2.3 Attention
An apparent disadvantage with conventional encoder-decoder models de-
scribed in Section 2.2.2, is that the input sequence has a fixed-length vector.
This issue limits the model to learn later parts of a sequence by truncating it.
Additionally, early parts of long sequences within the fixed-length are often
forgotten once it has processed the entire sequence [4].
11
2.2. Sequential models
allows the model to focus on relevant parts of an input sequence. The process
is done in two steps. First, instead of only passing the last encoder’s hidden
state (context vector) to the decoder, the encoder passes all the hidden states
of the previous encoders to the decoder. Second, the decoder gives each
encoder’s hidden state a score where each of these states is associated with
a certain word from the input sequence. This way, the model does not train
on using one context vector but rather learn which parts of a sequence to
pay attention to. Bahdanau et al. provide an example shown in figure 2.6.
It illustrates Attention when translating the English input sequence: [", This,
will, change, my, future, with, my, family, ., ", the, man, said], to the French
target sequence: [", Cela, va, changer, mon, avenir, avec, ma, famille,", a, dit,
l’, homme, .]. It can be seen in the figure that the alignment of the words is
largely monotonic, hence the high attention score along the diagonal. How-
ever, some words are non-monotonic. For example, the English word "man"
is "l’homme" in French, and in the example, we can find high attention scores
both for "l’" and "homme".
2.2.4 Transformers
Transformers are attention-based architecture consisting of two main compo-
nents, encoder and decoder. The model was introduced by Vaswani et al. [43]
to solve existing problems with recurrent models, presented in section 2.2.1,
that preclude parallelization, which would result in longer training time and
drop in performance for longer dependencies. Due to the attention-based
non-sequential nature of transformers, it can be highly parallelized and reach
a constant sequential and path complexity, O(1). Transformers are used to
solve translation problems. The aim is to find a relationship between words
12
2.2. Sequential models
Encoder: The encoder consists of multiple encoder layers where each layer
has two sub-layers. The first sub-layer is a multi-head self-attention mecha-
nism. For instance, looking at the same example as mentioned in the previous
section 2.2.4, self-attention means that the target sequence is the same as the
input sequence. In other words, self-attention is just another form of atten-
tion mechanism that relates different positions of the same input sequence.
The term "multi-head" means that instead of computing the attention once,
it utilizes scaled dot-product attention, allowing multiple attention compu-
tations in parallel [43]. The second sub-layer is a simple position-wise, fully
connected feed-forward network. All sub-layers have a residual connection
and a layer normalization. The purpose of this is to add the output of each
sub-layer with its previous input. The left block on figure 2.7 shows the en-
coder of a transformer.
For instance, in translation tasks, the encoder is fed with words of a specific
language, processing each word simultaneously. It then generates embed-
13
2.2. Sequential models
dings for each word which are vectors that describe the meaning of the words
in the form of numbers. Similar words have closer numbers in their respective
vectors. The decoder can then be used to predict the translation of a word in a
sentence by combining the embeddings from the encoder and the previously
generated translated sentence. The decoder predicts one word at a time until
the end of the sentence is reached.
Transformers have a token limitation of 512. The reason is that the mem-
ory and computations requirements for a transformer grow quadratically
with sequence length, making it impractical to process long sequences [43].
This means that transformers can only process input that is below 512 tokens.
Later, new solutions were introduced, such as Transformer XL that uses a
recurrent mechanism to handle text sequences longer than 512 tokens [6].
However, in most cases, it is sufficient to truncate sequences longer than 512
tokens to make them fit.
14
2.3. BERT
2.3 BERT
BERT is a transformer-based model, introduced by Devlin et al [7]. The au-
thors motivate that previous language representation models, such as RNNs,
were limited in how they encode tokens by only considering the tokens in
one direction. Unlike RNNs, the authors utilize transformers, described in
section 2.2.4, to design a Bidirectional Encoder Representation from Transformers
(BERT), which is able to encode a token using tokens from both directions.
Figure 2.8: The input embeddings and embedding layers for BERT illustrated
by Devlin et al. [7]
WordPiece
BERT adopts WordPiece tokenization proposed by Wu et al. [44]. The aim
of using WordPiece tokenization was to improve the handling of rare words.
The solution was to divide words into sub-words (WordPieces) using a fixed
15
2.3. BERT
Figure 2.9: Two words broken down into sub-words using WordPiece tok-
enization
Finally, a "[CLS]" token is added to the start and "[SEP]" token is added to the
end of a tokenized sentence. The objective of adding the extra tokens is to
distinguish a pair of sentences which will help create segment embeddings
2.3.1. Since BERT uses default transformer encoders (2.2.4), BERT is limited to
process input sequences up to 512 tokens.
Token Embeddings
The first step is to create vector representations of the tokenized input in the
token embeddings layer. Each token has a hidden size of 1x768 vector. For N
input tokens, the token embedding results in a matrix shape of Nx768 or, as a
tensor, 1xNx768.
Segment Embeddings
BERT can handle a pair of input sentences as shown in figure 2.10. The in-
puts are tokenized and concatenated to create a pair of tokenized sentences.
Thanks to the [SEP] token, BERT can distinguish two sentences and label the
16
2.3. BERT
sequence in binary.
The label sequence is then expanded into the same matrix shape as for token
embeddings, Nx768, where N is the number of tokens. For example, for the
paired input in figure 2.10, the segment embedding would result in a matrix
shape of 8x768.
Position Embedding
BERT is a transformer-based model and, therefore, will not process tokens
sequentially. Thus, to avoid BERT forgetting the order of tokens, position
embeddings are required. The position embeddings layer can be used as a
look-up table as illustrated in figure 2.11, where the index of a row represents
a token position. For example, the two sentences, "Cat is stuck" and "Call the
firefighter" has identical vector representations for the words; "Cat" - "Call",
"is" - "the" and "stuck" - "firefighter".
2.3.2 Pre-training
The pre-training phase is done by training on two unsupervised tasks simulta-
neously, which are Masked Language Model (MLM) and Next Sentence Prediction
(NSP) [16].
17
2.3. BERT
2.3.3 Fine-tuning
Fine-tuning allows the pre-trained BERT model to be used for specific NLP
tasks through supervised learning. It works by replacing the fully connected
output layers of the pre-trained BERT model with a new set of output layers
that can output an answer with respect to the NLP problem at hand. The new
model performs supervised learning with labeled data to update the weights
of the output layers. Since only the output layer weights are updated during
fine-tuning, the learning during fine-tune is relatively fast [7].
18
2.4. Extractive Text Summarization Methods
Norwegian BERT: At the current state, the SOTA monolingual BERT model
supporting the Norwegian language (both Bokmål and Nynorsk) is made by
the National Library of Norway 1 . It is based on the same structure as the
multilingual BERT (2.3.4) and trained on a wide variety of Norwegian text in
both Bokmål and Nynorsk from the last 200 years.
2.4.1 TF-IDF
TF-IDF is short for term frequency-inverse document frequency. It is a nu-
merical statistic that reflects how important a word is to a document within a
corpus [39].
Term weighting based on term frequency was first introduced by Luhn [25].
Luhn stated that the importance of a term is proportional to its frequency. In
mathematical terms, this can be described as:
multilingual.md
19
2.4. Extractive Text Summarization Methods
Here, terms are weighted based on the inverse fraction of the documents
containing a term. The fraction is calculated by dividing the total number of
documents n by the number of documents nt containing a term t.
The combination of both TF and IDF favors more unique terms and damps
common terms that occur in several documents. The combined equation is
presented as:
n
t f ´ id f (t, d) = f t,d ˆ log (2.5)
nt
For sentence weighting, the same principle can be used. Document d in eq 2.5
can be reformulated as a sentence s and term t can be represented as a word
w. In this case, n would be the total number of sentences, and nt would be the
number of sentences containing the term t. The final equation for sentence
weighting:
n
t f ´ id f (w, s) = f w,s ˆ log (2.6)
nw
2.4.2 TextRank
TextRank is a graph-based ranking algorithm proposed by Mihalcea and
Tarau [26]. The ranking is done by deciding the importance of vertex in a
graph-based on global information drawn recursively from the entire graph.
This is done by linking one vertex to another. The importance of a vertex is
measured by the number of links to other vertices as well as the score of the
vertices casting the votes.
20
2.4. Extractive Text Summarization Methods
ÿ 1
S(Vi ) = (1 ´ d) + d ˆ S(Vj ), where 0 ă d ă 1 (2.7)
|Out(Vj )|
jPIn(Vi )
|tωk | P Si &wk P S j u|
Similarity(Si , S j ) = (2.8)
log |Si | + log |S j |
2.4.3 BERTSum
BERT can not be directly used for extractive summarization. There are two
problems at hand that Liu (2019) [23] points out. Firstly, BERT is trained using
a masked language model (section 2.3.2). Therefore the output vectors result
in tokens rather than sentences. Secondly, although BERT has segmentation
embeddings for indicating different sentences, it can only differentiate a pair
of sentences because BERT is also trained on next sentence prediction (section
2.3.2).
Liu [23] propose a method for handling multiple sentences using BERT
by inserting [CLS] tokens before each sentence and a [SEP] token after each
sentence. To distinguish multiple sentences rather than two sentences, in-
terval segment embeddings can be used. This means that each token in a
sentence will be assigned the same E A or EB if the position of the sentence is
odd or even. As seen in figure 2.12, the output of the BERT layer shown as
Ti , are the corresponding [CLS] tokens from the top BERT layer. Each Ti are
treated as a sentence representation of sentence i.
21
2.4. Extractive Text Summarization Methods
Although from Liu’s experiments, the author stated that the second option in
list 2.4.3, with two transformer layers, showed the best performance. The loss
of the model is the binary cross-entropy of a prediction against its gold labels
[23].
Like BERT, the sequence input for BERTSum has a limit of 512 tokens.
22
2.5. Evaluation metrics for summarization
2.5.2 ROUGE
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of met-
rics presented by Lin [21] in 2004 for automatic evaluation of summaries. The
metrics compare machine-generated summaries against one or multiple refer-
ence summaries created by humans. The ROUGE-1, ROUGE-2, and ROUGE-
L metrics are commonly used for bench-marking of document summariza-
tion models, such as on the leaderboard for document summarization on the
CNN/Daily Mail dataset [5]. For each metric, the recall, precision, and F1
score are generally computed. With ROUGE, the true positives are the words
in the sentences of the reference summaries.
23
2.5. Evaluation metrics for summarization
ROUGE Example
For clarification on ROUGE scores, let us investigate the following example:
sentence one is a reference sentence, and sentences two and three are candi-
dates.
24
2.5. Evaluation metrics for summarization
Considering ROUGE-1, we can see that sentence three gives the best match
with a recall score of 4/5 = 0.8 and a precision score of 4/4 = 1. In the case
of ROUGE-l, sentence 2 is preferred, with a recall score of 3/5 = 0.6 and a
precision score of 3/4 = 0.75. For ROGUE-2 both of the candidate summaries
gives a recall and precision score of 2/4 = 0.5 and 2/5 = 0.4. The importance
of this example is to understand that focusing on only one type of ROUGE
score does not always provide good insight. In our example, intuitively can be
agreed that sentence two is the one that best fits the reference sentence since
sentence three completely changes the context. But according to ROUGE-1,
sentence three is preferred. This example shows why combining the three is
often a good idea and the importance of complementing the results with a
qualitative evaluation.
ROUGE Limitations
Regardless of ROUGEs popularity among papers on automatic text summa-
rization, there are some limitations that must be addressed:
1. ROUGE only considers content selection and not others aspects such as
fluency and coherence.
• Open-ended interviews
• Direct observations.
• Written documents.
25
2.5. Evaluation metrics for summarization
The purpose of these methods is to gather information and insights that are
useful for decision-making. Qualitative methods should therefore be appro-
priate and suitable, which means that it is essential to determine qualitative
strategies, data collection options, and analysis approaches based on the eval-
uation’s purpose. An example of a method that combines quantitative mea-
surements and qualitative data is a questionnaire, or interview, that asks both
fixed-choice questions and open-ended questions.
26
3 Method
3.1 Datasets
The following section presents the features and properties of the two datasets
used in this work.
3.1.1 CNN/DailyMail
The CNN/DailyMail dataset1 was initially developed for machine-reading
and comprehension and abstractive question answering by Herman et al.
[14]. The dataset contains over 300k unique news articles written in English
by journalists at CNN and the Daily Mail. Their script to obtain the data was
later modified by Nallapati et al. [29] to support training models of abstrac-
tive and extractive text summarization, using multi-sentence summaries.
Both of these datasets are anonymized versions, where the data has been
pre-processed to replace named entities with unique identifier labels. A third
version of the dataset also exists, which operates directly on the original text
(non-anonymized), see [41].
1 https://cs.nyu.edu/~kcho/DMQA/
27
3.1. Datasets
Furthermore, the dataset is split into a train, validation and test set according
to Table 3.2.
Table 3.2: Dataset split
Dataset Split Number of Instances
Train 287,113
Validation 13,368
Test 11,490
3.1.2 Aftenposten/Oppsummert
The Norwegian articles and summaries provided from Aftenposten (AP) and
Oppsummert build up to two datasets. One with 162k articles and one with
979 summaries. Each column in the article dataset includes is presented in
Table 3.3. The summary dataset contains an array of article IDs, which are the
articles that the summary is based on. Table 3.4 presents each column in the
summary dataset.
To get an idea of how many articles from the article dataset were used to create
the summaries, we plot this relation in Figure 3.1. As for the CNN/DailyMail
dataset, we were interested in examining the average number of sentences in
the AP/Oppsummert summaries. This plot is presented in Figure 3.2. Table
3.5 also displays the average token count and the average number of sentences
in the articles and summaries datasets.
2 https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail
28
3.1. Datasets
Field Description
ARTICLE_ID The article’s ID
ARTICLE_TITLE The title of the article
ARTICLE_TEXT raw article text data
ARTICLE_NEWSROOM The newsroom is Aftenposten
LAST_UPDATE The date of when the article was last updated
Field Description
ARTICLE_ID The summary’s ID
ARTICLE_TITLE The title of the summary
ARTICLE_TEXT raw summary text data
ARTICLE_NEWSROOM The newsroom is always Aftenposten
LAST_UPDATE The date of when the summary was last updated
SUMMARIZED_ARTICLES An array of connected article IDs
Table 3.5: Average token and sentence count for news articles and summaries
for AP/Oppsummert
29
3.2. Implementation
3.2 Implementation
In this section, the implementation of an automatic extractive text summariza-
tion model will be described and the different problems we had to overcome.
Firstly, in section 3.2.1, 3.2.2 and 3.2.4, the dataset is restructured, truncated
and labeled. Secondly, in section 3.2.4, the different model implementations
are described. Lastly, in section 3.2.6, 3.2.7 and 3.2.8, the fine-tuning, pre-
30
3.2. Implementation
diction and the hardware used for implementing a BERT-based model is de-
scribed.
From the graphs presented in Figure 3.3 we can draw two important conclu-
sions about the AP/Oppsummert dataset:
1. The summaries with only one article are predominantly extractive writ-
ten (since they have high ROUGE-2 and ROUGE-L scores).
2. The summaries with more articles regularly use sentences from only one
main article (since both the scores from the second-best article are far
worse than the scores from the top-scoring article).
With these two conclusions, the dataset was restructured so that sum-
maries with multiple IDs of related articles only were to be connected
with the highest-scoring article in that set. Therefore the field SUMMA-
RIZED_ARTICLES was updated from an array of IDs to only one article ID.
31
3.2. Implementation
(a) (b)
(c) (d)
(e) (f)
Figure 3.3: ROUGE-2 and ROUGE-L recall scores for summaries with one
article in (a) and (b), summaries with more articles and the top-scoring articles
in (c) and (d), and summaries with more articles and the second-best articles
in (e) and (f).
32
3.2. Implementation
33
3.2. Implementation
for maximizing the ROUGE-2 score; however, for many combinations, the al-
gorithm can be time-consuming.
3.2.4 Models
Six types of models were implemented for the task of extractive text summa-
rization. With each model, we predicted three, seven, and ten sentences for
the summaries. Out of the six models, TextRank and TF-IDF will only be used
as a comparison to the BERTSum models.
Oracle
Oracle was not only used as a label generation method described in section
3.2.3, but also as an upper limit for our BERT models. Since oracle summaries
are used as labels for our BERT models, the model can not score higher than
the oracle summaries. Therefore, the oracle summaries could be set as a
ceiling for our BERT models.
34
3.2. Implementation
LEAD
We used LEAD as a baseline, which selects the first sentences in an article.
From the previously presented analysis in section 3.2.2, with the position of
the highest ROUGE scoring sentences shown in Fig 3.4, we considered LEAD
to be a good baseline to use.
With Oracle and LEAD, we now had a good range for the ROUGE scores
of where we wanted our models to perform within. For clarification: Our
models should perform above the LEAD score and under or the same as the
Oracle score.
TextRank
We adopted a Python implementation3 of TextRank based on the approach
followed in [26]. This implementation performs both keyword extraction as
well as text summarization. We used Natural Language Toolkit (NLTK)4 to
download necessary files used by stopwords, tokenizer, and stemmer.
TF-IDF
An implementation5 of TF-IDF was adopted for extractive text summariza-
tion. The source code was updated to support Norwegian using Spacy6 , an
NLP toolkit similar to NLTK. Key sentences could then be extracted by rank-
ing the scores for each sentence in descending order.
BERTSum
We used BERTSum described in section 2.4.3 to fine-tune two pre-trained
BERT models for the task of extractive text summarization. The original
BERTSum code is currently using an older version of PyTorch and therefore is
not directly suitable for integrating new models. Thus, an updated version of
BERTSum by Microsoft 7 was used. PyTorch, introduced by Klein et al. [18], is
a toolkit for deep learning in Python. There are other currently popular deep
learning libraries, such as TensorFlow (Abadi et al. [1]). However, PyTorch
was chosen for our task since both the original and updated BERTSum code
is processed using PyTorch. PyTorch is also more suitable for development in
3 https://github.com/acatovic/textrank
4 https://www.nltk.org/
5 https://github.com/luisfredgs/extractive-text-summarization
6 https://spacy.io/
7 https://github.com/microsoft/nlp-recipes
35
3.2. Implementation
The updated BERTSum code from Microsoft includes other natural lan-
guage processing tools and is still maintained today. However, only relevant
functions were included in our source code. Also, the updated code provides
a more readable code structure, better optimization techniques, and a scal-
able solution for adding pre-trained BERT models than the original BERTSum
code from the author Yang.
Data loader: Data comes in many different formats and shapes. The pro-
vided code from Microsoft supports both text files and lists of strings as data
input. For loading and processing the CNN/DM dataset, a dataset loader is
already included by Microsoft. However, for the AP/Oppsummert dataset, a
script was implemented to shuffle and split the data into a train and test set
to match the input format.
3.2.5 Hyper-Parameters
Before fine-tuning our models, some hyper-parameters had to be set:
Batch size: To keep the memory consumption low, Peltarion [28] suggests
a rule of thumb is to keep the product batch size ˆ sequence length ă 3000.
Therefore, if the sequence length is set to 512, the batch size used is approxi-
mately 6.
8 https://huggingface.co/transformers/
36
3.2. Implementation
Learning rate: A low learning rate was chosen to avoid catastrophic forget-
ting, which can happen when a model is fine-tuned. This means that the new
fine-tuned model can forget the previously learned information. This issue is
avoided by setting a very low learning rate of the order of 10´ 5.
Epochs: Fine-tuning requires only a few epochs. Devlin et al. [7] used three
epochs for their fine-tuning experiments, and Peltarion [28] also suggests us-
ing three epochs or less. Thus we aimed to use three epochs.
Max steps: The steps were calculated to represent the number of epochs we
want to train. With the following formula (3.1, we estimated the max amount
of steps:
Where training size is the total number of data points in the training set.
Epochs is the number of epochs we use, and batch size is the number of data
points used per step.
We used the data loader described in the previous section 3.2.4 to load
the AP/Oppsummert dataset, which in turn was shuffled and split into a
train and test set with 90 % and 10 % data, respectively. Using the provided
code from Microsoft, binary labels were generated through oracle summary
generation 3.2.3. Using transformers, the pre-trained Norwegian BERT model
is downloaded and automatically initialized. The training data is also tok-
enized automatically using the Auto Class feature to convert the sequence of
words for each article in the training set to IDs representing the pre-trained
37
3.2. Implementation
model’s word embedding 2.1.1. Then, using PyTorch and specified hyper-
parameters (3.2.5), a fitting is done, which automatically divides the training
data into batches and iteratively calculates loss, backpropagate to calculate
gradients, and updates the model weights.
The process of fine-tuning Multilingual BERT was done in two steps. Firstly,
the code from Microsoft provides a data loader for the CNN/DM dataset,
which we used to load the dataset. Because of hardware limitations men-
tioned in 3.2.8, 10 000 articles were used in the training set and 1 000 articles
in the test set. Similar to the Norwegian BERT, the model is tokenized and
fine-tuned through BERTSum using the pre-trained BERT model, Multilin-
gual BERT.
3.2.7 Prediction
Finally, predictions were made on the AP/Oppsummert test set using the
model to obtain a score for each sentence in an article for every article in the
test set. The sentences for every article are then ranked by their scores from
highest to lowest, and then select top-N sentences as the summary, where N
is the number of oracles sentences the model was fine-tuned on.
3.2.8 Hardware
The training was done through Google Colab 9 which allows anyone to write
and execute python code through the browser. Google Colab provides single
GPUs such as Nvidia K80, T4, P4, and P100. However, there is no way for
users to choose which type of GPU to use. By displaying the allocated GPU
during runtime, we noticed that Nvidia K80 was used in our case. In terms of
memory, we were limited to use up to 12 GB. Also, sessions in Google Colab
for free use only last at most 12 hours.
9 https://colab.research.google.com
38
3.3. Evaluation
3.3 Evaluation
We used the automatic evaluation metrics ROUGE and human evaluation by
journalists from Aftenposten to evaluate our fine-tuned models. In this sec-
tion, we describe our implementation of these.
• Sentence Splitter
• Word Tokenizer
• Stemmer
• Word Splitter
Our models were then evaluated using the test data of the AP/Oppsummert.
The models’ predictions were used as candidate summaries, and the sum-
maries written by the journalists at AP were used as reference summaries.
Once the ROUGE scores were computed, we saved the ROUGE-1, ROUGE-2,
and ROUGE-L scores. These metrics were chosen for easy comparison with
results on the CNN/DailyMail dataset and because they are most commonly
used for summarization tasks in other scientific research.
10 https://pypi.org/project/pyrouge/
11 https://pypi.org/project/py-rouge/
12 https://github.com/microsoft/nlp-recipes/tree/master/utils_nlp/
eval/rouge
39
3.3. Evaluation
1. How well do you think this summary managed to extract key sentences
from the article?
3. How do you rate the summary in terms of not being redundant (not
having repetitive sentences)?
40
4 Results
In this chapter, we present the results from our methods. First, in section 4.1,
we cover our implemented models and how they manage to predict which
sentences should be picked from the articles. Secondly, in section 4.2, we
present how well the models performed in our evaluations.
4.1 Implementation
As mentioned in section 3.2.4, six models for sentence prediction were im-
plemented. When these models were used on the test data, they managed to
extract a selected number of sentences from different positions in the original
articles. The time it took to fine-tune the Norwegian BERT and Multilingual
BERT is presented in Table 4.1.
Table 4.1: Time it took to fine-tune time the Norwegian and Multilingual BERT
models
For the Norwegian and Multilingual BERT the selected sentences’ position in
their original article are shown in Figure 4.1 and 4.2. Six subplots are made for
each model. The first subplot shows the respective model fine-tuned on the
Oracle-3 training set with a prediction of three sentences on the Oracle-3 test
41
4.1. Implementation
set. The second and third subplot follows the same principle but for Oracle-7
with seven predicted sentences and Oracle-10 with ten sentences predicted.
This is then repeated through subplot four, five and six but without the use of
trigram blocking. Oracle sentence positions are also presented in these sub-
plots, showing where the top-scoring sentences exist in the original articles.
Since BERTSum has a limit of 512 tokens, the articles are truncated to fit
that limit, and the expected predictions will be sentences within the truncated
limit. Thus, if one word is a token, then counting the number of tokens per
sentence for each article in our AP/Oppsummert test set up to 512 tokens
corresponds to an average of 26 sentences.
42
4.1. Implementation
(a) (b)
(c) (d)
(e) (f)
Figure 4.1: Sentence selection for Norwegian BERT fine-tuned on (a) Oracle-3
(b) Oracle-7 (c) Oracle-10 with trigram blocking and on (d) Oracle-3 (e) Oracle-
7 and (f) Oracle-10 without trigram blocking.
43
4.1. Implementation
(a) (b)
(c) (d)
(e) (f)
Figure 4.2: Sentence selection for Multilingual BERT fine-tuned on (a) oracle-3
(b) oracle-7 (c) oracle-10 with trigram blocking and on (d) oracle-3 (e) oracle-7
and (f) oracle-10 without trigram blocking.
44
4.2. Evaluation
4.2 Evaluation
We present the evaluation results of the implemented models in this section.
In section 4.2.1, the automatic ROGUE evaluation is presented, and in section
4.2.2 the human evaluation that was done on the journalists of Aftenposten.
2. satisfactory in non-redundancy
2. satisfactory in non-redundancy
45
4.2. Evaluation
Figure 4.3: Average human evaluation scores for each category where the
highest score for each example is 20
46
4.2. Evaluation
Table 4.2: ROUGE scores on AP/Oppsummert test data (116 articles). *With
trigram blocking
Model Dataset R1 R2 RL
47
4.2. Evaluation
Table 4.3: Journalists’ opinion reflecting their satisfaction with generated sum-
maries
Key sentences: "The algorithm is good at picking out the key sen-
tences at the start of the articles"
Content Coverage "I think the summary all in all has captured the most
important aspects of the article"
"The most important content of the main story is
present in the summary"
"Most of the story is well covered here"
Summary Length "I think it was impressive to reduce the (quite long)
story to a summary this short"
"The length is good for such a long article"
"I think the length is good"
48
4.2. Evaluation
49
4.2. Evaluation
Response
1 "Better coverage of the whole article"
2 "The content. It doesn’t extract the most important
points of the article, and the way it’s written is not
very good."
3 "Mellomtitler should never be included in a summary
- this should be easy to avoid. In crime stories it is
often touchy ethical questions involved. I do not think
the summaries handled this in a satisfying way (the
lawyer’s quote in the Prinsdal-story)."
4 "See my comments on each story."
5 "The algorithm is good at picking out the key sen-
tences at the start of the articles, but seems to struggle
with contextualisation and order. Also, towards the
end of the articles, it often misses out on key informa-
tion or fails to summarize longer parts efficiently."
Response
1 "This is impressively good, but not quite there. Some
important points/facts from some of the articles are
missing (when the main article is long, the second half
of it seems to be missing in the summary?)"
2 "I think the two first summaries were the best ones."
3 "No answer"
4 "Overall this was not too bad considering."
5 "With some improvements, this could be a useful tool
for creating a base structure for us to edit and rewrite.
That could save time. But as it stands, it would not be
a reliable automated tool for writing summaries."
50
5 Discussion
This chapter presents our thoughts and reflections about different parts of the
thesis. In section 5.1, we analyze our results and discuss their meaning. Then,
in section 5.2, we discuss our methods and how they could be improved.
Moreover, in section 5.3, we focus on our work in a wider context, giving
our thoughts on related ethical and societal aspects.
5.1 Results
In the following we highlight our main findings from the results that was
presented in chapter 4.
The results presented in Table 4.2 show that the LEAD baselines give high
51
5.1. Results
It can be seen that the models without trigram blocking, in general, gives
a slightly better ROGUE score than the ones with trigram blocking. This re-
sult is expected considering the lead bias. However, the ROUGE score alone
does not show how well the trigram blocking reduced the redundancies in
the selected sentences. This issue is something that the human evaluation
gives insight into, which was the reason for selecting the model using trigram
blocking in the human evaluation. The results from this is discussed in sec-
tion 5.1.3.
The statistical models did not perform well in terms of the ROUGE score
and we will not further discuss their scores and performances.
52
5.1. Results
sometimes it also picks sentences from positions eight, nine, and ten. In figure
4.2(c), ten predictions are made. However, in this case, the proportions of
the selected sentences are more distributed across the sentence positions in
AP/Oppsummert’s test set.
Zhu et al. [45] discuss that, especially in news articles, the leading sen-
tences are often the most important part of the article. Therefore LEAD, in
general, is a hard baseline to beat for many deep learning models. Moreover,
comparing the LEAD-3 score for CNN/DM mentioned in Liu’s paper [23] to
our LEAD-3 score for AP/Oppsummert, we see that we have a significantly
higher LEAD-score. One explanation to this is that the data used in our work
might have higher positional bias than the CNN/DM data set, used in the
work of [23].
In figure 4.1 (d-f) and 4.2 (d-f), we observe that when predicting summaries
of our BERT models without using trigram blocking, a stronger lead bias can
be seen. Our BERT models predicted without trigram blocking got a slightly
better ROUGE score than predicting with trigram blocking. However, we
see a better distribution on the selected sentences with trigram blocking.
Therefore, trigram blocking is applied to the model we developed in our
work.
53
5.1. Results
well enough in an example if every block in the bar has a score above three.
For instance, example one & two shown in figure 4.3, shows scores above
three which means that the model has achieved in satisfying the journalists
for those examples. However, the two last examples proved to be less satis-
fying to the journalists because the model underperformed in aspects such as
content coverage and key sentence extraction.
As described in the results (section 4.2.2), the first two examples are cho-
sen based on summaries with a high ROUGE score. In comparison, the last
two examples are selected based on summaries with a low ROUGE score.
This decision is interesting because, as previously observed in figure 4.3, the
journalists scored the first two examples as satisfying. The last two examples
were underperforming in terms of the four categories presented in the figure.
This similarity in mind shows that there could be a mutual agreement be-
tween how the journalists assess the quality of the summary and the ROUGE
metric.
Qualitative assessment
For the qualitative human evaluation, the results in section 4.2.2 present the
journalist’s thoughts for different aspects of the summaries. In the following,
we discuss these comments:
Key Sentences: Regarding key sentence extraction, we notice how the jour-
nalists consider the summaries to miss meaningful sentences and have sen-
tences that do not bring any value. However, it is observed that sentences
picked from the beginning are considered relevant for the summary. How-
ever, these sentences alone are not sufficient for a satisfying summary.
Content Coverage: The results in 4.3 indicate that, in some cases, the sum-
maries managed to cover the essential aspects of its main article. This means
that, for those cases, the leading sentences indeed cover the main content of
the whole article, which matches the discussion about LEAD sentence selec-
tion in section 5.1.2. However, most comments from 4.4 and 4.5 state that the
summaries did not have sufficient content coverage, which means that only
considering leading sentences is not enough.
Context: According to the results, most cases where sentences are consid-
ered to be out of context are when the model has extracted quotes. This could
be due to the fact that quoted persons are usually presented before or after the
54
5.1. Results
actual quote. If the model then misses selecting this presentation, the quote
sentence will appear out of context. Another problem with context is when
it is not clear that a selected sentence is from a quote. This happens in cases
where the quote consists of several sentences like in the following example:
- "Sentence A. Sentence B. Sentence C.", Said the firefighter. In this case, if only
sentence B is extracted, there will be nothing in the summary stating that this
sentence is a quote.
One reason for journalist’s dissatisfaction about context could be our ex-
tractive approach towards text summarization. As mentioned earlier, the
intention/goal behind such approach was to avoid misleading context by
creating an extractive summarization model rather than an abstractive sum-
marization. That way, it was believed that the summaries generated from
extracted sentences would, in most cases, not go out of context. However,
from the previous discussion, we observe that even though the sentences are
extracted from the article, the summary, in general, could give a different per-
ception from the original article. On the other hand, one strength of extractive
summarization technique is that it guarantees the summaries to at least not
introduce new words and definitions that might result in more significant
problems.
Sub headers in summary: We also observed from the comments that a sum-
mary could contain the subheaders of an article. This problem could be linked
to AP/Oppsummert’s dataset format, which we discuss further in section
5.2.1.
Coherence: The journalists believe that the summaries are more like a list-
ing of facts than a coherent summary. However, the summaries were never
intended to be a speakable text, but bullet points of the most important sen-
tences of the longer article.
55
5.2. Method
on the summary. This result does not favor either BERT’s and BERTsum’s 512
token limit or the lead bias during fine-tuning. It seems that even though the
first sentences often contain essential information, that is not enough reason
to truncate articles and only use the first sentences for summarization.
5.2 Method
In this section, our chosen methods are discussed. More specifically, we look
at resource limitations and things that could have been done differently. We
also reflect on how these relate to the outcome of the results and whether
other approaches could lead to better results.
5.2.1 Datasets
In general, when training a model to perform a specific task with neural net-
works, it is essential to have a datasets of high quality. In this section, we will
therefore discuss the quality of the datasets that were used to implement our
models.
CNN/DM
The CNN/DM is one of the most commonly used datasets for comparing text
summarization models. It was also the dataset that was used by Liu [23] in the
BERTsum paper. Therefore we used CNN/DM as our dataset for fine-tuning
the multilingual BERT model. The advantage of this is that the results of our
model can be compared with that of Liu [23]. However, some cons with using
this dataset for this project has been realized:
56
5.2. Method
AP/Oppsummert
The initial limitation with the data that was provided from Aftenposten and
Oppsummert was known to be the limited amount of 979 summaries. How-
ever, during implementation, more limitations were observed. In the follow-
ing, we list some of these issues, which could be improved in the future:
1. The fact that the summaries mainly consist of sentences that have just
been picked from the beginning of the article. We believe that was the
main reason behind the model’s tendency towards choosing most of the
sentences from the beginning of the article.
2. The variation of related articles for each summary which lead to the re-
moval of articles. In our work, restructuring the dataset was essential for
training of a single document-classifier. However, the summaries based
on several articles can lead to non-optimal results since we remove po-
tential top Oracle sentences when removing these articles.
3. Raw article data not containing HTML tags. The articles online are pre-
sented with HTML, but the data in the dataset only consisted of raw text.
This made it impossible for us to separate headlines from paragraphs,
which resulted in headlines being treated the same as sentences during
model training and prediction.
Dealing with the first issue could not be done during the project’s implemen-
tation period. The human-written summaries have to be updated with new
summaries that consider other sentences of the articles.
The third and fourth issues deal with how the dataset was processed be-
fore we got access to the data. This issue could have been dealt with by
manually going through 979 summaries or by re-doing the data extraction
57
5.2. Method
process. Although, since we did not have resources, it was decided to stick
with the current state of the data to save both time and labor.
5.2.2 Implementation
All fine-tune for the BERT models were done using Google Colab, which
worked efficiently for AP/Oppsummert dataset but not for CNN/DM as
the dataset was much larger. Fine-tuning on CNN/DM resulted in a longer
training period causing sessions in Google Colab to timeout because of the
limitations described in section 3.2.8. Therefore, as mentioned in section 3.2.6,
where CNN/DM was used in our experiments: we only extract 10 000 sam-
ples to our training set (section 3.2.6). The time it took to fine-tune the BERT
models are shown in table 4.1. It would have been interesting to fine-tune
M-BERT on all samples in CNN/DM. However, because of the limitations of
Google Colab, that became out of the scope of this work.
The initial thought for model optimization was to include a validation set
during our model fine-tuning process, giving us insights on when to stop the
training. However, we chose to skip this step because of two main reasons:
5.2.3 Metrics
The evaluation was done in two parts. In this section, we discuss the ade-
quacy of ROUGE and our human evaluation.
ROUGE
The limitations of ROUGE, as mentioned in 2.5.2, have been widely discussed
across different papers (ADD REFS). Many authors question its suitability
as an evaluation metric for summaries and that it is used to claim the state-
of-the-art performance of models. However, to our knowledge, there is no
other good metric today that can be used to compare evaluation results across
papers, and developing a new metric would be out of scope for the project.
Therefore, we decided to use ROUGE as one of our metrics. However, we
were also interested to know how journalists assess the model’s performance.
Therefore, an evaluation study was carried out as a complement to ROUGE.
58
5.2. Method
Using ROUGE, we assumed that the golden summary is the true best sum-
mary to the article. Such assumption has some disadvantages. There is no
reason why one summary has to be better than another since this is subjec-
tive. Two very different summaries could be equally good simply because
two different summaries can both capture the full context of an article. This
means that potentially good summaries are ignored because they are not
using similar words as the reference summaries. For evaluation, this problem
can be addressed by additional evaluation. However, that is not the case
when we use ROUGE to create Oracle sentences for fine-tuning our models.
This means that the model will limit its learning to the "true best" summary
that we have for that article.
One limitation that we did attempt to deal with was how ROUGE only
considering content selection. We did this by using Liu’s [23] algorithm with
trigram blocking for reducing redundancy of the selected sentences. Accord-
ingly, to the results presented in 4.2, the usage of trigram blocking did indeed
lead to a broader range of selected sentences. This means that the algorithm
managed to find sentences that were considered to be redundant.
59
5.3. The work in a wider context
We believe that the questionnaire was formed in a way that covered well
the quality of the summaries. What makes a human evaluation of summaries
hard is how to interpret different opinions about quality. I.e., what is really
meant by a summary that is "really good" or "really bad". This needs to be
made explicit somehow. Our way of doing this was to have the journalists
rank how well each summary performed in specific categories. We defined
that a good summary captures key sentences from the article, has a decent
length, is not redundant, and covers the original article’s content well.
One could argue that making the human evaluation is less reliable due to
the few participants in the questionnaire. This would indeed be a big prob-
lem with fluctuating results. However, since the journalists share the same
opinion more or less, we consider the human evaluation to be still sufficient
to draw conclusions from it.
Firstly, when a journalist writes a summary, they are aware of what should
be highlighted and how the summary should be structured to convey the
article’s main points transparently. However, a machine-generated summary
might not pick up the sentences that the journalist finds important. This
could result in a shift in the main point in the summary. This problem can be
60
5.3. The work in a wider context
Depending on the topic and content, the issues mentioned can be critical
at some point. At the current state, the machine can not make these judg-
ments itself. Therefore we advise using the extractive summarization model,
BERTSum, as a tool to help the journalists write their summaries.
61
6 Conclusion
This chapter will summarize our concluding thoughts on the purpose and
the research questions described in our thesis. Ideas for improvements will
be discussed at the end of the chapter.
6.1 Conclusion
This thesis project aimed to develop a model that could extract the most rele-
vant sentences from Norwegian news articles. This aim has to an extent been
achieved but with limitations. In the following, we conclude our investiga-
tions to then answer the main research question.
• How news summaries can be used to generate labeled data that is re-
quired for a supervised model
To generate labeled data from summaries, two algorithms has been pre-
sented:
62
6.1. Conclusion
The first method has limitations regarding summary quality, while the
second method is not sustainable on a larger scale. Additionally, man-
ual experiments are difficult to compare across papers. Therefore, until
another metric is developed, ROUGE will continue to be the method of
choice to evaluate text summarization models on larger scales.
63
6.2. Future Work
Together with the results and discussion, the main research question can now
be answered from these investigations:
1. We have seen that the model can learn sentence selection based on the
data it is provided with. Therefore, we expect that with a better struc-
tured and less lead-biased dataset, the model should learn to pick sen-
tences based on context rather than on positions.
64
6.2. Future Work
65
Bibliography
[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,
Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry
Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,
Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. “Tensor-
Flow: A system for large-scale machine learning”. In: 12th USENIX Sym-
posium on Operating Systems Design and Implementation (OSDI 16). 2016,
pp. 265–283. URL: https : / / www . usenix . org / system / files /
conference/osdi16/osdi16-abadi.pdf.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Ma-
chine Translation by Jointly Learning to Align and Translate. 2016. arXiv:
1409.0473 [cs.CL].
[3] Miguel Romero Calvo. Dissecting BERT Part 1: Understanding the Trans-
former. https : / / medium . com / @mromerocalvo / dissecting -
bert-part1-6dcf5360b07f. Accessed: 2020-12-06.
[4] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. “Learn-
ing Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation”. In: Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Doha, Qatar: Asso-
ciation for Computational Linguistics, Oct. 2014, pp. 1724–1734. DOI:
10 . 3115 / v1 / D14 - 1179. URL: https : / / www . aclweb . org /
anthology/D14-1179.
[5] Papers With Code. Document Summarization on CNN / Daily Mail.
https : / / paperswithcode . com / sota / document -
summarization-on-cnn-daily-mail. Accessed: 2021-03-18.
66
Bibliography
[6] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le,
and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context. 2019. arXiv: 1901.02860 [cs.LG].
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
“BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805.
URL : http://arxiv.org/abs/1810.04805.
[8] Henning Carr Ekroll and Kjetil Magne Sørenes. Avgåtte regjeringspoli-
tikere får karantenelønn på grunn av egne selskaper. Enkelte har vært
helt uvirksomme. https : / / www . aftenposten . no / norge /
politikk / i / P9BX5J / avgaatte - regjeringspolitikere -
faar - karanteneloenn - paa - grunn - av - egne - selska. Ac-
cessed: 2021-05-13.
[9] Khalid N Elmadani, Mukhtar Elgezouli, and Anas Showk. “BERT
Fine-tuning For Arabic Text Summarization”. In: arXiv preprint
arXiv:2004.14135 (2020).
[10] Jeffrey L. Elman. “Finding structure in time”. In: Cognitive Science
14.2 (1990), pp. 179–211. ISSN: 0364-0213. DOI: https : / / doi .
org / 10 . 1016 / 0364 - 0213(90 ) 90002 - E. URL: https :
/ / www . sciencedirect . com / science / article / pii /
036402139090002E.
[11] Wenche Fuglehaug Fallsen. Tiltalte: Ble provosert da Mohammed sa «jeg
elsker henne». https://www.aftenposten.no/norge/i/Jo2V6P/
tiltalte - ble - provosert - da - mohammed - sa - jeg - elsker -
henne. Accessed: 2021-05-13.
[12] Jan Gunnar Furuly and Hans O. Torgersen. Havarikommisjon kritis-
erer sikkerhetsbrudd etter at 15-åring omkom i strømulykke. https : / /
www . aftenposten . no / norge / i / 8mo9bw / havarikommisjon -
kritiserer - sikkerhetsbrudd - etter - at - 15 - aaring -
omkom-i-s. Accessed: 2021-05-13.
[13] Yoav Goldberg. “A Primer on Neural Network Models for Natural
Language Processing”. In: CoRR abs/1510.00726 (2015). arXiv: 1510.
00726. URL: http://arxiv.org/abs/1510.00726.
[14] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Es-
peholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. “Teaching Ma-
chines to Read and Comprehend”. In: NIPS. 2015, pp. 1693–1701. URL:
http://papers.nips.cc/paper/5945-teaching-machines-
to-read-and-comprehend.
[15] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”.
In: Neural computation 9 (Dec. 1997), pp. 1735–80. DOI: 10.1162/neco.
1997.9.8.1735.
67
Bibliography
[16] Rani Horev. BERT Explained: State of the art language model for NLP.
https://towardsdatascience.com/bert-explained-state-
of- the- art- language- model- for- nlp- f8b21a9b6270. Ac-
cessed: 2020-12-06.
[17] Karen Spärck Jones. “A statistical interpretation of term specificity
and its application in retrieval”. In: Journal of Documentation 28 (1972),
pp. 11–21.
[18] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexan-
der M. Rush. “OpenNMT: Open-Source Toolkit for Neural Machine
Translation”. In: CoRR abs/1701.02810 (2017). arXiv: 1701 . 02810.
URL : http://arxiv.org/abs/1701.02810.
68
Bibliography
[27] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and San-
jeev Khudanpur. “Recurrent neural network based language model”.
In: vol. 2. Jan. 2010, pp. 1045–1048.
[28] Multilingual BERT snippet. URL: https : / / peltarion . com /
knowledge-center/documentation/modeling-view/build-
an- ai- model/pretrained- snippets/multilingual- bert-
snippet.
[29] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al.
“Abstractive text summarization using sequence-to-sequence rnns and
beyond”. In: arXiv preprint arXiv:1602.06023 (2016).
[30] Christopher Olah. Understanding LSTM Networks. https : / / colah .
github.io/posts/2015-08-Understanding-LSTMs/. Accessed:
2021-03-17.
[31] Oppsummert. Egne selskaper ga regjeringspolitikerne etterlønn. https://
www.aftenposten.no/norge/i/70gbJ4/egne-selskaper-ga-
regjeringspolitikerne-etterloenn. Accessed: 2021-05-13.
[32] Oppsummert. Kritikk etter dødsulykken på Filipstad. https : / / www .
aftenposten . no / verden / i / xPWQ6G / kritikk - etter -
doedsulykken-paa-filipstad. Accessed: 2021-05-13.
[33] Oppsummert. Prinsdal: Politiet knytter ny siktet til gjengmiljøet. https:
/ / www . aftenposten . no / norge / i / WbBReG / prinsdal -
politiet - knytter - ny - siktet - til - gjengmiljoeet. Ac-
cessed: 2021-05-13.
[34] Oppsummert. Tiltalte: Ble provosert da Mohammed sa «jeg elsker henne».
https://www.aftenposten.no/norge/i/RR23Jd/tiltalte-
ble - provosert - da - mohammed - sa - jeg - elsker - henne. Ac-
cessed: 2021-05-13.
[35] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
The PageRank Citation Ranking: Bringing Order to the Web. Technical Re-
port 1999-66. Previous number = SIDL-WP-1999-0120. Stanford Info-
Lab, Nov. 1999. URL: http://ilpubs.stanford.edu:8090/422/.
[36] Michael Quinn Patton. “Qualitative evaluation checklist”. In: Evaluation
checklists project 21 (2003), pp. 1–13.
[37] Romain Paulus, Caiming Xiong, and Richard Socher. “A Deep
Reinforced Model for Abstractive Summarization”. In: CoRR
abs/1705.04304 (2017). arXiv: 1705 . 04304. URL: http : / / arxiv .
org/abs/1705.04304.
[38] Telmo Pires, Eva Schlinger, and Dan Garrette. “How multilingual is
Multilingual BERT?” In: CoRR abs/1906.01502 (2019). arXiv: 1906 .
01502. URL: http://arxiv.org/abs/1906.01502.
69
Bibliography
[39] Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets.
Cambridge University Press, 2011.
[40] Wasim Riaz, Daniel Røed-Johansen, Frøydis Braathen, and Harald Stolt-
Nielsen. Politiet: 20-åringen som er siktet i Prinsdal-saken, har tilknytning til
gjengmiljøet. https://www.aftenposten.no/norge/i/mRdr1l/
politiet - 20 - aaringen - som - er - siktet - i - prinsdal -
saken-har-tilknytning-t. Accessed: 2021-05-13.
[41] Abigail See, Peter J Liu, and Christopher D Manning. “Get to the point:
Summarization with pointer-generator networks”. In: arXiv preprint
arXiv:1704.04368 (2017).
[42] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. “How to Fine-Tune
BERT for Text Classification?” In: CoRR abs/1905.05583 (2019). arXiv:
1905.05583. URL: http://arxiv.org/abs/1905.05583.
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
Is All You Need. 2017. arXiv: 1706.03762 [cs.CL].
[44] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Moham-
mad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,
Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil,
Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol
Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. “Google’s
Neural Machine Translation System: Bridging the Gap between Hu-
man and Machine Translation”. In: CoRR abs/1609.08144 (2016). arXiv:
1609.08144. URL: http://arxiv.org/abs/1609.08144.
[45] Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, and Xuedong
Huang. Leveraging Lead Bias for Zero-shot Abstractive News Summariza-
tion. 2021. arXiv: 1912.11602 [cs.CL].
70
A Appendix
A.1.1 Article 1
The article can be found in [12] and its gold summary in [32].
• Ulykken skjedde 24. februar 2019 da de tre ungdommene hadde tatt seg
inn på Filipstad driftsbanegård, like ved Ruseløkka fritidsklubb.
• Da de klatret opp på et hensatt togsett, ble alle tre utsatt for strøm fra
kjøreledningen.
I
A.1. All responses from Human Evaluation
• Gjerdet hadde hull og var for lavt Ungdommene hadde forut for
ulykken tatt seg gjennom et hull i et gjerde som ikke var i henhold til
sikkerhetsforskriften: Det skulle være 180 cm, men var på det laveste
kun 106 cm.
II
A.1. All responses from Human Evaluation
A.1.2 Article 2
The article can be found in [40] and its gold summary in [33].
III
A.1. All responses from Human Evaluation
• Politiet har siktet enda en 20 år gammel mann for drapet eller medvirkn-
ing til drapet på Halil Kara (21) i Prinsdal.
• Den 20 år gamle mannen er fra Oslo og er den tredje som blir siktet i
drapssaken i Prinsdal sør i Oslo.
• Han er blitt tatt to ganger, i løpet av kort tid, med svært farlig
skytevåpen».
• Politiinspektør Grete Lien Metlid har gitt uttrykk for at politiet foreløpig
ikke kjenner motivet for drapet.
• Aftenposten er kjent med at det over tid har vært flere konflikter mellom
unge menn fra Holmlia og unge menn fra Mortensrud.
IV
A.1. All responses from Human Evaluation
A.1.3 Article 3
The article can be found in [8] and its gold summary in [31].
V
A.1. All responses from Human Evaluation
• Som Aftenposten tidligere har skrevet, har Erna Solberg satt rekord i
antallet statsrådsutskiftninger i sitt regjeringsprosjekt.
• Aftenposten har sett nærmere på hva som har skjedd i disse virk-
somhetene.
• Da tenkte jeg at jeg får livnære meg selv, med det jeg gjorde før jeg gikk
inn i politikken.
VI
A.1. All responses from Human Evaluation
VII
A.1. All responses from Human Evaluation
A.1.4 Article 4
The article can be found in [11] and its gold summary in [34].
• I Oslo tingrett sitter en 21 år gammel mann som er tiltalt for grov kropp-
skrenkelse og for å ha etterlatt Mohammed Altai i en hjelpeløs tilstand.
• Hjertefeil?
VIII
A.1. All responses from Human Evaluation
1 The summary seems to just cover the first half of the main
article (?)
2 The length is good. The summary is missing some key infor-
mation, and contains some information that is not necessary
at all. Still it doesn’t seem like a summery (text), but more
like a listing of facts.
3 The summary is quite confusing to read and I do not think
I would understand much had I not known the whole orig-
inal story beforehand. The summary sentences seem very
randomized when I read them. If the third sentence (• I
Oslo tingrett sitter en 21 år gammel mann som er tiltalt for
grov kroppskrenkelse og for å ha etterlatt Mohammed Altai
i en hjelpeløs tilstand.) had been first in the summary, it
may have been easier to get into the story. On this note, it
should be noted that the original text is quite complicated,
and the text is quite long. The motive behind the (possible)
crime is not really made clear in the summary (the story
about the relationship with the sister). It also seems that
the summary emphasizes the first and the middle part of
the original text – and not the last third of the original txt.
Also, this word is just a «mellomtittel» and could have been
dropped: • Hjertefeil?
4 Bullet point 4 is a quote, but it doesn’t say from whom. Bul-
let point 5 is also a quote, but it is not presented as such.
Bullet point 6 is just a one word subtitle. The summary
does not include anything about the accused’s younger sis-
ter.
5 This one did not work so well. It is a good example that the
first sentences often have key information, but still needs
to be contextualized and ordered properly in a summary.
Sentence 2 and 3 should have been switched around, and as
the summary goes on, more bullet points appear totally out
of context.
IX