Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

LiU-ITN-TEK-A--21/016-SE

Extractive Text Summarization


of Norwegian News Articles
Using BERT
Thomas Indrias Biniam
Adam Morén

2021-06-04

Department of Science and Technology Institutionen för teknik och naturvetenskap


Linköping University Linköpings universitet
SE-601 74 Norrköping , Sw eden 601 74 Norrköping
LiU-ITN-TEK-A--21/016-SE

Extractive Text Summarization


of Norwegian News Articles
Using BERT
The thesis work carried out in Datateknik
at Tekniska högskolan at
Linköpings universitet

Thomas Indrias Biniam


Adam Morén

Norrköping 2021-06-04

Department of Science and Technology Institutionen för teknik och naturvetenskap


Linköping University Linköpings universitet
SE-601 74 Norrköping , Sw eden 601 74 Norrköping
Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –


under en längre tid från publiceringsdatum under förutsättning att inga extra-
ordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för
ickekommersiell forskning och för undervisning. Överföring av upphovsrätten
vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,
säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ
art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i
den omfattning som god sed kräver vid användning av dokumentet på ovan
beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan
form eller i sådant sammanhang som är kränkande för upphovsmannens litterära
eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se
förlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional on the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its WWW home page: http://www.ep.liu.se/

© Thomas Indrias Biniam, Adam Morén


Abstract

Extractive text summarization has over the years been an important re-
search area in Natural Language Processing. Numerous methods have
been proposed for extracting information from text documents. Recent
works has shown great success for English summarization tasks by fine-
tuning the language model BERT using large summarization datasets.
However, less research has been made for low-resource languages. This
work contributes by investigating how BERT can be used for Norwegian
text summarization. Two models are developed by applying a modified
BERT architecture, called BERTSum, on pre-trained Norwegian and Mul-
tilingual BERT. The results are models able to predict key sentences from
articles to generate bullet-point summaries. These models are evaluated
with the automatic metric ROUGE and in this evaluation the Multilin-
gual BERT model outperforms the Norwegian model. The multilingual
model is further evaluated in a human evaluation by journalists, reveal-
ing that the generated summaries are not entirely satisfactory in some
aspects. With some improvements, the model shows to be a valuable
tool for journalists to edit and rewrite generated summaries, saving time
and workload.
Acknowledgments

We want to start by giving our gratitude to our supervisor, Elmira Zohrevandi


and examiner Pierangelo Dellacqua at Linköpings University for their com-
mitment and valuable advice. We also want to express our gratitude to our
supervisor, Björn Schiffler at Schibsted, for actively committing and guiding
us with both technical and academic advice throughout our work. We thank
the contextual team at Schibsted for welcoming us into the team and provid-
ing us with advice and the necessary data to perform our research.

Norrköping, June 2021


Adam Morén and Thomas Indrias

iv
Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures viii

List of Tables ix

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 4
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . 7
2.2 Sequential models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Input and Output Embeddings . . . . . . . . . . . . . . . 15
2.3.2 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Pretrained BERT models . . . . . . . . . . . . . . . . . . . 18
2.4 Extractive Text Summarization Methods . . . . . . . . . . . . . . 19
2.4.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

v
2.4.3 BERTSum . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Evaluation metrics for summarization . . . . . . . . . . . . . . . 23
2.5.1 Precision, Recall and F-Score . . . . . . . . . . . . . . . . 23
2.5.2 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.3 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . 25

3 Method 27
3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 CNN/DailyMail . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Aftenposten/Oppsummert . . . . . . . . . . . . . . . . . 28
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Restructure of the AP/Oppsummert dataset . . . . . . . 31
3.2.2 Truncation of articles . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Oracle Summary Generation . . . . . . . . . . . . . . . . 33
3.2.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . 36
3.2.6 Fine tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.8 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 ROUGE Evaluation . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Human Evaluation with Journalists . . . . . . . . . . . . 40

4 Results 41
4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 ROUGE Evaluation . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Human Evaluation with Journalists . . . . . . . . . . . . 45

5 Discussion 51
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 ROUGE Scores . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Sentence Selection . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.3 Human Evaluation with Journalists . . . . . . . . . . . . 53
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusion 62
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vi
Bibliography 66

A Appendix I
A.1 All responses from Human Evaluation . . . . . . . . . . . . . . . I
A.1.1 Article 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.1.2 Article 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.1.3 Article 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.1.4 Article 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

vii
List of Figures

2.1 Perceptron model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


2.2 Illustration of a Multilayer neural network . . . . . . . . . . . . . . 9
2.3 RNN illustrated by C. Olah [rnn-lstm] . . . . . . . . . . . . . . . . . 10
2.4 RNN unpacked illustrated by C. Olah [rnn-lstm] . . . . . . . . . . . 10
2.5 RNN Encoder-Decoder sequence-to-sequence model illustrated by
Kostadinov [encoder-decoder-seq2seq] . . . . . . . . . . . . . . . . 12
2.6 Attention example illustrated by Bahdanau et al.
[bahdanau2016neural] . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 model architecture of a transformer illustrated by Vaswani et al.
[google-attention] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 The input embeddings and embedding layers for BERT illustrated
by Devlin et al. [pre-training-of-BERT] . . . . . . . . . . . . . . . . 15
2.9 Two words broken down into sub-words using WordPiece tok-
enization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Binary labels generated by two pair inputs. . . . . . . . . . . . . . . 17
2.11 Position embeddings layer. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 Architecture of BERTSum proposed by Yang [liu2019text] . . . . . . 22

3.1 Summaries associated with x articles in the AP/Oppsummert dataset 30


3.2 Number of sentences in the AP/Oppsummert summaries dataset . 30
3.3 ROUGE-2 and ROUGE-L recall scores for summaries with one arti-
cle in (a) and (b), summaries with more articles and the top-scoring
articles in (c) and (d), and summaries with more articles and the
second-best articles in (e) and (f). . . . . . . . . . . . . . . . . . . . . 32
3.4 Proportion of sentences with highest ROUGE score according to
their position in the original article . . . . . . . . . . . . . . . . . . . 34

4.1 Sentence selection for Norwegian BERT fine-tuned on (a) Oracle-3


(b) Oracle-7 (c) Oracle-10 with trigram blocking and on (d) Oracle-
3 (e) Oracle-7 and (f) Oracle-10 without trigram blocking. . . . . . . 43
4.2 Sentence selection for Multilingual BERT fine-tuned on (a) oracle-3
(b) oracle-7 (c) oracle-10 with trigram blocking and on (d) oracle-3
(e) oracle-7 and (f) oracle-10 without trigram blocking. . . . . . . . . 44
4.3 Average human evaluation scores for each category where the
highest score for each example is 20 . . . . . . . . . . . . . . . . . . 46

viii
List of Tables

3.1 Average token and sentence count for news articles and summaries
in the CNN/DailyMail dataset . . . . . . . . . . . . . . . . . . . . . 28
3.2 Dataset split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Article data type in the AP/Oppsummert dataset . . . . . . . . . . 29
3.4 Summary data type in the AP/Oppsummert dataset . . . . . . . . . 29
3.5 Average token and sentence count for news articles and summaries
for AP/Oppsummert . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 Time it took to fine-tune time the Norwegian and Multilingual


BERT models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 ROUGE scores on AP/Oppsummert test data (116 articles). *With
trigram blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Journalists’ opinion reflecting their satisfaction with generated
summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Journalists’ opinion on generated summaries mentioning features
they found the algorithm performed weak. . . . . . . . . . . . . . . 49
4.5 Journalists’ opinion on generated summaries reflecting potential
for improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Overall comments from journalists . . . . . . . . . . . . . . . . . . . 50

A.1 Responses from the journalists on article 1 . . . . . . . . . . . . . . . III


A.2 Responses from the journalists on article 2 . . . . . . . . . . . . . . . V
A.3 Responses from the journalists on article 3 . . . . . . . . . . . . . . . VII
A.4 Responses from the journalists on article 4 . . . . . . . . . . . . . . . IX

ix
1 Introduction

Over recent years, the amount of data available for both users and companies
has massively kept increasing. As a response to this, creating summarizations
of data has become a popular topic in data science. Text summarization tasks
take part in this, focusing on representing context in a shorter format. Con-
sidering the amount of text data in news and media, it is an example of a field
where automatic text summarization could be beneficial.

1.1 Background
Aftenposten is Norway’s largest daily newspaper based in Oslo. It is a private
company owned by Schibsted and has an estimate of 1.2 million readers. To
save readers time, Aftenposten developed a daily brief called Oppsummert,
which features the most important stories of the day, offered in a summarized
format. The idea is to help readers to be updated on the most important
daily news in a time-efficient way and at the same time offer a consistent and
standardized reading experience.

Summarizing articles manually leads to an increased workload for jour-


nalists. Additionally, many journalists want to focus on great journalism and
innovation, not re-writing shorter versions of already written articles. The
challenge here is to achieve both daily briefs for readers and, at the same
time, use fewer resources from the newsroom and their journalists. For this
challenge, we see the potential for automatic text summarization by learning
machines to understand and process the human language.

There are two main text summarization strategies: extractive and abstractive.

1
1.2. Motivation

Extractive techniques are about identifying the most important sentences


of a text and extracting them. In contrast, abstractive techniques produce
new, shorter text that captures the context of the original longer text. When
implementing automatic text summarization in this thesis, the approach will
be to use extractive techniques. The motivation for this is that we want the
summaries to use sentences written by the original journalist. Abstractive
summarization techniques can sometimes lead to misinformation or biased
generated results, which we want to prevent.

Traditional approaches for extractive text summarization are based on statis-


tical and graph-based methods such as TF-IDF and TextRank. However, these
methods have recently started to be replaced by methods based on neural
networks, such as BERT. BERT is a new state-of-the-art language model that
can learn to perform specific tasks using labeled data.

In our case, the amount of summarized articles from Aftenposten is limited


since Oppsummert is a newly released feature. Therefore, we hypothesize
that it will be challenging to train a Norwegian BERT model and get good
results. Our approach will investigate this and try different BERT models and
methods.

1.2 Motivation
In the massive flood of news and media seen today, it can be challenging for
newsreaders to filter out the most important daily news. Furthermore, due
to the rapidness of our daily lives, users often want to be as time-efficient as
possible. Therefore, the motivation of news summaries is to help readers be
updated on the most important daily news in a time-efficient way. However,
writing these summaries manually leads to an increased workload for the
journalists. This is where machine learning potential can be seen for generat-
ing these summaries automatically. By implementing a model that can extract
key sentences from an article, we can reduce the workload for journalists, and
at the same time deliver summaries with the most important content to the
newsreaders.

1.3 Aim
The thesis project aims to develop a model that can extract the most relevant
sentences from an article written in Norwegian on which journalists can base
their summaries. This will be done by investigating possible approaches for
extractive text summarization using BERT with a limited labeled Norwegian
data set and evaluating the results.

2
1.4. Research question

1.4 Research question


The current work aims to answer the following research question:

• How can a high-performance BERT-based extractive summarization


model be developed based on a limited amount of news summaries in
Norwegian?

To this end we aim to investigate:

• How news summaries can be used to generate labeled data that is re-
quired for a supervised learning model.

• How the model’s performance should be evaluated and assessed.

• How BERT can be used for extractive text summarization on a low re-
source language.

• Limitations with BERT and how they should be dealt with.

1.5 Delimitations
The study focuses on BERT-based models for extractive text summarization.
However, it will also explore traditional methods for comparison purposes.
The articles and summaries from Aftenposten are in Bokmål, one of Norway’s
official writing standards. Therefore, we narrow the scope of the language to
only Bokmål.

3
2 Theory

Automatic text summarization is the process of a machine condensing a


longer text into a shorter comprehensive synopsis. The technique can be
either abstractive or extractive. The abstractive approach aims to present a
text with newly generated sentences, and the extractive technique aims to
find and re-use key sentences from the original text. The output format can
be in the form of bullet points, quotes, questions or speakable summaries.
These outputs are usually analyzed and rated relative to how well they cap-
ture main points, grammar, text quality, etc. An automatic summarization
architecture must be able to capture the essence of longer articles. Therefore,
summary evaluation becomes a crucial part of automatic text summarization.

Extractive text summarization can be treated either as a sentence scoring


and selection task or as a sentence classification task. Sentence scoring and
selection is the traditional approach where each sentence is scored based
on its importance to the text, and the sentences with the highest scores are
selected for the final summary. With the approach of sentence classification,
each sentence is instead classified into one of two different classes: extracted
or not extracted. The former approach is part of statistical methods, and the
latter approach utilizes neural networks for task-learning.

In the following, we cover common methods and tasks with a focus on


text and text summarization. In section 2.1, we introduce the research area
of automatic text summarization known as natural language processing. Sec-
ondly, in section 2.2, previous methods within textual tasks through neural
networks are introduced. Thirdly, in section 2.3, the current state-of-the-art
language model BERT is introduced. Finally, in sections 2.4 and 2.5, we

4
2.1. Natural Language Processing

present different methods for extractive text summarization and how these
models can be evaluated.

2.1 Natural Language Processing


Natural language processing (NLP) is the field of computer science and artifi-
cial intelligence that deals with the enabling of computers to understand and
process the human language. This includes the understanding of both written
and spoken language, which comes with many complex challenges. Today
computer applications are expected to translate, answer voice commands,
give directions, and even produce human-like texts. These challenges are
hard for computers to manage because of how abstract and inconsistent the
human language is in it’s nature. Things like humor, sarcasm, irony, intent,
and sentiment is a few examples that not only vary from different languages,
but also from different people.

NLP aims to solve these challenges by converting language to numerical


computational inputs that a computer can understand and process. By
then combining computer algorithms with statistics, it’s possible to extract,
classify, and label language.

2.1.1 Text Processing


For a computer to be able to work with text and solve larger NLP-tasks, the
text input must first be processed. Text processing contains several sub tasks,
such as:

Tokenization: Tokenzation is usually the first subtask for text processing. It


is used for separate a chunk of continuous text into tokens that help the com-
puter to better understand the text. For example, the sentence "The firefighter
saved the cat. Awesome!" would with a tokenization of words be converted
into ["The", "fiefighter", "saved", "the", "cat", ".", "Awesome", "!,"], and as a
tokenization of sentences into ["The firefighter saved the cat.", "Awesome!"].
After a text input has been tokanized, the computer can use the processed
input for other important processes, such as stemming and lemmatization.

Stemming and Lemmatization: Stemming and lemmitization are methods


for trimming down words to their root form. For example the word "saved"
has the root "save". The difference between the two methods is that stemming
solely focuses changing the form of a word, whereas lammatizaiton actually
finds a dictionary form of the word. Meaning that after applying lemmatiza-
tion, we will always get a valid word.

5
2.1. Natural Language Processing

Stop Words: Stop words are usually words with no semantics, and are
therefore considered not to provide any relevant information for the task. The
English language contains several hundreds stop words such as "the" or "and"
which does not carry any signification by themselves and are therefore often
removed from documents [39].

POS tagging: Part-of-Speech tagging (POS tagging) is the method for iden-
tifying the part of speech for words (noun, adverb, adjective, etc.). Some
words in a language can be used as more than one part of speech. For ex-
ample, consider the word "book" which can be used both as a noun, in the
sentence "The firefighter read a book" and as a verb, in the sentence "Did the
firefighter book a table?". This example shows why POS tagging becomes
important to process sentiment in a text.

Sentence boundary identification: Sentence boundary identification is im-


portant in order for the system to recognize the end of sentences in the docu-
ment. Establishing where a sentence ends and the next one begins is impor-
tant for a clear sentence structure for many NLP tasks.

Word Embeddings: Word embeddings is a method for representing words


that can capture syntactical and semantic information. Usually, words are
mapped to a vector space of numbers with a fixed dimension, where words
with similar meaning are closer together in the vector space. This makes
it possible to for example detect synonymous words, or suggest additional
words for sentences. Word embeddings are obtained from, and used by, lan-
guage models that use neural networks to train and learn word associations
from a large corpus of text.

2.1.2 Statistical Methods


For many years the most common NLP methods were based solely on statis-
tical methods. For texts, this includes algorithms and rules for the statistics
of the words and sentences in a text document. An example is TF-IDF, a
numerical statistic that can reflect a word’s importance in a collection of text
documents. Machine learning algorithms should also be mentioned here, as
they revolutionised natural language processing when introduced in 1980.
Popular machine learning classifiers and algorithms are Naive Bayes, Sup-
port Vector Machine, decision trees and graph structures.

Today, statistical methods in NLP have been largely replaced by neural


networks. However, statistical methods continue to be relevant in some con-
texts, for example, when the amount of training data is insufficient. In Section
2.4 we investigate two statistical methods for extractive text summarization;
TF-IDF and a graph-based method called TextRank.

6
2.1. Natural Language Processing

2.1.3 Artificial Neural Networks


Most methods that are currently achieving state-of-the-art results for NLP
tasks employ neural networks. A neural network is an artificial intelligence
system that mimics how a biological brain works through artificial neurons.
It enables models to learn tasks iteratively, and one of the reasons for its suc-
cess in recent years has to do with the massive increase of data on which the
models can train. In this section, we overview the main concepts of neural
networks to understand better what is happening when a model learns to
perform a specific task.

Perceptron: Single Layer Neural Net


The simplest form of a neural network is a single-layer perceptron, capable of
classifying linearly separable data. A perceptron is an algorithm that can be
explained as a simplified biological neuron. It takes in a signal and outputs a
binary signal. A vector of numbers represents the input, and the classification
of this input represents the output. The framework for a perceptron is the
following:

• Input: x = ( x1 , x2 , ..., xd )

• Output: y

• Model: Weight vector w = (w1 , w2 , ..., wd ) and bias b

The perceptron makes its predictions based on the prediction function that is
presented in Eq 2.1. An illustration of this equation is also shown in Fig 2.1
where f is an activation function that can be different for different types of
neurons, w is the weights vector that represents the strength of nodes, T is
the transpose, and b is the bias.

y = f (w T x + b) (2.1)

Training Perceptrons
The training of perceptrons is a process which is known as gradient decent.
The goal of gradient descent is to find optimal parameters w that minimize
the loss function. A training set is used during training, which contains input
values X and their corresponding correct values Y. The model predicts a
value ŷ from the input x, and this prediction is then compared with the actual
value y. The difference between predicted values and actual values is the
error E of the model. E can be calculated in different ways depending on
the output type of the model. A model that has a binary output type usually
has a loss function that is the binary cross-entropy loss. Since the goal is to
minimize the error of the loss function, we are looking for where the gradient

7
2.1. Natural Language Processing

Figure 2.1: Perceptron model

of the loss function is zero. This is why it is called gradient descent because
the goal is to go down the gradient until it no longer has a slope, i.e., the error
becomes as small as it possibly can get.

The model parameters are updated via the equation presented in Eq 2.2.
Here, we calculate the new values of the parameters as the old parameters
minus a step in the direction of the derivative. e is called learning rate, and it is
a value that determines how big the size of the step should be. The size of the
step is important because if it is too large, we risk stepping over the optimal
point, and if it’s too small, the decent takes too much computational time.

BE
wi ( t + 1) = wi ( t ) ´ e (2.2)
Bwi
Training of perceptrons happens in epochs. An epoch is defined as a full cycle
through the training data. In the standard gradient descent method, we accu-
mulate the loss for every example in the train set before updating the weights.
The problem with the standard gradient descent method is that if the number
of training samples is large, it may be time-consuming because we have to run
through the whole training set for every parameter update. Instead, Stochastic
Gradient Decent (SGD) is often applied for faster convergence. There are two
main methods for SGD:

• Update weights for each sample

• Minibatch SGD: Update weights for a small set of samples

Updating the weights for each sample is fast, but it makes the model very
sensitive to noise in the training set. Using minibatch SGD is both fast and
robust to noise. That is why it is often preferred in training.

8
2.2. Sequential models

Figure 2.2: Illustration of a Multilayer neural network

Multi-Layer NN as a non-linear classifier


The problem with a single layer neural net is that it only can be used as a
linear classifier and not be used for feature learning. To solve this, multi-
ple perceptrons can be combined to form a neural network with more layers,
called hidden layers. An illustration of such a neural network consists of an
input layer, two hidden layers, and an output layer is shown in Fig 2.2. The
advantage of a multi-layer neural network is that it is able to solve non-linear
functions unlike single layer neural networks that can only solve linear func-
tions.

2.2 Sequential models


When working with textual data in NLP, the data is generally transformed
into a sequence. It is essential to keep the order of the sequence since if the or-
der of words is changed, the sentence’s context could also change. For exam-
ple, ["the", "firefighter", "saved", "a", "cat"] has a different meaning then ["a",
"cat", "saved", "the", "firefighter"]. Data, where the order is important, is called
sequential data and models working with this type of data is called sequen-
tial models. Another requirement for sequential models is that the processed
sequence should remember previous important parts of sequences. For exam-
ple, if the data is a sequence of sentences, such as ["X was walking home", ...,
"he forgot to buy milk on the way"], it is essential to remember specific keywords
such as "X" and "home".

2.2.1 RNN
A Recurrent Neural Network (RNN) is a family of neural networks for pro-
cessing sequential data. RNNs can process long sequences and sequences

9
2.2. Sequential models

with variable length, meaning that the input sequence does not have to be the
same length as the output sequence.

Figure 2.3: RNN illustrated by C. Olah [30]

Figure 2.3 above shows an RNN with loops. The model A takes in input
sequence xt and outputs a value ht . The model also passes its past state to the
next step. The same RNN can be visualized as an unpacked network instead,
shown in the following figure 2.4.

Figure 2.4: RNN unpacked illustrated by C. Olah [30]

RNN can have different structures and combination. For example, RNN can
have multiple layers so that an output from one layer can be used as input
to another layer. Such layering are often called deep RNNs. Goldberg [13]
observed empirically that deep RNNs works better than shallower RNNs on
some tasks. However, it is not theoretically clear why they perform better.

Another extension of RNN is bidirectional-RNN (BI-RNN). Conventional


RNN only uses the past state as seen in figure 2.4. However, the future state
might also hold useful information of the following words in a sequence. BI-
RNN attempts to deal with this by maintaining two separate states. Each state
has two layers. BI-RNN run the input in two ways; one from front-to-back
and one from back-to-front.

10
2.2. Sequential models

Simple RNN
The most conventional RNN is called simple RNN (S-RNN), and it was pro-
posed by Elman [10]. Mikolov [27] later explored S-RNN for use in NLP [13].
It builds a strong foundation for tasks such as sequence tagging and language
modelling. However, S-RNN introduces a problem that causes the gradients
that carry information used in a parameter update to increase or decrease
rapidly over time. This problem is known as the exploding or vanishing gra-
dients problem, resulting in the gradients becoming so big or small that the
parameter updates carry no significant changes. In other words, this problem
causes the model not to learn. In later works, Hochreiter [15] proposed an
architecture known as Long Short-Term Memory which managed to overcome
the exploding and vanishing gradient problem.

LSTM
Long Short Term Memory networks (LSTMs) is a special kind of RNN capable
of learning long-term dependencies [30]. The main difference between RNN
and LSTM is that an LSTM is made up of a memory cell, input and output
gate, and a forget gate [24]. The memory cell is responsible for remembering
dependencies in the input sequence, while the gates control how much of the
previous states should be memorized.

2.2.2 Encoder-Decoder
Encoder-Decoder architecture is a standard method used in sequence-to-sequence
(seq2seq) NLP tasks such as translation. For RNN (section 2.2.1), an Encoder-
Decoder structure was first proposed by Cho et al (2014) [4]. The encoder
takes a sequence as an input and produces an encoder vector used as input to
the decoder. The decoder then predicts an output at each step with respect to
the previous states (auto-regression) until some END-token is generated. Fig-
ure 2.5 shows an RNN encoder-decoder architecture for seq2seq tasks where
hi is the hidden state, xi is the input sequence and y j is the output sequence.

2.2.3 Attention
An apparent disadvantage with conventional encoder-decoder models de-
scribed in Section 2.2.2, is that the input sequence has a fixed-length vector.
This issue limits the model to learn later parts of a sequence by truncating it.
Additionally, early parts of long sequences within the fixed-length are often
forgotten once it has processed the entire sequence [4].

Bahdanau et al. (2016) [2] proposed an approach to solve the limitations


of encoder-decoder models by extending encoder-decoder forming a tech-
nique called Attention. Unlike the conventional encoder-decoder, Attention

11
2.2. Sequential models

Figure 2.5: RNN Encoder-Decoder sequence-to-sequence model illustrated by


Kostadinov [19]

allows the model to focus on relevant parts of an input sequence. The process
is done in two steps. First, instead of only passing the last encoder’s hidden
state (context vector) to the decoder, the encoder passes all the hidden states
of the previous encoders to the decoder. Second, the decoder gives each
encoder’s hidden state a score where each of these states is associated with
a certain word from the input sequence. This way, the model does not train
on using one context vector but rather learn which parts of a sequence to
pay attention to. Bahdanau et al. provide an example shown in figure 2.6.
It illustrates Attention when translating the English input sequence: [", This,
will, change, my, future, with, my, family, ., ", the, man, said], to the French
target sequence: [", Cela, va, changer, mon, avenir, avec, ma, famille,", a, dit,
l’, homme, .]. It can be seen in the figure that the alignment of the words is
largely monotonic, hence the high attention score along the diagonal. How-
ever, some words are non-monotonic. For example, the English word "man"
is "l’homme" in French, and in the example, we can find high attention scores
both for "l’" and "homme".

2.2.4 Transformers
Transformers are attention-based architecture consisting of two main compo-
nents, encoder and decoder. The model was introduced by Vaswani et al. [43]
to solve existing problems with recurrent models, presented in section 2.2.1,
that preclude parallelization, which would result in longer training time and
drop in performance for longer dependencies. Due to the attention-based
non-sequential nature of transformers, it can be highly parallelized and reach
a constant sequential and path complexity, O(1). Transformers are used to
solve translation problems. The aim is to find a relationship between words

12
2.2. Sequential models

Figure 2.6: Attention example illustrated by Bahdanau et al. [2]

in an input sentence and combine it with an existing translation of that sen-


tence [3].

Encoder: The encoder consists of multiple encoder layers where each layer
has two sub-layers. The first sub-layer is a multi-head self-attention mecha-
nism. For instance, looking at the same example as mentioned in the previous
section 2.2.4, self-attention means that the target sequence is the same as the
input sequence. In other words, self-attention is just another form of atten-
tion mechanism that relates different positions of the same input sequence.
The term "multi-head" means that instead of computing the attention once,
it utilizes scaled dot-product attention, allowing multiple attention compu-
tations in parallel [43]. The second sub-layer is a simple position-wise, fully
connected feed-forward network. All sub-layers have a residual connection
and a layer normalization. The purpose of this is to add the output of each
sub-layer with its previous input. The left block on figure 2.7 shows the en-
coder of a transformer.

Decoder: The decoder is structured similarly to the encoder. Still, it has an


additional sub-layer called masked multi-head attention, which is a modified
multi-head attention mechanism that, unlike self-attention, prevents it from
attending to subsequent positions. The goal of masking positions is to ensure
that the predictions made are not looking into the future of the target se-
quence [43]. The right block on figure 2.7 shows the decoder of a transformer.

For instance, in translation tasks, the encoder is fed with words of a specific
language, processing each word simultaneously. It then generates embed-

13
2.2. Sequential models

Figure 2.7: model architecture of a transformer illustrated by Vaswani et al.


[43]

dings for each word which are vectors that describe the meaning of the words
in the form of numbers. Similar words have closer numbers in their respective
vectors. The decoder can then be used to predict the translation of a word in a
sentence by combining the embeddings from the encoder and the previously
generated translated sentence. The decoder predicts one word at a time until
the end of the sentence is reached.

Transformers have a token limitation of 512. The reason is that the mem-
ory and computations requirements for a transformer grow quadratically
with sequence length, making it impractical to process long sequences [43].
This means that transformers can only process input that is below 512 tokens.
Later, new solutions were introduced, such as Transformer XL that uses a
recurrent mechanism to handle text sequences longer than 512 tokens [6].
However, in most cases, it is sufficient to truncate sequences longer than 512
tokens to make them fit.

In general, the encoder learns what a word is in relation to the origin of


language, grammar and, more importantly, context. In contrast, the decoder
learns how the origin word relates to the target word in terms of language.

14
2.3. BERT

2.3 BERT
BERT is a transformer-based model, introduced by Devlin et al [7]. The au-
thors motivate that previous language representation models, such as RNNs,
were limited in how they encode tokens by only considering the tokens in
one direction. Unlike RNNs, the authors utilize transformers, described in
section 2.2.4, to design a Bidirectional Encoder Representation from Transformers
(BERT), which is able to encode a token using tokens from both directions.

BERT can solve various types of problems such as question answering,


sentiment analysis, and text summarization. However, these problems re-
quire an understanding of language, which is solved by a pre-training and
a fine-tuning phase. The first phase consists of pre-training BERT to under-
stand language, and then fine-tuning is done so that BERT can learn to solve
a specific task.

2.3.1 Input and Output Embeddings


Similar to other language models, BERT processes each input token through a
token embedding layer to create a vector representation. Additionally, BERT
has two more embedding layers called, segment and position embeddings.
In figure 2.8, an illustration of the three embedding layers can be seen.

Figure 2.8: The input embeddings and embedding layers for BERT illustrated
by Devlin et al. [7]

To create an input embedding, Token, segment, and position embeddings are


summed into an input embedding for a given token. Before the embedding
layers process the input, the input text is tokenized using WordPiece.

WordPiece
BERT adopts WordPiece tokenization proposed by Wu et al. [44]. The aim
of using WordPiece tokenization was to improve the handling of rare words.
The solution was to divide words into sub-words (WordPieces) using a fixed

15
2.3. BERT

vocabulary set. In terms of BERT, it has a vocabulary size of 30 000 Word-


Pieces. In figure 2.9, an example of two words broken down into subwords is
shown. When the rarity of the word increase, the word can be broken down
into single characters. Additionally, every subword except the first subword
of a word is symbolized with "##" symbols. The first subword is separated
from the rest of the subwords because they can be redundant for the whole
word. For example, the word "bedding" can be split into subwords "bed",
"##ding". The subword "bed" conveys meaning to bedding because that can
be closely related to the word "bed".

Figure 2.9: Two words broken down into sub-words using WordPiece tok-
enization

Finally, a "[CLS]" token is added to the start and "[SEP]" token is added to the
end of a tokenized sentence. The objective of adding the extra tokens is to
distinguish a pair of sentences which will help create segment embeddings
2.3.1. Since BERT uses default transformer encoders (2.2.4), BERT is limited to
process input sequences up to 512 tokens.

Token Embeddings
The first step is to create vector representations of the tokenized input in the
token embeddings layer. Each token has a hidden size of 1x768 vector. For N
input tokens, the token embedding results in a matrix shape of Nx768 or, as a
tensor, 1xNx768.

Segment Embeddings
BERT can handle a pair of input sentences as shown in figure 2.10. The in-
puts are tokenized and concatenated to create a pair of tokenized sentences.
Thanks to the [SEP] token, BERT can distinguish two sentences and label the

16
2.3. BERT

sequence in binary.

Figure 2.10: Binary labels generated by two pair inputs.

The label sequence is then expanded into the same matrix shape as for token
embeddings, Nx768, where N is the number of tokens. For example, for the
paired input in figure 2.10, the segment embedding would result in a matrix
shape of 8x768.

Position Embedding
BERT is a transformer-based model and, therefore, will not process tokens
sequentially. Thus, to avoid BERT forgetting the order of tokens, position
embeddings are required. The position embeddings layer can be used as a
look-up table as illustrated in figure 2.11, where the index of a row represents
a token position. For example, the two sentences, "Cat is stuck" and "Call the
firefighter" has identical vector representations for the words; "Cat" - "Call",
"is" - "the" and "stuck" - "firefighter".

2.3.2 Pre-training
The pre-training phase is done by training on two unsupervised tasks simulta-
neously, which are Masked Language Model (MLM) and Next Sentence Prediction
(NSP) [16].

Masked Language Model


Masked Language Modeling (MLM) is an unsupervised task done during the
pre-train of BERT. The goal of MLM is to help BERT understand deep bidi-
rectional representations. MLM is done by masking 15 % of all WordPiece
tokens for the input sequence randomly. The tokens are masked by replacing
the token with a [ MASK ] token instead, which BERT identifies and predicts.

17
2.3. BERT

Figure 2.11: Position embeddings layer.

Next Sentence Prediction


Next Sentence Prediction is another unsupervised task done during the pre-
train of BERT. The objective of this task is to capture the relationship between
two sentences. To capture sentence relationships, BERT is pre-trained for a bi-
narized next sentence prediction that can be generated from any monolingual
corpus [7]. That is done by setting 50% of the inputs to sentence pairs where
the second sentence is the subsequent sentence from the corpus. The other 50
% contain the same sentence pairs except that the second sentence is instead
a random sentence selected from the corpus. For example, if A is a sentence
from the corpus, then 50 % of the time, B is the subsequent sentence of A and
the other 50 % is a random sentence from the corpus.

2.3.3 Fine-tuning
Fine-tuning allows the pre-trained BERT model to be used for specific NLP
tasks through supervised learning. It works by replacing the fully connected
output layers of the pre-trained BERT model with a new set of output layers
that can output an answer with respect to the NLP problem at hand. The new
model performs supervised learning with labeled data to update the weights
of the output layers. Since only the output layer weights are updated during
fine-tuning, the learning during fine-tune is relatively fast [7].

2.3.4 Pretrained BERT models


As described in the previous section (2.3), a BERT model has to be pre-trained
before it is fine-tuned on different tasks because the model needs to be taught
to encode language. This process is both time and resource-consuming. For
example, Devlin et al. [7] pre-trained the BERT model for four days using
four cloud TPUs (16 TPU chips in total). Therefore many BERT models are
released as pre-trained models with initialized parameters ready for specific

18
2.4. Extractive Text Summarization Methods

tasks. In turn, fine-tuning can be done on the pre-trained model to be used


for particular tasks.

Norwegian BERT: At the current state, the SOTA monolingual BERT model
supporting the Norwegian language (both Bokmål and Nynorsk) is made by
the National Library of Norway 1 . It is based on the same structure as the
multilingual BERT (2.3.4) and trained on a wide variety of Norwegian text in
both Bokmål and Nynorsk from the last 200 years.

Multilingual BERT: Multilingual BERT (M-BERT) is a BERT-based model


pre-trained on concatenated monolingual Wikipedia corpora from 104 lan-
guages 2 . In a study done by Pires et al. [38], it is shown that the M-BERT
performs exceptionally well on zero-shot cross-lingual model transfer. Mean-
ing, M-BERT can be fine-tuned using task-specific supervised data from one
language and evaluated in a different language. This was done in a paper
from Elmadani et al. [9]. They applied M-BERT for Arabic text summariza-
tion and showed how effective it could be in low resource situations for both
extractive and abstractive approaches.

2.4 Extractive Text Summarization Methods


Today there exists different extractive methods for text summarization. In this
chapter, the two well known unsupervised methods, TF-IDF and TextRank,
are presented in section 2.4.1 and 2.4.2. Furthermore, section 2.4.3 presents a
supervised method called BERTSum which utilize the language model BERT
for text summarization.

2.4.1 TF-IDF
TF-IDF is short for term frequency-inverse document frequency. It is a nu-
merical statistic that reflects how important a word is to a document within a
corpus [39].

Term weighting based on term frequency was first introduced by Luhn [25].
Luhn stated that the importance of a term is proportional to its frequency. In
mathematical terms, this can be described as:

t f (t, d) = f t,d (2.3)

As seen in eq. 2.3, the term frequency t f , is equal to the frequency f of a


1 https://github.com/NBAiLab/notram
2 https://github.com/google-research/bert/blob/master/

multilingual.md

19
2.4. Extractive Text Summarization Methods

term t found in a document d. For example in the following sentence: The


firefighter rescued a cat. The cat is safe now. The term, cat, would have high
importance because it’s mentioned multiple times. However, it is seen that
common terms such as, the, is also weighted as important.

To solve the issue of common terms appearing as important words, Jones


[17] proposed a metric called, Inverse document frequency (IDF). The idea is
to reduce the weighting of common terms and increase the weights of terms
that occur infrequently, see Equation 2.4.
n
id f (t, d) = log (2.4)
nt

Here, terms are weighted based on the inverse fraction of the documents
containing a term. The fraction is calculated by dividing the total number of
documents n by the number of documents nt containing a term t.

The combination of both TF and IDF favors more unique terms and damps
common terms that occur in several documents. The combined equation is
presented as:
n
t f ´ id f (t, d) = f t,d ˆ log (2.5)
nt

For sentence weighting, the same principle can be used. Document d in eq 2.5
can be reformulated as a sentence s and term t can be represented as a word
w. In this case, n would be the total number of sentences, and nt would be the
number of sentences containing the term t. The final equation for sentence
weighting:
n
t f ´ id f (w, s) = f w,s ˆ log (2.6)
nw

2.4.2 TextRank
TextRank is a graph-based ranking algorithm proposed by Mihalcea and
Tarau [26]. The ranking is done by deciding the importance of vertex in a
graph-based on global information drawn recursively from the entire graph.
This is done by linking one vertex to another. The importance of a vertex is
measured by the number of links to other vertices as well as the score of the
vertices casting the votes.

A directed graph can be defined as G = (V, E), where V is a set of ver-


tices and E is a set of edges. E is, in turn, a subset of V ˆ V. For a vertex Vi ,
let In(Vi ) be the set of vertices pointing to it and let Out(Vi ) be the vertices Vi

20
2.4. Extractive Text Summarization Methods

points to [26]. The score of a vertex Vi , indicating its importance, is based by


Brin et al. [35]:

ÿ 1
S(Vi ) = (1 ´ d) + d ˆ S(Vj ), where 0 ă d ă 1 (2.7)
|Out(Vj )|
jPIn(Vi )

In equation 2.7, d is a damping factor that sets the probability of jumping


from a given vertex to another.

TextRank can be applied for sentence extraction as proposed by Mihalcea


and Tarau. This is done by setting the vertices of a TextRank graph equivalent
to the number of sentences so that each vertex represents a sentence. The
damping d shown in eq. 2.7 is equal to the similarity between two sentences.
Additionally, the similarity function proposed by Mihalcea and Tarau also
considers using normalization factors and division of content overlap to
avoid promoting long sentences.

Given two sentences Si and S j where a sentence is represented by a set


of Ni words (Si = ω1i , ω2i , ..., ω iNi ), the similarity between Si and S j is defined
as (Mihalcea and Tarau):

|tωk | P Si &wk P S j u|
Similarity(Si , S j ) = (2.8)
log |Si | + log |S j |

2.4.3 BERTSum
BERT can not be directly used for extractive summarization. There are two
problems at hand that Liu (2019) [23] points out. Firstly, BERT is trained using
a masked language model (section 2.3.2). Therefore the output vectors result
in tokens rather than sentences. Secondly, although BERT has segmentation
embeddings for indicating different sentences, it can only differentiate a pair
of sentences because BERT is also trained on next sentence prediction (section
2.3.2).

Liu [23] propose a method for handling multiple sentences using BERT
by inserting [CLS] tokens before each sentence and a [SEP] token after each
sentence. To distinguish multiple sentences rather than two sentences, in-
terval segment embeddings can be used. This means that each token in a
sentence will be assigned the same E A or EB if the position of the sentence is
odd or even. As seen in figure 2.12, the output of the BERT layer shown as
Ti , are the corresponding [CLS] tokens from the top BERT layer. Each Ti are
treated as a sentence representation of sentence i.

21
2.4. Extractive Text Summarization Methods

Figure 2.12: Architecture of BERTSum proposed by Yang [23]

After obtaining sentence representations for multiple sentences, Yang sug-


gests several methods to capture document-level features for extracting
summaries:

1. Using a simple classifier on top of BERT outputs and a sigmoid function


to get predicted score.

2. Inter-sentence transformer by adding more transformer layers on top of


BERT outputs and a simple classifier together with a sigmoid function.

3. Applying an LSTM layer on top of BERT outputs together with a simple


classifier and a sigmoid function.

Although from Liu’s experiments, the author stated that the second option in
list 2.4.3, with two transformer layers, showed the best performance. The loss
of the model is the binary cross-entropy of a prediction against its gold labels
[23].

The predicted output from BERTSum is ranked by their importance which is


represented by a score. Before ranking each sentence by its score, the author
(Liu) implemented Trigram Blocking, introduced by Paulus et al. (2018) [37] to
reduce redundancy by minimizing similarity between the selected sentences
in the predicted summary.

Like BERT, the sequence input for BERTSum has a limit of 512 tokens.

22
2.5. Evaluation metrics for summarization

2.5 Evaluation metrics for summarization


Metrics that would traditionally be used to evaluate text summaries are co-
herence, conciseness, grammaticality, readability, and content [21]. These are
metrics that experts consider when writing summaries, and since experts are
going to use the developed tool, they become important. Evaluating sum-
maries manually does not scale well since it would required huge amounts
of time and effort to evaluate the hundreds, or even thousands, of summaries
that exists. Therefore, it is crucial to complement human evaluation with qual-
itative evaluation methods and metrics to evaluate summaries automatically.

2.5.1 Precision, Recall and F-Score


An extractive text summary can be seen as a binary classification problem
where 1 indicates that a sentence from the document is extracted and 0 in-
dicates that it is not. In statistical analysis of binary classification, Precision,
Recall, and F-Score measure the test’s accuracy. The precision score is the
number of true positive results divided by all selected positive results. The
recall score is the number of true positives divided by all positive values. An-
other way to interpret these values is to think of the precision score as how
many of the selected items that are relevant and the recall score as how many
of the relevant items that are selected. It then becomes clear that these values
alone are not always applicable. For example, if we were to pick out three
red apples in a bowl of ten apples, we could achieve a high precision score
by simply picking one red apple. Similarly, we would achieve a high recall
score by simply picking all ten apples in the bowl. In these cases, the F1 score,
known as the harmonic mean of precision and recall, becomes useful. The F1
score is calculated as in Equation 2.9.

precision score ¨ recall score


F1 = 2 ¨ (2.9)
precision score + recall score

2.5.2 ROUGE
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of met-
rics presented by Lin [21] in 2004 for automatic evaluation of summaries. The
metrics compare machine-generated summaries against one or multiple refer-
ence summaries created by humans. The ROUGE-1, ROUGE-2, and ROUGE-
L metrics are commonly used for bench-marking of document summariza-
tion models, such as on the leaderboard for document summarization on the
CNN/Daily Mail dataset [5]. For each metric, the recall, precision, and F1
score are generally computed. With ROUGE, the true positives are the words
in the sentences of the reference summaries.

23
2.5. Evaluation metrics for summarization

ROUGE-N: N-gram Co-Occurrence Statistics


ROUGE-N is defined as the overlap of n-grams between the candidate sum-
mary and the reference summary. As mentioned, the most common metrics
are ROUGE-1 and ROUGE-2, where ROUGE-1 refers to the overlap of un-
igrams (single words), and ROUGE-2 refers to the overlap of bigrams (two
adjacent words).

ROUGE-L: Longest Common Subsequence


ROUGE-L refers to the Longest Common Subsequence (LCS) of words be-
tween a candidate summary and a reference summary. It reflects similarity
on sentence-level based on the longest in-sequence matches of words.

ROUGE-W: Weighted Longest Common Subsequence


One disadvantage with the longest common sub-sequence with ROUGE-L is
that it does not favor consecutive matches. This disadvantage means that
word sequences with a less spatial difference will result in the same ROUGE
score as sequences with a larger spatial difference. ROUGE-W deals with
this problem by recognizing the length of encountered consecutive matches,
giving a weighted longest common subsequence [21].

ROUGE-S: Skip-Bigram Co-Occurrence Statistics


One can think of ROUGE-S as the opposite of ROUGE-2. Instead of measur-
ing the overlap of bigrams, ROUGE-S measures the overlap of skip-bigrams
between the candidate and the set of reference translations. A skip-bigram is
any pair of words in their sentence order. A sentence with x number of words
will have x!/(2! ¨ 2!) number of skip-bigrams.

ROUGE-SU: Extension of ROUGE-S


Rouge-SU is an extension of ROUGE-S that additionally considers unigrams
as a counting unit. The extension could be necessary if, for example, a candi-
date sentence is the exact reverse of the reference summary. In that case, only
using ROUGE-S would result in a score of zero even though the sentence has
single word co-occurrences. With ROUGE-SU, a reversed candidate sentence
would get a higher score than sentences that do not have a single word co-
occurrence with the reference sentence.

ROUGE Example
For clarification on ROUGE scores, let us investigate the following example:
sentence one is a reference sentence, and sentences two and three are candi-
dates.

24
2.5. Evaluation metrics for summarization

1. The firefighter saved the cat.

2. The firefighter rescued cat.

3. Cat saved the firefighter.

Considering ROUGE-1, we can see that sentence three gives the best match
with a recall score of 4/5 = 0.8 and a precision score of 4/4 = 1. In the case
of ROUGE-l, sentence 2 is preferred, with a recall score of 3/5 = 0.6 and a
precision score of 3/4 = 0.75. For ROGUE-2 both of the candidate summaries
gives a recall and precision score of 2/4 = 0.5 and 2/5 = 0.4. The importance
of this example is to understand that focusing on only one type of ROUGE
score does not always provide good insight. In our example, intuitively can be
agreed that sentence two is the one that best fits the reference sentence since
sentence three completely changes the context. But according to ROUGE-1,
sentence three is preferred. This example shows why combining the three is
often a good idea and the importance of complementing the results with a
qualitative evaluation.

ROUGE Limitations
Regardless of ROUGEs popularity among papers on automatic text summa-
rization, there are some limitations that must be addressed:

1. ROUGE only considers content selection and not others aspects such as
fluency and coherence.

2. ROUGE is relying only on exact overlaps, but a summary can express


the same content as an article without exact overlaps, using other words
and synonyms.

3. ROUGE was first designed to be used with multiple reference sum-


maries with consideration that summaries are subjective. However,
most datasets today only provides single summary references to each
input.

2.5.3 Qualitative Evaluation


A qualitative evaluation method is often used together with quantitative data
to deepen the understanding of the statistical numbers. Patton [36] suggests
three kinds of data collection methods for qualitative evaluation:

• Open-ended interviews

• Direct observations.

• Written documents.

25
2.5. Evaluation metrics for summarization

The purpose of these methods is to gather information and insights that are
useful for decision-making. Qualitative methods should therefore be appro-
priate and suitable, which means that it is essential to determine qualitative
strategies, data collection options, and analysis approaches based on the eval-
uation’s purpose. An example of a method that combines quantitative mea-
surements and qualitative data is a questionnaire, or interview, that asks both
fixed-choice questions and open-ended questions.

26
3 Method

In this chapter, the methodology for creating a text summarization model is


described. Firstly, in section 3.1, the datasets that were used, their properties
and features are introduced. Secondly, in section 3.2, implementation tech-
niques are described in three parts: pre-processing, binary label generation,
and fine-tuning. Finally, in section 3.3, we cover the methods used for evalu-
ating our different models.

3.1 Datasets
The following section presents the features and properties of the two datasets
used in this work.

3.1.1 CNN/DailyMail
The CNN/DailyMail dataset1 was initially developed for machine-reading
and comprehension and abstractive question answering by Herman et al.
[14]. The dataset contains over 300k unique news articles written in English
by journalists at CNN and the Daily Mail. Their script to obtain the data was
later modified by Nallapati et al. [29] to support training models of abstrac-
tive and extractive text summarization, using multi-sentence summaries.
Both of these datasets are anonymized versions, where the data has been
pre-processed to replace named entities with unique identifier labels. A third
version of the dataset also exists, which operates directly on the original text
(non-anonymized), see [41].

1 https://cs.nyu.edu/~kcho/DMQA/

27
3.1. Datasets

The CNN/DailyMail dataset consists of two main features: articles, which


are strings containing the body of the news article, and multi-sentence sum-
maries, which are strings containing the highlight of the article as written by
the article author. Table 3.1 displays the average token count and number of
sentences in the dataset.
Table 3.1: Average token and sentence count for news articles and summaries
in the CNN/DailyMail dataset

Type Average Token count Average nr of sentences


News Articles 781 29.74
Multi-Sentence Summaries 56 3.75

Furthermore, the dataset is split into a train, validation and test set according
to Table 3.2.
Table 3.2: Dataset split
Dataset Split Number of Instances
Train 287,113
Validation 13,368
Test 11,490

Model performance on the CNN/DailyMail dataset is measured by the


ROUGE score value of the model’s predicted summaries when compared to
the golden summaries. The highest achieving models can be found on the
Papers With Code Leader-board 2

3.1.2 Aftenposten/Oppsummert
The Norwegian articles and summaries provided from Aftenposten (AP) and
Oppsummert build up to two datasets. One with 162k articles and one with
979 summaries. Each column in the article dataset includes is presented in
Table 3.3. The summary dataset contains an array of article IDs, which are the
articles that the summary is based on. Table 3.4 presents each column in the
summary dataset.

To get an idea of how many articles from the article dataset were used to create
the summaries, we plot this relation in Figure 3.1. As for the CNN/DailyMail
dataset, we were interested in examining the average number of sentences in
the AP/Oppsummert summaries. This plot is presented in Figure 3.2. Table
3.5 also displays the average token count and the average number of sentences
in the articles and summaries datasets.
2 https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail

28
3.1. Datasets

Table 3.3: Article data type in the AP/Oppsummert dataset

Field Description
ARTICLE_ID The article’s ID
ARTICLE_TITLE The title of the article
ARTICLE_TEXT raw article text data
ARTICLE_NEWSROOM The newsroom is Aftenposten
LAST_UPDATE The date of when the article was last updated

Table 3.4: Summary data type in the AP/Oppsummert dataset

Field Description
ARTICLE_ID The summary’s ID
ARTICLE_TITLE The title of the summary
ARTICLE_TEXT raw summary text data
ARTICLE_NEWSROOM The newsroom is always Aftenposten
LAST_UPDATE The date of when the summary was last updated
SUMMARIZED_ARTICLES An array of connected article IDs

Table 3.5: Average token and sentence count for news articles and summaries
for AP/Oppsummert

Feature Mean Token count Average nr of sentences


News Articles 703 40.3
Multi-Sentence Summaries 154 9.5

Compared to the CNN/DailyMail dataset, the AP/Oppsummert dataset is


more dynamic, both with a higher variance of sentences in the summaries
and associated articles to each summary.

29
3.2. Implementation

Figure 3.1: Summaries associated with x articles in the AP/Oppsummert


dataset

Figure 3.2: Number of sentences in the AP/Oppsummert summaries dataset

3.2 Implementation
In this section, the implementation of an automatic extractive text summariza-
tion model will be described and the different problems we had to overcome.
Firstly, in section 3.2.1, 3.2.2 and 3.2.4, the dataset is restructured, truncated
and labeled. Secondly, in section 3.2.4, the different model implementations
are described. Lastly, in section 3.2.6, 3.2.7 and 3.2.8, the fine-tuning, pre-

30
3.2. Implementation

diction and the hardware used for implementing a BERT-based model is de-
scribed.

3.2.1 Restructure of the AP/Oppsummert dataset


When training a model, one of the most important aspects is to have good
training data. The CNN/DailyMail dataset is relatively straightforward to
use, having one summary per article. However, this was not the case for the
AP/Oppsummert dataset since some of the summaries have multiple related
articles. We, therefore, analyzed these summaries, together with their related
articles, to identify where the summaries’ content comes from. Our method
for this was to plot the ROUGE scores for articles that maximize the ROUGE-2
and ROUGE-l recall scores against their gold summaries, an approach similar
to the sentence selection algorithm presented in the BERTSum paper[23]. The
objective was to visualize and compare the score between the top-scoring
articles and the second scoring articles. For the summaries with only one ar-
ticle, we plot their ROUGE-2 and ROUGE-l recall score against the summary
directly to understand how extractive they are, i.e., high ROUGE score mean-
ing that the summaries use similar words and sentences as the connected
article. These plots are presented in Figure 3.3. The reason for using the
recall score is that we were not interested in the length variation of the arti-
cles, only to what extent the summaries use content from the different articles.

From the graphs presented in Figure 3.3 we can draw two important conclu-
sions about the AP/Oppsummert dataset:

1. The summaries with only one article are predominantly extractive writ-
ten (since they have high ROUGE-2 and ROUGE-L scores).

2. The summaries with more articles regularly use sentences from only one
main article (since both the scores from the second-best article are far
worse than the scores from the top-scoring article).

With these two conclusions, the dataset was restructured so that sum-
maries with multiple IDs of related articles only were to be connected
with the highest-scoring article in that set. Therefore the field SUMMA-
RIZED_ARTICLES was updated from an array of IDs to only one article ID.

31
3.2. Implementation

(a) (b)

(c) (d)

(e) (f)

Figure 3.3: ROUGE-2 and ROUGE-L recall scores for summaries with one
article in (a) and (b), summaries with more articles and the top-scoring articles
in (c) and (d), and summaries with more articles and the second-best articles
in (e) and (f).

32
3.2. Implementation

3.2.2 Truncation of articles


Before text articles can be used in a model like BERTSum, token limits must
be addressed. As mentioned in 2.4.3, both BERT and BERTSum have an input
limit of 512 tokens. The news articles in the CNN/DailyMails dataset have
a mean token count of 781, and the news articles in the AP/Oppsummert
dataset have a mean token count of 703. This means that the token limit of
BERTSum is indeed a problem when using these datasets.

Different approaches to handling token limitation have been suggested in


previous works [23] [42]. A standard method is to truncate longer texts to fit
the model’s token limit. The problem with this method is the loss of data that
it introduces. If important information is discarded, it will result in a poorly
trained model.

For news articles, important information is primarily presented in the


first third of the article [20]. This observation is also the case for the
CNN/DailyMail dataset, as demonstrated by Liu [23]. Using ROUGE, we
examined if this also is true for the AP/Oppsumert dataset. We did this by
plotting the position of every document’s Oracle sentences, i.e., the sentences
with the highest ROUGE score against the document’s golden summary, see
Fig 3.4. In this particular plot, we choose Oracle-3, which is the top three
scoring sentence.

Figure 3.4 shows that the top-scoring sentences in the AP/Oppsummert


dataset mainly occur at the beginning of the articles. Therefore, it was de-
cided that longer articles that do not fit the token limit of 512 should be
truncated only to use the first sentences sequentially since it is known that
this is where the most important information exists in each document. The
same approach was chosen for truncation of the CNN/DailyMail dataset,
regarding the results from Liu [23].

3.2.3 Oracle Summary Generation


Similar to the approach followed in [22], we are indirectly using the abstrac-
tive summaries (gold summaries) to create oracle summaries for supervised
learning. Since the gold summaries from AP/Oppsummert are abstractive,
they can not directly be used for supervised learning. Therefore a greedy al-
gorithm based on calculating and maximizing the ROUGE-2 score between
sentences in the gold summary and the article is performed to generate an
oracle summary. An oracle summary contains label 1 for selected sentences
and 0 for the rest. Liu [22], also suggests a second algorithm for oracle sum-
mary generation. The second algorithm considers all sentence combinations

33
3.2. Implementation

Figure 3.4: Proportion of sentences with highest ROUGE score according to


their position in the original article

for maximizing the ROUGE-2 score; however, for many combinations, the al-
gorithm can be time-consuming.

3.2.4 Models
Six types of models were implemented for the task of extractive text summa-
rization. With each model, we predicted three, seven, and ten sentences for
the summaries. Out of the six models, TextRank and TF-IDF will only be used
as a comparison to the BERTSum models.

Oracle
Oracle was not only used as a label generation method described in section
3.2.3, but also as an upper limit for our BERT models. Since oracle summaries
are used as labels for our BERT models, the model can not score higher than
the oracle summaries. Therefore, the oracle summaries could be set as a
ceiling for our BERT models.

We also experimented with both the greedy and combination algorithm


mentioned in section 3.2.3. However, the combination algorithm resulted in
slow performance, as Liu [22] mentioned, especially when selecting more
than three sentences. Thus, the greedy algorithm was used throughout the
experiments.

34
3.2. Implementation

LEAD
We used LEAD as a baseline, which selects the first sentences in an article.
From the previously presented analysis in section 3.2.2, with the position of
the highest ROUGE scoring sentences shown in Fig 3.4, we considered LEAD
to be a good baseline to use.

With Oracle and LEAD, we now had a good range for the ROUGE scores
of where we wanted our models to perform within. For clarification: Our
models should perform above the LEAD score and under or the same as the
Oracle score.

Next, we implemented two models based on statistical methods and two


models based on BERT and BERTsum.

TextRank
We adopted a Python implementation3 of TextRank based on the approach
followed in [26]. This implementation performs both keyword extraction as
well as text summarization. We used Natural Language Toolkit (NLTK)4 to
download necessary files used by stopwords, tokenizer, and stemmer.

TF-IDF
An implementation5 of TF-IDF was adopted for extractive text summariza-
tion. The source code was updated to support Norwegian using Spacy6 , an
NLP toolkit similar to NLTK. Key sentences could then be extracted by rank-
ing the scores for each sentence in descending order.

BERTSum
We used BERTSum described in section 2.4.3 to fine-tune two pre-trained
BERT models for the task of extractive text summarization. The original
BERTSum code is currently using an older version of PyTorch and therefore is
not directly suitable for integrating new models. Thus, an updated version of
BERTSum by Microsoft 7 was used. PyTorch, introduced by Klein et al. [18], is
a toolkit for deep learning in Python. There are other currently popular deep
learning libraries, such as TensorFlow (Abadi et al. [1]). However, PyTorch
was chosen for our task since both the original and updated BERTSum code
is processed using PyTorch. PyTorch is also more suitable for development in
3 https://github.com/acatovic/textrank
4 https://www.nltk.org/
5 https://github.com/luisfredgs/extractive-text-summarization
6 https://spacy.io/
7 https://github.com/microsoft/nlp-recipes

35
3.2. Implementation

Python as it is more of a pythonic framework than TensorFlow. In this case,


PyTorch is used for model processing, fitting, and prediction.

The updated BERTSum code from Microsoft includes other natural lan-
guage processing tools and is still maintained today. However, only relevant
functions were included in our source code. Also, the updated code provides
a more readable code structure, better optimization techniques, and a scal-
able solution for adding pre-trained BERT models than the original BERTSum
code from the author Yang.

Data loader: Data comes in many different formats and shapes. The pro-
vided code from Microsoft supports both text files and lists of strings as data
input. For loading and processing the CNN/DM dataset, a dataset loader is
already included by Microsoft. However, for the AP/Oppsummert dataset, a
script was implemented to shuffle and split the data into a train and test set
to match the input format.

Pre-trained models: The pre-trained BERT models were added using a


python-based library called Transformers provided by Hugging Face 8 . Trans-
formers provides general-purpose architectures for different tasks in both Py-
Torch and TensorFlow with currently, over 32 pre-trained models spanning
many different languages. Transformers include Auto Class feature that en-
ables dynamic tokenizer and model initialization. Both pre-trained models,
Norwegian BERT and Multilingual BERT mentioned in section 2.3.4, were
found in Hugging Face API and could be integrated using Transformers.

3.2.5 Hyper-Parameters
Before fine-tuning our models, some hyper-parameters had to be set:

Summarization layer: As described in section 2.4.3, one can choose mul-


tiple summarization layer modes. The author, Liu [23] suggests using two
transformer layers as summarization layers. Thus, this is what we used.

Sequence Length: Since we want to use as many tokens as possible, the


sequence length was set to the maximum sequence length that could be used
with transformers, which was 512 tokens.

Batch size: To keep the memory consumption low, Peltarion [28] suggests
a rule of thumb is to keep the product batch size ˆ sequence length ă 3000.
Therefore, if the sequence length is set to 512, the batch size used is approxi-
mately 6.
8 https://huggingface.co/transformers/

36
3.2. Implementation

Learning rate: A low learning rate was chosen to avoid catastrophic forget-
ting, which can happen when a model is fine-tuned. This means that the new
fine-tuned model can forget the previously learned information. This issue is
avoided by setting a very low learning rate of the order of 10´ 5.

Epochs: Fine-tuning requires only a few epochs. Devlin et al. [7] used three
epochs for their fine-tuning experiments, and Peltarion [28] also suggests us-
ing three epochs or less. Thus we aimed to use three epochs.

Max steps: The steps were calculated to represent the number of epochs we
want to train. With the following formula (3.1, we estimated the max amount
of steps:

training size ˆ epochs


Max steps = (3.1)
batch size

Where training size is the total number of data points in the training set.
Epochs is the number of epochs we use, and batch size is the number of data
points used per step.

3.2.6 Fine tuning


Once the pre-trained models had been integrated with BERTSum, we fine-
tuned Norwegian BERT and Multilingual BERT for the task of extractive text
summarization. Each model was fine-tuned with a different number of oracle
sentences to test how it affects the performance of a model during sentence
prediction. In this case, both models were fine-tuned with three, seven, and
ten oracle sentences. The motivation to why these numbers were chosen was
based on figure 3.2. Firstly, from the figure its seen that most summaries
for AP/Oppsummert are seven sentences, which motivates the selection of
seven oracle sentences. Secondly, three oracle sentences were considered
to experiment with since the CNN/DM dataset by default only uses three
oracle sentences. Thirdly, ten oracle sentences was chosen to investigate how
a longer oracle summary would affect the performance of the models.

We used the data loader described in the previous section 3.2.4 to load
the AP/Oppsummert dataset, which in turn was shuffled and split into a
train and test set with 90 % and 10 % data, respectively. Using the provided
code from Microsoft, binary labels were generated through oracle summary
generation 3.2.3. Using transformers, the pre-trained Norwegian BERT model
is downloaded and automatically initialized. The training data is also tok-
enized automatically using the Auto Class feature to convert the sequence of
words for each article in the training set to IDs representing the pre-trained

37
3.2. Implementation

model’s word embedding 2.1.1. Then, using PyTorch and specified hyper-
parameters (3.2.5), a fitting is done, which automatically divides the training
data into batches and iteratively calculates loss, backpropagate to calculate
gradients, and updates the model weights.

The process of fine-tuning Multilingual BERT was done in two steps. Firstly,
the code from Microsoft provides a data loader for the CNN/DM dataset,
which we used to load the dataset. Because of hardware limitations men-
tioned in 3.2.8, 10 000 articles were used in the training set and 1 000 articles
in the test set. Similar to the Norwegian BERT, the model is tokenized and
fine-tuned through BERTSum using the pre-trained BERT model, Multilin-
gual BERT.

Secondly, to improve the fine-tuned Multilingual BERT model, an additional


fine-tune was made using the AP/Oppsummert dataset. The additional
fine-tune was done using the weights from the Multilingual BERT model
fine-tuned on the CNN/DM dataset and fit the model using the processed
AP/Oppsummert data. The process is similar to the first step.

3.2.7 Prediction
Finally, predictions were made on the AP/Oppsummert test set using the
model to obtain a score for each sentence in an article for every article in the
test set. The sentences for every article are then ranked by their scores from
highest to lowest, and then select top-N sentences as the summary, where N
is the number of oracles sentences the model was fine-tuned on.

As mentioned in section 2.4.3, BERTSum uses trigram blocking by default


during prediction to reduce redundancy. However, to check the necessity of
that, predictions with and without trigram blocking were tested.

3.2.8 Hardware
The training was done through Google Colab 9 which allows anyone to write
and execute python code through the browser. Google Colab provides single
GPUs such as Nvidia K80, T4, P4, and P100. However, there is no way for
users to choose which type of GPU to use. By displaying the allocated GPU
during runtime, we noticed that Nvidia K80 was used in our case. In terms of
memory, we were limited to use up to 12 GB. Also, sessions in Google Colab
for free use only last at most 12 hours.
9 https://colab.research.google.com

38
3.3. Evaluation

3.3 Evaluation
We used the automatic evaluation metrics ROUGE and human evaluation by
journalists from Aftenposten to evaluate our fine-tuned models. In this sec-
tion, we describe our implementation of these.

3.3.1 ROUGE Evaluation


The most popular package for computing ROUGE scores is the pyrouge pack-
age10 which is a wrapper for the ROUGE summarization evaluation package.
Pyrouge does, however, not support other languages than English at the time
of writing. It was therefore decided to use another package called py-rouge 11 ,
which is a Python implementation of the ROUGE metric that be extended to
other languages. Since we must evaluate articles in Norwegian, this package
seemed to be the best fit.

We adopted the evaluation utilities provided by nlp-recipies 12 to imple-


ment a script based on py-rouge that supports both English and Norwegian
text. To add the language support for Norwegian, we added five Norwegian
language-specific arguments:

• Sentence Splitter

• Word Tokenizer

• Pattern of characters to remove

• Stemmer

• Word Splitter

Our models were then evaluated using the test data of the AP/Oppsummert.
The models’ predictions were used as candidate summaries, and the sum-
maries written by the journalists at AP were used as reference summaries.
Once the ROUGE scores were computed, we saved the ROUGE-1, ROUGE-2,
and ROUGE-L scores. These metrics were chosen for easy comparison with
results on the CNN/DailyMail dataset and because they are most commonly
used for summarization tasks in other scientific research.
10 https://pypi.org/project/pyrouge/
11 https://pypi.org/project/py-rouge/
12 https://github.com/microsoft/nlp-recipes/tree/master/utils_nlp/

eval/rouge

39
3.3. Evaluation

3.3.2 Human Evaluation with Journalists


The human evaluation was performed with a questionnaire of both fixed-
choice questions and open-ended questions on journalists at Aftenposten.
The fixed-choice questions represented the quantitative part of the human
evaluation and were designed to investigate our model’s performance by the
journalists’ assessment. The following measures were considered: extracted
key sentences, length, redundancy, and content coverage. On the other
hand, the open-ended question represented the qualitative part of the human
evaluation. The journalist was presented with four articles and a summary
predicted from M-BERT to each of these articles. The examples chosen for
this were taken from the test set, where two examples had summaries with
a high ROUGE score, and two examples had summaries with a low ROUGE
score. This selection process was chosen to avoid giving our model either an
advantage or disadvantage in terms of ROUGE score.

The journalists were then prompted to rank these summaries in different


categories by answering the following questions:

1. How well do you think this summary managed to extract key sentences
from the article?

2. How do you rate the summary in terms of length adequacy?

3. How do you rate the summary in terms of not being redundant (not
having repetitive sentences)?

4. How do you rate the summary in terms of content coverage?

For the open-ended questions, we asked the journalist to give comments


about other weaknesses/strengths of the summaries. At the end of the ques-
tionnaire, we also asked for overall improvements and comments.

40
4 Results

In this chapter, we present the results from our methods. First, in section 4.1,
we cover our implemented models and how they manage to predict which
sentences should be picked from the articles. Secondly, in section 4.2, we
present how well the models performed in our evaluations.

4.1 Implementation
As mentioned in section 3.2.4, six models for sentence prediction were im-
plemented. When these models were used on the test data, they managed to
extract a selected number of sentences from different positions in the original
articles. The time it took to fine-tune the Norwegian BERT and Multilingual
BERT is presented in Table 4.1.

Table 4.1: Time it took to fine-tune time the Norwegian and Multilingual BERT
models

Model Dataset Fine-tune time

nb-BERT AP/Oppsummert 45m 33s


mBERT CNN/DM + AP/Oppsummert 11h 8m 48s + 43m 6s

For the Norwegian and Multilingual BERT the selected sentences’ position in
their original article are shown in Figure 4.1 and 4.2. Six subplots are made for
each model. The first subplot shows the respective model fine-tuned on the
Oracle-3 training set with a prediction of three sentences on the Oracle-3 test

41
4.1. Implementation

set. The second and third subplot follows the same principle but for Oracle-7
with seven predicted sentences and Oracle-10 with ten sentences predicted.
This is then repeated through subplot four, five and six but without the use of
trigram blocking. Oracle sentence positions are also presented in these sub-
plots, showing where the top-scoring sentences exist in the original articles.

Since BERTSum has a limit of 512 tokens, the articles are truncated to fit
that limit, and the expected predictions will be sentences within the truncated
limit. Thus, if one word is a token, then counting the number of tokens per
sentence for each article in our AP/Oppsummert test set up to 512 tokens
corresponds to an average of 26 sentences.

42
4.1. Implementation

(a) (b)

(c) (d)

(e) (f)

Figure 4.1: Sentence selection for Norwegian BERT fine-tuned on (a) Oracle-3
(b) Oracle-7 (c) Oracle-10 with trigram blocking and on (d) Oracle-3 (e) Oracle-
7 and (f) Oracle-10 without trigram blocking.

43
4.1. Implementation

(a) (b)

(c) (d)

(e) (f)

Figure 4.2: Sentence selection for Multilingual BERT fine-tuned on (a) oracle-3
(b) oracle-7 (c) oracle-10 with trigram blocking and on (d) oracle-3 (e) oracle-7
and (f) oracle-10 without trigram blocking.

44
4.2. Evaluation

4.2 Evaluation
We present the evaluation results of the implemented models in this section.
In section 4.2.1, the automatic ROGUE evaluation is presented, and in section
4.2.2 the human evaluation that was done on the journalists of Aftenposten.

4.2.1 ROUGE Evaluation


The performance of TextRank, TF-IDF, NB-BERT, and M-BERT are presented
in table 4.2 The BERT models are also evaluated with and without trigram
blocking 3.2.7.

4.2.2 Human Evaluation with Journalists


The fixed-choice quantitative human evaluation results with journalists are
shown in figure 4.3. The scores ranging from 1 to 5 indicate the measures’
quality to be very weak, weak, satisfactory, strong and very strong. Here,
examples one & two were the examples chosen based on summaries with a
high ROUGE score, and examples three & four were the examples chosen
based on summaries with a low ROUGE score. In regards to the categories in
figure 4.3, it is seen that the summaries with high ROUGE score are:

1. strong in content coverage

2. satisfactory in non-redundancy

3. strong in summary length

4. strong in key sentence extraction

Furthermore, the summaries with low ROUGE scores are:

1. weak in content coverage

2. satisfactory in non-redundancy

3. satisfactory in summary length

4. weak in key sentence extraction

Overall, journalists found summaries with high ROUGE scores (example 1


& 2) performing stronger than those with low ROUGE scores (example 3 & 4).

For the qualitative open-ended questions, we present comments on the sum-


maries strengths in Table 4.3 and comments about the summaries weaknesses
in Table 4.4. The overall improvements suggestions and comments are pre-
sented in Table 4.5 and Table 4.5 respectively.

45
4.2. Evaluation

Figure 4.3: Average human evaluation scores for each category where the
highest score for each example is 20

46
4.2. Evaluation

Table 4.2: ROUGE scores on AP/Oppsummert test data (116 articles). *With
trigram blocking

Model Dataset R1 R2 RL

Oracle-3 - 58.69 50.06 57.71


Oracle-7 - 71.66 59.81 70.53
Oracle-10 - 72.64 59.88 71.5
Oracle-all - 72.11 59.06 70.87

LEAD-3 - 39.6 28.61 38.42


LEAD-7 - 55.44 40.8 53.92
LEAD-10 - 41.48 29.94 40.63

Textrank-3 - 33.87 15.1 31.3


Textrank-7 - 41.38 20.23 39.08
Textrank-10 - 40.56 20.98 38.74

TF-IDF-3 - 33.6 12.37 30.72


TF-IDF-7 - 38.68 17.85 36.46
TF-IDF-10 - 38.11 19.5 36.39

nb-BERT-3 AP/Oppsummert 35.62 23.1 30.49


nb-BERT-7 AP/Oppsummert 55.12 40.3 48.18
nb-BERT-10 AP/Oppsummert 54.16 39.25 47.4

mBERT-3 CNN/DM + AP/Oppsummert 39.6 28.61 34.75


mBERT-7 CNN/DM + AP/Oppsummert 55.41 40.82 48.61
mBERT-10 CNN/DM + AP/Oppsummert 54.58 39.86 47.96

nb-BERT-3* AP/Oppsummert 35.59 23.07 30.48


nb-BERT-7* AP/Oppsummert 54.85 39.85 48.08
nb-BERT-10* AP/Oppsummert 53.14 37.62 46.3

mBERT-3* CNN/DM + AP/Oppsummert 39.67 28.72 35


mBERT-7* CNN/DM + AP/Oppsummert 55 40.11 48.27
mBERT-10* CNN/DM + AP/Oppsummert 53.43 37.92 46.64

47
4.2. Evaluation

Table 4.3: Journalists’ opinion reflecting their satisfaction with generated sum-
maries

Measures Sample comments

Key sentences: "The algorithm is good at picking out the key sen-
tences at the start of the articles"
Content Coverage "I think the summary all in all has captured the most
important aspects of the article"
"The most important content of the main story is
present in the summary"
"Most of the story is well covered here"
Summary Length "I think it was impressive to reduce the (quite long)
story to a summary this short"
"The length is good for such a long article"
"I think the length is good"

48
4.2. Evaluation

Table 4.4: Journalists’ opinion on generated summaries mentioning features


they found the algorithm performed weak.

Measures Sample comments

Key sentences: "The summary is missing some key information,


and contains some information that is not neces-
sary at all."
"Bullet point 5 and 6 doesn’t provide any insight
to the security issues. Bullet point 4 does nothing
else than provide gender"
Content Coverage: "The summaries doesn’t reflect all aspects of the
articles"
"The summary doesn’t get the main and most
important point here"
Context: "I don’t think quotes should be used in sum-
maries, especially if it has no context (which is
the case here)
"Bullet point 9 is a quote, but that is not clear in
the summary"
"Bullet point 4 is a quote, but it doesn’t say from
whom"
Redundancy: "A few sentences are unnecessarily repetitive"
"Bullet point 6 is redundant when you have
point 7"
"feels a little repetitive and excessive"
Subtitles in summary: "In several sentences, the algorithm has com-
bined subheaders and text"
"Bullet point 6 is just a one word subtitle"
Coherence: "It doesn’t seem like a summery (text), but more
like a listing of facts."
Truncation: "The summary seems to just cover the first half
of the main article (?)"
"Sometimes it seems as if the summaries did not
include the last third of the original stories"

49
4.2. Evaluation

Table 4.5: Journalists’ opinion on generated summaries reflecting potential for


improvement.

Response
1 "Better coverage of the whole article"
2 "The content. It doesn’t extract the most important
points of the article, and the way it’s written is not
very good."
3 "Mellomtitler should never be included in a summary
- this should be easy to avoid. In crime stories it is
often touchy ethical questions involved. I do not think
the summaries handled this in a satisfying way (the
lawyer’s quote in the Prinsdal-story)."
4 "See my comments on each story."
5 "The algorithm is good at picking out the key sen-
tences at the start of the articles, but seems to struggle
with contextualisation and order. Also, towards the
end of the articles, it often misses out on key informa-
tion or fails to summarize longer parts efficiently."

Table 4.6: Overall comments from journalists

Response
1 "This is impressively good, but not quite there. Some
important points/facts from some of the articles are
missing (when the main article is long, the second half
of it seems to be missing in the summary?)"
2 "I think the two first summaries were the best ones."
3 "No answer"
4 "Overall this was not too bad considering."
5 "With some improvements, this could be a useful tool
for creating a base structure for us to edit and rewrite.
That could save time. But as it stands, it would not be
a reliable automated tool for writing summaries."

50
5 Discussion

This chapter presents our thoughts and reflections about different parts of the
thesis. In section 5.1, we analyze our results and discuss their meaning. Then,
in section 5.2, we discuss our methods and how they could be improved.
Moreover, in section 5.3, we focus on our work in a wider context, giving
our thoughts on related ethical and societal aspects.

5.1 Results
In the following we highlight our main findings from the results that was
presented in chapter 4.

5.1.1 ROUGE Scores


Table 4.2, compared ROUGE scores between various models. However it is
worth noticing that the Oracle models cannot be used directly for sentence
predictions since they are created from the golden summaries. The Oracles
are presented as the upper limit for our models to understand better how well
a model should perform if it always extracted the articles’ exact top ROUGE
scoring sentences. Keep in mind that the Oracles select sentences with a
greedy algorithm, making it possible to dynamically choose the number of
sentences that best match the golden summary for each article. In contrast,
the prediction models have a fixed output. They will, therefore, never per-
form above or similar to Oracle scores simply because they will not be able
to dynamically choose the number of sentences to predict for different articles.

The results presented in Table 4.2 show that the LEAD baselines give high

51
5.1. Results

ROUGE scores compared to golden summaries. This observation strengthens


the statement that the summaries provided in the AP/Oppsummert dataset
are strongly extractive and usually take sentences from the beginning of
the document. What is a bit surprising is the fact that this naive approach
performs better than any other model for three and seven-sentence predic-
tions. This means that to create a summary with three or seven sentences,
selecting the first three or seven sentences in an article would result in higher
ROUGE scores. Thus, the models have not yet learned to sufficiently pick the
top-scoring Oracle sentences that we see in Table 4.2, a problem that will be
further discussed in 5.1.2.

When predicting summaries with ten sentences, we see a better performance


comparing the BERT-models with LEAD-10. We, therefore, consider this score
to be more relevant in terms of seeing what the models have learned during
training. When predicting ten sentences, the model evidently manages to
pick better sentences than those just at the beginning of each document.

When comparing nb-BERT with m-BERT, it can be noticed how m-BERT,


fine-tuned with the CNN/DM dataset, outperforms nb-BERT in all different
experiments. This indicates that m-BERT performs better than nb-BERT in
few-shot learning. Furthermore, the result is supported by the work of [9]
and [38], in which it was shown that m-BERT performs stronger than other
models when training data is limited.

It can be seen that the models without trigram blocking, in general, gives
a slightly better ROGUE score than the ones with trigram blocking. This re-
sult is expected considering the lead bias. However, the ROUGE score alone
does not show how well the trigram blocking reduced the redundancies in
the selected sentences. This issue is something that the human evaluation
gives insight into, which was the reason for selecting the model using trigram
blocking in the human evaluation. The results from this is discussed in sec-
tion 5.1.3.

The statistical models did not perform well in terms of the ROUGE score
and we will not further discuss their scores and performances.

5.1.2 Sentence Selection


It is seen from the figures in 4.1 that the predicted model is prone to pick
more sentences at the beginning of an article which results in a lead bias. For
example, in figure 4.2(a), where three sentences are predicted, we see that the
first three sentences have a much higher proportion of being selected than
the rest of the sentences in AP/Oppsummert’s test set. The same pattern
can be seen for figure 4.2(b) where the model predicts seven sentences, but

52
5.1. Results

sometimes it also picks sentences from positions eight, nine, and ten. In figure
4.2(c), ten predictions are made. However, in this case, the proportions of
the selected sentences are more distributed across the sentence positions in
AP/Oppsummert’s test set.

The behavior described above comes from the convention of placing


key sentences in the first part of an article, as shown in figure 3.4 for
AP/Oppsummert. The same problem is mentioned by Liu, [23] where
they tested a non-pre-trained transformer called TransformerEXT, that uses
the same architecture as BERTSum. The author observed clear bias towards
the first sentences of the articles in the CNN/DM dataset and states that
TransformerEXT relied more on shallow position features and less on deeper
document representations. In our case, the same statement can be made. It
is, however, unclear why the model is biased towards the first sentences.
Although, one factor could be that the limited data we currently have is not
enough for the model to learn deeper document representations.

Zhu et al. [45] discuss that, especially in news articles, the leading sen-
tences are often the most important part of the article. Therefore LEAD, in
general, is a hard baseline to beat for many deep learning models. Moreover,
comparing the LEAD-3 score for CNN/DM mentioned in Liu’s paper [23] to
our LEAD-3 score for AP/Oppsummert, we see that we have a significantly
higher LEAD-score. One explanation to this is that the data used in our work
might have higher positional bias than the CNN/DM data set, used in the
work of [23].

In figure 4.1 (d-f) and 4.2 (d-f), we observe that when predicting summaries
of our BERT models without using trigram blocking, a stronger lead bias can
be seen. Our BERT models predicted without trigram blocking got a slightly
better ROUGE score than predicting with trigram blocking. However, we
see a better distribution on the selected sentences with trigram blocking.
Therefore, trigram blocking is applied to the model we developed in our
work.

5.1.3 Human Evaluation with Journalists


Quantitative assessment
Regarding our quantitative human evaluation shown in figure 4.3, we pre-
sented four different examples to five journalists. Every block for a stacked
bar in the figure shows the average score of the responses from the journal-
ists. These blocks represent different categories of a summary that we want
to assess. We presented them by the questions shown in section 3.3.2. Since
the scores range from one to five, we expect that our model has performed

53
5.1. Results

well enough in an example if every block in the bar has a score above three.
For instance, example one & two shown in figure 4.3, shows scores above
three which means that the model has achieved in satisfying the journalists
for those examples. However, the two last examples proved to be less satis-
fying to the journalists because the model underperformed in aspects such as
content coverage and key sentence extraction.

As described in the results (section 4.2.2), the first two examples are cho-
sen based on summaries with a high ROUGE score. In comparison, the last
two examples are selected based on summaries with a low ROUGE score.
This decision is interesting because, as previously observed in figure 4.3, the
journalists scored the first two examples as satisfying. The last two examples
were underperforming in terms of the four categories presented in the figure.
This similarity in mind shows that there could be a mutual agreement be-
tween how the journalists assess the quality of the summary and the ROUGE
metric.

Qualitative assessment
For the qualitative human evaluation, the results in section 4.2.2 present the
journalist’s thoughts for different aspects of the summaries. In the following,
we discuss these comments:

Key Sentences: Regarding key sentence extraction, we notice how the jour-
nalists consider the summaries to miss meaningful sentences and have sen-
tences that do not bring any value. However, it is observed that sentences
picked from the beginning are considered relevant for the summary. How-
ever, these sentences alone are not sufficient for a satisfying summary.

Content Coverage: The results in 4.3 indicate that, in some cases, the sum-
maries managed to cover the essential aspects of its main article. This means
that, for those cases, the leading sentences indeed cover the main content of
the whole article, which matches the discussion about LEAD sentence selec-
tion in section 5.1.2. However, most comments from 4.4 and 4.5 state that the
summaries did not have sufficient content coverage, which means that only
considering leading sentences is not enough.

Summary length: The summary length seems to be appreciated, which in-


dicates a correct analysis of the previously written summary lengths, as pre-
sented in 3.2.

Context: According to the results, most cases where sentences are consid-
ered to be out of context are when the model has extracted quotes. This could
be due to the fact that quoted persons are usually presented before or after the

54
5.1. Results

actual quote. If the model then misses selecting this presentation, the quote
sentence will appear out of context. Another problem with context is when
it is not clear that a selected sentence is from a quote. This happens in cases
where the quote consists of several sentences like in the following example:
- "Sentence A. Sentence B. Sentence C.", Said the firefighter. In this case, if only
sentence B is extracted, there will be nothing in the summary stating that this
sentence is a quote.

One reason for journalist’s dissatisfaction about context could be our ex-
tractive approach towards text summarization. As mentioned earlier, the
intention/goal behind such approach was to avoid misleading context by
creating an extractive summarization model rather than an abstractive sum-
marization. That way, it was believed that the summaries generated from
extracted sentences would, in most cases, not go out of context. However,
from the previous discussion, we observe that even though the sentences are
extracted from the article, the summary, in general, could give a different per-
ception from the original article. On the other hand, one strength of extractive
summarization technique is that it guarantees the summaries to at least not
introduce new words and definitions that might result in more significant
problems.

Redundancy: Even though trigram-blocking was used, journalists were still


not entirely satisfied with models performance in terms of redundancy. As a
future direction, we aim to introduce techniques which improve redundancy.
These comments are interesting, considering the model’s intention towards
selecting most of the sentences from the beginning of the article. This indi-
cates that the beginning of these articles is also using redundant sentences.
One would think that redundancy would become a bigger problem once the
model learns to select sentences from a broader range of sentences in the arti-
cle.

Sub headers in summary: We also observed from the comments that a sum-
mary could contain the subheaders of an article. This problem could be linked
to AP/Oppsummert’s dataset format, which we discuss further in section
5.2.1.

Coherence: The journalists believe that the summaries are more like a list-
ing of facts than a coherent summary. However, the summaries were never
intended to be a speakable text, but bullet points of the most important sen-
tences of the longer article.

Truncation It is evident that the journalists notice the model’s limitation in


only covering the first part of the articles and that this has a negative impact

55
5.2. Method

on the summary. This result does not favor either BERT’s and BERTsum’s 512
token limit or the lead bias during fine-tuning. It seems that even though the
first sentences often contain essential information, that is not enough reason
to truncate articles and only use the first sentences for summarization.

We interpreted journalists general opinion to be that they do not believe


the model’s current state can be used directly for article summarization.
However, with some improvements, the potential for the model can be seen.

5.2 Method
In this section, our chosen methods are discussed. More specifically, we look
at resource limitations and things that could have been done differently. We
also reflect on how these relate to the outcome of the results and whether
other approaches could lead to better results.

5.2.1 Datasets
In general, when training a model to perform a specific task with neural net-
works, it is essential to have a datasets of high quality. In this section, we will
therefore discuss the quality of the datasets that were used to implement our
models.

CNN/DM
The CNN/DM is one of the most commonly used datasets for comparing text
summarization models. It was also the dataset that was used by Liu [23] in the
BERTsum paper. Therefore we used CNN/DM as our dataset for fine-tuning
the multilingual BERT model. The advantage of this is that the results of our
model can be compared with that of Liu [23]. However, some cons with using
this dataset for this project has been realized:

1. The average number of sentences of the summaries is 3.75, while for


AP/Oppsummert, it is 9.5. The fine-tuning of mBERT could have made
more sense if first trained on a dataset with a similar number of sen-
tences in the summaries.
2. Like the AP/Oppsummert dataset, the CNN/DM dataset also has its
most important sentences at the beginning of the document (see figure
[23]). This means that after fine-tuning on CNN/DM, the model could
already have a bias towards picking the first sentences of the articles. As
already discussed in section 5.1.2, such phenomena is observed in the re-
sults of the model implemented in the current work as well. With this in
mind, having the model train on CNN/DM, which likewise is biased to-
wards the first sentences, may not be the best approach. It would have

56
5.2. Method

been interesting to investigate how the output would be if the model


instead were trained on a dataset with a more distributed sentence se-
lection.

AP/Oppsummert
The initial limitation with the data that was provided from Aftenposten and
Oppsummert was known to be the limited amount of 979 summaries. How-
ever, during implementation, more limitations were observed. In the follow-
ing, we list some of these issues, which could be improved in the future:

1. The fact that the summaries mainly consist of sentences that have just
been picked from the beginning of the article. We believe that was the
main reason behind the model’s tendency towards choosing most of the
sentences from the beginning of the article.

2. The variation of related articles for each summary which lead to the re-
moval of articles. In our work, restructuring the dataset was essential for
training of a single document-classifier. However, the summaries based
on several articles can lead to non-optimal results since we remove po-
tential top Oracle sentences when removing these articles.

3. Raw article data not containing HTML tags. The articles online are pre-
sented with HTML, but the data in the dataset only consisted of raw text.
This made it impossible for us to separate headlines from paragraphs,
which resulted in headlines being treated the same as sentences during
model training and prediction.

4. Summary data contained additional unrelated data. This was a problem


we discovered late in the development process. Some of the summaries
include text that is not necessarily relevant to the summary. For exam-
ple, it could be a prompt to read more of the topic in the main article.

Dealing with the first issue could not be done during the project’s implemen-
tation period. The human-written summaries have to be updated with new
summaries that consider other sentences of the articles.

For the second issue, a solution could be to implement a multi-document


approach for the summaries written from more than one article. This is some-
thing that we never tested and that we considered out of scope for the project.
Especially since the goal was to implement a model that could perform sen-
tence selection on single documents.

The third and fourth issues deal with how the dataset was processed be-
fore we got access to the data. This issue could have been dealt with by
manually going through 979 summaries or by re-doing the data extraction

57
5.2. Method

process. Although, since we did not have resources, it was decided to stick
with the current state of the data to save both time and labor.

5.2.2 Implementation
All fine-tune for the BERT models were done using Google Colab, which
worked efficiently for AP/Oppsummert dataset but not for CNN/DM as
the dataset was much larger. Fine-tuning on CNN/DM resulted in a longer
training period causing sessions in Google Colab to timeout because of the
limitations described in section 3.2.8. Therefore, as mentioned in section 3.2.6,
where CNN/DM was used in our experiments: we only extract 10 000 sam-
ples to our training set (section 3.2.6). The time it took to fine-tune the BERT
models are shown in table 4.1. It would have been interesting to fine-tune
M-BERT on all samples in CNN/DM. However, because of the limitations of
Google Colab, that became out of the scope of this work.

The initial thought for model optimization was to include a validation set
during our model fine-tuning process, giving us insights on when to stop the
training. However, we chose to skip this step because of two main reasons:

1. We are limited by the number of samples we have in AP/Oppsummert’s


dataset. Splitting that into an additional set for validation would reduce
the number of samples we have in our training set.

2. As mentioned in section 3.2.5, we decided to fine-tune our models for a


fixed number of three epochs. With the number of epochs fixed, there is
no need to have a validation set for deciding when training should stop.

5.2.3 Metrics
The evaluation was done in two parts. In this section, we discuss the ade-
quacy of ROUGE and our human evaluation.

ROUGE
The limitations of ROUGE, as mentioned in 2.5.2, have been widely discussed
across different papers (ADD REFS). Many authors question its suitability
as an evaluation metric for summaries and that it is used to claim the state-
of-the-art performance of models. However, to our knowledge, there is no
other good metric today that can be used to compare evaluation results across
papers, and developing a new metric would be out of scope for the project.
Therefore, we decided to use ROUGE as one of our metrics. However, we
were also interested to know how journalists assess the model’s performance.
Therefore, an evaluation study was carried out as a complement to ROUGE.

58
5.2. Method

Using ROUGE, we assumed that the golden summary is the true best sum-
mary to the article. Such assumption has some disadvantages. There is no
reason why one summary has to be better than another since this is subjec-
tive. Two very different summaries could be equally good simply because
two different summaries can both capture the full context of an article. This
means that potentially good summaries are ignored because they are not
using similar words as the reference summaries. For evaluation, this problem
can be addressed by additional evaluation. However, that is not the case
when we use ROUGE to create Oracle sentences for fine-tuning our models.
This means that the model will limit its learning to the "true best" summary
that we have for that article.

One way to address the issue above, is to formulate a new assumption


which does not rely on the fact that the golden summary is the only solu-
tion. ROUGE does support comparison between several summaries, so one
solution could be to train with a dataset with more than one summary for
each article. This, however, can become costly since each article would need
more than one journalist for summary-writing. Which there were simply no
resources for during the limited period.

One limitation that we did attempt to deal with was how ROUGE only
considering content selection. We did this by using Liu’s [23] algorithm with
trigram blocking for reducing redundancy of the selected sentences. Accord-
ingly, to the results presented in 4.2, the usage of trigram blocking did indeed
lead to a broader range of selected sentences. This means that the algorithm
managed to find sentences that were considered to be redundant.

Regarding other aspects of the content selection limitation of ROGUE, such as


coherence, we did not directly implement anything to counteract this. This is
because our summaries are presented as bullet points, therefore, considering
such measures became out of the scope of our wok. If the summaries were
presented more like a fluent text, more research would have been spent to
cover this limitation.

Human Evaluation with Journalists


We consider having our model’s performance evaluated by journalists highly
valuable to the research. Since these journalists are writing summaries them-
selves, we consider their answers highly reliable. However, we should have
used these resources to a greater extent, especially at the beginning of the
project. Throughout the implementation of the project, ROUGE was the only
metric used for tracking progress, while the extensive human evaluation was
only applied at the end.

59
5.3. The work in a wider context

We believe that the questionnaire was formed in a way that covered well
the quality of the summaries. What makes a human evaluation of summaries
hard is how to interpret different opinions about quality. I.e., what is really
meant by a summary that is "really good" or "really bad". This needs to be
made explicit somehow. Our way of doing this was to have the journalists
rank how well each summary performed in specific categories. We defined
that a good summary captures key sentences from the article, has a decent
length, is not redundant, and covers the original article’s content well.

One could argue that making the human evaluation is less reliable due to
the few participants in the questionnaire. This would indeed be a big prob-
lem with fluctuating results. However, since the journalists share the same
opinion more or less, we consider the human evaluation to be still sufficient
to draw conclusions from it.

5.3 The work in a wider context


Language models such as BERT require large datasets to be trained on to
achieve good results. Because of this, current advancements for monolin-
gual language models have mostly been made on large-resource languages
such as English. This makes it hard to train a monolingual model for other
low-resource languages. However, we see a potential in using a multilingual
model such as M-BERT, which supports many languages as mentioned in sec-
tion 2.3.4. From the results explained in section 5.1.1, we see that fine-tuning
the model with a larger dataset (CNN/DM) in a large-resource language, such
as English, can boost the performance of the task-specific model significantly.
When fine-tuning M-BERT with CNN/DM and our limited dataset in Norwe-
gian, we could outperform the monolingual Norwegian BERT model. Simi-
larly, our work can be used in other tasks involving a BERT model in other
low-resource languages.

5.3.1 Ethical Aspects


In this section, we highlight the ethical aspects of the summaries that our
model automatically generates. In this case, the model extracts sentences
from an article which ensures that we are using the "journalists" sentences.
However, there still are some primary concerns that need to be addressed.

Firstly, when a journalist writes a summary, they are aware of what should
be highlighted and how the summary should be structured to convey the
article’s main points transparently. However, a machine-generated summary
might not pick up the sentences that the journalist finds important. This
could result in a shift in the main point in the summary. This problem can be

60
5.3. The work in a wider context

seen in some of our extracted summaries and should be expected.

Secondly, the generated summaries from a machine might miss out on


picking coherent sentences. This can be problematic because if the generated
summary only picks up one of the coherent sentences, the information con-
veyed can be false. For example, if the coherent sentences are: "A is true.",
"However, not if B is false.", In this example, if only the first sentence is
picked, then "A" would always be true, which in the original context, it was
not.

Depending on the topic and content, the issues mentioned can be critical
at some point. At the current state, the machine can not make these judg-
ments itself. Therefore we advise using the extractive summarization model,
BERTSum, as a tool to help the journalists write their summaries.

61
6 Conclusion

This chapter will summarize our concluding thoughts on the purpose and
the research questions described in our thesis. Ideas for improvements will
be discussed at the end of the chapter.

6.1 Conclusion
This thesis project aimed to develop a model that could extract the most rele-
vant sentences from Norwegian news articles. This aim has to an extent been
achieved but with limitations. In the following, we conclude our investiga-
tions to then answer the main research question.

• How news summaries can be used to generate labeled data that is re-
quired for a supervised model
To generate labeled data from summaries, two algorithms has been pre-
sented:

1. Greedy sentence selection algorithm


2. Combination sentence selection algorithm

These algorithms create labels by maximizing the ROUGE score be-


tween the summaries and their original articles. In this work, the greedy
algorithm is used because the combination algorithm quickly gets com-
putationally heavy the more sentences that are selected. Thus, in terms
of computational efficiency, the greedy algorithm proved to be a better
label generator for this task.

62
6.1. Conclusion

• How the model’s performance should be evaluated and assessed


The performance of a summarization model can be evaluated by com-
paring the model’s generated summaries with other, ideal summaries.
Two ways for how this can be done has been presented:

1. With the automatic metric ROUGE, that compare the generated


summaries with human-written summaries, using n-grams and
word sequences.
2. With human evaluation, where humans compare the generated
summaries with the best summary they have in mind.

The first method has limitations regarding summary quality, while the
second method is not sustainable on a larger scale. Additionally, man-
ual experiments are difficult to compare across papers. Therefore, until
another metric is developed, ROUGE will continue to be the method of
choice to evaluate text summarization models on larger scales.

• How BERT can be used for extractive text summarization on a low


resource language
Two approaches have been presented to use BERT for extractive text
summarization on a low resource language, using a modified BERT ar-
chitecture called BERTSum. The first approach is to, if available, use a
monolingual model provided in the target language, which in this case
is the Norwegian BERT model. Otherwise, Multilingual BERT could be
used, allowing cross-lingual fine-tuning. From the limited Norwegian
news dataset provided by AP/Oppsummert, it is seen that the Mul-
tilingual BERT performed slightly better than the Norwegian BERT in
terms of ROUGE score. In conclusion the Multilingual BERT is the better
model as it can be trained on a limited Norwegian dataset with a higher
performance level as the monolingual BERT model. Furthermore, this
shows great potential for using the multilingual model to other low re-
source languages, where data is limited.

• Limitations with BERT and how they should be dealt with


The two main limitations with using BERT in this project has been
shown to be:

1. That it cannot directly be used for extractive text summarization


due to BERT only being able to differentiate a pair of sentences.
2. That it has a token limit of 512, which means that longer articles
cannot directly be used as input to the model

The first limitation was dealt with by implementing BERTSum, which


modifies BERT’s input sequence and embeddings to differentiate and

63
6.2. Future Work

extract sentences. The second limitation was dealt with by truncating


longer text only to use the first 512 tokens. This method was motivated
by the fact that key sentences were found at the beginning of each arti-
cle in the dataset. However, it has become clear with the human eval-
uation that this option was not sufficient for creating qualitative sum-
maries. Even if the first sentences contain key information, we still need
to consider sentences later in the article to create a satisfying summary
in terms of content coverage.

Together with the results and discussion, the main research question can now
be answered from these investigations:

How can a high-performance BERT-based extractive summarization model


be developed based on a limited amount of news summaries in
Norwegian?
By fine-tuning a pre-trained model using a modified BERTSum architecture,
we developed an extractive summarization model. By investigating two ap-
proaches, we found Multilingual BERT to perform best in terms of ROUGE
score. Even though the model outperformed other presented models in
ROUGE score, journalists pinpointed that, in the current state, the model
would not be a reliable automated tool for writing summaries. However, with
some future work, this could be a valuable tool for creating a base structure for
journalists in AP/Oppsummert to edit and rewrite the generated summaries,
saving them time and workload.

6.2 Future Work


To improve this work further, some changes in the implementation can be
made that we suspect could positively impact the final results of this project.

1. We have seen that the model can learn sentence selection based on the
data it is provided with. Therefore, we expect that with a better struc-
tured and less lead-biased dataset, the model should learn to pick sen-
tences based on context rather than on positions.

2. As proved by the human evaluation, experiments with other solutions


to BERT’s token limit have to be done to see if this will yield better final
results. One idea here could be to split the input articles into multi-
ple sub-articles, classify them, and combine the results. This approach
would be more expensive, but it would potentially solve the content
coverage problem.

64
6.2. Future Work

3. Another experiment that could be made is to change the CNN/DM


dataset to a dataset that contains summaries with more sentences to see
if this will give better results when fine-tuning M-BERT.

4. Continued work on the problem with subheaders and non-cited quotes


inside the generated summary data must also be done. One possible
solution would be to filter these from the article during prediction. This
solution requires some automatic tagging to differentiate subheaders,
quotes, and the article text itself.

65
Bibliography

[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,
Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry
Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,
Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. “Tensor-
Flow: A system for large-scale machine learning”. In: 12th USENIX Sym-
posium on Operating Systems Design and Implementation (OSDI 16). 2016,
pp. 265–283. URL: https : / / www . usenix . org / system / files /
conference/osdi16/osdi16-abadi.pdf.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Ma-
chine Translation by Jointly Learning to Align and Translate. 2016. arXiv:
1409.0473 [cs.CL].
[3] Miguel Romero Calvo. Dissecting BERT Part 1: Understanding the Trans-
former. https : / / medium . com / @mromerocalvo / dissecting -
bert-part1-6dcf5360b07f. Accessed: 2020-12-06.
[4] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. “Learn-
ing Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation”. In: Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Doha, Qatar: Asso-
ciation for Computational Linguistics, Oct. 2014, pp. 1724–1734. DOI:
10 . 3115 / v1 / D14 - 1179. URL: https : / / www . aclweb . org /
anthology/D14-1179.
[5] Papers With Code. Document Summarization on CNN / Daily Mail.
https : / / paperswithcode . com / sota / document -
summarization-on-cnn-daily-mail. Accessed: 2021-03-18.

66
Bibliography

[6] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le,
and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context. 2019. arXiv: 1901.02860 [cs.LG].
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
“BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805.
URL : http://arxiv.org/abs/1810.04805.

[8] Henning Carr Ekroll and Kjetil Magne Sørenes. Avgåtte regjeringspoli-
tikere får karantenelønn på grunn av egne selskaper. Enkelte har vært
helt uvirksomme. https : / / www . aftenposten . no / norge /
politikk / i / P9BX5J / avgaatte - regjeringspolitikere -
faar - karanteneloenn - paa - grunn - av - egne - selska. Ac-
cessed: 2021-05-13.
[9] Khalid N Elmadani, Mukhtar Elgezouli, and Anas Showk. “BERT
Fine-tuning For Arabic Text Summarization”. In: arXiv preprint
arXiv:2004.14135 (2020).
[10] Jeffrey L. Elman. “Finding structure in time”. In: Cognitive Science
14.2 (1990), pp. 179–211. ISSN: 0364-0213. DOI: https : / / doi .
org / 10 . 1016 / 0364 - 0213(90 ) 90002 - E. URL: https :
/ / www . sciencedirect . com / science / article / pii /
036402139090002E.
[11] Wenche Fuglehaug Fallsen. Tiltalte: Ble provosert da Mohammed sa «jeg
elsker henne». https://www.aftenposten.no/norge/i/Jo2V6P/
tiltalte - ble - provosert - da - mohammed - sa - jeg - elsker -
henne. Accessed: 2021-05-13.
[12] Jan Gunnar Furuly and Hans O. Torgersen. Havarikommisjon kritis-
erer sikkerhetsbrudd etter at 15-åring omkom i strømulykke. https : / /
www . aftenposten . no / norge / i / 8mo9bw / havarikommisjon -
kritiserer - sikkerhetsbrudd - etter - at - 15 - aaring -
omkom-i-s. Accessed: 2021-05-13.
[13] Yoav Goldberg. “A Primer on Neural Network Models for Natural
Language Processing”. In: CoRR abs/1510.00726 (2015). arXiv: 1510.
00726. URL: http://arxiv.org/abs/1510.00726.
[14] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Es-
peholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. “Teaching Ma-
chines to Read and Comprehend”. In: NIPS. 2015, pp. 1693–1701. URL:
http://papers.nips.cc/paper/5945-teaching-machines-
to-read-and-comprehend.
[15] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”.
In: Neural computation 9 (Dec. 1997), pp. 1735–80. DOI: 10.1162/neco.
1997.9.8.1735.

67
Bibliography

[16] Rani Horev. BERT Explained: State of the art language model for NLP.
https://towardsdatascience.com/bert-explained-state-
of- the- art- language- model- for- nlp- f8b21a9b6270. Ac-
cessed: 2020-12-06.
[17] Karen Spärck Jones. “A statistical interpretation of term specificity
and its application in retrieval”. In: Journal of Documentation 28 (1972),
pp. 11–21.
[18] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexan-
der M. Rush. “OpenNMT: Open-Source Toolkit for Neural Machine
Translation”. In: CoRR abs/1701.02810 (2017). arXiv: 1701 . 02810.
URL : http://arxiv.org/abs/1701.02810.

[19] Simeon Kotadinov. Understanding Encoder-Decoder Sequence to Sequence


Model. https : / / towardsdatascience . com / understanding -
encoder - decoder - sequence - to - sequence - model -
679e04af4346. Accessed: 2021-03-18.
[20] Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming
Xiong, and Richard Socher. “Neural text summarization: A critical eval-
uation”. In: arXiv preprint arXiv:1908.08960 (2019).
[21] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Sum-
maries”. In: Text Summarization Branches Out. Barcelona, Spain: Associ-
ation for Computational Linguistics, July 2004, pp. 74–81. URL: https:
//www.aclweb.org/anthology/W04-1013.
[22] Yang Liu. “Fine-tune BERT for Extractive Summarization”. In: CoRR
abs/1903.10318 (2019). arXiv: 1903 . 10318. URL: http : / / arxiv .
org/abs/1903.10318.
[23] Yang Liu and Mirella Lapata. “Text summarization with pretrained en-
coders”. In: arXiv preprint arXiv:1908.08345 (2019).
[24] Infolks Pvt Ltd. Recurrent Neural Network and Long Term Dependencies.
https : / / medium . com / tech - break / recurrent - neural -
network- and- long- term- dependencies- e21773defd92. Ac-
cessed: 2021-03-18.
[25] H. P. Luhn. “A Statistical Approach to Mechanized Encoding and
Searching of Literary Information”. In: IBM Journal of Research and De-
velopment 1.4 (1957), pp. 309–317. DOI: 10.1147/rd.14.0309.
[26] Rada Mihalcea and Paul Tarau. “TextRank: Bringing Order into Text”.
In: Proceedings of the 2004 Conference on Empirical Methods in Natural Lan-
guage Processing. Barcelona, Spain: Association for Computational Lin-
guistics, July 2004, pp. 404–411. URL: https://www.aclweb.org/
anthology/W04-3252.

68
Bibliography

[27] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and San-
jeev Khudanpur. “Recurrent neural network based language model”.
In: vol. 2. Jan. 2010, pp. 1045–1048.
[28] Multilingual BERT snippet. URL: https : / / peltarion . com /
knowledge-center/documentation/modeling-view/build-
an- ai- model/pretrained- snippets/multilingual- bert-
snippet.
[29] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al.
“Abstractive text summarization using sequence-to-sequence rnns and
beyond”. In: arXiv preprint arXiv:1602.06023 (2016).
[30] Christopher Olah. Understanding LSTM Networks. https : / / colah .
github.io/posts/2015-08-Understanding-LSTMs/. Accessed:
2021-03-17.
[31] Oppsummert. Egne selskaper ga regjeringspolitikerne etterlønn. https://
www.aftenposten.no/norge/i/70gbJ4/egne-selskaper-ga-
regjeringspolitikerne-etterloenn. Accessed: 2021-05-13.
[32] Oppsummert. Kritikk etter dødsulykken på Filipstad. https : / / www .
aftenposten . no / verden / i / xPWQ6G / kritikk - etter -
doedsulykken-paa-filipstad. Accessed: 2021-05-13.
[33] Oppsummert. Prinsdal: Politiet knytter ny siktet til gjengmiljøet. https:
/ / www . aftenposten . no / norge / i / WbBReG / prinsdal -
politiet - knytter - ny - siktet - til - gjengmiljoeet. Ac-
cessed: 2021-05-13.
[34] Oppsummert. Tiltalte: Ble provosert da Mohammed sa «jeg elsker henne».
https://www.aftenposten.no/norge/i/RR23Jd/tiltalte-
ble - provosert - da - mohammed - sa - jeg - elsker - henne. Ac-
cessed: 2021-05-13.
[35] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
The PageRank Citation Ranking: Bringing Order to the Web. Technical Re-
port 1999-66. Previous number = SIDL-WP-1999-0120. Stanford Info-
Lab, Nov. 1999. URL: http://ilpubs.stanford.edu:8090/422/.
[36] Michael Quinn Patton. “Qualitative evaluation checklist”. In: Evaluation
checklists project 21 (2003), pp. 1–13.
[37] Romain Paulus, Caiming Xiong, and Richard Socher. “A Deep
Reinforced Model for Abstractive Summarization”. In: CoRR
abs/1705.04304 (2017). arXiv: 1705 . 04304. URL: http : / / arxiv .
org/abs/1705.04304.
[38] Telmo Pires, Eva Schlinger, and Dan Garrette. “How multilingual is
Multilingual BERT?” In: CoRR abs/1906.01502 (2019). arXiv: 1906 .
01502. URL: http://arxiv.org/abs/1906.01502.

69
Bibliography

[39] Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets.
Cambridge University Press, 2011.
[40] Wasim Riaz, Daniel Røed-Johansen, Frøydis Braathen, and Harald Stolt-
Nielsen. Politiet: 20-åringen som er siktet i Prinsdal-saken, har tilknytning til
gjengmiljøet. https://www.aftenposten.no/norge/i/mRdr1l/
politiet - 20 - aaringen - som - er - siktet - i - prinsdal -
saken-har-tilknytning-t. Accessed: 2021-05-13.
[41] Abigail See, Peter J Liu, and Christopher D Manning. “Get to the point:
Summarization with pointer-generator networks”. In: arXiv preprint
arXiv:1704.04368 (2017).
[42] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. “How to Fine-Tune
BERT for Text Classification?” In: CoRR abs/1905.05583 (2019). arXiv:
1905.05583. URL: http://arxiv.org/abs/1905.05583.
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
Is All You Need. 2017. arXiv: 1706.03762 [cs.CL].
[44] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Moham-
mad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,
Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil,
Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol
Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. “Google’s
Neural Machine Translation System: Bridging the Gap between Hu-
man and Machine Translation”. In: CoRR abs/1609.08144 (2016). arXiv:
1609.08144. URL: http://arxiv.org/abs/1609.08144.
[45] Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, and Xuedong
Huang. Leveraging Lead Bias for Zero-shot Abstractive News Summariza-
tion. 2021. arXiv: 1912.11602 [cs.CL].

70
A Appendix

A.1 All responses from Human Evaluation


Each section presents the articles and the questions sent to the journalists.

A.1.1 Article 1
The article can be found in [12] and its gold summary in [32].

The generated summary where each bullet point is an extracted sentence


from the original article:

• Statens havarikommisjon for transport retter kritikk om mangler ved


inngjerding, skilting og sikkerhetsvurdinger etter ulykken ved Filipstad
der en 15-åring døde og to ble skadet.

• Ulykken skjedde 24. februar 2019 da de tre ungdommene hadde tatt seg
inn på Filipstad driftsbanegård, like ved Ruseløkka fritidsklubb.

• Da de klatret opp på et hensatt togsett, ble alle tre utsatt for strøm fra
kjøreledningen.

• En 15-årig gutt døde, mens en 15 år gammel gutt og en 16 år gammel


jente ble alvorlig skadet.

• Den ene av de to ungdommene som falt på bakken, klarte ifølge rap-


porten å gå ut av togtunnelen på stedet og tilkalle hjelp etter det som
hadde skjedd.

I
A.1. All responses from Human Evaluation

• Vedkommende løp deretter tilbake til ulykkesstedet for å hjelpe sine


venner.

• Gjerdet hadde hull og var for lavt Ungdommene hadde forut for
ulykken tatt seg gjennom et hull i et gjerde som ikke var i henhold til
sikkerhetsforskriften: Det skulle være 180 cm, men var på det laveste
kun 106 cm.

• I tillegg manglet denne delen av gjerdet korrekt skilting med «Adgang


forbudt».

• Oppe på taket beveget de seg slik at de kom i berøring med kontaktled-


ningsanlegget og fikk strømgjennomgang.

• I det kraftige strømstøtet som fulgte ble den omkomne liggende på


taket, mens de to andre ble kastet ned på bakken.

Qualitative responses from the journalists is presented in A.1:

II
A.1. All responses from Human Evaluation

Table A.1: Responses from the journalists on article 1

Response Weaknesses/strength of the summary?

1 This summary doesn’t reflect all the aspects of the article


2 I think the summary is a bit long, but that’s mostly because
it sometimes is quite repetitive. If you could take out the
repetitive parts, I think the summary all in all has captured
the most important aspects of the article.
3 When thinking about Oppsummert-articles, I think the
shorter, the better. Hence the grade 3 on length adequacy.
I think I would have preferred the text to have a little less
details about the actual event and one sentence containing
an answer from Bane Nor. I think these two sentences could
have been dropped: • En 15-årig gutt døde, mens en 15 år
gammel gutt og en 16 år gammel jente ble alvorlig skadet
(some of it has already been said in the first sentence ). •
Oppe på taket beveget de seg slik at de kom i berøring med
kontaktledningsanlegget og fikk strømgjennomgang. Also,
there is something strange with this sentence (but maybe
it is simply missing a bullet point before the word Ung-
dommene?). • Gjerdet hadde hull og var for lavt Ung-
dommene hadde forut for ulykken tatt seg gjennom et hull i
et gjerde som ikke var i henhold til sikkerhetsforskriften: Det
skulle være 180 cm, men var på det laveste kun 106 cm.
4 Bullet point 5 and 6 doesn’t provide any insight to the se-
curity issues. Bullet point 4 does nothing else than provide
gender.
5 The summary is accurate, but feels a little repetitive and
excessive. For instance two sentences are about the num-
ber of deaths/injuries. The faults found in the report could
have been summarized more succinctly and it should have
included one quote/sentence about the response from Bane
NOR.

A.1.2 Article 2
The article can be found in [40] and its gold summary in [33].

The generated summary where each bullet point is an extracted sentence


from the original article:

III
A.1. All responses from Human Evaluation

• Politiet har siktet enda en 20 år gammel mann for drapet eller medvirkn-
ing til drapet på Halil Kara (21) i Prinsdal.

• Ifølge en fersk kjennelse er han godt kjent for politiet.

• Den 20 år gamle mannen er fra Oslo og er den tredje som blir siktet i
drapssaken i Prinsdal sør i Oslo.

• En kamerat av ham ble skadet i basketaket som fulgte.

• 20-åringen som fredag ble siktet i drapssaken, var allerede varetekts-


fengslet for fire uker på bakgrunn av grov ulovlig bevæpning på of-
fentlig sted.

• Han ble fremstilt for fengsling i Oslo tingrett torsdag.

• Han ble fengslet med brev- og besøksforbud i to uker.

• Han er blitt tatt to ganger, i løpet av kort tid, med svært farlig
skytevåpen».

• Politiinspektør Grete Lien Metlid har gitt uttrykk for at politiet foreløpig
ikke kjenner motivet for drapet.

• Aftenposten er kjent med at det over tid har vært flere konflikter mellom
unge menn fra Holmlia og unge menn fra Mortensrud.

Qualitative responses from the journalists is presented in A.2:

IV
A.1. All responses from Human Evaluation

Table A.2: Responses from the journalists on article 2

Response Weaknesses/strength of the summary?

1 The most important content of the main story is present in


the summary. One small error, and some small repeats. But
all over, a quite good summary.
2 Some spelling errors, and it’s sometimes difficult to under-
stand if the text is talking about the deceased or the accused.
It seems more like a summary, then a coherent text, so it’s
not the best reading experience.
3 I think one of the sentences in the summary is in fact in-
correct – the sentence «En kamerat av ham ble skadet i bas-
ketaket som fulgte». I think «ham» in this case actually is
the killed person, Halil Kara (21). But I think the original
text is written in a way that makes it easy to misunder-
stand this. I think it was impressive to reduce the (quite
long) story to a summary this short. This (quite impor-
tant sentence) is the only one I am missing from the sum-
mary (would have been included if I had written it myself):
Øyvind Bratlien er forsvarer for 20-åringen som ble siktet
fredag. Han skriver i en SMS til Aftenposten at hans klient
sier at han er uskyldig.
4 Bullet point 4 is confusing and lacks context. Bullet point
3 provides little new information with many words. Bullet
point 6 is redundant when you have point 7.
5 Most of the story is well covered here, but a few sentences
are unnecessarily repetitive. For instance, "20-åringen som
fredag ble siktet i drapssaken," could have just been "Den
siktede". Also, this sentence appears out of context and
should just be removed: "En kamerat av ham ble skadet i
basketaket som fulgte."

A.1.3 Article 3
The article can be found in [8] and its gold summary in [31].

The generated summary where each bullet point is an extracted sentence


from the original article:

• Flere av Erna Solbergs tidligere regjeringskollegaer har fått ekstra etter-


lønn fordi de har opprettet selskaper som gir interessekonflikter.

V
A.1. All responses from Human Evaluation

• Eksstatsråd Robert Erikssons selskap er allerede avviklet – uten at det


har gitt inntekter.

• Som Aftenposten tidligere har skrevet, har Erna Solberg satt rekord i
antallet statsrådsutskiftninger i sitt regjeringsprosjekt.

• Den store gjennomtrekken av politikere i regjeringsapparatet har ført til


en stadig voksende etterlønnsregning til skattebetalerne.

• 20 millioner i etterlønn før Frp-exit Aftenposten har sammenstilt data


fra Statsministerens kontor og Karantenenemnda.

• Også Stavanger Aftenblad har omtalt omfanget av etterlønnsutbetalin-


gene.

• Aftenposten har sett nærmere på hva som har skjedd i disse virk-
somhetene.

• Konsulentselskap uten inntekter ga høy etterlønn Blant dem som har


fått aller mest etterlønn er Fremskrittspartiets Robert Eriksson, som gikk
av som arbeidsminister i desember 2015.

• Da tenkte jeg at jeg får livnære meg selv, med det jeg gjorde før jeg gikk
inn i politikken.

• Da drev jeg med rådgivning om pensjon og forsikring, sier Eriksson,


som i dag er administrerende direktør i arbeidsgiverorganisasjonen Sjø-
matbedriftene.

Qualitative responses from the journalists is presented in A.3:

VI
A.1. All responses from Human Evaluation

Table A.3: Responses from the journalists on article 3

Response Weaknesses/strength of the summary?

1 This is a rather long and complicated main article. The sum-


mary doesn’t get the main and most important point here,
that this is something the politicians are doing on purpose.
And some of the sentences in the summary are not impor-
tant at all.
2 I think this summary is missing some key points. The part
of the article that is summarized, does not make me more
able to understand the pressure points of the case. It seems
more like a list of facts, then a summary of this particular
case. And I don’t think quotes should be used in summaries,
especially if it has no context (which is the case here)
3 The original article is a bit complicated, so making a sum-
mary here is a real challenge, I would say. A couple of the
sentences contain both a sentence AND the «mellomtittel».
I have here marked the mellomtittel with () - they should ide-
ally have been removed: • (20 millioner i etterlønn før Frp-
exit) Aftenposten har sammenstilt data fra Statsministerens
kontor og Karantenenemnda. AND «• (Konsulentselskap
uten inntekter ga høy etterlønn) Blant dem som har fått aller
mest etterlønn er Fremskrittspartiets Robert Eriksson, som
gikk av som arbeidsminister i desember 2015.» It should ide-
ally be made clearer that this sentence is a quote from R.
Erikssen: • Da tenkte jeg at jeg får livnære meg selv, med
det jeg gjorde før jeg gikk inn i politikken.
4 Bullet point 3 is not very relevant. Bullet point 5 and 8
have included subtitles that give no meaning in the sum-
mary. Bullet points 6 and 7 give nothing in terms of con-
tent. Bullet point 9 is a quote, but that is not clear in the
summary.
5 This one had quite a few weaknesses. In several sentences,
the algorithm has combined subheaders and text (like here:
" 20 millioner i etterlønn før Frp-exit Aftenposten har sam-
menstilt data fra Statsministerens kontor og Karantenen-
emnda.") The length is good for such a long article, but
the review of different politicians could more efficiently have
been merged into one or two summarizing sentences. Some
sentences are also irrelevant in a summary, such as the one
about Stavanger Aftenblad. Also, key pieces of information,
like the total sum of money (20 million), is not included
properly.

VII
A.1. All responses from Human Evaluation

A.1.4 Article 4
The article can be found in [11] and its gold summary in [34].

The generated summary where each bullet point is an extracted sentence


from the original article:

• Mohammed Altai (16), som døde etter en voldsepisode på Holmlia i


2017, skal ha gått på hjertemedisiner da ugjerningen mot ham skjedde.

• Den tiltalte 21-åringen nekter å forklare seg for retten.

• I Oslo tingrett sitter en 21 år gammel mann som er tiltalt for grov kropp-
skrenkelse og for å ha etterlatt Mohammed Altai i en hjelpeløs tilstand.

• – Min klient opplever saken som svært belastende, og avisoppslagene


de siste dagene har vært en ekstra påkjenning for ham.

• Saken er dratt ut av proporsjoner der både ære og drap er trukket inn.

• Hjertefeil?

• I retten fortalte aktor Børge Enoksen at obduksjonsrapporten av Mo-


hammed Altai viser at slagene fra tiltalte var relativt beskjedne, og at de
var med flat hånd.

• Mohammed døde uansett av skadene.

• Han fikk hjerteflimmer som følge av oksygenmangel til hjernen, falt i


koma og døde fem uker senere, 25. juli 2017.

• Mohammeds mulige hjertesykdom skal belyses senere denne uken når


den medisinske sakkyndig rapporten senere skal legges frem i retten.

Qualitative responses from the journalists is presented in A.4:

VIII
A.1. All responses from Human Evaluation

Table A.4: Responses from the journalists on article 4

Response Weaknesses/strength of the summary?

1 The summary seems to just cover the first half of the main
article (?)
2 The length is good. The summary is missing some key infor-
mation, and contains some information that is not necessary
at all. Still it doesn’t seem like a summery (text), but more
like a listing of facts.
3 The summary is quite confusing to read and I do not think
I would understand much had I not known the whole orig-
inal story beforehand. The summary sentences seem very
randomized when I read them. If the third sentence (• I
Oslo tingrett sitter en 21 år gammel mann som er tiltalt for
grov kroppskrenkelse og for å ha etterlatt Mohammed Altai
i en hjelpeløs tilstand.) had been first in the summary, it
may have been easier to get into the story. On this note, it
should be noted that the original text is quite complicated,
and the text is quite long. The motive behind the (possible)
crime is not really made clear in the summary (the story
about the relationship with the sister). It also seems that
the summary emphasizes the first and the middle part of
the original text – and not the last third of the original txt.
Also, this word is just a «mellomtittel» and could have been
dropped: • Hjertefeil?
4 Bullet point 4 is a quote, but it doesn’t say from whom. Bul-
let point 5 is also a quote, but it is not presented as such.
Bullet point 6 is just a one word subtitle. The summary
does not include anything about the accused’s younger sis-
ter.
5 This one did not work so well. It is a good example that the
first sentences often have key information, but still needs
to be contextualized and ordered properly in a summary.
Sentence 2 and 3 should have been switched around, and as
the summary goes on, more bullet points appear totally out
of context.

IX

You might also like