Professional Documents
Culture Documents
Youtube Analysis3
Youtube Analysis3
I would like to thank my supervisor Francis Palma for introducing me to the topic of
sentiment analysis and for his time, feedback and guidance throughout this project.
Secondly, I would like to thank Victor Heijler for his support and valuable com-
ments, which helped me improve my report greatly. I would also like to thank my
mother for always being by my side and supporting me during this journey. Finally,
I would like to thank my friends and everyone who participated in the labelling
process of my dataset.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Research Questions & Objectives . . . . . . . . . . . . . . . . . . . 3
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Target groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Traditional sentiment classification techniques . . . . . . . 6
2.1.2 Text sentiment classification with deep learning . . . . . . . 6
2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 7
2.1.4 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Bidirectional RNN . . . . . . . . . . . . . . . . . . . . . . 9
2.1.6 Convolutional Neural Networks . . . . . . . . . . . . . . . 9
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Related work discussion . . . . . . . . . . . . . . . . . . . 11
3 Method 13
3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Video collection . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Comment collection . . . . . . . . . . . . . . . . . . . . . 15
3.2 Data cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Data labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Removal of nonessential data . . . . . . . . . . . . . . . . 17
3.4.2 Simplification of data . . . . . . . . . . . . . . . . . . . . . 18
3.4.3 Word vector generation . . . . . . . . . . . . . . . . . . . . 19
3.5 Model implementation . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7.2 Construct validity . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.3 Internal validity . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.4 External validity . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 Ethical considerations . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Implementation 23
4.1 Environment setup . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Data collection and clean up . . . . . . . . . . . . . . . . . . . . . 23
4.3 Dataset construction . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Web application for comment labelling . . . . . . . . . . . 25
4.3.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Implementation, training and evaluation of LSTM-based models . . 26
4.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . 28
4.4.3 Implementation of models and their training . . . . . . . . . 28
4.4.4 Evaluation of model performance . . . . . . . . . . . . . . 29
5 Results 31
5.1 YouTube comment dataset . . . . . . . . . . . . . . . . . . . . . . 31
5.2 YouTube video statistics . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.2 Evaluation on testing subset . . . . . . . . . . . . . . . . . 32
5.4 Performance evaluation on IMDB dataset . . . . . . . . . . . . . . 37
6 Discussion 39
6.1 Accuracy of sentiment prediction . . . . . . . . . . . . . . . . . . . 39
6.2 LSTM-based model performance . . . . . . . . . . . . . . . . . . . 40
6.3 LSTM-based model performance on IMDB dataset . . . . . . . . . 41
6.4 Relationship between video sentiment and users’ preferences . . . . 42
References 48
A Appendix 1 A
A.1 Video statistics and predicted sentiment . . . . . . . . . . . . . . . A
1 Introduction
Emotions and opinions play an essential role in human behaviour as we often make
decisions based on other people’s experience and thoughts. Since everyone’s opin-
ions are subjective and affected by their personal experiences and beliefs, learning
about them helps to get a broader perspective. Public opinion has been of interest
to organisations, companies, and politicians for a long time. The collective opin-
ion of people can help predict who will win elections, generate knowledge about
future trends and topics. It also helps determine general opinion about products
and services, which helps marketing teams decide on a marketing strategy, improv-
ing existing or production of a new product, and improving customer support [1].
Therefore, recognition of sentiment in various fields is of the essence.
Initially, people’s opinions were collected through surveys or questionnaires and
manually reviewed. However, as the usage of the internet increased, people started
expressing their opinion online more frequently. The boom of social media plat-
forms in recent years has allowed its users to share all kinds of information and
introduce a wider variety and ways to express their opinions. Blogs, discussion
forums, reviews, comments and micro-blogging services, such as Twitter or Face-
book, serve as rich data sources, composed of audio files, videos, images and opin-
ions.
One such platform, YouTube, is currently one of the most popular social media
platforms in the world1 . It allows its users to share their videos or view videos
uploaded by other users and provide feedback. A user can provide feedback by
supporting or rejecting a video by rating it positively or negatively, respectively.
Users may further also provide textual feedback in the form of comments posted to
videos. Each video can hold thousands of comments making YouTube a platform
of interest for text classification and categorisation.
This project focuses on building a YouTube comment dataset, developing a sen-
timent classifier and performing sentiment analysis on the dataset. To help under-
stand the fundamentals of this topic, the terms sentiment analysis, text classification
and text sentiment classification are described in the next section.
1.1 Background
Sentiment analysis - sometimes referred to as opinion mining - is a natural lan-
guage processing (NLP) technique used to determine textual data’s sentiment po-
larity. Over the years, it has become one of the most popular research areas in NLP
[2]. It analyses people’s sentiment and attitude towards products, digital content,
organisations, events, issues and others [2]. While the large volumes of opinion
data can provide an in-depth understanding of overall sentiment, they require a
lot of time to process. Not only is it time-consuming and challenging to review
large quantities of texts, but some texts might also be long and complex, express-
ing reasoning for different sentiments, making it challenging to understand overall
sentiment quickly. Currently, YouTube allows comments of 10 000 characters in
length, which is a considerable amount and can result in long texts. Naturally, this
amount of data calls for tools to ease and automate the process of classification and
sentiment extraction from a text.
1
https://www.alexa.com/topsites
1
Text classification is a process of categorising texts into groups. It is applied in
various domains, such as organisation of news articles, opinion mining of product
reviews, spam filtering, document organisation in digital libraries, such as literature,
social feeds, and more [3]. There can be different ways to classify text - by its topic,
whether it meets specific criteria or by its sentiment [3].
Text sentiment classification is a process of determining the overall sentiment
of some given text[4]. This problem can either result in binary classification or
multi-label classification, where binary classification produces two outputs, posi-
tive and negative, and multi-label classification produces more than two classifi-
cation outputs, for example - very negative, negative, neutral, positive and very
positive [4]. Common techniques to solve text sentiment classification problems of-
ten use traditional machine learning algorithms or deep-learning based approaches.
Convolutional neural networks (CNN) and recurrent neural networks (RNN) such
as Long Short-Term memory (LSTM) and their bidirectional variants, like Bidirec-
tional LSTM (BiLSTM) have been able to successfully perform text and sentiment
classification tasks [5]. Section 2.1 describes these concepts in more detail.
1.2 Motivations
It is common to publicly post opinions on social media, such as Twitter, Facebook
or YouTube, and these opinions can provide essential data to know what people
think about products and services. Furthermore, these opinions might influence
other people’s opinions and affect their decisions in whether or not they would be
interested in a particular product, organisation or topic. This user feedback might,
in turn, inform about general trends such as popularity and quality.
YouTube is primarily a platform where its users can upload their personal videos,
but also a platform for companies to advertise their products, news outlets to share
information on current events and more. YouTube offers a partner program for its
users, where content creators can monetise their videos and earn advertisement rev-
enue [6]. The amount of money a creator can earn is directly related to the amount
of interest, in the forms of views, ratings and comments, that they can generate on
their content [7]. Furthermore, popular videos might gather important user feed-
back, such as comments, where automatic ways to discern the sentiment polarity
is desirable. The video authors’ dependence on their viewers’ approval raises the
importance of their feedback, which makes it attractive also to study the connection
between the various forms of user feedback.
2
the classifier with different text patterns of positive and negative sentiment, which
can only be done by using comments of videos from a variety of categories. In addi-
tion, the semantics of languages differ, therefore it is unlikely that classifiers trained
with different languages and in single domains would be generalisable to perform
equally for English YouTube video comments of multiple domains. Lastly, based
on the research done, there seems to be no sentiment labelled English YouTube
comment dataset available publicly online.
This thesis addresses the gap of available research done in YouTube video com-
ment classification in a variety of video categories with various LSTM-based tech-
niques, as well as the lack of available sentiment labelled English YouTube com-
ment dataset.
RQ1 What is the highest accuracy that an LSTM-based model can achieve in pre-
dicting YouTube video sentiment?
RQ3 How accurate are the LSTM-based models with another sentiment dataset,
such as the IMDB movie review dataset?
The four types of the LSTM-based models are evaluated against Internet Movie
Database (IMDB) movie review dataset [8], which consists of 50 000 movie reviews
containing high polarity.
RQ4 What is the relationship between the video sentiment and users’ preferences?
The viewers of a YouTube video can comment and express their opinion by
rating the video with either a like or dislike. A comparison between the overall
sentiment of the comments and the positive and negative video rating is performed.
In addition, the relationship between the video’s view count and video sentiment is
checked.
In order to answer the research questions, the objectives, listed in Table 1.1, are
achieved.
3
O1 Construct an English YouTube comment dataset covering
several domains.
O2 Construct a YouTube video statistics dataset.
O3 Implement four LSTM-based models: LSTM, BiLSTM,
CNN-LSTM and CNN-BiLSTM.
O4 Evaluate the performance of the four LSTM-based models
using accuracy, precision, recall, F1 score and confusion
matrix.
O5 Analyze inconsistencies in different forms of user feedback.
The results of objective O1 are used to complete objectives O3 and O4, which
are needed to answer the research questions RQ1, RQ2 and RQ3. Objective O2
together with O1 have to be completed to achieve objective O5, which helps an-
swering the research question RQ4.
1.5 Contributions
This project produces four novel contributions:
The following section (Section 1.6) talks about the importance of these contri-
butions.
4
1.7 Report Structure
In this section, the brief background for the project was presented, research ques-
tions and objectives were defined, and contributions and their use were described.
The rest of the report describes the project more in-depth. Background (Section 2)
provides more detailed description of the background and discusses related work.
Method (Section 3) details the approach used to answer research questions. Each
step is described more in detail, and the reasoning behind each decision is given.
Implementation (Section 4) explains more technical details of the method used and
describes various environment setups for the whole project. Results (Section 5) con-
tains the results of the data collection and model evaluations. Discussion (Section
6) presents a more in-depth analysis of the results and the dataset. Finally, Con-
clusions and Future Work (Section 7) summarises the findings of this project and
presents suggestions for future work.
5
2 Background
This section begins by describing the theoretical background, where traditional and
deep-learning sentiment techniques are covered (Section 2.1) followed by a review
of related work (Section 2.2).
6
data to a higher level [15]. Deep learning is often used for pattern recognition, anal-
ysis and classification, and feature extraction [16]. Some of the most well known
deep learning architectures include feed-forward neural networks, convolutional
neural networks and recurrent neural networks. CNNs, RNNs and their variants
have achieved good results in various NLP tasks, especially sentiment classification
[5]. Sections 2.1.3, 2.1.4 and 2.1.5 provide a more in-depth introduction to RNNs
while Section 2.1.6 talks about CNNs.
Output Y1 Y2 Y3
Hidden Layers
Input
X1 X2 X3
Time
Since after processing each input from a given sequence, the input from the past
will be added to the memory, RNNs do not require a set variable for the length of
input and can handle arbitrarily sized inputs without the model increasing in size.
In addition, the output length can vary as well, so a single output can be produced
for multiple inputs. These qualities allow RNNs to predict the next word in the text
when solving text-related problems.
While RNNs are excellent for problems with sequential data, in theory, a regular
RNN suffers from vanishing (too low) and exploding (too high) gradient problems
[17, 18]. These problems, especially vanishing gradient, is mitigated by introducing
a Long Short-Term Memory [18].
7
2.1.4 LSTM
LSTM counters vanishing and exploding gradient and long-term dependency prob-
lems by introducing hidden units [18]. LSTM works similarly to RNN - the in-
formation about the past input is used in processing the next input; however, each
LSTM has gates, which decide if the information should be kept or deleted from the
memory [18]. Memory manipulation is based on how important that information is.
LSTM has three information controlling gates - input, output and forget gate. In
addition, a cell state Cn is used to carry the information [19]. A cell state is affected
by vector operations, such as addition or multiplication [19]. Figure 2.2 visualizes
the unfolded over time LSTM network.
Y1 Y2 Yn
Output
C n-1 X + Cn
t anh
X
X
1 2 3
? ? t anh ?
Hidden Layers
Xn
Input
X1 X2
Time
Figure 2.2: An "unfolded" visual representation of LSTM network over time. (1)
forget gate, (2) input gate and (3) output gate.
The forget gate decides which information should be disposed of. The Sigmoid
function determines this using the current input and the information from the pre-
vious input, and produces a number between 0 and 1, where 0 represents that the
information should be dropped and 1 represents that information should be kept
[19].
The input gate determines which new information should be kept. The sigmoid
function is responsible for updating the relevant values that describe the sequence,
while the tanh function creates a vector of new values that could be added to the
memory [19]. The old cell state is then updated by multiplying the old cell state
with the values generated by the forget gate and added to the values generated by
the input gate. This results in an updated cell state, and the new information is
carried to the upcoming iteration.
Lastly, the output gate determines the output for the given input. This is done
by first using the sigmoid function on the current input, and previous output [19].
Then tanh function is used on the updated cell state and multiplied with the sigmoid
function output [19].
8
2.1.5 Bidirectional RNN
At its core, RNN preserves information only of the past and therefore is unidirec-
tional. In the textual analysis domain, it is sometimes essential to go forwards and
backwards to understand the context and the meaning. Bidirectional RNN (BiRNN)
trains two RNN layers concurrently, where the first layer receives the given input
and the second one receives the reversed input [20]. Such an approach results in
the use of past and future information of given data. Figure 2.3 shows the visual
representation of the bidirectional RNN.
Output Y1 Y2 Y3
+ + +
Forwards
Hidden Layers
Backwards
Input X1 X2 X3
Time
BiLSTM works with a similar concept as BiRNN but with the LSTM layer
instead of a traditional RNN.
9
Kernel
Feature map
Input
Figure 2.4: Convolutional layer. The dashed lines indicate that the kernel is applied
several times and produces inputs in the feature map.
Output
Feature map
10
Day and Lin [28] used a Bidirectional LSTM based model to classify Google
Play reviews’ sentiment. They used several dictionaries to calculate and generate
the features - HowNet, iSGoPaSD and NTUSD. Finally, for the word embedding,
CBOW and Skip-gram proposed by Mikolov et al. [29] was used. The authors
compared BiLSTM’s performance against the performance of SVM and NB algo-
rithms. BiLSTM outperformed traditional machine learning approaches and was
able to reach a 94% accuracy.
Zhou et al. [30] attempted to solve the sentiment classification problem by ap-
plying a combination of convolutional neural network and LSTM, calling this model
C-LSTM. The experiments are performed on Stanford Sentiment Treebank, which
is a dataset consisting of movie reviews. The authors used one-dimensional con-
volutions to extract n-gram feature maps. These features maps are then rearranged
as feature representations and fed into the LSTM layer. The C-LSTM model out-
performed the SVM algorithm on both binary and fine-grained classification exper-
iments.
Other works have also combined CNN and LSTM to solve the sentiment clas-
sification problem. Hassan and Mahmood [31] combined CNN and LSTM, feeding
it pre-trained word vectors. They merged convolutional and LSTM layers into a
single model, which was able to outperform NB, SVM algorithms and CNNs while
reaching the highest accuracy for binary classification of 88.3%.
Huang et al. [32] used word embedding pre-trained with their own dataset
and combined CNN and LSTM where two LSTM layers were stacked on top of
CNN. Their proposed model’s performance outperformed standalone CNN, LSTM,
a combination of CNN and single-layer LSTM, and SVM with an accuracy of
87.2%.
Cunha et al. [33] built a convolutional neural network for sentiment classifica-
tion of comments posted on Brazilian political videos on YouTube. They evaluated
the performance of the proposed classifier by separating user feedback into several
different categories - what the user thinks about the content creator and the video,
users opinion on how well the topic is covered and users opinion on how relevant
the video is. Their classifier was able to classify the first category comments the
best, reaching 84%.
An LSTM-based approach to sentiment classification has been studied with the
previously mentioned IMDB data as well. Yenter and Verma [34] combined several
CNN-LSTM kernels for sentiment detection. They did not use the pre-trained word
embeddings. Their multi-kernel CNN-LSTM model achieved 89.5% accuracy.
Mathapati et al. [35] analysed the performance of several deep learning tech-
niques, including LSTM, CNN, and CNN-LSTM. They used NB as a base algo-
rithm, which was outperformed by all the deep learning techniques. The deep learn-
ing models were trained with pre-trained word vectors and CNN-LSTM performed
the best, achieving 88.3% accuracy.
11
view data. Reviews tend to be written more formally and using proper language.
This suggests that the sentiment classifiers proposed in the studies reviewed in Sec-
tion 2.2 would not perform satisfactorily in the YouTube sentiment classification
task using video comments. The importance of different having comments from
different category videos is mentioned by Cunha et al. [33] as well, where the au-
thors suggest that the future work should include comments of videos from several
domains. This thesis project deals with diverse text data, including proper speech,
internet slang, and discussions of various topics. In order to cover several video
domains, videos of several categories are selected when building the YouTube com-
ment dataset (see Section 3.1.1).
In addition, only one of the reviewed studies uses word embeddings [32], how-
ever, the word vectors are pre-trained on their own dataset. Pre-training word vec-
tors on their own dataset might pose a risk of vectors not representing the word
based on its context correctly, especially if the dataset is not large and the word
appeared only a few times. This thesis project uses pre-trained word vectors con-
sisting of 1.2 million words, allowing better word context representation and more
accurate predictions.
Lastly, this project uses deep learning model architectures similar to the studies
reviewed in section 2.2, such as LSTM, BiLSTM, CNN-LSTM. In addition, CNN-
BiLSTM architecture is evaluated.
12
3 Method
The following section describes the research methodology for the data collection,
data cleanup, data labelling and data processing. This is followed by explaining
how the model implementation and evaluation are performed. The methodology
is visualised in Figure 3.6. The section is concluded by discussing the reliability,
validity and ethical considerations of the chosen methods.
1. Data YouTube
YouTube
Produces Produces comment
video dataset collection
dataset
Used for
Clean
2. Data YouTube
Produces
cleanup comment
dataset
Used for
Labelled
3. Data YouTube
Produces
labelling comment
dataset
Used for
Processed and
4. Data labelled
Produces
processing YouTube
comments
Used for
5. Model LSTM-based
Produces
implementation models
Used for
6. Model Used for
evaluation
Model
Produces
performance
13
3.1.1 Video collection
Before the YouTube video comments can be collected, a list of videos has to be
defined. Since it is unclear if the overall video comment sentiment is any different
to video ratings or has any correlation to view count, the videos are collected based
on the upload date and the variety of categories. An initial list of videos is collected
using mixed data sources - a public dataset of trending YouTube video statistics [36]
and user constructed playlists of popular videos. The base collection of videos are
then refined using the following inclusion criteria:
• A video must have at least 100 000 views and at least 2000 comments
• The intent of the video can not be to ask a question to the audience
• The author of the video can not be encouraging the audience to rate the video
negatively
A video must have comments and ratings enabled, be public and have a cer-
tain amount of views and comments to ensure that the video contains all the data
needed. The date criteria ensure that the video is not actively undergoing changes in
statistics (views, ratings) and limiting the timespan of the collected videos if there
are changes to how ratings and views are counted on the videos. Given that changes
in platform implementation details on YouTube, such as algorithms used for view
count, are unknown, this measure attempts to minimise its effect on the data. To
prevent data manipulation, the video is then reviewed to not include videos that ask
questions of the audience (e.g. "Respond in the comments below") or videos where
the author encourage the audience to negatively rate the video [37]. Finally, it is
made sure that the creator of the selected video is not recently diseased as this could
cause the most recent comments to talk about the death of the author rather than the
contents of the video.
YouTube does not provide a functionality to search videos by the year they were
uploaded or by the comment and view number, therefore video collection is done
manually.
The video selection process is crucial as it has a significant impact on the later
construction of the YouTube Comment dataset. The final YouTube Comment dataset
must have a even number of positive and negative sentiment comments. If there is
a more significant portion of either sentiment, the LSTM-based model might be
biased towards one of the sentiments.
The selection of videos with the criteria mentioned above finalized on February
02, 2021 and results in a video collection comprised of 49 videos. The selected
videos cover the following categories:
• Entertainment
14
• People & Blogs
• Music
• Gaming
• Comedy
• Sports
The list of selected videos and their statistics can be found on this project’s
GitHub repository 2 .
15
body positive negative neutral rated comment_id video_id date
You
could
roast so
0 0 0 0 Ugxz_FdO-AA0mEz8FMJ4AaABAg ItYOdWRo0JY2021-01-31T20:24:17Z
many
chestnuts
with this
16
Figure 3.9: The comment labelling tool.
17
3.4.2 Simplification of data
Commonly, YouTube comments contain acronyms and emojis, which introduces
some inconsistencies in the data. To prevent this, a list of most commonly used emo-
jis4 and acronyms5 is collected and these inconsistencies are automatically replaced
with their full terms and simplified textual meanings.The list of used acronyms and
emojis can be found in the project’s GitHub repository6 .
According to G. Angiani et al., repeating vowels and consonants should also
be removed [38]. Hence if a vowel or consonant is repeated more than two times
in a word, the occurrences of that letter are reduced to just two (for example, in
the sentence "this is so coooooool!" the word "coooooool" has a repeating vowel
"o", which occurs more than two times and therefore this vowel is replaced by two
occurrences of "o", making the sentence "this is so cool!").
In addition, G. Angiani et al. claims that ignoring or having a variety of negation
words is one of the leading causes for misclassification in sentiment classification
tasks [38]. The English language has several different negation words, such as
"don’t", "won’t", "can’t", "shouldn’t", and to make the text more consistent, any
negation words are replaced with the word "not".
Moreover, text tokenisation is performed on the comment dataset. Text tokeni-
sation is a process of turning text into a sequence of integers, where each integer
represents a word. A dictionary of all unique words is generated, where the inte-
gers used in the text tokenisation are indices of these words in the dictionary. This
results in a dataset of sequences of numbers, making it possible to be used for deep
learning model training.
Finally, the generated sequences of integers are padded. Padding assures that
all the sequences in the dataset will be of the same length. The longest sequence is
not padded, but any other shorter sequences will have one or several values inserted
before to match the longest sequence length. The first index of a word in a dataset
vocabulary, generated during the text tokenisation, is 1, therefore 0 is used to pad
the integer sequences. Figure 3.10 visualises this process.
Padded integer
Integer sequences sequences
1, 30 0, 0, 0, 1, 30
Figure 3.10: Example of tokenized text with padding applied to the second text.
The labelled and processed dataset can be found in the GitHub repository7 . In
this dataset, the rating of 0 indicates negative sentiment, rating of 1 positive and -1
- neutral.
4
https://emojipedia.org/people/
5
https://www.netlingo.com/acronyms.php
6
https://github.com/indrer/ytcsc/blob/main/util/acronyms_smileys.py
7
https://github.com/indrer/ytcsc/blob/main/datasets/processed/processed_full.csv
18
3.4.3 Word vector generation
Word vectors (also known as word embeddings or word embedding matrix) are
distributed word representations as vectors in a vector space. Such representation
allows the words that have different meaning in another context to maintain their
meaning and helps with the detection of synonyms. In addition, it can predict pos-
sible follow up words. This project uses pre-trained GloVe8 word vectors trained
on 2 billion Twitter tweets with 1.2 million different words [39]. Preferably, a word
vector trained on YouTube comment data should be used because the vocabulary of
YouTube comments can differ from that of Twitter tweets, however, no such pre-
trained word vectors exist yet, and the dataset compiled during this project does not
contain enough data to train accurate word vectors.
19
the data. Three fold for cross-validation is chosen due to time limitations as larger
number of folds takes longer to execute. Evaluation on unseen data, on the other
hand, simulates real world, when the model would be trained on the entire training
and validation subset and would predict the unseen data. During cross-validation,
the training and validation subsets are joined and randomised and then, in turn, split
into three groups, where some data in each group is used for training and some
for evaluation. This process is illustrated in Figure 3.11. The model evaluation
on unseen data produces metrics that aid in evaluating each models performance,
including accuracy, precision, recall, F1 score, confusion matrix and ROC curve.
These metrics are chosen as they not only show the accuracy of the model but also
provide different insights of how a model is performing, such as whether a model
is biased towards some sentiment, how accurate is the model in guessing a specific
sentiment or how well does a model do in comparison to other models (the metrics
are explained more in-depth in Section 4.4.4). The goal of analysing the model’s
performance on the testing data is to see how well the model performs on unseen
data and assess if the model is over-fitting during the training process. Higher model
performance on unseen data suggests a more reliable model when applied in real-
life use cases.
Average
Test Train Metrics
metrics
3.7.1 Reliability
Reliability assesses how reproducible the approach of this project is and whether it
will yield the same results as presented in the report. During this project, various
steps were taken to ensure that the data and the results collected can be reproduced
to as close to the described numbers as possible. However, there are some consid-
erations, in particular, that should be emphasised.
Attempting to compile YouTube comments and YouTube video statistics will
most likely not result in the same exact data as it was collected during this project.
Some videos are still viewed and interacted with occasionally, therefore the com-
ments, view count and rating count might be different. Further, the authors of the
videos can make videos used in this project unlisted or private, making them inac-
cessible.
20
Moreover, due to the nature of deep learning models, randomness plays a large
part in the model training process as deep neural networks are stochastic. There-
fore, using the same architecture will not yield the same results. The same random
seed was used throughout all the experiments to reduce randomness, however, some
randomness can not be controlled. To solve this, all the model’s weights and their
implementations are saved in separate files. To generate the results achieved during
this project, the models can be loaded and evaluated with the provided YouTube
comment dataset, which should result in accuracy as described in the report.
Finally, the steps for data processing are essential to follow for good perfor-
mance of the model, therefore the same processing as described in this report should
be applied to any new data to achieve similar or identical accuracy metrics.
21
3.7.4 External validity
The external validity refers to whether the research generalises to the broader field
or other situations.
Concerns regarding the external validity of the project are primarily whether or
not the finished model is applicable for YouTube videos and comments of different
categories, different text or other contexts, and with what accuracy. To address these
concerns, the four LSTM-based models are evaluated on the IMDB dataset. As
shown in the model evaluation (Section 6.3), the model performance is acceptable
even for a dataset of movie reviews, which should indicate that the models are
generalisable to some extent.
22
4 Implementation
This section describes the technical details and implementation of the methods men-
tioned in Section 3. The choices made regarding implementation and their reasons
are clarified. Further, the environment setup (Section 4.1), data collection and clean
up (Section 4.2), and final YouTube dataset construction (Section 4.3) are explained,
followed by details about the implementation of the four models used in this project
(Section 4.4). All the code is available in a GitHub repository9 .
1. Google API Client (1.12.5)10 - Data retrieval from YouTube Data API
The computations for the model training and evaluation are done on the GPU
(NVIDIA GeForce GTX 1660 Ti). PostgreSQL (version 12.6) is used to tempo-
rary store video statistics and comments. NodeJS (version 14.15.3), Nginx (version
1.18.0), together with ExpressJS (version 4.17.1), is used to develop and run a web
tool for comment labelling.
23
YouTube’s Data API is made for each video. Each request specifies which video re-
source properties17 should be retrieved. For this project, the properties of interest are
snippet and statistics, which consists of the useful information needed to construct
comment dataset - title, id, dislikeCount (negative ratings), likeCount (positive rat-
ings), viewCount, commentCount and publishedAt. The video data of 49 videos are
then saved into the PostgreSQL database.
After the essential video information is retrieved, the video comments are col-
lected. This is again done by making requests to the YouTube Data API. Each
request provided with a video ID, comment page token, specification of what kind
of comment properties to retrieve (snippet) and the order (newest first). Only the
newest top-level comments are collected. The response provides only 100 com-
ments, therefore 10 requests need to be made for each video to collect 1000 com-
ments. Comment page token is used to make sure that no comments from the same
page are fetched. The following information of the comment is stored - textOrig-
inal (body of the comment), id, video_id and updatedAt (when the comment was
posted).
Before the comments are stored in the database, they are cleaned up. First,
any duplicate comments are removed. Then, comments that consists of URLs or
non-alphanumeric characters solely are identified using regular expressions and re-
moved. Afterwards, the language identification model [40, 41] together with the
fastText library18 is used to identify non-English comments and remove them. Many
YouTube comments consist of misspelt words and slang, therefore the language
identification model cannot remove all the non-English comments. These kinds of
comments are later removed manually during the comment labelling process.
Finally, the cleaned-up comments are saved into the PostgreSQL database. The
database is made up of 2 tables - comment and video. The primary key for the
video table is the video’s url, while the primary key for the comment table is the
comment’s ID. Each comment references the video it belongs to with a foreign key
pointing to the video’s ID in the video table. The entity-relationship diagram for the
database can be seen in Figure 4.12.
Comment
Video
comment_id
varchar
(PK) url (PK) varchar
body text title varchar
positive int8 upvote int8
negative int8 downvote int8
neutral int8 views int8
rated int8 commentcount int8
video_id(FK) varchar date varchar
date varchar
17
https://developers.google.com/youtube/v3/docs/videos/list
18
https://fasttext.cc/
24
4.3 Dataset construction
The YouTube comment dataset construction has two main steps: labelling the data
and processing the labelled data. This section describes the implementation of both
of the steps.
Client (web
browser)
Nginx reverse
Server proxy
Web server
Database
Nginx19 is used as a reverse proxy for security, easier logging and encrypted
connection. Nginx handles the HTTP requests first and passes them into the web-
server. The web server is implemented with NodeJS and ExpressJS. The web server
is responsible for updating comment’s sentiment on the database, fetching an unla-
belled comment from the database and displaying the comment to be labelled to the
user.
25
to change the desired data. In addition, DataFrames allow storage of different data
types. The YouTube comments are processed using the natural language prepro-
cessing methods mentioned in Section 3.4 before they are passed into the model.
Positive, negative and neutral sentiment data columns are transformed into a sin-
gle column, where positive sentiment is marked with an integer 1, negative with a
0 and neutral with a -1. The Natural Language Toolkit21 library is used to get a
list of stopwords. Then, the thirty-five most frequently occurring stopwords in the
comment dataset are removed. The dataset is saved into a CSV file.
The next step is to generate word vectors. The pre-trained GloVe word vectors
with Twitter data are used. They are loaded using Gensim22 Python library. The
word vector matrix is constructed by including only the words that occur in the
comment dataset vocabulary.
Lastly, comments are tokenised using Tokenizer, provided by the Keras library.
The process is described more in-depth in Section 3.4. Three types of datasets of the
processed and labelled comments are created - positive, negative, and mixed. The
mixed dataset consists of positive, negative and neutral comments. All the datasets
have the following columns: column_id, video_id, rating, date and body. These
datasets can be found in the GitHub repository, in the datasets/processed folder.
LSTM-based models
Trained Trained
Trained with Trained with
without word without word
word vectors word vectors
vectors vectors
Trained Trained
Trained with Trained with
without word without word
word vectors word vectors
vectors vectors
26
Testing subset
Trained Trained
Trained with Trained with
without word without word
word vectors word vectors
vectors vectors
Trained Trained
Trained with Trained with
without word without word
word vectors word vectors
vectors vectors
Accuracy, precision,
recall, F1 score,
confusion matrix,
ROC curve
Figure 4.14 illustrates the simplified process of how the models are implemented.
Each of the four types of models is trained on YouTube comment data training
and validation subsets with or without word vectors in the embedding layer. This
produces the history of values of model accuracy and loss.
Figure 4.15 illustrates the process of model evaluation. The testing subset of
the comment dataset is used for model evaluation, and this results in several more
metrics that help determine how the model performs with the unknown data. The
produced metrics are accuracy, precision, recall F1 score, confusion matrix and
ROC curve.
This section describes the implementation, training and evaluation of four LSTM-
based models.
27
DATA SET N UMBER OF COMMENTS
Training set 6041
Validation set 863
Testing set 1726
The same subsets are used for all experiments and hyperparameter training. The
dataset subsets can be found in the GitHub repository 23 .
Figure 4.16: The architectures of four LSTM-based models. Green nodes are input
layers, white nodes are hidden layers and purple nodes are output layers.
All four models’ architectures are defined using Keras’ methods. Each model is
built as a Sequential model, which is a group of linearly stacked layers. All four
neural networks share the same input and output layers but are different in their
hidden layers. The architecture of each model is visualised in Figure 4.16.
23
https://github.com/indrer/ytcsc/tree/main/datasets/split
28
Input layer. The Input layer is implemented as a Keras Embedding Layer. If not
specified, the embedding layer is initialised with random weights and turns inte-
ger represented word sequences into dense vectors [43]. Models with two types of
embedding layers are evaluated - with initialised weights (word vectors) and with-
out initialised weights. The input dimensions of the embedding layer depend on
whether the word vectors are used or not - if they are, then the input is the size of
the word vector’s vocabulary. Otherwise, the input dimensions are the maximum
integer index of tokenised set + 1. The output dimension is either
the vector size (if word vectors are used) or an integer 32.
Output layer. The output layer produces a single value for all the models rep-
resenting the sentiment of a comment. The output layer is a Dense layer with a
single unit and sigmoid activation function. The produced value is between 0 and
1. If the value is 0.5 or higher, the comment is classified as positive, and if the
value is lower than 0.5, the comment is classified as negative.
Hidden layers Hidden layers of an individual model are what makes each model
unique. The LSTM model has a single LSTM layer with 128 units followed by a
fully connected layer with 32 units. The BiLSTM model has a single bidirectional
LSTM layer with 128 units. The CNN-LSTM model has two one-dimensional
convolutional layers with 128 and 64 filters and a kernel size of 2 for both layers,
followed by a max-pooling layer with a pool size of 2 and an LSTM layer with
64 units. Lastly, the CNN-BiLSTM model has the same configuration of convolu-
tional and max-pooling layers as the CNN-LSTM model, but these layers are then
followed by a bidirectional LSTM layer with 128 units.
Model compilation The models are compiled using binary cross-entropy as a loss
function together with the Adam optimiser. The learning rate for the optimiser is
selected based on the hyperparameter turning results.
Model training Models are trained with the best performing hyperparameters for
that type of model and 30 epochs. The number of epochs can be arbitrarily large
since the early stopping method is used, which stops model training as soon as there
is no improvement in the loss observed.
After each model is trained, its weights and the architecture are saved in a folder
to be loaded and used later. Saved models can be found in the GitHub repository, in
the models/models folder.
29
as negative. Then, accuracy, precision, recall, F1 score, confusion matrix and
ROC curve are calculated using the scikit-learn methods accuracy_score(),
precision_score(), recall_score(), f1_score(), confusion_-
matrix() and roc_curve() respectively. Performance measurement values
are calculated by using prediction result classification as true positives (TP), true
negatives (TN), false positives (FP) and false negatives (FN). TP and TN values are
correct predicted values for positive and negative classes respectively, while FP and
FN are values where the model predicted one class, but it was the other.
Accuracy is a ratio of how many predictions were correct out of all the predic-
tions and can be calculated as:
(T P + T N )
Accuracy =
(T P + T N + F P + F N )
Precision shows how well the model does not label the sample as negative when
it is positive [44]:
TP
P recision =
TP + FP
Recall shows how many positives were predicted as actually positive [45]:
TP
Recall =
(T P + F N )
F1 score is a weighted average of precision and recall [46]. The value is between
0 and 1 and the higher values indicate better performance:
2 ∗ (precision ∗ recall)
F 1score =
(precision + recall)
The confusion matrix displays the number of TP, TN, FP and FN predictions.
Finally, ROC curves are generated. The ROC curves are constructed using true
positive rates (TPR) and false-positive rates (FPR). The TPR is the rate of predic-
tions that were predicted to be positive out of all the positive samples. Similarly,
the FPR is incorrect positive predictions out of all positive samples. The generated
curve provides the insight between the model’s sensitivity and specificity, and the
closer the curve is to the top left of the graph, the more accurate the classifier is
[47]. A random guess would produce points along the diagonal line of the graph,
therefore the diagonal line is used as a baseline to see if the model is doing worse
than random predictions or better.
30
5 Results
This section describes the final labelled YouTube comment dataset (Section 5.1)
followed by the introduction to the video statistics (Section 5.2). Then, the re-
sults of model performance on the YouTube comment dataset is presented (Section
5.3). Lastly, the model performance on the IMDB movie review dataset is presented
(Section 5.4).
4315,
22%
10492,
52%
5144,
26%
Figure 5.17: The sentiment distribution of the finalized YouTube comment dataset.
31
Video ID Upvotes Downvotes Views Sentiment
xuCn8ux2gbs 4467927 71547 112060703 0.869
WeA7edXsU40 129840 2236 2826612 0.766
HVjlcUtuENM 25551 31465 1741401 0.123
u_J0Ng5cUGg 57341 642191 4679316 0.062
ffxKSjUwKdU 7036447 316928 977824338 0.953
Table 5.3: The example of simplified video data (title, comment count and date
removed). This video data was collected on 2021-02-04.
5.3.1 Cross-validation
Table 5.4 shows the accuracy of the models over each iteration of cross-validation
and the final column displays the average accuracy the model achieved during cross-
validation. Table 5.4 shows model accuracy scores over the three iterations of cross-
validation. Cross-validation for each model was performed using random seed to
ensure that the splits made on the dataset are the same for all the models. From
the results, the BiLSTM model trained with word vectors achieved the highest
average accuracy of 0.8782 (87.8%). Cross-validation was performed using the
training and validation subsets. The training subset was used for the model evalua-
tion described in the next section (Section 5.3.2).
I TERATIONS
1 2 3 AVERAGE
M ODEL
LSTM 0.8662 0.8696 0.8644 0.8667
LSTM + WV 0.8858 0.8627 0.8757 0.8747
BiLSTM 0.861 0.8648 0.8618 0.8625
BiLSTM + WV 0.8905 0.867 0.877 0.8782
CNN-LSTM 0.8566 0.8557 0.8462 0.8528
CNN-LSTM + WV 0.8814 0.864 0.8722 0.8725
CNN-BiLSTM 0.8627 0.8605 0.8527 0.8586
CNN-BiLSTM + WV 0.8853 0.8696 0.8727 0.8759
32
the models were trained with and without word vectors.
Table 5.5 displays the accuracy scores achieved by all the models. A higher
accuracy indicates that the model has made a larger number of correct predictions.
BiLSTM model trained with word vectors has achieved the highest accuracy of
0.8905 (or 89%), however, all the models trained with word vectors perform better
than the same models with no initial word embedding matrix specified.
M ODEL ACCURACY
LSTM 0.876
LSTM + WV 0.8835
BiLSTM 0.8766
BiLSTM + WV 0.8905
CNN-LSTM 0.8691
CNN-LSTM + WV 0.8824
CNN-BiLSTM 0.8621
CNN-BiLSTM + WV 0.883
Table 5.6 displays the recall scores achieved by all of the four models trained
with and without pre-trained word vectors. Recall score describes how accurate
the model was in predicting the positive class out of all true positive classes. The
results show that CNN-LSTM with word vectors model was able to recognize the
positive class the best out of all the models with a recall score of 0.9027.
M ODEL R ECALL
LSTM 0.8436
LSTM + WV 0.8714
BiLSTM 0.8783
BiLSTM + WV 0.8864
CNN-LSTM 0.8749
CNN-LSTM + WV 0.9027
CNN-BiLSTM 0.8737
CNN-BiLSTM + WV 0.883
Precision scores can be seen in Table 5.7 for models trained with and without
word vectors. Precision refers to how accurate the model was in predicting the
positive class out of all positive class predictions. The precision results show that
LSTM model out of all positive class predictions got the most that were actually
positive, achieving a precision score of 0.9021.
33
M ODEL P RECISION
LSTM 0.9021
LSTM + WV 0.8931
BiLSTM 0.8753
BiLSTM + WV 0.8937
CNN-LSTM 0.8648
CNN-LSTM + WV 0.8675
CNN-BiLSTM 0.8539
CNN-BiLSTM + WV 0.883
F1 scores can be seen in Table 5.8 for all the four models with and without word
vectors. F1 score provides a balanced metric of precision and recall. The highest
F1 score was achieved by BiLSTM model trained with word vectors with an F1
score of 0.8901.
M ODEL F1 S CORE
LSTM 0.8719
LSTM + WV 0.8821
BiLSTM 0.8768
BiLSTM + WV 0.8901
CNN-LSTM 0.8698
CNN-LSTM + WV 0.8847
CNN-BiLSTM 0.8637
CNN-BiLSTM + WV 0.883
The correct and incorrect predictions are displayed in confusion matrices. The
confusion matrix shows how many predictions were correct for both, positive and
negative classes. Tables 5.9, 5.10, 5.11 and 5.12 displays confusion matrices for
LSTM, BiLSTM, CNN-LSTM and CNN-BiLSTM respectively.
True True
Negative Positive Negative Positive
Predicted
Predicted
34
True True
Negative Positive Negative Positive
Predicted
Predicted
Negative 755 108 Negative 772 91
Positive 105 758 Positive 98 765
(a) Trained without word vectors (b) Trained with word vectors
True True
Negative Positive Negative Positive
Predicted
Predicted
True True
Negative Positive Negative Positive
Predicted
Predicted
Finally, the model prediction abilities are displayed in ROC curve graphs, which
can be seen in Figures 5.18, 5.19, 5.20 and 5.21. The dashed diagonal line indicates
random predictions. The area value is the area under the ROC curve (AUC), which
shows what is the probability that the model will classify the random positive com-
ment higher than a random negative comment.
35
ROC curve
1.0
0.8
0.4
0.2
LSTM (area = 0.941)
0.0 LSTM with word vectors (area = 0.951)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.18: The ROC curve of LSTM model trained with and without word vec-
tors.
ROC curve
1.0
0.8
True positive rate
0.6
0.4
0.2
BiLSTM (area = 0.946)
0.0 BiLSTM with word vectors (area = 0.959)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.19: The ROC curve of BiLSTM model trained with and without word
vectors.
36
ROC curve
1.0
0.8
0.4
0.2
CNN-LSTM (area = 0.939)
0.0 CNN-LSTM with word vectors (area = 0.956)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.20: The ROC curve of CNN-LSTM model trained with and without word
vectors.
ROC curve
1.0
0.8
True positive rate
0.6
0.4
0.2
CNN-BiLSTM (area = 0.941)
0.0 CNN-BiLSTM with word vectors (area = 0.957)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.21: The ROC curve of CNN-BiLSTM model trained with and without
word vectors.
37
was achieved by the LSTM model and the highest F1 score was achieved by the
LSTM model.
Lastly, the ROC curves of all the models are generated to compare the perfor-
mance of the models easier. The ROC curve graph can be seen in Figure 5.22.
38
6 Discussion
During this research, a YouTube comment with their sentiment dataset was con-
structed, and four LSTM-based models were built to predict the sentiment of YouTube
videos based on their comments. The performance of the four models was evaluated
and the results of the evaluation were presented in the previous section. In this sec-
tion, the results are analyzed, discussed and compared with the previous research.
The research questions, presented in Section 1.4, are answered.
• RQ1 What is the highest accuracy that an LSTM-based model can achieve in
predicting YouTube video sentiment?
In order to answer this question, objectives O1, O3 and partially O4 were com-
pleted. Objective O4 was partially completed as RQ1 only concerns the accuracy
of the model. The accuracy of the four LSTM-based models has to be evaluated.
Two approaches are used to measure the accuracy of the four LSTM-based models -
cross-validation and model evaluation on the testing subset. The results of these are
presented in Table 5.4 for cross-validation and Table 5.5 for the model evaluation
on the testing subset.
Looking at the achieved accuracy scores, it appears that specifying the word
embedding matrix in the embedding layer either results in better performance or
similar performance to the model without a specified word vector. Overall, the
average model accuracy in cross-validation is lower than the model accuracy using
the test subset. Several reasons can cause the difference in accuracy:
1. Models are trained on smaller sized subsets during cross-validation than mod-
els trained and evaluated on the testing subset. Smaller subsets might not
provide enough data to train on, resulting in lower accuracy during model
evaluation, therefore the training and validation subset sizes should be in-
creased.
2. Models trained during cross-validation do not use the early stopping method,
therefore they can overfit the data they are trained on.
The cross-validation results show that, on average, BiLSTM with word vectors
has achieved the highest accuracy, followed by CNN-BiLSTM with word vectors
and LSTM with word vectors. The evaluation on testing subset shows that the same
three model architectures were at the top with accuracy scores as well, but BiLSTM
with pre-trained word vectors reached the highest accuracy of 89%. Therefore, RQ1
is answered as follows:
39
The accuracy results achieved in this research differ from other studies, where
LSTM models with convolutional layers outperformed LSTM networks without
them in binary sentiment classification problems [32, 48, 35]. However, these stud-
ies evaluated their models on different datasets as well as used different data pre-
processing techniques.
Lastly, previous work was done on the YouTube dataset that used traditional
machine learning techniques. Benkhelifa and Laallam assessed the performance
of the SVM classifier in predicting the sentiment of the YouTube cooking videos
based on comments and achieved 95.3% accuracy [12], however it is unclear how
their classifier would perform on the comments from videos of other categories.
In this project, LSTM-based models were exposed to the comments from various
domains, possibly covering a larger variety of terms and vocabulary.
40
CNN-LSTM with pre-trained word vectors has achieved the highest
recall, meaning out of all the models, it was able to detect the most
positively labelled comments correct. The LSTM model has reached
the highest precision score, meaning that if this model classified the
comment positively, it had the highest chance of being truly positive.
However, BiLSTM with pre-trained word vectors has achieved the
highest F1 score, meaning that it has the best balance between preci-
sion and recall out of all the models. In addition, BiLSTM with word
vectors achieved the highest accuracy in cross-validation and model
evaluation on the unseen data.
Lastly, it is worth mentioning that CNN-LSTM has the highest recall but low
precision (second lowest of all the models). Such results could suggest that the
CNN-LSTM tends to classify comments as positive more often, therefore, even
if the comment is negative, it might still guess that the comment is positive. In
addition, LSTM has the lowest recall but the highest precision, suggesting that the
model is least likely to produce false-positive classifications. Finally, BiLSTM with
word vectors achieved the highest accuracy in cross-validation and the highest F1
score, indicating a balance between recall and precision. These results suggest that
BiLSTM with word vectors is the model that has performed the best overall on the
YouTube comment dataset.
• RQ3 How accurate are the LSTM-based models with another sentiment dataset,
such as the IMDB movie review dataset?
To answer the third research question, all four LSTM-based models were eval-
uated on the IMDB dataset. Pre-trained word vectors were not used. The results
of the evaluation can be seen in Table 5.13. The highest accuracy was achieved
by the BiLSTM model (87.87%). The highest recall score was reached by the
LSTM model (0.8896), and the highest precision was achieved by the CNN-
LSTM model (0.8996). The LSTM model reached the highest F1 score (0.8803).
Looking at the ROC curve (Figure 5.22), the CNN-LSTM shows the worst perfor-
mance with the lowest AUC, while the BiLSTM and LSTM have the highest and
similar AUC values and distance to the top left corner.
Therefore, the third research question is answered as follows:
41
All four models achieved an average of 87% accuracy on the dataset. That is
2% less than the accuracy attained on the YouTube comment dataset. This indicates
that all the models can perform with different datasets.
Lastly, some previous studies have been done using LSTM-based models to pre-
dict IMDB movie review sentiment. Mathapati et al. reached an accuracy score of
88.3% with the CNN-LSTM model [35], however, their model was trained with
pre-trained word vectors meaning that the four LSTM-based models might have
gained higher accuracy scores if word vectors would have been used. Yenter and
Verma implemented a multi-kernel CNN-LSTM model, which achieved 89.5% ac-
curacy on IMDB dataset [34]. This suggests that if the model architecture of the
four LSTM-based models used in this project consisted of joint kernels, the accu-
racy might have been higher. Finally, Peng compared the performance of RNN
and LSTM and their bidirectional variants [49]. Peng discovered that bidirectional
variants outperform regular LSTM and RNN on IMDB data. In this research, the
bidirectional variant of an LSTM model reached a higher accuracy on IMDB data,
but it underperformed in the other metrics. In addition, looking back at the predic-
tion on the YouTube comment dataset, the bidirectional variants achieved slightly
higher accuracy as well.
• RQ4 What is the relationship between the video sentiment and users’ prefer-
ences?
To answer the final research question, objective O2 was completed and box plots
are generated to visualize how the sentiment affects the number of video upvotes,
downvotes and views. Furthermore, ANOVA tests are performed between positive
and negative video statistics to see if there is any significant difference between the
groups.
42
H
8 S Y R W H V
6 H Q W L P H Q W S R V L W L Y H Q H J D W L Y H
H
'